Re: [PATCH v5 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-05-31 Thread Keqian Zhu
Hi Marc,

Kindly remind. :)

BRs,
Keqian

On 2021/5/7 19:03, Keqian Zhu wrote:
> Hi Marc,
> 
> This rebases to newest mainline kernel.
> 
> Thanks,
> Keqian
> 
> 
> We have two pathes to build stage2 mapping for MMIO regions.
> 
> Create time's path and stage2 fault path.
> 
> Patch#1 removes the creation time's mapping of MMIO regions
> Patch#2 tries stage2 block mapping for host device MMIO at fault path
> 
> Changelog:
> 
> v5:
>  - Rebase to newest mainline.
> 
> v4:
>  - use get_vma_page_shift() handle all cases. (Marc)
>  - get rid of force_pte for device mapping. (Marc)
> 
> v3:
>  - Do not need to check memslot boundary in device_rough_page_shift(). (Marc)
> 
> 
> Keqian Zhu (2):
>   kvm/arm64: Remove the creation time's mapping of MMIO regions
>   kvm/arm64: Try stage2 block mapping for host device MMIO
> 
>  arch/arm64/kvm/mmu.c | 99 
>  1 file changed, 54 insertions(+), 45 deletions(-)
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH] KVM: halt polling: Make the adjustment of polling time clearer

2021-05-19 Thread Keqian Zhu
When we have "block_ns > halt_poll_ns" and "block_ns < max_halt_poll_ns",
then "halt_poll_ns < max_halt_poll_ns" is true, so we can drop this extra
condition.

We want to make sure halt_poll_ns is not zero before shrinking it. Put
the condition in shrinking primitive can make code clearer.

None functional change.

Signed-off-by: Keqian Zhu 
---
 Documentation/virt/kvm/halt-polling.rst | 21 ++---
 virt/kvm/kvm_main.c | 11 ++-
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/Documentation/virt/kvm/halt-polling.rst 
b/Documentation/virt/kvm/halt-polling.rst
index 4922e4a15f18..d9f699395a7f 100644
--- a/Documentation/virt/kvm/halt-polling.rst
+++ b/Documentation/virt/kvm/halt-polling.rst
@@ -47,17 +47,16 @@ Thus this is a per vcpu (or vcore) value.
 During polling if a wakeup source is received within the halt polling interval,
 the interval is left unchanged. In the event that a wakeup source isn't
 received during the polling interval (and thus schedule is invoked) there are
-two options, either the polling interval and total block time[0] were less than
-the global max polling interval (see module params below), or the total block
-time was greater than the global max polling interval.
-
-In the event that both the polling interval and total block time were less than
-the global max polling interval then the polling interval can be increased in
-the hope that next time during the longer polling interval the wake up source
-will be received while the host is polling and the latency benefits will be
-received. The polling interval is grown in the function grow_halt_poll_ns() and
-is multiplied by the module parameters halt_poll_ns_grow and
-halt_poll_ns_grow_start.
+two options, either the total block time[0] were less than the global max
+polling interval (see module params below), or the total block time was greater
+than the global max polling interval.
+
+In the event that the total block time were less than the global max polling
+interval then the polling interval can be increased in the hope that next time
+during the longer polling interval the wake up source will be received while 
the
+host is polling and the latency benefits will be received. The polling interval
+is grown in the function grow_halt_poll_ns() and is multiplied by the module
+parameters halt_poll_ns_grow and halt_poll_ns_grow_start.
 
 In the event that the total block time was greater than the global max polling
 interval then the host will never poll for long enough (limited by the global
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6b4feb92dc79..13a9996c4ccb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2906,6 +2906,9 @@ static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
unsigned int old, val, shrink;
 
old = val = vcpu->halt_poll_ns;
+   if (!old)
+   return;
+
shrink = READ_ONCE(halt_poll_ns_shrink);
if (shrink == 0)
val = 0;
@@ -3003,12 +3006,10 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
if (block_ns <= vcpu->halt_poll_ns)
;
/* we had a long block, shrink polling */
-   else if (vcpu->halt_poll_ns &&
-   block_ns > vcpu->kvm->max_halt_poll_ns)
+   else if (block_ns > vcpu->kvm->max_halt_poll_ns)
shrink_halt_poll_ns(vcpu);
-   /* we had a short halt and our poll time is too small */
-   else if (vcpu->halt_poll_ns < 
vcpu->kvm->max_halt_poll_ns &&
-   block_ns < vcpu->kvm->max_halt_poll_ns)
+   /* we had a short block, grow polling */
+   else if (block_ns < vcpu->kvm->max_halt_poll_ns)
grow_halt_poll_ns(vcpu);
} else {
vcpu->halt_poll_ns = 0;
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v5 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-05-07 Thread Keqian Zhu
The MMIO region of a device maybe huge (GB level), try to use
block mapping in stage2 to speedup both map and unmap.

Compared to normal memory mapping, we should consider two more
points when try block mapping for MMIO region:

1. For normal memory mapping, the PA(host physical address) and
HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
the HVA to request hugepage, so we don't need to consider PA
alignment when verifing block mapping. But for device memory
mapping, the PA and HVA may have different alignment.

2. For normal memory mapping, we are sure hugepage size properly
fit into vma, so we don't check whether the mapping size exceeds
the boundary of vma. But for device memory mapping, we should pay
attention to this.

This adds get_vma_page_shift() to get page shift for both normal
memory and device MMIO region, and check these two points when
selecting block mapping size for MMIO region.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 61 
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 16efd47439a6..2a3d3b760550 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -822,6 +822,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
*memslot,
return PAGE_SIZE;
 }
 
+static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
+{
+   unsigned long pa;
+
+   if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
+   return huge_page_shift(hstate_vma(vma));
+
+   if (!(vma->vm_flags & VM_PFNMAP))
+   return PAGE_SHIFT;
+
+   VM_BUG_ON(is_vm_hugetlb_page(vma));
+
+   pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
+
+#ifndef __PAGETABLE_PMD_FOLDED
+   if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
+   ALIGN(hva, PUD_SIZE) <= vma->vm_end)
+   return PUD_SHIFT;
+#endif
+
+   if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
+   ALIGN(hva, PMD_SIZE) <= vma->vm_end)
+   return PMD_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
@@ -853,7 +882,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   /* Let's check if we will get back a huge page backed by hugetlbfs */
+   /*
+* Let's check if we will get back a huge page backed by hugetlbfs, or
+* get block mapping for device MMIO region.
+*/
mmap_read_lock(current->mm);
vma = find_vma_intersection(current->mm, hva, hva + 1);
if (unlikely(!vma)) {
@@ -862,15 +894,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   if (is_vm_hugetlb_page(vma))
-   vma_shift = huge_page_shift(hstate_vma(vma));
-   else
-   vma_shift = PAGE_SHIFT;
-
-   if (logging_active ||
-   (vma->vm_flags & VM_PFNMAP)) {
+   /*
+* logging_active is guaranteed to never be true for VM_PFNMAP
+* memslots.
+*/
+   if (logging_active) {
force_pte = true;
vma_shift = PAGE_SHIFT;
+   } else {
+   vma_shift = get_vma_page_shift(vma, hva);
}
 
switch (vma_shift) {
@@ -943,8 +975,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
 
if (kvm_is_device_pfn(pfn)) {
+   /*
+* If the page was identified as device early by looking at
+* the VMA flags, vma_pagesize is already representing the
+* largest quantity we can map.  If instead it was mapped
+* via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
+* and must not be upgraded.
+*
+* In both cases, we don't let transparent_hugepage_adjust()
+* change things at the last minute.
+*/
device = true;
-   force_pte = true;
} else if (logging_active && !write_fault) {
/*
 * Only actually map the page as writable if this was a write
@@ -965,7 +1006,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 * If we are not forced to use page mapping, check if we are
 * backed by a THP and thus use block mapping if possible.
 */
-   if (vma_pag

[PATCH v5 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions

2021-05-07 Thread Keqian Zhu
The MMIO regions may be unmapped for many reasons and can be remapped
by stage2 fault path. Map MMIO regions at creation time becomes a
minor optimization and makes these two mapping path hard to sync.

Remove the mapping code while keep the useful sanity check.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 38 +++---
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c5d1f3c87dbd..16efd47439a6 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1346,7 +1346,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 {
hva_t hva = mem->userspace_addr;
hva_t reg_end = hva + mem->memory_size;
-   bool writable = !(mem->flags & KVM_MEM_READONLY);
int ret = 0;
 
if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
@@ -1363,8 +1362,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
mmap_read_lock(current->mm);
/*
 * A memory region could potentially cover multiple VMAs, and any holes
-* between them, so iterate over all of them to find out if we can map
-* any of them right now.
+* between them, so iterate over all of them.
 *
 * ++
 * +---++   ++
@@ -1375,51 +1373,21 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 */
do {
struct vm_area_struct *vma;
-   hva_t vm_start, vm_end;
 
vma = find_vma_intersection(current->mm, hva, reg_end);
if (!vma)
break;
 
-   /*
-* Take the intersection of this VMA with the memory region
-*/
-   vm_start = max(hva, vma->vm_start);
-   vm_end = min(reg_end, vma->vm_end);
-
if (vma->vm_flags & VM_PFNMAP) {
-   gpa_t gpa = mem->guest_phys_addr +
-   (vm_start - mem->userspace_addr);
-   phys_addr_t pa;
-
-   pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
-   pa += vm_start - vma->vm_start;
-
/* IO region dirty page logging not allowed */
if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
ret = -EINVAL;
-   goto out;
-   }
-
-   ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
-   vm_end - vm_start,
-   writable);
-   if (ret)
break;
+   }
}
-   hva = vm_end;
+   hva = min(reg_end, vma->vm_end);
} while (hva < reg_end);
 
-   if (change == KVM_MR_FLAGS_ONLY)
-   goto out;
-
-   spin_lock(>mmu_lock);
-   if (ret)
-   unmap_stage2_range(>arch.mmu, mem->guest_phys_addr, 
mem->memory_size);
-   else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
-   stage2_flush_memslot(kvm, memslot);
-   spin_unlock(>mmu_lock);
-out:
mmap_read_unlock(current->mm);
return ret;
 }
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v5 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-05-07 Thread Keqian Zhu
Hi Marc,

This rebases to newest mainline kernel.

Thanks,
Keqian


We have two pathes to build stage2 mapping for MMIO regions.

Create time's path and stage2 fault path.

Patch#1 removes the creation time's mapping of MMIO regions
Patch#2 tries stage2 block mapping for host device MMIO at fault path

Changelog:

v5:
 - Rebase to newest mainline.

v4:
 - use get_vma_page_shift() handle all cases. (Marc)
 - get rid of force_pte for device mapping. (Marc)

v3:
 - Do not need to check memslot boundary in device_rough_page_shift(). (Marc)


Keqian Zhu (2):
  kvm/arm64: Remove the creation time's mapping of MMIO regions
  kvm/arm64: Try stage2 block mapping for host device MMIO

 arch/arm64/kvm/mmu.c | 99 
 1 file changed, 54 insertions(+), 45 deletions(-)

-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions

2021-04-22 Thread Keqian Zhu
Hi Gavin,

On 2021/4/23 9:35, Gavin Shan wrote:
> Hi Keqian,
> 
> On 4/22/21 5:41 PM, Keqian Zhu wrote:
>> On 2021/4/22 10:12, Gavin Shan wrote:
>>> On 4/21/21 4:28 PM, Keqian Zhu wrote:
>>>> On 2021/4/21 14:38, Gavin Shan wrote:
>>>>> On 4/16/21 12:03 AM, Keqian Zhu wrote:
> 
> [...]
> 
>>>
>>> Yeah, Sorry that I missed that part. Something associated with Santosh's
>>> patch. The flag can be not existing until the page fault happened on
>>> the vma. In this case, the check could be not working properly.
>>>
>>>[PATCH] KVM: arm64: Correctly handle the mmio faulting
>> Yeah, you are right.
>>
>> If that happens, we won't try to use block mapping for memslot with 
>> VM_PFNMAP.
>> But it keeps a same logic with old code.
>>
>> 1. When without dirty-logging, we won't try block mapping for it, and we'll
>> finally know that it's device, so won't try to do adjust THP (Transparent 
>> Huge Page)
>> for it.
>> 2. If userspace wrongly enables dirty logging for this memslot, we'll 
>> force_pte for it.
>>
> 
> It's not about the patch itself and just want more discussion to get more 
> details.
> The patch itself looks good to me. I got two questions as below:
> 
> (1) The memslot fails to be added if it's backed by MMIO region and dirty 
> logging is
> enabled in kvm_arch_prepare_memory_region(). As Santosh reported, the 
> corresponding
> vma could be associated with MMIO region and VM_PFNMAP is missed. In this 
> case,
> kvm_arch_prepare_memory_region() isn't returning error, meaning the memslot 
> can be
> added successfully and block mapping isn't used, as you mentioned. The 
> question is
> the memslot is added, but the expected result would be failure.
Sure. I think we could try to populate the final flag of vma in 
kvm_arch_prepare_memory_region().
Maybe through GUP or any better method? It's nice if you can try to solve this. 
:)

> 
> (2) If dirty logging is enabled on the MMIO memslot, everything should be 
> fine. If
> the dirty logging isn't enabled and VM_PFNMAP isn't set yet in 
> user_mem_abort(),
> block mapping won't be used and PAGE_SIZE is picked, but the failing IPA might
> be good candidate for block mapping. It means we miss something for blocking
> mapping?
Right. This issue also can be solved by populating the final flag of vma in 
kvm_arch_prepare_memory_region().


> 
> By the way, do you have idea why dirty logging can't be enabled on MMIO 
> memslot?
IIUC, MMIO region is of device memory type, it's associated with device state 
and action.
For normal memory type, we can write it out-of-order and repeatedly, but for 
device memory
type, we can't do that. The write to MMIO will trigger device action based on 
current device
state, also what we can read from MMIO based on current device state. Thus the 
policy of
dirty logging for normal memory can't be applied to MMIO.



> I guess Marc might know the history. For example, QEMU is taking "/dev/mem" or
> "/dev/kmem" to back guest's memory, the vma is marked as MMIO, but dirty 
> logging
> and migration isn't supported?
The MMIO region is a part of device state. We need extra kernel driver to 
support migration
of pass-through device, as how to save and restore the device state is closely 
related to
a specific type of device. You can refer VFIO migration for more detail.

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Correctly handle the mmio faulting

2021-04-22 Thread Keqian Zhu



On 2021/4/22 16:00, Santosh Shukla wrote:
> On Thu, Apr 22, 2021 at 1:07 PM Tarun Gupta (SW-GPU)
>  wrote:
>>
>>
>>
>> On 4/22/2021 12:20 PM, Marc Zyngier wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Thu, 22 Apr 2021 03:02:00 +0100,
>>> Gavin Shan  wrote:
>>>>
>>>> Hi Marc,
>>>>
>>>> On 4/21/21 9:59 PM, Marc Zyngier wrote:
>>>>> On Wed, 21 Apr 2021 07:17:44 +0100,
>>>>> Keqian Zhu  wrote:
>>>>>> On 2021/4/21 14:20, Gavin Shan wrote:
>>>>>>> On 4/21/21 12:59 PM, Keqian Zhu wrote:
>>>>>>>> On 2020/10/22 0:16, Santosh Shukla wrote:
>>>>>>>>> The Commit:6d674e28 introduces a notion to detect and handle the
>>>>>>>>> device mapping. The commit checks for the VM_PFNMAP flag is set
>>>>>>>>> in vma->flags and if set then marks force_pte to true such that
>>>>>>>>> if force_pte is true then ignore the THP function check
>>>>>>>>> (/transparent_hugepage_adjust()).
>>>>>>>>>
>>>>>>>>> There could be an issue with the VM_PFNMAP flag setting and checking.
>>>>>>>>> For example consider a case where the mdev vendor driver register's
>>>>>>>>> the vma_fault handler named vma_mmio_fault(), which maps the
>>>>>>>>> host MMIO region in-turn calls remap_pfn_range() and maps
>>>>>>>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
>>>>>>>>> the VM_PFNMAP flag into vma->flags.
>>>>>>>> Could you give the name of the mdev vendor driver that triggers this 
>>>>>>>> issue?
>>>>>>>> I failed to find one according to your description. Thanks.
>>>>>>>>
>>>>>>>
>>>>>>> I think it would be fixed in driver side to set VM_PFNMAP in
>>>>>>> its mmap() callback (call_mmap()), like vfio PCI driver does.
>>>>>>> It means it won't be delayed until page fault is issued and
>>>>>>> remap_pfn_range() is called. It's determined from the beginning
>>>>>>> that the vma associated the mdev vendor driver is serving as
>>>>>>> PFN remapping purpose. So the vma should be populated completely,
>>>>>>> including the VM_PFNMAP flag before it becomes visible to user
>>>>>>> space.
>>>>>
>>>>> Why should that be a requirement? Lazy populating of the VMA should be
>>>>> perfectly acceptable if the fault can only happen on the CPU side.
>>>>>
> 
> Right.
> Hi keqian,
> You can refer to case
> http://lkml.iu.edu/hypermail/linux/kernel/2010.3/00952.html
Hi Santosh,

Yeah, thanks for that.

BRs,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions

2021-04-22 Thread Keqian Zhu
Hi Gavin,

On 2021/4/22 10:12, Gavin Shan wrote:
> Hi Keqian,
> 
> On 4/21/21 4:28 PM, Keqian Zhu wrote:
>> On 2021/4/21 14:38, Gavin Shan wrote:
>>> On 4/16/21 12:03 AM, Keqian Zhu wrote:
>>>> The MMIO regions may be unmapped for many reasons and can be remapped
>>>> by stage2 fault path. Map MMIO regions at creation time becomes a
>>>> minor optimization and makes these two mapping path hard to sync.
>>>>
>>>> Remove the mapping code while keep the useful sanity check.
>>>>
>>>> Signed-off-by: Keqian Zhu 
>>>> ---
>>>>arch/arm64/kvm/mmu.c | 38 +++---
>>>>1 file changed, 3 insertions(+), 35 deletions(-)
>>>>
>>>
>>> After removing the logic to create stage2 mapping for VM_PFNMAP region,
>>> I think the "do { } while" loop becomes unnecessary and can be dropped
>>> completely. It means the only sanity check is to see if the memory slot
>>> overflows IPA space or not. In that case, KVM_MR_FLAGS_ONLY can be
>>> ignored because the memory slot's base address and length aren't changed
>>> when we have KVM_MR_FLAGS_ONLY.
>> Maybe not exactly. Here we do an important sanity check that we shouldn't
>> log dirty for memslots with VM_PFNMAP.
>>
> 
> Yeah, Sorry that I missed that part. Something associated with Santosh's
> patch. The flag can be not existing until the page fault happened on
> the vma. In this case, the check could be not working properly.
> 
>   [PATCH] KVM: arm64: Correctly handle the mmio faulting
Yeah, you are right.

If that happens, we won't try to use block mapping for memslot with VM_PFNMAP.
But it keeps a same logic with old code.

1. When without dirty-logging, we won't try block mapping for it, and we'll
finally know that it's device, so won't try to do adjust THP (Transparent Huge 
Page)
for it.
2. If userspace wrongly enables dirty logging for this memslot, we'll force_pte 
for it.

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-21 Thread Keqian Zhu



On 2021/4/21 15:52, Gavin Shan wrote:
> Hi Keqian,
> 
> On 4/16/21 12:03 AM, Keqian Zhu wrote:
>> The MMIO region of a device maybe huge (GB level), try to use
>> block mapping in stage2 to speedup both map and unmap.
>>
>> Compared to normal memory mapping, we should consider two more
>> points when try block mapping for MMIO region:
>>
>> 1. For normal memory mapping, the PA(host physical address) and
>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>> the HVA to request hugepage, so we don't need to consider PA
>> alignment when verifing block mapping. But for device memory
>> mapping, the PA and HVA may have different alignment.
>>
>> 2. For normal memory mapping, we are sure hugepage size properly
>> fit into vma, so we don't check whether the mapping size exceeds
>> the boundary of vma. But for device memory mapping, we should pay
>> attention to this.
>>
>> This adds get_vma_page_shift() to get page shift for both normal
>> memory and device MMIO region, and check these two points when
>> selecting block mapping size for MMIO region.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>   arch/arm64/kvm/mmu.c | 61 
>>   1 file changed, 51 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c59af5ca01b0..5a1cc7751e6d 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
>> *memslot,
>>   return PAGE_SIZE;
>>   }
>>   +static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long 
>> hva)
>> +{
>> +unsigned long pa;
>> +
>> +if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
>> +return huge_page_shift(hstate_vma(vma));
>> +
>> +if (!(vma->vm_flags & VM_PFNMAP))
>> +return PAGE_SHIFT;
>> +
>> +VM_BUG_ON(is_vm_hugetlb_page(vma));
>> +
> 
> I don't understand how VM_PFNMAP is set for hugetlbfs related vma.
> I think they are exclusive, meaning the flag is never set for
> hugetlbfs vma. If it's true, VM_PFNMAP needn't be checked on hugetlbfs
> vma and the VM_BUG_ON() becomes unnecessary.
Yes, but we're not sure all drivers follow this rule. Add a BUG_ON() is
a way to catch issue.

> 
>> +pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
>> +
>> +#ifndef __PAGETABLE_PMD_FOLDED
>> +if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
>> +ALIGN(hva, PUD_SIZE) <= vma->vm_end)
>> +return PUD_SHIFT;
>> +#endif
>> +
>> +if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
>> +ALIGN(hva, PMD_SIZE) <= vma->vm_end)
>> +return PMD_SHIFT;
>> +
>> +return PAGE_SHIFT;
>> +}
>> +
> 
> There is "switch(...)" fallback mechanism in user_mem_abort(). 
> PUD_SIZE/PMD_SIZE
> can be downgraded accordingly if the addresses fails in the alignment check
> by fault_supports_stage2_huge_mapping(). I think it would make 
> user_mem_abort()
> simplified if the logic can be moved to get_vma_page_shift().
> 
> Another question if we need the check from 
> fault_supports_stage2_huge_mapping()
> if VM_PFNMAP area is going to be covered by block mapping. If so, the 
> "switch(...)"
> fallback mechanism needs to be part of get_vma_page_shift().
Yes, Good suggestion. My idea is that we can keep this series simpler and do 
further
optimization in another patch series. Do you mind to send a patch?

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions

2021-04-21 Thread Keqian Zhu
Hi Gavin,

On 2021/4/21 14:38, Gavin Shan wrote:
> Hi Keqian,
> 
> On 4/16/21 12:03 AM, Keqian Zhu wrote:
>> The MMIO regions may be unmapped for many reasons and can be remapped
>> by stage2 fault path. Map MMIO regions at creation time becomes a
>> minor optimization and makes these two mapping path hard to sync.
>>
>> Remove the mapping code while keep the useful sanity check.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>   arch/arm64/kvm/mmu.c | 38 +++---
>>   1 file changed, 3 insertions(+), 35 deletions(-)
>>
> 
> After removing the logic to create stage2 mapping for VM_PFNMAP region,
> I think the "do { } while" loop becomes unnecessary and can be dropped
> completely. It means the only sanity check is to see if the memory slot
> overflows IPA space or not. In that case, KVM_MR_FLAGS_ONLY can be
> ignored because the memory slot's base address and length aren't changed
> when we have KVM_MR_FLAGS_ONLY.
Maybe not exactly. Here we do an important sanity check that we shouldn't
log dirty for memslots with VM_PFNMAP.


> 
> It seems the patch isn't based on "next" branch because find_vma() was
> replaced by find_vma_intersection() by one of my patches :)
Yep, I remember it. I will replace it at next merge window...

Thanks,
Keqian

> 
> Thanks,
> Gavin
> 
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 8711894db8c2..c59af5ca01b0 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>   {
>>   hva_t hva = mem->userspace_addr;
>>   hva_t reg_end = hva + mem->memory_size;
>> -bool writable = !(mem->flags & KVM_MEM_READONLY);
>>   int ret = 0;
>> if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
>> @@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>   mmap_read_lock(current->mm);
>>   /*
>>* A memory region could potentially cover multiple VMAs, and any holes
>> - * between them, so iterate over all of them to find out if we can map
>> - * any of them right now.
>> + * between them, so iterate over all of them.
>>*
>>* ++
>>* +---++   ++
>> @@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>*/
>>   do {
>>   struct vm_area_struct *vma = find_vma(current->mm, hva);
>> -hva_t vm_start, vm_end;
>> if (!vma || vma->vm_start >= reg_end)
>>   break;
>>   -/*
>> - * Take the intersection of this VMA with the memory region
>> - */
>> -vm_start = max(hva, vma->vm_start);
>> -vm_end = min(reg_end, vma->vm_end);
>> -
>>   if (vma->vm_flags & VM_PFNMAP) {
>> -gpa_t gpa = mem->guest_phys_addr +
>> -(vm_start - mem->userspace_addr);
>> -phys_addr_t pa;
>> -
>> -pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
>> -pa += vm_start - vma->vm_start;
>> -
>>   /* IO region dirty page logging not allowed */
>>   if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
>>   ret = -EINVAL;
>> -goto out;
>> -}
>> -
>> -ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
>> -vm_end - vm_start,
>> -writable);
>> -if (ret)
>>   break;
>> +}
>>   }
>> -hva = vm_end;
>> +hva = min(reg_end, vma->vm_end);
>>   } while (hva < reg_end);
>>   -if (change == KVM_MR_FLAGS_ONLY)
>> -goto out;
>> -
>> -spin_lock(>mmu_lock);
>> -if (ret)
>> -unmap_stage2_range(>arch.mmu, mem->guest_phys_addr, 
>> mem->memory_size);
>> -else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
>> -stage2_flush_memslot(kvm, memslot);
>> -spin_unlock(>mmu_lock);
>> -out:
>>   mmap_read_unlock(current->mm);
>>   return ret;
>>   }
>>
> 
> .
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Correctly handle the mmio faulting

2021-04-21 Thread Keqian Zhu
Hi Gavin,

On 2021/4/21 14:20, Gavin Shan wrote:
> Hi Keqian and Santosh,
> 
> On 4/21/21 12:59 PM, Keqian Zhu wrote:
>> On 2020/10/22 0:16, Santosh Shukla wrote:
>>> The Commit:6d674e28 introduces a notion to detect and handle the
>>> device mapping. The commit checks for the VM_PFNMAP flag is set
>>> in vma->flags and if set then marks force_pte to true such that
>>> if force_pte is true then ignore the THP function check
>>> (/transparent_hugepage_adjust()).
>>>
>>> There could be an issue with the VM_PFNMAP flag setting and checking.
>>> For example consider a case where the mdev vendor driver register's
>>> the vma_fault handler named vma_mmio_fault(), which maps the
>>> host MMIO region in-turn calls remap_pfn_range() and maps
>>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
>>> the VM_PFNMAP flag into vma->flags.
>> Could you give the name of the mdev vendor driver that triggers this issue?
>> I failed to find one according to your description. Thanks.
>>
> 
> I think it would be fixed in driver side to set VM_PFNMAP in
> its mmap() callback (call_mmap()), like vfio PCI driver does.
> It means it won't be delayed until page fault is issued and
> remap_pfn_range() is called. It's determined from the beginning
> that the vma associated the mdev vendor driver is serving as
> PFN remapping purpose. So the vma should be populated completely,
> including the VM_PFNMAP flag before it becomes visible to user
> space.
> 
> The example can be found from vfio driver in drivers/vfio/pci/vfio_pci.c:
> vfio_pci_mmap:   VM_PFNMAP is set for the vma
> vfio_pci_mmap_fault: remap_pfn_range() is called
Right. I have discussed the above with Marc. I want to find the driver
to fix it. However, AFAICS, there is no driver matches the description...

Thanks,
Keqian

> 
> Thanks,
> Gavin
> 
>>>
>>> Now lets assume a mmio fault handing flow where guest first access
>>> the MMIO region whose 2nd stage translation is not present.
>>> So that results to arm64-kvm hypervisor executing guest abort handler,
>>> like below:
>>>
>>> kvm_handle_guest_abort() -->
>>>   user_mem_abort()--> {
>>>
>>>  ...
>>>  0. checks the vma->flags for the VM_PFNMAP.
>>>  1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false;
>>>  2. gfn_to_pfn_prot() -->
>>>  __gfn_to_pfn_memslot() -->
>>>  fixup_user_fault() -->
>>>  handle_mm_fault()-->
>>>  __do_fault() -->
>>> vma_mmio_fault() --> // vendor's mdev fault handler
>>>  remap_pfn_range()--> // Here sets the VM_PFNMAP
>>> flag into vma->flags.
>>>  3. Now that force_pte is set to false in step-2),
>>> will execute transparent_hugepage_adjust() func and
>>> that lead to Oops [4].
>>>   }
>>>
>>> The proposition is to check is_iomap flag before executing the THP
>>> function transparent_hugepage_adjust().
>>>
>>> [4] THP Oops:
>>>> pc: kvm_is_transparent_hugepage+0x18/0xb0
>>>> ...
>>>> ...
>>>> user_mem_abort+0x340/0x9b8
>>>> kvm_handle_guest_abort+0x248/0x468
>>>> handle_exit+0x150/0x1b0
>>>> kvm_arch_vcpu_ioctl_run+0x4d4/0x778
>>>> kvm_vcpu_ioctl+0x3c0/0x858
>>>> ksys_ioctl+0x84/0xb8
>>>> __arm64_sys_ioctl+0x28/0x38
>>>
>>> Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev device.
>>> Linux tip: 583090b1
>>>
>>> Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device 
>>> mappings")
>>> Signed-off-by: Santosh Shukla 
>>> ---
>>>   arch/arm64/kvm/mmu.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index 3d26b47..ff15357 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>>> phys_addr_t fault_ipa,
>>>* If we are not forced to use page mapping, check if we are
>>>* backed by a THP and thus use block mapping if possible.
>>>*/
>>> -if (vma_pagesize == PAGE_SIZE && !force_pte)
>>> +if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags))
>>>   vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>>>  , _ipa);
>>>   if (writable)
>>>
>>
> 
> .
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Correctly handle the mmio faulting

2021-04-20 Thread Keqian Zhu
Hi Santosh,

On 2020/10/22 0:16, Santosh Shukla wrote:
> The Commit:6d674e28 introduces a notion to detect and handle the
> device mapping. The commit checks for the VM_PFNMAP flag is set
> in vma->flags and if set then marks force_pte to true such that
> if force_pte is true then ignore the THP function check
> (/transparent_hugepage_adjust()).
> 
> There could be an issue with the VM_PFNMAP flag setting and checking.
> For example consider a case where the mdev vendor driver register's
> the vma_fault handler named vma_mmio_fault(), which maps the
> host MMIO region in-turn calls remap_pfn_range() and maps
> the MMIO's vma space. Where, remap_pfn_range implicitly sets
> the VM_PFNMAP flag into vma->flags.
Could you give the name of the mdev vendor driver that triggers this issue?
I failed to find one according to your description. Thanks.


BRs,
Keqian


> 
> Now lets assume a mmio fault handing flow where guest first access
> the MMIO region whose 2nd stage translation is not present.
> So that results to arm64-kvm hypervisor executing guest abort handler,
> like below:
> 
> kvm_handle_guest_abort() -->
>  user_mem_abort()--> {
> 
> ...
> 0. checks the vma->flags for the VM_PFNMAP.
> 1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false;
> 2. gfn_to_pfn_prot() -->
> __gfn_to_pfn_memslot() -->
> fixup_user_fault() -->
> handle_mm_fault()-->
> __do_fault() -->
>vma_mmio_fault() --> // vendor's mdev fault handler
> remap_pfn_range()--> // Here sets the VM_PFNMAP
>   flag into vma->flags.
> 3. Now that force_pte is set to false in step-2),
>will execute transparent_hugepage_adjust() func and
>that lead to Oops [4].
>  }
> 
> The proposition is to check is_iomap flag before executing the THP
> function transparent_hugepage_adjust().
> 
> [4] THP Oops:
>> pc: kvm_is_transparent_hugepage+0x18/0xb0
>> ...
>> ...
>> user_mem_abort+0x340/0x9b8
>> kvm_handle_guest_abort+0x248/0x468
>> handle_exit+0x150/0x1b0
>> kvm_arch_vcpu_ioctl_run+0x4d4/0x778
>> kvm_vcpu_ioctl+0x3c0/0x858
>> ksys_ioctl+0x84/0xb8
>> __arm64_sys_ioctl+0x28/0x38
> 
> Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev device.
> Linux tip: 583090b1
> 
> Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device 
> mappings")
> Signed-off-by: Santosh Shukla 
> ---
>  arch/arm64/kvm/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3d26b47..ff15357 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>* If we are not forced to use page mapping, check if we are
>* backed by a THP and thus use block mapping if possible.
>*/
> - if (vma_pagesize == PAGE_SIZE && !force_pte)
> + if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags))
>   vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>  , _ipa);
>   if (writable)
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-16 Thread Keqian Zhu
Hi Marc,

On 2021/4/16 22:44, Marc Zyngier wrote:
> On Thu, 15 Apr 2021 15:08:09 +0100,
> Keqian Zhu  wrote:
>>
>> Hi Marc,
>>
>> On 2021/4/15 22:03, Keqian Zhu wrote:
>>> The MMIO region of a device maybe huge (GB level), try to use
>>> block mapping in stage2 to speedup both map and unmap.
>>>
>>> Compared to normal memory mapping, we should consider two more
>>> points when try block mapping for MMIO region:
>>>
>>> 1. For normal memory mapping, the PA(host physical address) and
>>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>>> the HVA to request hugepage, so we don't need to consider PA
>>> alignment when verifing block mapping. But for device memory
>>> mapping, the PA and HVA may have different alignment.
>>>
>>> 2. For normal memory mapping, we are sure hugepage size properly
>>> fit into vma, so we don't check whether the mapping size exceeds
>>> the boundary of vma. But for device memory mapping, we should pay
>>> attention to this.
>>>
>>> This adds get_vma_page_shift() to get page shift for both normal
>>> memory and device MMIO region, and check these two points when
>>> selecting block mapping size for MMIO region.
>>>
>>> Signed-off-by: Keqian Zhu 
>>> ---
>>>  arch/arm64/kvm/mmu.c | 61 
>>>  1 file changed, 51 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index c59af5ca01b0..5a1cc7751e6d 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
>>> *memslot,
>>> return PAGE_SIZE;
>>>  }
>>>  
[...]

>>> +   /*
>>> +* logging_active is guaranteed to never be true for VM_PFNMAP
>>> +* memslots.
>>> +*/
>>> +   if (logging_active) {
>>> force_pte = true;
>>> vma_shift = PAGE_SHIFT;
>>> +   } else {
>>> +   vma_shift = get_vma_page_shift(vma, hva);
>>> }
>> I use a if/else manner in v4, please check that. Thanks very much!
> 
> That's fine. However, it is getting a bit late for 5.13, and we don't
> have much time to left it simmer in -next. I'll probably wait until
> after the merge window to pick it up.
OK, no problem. Thanks! :)

BRs,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Keqian Zhu
Hi Marc,

On 2021/4/15 22:03, Keqian Zhu wrote:
> The MMIO region of a device maybe huge (GB level), try to use
> block mapping in stage2 to speedup both map and unmap.
> 
> Compared to normal memory mapping, we should consider two more
> points when try block mapping for MMIO region:
> 
> 1. For normal memory mapping, the PA(host physical address) and
> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
> the HVA to request hugepage, so we don't need to consider PA
> alignment when verifing block mapping. But for device memory
> mapping, the PA and HVA may have different alignment.
> 
> 2. For normal memory mapping, we are sure hugepage size properly
> fit into vma, so we don't check whether the mapping size exceeds
> the boundary of vma. But for device memory mapping, we should pay
> attention to this.
> 
> This adds get_vma_page_shift() to get page shift for both normal
> memory and device MMIO region, and check these two points when
> selecting block mapping size for MMIO region.
> 
> Signed-off-by: Keqian Zhu 
> ---
>  arch/arm64/kvm/mmu.c | 61 
>  1 file changed, 51 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c59af5ca01b0..5a1cc7751e6d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
> *memslot,
>   return PAGE_SIZE;
>  }
>  
> +static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
> +{
> + unsigned long pa;
> +
> + if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
> + return huge_page_shift(hstate_vma(vma));
> +
> + if (!(vma->vm_flags & VM_PFNMAP))
> + return PAGE_SHIFT;
> +
> + VM_BUG_ON(is_vm_hugetlb_page(vma));
> +
> + pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
> +
> +#ifndef __PAGETABLE_PMD_FOLDED
> + if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
> + ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
> + ALIGN(hva, PUD_SIZE) <= vma->vm_end)
> + return PUD_SHIFT;
> +#endif
> +
> + if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
> + ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
> + ALIGN(hva, PMD_SIZE) <= vma->vm_end)
> + return PMD_SHIFT;
> +
> + return PAGE_SHIFT;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> struct kvm_memory_slot *memslot, unsigned long hva,
> unsigned long fault_status)
> @@ -769,7 +798,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>   }
>  
> - /* Let's check if we will get back a huge page backed by hugetlbfs */
> + /*
> +  * Let's check if we will get back a huge page backed by hugetlbfs, or
> +  * get block mapping for device MMIO region.
> +  */
>   mmap_read_lock(current->mm);
>   vma = find_vma_intersection(current->mm, hva, hva + 1);
>   if (unlikely(!vma)) {
> @@ -778,15 +810,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>   }
>  
> - if (is_vm_hugetlb_page(vma))
> - vma_shift = huge_page_shift(hstate_vma(vma));
> - else
> - vma_shift = PAGE_SHIFT;
> -
> - if (logging_active ||
> - (vma->vm_flags & VM_PFNMAP)) {
> + /*
> +  * logging_active is guaranteed to never be true for VM_PFNMAP
> +  * memslots.
> +  */
> + if (logging_active) {
>   force_pte = true;
>   vma_shift = PAGE_SHIFT;
> + } else {
> + vma_shift = get_vma_page_shift(vma, hva);
>   }
I use a if/else manner in v4, please check that. Thanks very much!


BRs,
Keqian

>  
>   switch (vma_shift) {
> @@ -854,8 +886,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>  
>   if (kvm_is_device_pfn(pfn)) {
> + /*
> +  * If the page was identified as device early by looking at
> +  * the VMA flags, vma_pagesize is already representing the
> +  * largest quantity we can map.  If instead it was mapped
> +  * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
> +  * and must not be upgraded.
> +  *
> +  * In both cases, we don't let transpare

[PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Keqian Zhu
The MMIO region of a device maybe huge (GB level), try to use
block mapping in stage2 to speedup both map and unmap.

Compared to normal memory mapping, we should consider two more
points when try block mapping for MMIO region:

1. For normal memory mapping, the PA(host physical address) and
HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
the HVA to request hugepage, so we don't need to consider PA
alignment when verifing block mapping. But for device memory
mapping, the PA and HVA may have different alignment.

2. For normal memory mapping, we are sure hugepage size properly
fit into vma, so we don't check whether the mapping size exceeds
the boundary of vma. But for device memory mapping, we should pay
attention to this.

This adds get_vma_page_shift() to get page shift for both normal
memory and device MMIO region, and check these two points when
selecting block mapping size for MMIO region.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 61 
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c59af5ca01b0..5a1cc7751e6d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
*memslot,
return PAGE_SIZE;
 }
 
+static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
+{
+   unsigned long pa;
+
+   if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
+   return huge_page_shift(hstate_vma(vma));
+
+   if (!(vma->vm_flags & VM_PFNMAP))
+   return PAGE_SHIFT;
+
+   VM_BUG_ON(is_vm_hugetlb_page(vma));
+
+   pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
+
+#ifndef __PAGETABLE_PMD_FOLDED
+   if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
+   ALIGN(hva, PUD_SIZE) <= vma->vm_end)
+   return PUD_SHIFT;
+#endif
+
+   if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
+   ALIGN(hva, PMD_SIZE) <= vma->vm_end)
+   return PMD_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
@@ -769,7 +798,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   /* Let's check if we will get back a huge page backed by hugetlbfs */
+   /*
+* Let's check if we will get back a huge page backed by hugetlbfs, or
+* get block mapping for device MMIO region.
+*/
mmap_read_lock(current->mm);
vma = find_vma_intersection(current->mm, hva, hva + 1);
if (unlikely(!vma)) {
@@ -778,15 +810,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   if (is_vm_hugetlb_page(vma))
-   vma_shift = huge_page_shift(hstate_vma(vma));
-   else
-   vma_shift = PAGE_SHIFT;
-
-   if (logging_active ||
-   (vma->vm_flags & VM_PFNMAP)) {
+   /*
+* logging_active is guaranteed to never be true for VM_PFNMAP
+* memslots.
+*/
+   if (logging_active) {
force_pte = true;
vma_shift = PAGE_SHIFT;
+   } else {
+   vma_shift = get_vma_page_shift(vma, hva);
}
 
switch (vma_shift) {
@@ -854,8 +886,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
 
if (kvm_is_device_pfn(pfn)) {
+   /*
+* If the page was identified as device early by looking at
+* the VMA flags, vma_pagesize is already representing the
+* largest quantity we can map.  If instead it was mapped
+* via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
+* and must not be upgraded.
+*
+* In both cases, we don't let transparent_hugepage_adjust()
+* change things at the last minute.
+*/
device = true;
-   force_pte = true;
} else if (logging_active && !write_fault) {
/*
 * Only actually map the page as writable if this was a write
@@ -876,7 +917,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 * If we are not forced to use page mapping, check if we are
 * backed by a THP and thus use block mapping if possible.
 */
-   if (vma_pag

[PATCH v4 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Keqian Zhu
Hi,

We have two pathes to build stage2 mapping for MMIO regions.

Create time's path and stage2 fault path.

Patch#1 removes the creation time's mapping of MMIO regions
Patch#2 tries stage2 block mapping for host device MMIO at fault path

Changelog:

v4:
 - use get_vma_page_shift() handle all cases. (Marc)
 - get rid of force_pte for device mapping. (Marc)

v3:
 - Do not need to check memslot boundary in device_rough_page_shift(). (Marc)

Thanks,
Keqian


Keqian Zhu (2):
  kvm/arm64: Remove the creation time's mapping of MMIO regions
  kvm/arm64: Try stage2 block mapping for host device MMIO

 arch/arm64/kvm/mmu.c | 99 
 1 file changed, 54 insertions(+), 45 deletions(-)

-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions

2021-04-15 Thread Keqian Zhu
The MMIO regions may be unmapped for many reasons and can be remapped
by stage2 fault path. Map MMIO regions at creation time becomes a
minor optimization and makes these two mapping path hard to sync.

Remove the mapping code while keep the useful sanity check.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 38 +++---
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..c59af5ca01b0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 {
hva_t hva = mem->userspace_addr;
hva_t reg_end = hva + mem->memory_size;
-   bool writable = !(mem->flags & KVM_MEM_READONLY);
int ret = 0;
 
if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
@@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
mmap_read_lock(current->mm);
/*
 * A memory region could potentially cover multiple VMAs, and any holes
-* between them, so iterate over all of them to find out if we can map
-* any of them right now.
+* between them, so iterate over all of them.
 *
 * ++
 * +---++   ++
@@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 */
do {
struct vm_area_struct *vma = find_vma(current->mm, hva);
-   hva_t vm_start, vm_end;
 
if (!vma || vma->vm_start >= reg_end)
break;
 
-   /*
-* Take the intersection of this VMA with the memory region
-*/
-   vm_start = max(hva, vma->vm_start);
-   vm_end = min(reg_end, vma->vm_end);
-
if (vma->vm_flags & VM_PFNMAP) {
-   gpa_t gpa = mem->guest_phys_addr +
-   (vm_start - mem->userspace_addr);
-   phys_addr_t pa;
-
-   pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
-   pa += vm_start - vma->vm_start;
-
/* IO region dirty page logging not allowed */
if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
ret = -EINVAL;
-   goto out;
-   }
-
-   ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
-   vm_end - vm_start,
-   writable);
-   if (ret)
break;
+   }
}
-   hva = vm_end;
+   hva = min(reg_end, vma->vm_end);
} while (hva < reg_end);
 
-   if (change == KVM_MR_FLAGS_ONLY)
-   goto out;
-
-   spin_lock(>mmu_lock);
-   if (ret)
-   unmap_stage2_range(>arch.mmu, mem->guest_phys_addr, 
mem->memory_size);
-   else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
-   stage2_flush_memslot(kvm, memslot);
-   spin_unlock(>mmu_lock);
-out:
mmap_read_unlock(current->mm);
return ret;
 }
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/5] KVM: arm64: Divorce the perf code from oprofile helpers

2021-04-15 Thread Keqian Zhu
Hi Marc,

On 2021/4/15 18:42, Marc Zyngier wrote:
> On Thu, 15 Apr 2021 07:59:26 +0100,
> Keqian Zhu  wrote:
>>
>> Hi Marc,
>>
>> On 2021/4/14 21:44, Marc Zyngier wrote:
>>> KVM/arm64 is the sole user of perf_num_counters(), and really
>>> could do without it. Stop using the obsolete API by relying on
>>> the existing probing code.
>>>
>>> Signed-off-by: Marc Zyngier 
>>> ---
>>>  arch/arm64/kvm/perf.c | 7 +--
>>>  arch/arm64/kvm/pmu-emul.c | 2 +-
>>>  include/kvm/arm_pmu.h | 4 
>>>  3 files changed, 6 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
>>> index 739164324afe..b8b398670ef2 100644
>>> --- a/arch/arm64/kvm/perf.c
>>> +++ b/arch/arm64/kvm/perf.c
>>> @@ -50,12 +50,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
>>>  
>>>  int kvm_perf_init(void)
>>>  {
>>> -   /*
>>> -* Check if HW_PERF_EVENTS are supported by checking the number of
>>> -* hardware performance counters. This could ensure the presence of
>>> -* a physical PMU and CONFIG_PERF_EVENT is selected.
>>> -*/
>>> -   if (IS_ENABLED(CONFIG_ARM_PMU) && perf_num_counters() > 0)
>>> +   if (kvm_pmu_probe_pmuver() != 0xf)
>> The probe() function may be called many times
>> (kvm_arm_pmu_v3_set_attr also calls it).  I don't know whether the
>> first calling is enough. If so, can we use a static variable in it,
>> so the following calling can return the result right away?
> 
> No, because that wouldn't help with crappy big-little implementations
> that could have PMUs with different versions. We want to find the
> version at the point where the virtual PMU is created, which is why we
> call the probe function once per vcpu.
I see.

But AFAICS the pmuver is placed in kvm->arch, and the probe function is called
once per VM. Maybe I miss something.

> 
> This of course is broken in other ways (BL+KVM is a total disaster
> when it comes to PMU), but making this static would just make it
> worse.
OK.

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Keqian Zhu
Hi Marc,

On 2021/4/15 18:23, Marc Zyngier wrote:
> On Thu, 15 Apr 2021 03:20:52 +0100,
> Keqian Zhu  wrote:
>>
>> Hi Marc,
>>
>> On 2021/4/14 17:05, Marc Zyngier wrote:
>>> + Santosh, who found some interesting bugs in that area before.
>>>
>>> On Wed, 14 Apr 2021 07:51:09 +0100,
>>> Keqian Zhu  wrote:
>>>>
>>>> The MMIO region of a device maybe huge (GB level), try to use
>>>> block mapping in stage2 to speedup both map and unmap.
>>>>
>>>> Compared to normal memory mapping, we should consider two more
>>>> points when try block mapping for MMIO region:
>>>>
>>>> 1. For normal memory mapping, the PA(host physical address) and
>>>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>>>> the HVA to request hugepage, so we don't need to consider PA
>>>> alignment when verifing block mapping. But for device memory
>>>> mapping, the PA and HVA may have different alignment.
>>>>
>>>> 2. For normal memory mapping, we are sure hugepage size properly
>>>> fit into vma, so we don't check whether the mapping size exceeds
>>>> the boundary of vma. But for device memory mapping, we should pay
>>>> attention to this.
>>>>
>>>> This adds device_rough_page_shift() to check these two points when
>>>> selecting block mapping size.
>>>>
>>>> Signed-off-by: Keqian Zhu 
>>>> ---
>>>>  arch/arm64/kvm/mmu.c | 37 +
>>>>  1 file changed, 33 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index c59af5ca01b0..1a6d96169d60 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -624,6 +624,31 @@ static void kvm_send_hwpoison_signal(unsigned long 
>>>> address, short lsb)
>>>>send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>>>>  }
>>>>  
>>>> +/*
>>>> + * Find a max mapping size that properly insides the vma. And hva and pa 
>>>> must
>>>> + * have the same alignment to this mapping size. It's rough as there are 
>>>> still
>>>> + * other restrictions, will be checked by 
>>>> fault_supports_stage2_huge_mapping().
>>>> + */
>>>> +static short device_rough_page_shift(struct vm_area_struct *vma,
>>>> +   unsigned long hva)
>>>
>>> My earlier question still stands. Under which circumstances would this
>>> function return something that is *not* the final mapping size? I
>>> really don't see a reason why this would not return the final mapping
>>> size.
>>
>> IIUC, all the restrictions are about alignment and area boundary.
>>
>> That's to say, HVA, IPA and PA must have same alignment within the
>> mapping size.  And the areas are memslot and vma, which means the
>> mapping size must properly fit into the memslot and vma.
>>
>> In this function, we just checked the alignment of HVA and PA, and
>> the boundary of vma.  So we still need to check the alignment of HVA
>> and IPA, and the boundary of memslot.  These will be checked by
>> fault_supports_stage2_huge_mapping().
> 
> But that's no different from what we do with normal memory, is it? So
> it really feels like we should have *one* function that deals with
> establishing the basic mapping size from the VMA (see below for what I
> have in mind).
Right. And it looks better.

> 
>>
>>>
>>>> +{
>>>> +  phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
>>>> +
>>>> +#ifndef __PAGETABLE_PMD_FOLDED
>>>> +  if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
>>>> +  ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
>>>> +  ALIGN(hva, PUD_SIZE) <= vma->vm_end)
>>>> +  return PUD_SHIFT;
>>>> +#endif
>>>> +
>>>> +  if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
>>>> +  ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
>>>> +  ALIGN(hva, PMD_SIZE) <= vma->vm_end)
>>>> +  return PMD_SHIFT;
>>>> +
>>>> +  return PAGE_SHIFT;
>>>> +}
>>>> +
>>>>  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_

Re: [PATCH 1/5] KVM: arm64: Divorce the perf code from oprofile helpers

2021-04-15 Thread Keqian Zhu
Hi Marc,

On 2021/4/14 21:44, Marc Zyngier wrote:
> KVM/arm64 is the sole user of perf_num_counters(), and really
> could do without it. Stop using the obsolete API by relying on
> the existing probing code.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/perf.c | 7 +--
>  arch/arm64/kvm/pmu-emul.c | 2 +-
>  include/kvm/arm_pmu.h | 4 
>  3 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
> index 739164324afe..b8b398670ef2 100644
> --- a/arch/arm64/kvm/perf.c
> +++ b/arch/arm64/kvm/perf.c
> @@ -50,12 +50,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
>  
>  int kvm_perf_init(void)
>  {
> - /*
> -  * Check if HW_PERF_EVENTS are supported by checking the number of
> -  * hardware performance counters. This could ensure the presence of
> -  * a physical PMU and CONFIG_PERF_EVENT is selected.
> -  */
> - if (IS_ENABLED(CONFIG_ARM_PMU) && perf_num_counters() > 0)
> + if (kvm_pmu_probe_pmuver() != 0xf)
The probe() function may be called many times (kvm_arm_pmu_v3_set_attr also 
calls it).
I don't know whether the first calling is enough. If so, can we use a static 
variable
in it, so the following calling can return the result right away?

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-14 Thread Keqian Zhu
Hi Marc,

On 2021/4/14 17:05, Marc Zyngier wrote:
> + Santosh, who found some interesting bugs in that area before.
> 
> On Wed, 14 Apr 2021 07:51:09 +0100,
> Keqian Zhu  wrote:
>>
>> The MMIO region of a device maybe huge (GB level), try to use
>> block mapping in stage2 to speedup both map and unmap.
>>
>> Compared to normal memory mapping, we should consider two more
>> points when try block mapping for MMIO region:
>>
>> 1. For normal memory mapping, the PA(host physical address) and
>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>> the HVA to request hugepage, so we don't need to consider PA
>> alignment when verifing block mapping. But for device memory
>> mapping, the PA and HVA may have different alignment.
>>
>> 2. For normal memory mapping, we are sure hugepage size properly
>> fit into vma, so we don't check whether the mapping size exceeds
>> the boundary of vma. But for device memory mapping, we should pay
>> attention to this.
>>
>> This adds device_rough_page_shift() to check these two points when
>> selecting block mapping size.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  arch/arm64/kvm/mmu.c | 37 +
>>  1 file changed, 33 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c59af5ca01b0..1a6d96169d60 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -624,6 +624,31 @@ static void kvm_send_hwpoison_signal(unsigned long 
>> address, short lsb)
>>  send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>>  }
>>  
>> +/*
>> + * Find a max mapping size that properly insides the vma. And hva and pa 
>> must
>> + * have the same alignment to this mapping size. It's rough as there are 
>> still
>> + * other restrictions, will be checked by 
>> fault_supports_stage2_huge_mapping().
>> + */
>> +static short device_rough_page_shift(struct vm_area_struct *vma,
>> + unsigned long hva)
> 
> My earlier question still stands. Under which circumstances would this
> function return something that is *not* the final mapping size? I
> really don't see a reason why this would not return the final mapping
> size.

IIUC, all the restrictions are about alignment and area boundary.

That's to say, HVA, IPA and PA must have same alignment within the mapping size.
And the areas are memslot and vma, which means the mapping size must properly 
fit
into the memslot and vma.

In this function, we just checked the alignment of HVA and PA, and the boundary 
of vma.
So we still need to check the alignment of HVA and IPA, and the boundary of 
memslot.
These will be checked by fault_supports_stage2_huge_mapping().

> 
>> +{
>> +phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
>> +
>> +#ifndef __PAGETABLE_PMD_FOLDED
>> +if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
>> +ALIGN(hva, PUD_SIZE) <= vma->vm_end)
>> +return PUD_SHIFT;
>> +#endif
>> +
>> +if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
>> +ALIGN(hva, PMD_SIZE) <= vma->vm_end)
>> +return PMD_SHIFT;
>> +
>> +return PAGE_SHIFT;
>> +}
>> +
>>  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot 
>> *memslot,
>> unsigned long hva,
>> unsigned long map_size)
>> @@ -769,7 +794,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  return -EFAULT;
>>  }
>>  
>> -/* Let's check if we will get back a huge page backed by hugetlbfs */
>> +/*
>> + * Let's check if we will get back a huge page backed by hugetlbfs, or
>> + * get block mapping for device MMIO region.
>> + */
>>  mmap_read_lock(current->mm);
>>  vma = find_vma_intersection(current->mm, hva, hva + 1);
>>  if (unlikely(!vma)) {
>> @@ -780,11 +808,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  
>>  if (is_vm_hugetlb_page(vma))
>>  vma_shift = huge_page_shift(hstate_vma(vma));
>> +else if (vma->vm_flags & VM_PFNMAP)
>> +vma_shift = devi

Re: [RFC PATCH v2 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-14 Thread Keqian Zhu
Hi Marc,

On 2021/4/8 15:28, Keqian Zhu wrote:
> Hi Marc,
> 
> On 2021/4/7 21:18, Marc Zyngier wrote:
>> On Tue, 16 Mar 2021 13:43:38 +0000,
>> Keqian Zhu  wrote:
>>>
[...]

>>>  
>>> +/*
>>> + * Find a mapping size that properly insides the intersection of vma and
>>> + * memslot. And hva and pa have the same alignment to this mapping size.
>>> + * It's rough because there are still other restrictions, which will be
>>> + * checked by the following fault_supports_stage2_huge_mapping().
>>
>> I don't think these restrictions make complete sense to me. If this is
>> a PFNMAP VMA, we should use the biggest mapping size that covers the
>> VMA, and not more than the VMA.
> But as described by kvm_arch_prepare_memory_region(), the memslot may not 
> fully
> cover the VMA. If that's true and we just consider the boundary of the VMA, 
> our
> block mapping may beyond the boundary of memslot. Is this a problem?
emm... Sorry I missed something. The fault_supports_stage2_huge_mapping() will 
check
the boundary of memslot, so we don't need to check it here. I have send v3, 
please
check that.

BRs,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions

2021-04-14 Thread Keqian Zhu
The MMIO regions may be unmapped for many reasons and can be remapped
by stage2 fault path. Map MMIO regions at creation time becomes a
minor optimization and makes these two mapping path hard to sync.

Remove the mapping code while keep the useful sanity check.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 38 +++---
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..c59af5ca01b0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 {
hva_t hva = mem->userspace_addr;
hva_t reg_end = hva + mem->memory_size;
-   bool writable = !(mem->flags & KVM_MEM_READONLY);
int ret = 0;
 
if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
@@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
mmap_read_lock(current->mm);
/*
 * A memory region could potentially cover multiple VMAs, and any holes
-* between them, so iterate over all of them to find out if we can map
-* any of them right now.
+* between them, so iterate over all of them.
 *
 * ++
 * +---++   ++
@@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 */
do {
struct vm_area_struct *vma = find_vma(current->mm, hva);
-   hva_t vm_start, vm_end;
 
if (!vma || vma->vm_start >= reg_end)
break;
 
-   /*
-* Take the intersection of this VMA with the memory region
-*/
-   vm_start = max(hva, vma->vm_start);
-   vm_end = min(reg_end, vma->vm_end);
-
if (vma->vm_flags & VM_PFNMAP) {
-   gpa_t gpa = mem->guest_phys_addr +
-   (vm_start - mem->userspace_addr);
-   phys_addr_t pa;
-
-   pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
-   pa += vm_start - vma->vm_start;
-
/* IO region dirty page logging not allowed */
if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
ret = -EINVAL;
-   goto out;
-   }
-
-   ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
-   vm_end - vm_start,
-   writable);
-   if (ret)
break;
+   }
}
-   hva = vm_end;
+   hva = min(reg_end, vma->vm_end);
} while (hva < reg_end);
 
-   if (change == KVM_MR_FLAGS_ONLY)
-   goto out;
-
-   spin_lock(>mmu_lock);
-   if (ret)
-   unmap_stage2_range(>arch.mmu, mem->guest_phys_addr, 
mem->memory_size);
-   else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
-   stage2_flush_memslot(kvm, memslot);
-   spin_unlock(>mmu_lock);
-out:
mmap_read_unlock(current->mm);
return ret;
 }
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-14 Thread Keqian Zhu
The MMIO region of a device maybe huge (GB level), try to use
block mapping in stage2 to speedup both map and unmap.

Compared to normal memory mapping, we should consider two more
points when try block mapping for MMIO region:

1. For normal memory mapping, the PA(host physical address) and
HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
the HVA to request hugepage, so we don't need to consider PA
alignment when verifing block mapping. But for device memory
mapping, the PA and HVA may have different alignment.

2. For normal memory mapping, we are sure hugepage size properly
fit into vma, so we don't check whether the mapping size exceeds
the boundary of vma. But for device memory mapping, we should pay
attention to this.

This adds device_rough_page_shift() to check these two points when
selecting block mapping size.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 37 +
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c59af5ca01b0..1a6d96169d60 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -624,6 +624,31 @@ static void kvm_send_hwpoison_signal(unsigned long 
address, short lsb)
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
 }
 
+/*
+ * Find a max mapping size that properly insides the vma. And hva and pa must
+ * have the same alignment to this mapping size. It's rough as there are still
+ * other restrictions, will be checked by fault_supports_stage2_huge_mapping().
+ */
+static short device_rough_page_shift(struct vm_area_struct *vma,
+unsigned long hva)
+{
+   phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
+
+#ifndef __PAGETABLE_PMD_FOLDED
+   if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
+   ALIGN(hva, PUD_SIZE) <= vma->vm_end)
+   return PUD_SHIFT;
+#endif
+
+   if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
+   ALIGN(hva, PMD_SIZE) <= vma->vm_end)
+   return PMD_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
 static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
   unsigned long hva,
   unsigned long map_size)
@@ -769,7 +794,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   /* Let's check if we will get back a huge page backed by hugetlbfs */
+   /*
+* Let's check if we will get back a huge page backed by hugetlbfs, or
+* get block mapping for device MMIO region.
+*/
mmap_read_lock(current->mm);
vma = find_vma_intersection(current->mm, hva, hva + 1);
if (unlikely(!vma)) {
@@ -780,11 +808,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 
if (is_vm_hugetlb_page(vma))
vma_shift = huge_page_shift(hstate_vma(vma));
+   else if (vma->vm_flags & VM_PFNMAP)
+   vma_shift = device_rough_page_shift(vma, hva);
else
vma_shift = PAGE_SHIFT;
 
-   if (logging_active ||
-   (vma->vm_flags & VM_PFNMAP)) {
+   if (logging_active) {
force_pte = true;
vma_shift = PAGE_SHIFT;
}
@@ -855,7 +884,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 
if (kvm_is_device_pfn(pfn)) {
device = true;
-   force_pte = true;
+   force_pte = (vma_pagesize == PAGE_SIZE);
} else if (logging_active && !write_fault) {
/*
 * Only actually map the page as writable if this was a write
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-14 Thread Keqian Zhu
Hi,

We have two pathes to build stage2 mapping for MMIO regions.

Create time's path and stage2 fault path.

Patch#1 removes the creation time's mapping of MMIO regions
Patch#2 tries stage2 block mapping for host device MMIO at fault path

Changelog:

v3:
 - Do not need to check memslot boundary in device_rough_page_shift(). (Marc)

Thanks,
Keqian

Keqian Zhu (2):
  kvm/arm64: Remove the creation time's mapping of MMIO regions
  kvm/arm64: Try stage2 block mapping for host device MMIO

 arch/arm64/kvm/mmu.c | 75 +---
 1 file changed, 36 insertions(+), 39 deletions(-)

-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH v2 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-13 Thread Keqian Zhu
Hi Marc,

I think I have fully tested this patch. The next step is to do some restriction 
on
HVA in vfio module, so we can build block mapping for it with a higher 
probability.

Is there anything to improve? If not, could you apply it? ^_^

Thanks,
Keqian

On 2021/4/7 21:18, Marc Zyngier wrote:
> On Tue, 16 Mar 2021 13:43:38 +,
> Keqian Zhu  wrote:
>>
>> The MMIO region of a device maybe huge (GB level), try to use
>> block mapping in stage2 to speedup both map and unmap.
>>
>> Compared to normal memory mapping, we should consider two more
>> points when try block mapping for MMIO region:
>>
>> 1. For normal memory mapping, the PA(host physical address) and
>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>> the HVA to request hugepage, so we don't need to consider PA
>> alignment when verifing block mapping. But for device memory
>> mapping, the PA and HVA may have different alignment.
>>
>> 2. For normal memory mapping, we are sure hugepage size properly
>> fit into vma, so we don't check whether the mapping size exceeds
>> the boundary of vma. But for device memory mapping, we should pay
>> attention to this.
>>
>> This adds device_rough_page_shift() to check these two points when
>> selecting block mapping size.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>
>> Mainly for RFC, not fully tested. I will fully test it when the
>> code logic is well accepted.
>>
>> ---
>>  arch/arm64/kvm/mmu.c | 42 ++
>>  1 file changed, 38 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c59af5ca01b0..224aa15eb4d9 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -624,6 +624,36 @@ static void kvm_send_hwpoison_signal(unsigned long 
>> address, short lsb)
>>  send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>>  }
>>  
>> +/*
>> + * Find a mapping size that properly insides the intersection of vma and
>> + * memslot. And hva and pa have the same alignment to this mapping size.
>> + * It's rough because there are still other restrictions, which will be
>> + * checked by the following fault_supports_stage2_huge_mapping().
> 
> I don't think these restrictions make complete sense to me. If this is
> a PFNMAP VMA, we should use the biggest mapping size that covers the
> VMA, and not more than the VMA.
> 
>> + */
>> +static short device_rough_page_shift(struct kvm_memory_slot *memslot,
>> + struct vm_area_struct *vma,
>> + unsigned long hva)
>> +{
>> +size_t size = memslot->npages * PAGE_SIZE;
>> +hva_t sec_start = max(memslot->userspace_addr, vma->vm_start);
>> +hva_t sec_end = min(memslot->userspace_addr + size, vma->vm_end);
>> +phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
>> +
>> +#ifndef __PAGETABLE_PMD_FOLDED
>> +if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PUD_SIZE) >= sec_start &&
>> +ALIGN(hva, PUD_SIZE) <= sec_end)
>> +return PUD_SHIFT;
>> +#endif
>> +
>> +if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PMD_SIZE) >= sec_start &&
>> +ALIGN(hva, PMD_SIZE) <= sec_end)
>> +return PMD_SHIFT;
>> +
>> +return PAGE_SHIFT;
>> +}
>> +
>>  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot 
>> *memslot,
>> unsigned long hva,
>> unsigned long map_size)
>> @@ -769,7 +799,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  return -EFAULT;
>>  }
>>  
>> -/* Let's check if we will get back a huge page backed by hugetlbfs */
>> +/*
>> + * Let's check if we will get back a huge page backed by hugetlbfs, or
>> + * get block mapping for device MMIO region.
>> + */
>>  mmap_read_lock(current->mm);
>>  vma = find_vma_intersection(current->mm, hva, hva + 1);
>>  if (unlikely(!vma)) {
>> @@ -780,11 +813,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  
>>  if (is_vm_hugetlb_page(vma))
>>  vma_shift = huge_page_shift(hstate_vma(vma));
>> +else if (v

Re: [PATCH] KVM: arm/arm64: Fix KVM_VGIC_V3_ADDR_TYPE_REDIST read

2021-04-13 Thread Keqian Zhu


On 2021/4/12 23:00, Eric Auger wrote:
> When reading the base address of the a REDIST region
> through KVM_VGIC_V3_ADDR_TYPE_REDIST we expect the
> redistributor region list to be populated with a single
> element.
> 
> However list_first_entry() expects the list to be non empty.
Indeed, list_first_entry() always return a non-null ptr. If the list
is empty, it will mistake the list head as the first element.

> Instead we should use list_first_entry_or_null which effectively
> returns NULL if the list is empty.
> 
> Fixes: dbd9733ab674 ("KVM: arm/arm64: Replace the single rdist region by a 
> list")
> Cc:  # v4.18+
> Signed-off-by: Eric Auger 
> Reported-by: Gavin Shan 
> ---
>  arch/arm64/kvm/vgic/vgic-kvm-device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-kvm-device.c 
> b/arch/arm64/kvm/vgic/vgic-kvm-device.c
> index 44419679f91a..5eaede3e3b5a 100644
> --- a/arch/arm64/kvm/vgic/vgic-kvm-device.c
> +++ b/arch/arm64/kvm/vgic/vgic-kvm-device.c
> @@ -87,8 +87,8 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
>   r = vgic_v3_set_redist_base(kvm, 0, *addr, 0);
>   goto out;
>   }
> - rdreg = list_first_entry(>rd_regions,
> -  struct vgic_redist_region, list);
> + rdreg = list_first_entry_or_null(>rd_regions,
> +  struct vgic_redist_region, 
> list);
>   if (!rdreg)
>   addr_ptr = _value;
>   else
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH v2 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-08 Thread Keqian Zhu
Hi Marc,

On 2021/4/7 21:18, Marc Zyngier wrote:
> On Tue, 16 Mar 2021 13:43:38 +,
> Keqian Zhu  wrote:
>>
>> The MMIO region of a device maybe huge (GB level), try to use
>> block mapping in stage2 to speedup both map and unmap.
>>
>> Compared to normal memory mapping, we should consider two more
>> points when try block mapping for MMIO region:
>>
>> 1. For normal memory mapping, the PA(host physical address) and
>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>> the HVA to request hugepage, so we don't need to consider PA
>> alignment when verifing block mapping. But for device memory
>> mapping, the PA and HVA may have different alignment.
>>
>> 2. For normal memory mapping, we are sure hugepage size properly
>> fit into vma, so we don't check whether the mapping size exceeds
>> the boundary of vma. But for device memory mapping, we should pay
>> attention to this.
>>
>> This adds device_rough_page_shift() to check these two points when
>> selecting block mapping size.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>
>> Mainly for RFC, not fully tested. I will fully test it when the
>> code logic is well accepted.
>>
>> ---
>>  arch/arm64/kvm/mmu.c | 42 ++
>>  1 file changed, 38 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c59af5ca01b0..224aa15eb4d9 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -624,6 +624,36 @@ static void kvm_send_hwpoison_signal(unsigned long 
>> address, short lsb)
>>  send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>>  }
>>  
>> +/*
>> + * Find a mapping size that properly insides the intersection of vma and
>> + * memslot. And hva and pa have the same alignment to this mapping size.
>> + * It's rough because there are still other restrictions, which will be
>> + * checked by the following fault_supports_stage2_huge_mapping().
> 
> I don't think these restrictions make complete sense to me. If this is
> a PFNMAP VMA, we should use the biggest mapping size that covers the
> VMA, and not more than the VMA.
But as described by kvm_arch_prepare_memory_region(), the memslot may not fully
cover the VMA. If that's true and we just consider the boundary of the VMA, our
block mapping may beyond the boundary of memslot. Is this a problem?

> 
>> + */
>> +static short device_rough_page_shift(struct kvm_memory_slot *memslot,
>> + struct vm_area_struct *vma,
>> + unsigned long hva)
>> +{
>> +size_t size = memslot->npages * PAGE_SIZE;
>> +hva_t sec_start = max(memslot->userspace_addr, vma->vm_start);
>> +hva_t sec_end = min(memslot->userspace_addr + size, vma->vm_end);
>> +phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
>> +
>> +#ifndef __PAGETABLE_PMD_FOLDED
>> +if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PUD_SIZE) >= sec_start &&
>> +ALIGN(hva, PUD_SIZE) <= sec_end)
>> +return PUD_SHIFT;
>> +#endif
>> +
>> +if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PMD_SIZE) >= sec_start &&
>> +ALIGN(hva, PMD_SIZE) <= sec_end)
>> +return PMD_SHIFT;
>> +
>> +return PAGE_SHIFT;
>> +}
>> +
>>  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot 
>> *memslot,
>> unsigned long hva,
>> unsigned long map_size)
>> @@ -769,7 +799,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  return -EFAULT;
>>  }
>>  
>> -/* Let's check if we will get back a huge page backed by hugetlbfs */
>> +/*
>> + * Let's check if we will get back a huge page backed by hugetlbfs, or
>> + * get block mapping for device MMIO region.
>> + */
>>  mmap_read_lock(current->mm);
>>  vma = find_vma_intersection(current->mm, hva, hva + 1);
>>  if (unlikely(!vma)) {
>> @@ -780,11 +813,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  
>>  if (is_vm_hugetlb_page(vma))
>>  vma_shift = huge_page_shift(hstate_vma(vma));
>> +else if (vma->vm_flags & VM

Re: [RFC PATCH v2 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-03-31 Thread Keqian Zhu
Kind ping...

On 2021/3/16 21:43, Keqian Zhu wrote:
> Hi all,
> 
> We have two pathes to build stage2 mapping for MMIO regions.
> 
> Create time's path and stage2 fault path.
> 
> Patch#1 removes the creation time's mapping of MMIO regions
> Patch#2 tries stage2 block mapping for host device MMIO at fault path
> 
> Thanks,
> Keqian
> 
> Keqian Zhu (2):
>   kvm/arm64: Remove the creation time's mapping of MMIO regions
>   kvm/arm64: Try stage2 block mapping for host device MMIO
> 
>  arch/arm64/kvm/mmu.c | 80 +++-
>  1 file changed, 41 insertions(+), 39 deletions(-)
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/7] KVM: arm64: Add two page mapping counters for kvm_stat

2021-03-23 Thread Keqian Zhu
Hi Yoan,

On 2021/3/20 0:17, Yoan Picchi wrote:
> Add a counter for when a regular page is mapped, and another
> for when a huge page is mapped.
> 
> Signed-off-by: Yoan Picchi 
> ---
>  arch/arm64/include/asm/kvm_host.h | 2 ++
>  arch/arm64/kvm/guest.c| 2 ++
>  arch/arm64/kvm/mmu.c  | 7 +++
>  3 files changed, 11 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 3d10e6527..863603285 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -561,6 +561,8 @@ struct kvm_vcpu_stat {
>   u64 wfi_exit_stat;
>   u64 mmio_exit_user;
>   u64 mmio_exit_kernel;
> + u64 regular_page_mapped;
> + u64 huge_page_mapped;
>   u64 exits;
>  };
>  
> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index 9bbd30e62..14b15fb8f 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -38,6 +38,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
>   VCPU_STAT("wfi_exit_stat", wfi_exit_stat),
>   VCPU_STAT("mmio_exit_user", mmio_exit_user),
>   VCPU_STAT("mmio_exit_kernel", mmio_exit_kernel),
> + VCPU_STAT("regular_page_mapped", regular_page_mapped),
> + VCPU_STAT("huge_page_mapped", huge_page_mapped),
>   VCPU_STAT("exits", exits),
>   VCPU_STAT("halt_poll_success_ns", halt_poll_success_ns),
>   VCPU_STAT("halt_poll_fail_ns", halt_poll_fail_ns),
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f..3996b28da 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -914,6 +914,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   mark_page_dirty(kvm, gfn);
>   }
>  
> + if (ret >= 0) {
> + if (vma_pagesize == PAGE_SIZE) {
> + vcpu->stat.regular_page_mapped++;
> + } else {
> + vcpu->stat.huge_page_mapped++;
> + }
> + }
This looks too simple, as we don't always map a new regular_page or huge_page 
here.
We may just relax permission or change mapping size, etc. If we mix these 
operations,
the stat result seems not very valuable...

Thanks,
Keqian

>  out_unlock:
>   spin_unlock(>mmu_lock);
>   kvm_set_pfn_accessed(pfn);
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v14 05/13] iommu/smmuv3: Implement attach/detach_pasid_table

2021-03-22 Thread Keqian Zhu
Hi Eric,

On 2021/3/19 21:15, Auger Eric wrote:
> Hi Keqian,
> 
> On 3/2/21 9:35 AM, Keqian Zhu wrote:
>> Hi Eric,
>>
>> On 2021/2/24 4:56, Eric Auger wrote:
>>> On attach_pasid_table() we program STE S1 related info set
>>> by the guest into the actual physical STEs. At minimum
>>> we need to program the context descriptor GPA and compute
>>> whether the stage1 is translated/bypassed or aborted.
>>>
>>> On detach, the stage 1 config is unset and the abort flag is
>>> unset.
>>>
>>> Signed-off-by: Eric Auger 
>>>
>> [...]
>>
>>> +
>>> +   /*
>>> +* we currently support a single CD so s1fmt and s1dss
>>> +* fields are also ignored
>>> +*/
>>> +   if (cfg->pasid_bits)
>>> +   goto out;
>>> +
>>> +   smmu_domain->s1_cfg.cdcfg.cdtab_dma = cfg->base_ptr;
>> only the "cdtab_dma" field of "cdcfg" is set, we are not able to locate a 
>> specific cd using arm_smmu_get_cd_ptr().
>>
>> Maybe we'd better use a specialized function to fill other fields of "cdcfg" 
>> or add a sanity check in arm_smmu_get_cd_ptr()
>> to prevent calling it under nested mode?
>>
>> As now we just call arm_smmu_get_cd_ptr() during finalise_s1(), no problem 
>> found. Just a suggestion ;-)
> 
> forgive me for the delay. yes I can indeed make sure that code is not
> called in nested mode. Please could you detail why you would need to
> call arm_smmu_get_cd_ptr()?
I accidentally called this function in nested mode when verify the smmu mpam 
feature. :)

Yes, in nested mode, context descriptor is owned by guest, hypervisor does not 
need to care about its content.
Maybe we'd better give an explicit comment for arm_smmu_get_cd_ptr() to let 
coder pay attention to this? :)

Thanks,
Keqian

> 
> Thanks
> 
> Eric
>>
>> Thanks,
>> Keqian
>>
>>
>>> +   smmu_domain->s1_cfg.set = true;
>>> +   smmu_domain->abort = false;
>>> +   break;
>>> +   default:
>>> +   goto out;
>>> +   }
>>> +   spin_lock_irqsave(_domain->devices_lock, flags);
>>> +   list_for_each_entry(master, _domain->devices, domain_head)
>>> +   arm_smmu_install_ste_for_dev(master);
>>> +   spin_unlock_irqrestore(_domain->devices_lock, flags);
>>> +   ret = 0;
>>> +out:
>>> +   mutex_unlock(_domain->init_mutex);
>>> +   return ret;
>>> +}
>>> +
>>> +static void arm_smmu_detach_pasid_table(struct iommu_domain *domain)
>>> +{
>>> +   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>>> +   struct arm_smmu_master *master;
>>> +   unsigned long flags;
>>> +
>>> +   mutex_lock(_domain->init_mutex);
>>> +
>>> +   if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
>>> +   goto unlock;
>>> +
>>> +   smmu_domain->s1_cfg.set = false;
>>> +   smmu_domain->abort = false;
>>> +
>>> +   spin_lock_irqsave(_domain->devices_lock, flags);
>>> +   list_for_each_entry(master, _domain->devices, domain_head)
>>> +   arm_smmu_install_ste_for_dev(master);
>>> +   spin_unlock_irqrestore(_domain->devices_lock, flags);
>>> +
>>> +unlock:
>>> +   mutex_unlock(_domain->init_mutex);
>>> +}
>>> +
>>>  static bool arm_smmu_dev_has_feature(struct device *dev,
>>>  enum iommu_dev_features feat)
>>>  {
>>> @@ -2939,6 +3026,8 @@ static struct iommu_ops arm_smmu_ops = {
>>> .of_xlate   = arm_smmu_of_xlate,
>>> .get_resv_regions   = arm_smmu_get_resv_regions,
>>> .put_resv_regions   = generic_iommu_put_resv_regions,
>>> +   .attach_pasid_table = arm_smmu_attach_pasid_table,
>>> +   .detach_pasid_table = arm_smmu_detach_pasid_table,
>>> .dev_has_feat   = arm_smmu_dev_has_feature,
>>> .dev_feat_enabled   = arm_smmu_dev_feature_enabled,
>>> .dev_enable_feat= arm_smmu_dev_enable_feature,
>>>
>>
> 
> .
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH v2 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-03-16 Thread Keqian Zhu
The MMIO region of a device maybe huge (GB level), try to use
block mapping in stage2 to speedup both map and unmap.

Compared to normal memory mapping, we should consider two more
points when try block mapping for MMIO region:

1. For normal memory mapping, the PA(host physical address) and
HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
the HVA to request hugepage, so we don't need to consider PA
alignment when verifing block mapping. But for device memory
mapping, the PA and HVA may have different alignment.

2. For normal memory mapping, we are sure hugepage size properly
fit into vma, so we don't check whether the mapping size exceeds
the boundary of vma. But for device memory mapping, we should pay
attention to this.

This adds device_rough_page_shift() to check these two points when
selecting block mapping size.

Signed-off-by: Keqian Zhu 
---

Mainly for RFC, not fully tested. I will fully test it when the
code logic is well accepted.

---
 arch/arm64/kvm/mmu.c | 42 ++
 1 file changed, 38 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c59af5ca01b0..224aa15eb4d9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -624,6 +624,36 @@ static void kvm_send_hwpoison_signal(unsigned long 
address, short lsb)
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
 }
 
+/*
+ * Find a mapping size that properly insides the intersection of vma and
+ * memslot. And hva and pa have the same alignment to this mapping size.
+ * It's rough because there are still other restrictions, which will be
+ * checked by the following fault_supports_stage2_huge_mapping().
+ */
+static short device_rough_page_shift(struct kvm_memory_slot *memslot,
+struct vm_area_struct *vma,
+unsigned long hva)
+{
+   size_t size = memslot->npages * PAGE_SIZE;
+   hva_t sec_start = max(memslot->userspace_addr, vma->vm_start);
+   hva_t sec_end = min(memslot->userspace_addr + size, vma->vm_end);
+   phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
+
+#ifndef __PAGETABLE_PMD_FOLDED
+   if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PUD_SIZE) >= sec_start &&
+   ALIGN(hva, PUD_SIZE) <= sec_end)
+   return PUD_SHIFT;
+#endif
+
+   if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PMD_SIZE) >= sec_start &&
+   ALIGN(hva, PMD_SIZE) <= sec_end)
+   return PMD_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
 static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
   unsigned long hva,
   unsigned long map_size)
@@ -769,7 +799,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   /* Let's check if we will get back a huge page backed by hugetlbfs */
+   /*
+* Let's check if we will get back a huge page backed by hugetlbfs, or
+* get block mapping for device MMIO region.
+*/
mmap_read_lock(current->mm);
vma = find_vma_intersection(current->mm, hva, hva + 1);
if (unlikely(!vma)) {
@@ -780,11 +813,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 
if (is_vm_hugetlb_page(vma))
vma_shift = huge_page_shift(hstate_vma(vma));
+   else if (vma->vm_flags & VM_PFNMAP)
+   vma_shift = device_rough_page_shift(memslot, vma, hva);
else
vma_shift = PAGE_SHIFT;
 
-   if (logging_active ||
-   (vma->vm_flags & VM_PFNMAP)) {
+   if (logging_active) {
force_pte = true;
vma_shift = PAGE_SHIFT;
}
@@ -855,7 +889,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 
if (kvm_is_device_pfn(pfn)) {
device = true;
-   force_pte = true;
+   force_pte = (vma_pagesize == PAGE_SIZE);
} else if (logging_active && !write_fault) {
/*
 * Only actually map the page as writable if this was a write
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH v2 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions

2021-03-16 Thread Keqian Zhu
The MMIO regions may be unmapped for many reasons and can be remapped
by stage2 fault path. Map MMIO regions at creation time becomes a
minor optimization and makes these two mapping path hard to sync.

Remove the mapping code while keep the useful sanity check.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 38 +++---
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..c59af5ca01b0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 {
hva_t hva = mem->userspace_addr;
hva_t reg_end = hva + mem->memory_size;
-   bool writable = !(mem->flags & KVM_MEM_READONLY);
int ret = 0;
 
if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
@@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
mmap_read_lock(current->mm);
/*
 * A memory region could potentially cover multiple VMAs, and any holes
-* between them, so iterate over all of them to find out if we can map
-* any of them right now.
+* between them, so iterate over all of them.
 *
 * ++
 * +---++   ++
@@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 */
do {
struct vm_area_struct *vma = find_vma(current->mm, hva);
-   hva_t vm_start, vm_end;
 
if (!vma || vma->vm_start >= reg_end)
break;
 
-   /*
-* Take the intersection of this VMA with the memory region
-*/
-   vm_start = max(hva, vma->vm_start);
-   vm_end = min(reg_end, vma->vm_end);
-
if (vma->vm_flags & VM_PFNMAP) {
-   gpa_t gpa = mem->guest_phys_addr +
-   (vm_start - mem->userspace_addr);
-   phys_addr_t pa;
-
-   pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
-   pa += vm_start - vma->vm_start;
-
/* IO region dirty page logging not allowed */
if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
ret = -EINVAL;
-   goto out;
-   }
-
-   ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
-   vm_end - vm_start,
-   writable);
-   if (ret)
break;
+   }
}
-   hva = vm_end;
+   hva = min(reg_end, vma->vm_end);
} while (hva < reg_end);
 
-   if (change == KVM_MR_FLAGS_ONLY)
-   goto out;
-
-   spin_lock(>mmu_lock);
-   if (ret)
-   unmap_stage2_range(>arch.mmu, mem->guest_phys_addr, 
mem->memory_size);
-   else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
-   stage2_flush_memslot(kvm, memslot);
-   spin_unlock(>mmu_lock);
-out:
mmap_read_unlock(current->mm);
return ret;
 }
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH v2 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-03-16 Thread Keqian Zhu
Hi all,

We have two pathes to build stage2 mapping for MMIO regions.

Create time's path and stage2 fault path.

Patch#1 removes the creation time's mapping of MMIO regions
Patch#2 tries stage2 block mapping for host device MMIO at fault path

Thanks,
Keqian

Keqian Zhu (2):
  kvm/arm64: Remove the creation time's mapping of MMIO regions
  kvm/arm64: Try stage2 block mapping for host device MMIO

 arch/arm64/kvm/mmu.c | 80 +++-
 1 file changed, 41 insertions(+), 39 deletions(-)

-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 2/4] KVM: arm64: Use find_vma_intersection()

2021-03-15 Thread Keqian Zhu
Hi Gavin,

On 2021/3/16 11:52, Gavin Shan wrote:
> Hi Keqian,
> 
> On 3/15/21 8:42 PM, Gavin Shan wrote:
>> On 3/15/21 7:04 PM, Keqian Zhu wrote:
>>> On 2021/3/15 12:18, Gavin Shan wrote:
>>>> find_vma_intersection() has been existing to search the intersected
>>>> vma. This uses the function where it's applicable, to simplify the
>>>> code.
>>>>
>>>> Signed-off-by: Gavin Shan 
>>>> ---
>>>>   arch/arm64/kvm/mmu.c | 10 ++
>>>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index 84e70f953de6..286b603ed0d3 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -421,10 +421,11 @@ static void stage2_unmap_memslot(struct kvm *kvm,
>>>>* ++
>>>>*/
>>>>   do {
>>>> -struct vm_area_struct *vma = find_vma(current->mm, hva);
>>>> +struct vm_area_struct *vma;
>>>>   hva_t vm_start, vm_end;
>>>> -if (!vma || vma->vm_start >= reg_end)
>>>> +vma = find_vma_intersection(current->mm, hva, reg_end);
>>> Nit: Keep a same style may be better(Assign vma when declare it).
>>> Other looks good to me.
>>>
>>
>> Yeah, I agree. I will adjust the code in v2 and included your r-b.
>> Thanks for your time to review.
>>
> 
> After rechecking the code, I think it'd better to keep current style
> because there is a follow-on validation on @vma. Keeping them together
> seems a good idea. I think it wouldn't a big deal to you. So I will
> keep current style with your r-b in v2.
Sure, both is OK. ;-)

Thanks,
Keqian
> 
> vma = find_vma_intersection(current->mm, hva, reg_end);
> if (!vma)
>  break;
> Thanks,
> Gavin
>  
>>>> +if (!vma)
>>>>   break;
>>>>   /*
>>>> @@ -1330,10 +1331,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>>>* ++
>>>>*/
>>>>   do {
>>>> -struct vm_area_struct *vma = find_vma(current->mm, hva);
>>>> +struct vm_area_struct *vma;
>>>>   hva_t vm_start, vm_end;
>>>> -if (!vma || vma->vm_start >= reg_end)
>>>> +vma = find_vma_intersection(current->mm, hva, reg_end);
>>>> +if (!vma)
>>>>   break;
>>>>   /*
>>>>
>>>
>>
> 
> .
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 4/4] KVM: arm64: Don't retrieve memory slot again in page fault handler

2021-03-15 Thread Keqian Zhu
Hi Gavin,

On 2021/3/15 17:56, Gavin Shan wrote:
> Hi Keqian,
> 
> On 3/15/21 7:25 PM, Keqian Zhu wrote:
>> On 2021/3/15 12:18, Gavin Shan wrote:
>>> We needn't retrieve the memory slot again in user_mem_abort() because
>>> the corresponding memory slot has been passed from the caller. This
>> I think you are right, though fault_ipa will be adjusted when we try to use 
>> block mapping,
>> the fault_supports_stage2_huge_mapping() makes sure we're not trying to map 
>> anything
>> not covered by the memslot, so the adjusted fault_ipa still belongs to the 
>> memslot.
>>
> 
> Yeah, it's correct. Besides, the @logging_active is determined
> based on the passed memory slot. It means user_mem_abort() can't
> support memory range which spans multiple memory slot.
> 
>>> would save some CPU cycles. For example, the time used to write 1GB
>>> memory, which is backed by 2MB hugetlb pages and write-protected, is
>>> dropped by 6.8% from 928ms to 864ms.
>>>
>>> Signed-off-by: Gavin Shan 
>>> ---
>>>   arch/arm64/kvm/mmu.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index a5a8ade9fde4..4a4abcccfafb 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -846,7 +846,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>>> phys_addr_t fault_ipa,
>>>*/
>>>   smp_rmb();
>>>   -pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, );
>>> +pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
>>> +   write_fault, , NULL);
>> It's better to update the code comments at same time.
>>
> 
> I guess you need some comments here? If so, I would add something
> like below in v2:
> 
> /*
>  * gfn_to_pfn_prot() can be used either with unnecessary overhead
>  * introduced to locate the memory slot because the memory slot is
>  * always fixed even @gfn is adjusted for huge pages.
>  */
Looks good.

See comments above "smp_rmb();", and actually my meaning is that we should 
change "gfn_to_pfn_prot"
to "__gfn_to_pfn_memslot" :)

Thanks,
Keqian

> 
>>>   if (pfn == KVM_PFN_ERR_HWPOISON) {
>>>   kvm_send_hwpoison_signal(hva, vma_shift);
>>>   return 0;
>>> @@ -912,7 +913,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>>> phys_addr_t fault_ipa,
>>>   /* Mark the page dirty only if the fault is handled successfully */
>>>   if (writable && !ret) {
>>>   kvm_set_pfn_dirty(pfn);
>>> -mark_page_dirty(kvm, gfn);
>>> +mark_page_dirty_in_slot(kvm, memslot, gfn);
>>>   }
>>> out_unlock:
>>>
> 
> Thanks,
> Gavin
> 
> 
> .
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 4/4] KVM: arm64: Don't retrieve memory slot again in page fault handler

2021-03-15 Thread Keqian Zhu
Hi Gavin,

On 2021/3/15 12:18, Gavin Shan wrote:
> We needn't retrieve the memory slot again in user_mem_abort() because
> the corresponding memory slot has been passed from the caller. This
I think you are right, though fault_ipa will be adjusted when we try to use 
block mapping,
the fault_supports_stage2_huge_mapping() makes sure we're not trying to map 
anything
not covered by the memslot, so the adjusted fault_ipa still belongs to the 
memslot.

> would save some CPU cycles. For example, the time used to write 1GB
> memory, which is backed by 2MB hugetlb pages and write-protected, is
> dropped by 6.8% from 928ms to 864ms.
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/kvm/mmu.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index a5a8ade9fde4..4a4abcccfafb 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -846,7 +846,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>*/
>   smp_rmb();
>  
> - pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, );
> + pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
> +write_fault, , NULL);
It's better to update the code comments at same time.

>   if (pfn == KVM_PFN_ERR_HWPOISON) {
>   kvm_send_hwpoison_signal(hva, vma_shift);
>   return 0;
> @@ -912,7 +913,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   /* Mark the page dirty only if the fault is handled successfully */
>   if (writable && !ret) {
>   kvm_set_pfn_dirty(pfn);
> - mark_page_dirty(kvm, gfn);
> + mark_page_dirty_in_slot(kvm, memslot, gfn);
>   }
>  
>  out_unlock:
> 

Thanks,
Keqian.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 2/4] KVM: arm64: Use find_vma_intersection()

2021-03-15 Thread Keqian Zhu
Hi Gavin,

On 2021/3/15 12:18, Gavin Shan wrote:
> find_vma_intersection() has been existing to search the intersected
> vma. This uses the function where it's applicable, to simplify the
> code.
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/kvm/mmu.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 84e70f953de6..286b603ed0d3 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -421,10 +421,11 @@ static void stage2_unmap_memslot(struct kvm *kvm,
>* ++
>*/
>   do {
> - struct vm_area_struct *vma = find_vma(current->mm, hva);
> + struct vm_area_struct *vma;
>   hva_t vm_start, vm_end;
>  
> - if (!vma || vma->vm_start >= reg_end)
> + vma = find_vma_intersection(current->mm, hva, reg_end);
Nit: Keep a same style may be better(Assign vma when declare it).
Other looks good to me.

Thank,
Keqian


> + if (!vma)
>   break;
>  
>   /*
> @@ -1330,10 +1331,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>* ++
>*/
>   do {
> - struct vm_area_struct *vma = find_vma(current->mm, hva);
> + struct vm_area_struct *vma;
>   hva_t vm_start, vm_end;
>  
> - if (!vma || vma->vm_start >= reg_end)
> + vma = find_vma_intersection(current->mm, hva, reg_end);
> + if (!vma)
>   break;
>  
>   /*
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/4] KVM: arm64: Hide kvm_mmu_wp_memory_region()

2021-03-15 Thread Keqian Zhu
Hi Gavin,

This function is only used by mmu.c in the first commit c64735554c0a, so please 
feel free
to add:

Reviewed-by: Keqian Zhu 


Thanks,
Keqian

On 2021/3/15 12:18, Gavin Shan wrote:
> We needn't expose the function as it's only used by mmu.c.
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/include/asm/kvm_host.h | 1 -
>  arch/arm64/kvm/mmu.c  | 2 +-
>  2 files changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 3d10e6527f7d..688f2df1957b 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -632,7 +632,6 @@ void kvm_arm_resume_guest(struct kvm *kvm);
>   })
>  
>  void force_vm_exit(const cpumask_t *mask);
> -void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
>  
>  int handle_exit(struct kvm_vcpu *vcpu, int exception_index);
>  void handle_exit_early(struct kvm_vcpu *vcpu, int exception_index);
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..84e70f953de6 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -555,7 +555,7 @@ static void stage2_wp_range(struct kvm_s2_mmu *mmu, 
> phys_addr_t addr, phys_addr_
>   * Acquires kvm_mmu_lock. Called with kvm->slots_lock mutex acquired,
>   * serializing operations for VM memory regions.
>   */
> -void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
> +static void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
>  {
>   struct kvm_memslots *slots = kvm_memslots(kvm);
>   struct kvm_memory_slot *memslot = id_to_memslot(slots, slot);
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 3/4] KVM: arm64: Fix address check for memory slot

2021-03-15 Thread Keqian Zhu
Hi Gavin,

FYI, this has been fixed by Marc in commit 262b003d059c.

Thanks,
Keqian

On 2021/3/15 12:18, Gavin Shan wrote:
> The last (IPA) page can't be specified when a new memory slot is
> added. The error -EFAULT is returned when the memory slot is added
> with the following parameters for the VM, which has 40-bits IPA
> limit. The host has 4KB base page size. It's not correct because
> the last (IPA) page is still usable.
> 
>struct kvm_userspace_memory_region {
>   __u32 slot;   /* 1*/
>   __u32 flags;  /* 0*/
>   __u64 guest_phys_addr;/* 0xfff000 */
>   __u64 memory_size;/* 0x1000   */
>   __u64 userspace_addr;
>};
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/kvm/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 286b603ed0d3..a5a8ade9fde4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1313,7 +1313,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>* Prevent userspace from creating a memory region outside of the IPA
>* space addressable by the KVM guest IPA space.
>*/
> - if (memslot->base_gfn + memslot->npages >=
> + if (memslot->base_gfn + memslot->npages >
>   (kvm_phys_size(kvm) >> PAGE_SHIFT))
>   return -EFAULT;
>  
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH] kvm: arm64: Try stage2 block mapping for host device MMIO

2021-03-12 Thread Keqian Zhu
Hi Marc,

On 2021/3/12 16:52, Marc Zyngier wrote:
> On Thu, 11 Mar 2021 14:28:17 +,
> Keqian Zhu  wrote:
>>
>> Hi Marc,
>>
>> On 2021/3/11 16:43, Marc Zyngier wrote:
>>> Digging this patch back from my Inbox...
>> Yeah, thanks ;-)
>>
>>>
>>> On Fri, 22 Jan 2021 08:36:50 +,
>>> Keqian Zhu  wrote:
>>>>
>>>> The MMIO region of a device maybe huge (GB level), try to use block
>>>> mapping in stage2 to speedup both map and unmap.
[...]

>>>>break;
>>>>  
>>>> -  pa += PAGE_SIZE;
>>>> +  pa += pgsize;
>>>>}
>>>>  
>>>>kvm_mmu_free_memory_cache();
>>>
>>> There is one issue with this patch, which is that it only does half
>>> the job. A VM_PFNMAP VMA can definitely be faulted in dynamically, and
>>> in that case we force this to be a page mapping. This conflicts with
>>> what you are doing here.
>> Oh yes, these two paths should keep a same mapping logic.
>>
>> I try to search the "force_pte" and find out some discussion [1]
>> between you and Christoffer.  And I failed to get a reason about
>> forcing pte mapping for device MMIO region (expect that we want to
>> keep a same logic with the eager mapping path). So if you don't
>> object to it, I will try to implement block mapping for device MMIO
>> in user_mem_abort().
>>
>>>
>>> There is also the fact that if we can map things on demand, why are we
>>> still mapping these MMIO regions ahead of time?
>>
>> Indeed. Though this provides good *startup* performance for guest
>> accessing MMIO, it's hard to keep the two paths in sync. We can keep
>> this minor optimization or delete it to avoid hard maintenance,
>> which one do you prefer?
> 
> I think we should be able to get rid of the startup path. If we can do
> it for memory, I see no reason not to do it for MMIO.
OK, I will do.

> 
>> BTW, could you please have a look at my another patch series[2]
>> about HW/SW combined dirty log? ;)
> 
> I will eventually, but while I really appreciate your contributions in
> terms of features and bug fixes, I would really *love* it if you were
> a bit more active on the list when it comes to reviewing other
> people's code.
> 
> There is no shortage of patches that really need reviewing, and just
> pointing me in the direction of your favourite series doesn't really
> help. I have something like 200+ patches that need careful reviewing
> in my inbox, and they all deserve the same level of attention.
> 
> To make it short, help me to help you!
My apologies, and I can't agree more.

I have noticed this, and have reviewed several patches of IOMMU community.
For that some patches are with much background knowledge, so it's hard to
review. I will dig into them in the future.

Thanks for your valuable advice. :)

Thanks,
Keqian


> 
> Thanks,
> 
>   M.
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH] kvm: arm64: Try stage2 block mapping for host device MMIO

2021-03-11 Thread Keqian Zhu
Hi Marc,

On 2021/3/11 16:43, Marc Zyngier wrote:
> Digging this patch back from my Inbox...
Yeah, thanks ;-)

> 
> On Fri, 22 Jan 2021 08:36:50 +,
> Keqian Zhu  wrote:
>>
>> The MMIO region of a device maybe huge (GB level), try to use block
>> mapping in stage2 to speedup both map and unmap.
>>
>> Especially for unmap, it performs TLBI right after each invalidation
>> of PTE. If all mapping is of PAGE_SIZE, it takes much time to handle
>> GB level range.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  arch/arm64/include/asm/kvm_pgtable.h | 11 +++
>>  arch/arm64/kvm/hyp/pgtable.c | 15 +++
>>  arch/arm64/kvm/mmu.c | 12 
>>  3 files changed, 34 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
>> b/arch/arm64/include/asm/kvm_pgtable.h
>> index 52ab38db04c7..2266ac45f10c 100644
>> --- a/arch/arm64/include/asm/kvm_pgtable.h
>> +++ b/arch/arm64/include/asm/kvm_pgtable.h
>> @@ -82,6 +82,17 @@ struct kvm_pgtable_walker {
>>  const enum kvm_pgtable_walk_flags   flags;
>>  };
>>  
>> +/**
>> + * kvm_supported_pgsize() - Get the max supported page size of a mapping.
>> + * @pgt:Initialised page-table structure.
>> + * @addr:   Virtual address at which to place the mapping.
>> + * @end:End virtual address of the mapping.
>> + * @phys:   Physical address of the memory to map.
>> + *
>> + * The smallest return value is PAGE_SIZE.
>> + */
>> +u64 kvm_supported_pgsize(struct kvm_pgtable *pgt, u64 addr, u64 end, u64 
>> phys);
>> +
>>  /**
>>   * kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
>>   * @pgt:Uninitialised page-table structure to initialise.
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index bdf8e55ed308..ab11609b9b13 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -81,6 +81,21 @@ static bool kvm_block_mapping_supported(u64 addr, u64 
>> end, u64 phys, u32 level)
>>  return IS_ALIGNED(addr, granule) && IS_ALIGNED(phys, granule);
>>  }
>>  
>> +u64 kvm_supported_pgsize(struct kvm_pgtable *pgt, u64 addr, u64 end, u64 
>> phys)
>> +{
>> +u32 lvl;
>> +u64 pgsize = PAGE_SIZE;
>> +
>> +for (lvl = pgt->start_level; lvl < KVM_PGTABLE_MAX_LEVELS; lvl++) {
>> +if (kvm_block_mapping_supported(addr, end, phys, lvl)) {
>> +pgsize = kvm_granule_size(lvl);
>> +break;
>> +}
>> +}
>> +
>> +return pgsize;
>> +}
>> +
>>  static u32 kvm_pgtable_idx(struct kvm_pgtable_walk_data *data, u32 level)
>>  {
>>  u64 shift = kvm_granule_shift(level);
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 7d2257cc5438..80b403fc8e64 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -499,7 +499,8 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>>  int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>phys_addr_t pa, unsigned long size, bool writable)
>>  {
>> -phys_addr_t addr;
>> +phys_addr_t addr, end;
>> +unsigned long pgsize;
>>  int ret = 0;
>>  struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
>>  struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>> @@ -509,21 +510,24 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t 
>> guest_ipa,
>>  
>>  size += offset_in_page(guest_ipa);
>>  guest_ipa &= PAGE_MASK;
>> +end = guest_ipa + size;
>>  
>> -for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
>> +for (addr = guest_ipa; addr < end; addr += pgsize) {
>>  ret = kvm_mmu_topup_memory_cache(,
>>   kvm_mmu_cache_min_pages(kvm));
>>  if (ret)
>>  break;
>>  
>> +pgsize = kvm_supported_pgsize(pgt, addr, end, pa);
>> +
>>  spin_lock(>mmu_lock);
>> -ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
>> +ret = kvm_pgtable_stage2_map(pgt, addr, pgsize, pa, prot,
>>   );
>>  spin_unlock(>mmu_lock);
>>  if (ret)
>>  break;
>>  
>> -pa += PAGE_SIZE;
>> +pa += pgsize

Re: [RFC PATCH] kvm: arm64: Try stage2 block mapping for host device MMIO

2021-03-02 Thread Keqian Zhu
Hi Marc,

Do you have further suggestion on this? Block mapping do bring obvious benefit.

Thanks,
Keqian

On 2021/1/25 19:25, Keqian Zhu wrote:
> Hi Marc,
> 
> On 2021/1/22 17:45, Marc Zyngier wrote:
>> On 2021-01-22 08:36, Keqian Zhu wrote:
>>> The MMIO region of a device maybe huge (GB level), try to use block
>>> mapping in stage2 to speedup both map and unmap.
>>>
>>> Especially for unmap, it performs TLBI right after each invalidation
>>> of PTE. If all mapping is of PAGE_SIZE, it takes much time to handle
>>> GB level range.
>>
>> This is only on VM teardown, right? Or do you unmap the device more ofet?
>> Can you please quantify the speedup and the conditions this occurs in?
> 
> Yes, and there are some other paths (includes what your patch series handles) 
> will do the unmap action:
> 
> 1、guest reboot without S2FWB: stage2_unmap_vm()which only unmaps guest 
> regular RAM.
> 2、userspace deletes memslot: kvm_arch_flush_shadow_memslot().
> 3、rollback of device MMIO mapping: kvm_arch_prepare_memory_region().
> 4、rollback of dirty log tracking: If we enable hugepage for guest RAM, after 
> dirty log is stopped,
>the newly created block mappings will 
> unmap all page mappings.
> 5、mmu notifier: kvm_unmap_hva_range(). AFAICS, we will use this path when VM 
> teardown or guest resets pass-through devices.
> The bugfix[1] gives the reason for 
> unmapping MMIO region when guest resets pass-through devices.
> 
> unmap related to MMIO region, as this patch solves:
> point 1 is not applied.
> point 2 occurs when userspace unplug pass-through devices.
> point 3 can occurs, but rarely.
> point 4 is not applied.
> point 5 occurs when VM teardown or guest resets pass-through devices.
> 
> And I had a look at your patch series, it can solve:
> For VM teardown, elide CMO and perform VMALL instead of individually (But 
> current kernel do not go through this path when VM teardown).
> For rollback of dirty log tracking, elide CMO.
> For kvm_unmap_hva_range, if event is MMU_NOTIFY_UNMAP. elide CMO.
> 
> (But I doubt the CMOs in unmap. As we perform CMOs in user_mem_abort when 
> install new stage2 mapping for VM,
>  maybe the CMO in unmap is unnecessary under all conditions :-) ?)
> 
> So it shows that we are solving different parts of unmap, so they are not 
> conflicting. At least this patch can
> still speedup map of device MMIO region, and speedup unmap of device MMIO 
> region even if we do not need to perform
> CMO and TLBI ;-).
> 
> speedup: unmap 8GB MMIO on FPGA.
> 
>beforeafter opt
> cost30+ minutes949ms
> 
> Thanks,
> Keqian
> 
>>
>> I have the feeling that we are just circling around another problem,
>> which is that we could rely on a VM-wide TLBI when tearing down the
>> guest. I worked on something like that[1] a long while ago, and parked
>> it for some reason. Maybe it is worth reviving.
>>
>> [1] 
>> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/elide-cmo-tlbi
>>
>>>
>>> Signed-off-by: Keqian Zhu 
>>> ---
>>>  arch/arm64/include/asm/kvm_pgtable.h | 11 +++
>>>  arch/arm64/kvm/hyp/pgtable.c | 15 +++
>>>  arch/arm64/kvm/mmu.c | 12 
>>>  3 files changed, 34 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_pgtable.h
>>> b/arch/arm64/include/asm/kvm_pgtable.h
>>> index 52ab38db04c7..2266ac45f10c 100644
>>> --- a/arch/arm64/include/asm/kvm_pgtable.h
>>> +++ b/arch/arm64/include/asm/kvm_pgtable.h
>>> @@ -82,6 +82,17 @@ struct kvm_pgtable_walker {
>>>  const enum kvm_pgtable_walk_flagsflags;
>>>  };
>>>
>>> +/**
>>> + * kvm_supported_pgsize() - Get the max supported page size of a mapping.
>>> + * @pgt:Initialised page-table structure.
>>> + * @addr:Virtual address at which to place the mapping.
>>> + * @end:End virtual address of the mapping.
>>> + * @phys:Physical address of the memory to map.
>>> + *
>>> + * The smallest return value is PAGE_SIZE.
>>> + */
>>> +u64 kvm_supported_pgsize(struct kvm_pgtable *pgt, u64 addr, u64 end, u64 
>>> phys);
>>> +
>>>  /**
>>>   * kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
>>>   * @pgt:Uninitialised page-table structure to initialise.
>>> diff --git a/arch/arm64/kvm

Re: [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log

2021-03-02 Thread Keqian Zhu
Hi everyone,

Any comments are welcome :).

Thanks,
Keqian

On 2021/1/26 20:44, Keqian Zhu wrote:
> The intention:
> 
> On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
> KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
> This leads to heavy side effect on VM, especially when multi vCPU race and
> some of them block on kvm mmu_lock.
> 
> DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
> its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.
> 
> About this patch series:
> 
> The biggest problem of apply DBM for stage2 is that software must scan PTs to
> collect dirty state, which may cost much time and affect downtime of 
> migration.
> 
> This series realize a SW/HW combined dirty log that can effectively solve this
> problem (The smmu side can also use this approach to solve dma dirty log 
> tracking).
> 
> The core idea is that we do not enable hardware dirty at start (do not add 
> DBM bit).
> When a arbitrary PT occurs fault, we execute soft tracking for this PT and 
> enable
> hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). 
> Then when
> sync dirty log, we have known all PTs with hardware dirty enabled, so we do 
> not need
> to scan all PTs.
> 
> mem abort point mem abort point
>   ↓↓
> ---
> |||||||
> ---
>  ↑↑
> set DBM bit of   set DBM bit of
>  this PT section (64PTEs)  this PT section (64PTEs)
> 
> We may worry that when dirty rate is over-high we still need to scan too much 
> PTs.
> We mainly concern the VM stop time. With Qemu dirty rate throttling, the 
> dirty memory
> is closing to the VM stop threshold, so there is a little PTs to scan after 
> VM stop.
> 
> It has the advantages of hardware tracking that minimizes side effect on vCPU,
> and also has the advantages of software tracking that controls vCPU dirty 
> rate.
> Moreover, software tracking helps us to scan PTs at some fixed points, which
> greatly reduces scanning time. And the biggest benefit is that we can apply 
> this
> solution for dma dirty tracking.
> 
> Test:
> 
> Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure 
> test result
>   is not effected by dissolve of block page table at the early stage of 
> migration).
> VM:   16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).
> 
> Each run 5 times for software dirty log and SW/HW conbined dirty log. 
> 
> Test result:
> 
> Gain 5%~7% improvement of redis QPS during VM migration.
> VM downtime is not affected fundamentally.
> About 56.7% of DBM is effectively used.
> 
> Keqian Zhu (7):
>   arm64: cpufeature: Add API to report system support of HWDBM
>   kvm: arm64: Use atomic operation when update PTE
>   kvm: arm64: Add level_apply parameter for stage2_attr_walker
>   kvm: arm64: Add some HW_DBM related pgtable interfaces
>   kvm: arm64: Add some HW_DBM related mmu interfaces
>   kvm: arm64: Only write protect selected PTE
>   kvm: arm64: Start up SW/HW combined dirty log
> 
>  arch/arm64/include/asm/cpufeature.h  |  12 +++
>  arch/arm64/include/asm/kvm_host.h|   6 ++
>  arch/arm64/include/asm/kvm_mmu.h |   7 ++
>  arch/arm64/include/asm/kvm_pgtable.h |  45 ++
>  arch/arm64/kvm/arm.c | 125 ++
>  arch/arm64/kvm/hyp/pgtable.c | 130 ++-
>  arch/arm64/kvm/mmu.c |  47 +-
>  arch/arm64/kvm/reset.c   |   8 +-
>  8 files changed, 351 insertions(+), 29 deletions(-)
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v14 05/13] iommu/smmuv3: Implement attach/detach_pasid_table

2021-03-02 Thread Keqian Zhu
Hi Eric,

On 2021/2/24 4:56, Eric Auger wrote:
> On attach_pasid_table() we program STE S1 related info set
> by the guest into the actual physical STEs. At minimum
> we need to program the context descriptor GPA and compute
> whether the stage1 is translated/bypassed or aborted.
> 
> On detach, the stage 1 config is unset and the abort flag is
> unset.
> 
> Signed-off-by: Eric Auger 
> 
[...]

> +
> + /*
> +  * we currently support a single CD so s1fmt and s1dss
> +  * fields are also ignored
> +  */
> + if (cfg->pasid_bits)
> + goto out;
> +
> + smmu_domain->s1_cfg.cdcfg.cdtab_dma = cfg->base_ptr;
only the "cdtab_dma" field of "cdcfg" is set, we are not able to locate a 
specific cd using arm_smmu_get_cd_ptr().

Maybe we'd better use a specialized function to fill other fields of "cdcfg" or 
add a sanity check in arm_smmu_get_cd_ptr()
to prevent calling it under nested mode?

As now we just call arm_smmu_get_cd_ptr() during finalise_s1(), no problem 
found. Just a suggestion ;-)

Thanks,
Keqian


> + smmu_domain->s1_cfg.set = true;
> + smmu_domain->abort = false;
> + break;
> + default:
> + goto out;
> + }
> + spin_lock_irqsave(_domain->devices_lock, flags);
> + list_for_each_entry(master, _domain->devices, domain_head)
> + arm_smmu_install_ste_for_dev(master);
> + spin_unlock_irqrestore(_domain->devices_lock, flags);
> + ret = 0;
> +out:
> + mutex_unlock(_domain->init_mutex);
> + return ret;
> +}
> +
> +static void arm_smmu_detach_pasid_table(struct iommu_domain *domain)
> +{
> + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> + struct arm_smmu_master *master;
> + unsigned long flags;
> +
> + mutex_lock(_domain->init_mutex);
> +
> + if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> + goto unlock;
> +
> + smmu_domain->s1_cfg.set = false;
> + smmu_domain->abort = false;
> +
> + spin_lock_irqsave(_domain->devices_lock, flags);
> + list_for_each_entry(master, _domain->devices, domain_head)
> + arm_smmu_install_ste_for_dev(master);
> + spin_unlock_irqrestore(_domain->devices_lock, flags);
> +
> +unlock:
> + mutex_unlock(_domain->init_mutex);
> +}
> +
>  static bool arm_smmu_dev_has_feature(struct device *dev,
>enum iommu_dev_features feat)
>  {
> @@ -2939,6 +3026,8 @@ static struct iommu_ops arm_smmu_ops = {
>   .of_xlate   = arm_smmu_of_xlate,
>   .get_resv_regions   = arm_smmu_get_resv_regions,
>   .put_resv_regions   = generic_iommu_put_resv_regions,
> + .attach_pasid_table = arm_smmu_attach_pasid_table,
> + .detach_pasid_table = arm_smmu_detach_pasid_table,
>   .dev_has_feat   = arm_smmu_dev_has_feature,
>   .dev_feat_enabled   = arm_smmu_dev_feature_enabled,
>   .dev_enable_feat= arm_smmu_dev_enable_feature,
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-03-01 Thread Keqian Zhu
Hi Robin,

I am going to send v2 at next week, to addresses these issues reported by you. 
Many thanks!
And do you have any further comments on patch #4 #5 and #6?

Thanks,
Keqian

On 2021/2/5 3:50, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>> update the access flag and the dirty state of TTD by hardware. It is
>> essential to track dirty pages of DMA.
>>
>> This adds feature detection, none functional change.
>>
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
>>   include/linux/io-pgtable.h  |  1 +
>>   3 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 8ca7415d785d..0f0fe71cc10d 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct 
>> iommu_domain *domain,
>>   .pgsize_bitmap= smmu->pgsize_bitmap,
>>   .ias= ias,
>>   .oas= oas,
>> +.httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>   .coherent_walk= smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>   .tlb= _smmu_flush_ops,
>>   .iommu_dev= smmu->dev,
>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
>> arm_smmu_device *smmu)
>>   if (reg & IDR0_HYP)
>>   smmu->features |= ARM_SMMU_FEAT_HYP;
>>   +switch (FIELD_GET(IDR0_HTTU, reg)) {
> 
> We need to accommodate the firmware override as well if we need this to be 
> meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
> stack[1].
> 
>> +case IDR0_HTTU_NONE:
>> +break;
>> +case IDR0_HTTU_HA:
>> +smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>> +break;
>> +case IDR0_HTTU_HAD:
>> +smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>> +smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
>> +break;
>> +default:
>> +dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
>> +return -ENXIO;
>> +}
>> +
>>   /*
>>* The coherency feature as set by FW is used in preference to the ID
>>* register, but warn on mismatch.
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> index 96c2e9565e00..e91bea44519e 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> @@ -33,6 +33,10 @@
>>   #define IDR0_ASID16(1 << 12)
>>   #define IDR0_ATS(1 << 10)
>>   #define IDR0_HYP(1 << 9)
>> +#define IDR0_HTTUGENMASK(7, 6)
>> +#define IDR0_HTTU_NONE0
>> +#define IDR0_HTTU_HA1
>> +#define IDR0_HTTU_HAD2
>>   #define IDR0_COHACC(1 << 4)
>>   #define IDR0_TTFGENMASK(3, 2)
>>   #define IDR0_TTF_AARCH642
>> @@ -286,6 +290,8 @@
>>   #define CTXDESC_CD_0_TCR_TBI0(1ULL << 38)
>> #define CTXDESC_CD_0_AA64(1UL << 41)
>> +#define CTXDESC_CD_0_HD(1UL << 42)
>> +#define CTXDESC_CD_0_HA(1UL << 43)
>>   #define CTXDESC_CD_0_S(1UL << 44)
>>   #define CTXDESC_CD_0_R(1UL << 45)
>>   #define CTXDESC_CD_0_A(1UL << 46)
>> @@ -604,6 +610,8 @@ struct arm_smmu_device {
>>   #define ARM_SMMU_FEAT_RANGE_INV(1 << 15)
>>   #define ARM_SMMU_FEAT_BTM(1 << 16)
>>   #define ARM_SMMU_FEAT_SVA(1 << 17)
>> +#define ARM_SMMU_FEAT_HTTU_HA(1 << 18)
>> +#define ARM_SMMU_FEAT_HTTU_HD(1 << 19)
>>   u32features;
>> #define ARM_SMMU_OPT_SKIP_PREFETCH(1 << 0)
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index ea727eb1a1a9..1a00ea8562c7 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -97,6 +97,7 @@ struct io_pgtable_cfg {
>>   unsigned longpgsize_bitmap;
>>   unsigned intias;
>>   unsigned intoas;
>> +boolhttu_hd;
> 
> This is very specific to the AArch64 stage 1 format, not a generic capability 
> - I think it should be a quirk flag rather than a common field.
> 
> Robin.
> 
> [1] 
> https://jpbrucker.net/git/linux/commit/?h=sva/current=1ef7d512fb9082450dfe0d22ca4f7e35625a097b
> 
>>   boolcoherent_walk;
>>   const struct iommu_flush_ops*tlb;
>>   struct device*iommu_dev;
>>
> .
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v11 01/13] vfio: VFIO_IOMMU_SET_PASID_TABLE

2021-02-22 Thread Keqian Zhu
Hi Eric,

On 2021/2/22 18:53, Auger Eric wrote:
> Hi Keqian,
> 
> On 2/2/21 1:34 PM, Keqian Zhu wrote:
>> Hi Eric,
>>
>> On 2020/11/16 19:00, Eric Auger wrote:
>>> From: "Liu, Yi L" 
>>>
>>> This patch adds an VFIO_IOMMU_SET_PASID_TABLE ioctl
>>> which aims to pass the virtual iommu guest configuration
>>> to the host. This latter takes the form of the so-called
>>> PASID table.
>>>
>>> Signed-off-by: Jacob Pan 
>>> Signed-off-by: Liu, Yi L 
>>> Signed-off-by: Eric Auger 
>>>
>>> ---
>>> v11 -> v12:
>>> - use iommu_uapi_set_pasid_table
>>> - check SET and UNSET are not set simultaneously (Zenghui)
>>>
>>> v8 -> v9:
>>> - Merge VFIO_IOMMU_ATTACH/DETACH_PASID_TABLE into a single
>>>   VFIO_IOMMU_SET_PASID_TABLE ioctl.
>>>
>>> v6 -> v7:
>>> - add a comment related to VFIO_IOMMU_DETACH_PASID_TABLE
>>>
>>> v3 -> v4:
>>> - restore ATTACH/DETACH
>>> - add unwind on failure
>>>
>>> v2 -> v3:
>>> - s/BIND_PASID_TABLE/SET_PASID_TABLE
>>>
>>> v1 -> v2:
>>> - s/BIND_GUEST_STAGE/BIND_PASID_TABLE
>>> - remove the struct device arg
>>> ---
>>>  drivers/vfio/vfio_iommu_type1.c | 65 +
>>>  include/uapi/linux/vfio.h   | 19 ++
>>>  2 files changed, 84 insertions(+)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>>> b/drivers/vfio/vfio_iommu_type1.c
>>> index 67e827638995..87ddd9e882dc 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -2587,6 +2587,41 @@ static int vfio_iommu_iova_build_caps(struct 
>>> vfio_iommu *iommu,
>>> return ret;
>>>  }
>>>  
>>> +static void
>>> +vfio_detach_pasid_table(struct vfio_iommu *iommu)
>>> +{
>>> +   struct vfio_domain *d;
>>> +
>>> +   mutex_lock(>lock);
>>> +   list_for_each_entry(d, >domain_list, next)
>>> +   iommu_detach_pasid_table(d->domain);
>>> +
>>> +   mutex_unlock(>lock);
>>> +}
>>> +
>>> +static int
>>> +vfio_attach_pasid_table(struct vfio_iommu *iommu, unsigned long arg)
>>> +{
>>> +   struct vfio_domain *d;
>>> +   int ret = 0;
>>> +
>>> +   mutex_lock(>lock);
>>> +
>>> +   list_for_each_entry(d, >domain_list, next) {
>>> +   ret = iommu_uapi_attach_pasid_table(d->domain, (void __user 
>>> *)arg);
>> This design is not very clear to me. This assumes all iommu_domains share 
>> the same pasid table.
>>
>> As I understand, it's reasonable when there is only one group in the domain, 
>> and only one domain in the vfio_iommu.
>> If more than one group in the vfio_iommu, the guest may put them into 
>> different guest iommu_domain, then they have different pasid table.
>>
>> Is this the use scenario?
> 
> the vfio_iommu is attached to a container. all the groups within a
> container share the same set of page tables (linux
> Documentation/driver-api/vfio.rst). So to me if you want to use
> different pasid tables, the groups need to be attached to different
> containers. Does that make sense to you?
OK, so this is what I understand about the design. A little question is that 
when
we perform attach_pasid_table on a container, maybe we ought to do a sanity
check to make sure that only one group is in this container, instead of
iterating all domain?

To be frank, my main concern is that if we put each group into different 
container
under nested mode, then we give up the possibility that they can share stage2 
page tables,
which saves host memory and reduces the time of preparing environment for VM.

To me, I'd like to understand the "container shares page table" to be:
1) share stage2 page table under nested mode.
2) share stage1 page table under non-nested mode.

As when we perform "map" on a container:
1) under nested mode, we setup stage2 mapping.
2) under non-nested mode, we setup stage1 mapping.

Indeed, to realize stage2 mapping sharing, we should do much more work to 
refactor
SMMU_DOMAIN...

Hope you can consider this. :)

Thanks,
Keqian

> 
> Thanks
> 
> Eric
>>
>> Thanks,
>> Keqian
>>
>>> +   if (ret)
>>> +   goto unwind;
>>> +   }
>>> +   goto unlock;
>>> +unwind:
>>> +   list_for_each_entry_continue_reverse(d, >domain_list, next) {
>

Re: [PATCH v13 02/15] iommu: Introduce bind/unbind_guest_msi

2021-02-18 Thread Keqian Zhu
Hi Eric,

On 2021/2/12 16:55, Auger Eric wrote:
> Hi Keqian,
> 
> On 2/1/21 12:52 PM, Keqian Zhu wrote:
>> Hi Eric,
>>
>> On 2020/11/18 19:21, Eric Auger wrote:
>>> On ARM, MSI are translated by the SMMU. An IOVA is allocated
>>> for each MSI doorbell. If both the host and the guest are exposed
>>> with SMMUs, we end up with 2 different IOVAs allocated by each.
>>> guest allocates an IOVA (gIOVA) to map onto the guest MSI
>>> doorbell (gDB). The Host allocates another IOVA (hIOVA) to map
>>> onto the physical doorbell (hDB).
>>>
>>> So we end up with 2 untied mappings:
>>>  S1S2
>>> gIOVA->gDB
>>>   hIOVA->hDB
>>>
>>> Currently the PCI device is programmed by the host with hIOVA
>>> as MSI doorbell. So this does not work.
>>>
>>> This patch introduces an API to pass gIOVA/gDB to the host so
>>> that gIOVA can be reused by the host instead of re-allocating
>>> a new IOVA. So the goal is to create the following nested mapping:
>> Does the gDB can be reused under non-nested mode?
> 
> Under non nested mode the hIOVA is allocated within the MSI reserved
> region exposed by the SMMU driver, [0x800, 80f]. see
> iommu_dma_prepare_msi/iommu_dma_get_msi_page in dma_iommu.c. this hIOVA
> is programmed in the physical device so that the physical SMMU
> translates it into the physical doorbell (hDB = host physical ITS
So, AFAIU, under non-nested mode, at smmu side, we reuse the workflow of 
non-virtualization scenario.

> doorbell). The gDB is not used at pIOMMU programming level. It is only
> used when setting up the KVM irq route.
> 
> Hope this answers your question.
Thanks for your explanation!
> 

Thanks,
Keqian

>>
>>>
>>>  S1S2
>>> gIOVA->gDB ->hDB
>>>
>>> and program the PCI device with gIOVA MSI doorbell.
>>>
>>> In case we have several devices attached to this nested domain
>>> (devices belonging to the same group), they cannot be isolated
>>> on guest side either. So they should also end up in the same domain
>>> on guest side. We will enforce that all the devices attached to
>>> the host iommu domain use the same physical doorbell and similarly
>>> a single virtual doorbell mapping gets registered (1 single
>>> virtual doorbell is used on guest as well).
>>>
>> [...]
>>
>>> + *
>>> + * The associated IOVA can be reused by the host to create a nested
>>> + * stage2 binding mapping translating into the physical doorbell used
>>> + * by the devices attached to the domain.
>>> + *
>>> + * All devices within the domain must share the same physical doorbell.
>>> + * A single MSI GIOVA/GPA mapping can be attached to an iommu_domain.
>>> + */
>>> +
>>> +int iommu_bind_guest_msi(struct iommu_domain *domain,
>>> +dma_addr_t giova, phys_addr_t gpa, size_t size)
>>> +{
>>> +   if (unlikely(!domain->ops->bind_guest_msi))
>>> +   return -ENODEV;
>>> +
>>> +   return domain->ops->bind_guest_msi(domain, giova, gpa, size);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_bind_guest_msi);
>>> +
>>> +void iommu_unbind_guest_msi(struct iommu_domain *domain,
>>> +   dma_addr_t iova)
>> nit: s/iova/giova
> sure
>>
>>> +{
>>> +   if (unlikely(!domain->ops->unbind_guest_msi))
>>> +   return;
>>> +
>>> +   domain->ops->unbind_guest_msi(domain, iova);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_unbind_guest_msi);
>>> +
>> [...]
>>
>> Thanks,
>> Keqian
>>
> 
> Thanks
> 
> Eric
> 
> .
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM

2021-02-17 Thread Keqian Zhu
Hi Yi,

On 2021/2/9 19:57, Yi Sun wrote:
> On 21-02-07 18:40:36, Keqian Zhu wrote:
>> Hi Yi,
>>
>> On 2021/2/7 17:56, Yi Sun wrote:
>>> Hi,
>>>
>>> On 21-01-28 23:17:41, Keqian Zhu wrote:
>>>
>>> [...]
>>>
>>>> +static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
>>>> +   struct vfio_dma *dma)
>>>> +{
>>>> +  struct vfio_domain *d;
>>>> +
>>>> +  list_for_each_entry(d, >domain_list, next) {
>>>> +  /* Go through all domain anyway even if we fail */
>>>> +  iommu_split_block(d->domain, dma->iova, dma->size);
>>>> +  }
>>>> +}
>>>
>>> This should be a switch to prepare for dirty log start. Per Intel
>>> Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
>>> It enables Accessed/Dirty Flags in second-level paging entries.
>>> So, a generic iommu interface here is better. For Intel iommu, it
>>> enables SLADE. For ARM, it splits block.
>> Indeed, a generic interface name is better.
>>
>> The vendor iommu driver plays vendor's specific actions to start dirty log, 
>> and Intel iommu and ARM smmu may differ. Besides, we may add more actions in 
>> ARM smmu driver in future.
>>
>> One question: Though I am not familiar with Intel iommu, I think it also 
>> should split block mapping besides enable SLADE. Right?
>>
> I am not familiar with ARM smmu. :) So I want to clarify if the block
> in smmu is big page, e.g. 2M page? Intel Vtd manages the memory per
Yes, for ARM, the "block" is big page :).

> page, 4KB/2MB/1GB. There are two ways to manage dirty pages.
> 1. Keep default granularity. Just set SLADE to enable the dirty track.
> 2. Split big page to 4KB to get finer granularity.
According to your statement, I see that VT-D's SLADE behaves like smmu HTTU. 
They are both based on page-table.

Right, we should give more freedom to iommu vendor driver, so a generic 
interface is better.
1) As you said, set SLADE when enable dirty log.
2) IOMMUs of other architecture may has completely different dirty tracking 
mechanism.

> 
> But question about the second solution is if it can benefit the user
> space, e.g. live migration. If my understanding about smmu block (i.e.
> the big page) is correct, have you collected some performance data to
> prove that the split can improve performance? Thanks!
The purpose of splitting block mapping is to reduce the amount of dirty bytes, 
which depends on actual DMA transaction.
Take an extreme example, if DMA writes one byte, under 1G mapping, the dirty 
amount reported to userspace is 1G, but under 4K mapping, the dirty amount is 
just 4K.

I will detail the commit message in v2.

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log

2021-02-07 Thread Keqian Zhu



On 2021/2/5 3:52, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> During dirty log tracking, user will try to retrieve dirty log from
>> iommu if it supports hardware dirty log. This adds a new interface
[...]

>>   static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
>>   {
>>   unsigned long granule, page_sizes;
>> @@ -957,6 +1046,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>>   .iova_to_phys= arm_lpae_iova_to_phys,
>>   .split_block= arm_lpae_split_block,
>>   .merge_page= arm_lpae_merge_page,
>> +.sync_dirty_log= arm_lpae_sync_dirty_log,
>>   };
>> return data;
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index f1261da11ea8..69f268069282 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2822,6 +2822,47 @@ size_t iommu_merge_page(struct iommu_domain *domain, 
>> unsigned long iova,
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_merge_page);
>>   +int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
>> + size_t size, unsigned long *bitmap,
>> + unsigned long base_iova, unsigned long bitmap_pgshift)
>> +{
>> +const struct iommu_ops *ops = domain->ops;
>> +unsigned int min_pagesz;
>> +size_t pgsize;
>> +int ret;
>> +
>> +min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
>> +
>> +if (!IS_ALIGNED(iova | size, min_pagesz)) {
>> +pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
>> +   iova, size, min_pagesz);
>> +return -EINVAL;
>> +}
>> +
>> +if (!ops || !ops->sync_dirty_log) {
>> +pr_err("don't support sync dirty log\n");
>> +return -ENODEV;
>> +}
>> +
>> +while (size) {
>> +pgsize = iommu_pgsize(domain, iova, size);
>> +
>> +ret = ops->sync_dirty_log(domain, iova, pgsize,
>> +  bitmap, base_iova, bitmap_pgshift);
> 
> Once again, we have a worst-of-both-worlds iteration that doesn't make much 
> sense. iommu_pgsize() essentially tells you the best supported size that an 
> IOVA range *can* be mapped with, but we're iterating a range that's already 
> mapped, so we don't know if it's relevant, and either way it may not bear any 
> relation to the granularity of the bitmap, which is presumably what actually 
> matters.
> 
> Logically, either we should iterate at the bitmap granularity here, and the 
> driver just says whether the given iova chunk contains any dirty pages or 
> not, or we just pass everything through to the driver and let it do the whole 
> job itself. Doing a little bit of both is just an overcomplicated mess.
> 
> I'm skimming patch #7 and pretty much the same comments apply, so I can't be 
> bothered to repeat them there...
> 
> Robin.
Sorry that I missed these comments...

As I clarified in #4, due to unsuitable variable name, the @pgsize actually is 
the max size that meets alignment acquirement and fits into the pgsize_bitmap.

All iommu interfaces acquire the @size fits into pgsize_bitmap to simplify 
their implementation. And the logic is very similar to "unmap" here.

Thanks,
Keqian

> 
>> +if (ret)
>> +break;
>> +
>> +pr_debug("dirty_log_sync: iova 0x%lx pagesz 0x%zx\n", iova,
>> + pgsize);
>> +
>> +iova += pgsize;
>> +size -= pgsize;
>> +}
>> +
>> +return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_sync_dirty_log);
>> +
>>   void iommu_get_resv_regions(struct device *dev, struct list_head *list)
>>   {
>>   const struct iommu_ops *ops = dev->bus->iommu_ops;
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index 754b62a1bbaf..f44551e4a454 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -166,6 +166,10 @@ struct io_pgtable_ops {
>> size_t size);
>>   size_t (*merge_page)(struct io_pgtable_ops *ops, unsigned long iova,
>>phys_addr_t phys, size_t size, int prot);
>> +int (*sync_dirty_log)(struct io_pgtable_ops *ops,
>> +  unsigned long iova, size_t size,
>> +  unsigned long *bitmap, unsigned long base_iova,
>> +  unsigned long bitmap_pgshift);
>>   };
>> /**
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> in

Re: [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log

2021-02-07 Thread Keqian Zhu
Hi Robin,

On 2021/2/5 3:52, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> During dirty log tracking, user will try to retrieve dirty log from
>> iommu if it supports hardware dirty log. This adds a new interface
>> named sync_dirty_log in iommu layer and arm smmuv3 implements it,
>> which scans leaf TTD and treats it's dirty if it's writable (As we
>> just enable HTTU for stage1, so check AP[2] is not set).
>>
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 27 +++
>>   drivers/iommu/io-pgtable-arm.c  | 90 +
>>   drivers/iommu/iommu.c   | 41 ++
>>   include/linux/io-pgtable.h  |  4 +
>>   include/linux/iommu.h   | 17 
>>   5 files changed, 179 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 2434519e4bb6..43d0536b429a 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -2548,6 +2548,32 @@ static size_t arm_smmu_merge_page(struct iommu_domain 
>> *domain, unsigned long iov
>>   return ops->merge_page(ops, iova, paddr, size, prot);
>>   }
>>   +static int arm_smmu_sync_dirty_log(struct iommu_domain *domain,
>> +   unsigned long iova, size_t size,
>> +   unsigned long *bitmap,
>> +   unsigned long base_iova,
>> +   unsigned long bitmap_pgshift)
>> +{
>> +struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
>> +struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
>> +
>> +if (!(smmu->features & ARM_SMMU_FEAT_HTTU_HD)) {
>> +dev_err(smmu->dev, "don't support HTTU_HD and sync dirty log\n");
>> +return -EPERM;
>> +}
>> +
>> +if (!ops || !ops->sync_dirty_log) {
>> +pr_err("don't support sync dirty log\n");
>> +return -ENODEV;
>> +}
>> +
>> +/* To ensure all inflight transactions are completed */
>> +arm_smmu_flush_iotlb_all(domain);
> 
> What about transactions that arrive between the point that this completes, 
> and the point - potentially much later - that we actually access any given 
> PTE during the walk? I don't see what this is supposed to be synchronising 
> against, even if it were just a CMD_SYNC (I especially don't see why we'd 
> want to knock out the TLBs).
The idea is that pgtable may be updated by HTTU *before* or *after* actual DMA 
access.

1) For PCI ATS. As SMMU spec (3.13.6.1 Hardware flag update for ATS & PRI) 
states:

"In addition to the behavior that is described earlier in this section, if 
hardware-management of Dirty state is enabled
and an ATS request for write access (with NW == 0) is made to a page that is 
marked Writable Clean, the SMMU
assumes a write will be made to that page and marks the page as Writable Dirty 
before returning the ATS response
that grants write access. When this happens, the modification to the page data 
by a device is not visible before
the page state is visible as Writable Dirty."

The problem is that guest memory may be dirtied *after* we actually handle it.

2) For inflight DMA. As SMMU spec (3.13.4 HTTU behavior summary) states:

"In addition, the completion of a TLB invalidation operation makes TTD updates 
that were caused by
transactions that are themselves completed by the completion of the TLB 
invalidation visible. Both
broadcast and explicit CMD_TLBI_* invalidations have this property."

The problem is that we should flush all dma transaction after guest stop.



The key to solve these problems is that we should invalidate related TLB.
1) TLBI can flush inflight dma translation (before dirty_log_sync()).
2) If a DMA translation uses ATC and occurs after we have handle dirty memory, 
then the ATC has been invalidated, so this will remark page as dirty (in 
dirty_log_clear()).

Thanks,
Keqian

> 
>> +
>> +return ops->sync_dirty_log(ops, iova, size, bitmap,
>> +base_iova, bitmap_pgshift);
>> +}
>> +
>>   static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args 
>> *args)
>>   {
>>   return iommu_fwspec_add_ids(dev, args->args, 1);
>> @@ -2649,6 +2675,7 @@ static struct iommu_ops arm_smmu_ops = {
>>   .domain_set_attr= arm_smmu_domain_set_attr,
>>   .split_block= arm_smmu_split_block,
>>   .merge_page= arm_smmu_m

Re: [RFC PATCH 05/11] iommu/arm-smmu-v3: Merge a span of page to block descriptor

2021-02-07 Thread Keqian Zhu
Hi Robin,

On 2021/2/5 3:52, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> When stop dirty log tracking, we need to recover all block descriptors
>> which are splited when start dirty log tracking. This adds a new
>> interface named merge_page in iommu layer and arm smmuv3 implements it,
>> which reinstall block mappings and unmap the span of page mappings.
>>
>> It's caller's duty to find contiuous physical memory.
>>
>> During merging page, other interfaces are not expected to be working,
>> so race condition does not exist. And we flush all iotlbs after the merge
>> procedure is completed to ease the pressure of iommu, as we will merge a
>> huge range of page mappings in general.
> 
> Again, I think we need better reasoning than "race conditions don't exist 
> because we don't expect them to exist".
Sure, because they can't. ;-)

> 
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 ++
>>   drivers/iommu/io-pgtable-arm.c  | 78 +
>>   drivers/iommu/iommu.c   | 75 
>>   include/linux/io-pgtable.h  |  2 +
>>   include/linux/iommu.h   | 10 +++
>>   5 files changed, 185 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 5469f4fca820..2434519e4bb6 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -2529,6 +2529,25 @@ static size_t arm_smmu_split_block(struct 
>> iommu_domain *domain,
>>   return ops->split_block(ops, iova, size);
>>   }
[...]

>> +
>> +size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
>> +size_t size, int prot)
>> +{
>> +phys_addr_t phys;
>> +dma_addr_t p, i;
>> +size_t cont_size, merged_size;
>> +size_t merged = 0;
>> +
>> +while (size) {
>> +phys = iommu_iova_to_phys(domain, iova);
>> +cont_size = PAGE_SIZE;
>> +p = phys + cont_size;
>> +i = iova + cont_size;
>> +
>> +while (cont_size < size && p == iommu_iova_to_phys(domain, i)) {
>> +p += PAGE_SIZE;
>> +i += PAGE_SIZE;
>> +cont_size += PAGE_SIZE;
>> +}
>> +
>> +merged_size = __iommu_merge_page(domain, iova, phys, cont_size,
>> +prot);
> 
> This is incredibly silly. The amount of time you'll spend just on walking the 
> tables in all those iova_to_phys() calls is probably significantly more than 
> it would take the low-level pagetable code to do the entire operation for 
> itself. At this level, any knowledge of how mappings are actually constructed 
> is lost once __iommu_map() returns, so we just don't know, and for this 
> operation in particular there seems little point in trying to guess - the 
> driver backend still has to figure out whether something we *think* might me 
> mergeable actually is, so it's better off doing the entire operation in a 
> single pass by itself.
>
> There's especially little point in starting all this work *before* checking 
> that it's even possible...
>
> Robin.

Well, this looks silly indeed. But the iova->phys info is only stored in 
pgtable. It seems that there is no other method to find continuous physical 
address :-( (actually, the vfio_iommu_replay() has similar logic).

We put the finding procedure of continuous physical address in common iommu 
layer, because this is a common logic for all types of iommu driver.

If a vendor iommu driver thinks (iova, phys, cont_size) is not merge-able, it 
can make its own decision to map them. This keeps same as iommu_map(), which 
provides (iova, paddr, pgsize) to vendor driver, and vendor driver can make its 
own decision to map them.

Do I understand your idea correctly?

Thanks,
Keqian
> 
>> +iova += merged_size;
>> +size -= merged_size;
>> +merged += merged_size;
>> +
>> +if (merged_size != cont_size)
>> +break;
>> +}
>> +iommu_flush_iotlb_all(domain);
>> +
>> +return merged;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_merge_page);
>> +
>>   void iommu_get_resv_regions(struct device *dev, struct list_head *list)
>>   {
>>   const struct iommu_ops *ops = dev->bus->iommu_ops;
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
&g

Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM

2021-02-07 Thread Keqian Zhu
Hi Yi,

On 2021/2/7 17:56, Yi Sun wrote:
> Hi,
> 
> On 21-01-28 23:17:41, Keqian Zhu wrote:
> 
> [...]
> 
>> +static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
>> + struct vfio_dma *dma)
>> +{
>> +struct vfio_domain *d;
>> +
>> +list_for_each_entry(d, >domain_list, next) {
>> +/* Go through all domain anyway even if we fail */
>> +iommu_split_block(d->domain, dma->iova, dma->size);
>> +}
>> +}
> 
> This should be a switch to prepare for dirty log start. Per Intel
> Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
> It enables Accessed/Dirty Flags in second-level paging entries.
> So, a generic iommu interface here is better. For Intel iommu, it
> enables SLADE. For ARM, it splits block.
Indeed, a generic interface name is better.

The vendor iommu driver plays vendor's specific actions to start dirty log, and 
Intel iommu and ARM smmu may differ. Besides, we may add more actions in ARM 
smmu driver in future.

One question: Though I am not familiar with Intel iommu, I think it also should 
split block mapping besides enable SLADE. Right?

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 04/11] iommu/arm-smmu-v3: Split block descriptor to a span of page

2021-02-07 Thread Keqian Zhu
Hi Robin,

On 2021/2/5 3:51, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> Block descriptor is not a proper granule for dirty log tracking. This
>> adds a new interface named split_block in iommu layer and arm smmuv3
>> implements it, which splits block descriptor to an equivalent span of
>> page descriptors.
>>
>> During spliting block, other interfaces are not expected to be working,
>> so race condition does not exist. And we flush all iotlbs after the split
>> procedure is completed to ease the pressure of iommu, as we will split a
>> huge range of block mappings in general.
> 
> "Not expected to be" is not the same thing as "can not". Presumably the whole 
> point of dirty log tracking is that it can be run speculatively in the 
> background, so is there any actual guarantee that the guest can't, say, issue 
> a hotplug event that would cause some memory to be released back to the host 
> and unmapped while a scan might be in progress? Saying effectively "there is 
> no race condition as long as you assume there is no race condition" isn't all 
> that reassuring...
Sorry for my inaccuracy expression. "Not expected to be" is inappropriate here, 
the actual meaning is "can not".

As the only user of these newly added interfaces is vfio_iommu_type1 for now, 
and vfio_iommu_type1 always acquires "iommu->lock" before invoke them.

> 
> That said, it's not very clear why patches #4 and #5 are here at all, given 
> that patches #6 and #7 appear quite happy to handle block entries.
Split block into page is very important for dirty page tracking. Page mapping 
can greatly reduce the amount of dirty memory handling. The KVM mmu stage2 side 
also has this logic.

Yes, #6 (log_sync) and #7 (log_clear) is designed to be applied for both block 
and page mapping. As the "split" operation may fail (e.g, without BBML1/2 or 
ENOMEM), but we can still track dirty at block granule, which is still a much 
better choice compared to the full dirty policy.

> 
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  20 
>>   drivers/iommu/io-pgtable-arm.c  | 122 
>>   drivers/iommu/iommu.c   |  40 +++
>>   include/linux/io-pgtable.h  |   2 +
>>   include/linux/iommu.h   |  10 ++
>>   5 files changed, 194 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 9208881a571c..5469f4fca820 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -2510,6 +2510,25 @@ static int arm_smmu_domain_set_attr(struct 
>> iommu_domain *domain,
>>   return ret;
>>   }
>>   +static size_t arm_smmu_split_block(struct iommu_domain *domain,
>> +   unsigned long iova, size_t size)
>> +{
>> +struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
>> +struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
>> +
>> +if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
>> +dev_err(smmu->dev, "don't support BBML1/2 and split block\n");
>> +return 0;
>> +}
>> +
>> +if (!ops || !ops->split_block) {
>> +pr_err("don't support split block\n");
>> +return 0;
>> +}
>> +
>> +return ops->split_block(ops, iova, size);
>> +}
>> +
>>   static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args 
>> *args)
>>   {
>>   return iommu_fwspec_add_ids(dev, args->args, 1);
>> @@ -2609,6 +2628,7 @@ static struct iommu_ops arm_smmu_ops = {
>>   .device_group= arm_smmu_device_group,
>>   .domain_get_attr= arm_smmu_domain_get_attr,
>>   .domain_set_attr= arm_smmu_domain_set_attr,
>> +.split_block= arm_smmu_split_block,
>>   .of_xlate= arm_smmu_of_xlate,
>>   .get_resv_regions= arm_smmu_get_resv_regions,
>>   .put_resv_regions= generic_iommu_put_resv_regions,
>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
>> index e299a44808ae..f3b7f7115e38 100644
>> --- a/drivers/iommu/io-pgtable-arm.c
>> +++ b/drivers/iommu/io-pgtable-arm.c
>> @@ -79,6 +79,8 @@
>>   #define ARM_LPAE_PTE_SH_IS(((arm_lpae_iopte)3) << 8)
>>   #define ARM_LPAE_P

Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-06 Thread Keqian Zhu
Hi Robin,

On 2021/2/5 19:48, Robin Murphy wrote:
> On 2021-02-05 09:13, Keqian Zhu wrote:
>> Hi Robin and Jean,
>>
>> On 2021/2/5 3:50, Robin Murphy wrote:
>>> On 2021-01-28 15:17, Keqian Zhu wrote:
>>>> From: jiangkunkun 
>>>>
>>>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>>>> update the access flag and the dirty state of TTD by hardware. It is
>>>> essential to track dirty pages of DMA.
>>>>
>>>> This adds feature detection, none functional change.
>>>>
>>>> Co-developed-by: Keqian Zhu 
>>>> Signed-off-by: Kunkun Jiang 
>>>> ---
>>>>drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
>>>>drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
>>>>include/linux/io-pgtable.h  |  1 +
>>>>3 files changed, 25 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>>>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> index 8ca7415d785d..0f0fe71cc10d 100644
>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct 
>>>> iommu_domain *domain,
>>>>.pgsize_bitmap= smmu->pgsize_bitmap,
>>>>.ias= ias,
>>>>.oas= oas,
>>>> +.httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>>>.coherent_walk= smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>>>.tlb= _smmu_flush_ops,
>>>>.iommu_dev= smmu->dev,
>>>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
>>>> arm_smmu_device *smmu)
>>>>if (reg & IDR0_HYP)
>>>>smmu->features |= ARM_SMMU_FEAT_HYP;
>>>>+switch (FIELD_GET(IDR0_HTTU, reg)) {
>>>
>>> We need to accommodate the firmware override as well if we need this to be 
>>> meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
>>> stack[1].
>> Robin, Thanks for pointing it out.
>>
>> Jean, I see that the IORT HTTU flag overrides the hardware register info 
>> unconditionally. I have some concern about it:
>>
>> If the override flag has HTTU but hardware doesn't support it, then driver 
>> will use this feature but receive access fault or permission fault from SMMU 
>> unexpectedly.
>> 1) If IOPF is not supported, then kernel can not work normally.
>> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU 
>> based dma dirty tracking (this series).
> 
> Yes, if the IORT describes the SMMU incorrectly, things will not work well. 
> Just like if it describes the wrong base address or the wrong interrupt 
> numbers, things will also not work well. The point is that incorrect firmware 
> can be updated in the field fairly easily; incorrect hardware can not.
Agree.

> 
> Say the SMMU designer hard-codes the ID register field to 0x2 because the 
> SMMU itself is capable of HTTU, and they assume it's always going to be wired 
> up coherently, but then a customer integrates it to a non-coherent 
> interconnect. Firmware needs to override that value to prevent an OS thinking 
> that the claimed HTTU capability is ever going to work.
> 
> Or say the SMMU *is* integrated correctly, but due to an erratum discovered 
> later in the interconnect or SMMU itself, it turns out DBM doesn't always 
> work reliably, but AF is still OK. Firmware needs to downgrade the indicated 
> level of support from that which was intended to that which works reliably.
> 
> Or say someone forgets to set an integration tieoff so their SMMU reports 0x0 
> even though it and the interconnect *can* happily support HTTU. In that case, 
> firmware may want to upgrade the value to *allow* an OS to use HTTU despite 
> the ID register being wrong.
Fair enough. Mask can realize "downgrade", but not "upgrade". You give a 
reasonable point for upgrade.

BTW, my original intention is that mask can provide some convenience for BIOS 
maker, as the override flag can keep same for SMMUs regardless they support 
HTTU or not. But it shows that mask cannot cover all scenario.

> 
>> As the IORT spec doesn't give an explicit explanation for HTTU override, can 
>> we comprehend it as a mask for HTTU related hardware register?
>> So the logic becomes: smmu->feature = HTTU override &

Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-06 Thread Keqian Zhu
Hi Robin,

On 2021/2/6 0:11, Robin Murphy wrote:
> On 2021-02-05 11:48, Robin Murphy wrote:
>> On 2021-02-05 09:13, Keqian Zhu wrote:
>>> Hi Robin and Jean,
>>>
>>> On 2021/2/5 3:50, Robin Murphy wrote:
>>>> On 2021-01-28 15:17, Keqian Zhu wrote:
>>>>> From: jiangkunkun 
>>>>>
>>>>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>>>>> update the access flag and the dirty state of TTD by hardware. It is
>>>>> essential to track dirty pages of DMA.
>>>>>
>>>>> This adds feature detection, none functional change.
>>>>>
>>>>> Co-developed-by: Keqian Zhu 
>>>>> Signed-off-by: Kunkun Jiang 
>>>>> ---
>>>>>drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
>>>>>drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
>>>>>include/linux/io-pgtable.h  |  1 +
>>>>>3 files changed, 25 insertions(+)
>>>>>
>>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>>>>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> index 8ca7415d785d..0f0fe71cc10d 100644
>>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct 
>>>>> iommu_domain *domain,
>>>>>.pgsize_bitmap= smmu->pgsize_bitmap,
>>>>>.ias= ias,
>>>>>.oas= oas,
>>>>> +.httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>>>>.coherent_walk= smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>>>>.tlb= _smmu_flush_ops,
>>>>>.iommu_dev= smmu->dev,
>>>>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
>>>>> arm_smmu_device *smmu)
>>>>>if (reg & IDR0_HYP)
>>>>>smmu->features |= ARM_SMMU_FEAT_HYP;
>>>>>+switch (FIELD_GET(IDR0_HTTU, reg)) {
>>>>
>>>> We need to accommodate the firmware override as well if we need this to be 
>>>> meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
>>>> stack[1].
>>> Robin, Thanks for pointing it out.
>>>
>>> Jean, I see that the IORT HTTU flag overrides the hardware register info 
>>> unconditionally. I have some concern about it:
>>>
>>> If the override flag has HTTU but hardware doesn't support it, then driver 
>>> will use this feature but receive access fault or permission fault from 
>>> SMMU unexpectedly.
>>> 1) If IOPF is not supported, then kernel can not work normally.
>>> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU 
>>> based dma dirty tracking (this series).
>>
>> Yes, if the IORT describes the SMMU incorrectly, things will not work well. 
>> Just like if it describes the wrong base address or the wrong interrupt 
>> numbers, things will also not work well. The point is that incorrect 
>> firmware can be updated in the field fairly easily; incorrect hardware can 
>> not.
>>
>> Say the SMMU designer hard-codes the ID register field to 0x2 because the 
>> SMMU itself is capable of HTTU, and they assume it's always going to be 
>> wired up coherently, but then a customer integrates it to a non-coherent 
>> interconnect. Firmware needs to override that value to prevent an OS 
>> thinking that the claimed HTTU capability is ever going to work.
>>
>> Or say the SMMU *is* integrated correctly, but due to an erratum discovered 
>> later in the interconnect or SMMU itself, it turns out DBM doesn't always 
>> work reliably, but AF is still OK. Firmware needs to downgrade the indicated 
>> level of support from that which was intended to that which works reliably.
>>
>> Or say someone forgets to set an integration tieoff so their SMMU reports 
>> 0x0 even though it and the interconnect *can* happily support HTTU. In that 
>> case, firmware may want to upgrade the value to *allow* an OS to use HTTU 
>> despite the ID register being wrong.
>>
>>> As the IORT spec doesn't give an explicit explanation for HTTU override, 
>>> can we comprehend it as a mask for HTTU related hardware register?
>>> So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;
>>
>> No, it literally states that the OS must use the value of the firmware field 
>> *instead* of the value from the hardware field.
> 
> Oops, apologies for an oversight there - I've been reviewing IORT spec 
> updates lately so naturally had the newest version open already. Turns out 
> these descriptions were only clarified in the most recent release, so if you 
> were looking at an older document they *were* horribly vague.
Yep, my local version is E which was released at July 2020. I download the 
version E.a just now, thanks. ;-)

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-06 Thread Keqian Zhu
Hi Jean,

On 2021/2/5 17:51, Jean-Philippe Brucker wrote:
> Hi Keqian,
> 
> On Fri, Feb 05, 2021 at 05:13:50PM +0800, Keqian Zhu wrote:
>>> We need to accommodate the firmware override as well if we need this to be 
>>> meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
>>> stack[1].
>> Robin, Thanks for pointing it out.
>>
>> Jean, I see that the IORT HTTU flag overrides the hardware register info 
>> unconditionally. I have some concern about it:
>>
>> If the override flag has HTTU but hardware doesn't support it, then driver 
>> will use this feature but receive access fault or permission fault from SMMU 
>> unexpectedly.
>> 1) If IOPF is not supported, then kernel can not work normally.
>> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU 
>> based dma dirty tracking (this series).
>>
>> As the IORT spec doesn't give an explicit explanation for HTTU override, can 
>> we comprehend it as a mask for HTTU related hardware register?
> 
> To me "Overrides the value of SMMU_IDR0.HTTU" is clear enough: disregard
> the value of SMMU_IDR0.HTTU and use the one specified by IORT instead. And
> that's both ways, since there is no validity mask for the IORT value: if
> there is an IORT table, always ignore SMMU_IDR0.HTTU.
> 
> That's how the SMMU driver implements the COHACC bit, which has the same
> wording in IORT. So I think we should implement HTTU the same way.
OK, and Robin said that the latest IORT spec literally states it.

> 
> One complication is that there is no equivalent override for device tree.
> I think it can be added later if necessary, because unlike IORT it can be
> tri state (property not present, overriden positive, overridden negative).
Yeah, that would be more flexible. ;-)

> 
> Thanks,
> Jean
> 
> .
> 
Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-05 Thread Keqian Zhu
Hi Robin and Jean,

On 2021/2/5 3:50, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>> update the access flag and the dirty state of TTD by hardware. It is
>> essential to track dirty pages of DMA.
>>
>> This adds feature detection, none functional change.
>>
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
>>   include/linux/io-pgtable.h  |  1 +
>>   3 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 8ca7415d785d..0f0fe71cc10d 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct 
>> iommu_domain *domain,
>>   .pgsize_bitmap= smmu->pgsize_bitmap,
>>   .ias= ias,
>>   .oas= oas,
>> +.httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>   .coherent_walk= smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>   .tlb= _smmu_flush_ops,
>>   .iommu_dev= smmu->dev,
>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
>> arm_smmu_device *smmu)
>>   if (reg & IDR0_HYP)
>>   smmu->features |= ARM_SMMU_FEAT_HYP;
>>   +switch (FIELD_GET(IDR0_HTTU, reg)) {
> 
> We need to accommodate the firmware override as well if we need this to be 
> meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
> stack[1].
Robin, Thanks for pointing it out.

Jean, I see that the IORT HTTU flag overrides the hardware register info 
unconditionally. I have some concern about it:

If the override flag has HTTU but hardware doesn't support it, then driver will 
use this feature but receive access fault or permission fault from SMMU 
unexpectedly.
1) If IOPF is not supported, then kernel can not work normally.
2) If IOPF is supported, kernel will perform useless actions, such as HTTU 
based dma dirty tracking (this series).

As the IORT spec doesn't give an explicit explanation for HTTU override, can we 
comprehend it as a mask for HTTU related hardware register?
So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;

> 
>> +case IDR0_HTTU_NONE:
>> +break;
>> +case IDR0_HTTU_HA:
>> +smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>> +break;
>> +case IDR0_HTTU_HAD:
>> +smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>> +smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
>> +break;
>> +default:
>> +dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
>> +return -ENXIO;
>> +}
>> +
>>   /*
>>* The coherency feature as set by FW is used in preference to the ID
>>* register, but warn on mismatch.
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> index 96c2e9565e00..e91bea44519e 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> @@ -33,6 +33,10 @@
>>   #define IDR0_ASID16(1 << 12)
>>   #define IDR0_ATS(1 << 10)
>>   #define IDR0_HYP(1 << 9)
>> +#define IDR0_HTTUGENMASK(7, 6)
>> +#define IDR0_HTTU_NONE0
>> +#define IDR0_HTTU_HA1
>> +#define IDR0_HTTU_HAD2
>>   #define IDR0_COHACC(1 << 4)
>>   #define IDR0_TTFGENMASK(3, 2)
>>   #define IDR0_TTF_AARCH642
>> @@ -286,6 +290,8 @@
>>   #define CTXDESC_CD_0_TCR_TBI0(1ULL << 38)
>> #define CTXDESC_CD_0_AA64(1UL << 41)
>> +#define CTXDESC_CD_0_HD(1UL << 42)
>> +#define CTXDESC_CD_0_HA(1UL << 43)
>>   #define CTXDESC_CD_0_S(1UL << 44)
>>   #define CTXDESC_CD_0_R(1UL << 45)
>>   #define CTXDESC_CD_0_A(1UL << 46)
>> @@ -604,6 +610,8 @@ struct arm_smmu_device {
>>   #define ARM_SMMU_FEAT_RANGE_INV(1 << 15)
>>   #define ARM_SMMU_FEAT_BTM(1 << 16)
>>   #define ARM_SMMU_FEAT_SVA(1 << 17)
>> +#define ARM_SMMU_FEAT_

Re: [RFC] Use SMMU HTTU for DMA dirty page tracking

2021-02-04 Thread Keqian Zhu
Hi Jean and Kevin,

FYI, I have send out the SMMUv3 HTTU support for DMA dirty tracking[1] a week 
ago.

Thanks,
Keqian

[1] 
https://lore.kernel.org/linux-iommu/20210128151742.18840-1-zhukeqi...@huawei.com/

On 2020/5/27 17:14, Jean-Philippe Brucker wrote:
> On Wed, May 27, 2020 at 08:40:47AM +, Tian, Kevin wrote:
>>> From: Xiang Zheng 
>>> Sent: Wednesday, May 27, 2020 2:45 PM
>>>
>>>
>>> On 2020/5/27 11:27, Tian, Kevin wrote:
> From: Xiang Zheng
> Sent: Monday, May 25, 2020 7:34 PM
>
> [+cc Kirti, Yan, Alex]
>
> On 2020/5/23 1:14, Jean-Philippe Brucker wrote:
>> Hi,
>>
>> On Tue, May 19, 2020 at 05:42:55PM +0800, Xiang Zheng wrote:
>>> Hi all,
>>>
>>> Is there any plan for enabling SMMU HTTU?
>>
>> Not outside of SVA, as far as I know.
>>
>
>>> I have seen the patch locates in the SVA series patch, which adds
>>> support for HTTU:
>>> https://www.spinics.net/lists/arm-kernel/msg798694.html
>>>
>>> HTTU reduces the number of access faults on SMMU fault queue
>>> (permission faults also benifit from it).
>>>
>>> Besides reducing the faults, HTTU also helps to track dirty pages for
>>> device DMA. Is it feasible to utilize HTTU to get dirty pages on device
>>> DMA during VFIO live migration?
>>
>> As you know there is a VFIO interface for this under discussion:
>> https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-
> kwankh...@nvidia.com/
>> It doesn't implement an internal API to communicate with the IOMMU
> driver
>> about dirty pages.

 We plan to add such API later, e.g. to utilize A/D bit in VT-d 2nd-level
 page tables (Rev 3.0).

>>>
>>> Thank you, Kevin.
>>>
>>> When will you send this series patches? Maybe(Hope) we can also support
>>> hardware-based dirty pages tracking via common APIs based on your
>>> patches. :)
>>
>> Yan is working with Kirti on basic live migration support now. After that
>> part is done, we will start working on A/D bit support. Yes, common APIs
>> are definitely the goal here.
>>
>>>
>
>>
>>> If SMMU can track dirty pages, devices are not required to implement
>>> additional dirty pages tracking to support VFIO live migration.
>>
>> It seems feasible, though tracking it in the device might be more
>> efficient. I might have misunderstood but I think for live migration of
>> the Intel NIC they trap guest accesses to the device and introspect its
>> state to figure out which pages it is accessing.

 Does HTTU implement A/D-like mechanism in SMMU page tables, or just
 report dirty pages in a log buffer? Either way tracking dirty pages in 
 IOMMU
 side is generic thus doesn't require device-specific tweak like in Intel 
 NIC.

>>>
>>> Currently HTTU just implement A/D-like mechanism in SMMU page tables.
>>> We certainly
>>> expect SMMU can also implement PML-like feature so that we can avoid
>>> walking the
>>> whole page table to get the dirty pages.
> 
> There is no reporting of dirty pages in log buffer. It might be possible
> to do software logging based on PRI or Stall, but that requires special
> support in the endpoint as well as the SMMU.
> 
>> Is there a link to HTTU introduction?
> 
> I don't know any gentle introduction, but there are sections D5.4.11
> "Hardware management of the Access flag and dirty state" in the ARM
> Architecture Reference Manual (DDI0487E), and section 3.13 "Translation
> table entries and Access/Dirty flags" in the SMMU specification
> (IHI0070C). HTTU stands for "Hardware Translation Table Update".
> 
> In short, when HTTU is enabled, the SMMU translation performs an atomic
> read-modify-write on the leaf translation table descriptor, setting some
> bits depending on the type of memory access. This can be enabled
> independently on both stage-1 and stage-2 tables (equivalent to your 1st
> and 2nd page tables levels, I think).
> 
> Thanks,
> Jean
> ___
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> .
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v11 01/13] vfio: VFIO_IOMMU_SET_PASID_TABLE

2021-02-02 Thread Keqian Zhu
Hi Eric,

On 2020/11/16 19:00, Eric Auger wrote:
> From: "Liu, Yi L" 
> 
> This patch adds an VFIO_IOMMU_SET_PASID_TABLE ioctl
> which aims to pass the virtual iommu guest configuration
> to the host. This latter takes the form of the so-called
> PASID table.
> 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Liu, Yi L 
> Signed-off-by: Eric Auger 
> 
> ---
> v11 -> v12:
> - use iommu_uapi_set_pasid_table
> - check SET and UNSET are not set simultaneously (Zenghui)
> 
> v8 -> v9:
> - Merge VFIO_IOMMU_ATTACH/DETACH_PASID_TABLE into a single
>   VFIO_IOMMU_SET_PASID_TABLE ioctl.
> 
> v6 -> v7:
> - add a comment related to VFIO_IOMMU_DETACH_PASID_TABLE
> 
> v3 -> v4:
> - restore ATTACH/DETACH
> - add unwind on failure
> 
> v2 -> v3:
> - s/BIND_PASID_TABLE/SET_PASID_TABLE
> 
> v1 -> v2:
> - s/BIND_GUEST_STAGE/BIND_PASID_TABLE
> - remove the struct device arg
> ---
>  drivers/vfio/vfio_iommu_type1.c | 65 +
>  include/uapi/linux/vfio.h   | 19 ++
>  2 files changed, 84 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 67e827638995..87ddd9e882dc 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2587,6 +2587,41 @@ static int vfio_iommu_iova_build_caps(struct 
> vfio_iommu *iommu,
>   return ret;
>  }
>  
> +static void
> +vfio_detach_pasid_table(struct vfio_iommu *iommu)
> +{
> + struct vfio_domain *d;
> +
> + mutex_lock(>lock);
> + list_for_each_entry(d, >domain_list, next)
> + iommu_detach_pasid_table(d->domain);
> +
> + mutex_unlock(>lock);
> +}
> +
> +static int
> +vfio_attach_pasid_table(struct vfio_iommu *iommu, unsigned long arg)
> +{
> + struct vfio_domain *d;
> + int ret = 0;
> +
> + mutex_lock(>lock);
> +
> + list_for_each_entry(d, >domain_list, next) {
> + ret = iommu_uapi_attach_pasid_table(d->domain, (void __user 
> *)arg);
This design is not very clear to me. This assumes all iommu_domains share the 
same pasid table.

As I understand, it's reasonable when there is only one group in the domain, 
and only one domain in the vfio_iommu.
If more than one group in the vfio_iommu, the guest may put them into different 
guest iommu_domain, then they have different pasid table.

Is this the use scenario?

Thanks,
Keqian

> + if (ret)
> + goto unwind;
> + }
> + goto unlock;
> +unwind:
> + list_for_each_entry_continue_reverse(d, >domain_list, next) {
> + iommu_detach_pasid_table(d->domain);
> + }
> +unlock:
> + mutex_unlock(>lock);
> + return ret;
> +}
> +
>  static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu,
>  struct vfio_info_cap *caps)
>  {
> @@ -2747,6 +2782,34 @@ static int vfio_iommu_type1_unmap_dma(struct 
> vfio_iommu *iommu,
>   -EFAULT : 0;
>  }
>  
> +static int vfio_iommu_type1_set_pasid_table(struct vfio_iommu *iommu,
> + unsigned long arg)
> +{
> + struct vfio_iommu_type1_set_pasid_table spt;
> + unsigned long minsz;
> + int ret = -EINVAL;
> +
> + minsz = offsetofend(struct vfio_iommu_type1_set_pasid_table, flags);
> +
> + if (copy_from_user(, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (spt.argsz < minsz)
> + return -EINVAL;
> +
> + if (spt.flags & VFIO_PASID_TABLE_FLAG_SET &&
> + spt.flags & VFIO_PASID_TABLE_FLAG_UNSET)
> + return -EINVAL;
> +
> + if (spt.flags & VFIO_PASID_TABLE_FLAG_SET)
> + ret = vfio_attach_pasid_table(iommu, arg + minsz);
> + else if (spt.flags & VFIO_PASID_TABLE_FLAG_UNSET) {
> + vfio_detach_pasid_table(iommu);
> + ret = 0;
> + }
> + return ret;
> +}
> +
>  static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>   unsigned long arg)
>  {
> @@ -2867,6 +2930,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>   return vfio_iommu_type1_unmap_dma(iommu, arg);
>   case VFIO_IOMMU_DIRTY_PAGES:
>   return vfio_iommu_type1_dirty_pages(iommu, arg);
> + case VFIO_IOMMU_SET_PASID_TABLE:
> + return vfio_iommu_type1_set_pasid_table(iommu, arg);
>   default:
>   return -ENOTTY;
>   }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2f313a238a8f..78ce3ce6c331 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -14,6 +14,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #define VFIO_API_VERSION 0
>  
> @@ -1180,6 +1181,24 @@ struct vfio_iommu_type1_dirty_bitmap_get {
>  
>  #define VFIO_IOMMU_DIRTY_PAGES _IO(VFIO_TYPE, VFIO_BASE + 17)
>  
> +/*
> + * VFIO_IOMMU_SET_PASID_TABLE - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> + *   struct vfio_iommu_type1_set_pasid_table)
> + 

Re: [PATCH v11 03/13] vfio: VFIO_IOMMU_SET_MSI_BINDING

2021-02-02 Thread Keqian Zhu
Hi Eric,

On 2020/11/16 19:00, Eric Auger wrote:
> This patch adds the VFIO_IOMMU_SET_MSI_BINDING ioctl which aim
> to (un)register the guest MSI binding to the host. This latter
> then can use those stage 1 bindings to build a nested stage
> binding targeting the physical MSIs.
[...]

> +static int vfio_iommu_type1_set_msi_binding(struct vfio_iommu *iommu,
> + unsigned long arg)
> +{
> + struct vfio_iommu_type1_set_msi_binding msi_binding;
> + unsigned long minsz;
> + int ret = -EINVAL;
> +
> + minsz = offsetofend(struct vfio_iommu_type1_set_msi_binding,
> + size);
> +
> + if (copy_from_user(_binding, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (msi_binding.argsz < minsz)
> + return -EINVAL;
We can check BIND and UNBIND are not set simultaneously, just like 
VFIO_IOMMU_SET_PASID_TABLE.

> +
> + if (msi_binding.flags == VFIO_IOMMU_UNBIND_MSI) {
> + vfio_unbind_msi(iommu, msi_binding.iova);
> + ret = 0;
> + } else if (msi_binding.flags == VFIO_IOMMU_BIND_MSI) {
> + ret = vfio_bind_msi(iommu, msi_binding.iova,
> + msi_binding.gpa, msi_binding.size);
> + }
> + return ret;
> +}
> +

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 06/15] iommu/smmuv3: Implement attach/detach_pasid_table

2021-02-02 Thread Keqian Zhu
Hi Eric,

On 2020/11/18 19:21, Eric Auger wrote:
> On attach_pasid_table() we program STE S1 related info set
> by the guest into the actual physical STEs. At minimum
> we need to program the context descriptor GPA and compute
> whether the stage1 is translated/bypassed or aborted.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> v7 -> v8:
> - remove smmu->features check, now done on domain finalize
> 
> v6 -> v7:
> - check versions and comment the fact we don't need to take
>   into account s1dss and s1fmt
> v3 -> v4:
> - adapt to changes in iommu_pasid_table_config
> - different programming convention at s1_cfg/s2_cfg/ste.abort
> 
> v2 -> v3:
> - callback now is named set_pasid_table and struct fields
>   are laid out differently.
> 
> v1 -> v2:
> - invalidate the STE before changing them
> - hold init_mutex
> - handle new fields
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 89 +
>  1 file changed, 89 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 412ea1bafa50..805acdc18a3a 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2661,6 +2661,93 @@ static void arm_smmu_get_resv_regions(struct device 
> *dev,
>   iommu_dma_get_resv_regions(dev, head);
>  }
>  
> +static int arm_smmu_attach_pasid_table(struct iommu_domain *domain,
> +struct iommu_pasid_table_config *cfg)
> +{
> + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> + struct arm_smmu_master *master;
> + struct arm_smmu_device *smmu;
> + unsigned long flags;
> + int ret = -EINVAL;
> +
> + if (cfg->format != IOMMU_PASID_FORMAT_SMMUV3)
> + return -EINVAL;
> +
> + if (cfg->version != PASID_TABLE_CFG_VERSION_1 ||
> + cfg->vendor_data.smmuv3.version != PASID_TABLE_SMMUV3_CFG_VERSION_1)
> + return -EINVAL;
> +
> + mutex_lock(_domain->init_mutex);
> +
> + smmu = smmu_domain->smmu;
> +
> + if (!smmu)
> + goto out;
> +
> + if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> + goto out;
> +
> + switch (cfg->config) {
> + case IOMMU_PASID_CONFIG_ABORT:
> + smmu_domain->s1_cfg.set = false;
> + smmu_domain->abort = true;
> + break;
> + case IOMMU_PASID_CONFIG_BYPASS:
> + smmu_domain->s1_cfg.set = false;
> + smmu_domain->abort = false;
I didn't test it, but it seems that this will cause BUG() in 
arm_smmu_write_strtab_ent().
At the line "BUG_ON(ste_live && !nested);". Maybe I miss something?

> + break;
> + case IOMMU_PASID_CONFIG_TRANSLATE:
> + /* we do not support S1 <-> S1 transitions */
> + if (smmu_domain->s1_cfg.set)
> + goto out;
> +
> + /*
> +  * we currently support a single CD so s1fmt and s1dss
> +  * fields are also ignored
> +  */
> + if (cfg->pasid_bits)
> + goto out;
> +
> + smmu_domain->s1_cfg.cdcfg.cdtab_dma = cfg->base_ptr;
> + smmu_domain->s1_cfg.set = true;
> + smmu_domain->abort = false;
> + break;
> + default:
> + goto out;
> + }
> + spin_lock_irqsave(_domain->devices_lock, flags);
> + list_for_each_entry(master, _domain->devices, domain_head)
> + arm_smmu_install_ste_for_dev(master);
> + spin_unlock_irqrestore(_domain->devices_lock, flags);
> + ret = 0;
> +out:
> + mutex_unlock(_domain->init_mutex);
> + return ret;
> +}
> +
[...]

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 03/15] iommu/arm-smmu-v3: Maintain a SID->device structure

2021-02-01 Thread Keqian Zhu
Hi Eric,

On 2021/2/2 1:19, Auger Eric wrote:
> Hi Keqian,
> 
> On 2/1/21 1:26 PM, Keqian Zhu wrote:
>> Hi Eric,
>>
>> On 2020/11/18 19:21, Eric Auger wrote:
>>> From: Jean-Philippe Brucker 
>>>
>>> When handling faults from the event or PRI queue, we need to find the
>>> struct device associated to a SID. Add a rb_tree to keep track of SIDs.
>>>
>>> Signed-off-by: Jean-Philippe Brucker 
>> [...]
>>
>>>  }
>>>  
>>> +static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
>>> + struct arm_smmu_master *master)
[...]

>>> kfree(master);
>>
>> Thanks,
>> Keqian
>>
> Thank you for the review. Jean will address this issues in his own
> series and on my end I will rebase on this latter.
> 
> Best Regards
> 
> Eric
>

Yeah, and hope this series can be accepted earlier ;-)

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 05/15] iommu/smmuv3: Get prepared for nested stage support

2021-02-01 Thread Keqian Zhu
Hi Eric,

On 2020/11/18 19:21, Eric Auger wrote:
> When nested stage translation is setup, both s1_cfg and
> s2_cfg are set.
> 
> We introduce a new smmu domain abort field that will be set
> upon guest stage1 configuration passing.
> 
> arm_smmu_write_strtab_ent() is modified to write both stage
> fields in the STE and deal with the abort field.
> 
> In nested mode, only stage 2 is "finalized" as the host does
> not own/configure the stage 1 context descriptor; guest does.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> v10 -> v11:
> - Fix an issue reported by Shameer when switching from with vSMMU
>   to without vSMMU. Despite the spec does not seem to mention it
>   seems to be needed to reset the 2 high 64b when switching from
>   S1+S2 cfg to S1 only. Especially dst[3] needs to be reset (S2TTB).
>   On some implementations, if the S2TTB is not reset, this causes
>   a C_BAD_STE error
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 64 +
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  2 +
>  2 files changed, 56 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 18ac5af1b284..412ea1bafa50 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1181,8 +1181,10 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_master *master, u32 sid,
>* three cases at the moment:
>*
>* 1. Invalid (all zero) -> bypass/fault (init)
> -  * 2. Bypass/fault -> translation/bypass (attach)
> -  * 3. Translation/bypass -> bypass/fault (detach)
> +  * 2. Bypass/fault -> single stage translation/bypass (attach)
> +  * 3. Single or nested stage Translation/bypass -> bypass/fault (detach)
> +  * 4. S2 -> S1 + S2 (attach_pasid_table)
> +  * 5. S1 + S2 -> S2 (detach_pasid_table)

The following line "BUG_ON(ste_live && !nested);" forbids this transform.
And I have a look at the 6th patch, the transform seems S1 + S2 -> abort.
So after detach, the status is not the same as that before attach. Does it
match our expectation?

>*
>* Given that we can't update the STE atomically and the SMMU
>* doesn't read the thing in a defined order, that leaves us
> @@ -1193,7 +1195,8 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_master *master, u32 sid,
>* 3. Update Config, sync
>*/
>   u64 val = le64_to_cpu(dst[0]);
> - bool ste_live = false;
> + bool s1_live = false, s2_live = false, ste_live;
> + bool abort, nested = false, translate = false;
>   struct arm_smmu_device *smmu = NULL;
>   struct arm_smmu_s1_cfg *s1_cfg;
>   struct arm_smmu_s2_cfg *s2_cfg;
> @@ -1233,6 +1236,8 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_master *master, u32 sid,
>   default:
>   break;
>   }
> + nested = s1_cfg->set && s2_cfg->set;
> + translate = s1_cfg->set || s2_cfg->set;
>   }
>  
>   if (val & STRTAB_STE_0_V) {
> @@ -1240,23 +1245,36 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_master *master, u32 sid,
>   case STRTAB_STE_0_CFG_BYPASS:
>   break;
>   case STRTAB_STE_0_CFG_S1_TRANS:
> + s1_live = true;
> + break;
>   case STRTAB_STE_0_CFG_S2_TRANS:
> - ste_live = true;
> + s2_live = true;
> + break;
> + case STRTAB_STE_0_CFG_NESTED:
> + s1_live = true;
> + s2_live = true;
>   break;
>   case STRTAB_STE_0_CFG_ABORT:
> - BUG_ON(!disable_bypass);
>   break;
>   default:
>   BUG(); /* STE corruption */
>   }
>   }
>  
> + ste_live = s1_live || s2_live;
> +
>   /* Nuke the existing STE_0 value, as we're going to rewrite it */
>   val = STRTAB_STE_0_V;
>  
>   /* Bypass/fault */
> - if (!smmu_domain || !(s1_cfg->set || s2_cfg->set)) {
> - if (!smmu_domain && disable_bypass)
> +
> + if (!smmu_domain)
> + abort = disable_bypass;
> + else
> + abort = smmu_domain->abort;
> +
> + if (abort || !translate) {
> + if (abort)
>   val |= FIELD_PREP(STRTAB_STE_0_CFG, 
> STRTAB_STE_0_CFG_ABORT);
>   else
>   val |= FIELD_PREP(STRTAB_STE_0_CFG, 
> STRTAB_STE_0_CFG_BYPASS);
> @@ -1274,8 +1292,16 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_master *master, u32 sid,
>   return;
>   }
>  
> + BUG_ON(ste_live && !nested);
> +
> + if (ste_live) {
> + /* First invalidate the live STE */
> + dst[0] = cpu_to_le64(STRTAB_STE_0_CFG_ABORT);
> + 

Re: [PATCH v13 03/15] iommu/arm-smmu-v3: Maintain a SID->device structure

2021-02-01 Thread Keqian Zhu
Hi Jean,

On 2021/2/1 23:15, Jean-Philippe Brucker wrote:
> On Mon, Feb 01, 2021 at 08:26:41PM +0800, Keqian Zhu wrote:
>>> +static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
>>> + struct arm_smmu_master *master)
>>> +{
>>> +   int i;
>>> +   int ret = 0;
>>> +   struct arm_smmu_stream *new_stream, *cur_stream;
>>> +   struct rb_node **new_node, *parent_node = NULL;
>>> +   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
>>> +
>>> +   master->streams = kcalloc(fwspec->num_ids,
>>> + sizeof(struct arm_smmu_stream), GFP_KERNEL);
>>> +   if (!master->streams)
>>> +   return -ENOMEM;
>>> +   master->num_streams = fwspec->num_ids;
>> This is not roll-backed when fail.
> 
> No need, the caller frees master
OK.

> 
>>> +
>>> +   mutex_lock(>streams_mutex);
>>> +   for (i = 0; i < fwspec->num_ids && !ret; i++) {
>> Check ret at here, makes it hard to decide the start index of rollback.
>>
>> If we fail at here, then start index is (i-2).
>> If we fail in the loop, then start index is (i-1).
>>
> [...]
>>> +   if (ret) {
>>> +   for (; i > 0; i--)
>> should be (i >= 0)?
>> And the start index seems not correct.
> 
> Indeed, this whole bit is wrong. I'll fix it while resending the IOPF
> series.
> 
> Thanks,
> Jean
OK, I am glad it helps.

Thanks,
Keqian
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log

2021-02-01 Thread Keqian Zhu



On 2021/2/1 21:17, Marc Zyngier wrote:
> On 2021-02-01 13:12, Keqian Zhu wrote:
>> Hi Marc,
>>
>> Do you have time to have a look at this? Thanks ;-)
> 
> Not immediately. I'm busy with stuff that is planned to go
> in 5.12, which isn't the case for this series. I'll get to
> it eventually.
> 
> Thanks,
> 
> M.
Sure, I am not eager. Please concentrate on your urgent work firstly. ;-) 
Thanks.

Keqian.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log

2021-02-01 Thread Keqian Zhu
Hi Marc,

Do you have time to have a look at this? Thanks ;-)

Keqian.

On 2021/1/26 20:44, Keqian Zhu wrote:
> The intention:
> 
> On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
> KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
> This leads to heavy side effect on VM, especially when multi vCPU race and
> some of them block on kvm mmu_lock.
> 
> DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
> its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.
> 
> About this patch series:
> 
> The biggest problem of apply DBM for stage2 is that software must scan PTs to
> collect dirty state, which may cost much time and affect downtime of 
> migration.
> 
> This series realize a SW/HW combined dirty log that can effectively solve this
> problem (The smmu side can also use this approach to solve dma dirty log 
> tracking).
> 
> The core idea is that we do not enable hardware dirty at start (do not add 
> DBM bit).
> When a arbitrary PT occurs fault, we execute soft tracking for this PT and 
> enable
> hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). 
> Then when
> sync dirty log, we have known all PTs with hardware dirty enabled, so we do 
> not need
> to scan all PTs.
> 
> mem abort point mem abort point
>   ↓↓
> ---
> |||||||
> ---
>  ↑↑
> set DBM bit of   set DBM bit of
>  this PT section (64PTEs)  this PT section (64PTEs)
> 
> We may worry that when dirty rate is over-high we still need to scan too much 
> PTs.
> We mainly concern the VM stop time. With Qemu dirty rate throttling, the 
> dirty memory
> is closing to the VM stop threshold, so there is a little PTs to scan after 
> VM stop.
> 
> It has the advantages of hardware tracking that minimizes side effect on vCPU,
> and also has the advantages of software tracking that controls vCPU dirty 
> rate.
> Moreover, software tracking helps us to scan PTs at some fixed points, which
> greatly reduces scanning time. And the biggest benefit is that we can apply 
> this
> solution for dma dirty tracking.
> 
> Test:
> 
> Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure 
> test result
>   is not effected by dissolve of block page table at the early stage of 
> migration).
> VM:   16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).
> 
> Each run 5 times for software dirty log and SW/HW conbined dirty log. 
> 
> Test result:
> 
> Gain 5%~7% improvement of redis QPS during VM migration.
> VM downtime is not affected fundamentally.
> About 56.7% of DBM is effectively used.
> 
> Keqian Zhu (7):
>   arm64: cpufeature: Add API to report system support of HWDBM
>   kvm: arm64: Use atomic operation when update PTE
>   kvm: arm64: Add level_apply parameter for stage2_attr_walker
>   kvm: arm64: Add some HW_DBM related pgtable interfaces
>   kvm: arm64: Add some HW_DBM related mmu interfaces
>   kvm: arm64: Only write protect selected PTE
>   kvm: arm64: Start up SW/HW combined dirty log
> 
>  arch/arm64/include/asm/cpufeature.h  |  12 +++
>  arch/arm64/include/asm/kvm_host.h|   6 ++
>  arch/arm64/include/asm/kvm_mmu.h |   7 ++
>  arch/arm64/include/asm/kvm_pgtable.h |  45 ++
>  arch/arm64/kvm/arm.c | 125 ++
>  arch/arm64/kvm/hyp/pgtable.c | 130 ++-
>  arch/arm64/kvm/mmu.c |  47 +-
>  arch/arm64/kvm/reset.c   |   8 +-
>  8 files changed, 351 insertions(+), 29 deletions(-)
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 04/15] iommu/smmuv3: Allow s1 and s2 configs to coexist

2021-02-01 Thread Keqian Zhu
Hi Eric,

On 2020/11/18 19:21, Eric Auger wrote:
> In true nested mode, both s1_cfg and s2_cfg will coexist.
> Let's remove the union and add a "set" field in each
> config structure telling whether the config is set and needs
> to be applied when writing the STE. In legacy nested mode,
> only the 2d stage is used. In true nested mode, the "set" field
nit: s/2d/2nd

> will be set when the guest passes the pasid table.
nit: ... the "set" filed of s1_cfg and s2_cfg will be set ...

> 
> Signed-off-by: Eric Auger 
> 
> ---
> v12 -> v13:
> - does not dynamically allocate s1-cfg and s2_cfg anymore. Add
>   the set field
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 43 +
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 ++--
>  2 files changed, 31 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 1e4acc7f3d3c..18ac5af1b284 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1195,8 +1195,8 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_master *master, u32 sid,
>   u64 val = le64_to_cpu(dst[0]);
>   bool ste_live = false;
>   struct arm_smmu_device *smmu = NULL;
> - struct arm_smmu_s1_cfg *s1_cfg = NULL;
> - struct arm_smmu_s2_cfg *s2_cfg = NULL;
> + struct arm_smmu_s1_cfg *s1_cfg;
> + struct arm_smmu_s2_cfg *s2_cfg;
>   struct arm_smmu_domain *smmu_domain = NULL;
>   struct arm_smmu_cmdq_ent prefetch_cmd = {
>   .opcode = CMDQ_OP_PREFETCH_CFG,
> @@ -1211,13 +1211,24 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_master *master, u32 sid,
>   }
>  
>   if (smmu_domain) {
> + s1_cfg = _domain->s1_cfg;
> + s2_cfg = _domain->s2_cfg;
> +
>   switch (smmu_domain->stage) {
>   case ARM_SMMU_DOMAIN_S1:
> - s1_cfg = _domain->s1_cfg;
> + s1_cfg->set = true;
> + s2_cfg->set = false;
>   break;
>   case ARM_SMMU_DOMAIN_S2:
> + s1_cfg->set = false;
> + s2_cfg->set = true;
> + break;
>   case ARM_SMMU_DOMAIN_NESTED:
> - s2_cfg = _domain->s2_cfg;
> + /*
> +  * Actual usage of stage 1 depends on nested mode:
> +  * legacy (2d stage only) or true nested mode
> +  */
> + s2_cfg->set = true;
>   break;
>   default:
>   break;
> @@ -1244,7 +1255,7 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_master *master, u32 sid,
>   val = STRTAB_STE_0_V;
>  
>   /* Bypass/fault */
> - if (!smmu_domain || !(s1_cfg || s2_cfg)) {
> + if (!smmu_domain || !(s1_cfg->set || s2_cfg->set)) {
>   if (!smmu_domain && disable_bypass)
>   val |= FIELD_PREP(STRTAB_STE_0_CFG, 
> STRTAB_STE_0_CFG_ABORT);
>   else
> @@ -1263,7 +1274,7 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_master *master, u32 sid,
>   return;
>   }
>  
> - if (s1_cfg) {
> + if (s1_cfg->set) {
>   BUG_ON(ste_live);
>   dst[1] = cpu_to_le64(
>FIELD_PREP(STRTAB_STE_1_S1DSS, 
> STRTAB_STE_1_S1DSS_SSID0) |
> @@ -1282,7 +1293,7 @@ static void arm_smmu_write_strtab_ent(struct 
> arm_smmu_master *master, u32 sid,
>   FIELD_PREP(STRTAB_STE_0_S1FMT, s1_cfg->s1fmt);
>   }
>  
> - if (s2_cfg) {
> + if (s2_cfg->set) {
>   BUG_ON(ste_live);
>   dst[2] = cpu_to_le64(
>FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
> @@ -1846,24 +1857,24 @@ static void arm_smmu_domain_free(struct iommu_domain 
> *domain)
>  {
>   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>   struct arm_smmu_device *smmu = smmu_domain->smmu;
> + struct arm_smmu_s1_cfg *s1_cfg = _domain->s1_cfg;
> + struct arm_smmu_s2_cfg *s2_cfg = _domain->s2_cfg;
>  
>   iommu_put_dma_cookie(domain);
>   free_io_pgtable_ops(smmu_domain->pgtbl_ops);
>  
>   /* Free the CD and ASID, if we allocated them */
> - if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
> - struct arm_smmu_s1_cfg *cfg = _domain->s1_cfg;
> -
> + if (s1_cfg->set) {
>   /* Prevent SVA from touching the CD while we're freeing it */
>   mutex_lock(_smmu_asid_lock);
> - if (cfg->cdcfg.cdtab)
> + if (s1_cfg->cdcfg.cdtab)
>   arm_smmu_free_cd_tables(smmu_domain);
> - arm_smmu_free_asid(>cd);
> + arm_smmu_free_asid(_cfg->cd);
>   mutex_unlock(_smmu_asid_lock);
> - } else {
> - struct arm_smmu_s2_cfg *cfg = 

Re: [PATCH v13 03/15] iommu/arm-smmu-v3: Maintain a SID->device structure

2021-02-01 Thread Keqian Zhu
Hi Eric,

On 2020/11/18 19:21, Eric Auger wrote:
> From: Jean-Philippe Brucker 
> 
> When handling faults from the event or PRI queue, we need to find the
> struct device associated to a SID. Add a rb_tree to keep track of SIDs.
> 
> Signed-off-by: Jean-Philippe Brucker 
[...]

>  }
>  
> +static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
> +   struct arm_smmu_master *master)
> +{
> + int i;
> + int ret = 0;
> + struct arm_smmu_stream *new_stream, *cur_stream;
> + struct rb_node **new_node, *parent_node = NULL;
> + struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
> +
> + master->streams = kcalloc(fwspec->num_ids,
> +   sizeof(struct arm_smmu_stream), GFP_KERNEL);
> + if (!master->streams)
> + return -ENOMEM;
> + master->num_streams = fwspec->num_ids;
This is not roll-backed when fail.

> +
> + mutex_lock(>streams_mutex);
> + for (i = 0; i < fwspec->num_ids && !ret; i++) {
Check ret at here, makes it hard to decide the start index of rollback.

If we fail at here, then start index is (i-2).
If we fail in the loop, then start index is (i-1).

> + u32 sid = fwspec->ids[i];
> +
> + new_stream = >streams[i];
> + new_stream->id = sid;
> + new_stream->master = master;
> +
> + /*
> +  * Check the SIDs are in range of the SMMU and our stream table
> +  */
> + if (!arm_smmu_sid_in_range(smmu, sid)) {
> + ret = -ERANGE;
> + break;
> + }
> +
> + /* Ensure l2 strtab is initialised */
> + if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
> + ret = arm_smmu_init_l2_strtab(smmu, sid);
> + if (ret)
> + break;
> + }
> +
> + /* Insert into SID tree */
> + new_node = &(smmu->streams.rb_node);
> + while (*new_node) {
> + cur_stream = rb_entry(*new_node, struct arm_smmu_stream,
> +   node);
> + parent_node = *new_node;
> + if (cur_stream->id > new_stream->id) {
> + new_node = &((*new_node)->rb_left);
> + } else if (cur_stream->id < new_stream->id) {
> + new_node = &((*new_node)->rb_right);
> + } else {
> + dev_warn(master->dev,
> +  "stream %u already in tree\n",
> +  cur_stream->id);
> + ret = -EINVAL;
> + break;
> + }
> + }
> +
> + if (!ret) {
> + rb_link_node(_stream->node, parent_node, new_node);
> + rb_insert_color(_stream->node, >streams);
> + }
> + }
> +
> + if (ret) {
> + for (; i > 0; i--)
should be (i >= 0)?
And the start index seems not correct.

> + rb_erase(>streams[i].node, >streams);
> + kfree(master->streams);
> + }
> + mutex_unlock(>streams_mutex);
> +
> + return ret;
> +}
> +
> +static void arm_smmu_remove_master(struct arm_smmu_master *master)
> +{
> + int i;
> + struct arm_smmu_device *smmu = master->smmu;
> + struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
> +
> + if (!smmu || !master->streams)
> + return;
> +
> + mutex_lock(>streams_mutex);
> + for (i = 0; i < fwspec->num_ids; i++)
> + rb_erase(>streams[i].node, >streams);
> + mutex_unlock(>streams_mutex);
> +
> + kfree(master->streams);
> +}
> +
>  static struct iommu_ops arm_smmu_ops;
>  
>  static struct iommu_device *arm_smmu_probe_device(struct device *dev)
>  {
> - int i, ret;
> + int ret;
>   struct arm_smmu_device *smmu;
>   struct arm_smmu_master *master;
>   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> @@ -2331,27 +2447,12 @@ static struct iommu_device 
> *arm_smmu_probe_device(struct device *dev)
>  
>   master->dev = dev;
>   master->smmu = smmu;
> - master->sids = fwspec->ids;
> - master->num_sids = fwspec->num_ids;
>   INIT_LIST_HEAD(>bonds);
>   dev_iommu_priv_set(dev, master);
>  
> - /* Check the SIDs are in range of the SMMU and our stream table */
> - for (i = 0; i < master->num_sids; i++) {
> - u32 sid = master->sids[i];
> -
> - if (!arm_smmu_sid_in_range(smmu, sid)) {
> - ret = -ERANGE;
> - goto err_free_master;
> - }
> -
> - /* Ensure l2 strtab is initialised */
> - if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
> - ret = arm_smmu_init_l2_strtab(smmu, 

Re: [PATCH v13 02/15] iommu: Introduce bind/unbind_guest_msi

2021-02-01 Thread Keqian Zhu
Hi Eric,

On 2020/11/18 19:21, Eric Auger wrote:
> On ARM, MSI are translated by the SMMU. An IOVA is allocated
> for each MSI doorbell. If both the host and the guest are exposed
> with SMMUs, we end up with 2 different IOVAs allocated by each.
> guest allocates an IOVA (gIOVA) to map onto the guest MSI
> doorbell (gDB). The Host allocates another IOVA (hIOVA) to map
> onto the physical doorbell (hDB).
> 
> So we end up with 2 untied mappings:
>  S1S2
> gIOVA->gDB
>   hIOVA->hDB
> 
> Currently the PCI device is programmed by the host with hIOVA
> as MSI doorbell. So this does not work.
> 
> This patch introduces an API to pass gIOVA/gDB to the host so
> that gIOVA can be reused by the host instead of re-allocating
> a new IOVA. So the goal is to create the following nested mapping:
Does the gDB can be reused under non-nested mode?

> 
>  S1S2
> gIOVA->gDB ->hDB
> 
> and program the PCI device with gIOVA MSI doorbell.
> 
> In case we have several devices attached to this nested domain
> (devices belonging to the same group), they cannot be isolated
> on guest side either. So they should also end up in the same domain
> on guest side. We will enforce that all the devices attached to
> the host iommu domain use the same physical doorbell and similarly
> a single virtual doorbell mapping gets registered (1 single
> virtual doorbell is used on guest as well).
> 
[...]

> + *
> + * The associated IOVA can be reused by the host to create a nested
> + * stage2 binding mapping translating into the physical doorbell used
> + * by the devices attached to the domain.
> + *
> + * All devices within the domain must share the same physical doorbell.
> + * A single MSI GIOVA/GPA mapping can be attached to an iommu_domain.
> + */
> +
> +int iommu_bind_guest_msi(struct iommu_domain *domain,
> +  dma_addr_t giova, phys_addr_t gpa, size_t size)
> +{
> + if (unlikely(!domain->ops->bind_guest_msi))
> + return -ENODEV;
> +
> + return domain->ops->bind_guest_msi(domain, giova, gpa, size);
> +}
> +EXPORT_SYMBOL_GPL(iommu_bind_guest_msi);
> +
> +void iommu_unbind_guest_msi(struct iommu_domain *domain,
> + dma_addr_t iova)
nit: s/iova/giova

> +{
> + if (unlikely(!domain->ops->unbind_guest_msi))
> + return;
> +
> + domain->ops->unbind_guest_msi(domain, iova);
> +}
> +EXPORT_SYMBOL_GPL(iommu_unbind_guest_msi);
> +
[...]

Thanks,
Keqian

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 01/15] iommu: Introduce attach/detach_pasid_table API

2021-02-01 Thread Keqian Zhu
Hi Eric,

On 2020/11/18 19:21, Eric Auger wrote:
> In virtualization use case, when a guest is assigned
> a PCI host device, protected by a virtual IOMMU on the guest,
> the physical IOMMU must be programmed to be consistent with
> the guest mappings. If the physical IOMMU supports two
> translation stages it makes sense to program guest mappings
> onto the first stage/level (ARM/Intel terminology) while the host
> owns the stage/level 2.
> 
> In that case, it is mandated to trap on guest configuration
> settings and pass those to the physical iommu driver.
> 
> This patch adds a new API to the iommu subsystem that allows
> to set/unset the pasid table information.
> 
> A generic iommu_pasid_table_config struct is introduced in
> a new iommu.h uapi header. This is going to be used by the VFIO
> user API.
> 
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Liu, Yi L 
> Signed-off-by: Ashok Raj 
> Signed-off-by: Jacob Pan 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v12 -> v13:
> - Fix config check
> 
> v11 -> v12:
> - add argsz, name the union
> ---
>  drivers/iommu/iommu.c  | 68 ++
>  include/linux/iommu.h  | 21 
>  include/uapi/linux/iommu.h | 54 ++
>  3 files changed, 143 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index b53446bb8c6b..978fe34378fb 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2171,6 +2171,74 @@ int iommu_uapi_sva_unbind_gpasid(struct iommu_domain 
> *domain, struct device *dev
>  }
>  EXPORT_SYMBOL_GPL(iommu_uapi_sva_unbind_gpasid);
>  
> +int iommu_attach_pasid_table(struct iommu_domain *domain,
> +  struct iommu_pasid_table_config *cfg)
> +{
> + if (unlikely(!domain->ops->attach_pasid_table))
> + return -ENODEV;
> +
> + return domain->ops->attach_pasid_table(domain, cfg);
> +}
miss export symbol?

> +
> +int iommu_uapi_attach_pasid_table(struct iommu_domain *domain,
> +   void __user *uinfo)
> +{
> + struct iommu_pasid_table_config pasid_table_data = { 0 };
> + u32 minsz;
> +
> + if (unlikely(!domain->ops->attach_pasid_table))
> + return -ENODEV;
> +
> + /*
> +  * No new spaces can be added before the variable sized union, the
> +  * minimum size is the offset to the union.
> +  */
> + minsz = offsetof(struct iommu_pasid_table_config, vendor_data);
> +
> + /* Copy minsz from user to get flags and argsz */
> + if (copy_from_user(_table_data, uinfo, minsz))
> + return -EFAULT;
> +
> + /* Fields before the variable size union are mandatory */
> + if (pasid_table_data.argsz < minsz)
> + return -EINVAL;
> +
> + /* PASID and address granu require additional info beyond minsz */
> + if (pasid_table_data.version != PASID_TABLE_CFG_VERSION_1)
> + return -EINVAL;
> + if (pasid_table_data.format == IOMMU_PASID_FORMAT_SMMUV3 &&
> + pasid_table_data.argsz <
> + offsetofend(struct iommu_pasid_table_config, 
> vendor_data.smmuv3))
> + return -EINVAL;
> +
> + /*
> +  * User might be using a newer UAPI header which has a larger data
> +  * size, we shall support the existing flags within the current
> +  * size. Copy the remaining user data _after_ minsz but not more
> +  * than the current kernel supported size.
> +  */
> + if (copy_from_user((void *)_table_data + minsz, uinfo + minsz,
> +min_t(u32, pasid_table_data.argsz, 
> sizeof(pasid_table_data)) - minsz))
> + return -EFAULT;
> +
> + /* Now the argsz is validated, check the content */
> + if (pasid_table_data.config < IOMMU_PASID_CONFIG_TRANSLATE ||
> + pasid_table_data.config > IOMMU_PASID_CONFIG_ABORT)
> + return -EINVAL;
> +
> + return domain->ops->attach_pasid_table(domain, _table_data);
> +}
> +EXPORT_SYMBOL_GPL(iommu_uapi_attach_pasid_table);
> +
> +void iommu_detach_pasid_table(struct iommu_domain *domain)
> +{
> + if (unlikely(!domain->ops->detach_pasid_table))
> + return;
> +
> + domain->ops->detach_pasid_table(domain);
> +}
> +EXPORT_SYMBOL_GPL(iommu_detach_pasid_table);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
> struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index b95a6f8db6ff..464fcbecf841 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -223,6 +223,8 @@ struct iommu_iotlb_gather {
>   * @cache_invalidate: invalidate translation caches
>   * @sva_bind_gpasid: bind guest pasid and mm
>   * @sva_unbind_gpasid: unbind guest pasid and mm
> + * @attach_pasid_table: attach a pasid table
> + * @detach_pasid_table: detach the pasid table
>   * @def_domain_type: device default domain type, return value:
>   *   - IOMMU_DOMAIN_IDENTITY: 

Re: [PATCH v3 2/2] vfio/iommu_type1: Fix some sanity checks in detach group

2021-01-28 Thread Keqian Zhu



On 2021/1/28 7:46, Alex Williamson wrote:
> On Fri, 22 Jan 2021 17:26:35 +0800
> Keqian Zhu  wrote:
> 
>> vfio_sanity_check_pfn_list() is used to check whether pfn_list and
>> notifier are empty when remove the external domain, so it makes a
>> wrong assumption that only external domain will use the pinning
>> interface.
>>
>> Now we apply the pfn_list check when a vfio_dma is removed and apply
>> the notifier check when all domains are removed.
>>
>> Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices")
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 33 ++---
>>  1 file changed, 10 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 161725395f2f..d8c10f508321 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -957,6 +957,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, 
>> struct vfio_dma *dma,
>>  
>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  {
>> +WARN_ON(!RB_EMPTY_ROOT(>pfn_list));
>>  vfio_unmap_unpin(iommu, dma, true);
>>  vfio_unlink_dma(iommu, dma);
>>  put_task_struct(dma->task);
>> @@ -2250,23 +2251,6 @@ static void vfio_iommu_unmap_unpin_reaccount(struct 
>> vfio_iommu *iommu)
>>  }
>>  }
>>  
>> -static void vfio_sanity_check_pfn_list(struct vfio_iommu *iommu)
>> -{
>> -struct rb_node *n;
>> -
>> -n = rb_first(>dma_list);
>> -for (; n; n = rb_next(n)) {
>> -struct vfio_dma *dma;
>> -
>> -dma = rb_entry(n, struct vfio_dma, node);
>> -
>> -if (WARN_ON(!RB_EMPTY_ROOT(>pfn_list)))
>> -break;
>> -}
>> -/* mdev vendor driver must unregister notifier */
>> -WARN_ON(iommu->notifier.head);
>> -}
>> -
>>  /*
>>   * Called when a domain is removed in detach. It is possible that
>>   * the removed domain decided the iova aperture window. Modify the
>> @@ -2366,10 +2350,10 @@ static void vfio_iommu_type1_detach_group(void 
>> *iommu_data,
>>  kfree(group);
>>  
>>  if (list_empty(>external_domain->group_list)) {
>> -vfio_sanity_check_pfn_list(iommu);
>> -
>> -if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
>> +if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
>> +WARN_ON(iommu->notifier.head);
>>  vfio_iommu_unmap_unpin_all(iommu);
>> +}
>>  
>>  kfree(iommu->external_domain);
>>  iommu->external_domain = NULL;
>> @@ -2403,10 +2387,12 @@ static void vfio_iommu_type1_detach_group(void 
>> *iommu_data,
>>   */
>>  if (list_empty(>group_list)) {
>>  if (list_is_singular(>domain_list)) {
>> -if (!iommu->external_domain)
>> +if (!iommu->external_domain) {
>> +WARN_ON(iommu->notifier.head);
>>  vfio_iommu_unmap_unpin_all(iommu);
>> -else
>> +} else {
>>  vfio_iommu_unmap_unpin_reaccount(iommu);
>> +}
>>  }
>>  iommu_domain_free(domain->domain);
>>  list_del(>next);
>> @@ -2488,9 +2474,10 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>  struct vfio_iommu *iommu = iommu_data;
>>  struct vfio_domain *domain, *domain_tmp;
>>  
>> +WARN_ON(iommu->notifier.head);
> 
> I don't see that this does any harm, but isn't it actually redundant?
> It seems vfio-core only calls the iommu backend release function after
> removing all the groups, so the tests in _detach_group should catch all
> cases.  We're expecting the vfio bus/mdev driver to remove the notifier
> when a device is closed, which necessarily occurs before detaching the
> group.  Thanks,
Right. Devices of a specific group must be closed before detach this group.
Detach the last group have checked this, so vfio_iommu_type1_release doesn't
need to do this check again.

Could you please queue this patch and delete this check btw? Thanks. ;-)

Keqian.

> 
> Alex
> 
>> +
>>  if (iommu->external_domain) {
>>  vfio_release_domain(iommu->external_domain, true);
>> -vfio_sanity_check_pfn_list(iommu);
>>  kfree(iommu->external_domain);
>>  }
>>  
> 
> .
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 2/2] vfio/iommu_type1: Fix some sanity checks in detach group

2021-01-28 Thread Keqian Zhu



On 2021/1/28 7:46, Alex Williamson wrote:
> On Fri, 22 Jan 2021 17:26:35 +0800
> Keqian Zhu  wrote:
> 
>> vfio_sanity_check_pfn_list() is used to check whether pfn_list and
>> notifier are empty when remove the external domain, so it makes a
>> wrong assumption that only external domain will use the pinning
>> interface.
>>
>> Now we apply the pfn_list check when a vfio_dma is removed and apply
>> the notifier check when all domains are removed.
>>
>> Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices")
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 33 ++---
>>  1 file changed, 10 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 161725395f2f..d8c10f508321 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -957,6 +957,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, 
>> struct vfio_dma *dma,
>>  
>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  {
>> +WARN_ON(!RB_EMPTY_ROOT(>pfn_list));
>>  vfio_unmap_unpin(iommu, dma, true);
>>  vfio_unlink_dma(iommu, dma);
>>  put_task_struct(dma->task);
>> @@ -2250,23 +2251,6 @@ static void vfio_iommu_unmap_unpin_reaccount(struct 
>> vfio_iommu *iommu)
>>  }
>>  }
>>  
>> -static void vfio_sanity_check_pfn_list(struct vfio_iommu *iommu)
>> -{
>> -struct rb_node *n;
>> -
>> -n = rb_first(>dma_list);
>> -for (; n; n = rb_next(n)) {
>> -struct vfio_dma *dma;
>> -
>> -dma = rb_entry(n, struct vfio_dma, node);
>> -
>> -if (WARN_ON(!RB_EMPTY_ROOT(>pfn_list)))
>> -break;
>> -}
>> -/* mdev vendor driver must unregister notifier */
>> -WARN_ON(iommu->notifier.head);
>> -}
>> -
>>  /*
>>   * Called when a domain is removed in detach. It is possible that
>>   * the removed domain decided the iova aperture window. Modify the
>> @@ -2366,10 +2350,10 @@ static void vfio_iommu_type1_detach_group(void 
>> *iommu_data,
>>  kfree(group);
>>  
>>  if (list_empty(>external_domain->group_list)) {
>> -vfio_sanity_check_pfn_list(iommu);
>> -
>> -if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
>> +if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
>> +WARN_ON(iommu->notifier.head);
>>  vfio_iommu_unmap_unpin_all(iommu);
>> +}
>>  
>>  kfree(iommu->external_domain);
>>  iommu->external_domain = NULL;
>> @@ -2403,10 +2387,12 @@ static void vfio_iommu_type1_detach_group(void 
>> *iommu_data,
>>   */
>>  if (list_empty(>group_list)) {
>>  if (list_is_singular(>domain_list)) {
>> -if (!iommu->external_domain)
>> +if (!iommu->external_domain) {
>> +WARN_ON(iommu->notifier.head);
>>  vfio_iommu_unmap_unpin_all(iommu);
>> -else
>> +} else {
>>  vfio_iommu_unmap_unpin_reaccount(iommu);
>> +}
>>  }
>>  iommu_domain_free(domain->domain);
>>  list_del(>next);
>> @@ -2488,9 +2474,10 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>  struct vfio_iommu *iommu = iommu_data;
>>  struct vfio_domain *domain, *domain_tmp;
>>  
>> +WARN_ON(iommu->notifier.head);
> 
> I don't see that this does any harm, but isn't it actually redundant?
> It seems vfio-core only calls the iommu backend release function after
> removing all the groups, so the tests in _detach_group should catch all
> cases.  We're expecting the vfio bus/mdev driver to remove the notifier
> when a device is closed, which necessarily occurs before detaching the
> group.  Thanks,
> 
> Alex
Hi Alex,

Sorry that today I was busy at sending the smmu HTTU based dma dirty log 
tracking.
I will reply you tomorrow. Thanks!

Keqian.

> 
>> +
>>  if (iommu->external_domain) {
>>  vfio_release_domain(iommu->external_domain, true);
>> -vfio_sanity_check_pfn_list(iommu);
>>  kfree(iommu->external_domain);
>>  }
>>  
> 
> .
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM

2021-01-28 Thread Keqian Zhu
From: jiangkunkun 

In the past if vfio_iommu is not of pinned_page_dirty_scope and
vfio_dma is iommu_mapped, we populate full dirty bitmap for this
vfio_dma. Now we can try to get dirty log from iommu before make
the lousy decision.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/vfio/vfio_iommu_type1.c | 97 -
 1 file changed, 94 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3b8522ebf955..1cd10f3e7ed4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -999,6 +999,25 @@ static bool vfio_group_supports_hwdbm(struct vfio_group 
*group)
return true;
 }
 
+static int vfio_iommu_dirty_log_clear(struct vfio_iommu *iommu,
+ dma_addr_t start_iova, size_t size,
+ unsigned long *bitmap_buffer,
+ dma_addr_t base_iova, size_t pgsize)
+{
+   struct vfio_domain *d;
+   unsigned long pgshift = __ffs(pgsize);
+   int ret;
+
+   list_for_each_entry(d, >domain_list, next) {
+   ret = iommu_clear_dirty_log(d->domain, start_iova, size,
+   bitmap_buffer, base_iova, pgshift);
+   if (ret)
+   return ret;
+   }
+
+   return 0;
+}
+
 static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
  struct vfio_dma *dma, dma_addr_t base_iova,
  size_t pgsize)
@@ -1010,13 +1029,28 @@ static int update_user_bitmap(u64 __user *bitmap, 
struct vfio_iommu *iommu,
unsigned long shift = bit_offset % BITS_PER_LONG;
unsigned long leftover;
 
+   if (iommu->pinned_page_dirty_scope || !dma->iommu_mapped)
+   goto bitmap_done;
+
+   /* try to get dirty log from IOMMU */
+   if (!iommu->num_non_hwdbm_groups) {
+   struct vfio_domain *d;
+
+   list_for_each_entry(d, >domain_list, next) {
+   if (iommu_sync_dirty_log(d->domain, dma->iova, 
dma->size,
+   dma->bitmap, dma->iova, 
pgshift))
+   return -EFAULT;
+   }
+   goto bitmap_done;
+   }
+
/*
 * mark all pages dirty if any IOMMU capable device is not able
 * to report dirty pages and all pages are pinned and mapped.
 */
-   if (!iommu->pinned_page_dirty_scope && dma->iommu_mapped)
-   bitmap_set(dma->bitmap, 0, nbits);
+   bitmap_set(dma->bitmap, 0, nbits);
 
+bitmap_done:
if (shift) {
bitmap_shift_left(dma->bitmap, dma->bitmap, shift,
  nbits + shift);
@@ -1078,6 +1112,18 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, 
struct vfio_iommu *iommu,
 */
bitmap_clear(dma->bitmap, 0, dma->size >> pgshift);
vfio_dma_populate_bitmap(dma, pgsize);
+
+   /* Clear iommu dirty log to re-enable dirty log tracking */
+   if (!iommu->pinned_page_dirty_scope &&
+   dma->iommu_mapped && !iommu->num_non_hwdbm_groups) {
+   ret = vfio_iommu_dirty_log_clear(iommu, dma->iova,
+   dma->size, dma->bitmap, dma->iova,
+   pgsize);
+   if (ret) {
+   pr_warn("dma dirty log clear failed!\n");
+   return ret;
+   }
+   }
}
return 0;
 }
@@ -2780,6 +2826,48 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu 
*iommu,
-EFAULT : 0;
 }
 
+static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
+struct vfio_dma *dma)
+{
+   struct vfio_domain *d;
+
+   list_for_each_entry(d, >domain_list, next) {
+   /* Go through all domain anyway even if we fail */
+   iommu_split_block(d->domain, dma->iova, dma->size);
+   }
+}
+
+static void vfio_dma_dirty_log_stop(struct vfio_iommu *iommu,
+   struct vfio_dma *dma)
+{
+   struct vfio_domain *d;
+
+   list_for_each_entry(d, >domain_list, next) {
+   /* Go through all domain anyway even if we fail */
+   iommu_merge_page(d->domain, dma->iova, dma->size,
+d->prot | dma->prot);
+   }
+}
+
+static void vfio_iommu_dirty_log_switch(struct vfio_iommu *iommu, bool start)
+{
+   struct rb_node *n;
+
+   /* Split and merge even if all iommu don't support HWDBM now */
+   for (n = rb_first(

[RFC PATCH 02/11] iommu/arm-smmu-v3: Enable HTTU for SMMU stage1 mapping

2021-01-28 Thread Keqian Zhu
From: jiangkunkun 

If HTTU is supported, we enable HA/HD bits in the SMMU CD (stage 1
mapping), and set DBM bit for writable TTD.

The dirty state information is encoded using the access permission
bits AP[2] (stage 1) or S2AP[1] (stage 2) in conjunction with the
DBM (Dirty Bit Modifier) bit, where DBM means writable and AP[2]/
S2AP[1] means dirty.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 +
 drivers/iommu/io-pgtable-arm.c  | 7 ++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 0f0fe71cc10d..8cc9d7536b08 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1036,6 +1036,11 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain 
*smmu_domain, int ssid,
FIELD_PREP(CTXDESC_CD_0_ASID, cd->asid) |
CTXDESC_CD_0_V;
 
+   if (smmu->features & ARM_SMMU_FEAT_HTTU_HA)
+   val |= CTXDESC_CD_0_HA;
+   if (smmu->features & ARM_SMMU_FEAT_HTTU_HD)
+   val |= CTXDESC_CD_0_HD;
+
/* STALL_MODEL==0b10 && CD.S==0 is ILLEGAL */
if (smmu->features & ARM_SMMU_FEAT_STALL_FORCE)
val |= CTXDESC_CD_0_S;
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 87def58e79b5..e299a44808ae 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -72,6 +72,7 @@
 
 #define ARM_LPAE_PTE_NSTABLE   (((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_DBM   (((arm_lpae_iopte)1) << 51)
 #define ARM_LPAE_PTE_AF(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS (((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS (((arm_lpae_iopte)2) << 8)
@@ -81,7 +82,7 @@
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK  (((arm_lpae_iopte)0x3ff) << 2)
 /* Ignore the contiguous bit for block splitting */
-#define ARM_LPAE_PTE_ATTR_HI_MASK  (((arm_lpae_iopte)6) << 52)
+#define ARM_LPAE_PTE_ATTR_HI_MASK  (((arm_lpae_iopte)13) << 51)
 #define ARM_LPAE_PTE_ATTR_MASK (ARM_LPAE_PTE_ATTR_LO_MASK |\
 ARM_LPAE_PTE_ATTR_HI_MASK)
 /* Software bit for solving coherency races */
@@ -379,6 +380,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, 
unsigned long iova,
 static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
   int prot)
 {
+   struct io_pgtable_cfg *cfg = >iop.cfg;
arm_lpae_iopte pte;
 
if (data->iop.fmt == ARM_64_LPAE_S1 ||
@@ -386,6 +388,9 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
arm_lpae_io_pgtable *data,
pte = ARM_LPAE_PTE_nG;
if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
pte |= ARM_LPAE_PTE_AP_RDONLY;
+   else if (cfg->httu_hd)
+   pte |= ARM_LPAE_PTE_DBM;
+
if (!(prot & IOMMU_PRIV))
pte |= ARM_LPAE_PTE_AP_UNPRIV;
} else {
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH 09/11] vfio/iommu_type1: Add HWDBM status maintanance

2021-01-28 Thread Keqian Zhu
From: jiangkunkun 

We are going to optimize dirty log tracking based on iommu
HWDBM feature, but the dirty log from iommu is useful only
when all iommu backed groups are connected to iommu with
HWDBM feature. This maintains a counter for this feature.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/vfio/vfio_iommu_type1.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0b4dedaa9128..3b8522ebf955 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -74,6 +74,7 @@ struct vfio_iommu {
boolnesting;
booldirty_page_tracking;
boolpinned_page_dirty_scope;
+   uint64_tnum_non_hwdbm_groups;
 };
 
 struct vfio_domain {
@@ -102,6 +103,7 @@ struct vfio_group {
struct list_headnext;
boolmdev_group; /* An mdev group */
boolpinned_page_dirty_scope;
+   booliommu_hwdbm;/* Valid for non-mdev group */
 };
 
 struct vfio_iova {
@@ -976,6 +978,27 @@ static void vfio_update_pgsize_bitmap(struct vfio_iommu 
*iommu)
}
 }
 
+static int vfio_dev_has_feature(struct device *dev, void *data)
+{
+   enum iommu_dev_features *feat = data;
+
+   if (!iommu_dev_has_feature(dev, *feat))
+   return -ENODEV;
+
+   return 0;
+}
+
+static bool vfio_group_supports_hwdbm(struct vfio_group *group)
+{
+   enum iommu_dev_features feat = IOMMU_DEV_FEAT_HWDBM;
+
+   if (iommu_group_for_each_dev(group->iommu_group, ,
+vfio_dev_has_feature))
+   return false;
+
+   return true;
+}
+
 static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
  struct vfio_dma *dma, dma_addr_t base_iova,
  size_t pgsize)
@@ -2189,6 +2212,12 @@ static int vfio_iommu_type1_attach_group(void 
*iommu_data,
 * capable via the page pinning interface.
 */
iommu->pinned_page_dirty_scope = false;
+
+   /* Update the hwdbm status of group and iommu */
+   group->iommu_hwdbm = vfio_group_supports_hwdbm(group);
+   if (!group->iommu_hwdbm)
+   iommu->num_non_hwdbm_groups++;
+
mutex_unlock(>lock);
vfio_iommu_resv_free(_resv_regions);
 
@@ -2342,6 +2371,7 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
struct vfio_domain *domain;
struct vfio_group *group;
bool update_dirty_scope = false;
+   bool update_iommu_hwdbm = false;
LIST_HEAD(iova_copy);
 
mutex_lock(>lock);
@@ -2380,6 +2410,7 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
 
vfio_iommu_detach_group(domain, group);
update_dirty_scope = !group->pinned_page_dirty_scope;
+   update_iommu_hwdbm = !group->iommu_hwdbm;
list_del(>next);
kfree(group);
/*
@@ -2417,6 +2448,8 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
 */
if (update_dirty_scope)
update_pinned_page_dirty_scope(iommu);
+   if (update_iommu_hwdbm)
+   iommu->num_non_hwdbm_groups--;
mutex_unlock(>lock);
 }
 
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH 11/11] vfio/iommu_type1: Add support for manual dirty log clear

2021-01-28 Thread Keqian Zhu
From: jiangkunkun 

In the past, we clear dirty log immediately after sync dirty
log to userspace. This may cause redundant dirty handling if
userspace handles dirty log iteratively:

After vfio clears dirty log, new dirty log starts to generate.
These new dirty log will be reported to userspace even if they
are generated before userspace handles the same dirty page.

That's to say, we should minimize the time gap of dirty log
clearing and dirty log handling. We can give userspace the
interface to clear dirty log.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/vfio/vfio_iommu_type1.c | 103 ++--
 include/uapi/linux/vfio.h   |  28 -
 2 files changed, 126 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1cd10f3e7ed4..a32dc684b86e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -73,6 +73,7 @@ struct vfio_iommu {
boolv2;
boolnesting;
booldirty_page_tracking;
+   booldirty_log_manual_clear;
boolpinned_page_dirty_scope;
uint64_tnum_non_hwdbm_groups;
 };
@@ -1018,6 +1019,78 @@ static int vfio_iommu_dirty_log_clear(struct vfio_iommu 
*iommu,
return 0;
 }
 
+static int vfio_iova_dirty_log_clear(u64 __user *bitmap,
+struct vfio_iommu *iommu,
+dma_addr_t iova, size_t size,
+size_t pgsize)
+{
+   struct vfio_dma *dma;
+   struct rb_node *n;
+   dma_addr_t start_iova, end_iova, riova;
+   unsigned long pgshift = __ffs(pgsize);
+   unsigned long bitmap_size;
+   unsigned long *bitmap_buffer = NULL;
+   bool clear_valid;
+   int rs, re, start, end, dma_offset;
+   int ret = 0;
+
+   bitmap_size = DIRTY_BITMAP_BYTES(size >> pgshift);
+   bitmap_buffer = kvmalloc(bitmap_size, GFP_KERNEL);
+   if (!bitmap_buffer) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   if (copy_from_user(bitmap_buffer, bitmap, bitmap_size)) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   for (n = rb_first(>dma_list); n; n = rb_next(n)) {
+   dma = rb_entry(n, struct vfio_dma, node);
+   if (!dma->iommu_mapped)
+   continue;
+   if ((dma->iova + dma->size - 1) < iova)
+   continue;
+   if (dma->iova > iova + size - 1)
+   break;
+
+   start_iova = max(iova, dma->iova);
+   end_iova = min(iova + size, dma->iova + dma->size);
+
+   /* Similar logic as the tail of vfio_iova_dirty_bitmap */
+
+   clear_valid = false;
+   start = (start_iova - iova) >> pgshift;
+   end = (end_iova - iova) >> pgshift;
+   bitmap_for_each_set_region(bitmap_buffer, rs, re, start, end) {
+   clear_valid = true;
+   riova = iova + (rs << pgshift);
+   dma_offset = (riova - dma->iova) >> pgshift;
+   bitmap_clear(dma->bitmap, dma_offset, re - rs);
+   }
+
+   if (clear_valid)
+   vfio_dma_populate_bitmap(dma, pgsize);
+
+   if (clear_valid && !iommu->pinned_page_dirty_scope &&
+   dma->iommu_mapped && !iommu->num_non_hwdbm_groups) {
+   ret = vfio_iommu_dirty_log_clear(iommu, start_iova,
+   end_iova - start_iova,  bitmap_buffer,
+   iova, pgsize);
+   if (ret) {
+   pr_warn("dma dirty log clear failed!\n");
+   goto out;
+   }
+   }
+
+   }
+
+out:
+   kfree(bitmap_buffer);
+   return ret;
+}
+
 static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
  struct vfio_dma *dma, dma_addr_t base_iova,
  size_t pgsize)
@@ -1067,6 +1140,10 @@ static int update_user_bitmap(u64 __user *bitmap, struct 
vfio_iommu *iommu,
 DIRTY_BITMAP_BYTES(nbits + shift)))
return -EFAULT;
 
+   if (shift && iommu->dirty_log_manual_clear)
+   bitmap_shift_right(dma->bitmap, dma->bitmap, shift,
+  nbits + shift);
+
return 0;
 }
 
@@ -1105,6 +1182,9 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, 
struct vfio_iommu *iommu,
if (ret)
return ret;
 
+   if (iommu-&g

[RFC PATCH 08/11] iommu/arm-smmu-v3: Add HWDBM device feature reporting

2021-01-28 Thread Keqian Zhu
From: jiangkunkun 

We have implemented these interfaces required to support iommu
dirty log tracking. The last step is reporting this feature to
upper user, then the user can perform higher policy base on it.
This adds a new dev feature named IOMMU_DEV_FEAT_HWDBM in iommu
layer. For arm smmuv3, it is equal to ARM_SMMU_FEAT_HTTU_HD.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 ++
 include/linux/iommu.h   | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 0c24503d29d3..cbde0489cf31 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2629,6 +2629,8 @@ static bool arm_smmu_dev_has_feature(struct device *dev,
switch (feat) {
case IOMMU_DEV_FEAT_SVA:
return arm_smmu_master_sva_supported(master);
+   case IOMMU_DEV_FEAT_HWDBM:
+   return !!(master->smmu->features & ARM_SMMU_FEAT_HTTU_HD);
default:
return false;
}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 1cb6cd0cfc7b..77e561ed57fd 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -160,6 +160,7 @@ struct iommu_resv_region {
 enum iommu_dev_features {
IOMMU_DEV_FEAT_AUX, /* Aux-domain feature */
IOMMU_DEV_FEAT_SVA, /* Shared Virtual Addresses */
+   IOMMU_DEV_FEAT_HWDBM,   /* Hardware Dirty Bit Management */
 };
 
 #define IOMMU_PASID_INVALID(-1U)
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH 05/11] iommu/arm-smmu-v3: Merge a span of page to block descriptor

2021-01-28 Thread Keqian Zhu
From: jiangkunkun 

When stop dirty log tracking, we need to recover all block descriptors
which are splited when start dirty log tracking. This adds a new
interface named merge_page in iommu layer and arm smmuv3 implements it,
which reinstall block mappings and unmap the span of page mappings.

It's caller's duty to find contiuous physical memory.

During merging page, other interfaces are not expected to be working,
so race condition does not exist. And we flush all iotlbs after the merge
procedure is completed to ease the pressure of iommu, as we will merge a
huge range of page mappings in general.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 ++
 drivers/iommu/io-pgtable-arm.c  | 78 +
 drivers/iommu/iommu.c   | 75 
 include/linux/io-pgtable.h  |  2 +
 include/linux/iommu.h   | 10 +++
 5 files changed, 185 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5469f4fca820..2434519e4bb6 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2529,6 +2529,25 @@ static size_t arm_smmu_split_block(struct iommu_domain 
*domain,
return ops->split_block(ops, iova, size);
 }
 
+static size_t arm_smmu_merge_page(struct iommu_domain *domain, unsigned long 
iova,
+ phys_addr_t paddr, size_t size, int prot)
+{
+   struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+   struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+
+   if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
+   dev_err(smmu->dev, "don't support BBML1/2 and merge page\n");
+   return 0;
+   }
+
+   if (!ops || !ops->merge_page) {
+   pr_err("don't support merge page\n");
+   return 0;
+   }
+
+   return ops->merge_page(ops, iova, paddr, size, prot);
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2629,6 +2648,7 @@ static struct iommu_ops arm_smmu_ops = {
.domain_get_attr= arm_smmu_domain_get_attr,
.domain_set_attr= arm_smmu_domain_set_attr,
.split_block= arm_smmu_split_block,
+   .merge_page = arm_smmu_merge_page,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index f3b7f7115e38..17390f258eb1 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -800,6 +800,83 @@ static size_t arm_lpae_split_block(struct io_pgtable_ops 
*ops,
return __arm_lpae_split_block(data, iova, size, lvl, ptep);
 }
 
+static size_t __arm_lpae_merge_page(struct arm_lpae_io_pgtable *data,
+   unsigned long iova, phys_addr_t paddr,
+   size_t size, int lvl, arm_lpae_iopte *ptep,
+   arm_lpae_iopte prot)
+{
+   arm_lpae_iopte pte, *tablep;
+   struct io_pgtable *iop = >iop;
+   struct io_pgtable_cfg *cfg = >iop.cfg;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return 0;
+
+   ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return 0;
+
+   if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+   if (iopte_leaf(pte, lvl, iop->fmt))
+   return size;
+
+   /* Race does not exist */
+   if (cfg->bbml == 1) {
+   prot |= ARM_LPAE_PTE_NT;
+   __arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+   io_pgtable_tlb_flush_walk(iop, iova, size,
+ ARM_LPAE_GRANULE(data));
+
+   prot &= ~(ARM_LPAE_PTE_NT);
+   __arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+   } else {
+   __arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+   }
+
+   tablep = iopte_deref(pte, data);
+   __arm_lpae_free_pgtable(data, lvl + 1, tablep);
+   return size;
+   } else if (iopte_leaf(pte, lvl, iop->fmt)) {
+   /* The size is too small, already merged */
+   return size;
+   }
+
+   /* Keep on walkin */
+   ptep = iopte_deref(pte, data);
+   return __arm_lpae_merge_page(data, iova, paddr, size, lvl + 1, ptep, 
prot);
+}
+
+static size_t 

[RFC PATCH 07/11] iommu/arm-smmu-v3: Clear dirty log according to bitmap

2021-01-28 Thread Keqian Zhu
From: jiangkunkun 

After dirty log is retrieved, user should clear dirty log to re-enable
dirty log tracking for these dirtied pages. This adds a new interface
named clear_dirty_log and arm smmuv3 implements it, which clears the dirty
state (As we just enable HTTU for stage1, so set the AP[2] bit) of these
TTDs that are specified by the user provided bitmap.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++
 drivers/iommu/io-pgtable-arm.c  | 95 +
 drivers/iommu/iommu.c   | 71 +++
 include/linux/io-pgtable.h  |  4 +
 include/linux/iommu.h   | 17 
 5 files changed, 211 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 43d0536b429a..0c24503d29d3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2574,6 +2574,29 @@ static int arm_smmu_sync_dirty_log(struct iommu_domain 
*domain,
base_iova, bitmap_pgshift);
 }
 
+static int arm_smmu_clear_dirty_log(struct iommu_domain *domain,
+   unsigned long iova, size_t size,
+   unsigned long *bitmap,
+   unsigned long base_iova,
+   unsigned long bitmap_pgshift)
+{
+   struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+   struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HTTU_HD)) {
+   dev_err(smmu->dev, "don't support HTTU_HD and clear dirty 
log\n");
+   return -EPERM;
+   }
+
+   if (!ops || !ops->clear_dirty_log) {
+   pr_err("don't support clear dirty log\n");
+   return -ENODEV;
+   }
+
+   return ops->clear_dirty_log(ops, iova, size, bitmap, base_iova,
+   bitmap_pgshift);
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2676,6 +2699,7 @@ static struct iommu_ops arm_smmu_ops = {
.split_block= arm_smmu_split_block,
.merge_page = arm_smmu_merge_page,
.sync_dirty_log = arm_smmu_sync_dirty_log,
+   .clear_dirty_log= arm_smmu_clear_dirty_log,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 6cfe1ef3fedd..2256e37bcb3a 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -966,6 +966,100 @@ static int arm_lpae_sync_dirty_log(struct io_pgtable_ops 
*ops,
 bitmap, base_iova, bitmap_pgshift);
 }
 
+static int __arm_lpae_clear_dirty_log(struct arm_lpae_io_pgtable *data,
+ unsigned long iova, size_t size,
+ int lvl, arm_lpae_iopte *ptep,
+ unsigned long *bitmap,
+ unsigned long base_iova,
+ unsigned long bitmap_pgshift)
+{
+   arm_lpae_iopte pte;
+   struct io_pgtable *iop = >iop;
+   unsigned long offset;
+   size_t base, next_size;
+   int nbits, ret, i;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return -EINVAL;
+
+   ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return -EINVAL;
+
+   if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+   if (iopte_leaf(pte, lvl, iop->fmt)) {
+   if (pte & ARM_LPAE_PTE_AP_RDONLY)
+   return 0;
+
+   /* Ensure all corresponding bits are set */
+   nbits = size >> bitmap_pgshift;
+   offset = (iova - base_iova) >> bitmap_pgshift;
+   for (i = offset; i < offset + nbits; i++) {
+   if (!test_bit(i, bitmap))
+   return 0;
+   }
+
+   /* Race does not exist */
+   pte |= ARM_LPAE_PTE_AP_RDONLY;
+   __arm_lpae_set_pte(ptep, pte, >cfg);
+   return 0;
+   } else {
+   /* To traverse next level */
+   next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+   ptep = iopte_deref(pte, data);
+   for (base = 0; base < size; base += n

[RFC PATCH 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU

2021-01-28 Thread Keqian Zhu
Hi all,

This patch series implement a new dirty log tracking method for vfio dma.

Intention:

As we know, vfio live migration is an important and valuable feature, but there
are still many hurdles to solve, including migration of interrupt, device state,
DMA dirty log tracking, and etc.

For now, the only dirty log tracking interface is pinning. It has some 
drawbacks:
1. Only smart vendor drivers are aware of this.
2. It's coarse-grained, the pinned-scope is generally bigger than what the 
device actually access.
3. It can't track dirty continuously and precisely, vfio populates all 
pinned-scope as dirty.
   So it doesn't work well with iteratively dirty log handling.

About SMMU HTTU:

HTTU (Hardware Translation Table Update) is a feature of ARM SMMUv3, it can 
update
access flag or/and dirty state of the TTD (Translation Table Descriptor) by 
hardware.
With HTTU, stage1 TTD is classified into 3 types:
DBM bit AP[2](readonly bit)
1. writable_clean 1   1
2. writable_dirty 1   0
3. readonly   0   1

If HTTU_HD (manage dirty state) is enabled, smmu can change TTD from 
writable_clean to
writable_dirty. Then software can scan TTD to sync dirty state into dirty 
bitmap. With
this feature, we can track the dirty log of DMA continuously and precisely.

About this series:

Patch 1-3: Add feature detection for smmu HTTU and enable HTTU for smmu stage1 
mapping.
   And add feature detection for smmu BBML. We need to split block 
mapping when
   start dirty log tracking and merge page mapping when stop dirty log 
tracking,
   which requires break-before-make procedure. But it might 
cause problems when the
   TTD is alive. The I/O streams might not tolerate translation 
faults. So BBML
   should be used.

Patch 4-7: Add four interfaces (split_block, merge_page, sync_dirty_log and 
clear_dirty_log)
   in IOMMU layer, they are essential to implement dma dirty log 
tracking for vfio.
   We implement these interfaces for arm smmuv3.

Patch   8: Add HWDBM (Hardware Dirty Bit Management) device feature reporting 
in IOMMU layer.

Patch9-11: Implement a new dirty log tracking method for vfio based on iommu 
hwdbm. A new
   ioctl operation named VFIO_DIRTY_LOG_MANUAL_CLEAR is added, which 
can eliminate
   some redundant dirty handling of userspace.

Optimizations TO Do:

1. We recognized that each smmu_domain (a vfio_container may has several 
smmu_domain) has its
   own stage1 mapping, and we must scan all these mapping to sync dirty state. 
We plan to refactor
   smmu_domain to support more than one smmu in one smmu_domain, then these 
smmus can share a same
   stage1 mapping.
2. We also recognized that scan TTD is a hotspot of performance. Recently, I 
have implement a
   SW/HW conbined dirty log tracking at MMU side [1], which can effectively 
solve this problem.
   This idea can be applied to smmu side too.

Thanks,
Keqian


[1] 
https://lore.kernel.org/linux-arm-kernel/2021012612.27136-1-zhukeqi...@huawei.com/

jiangkunkun (11):
  iommu/arm-smmu-v3: Add feature detection for HTTU
  iommu/arm-smmu-v3: Enable HTTU for SMMU stage1 mapping
  iommu/arm-smmu-v3: Add feature detection for BBML
  iommu/arm-smmu-v3: Split block descriptor to a span of page
  iommu/arm-smmu-v3: Merge a span of page to block descriptor
  iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log
  iommu/arm-smmu-v3: Clear dirty log according to bitmap
  iommu/arm-smmu-v3: Add HWDBM device feature reporting
  vfio/iommu_type1: Add HWDBM status maintanance
  vfio/iommu_type1: Optimize dirty bitmap population based on iommu
HWDBM
  vfio/iommu_type1: Add support for manual dirty log clear

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 138 ++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  14 +
 drivers/iommu/io-pgtable-arm.c  | 392 +++-
 drivers/iommu/iommu.c   | 227 
 drivers/vfio/vfio_iommu_type1.c | 235 +++-
 include/linux/io-pgtable.h  |  14 +
 include/linux/iommu.h   |  55 +++
 include/uapi/linux/vfio.h   |  28 +-
 8 files changed, 1093 insertions(+), 10 deletions(-)

-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH 03/11] iommu/arm-smmu-v3: Add feature detection for BBML

2021-01-28 Thread Keqian Zhu
From: jiangkunkun 

When altering a translation table descriptor of some specific reasons,
we require break-before-make procedure. But it might cause problems when
the TTD is alive. The I/O streams might not tolerate translation faults.

If the SMMU supports BBML level 1 or BBML level 2, we can change the block
size without using break-before-make.

This adds feature detection for BBML, none functional change.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 -
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  6 ++
 include/linux/io-pgtable.h  |  1 +
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8cc9d7536b08..9208881a571c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1947,7 +1947,7 @@ static int arm_smmu_domain_finalise_s2(struct 
arm_smmu_domain *smmu_domain,
 static int arm_smmu_domain_finalise(struct iommu_domain *domain,
struct arm_smmu_master *master)
 {
-   int ret;
+   int ret, bbml;
unsigned long ias, oas;
enum io_pgtable_fmt fmt;
struct io_pgtable_cfg pgtbl_cfg;
@@ -1988,12 +1988,20 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
return -EINVAL;
}
 
+   if (smmu->features & ARM_SMMU_FEAT_BBML2)
+   bbml = 2;
+   else if (smmu->features & ARM_SMMU_FEAT_BBML1)
+   bbml = 1;
+   else
+   bbml = 0;
+
pgtbl_cfg = (struct io_pgtable_cfg) {
.pgsize_bitmap  = smmu->pgsize_bitmap,
.ias= ias,
.oas= oas,
.httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
.coherent_walk  = smmu->features & ARM_SMMU_FEAT_COHERENCY,
+   .bbml   = bbml,
.tlb= _smmu_flush_ops,
.iommu_dev  = smmu->dev,
};
@@ -3328,6 +3336,20 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
 
/* IDR3 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
+   switch (FIELD_GET(IDR3_BBML, reg)) {
+   case IDR3_BBML0:
+   break;
+   case IDR3_BBML1:
+   smmu->features |= ARM_SMMU_FEAT_BBML1;
+   break;
+   case IDR3_BBML2:
+   smmu->features |= ARM_SMMU_FEAT_BBML2;
+   break;
+   default:
+   dev_err(smmu->dev, "unknown/unsupported BBM behavior level\n");
+   return -ENXIO;
+   }
+
if (FIELD_GET(IDR3_RIL, reg))
smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index e91bea44519e..11e526ab7239 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -55,6 +55,10 @@
 #define IDR1_SIDSIZE   GENMASK(5, 0)
 
 #define ARM_SMMU_IDR3  0xc
+#define IDR3_BBML  GENMASK(12, 11)
+#define IDR3_BBML0 0
+#define IDR3_BBML1 1
+#define IDR3_BBML2 2
 #define IDR3_RIL   (1 << 10)
 
 #define ARM_SMMU_IDR5  0x14
@@ -612,6 +616,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_SVA  (1 << 17)
 #define ARM_SMMU_FEAT_HTTU_HA  (1 << 18)
 #define ARM_SMMU_FEAT_HTTU_HD  (1 << 19)
+#define ARM_SMMU_FEAT_BBML1(1 << 20)
+#define ARM_SMMU_FEAT_BBML2(1 << 21)
u32 features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 1a00ea8562c7..26583beeb5d9 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -99,6 +99,7 @@ struct io_pgtable_cfg {
unsigned intoas;
boolhttu_hd;
boolcoherent_walk;
+   int bbml;
const struct iommu_flush_ops*tlb;
struct device   *iommu_dev;
 
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH 04/11] iommu/arm-smmu-v3: Split block descriptor to a span of page

2021-01-28 Thread Keqian Zhu
From: jiangkunkun 

Block descriptor is not a proper granule for dirty log tracking. This
adds a new interface named split_block in iommu layer and arm smmuv3
implements it, which splits block descriptor to an equivalent span of
page descriptors.

During spliting block, other interfaces are not expected to be working,
so race condition does not exist. And we flush all iotlbs after the split
procedure is completed to ease the pressure of iommu, as we will split a
huge range of block mappings in general.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  20 
 drivers/iommu/io-pgtable-arm.c  | 122 
 drivers/iommu/iommu.c   |  40 +++
 include/linux/io-pgtable.h  |   2 +
 include/linux/iommu.h   |  10 ++
 5 files changed, 194 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9208881a571c..5469f4fca820 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2510,6 +2510,25 @@ static int arm_smmu_domain_set_attr(struct iommu_domain 
*domain,
return ret;
 }
 
+static size_t arm_smmu_split_block(struct iommu_domain *domain,
+  unsigned long iova, size_t size)
+{
+   struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+   struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+
+   if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
+   dev_err(smmu->dev, "don't support BBML1/2 and split block\n");
+   return 0;
+   }
+
+   if (!ops || !ops->split_block) {
+   pr_err("don't support split block\n");
+   return 0;
+   }
+
+   return ops->split_block(ops, iova, size);
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2609,6 +2628,7 @@ static struct iommu_ops arm_smmu_ops = {
.device_group   = arm_smmu_device_group,
.domain_get_attr= arm_smmu_domain_get_attr,
.domain_set_attr= arm_smmu_domain_set_attr,
+   .split_block= arm_smmu_split_block,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index e299a44808ae..f3b7f7115e38 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -79,6 +79,8 @@
 #define ARM_LPAE_PTE_SH_IS (((arm_lpae_iopte)3) << 8)
 #define ARM_LPAE_PTE_NS(((arm_lpae_iopte)1) << 5)
 #define ARM_LPAE_PTE_VALID (((arm_lpae_iopte)1) << 0)
+/* Block descriptor bits */
+#define ARM_LPAE_PTE_NT(((arm_lpae_iopte)1) << 16)
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK  (((arm_lpae_iopte)0x3ff) << 2)
 /* Ignore the contiguous bit for block splitting */
@@ -679,6 +681,125 @@ static phys_addr_t arm_lpae_iova_to_phys(struct 
io_pgtable_ops *ops,
return iopte_to_paddr(pte, data) | iova;
 }
 
+static size_t __arm_lpae_split_block(struct arm_lpae_io_pgtable *data,
+unsigned long iova, size_t size, int lvl,
+arm_lpae_iopte *ptep);
+
+static size_t arm_lpae_do_split_blk(struct arm_lpae_io_pgtable *data,
+   unsigned long iova, size_t size,
+   arm_lpae_iopte blk_pte, int lvl,
+   arm_lpae_iopte *ptep)
+{
+   struct io_pgtable_cfg *cfg = >iop.cfg;
+   arm_lpae_iopte pte, *tablep;
+   phys_addr_t blk_paddr;
+   size_t tablesz = ARM_LPAE_GRANULE(data);
+   size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+   int i;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return 0;
+
+   tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg);
+   if (!tablep)
+   return 0;
+
+   blk_paddr = iopte_to_paddr(blk_pte, data);
+   pte = iopte_prot(blk_pte);
+   for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz)
+   __arm_lpae_init_pte(data, blk_paddr, pte, lvl, [i]);
+
+   if (cfg->bbml == 1) {
+   /* Race does not exist */
+   blk_pte |= ARM_LPAE_PTE_NT;
+   __arm_lpae_set_pte(ptep, blk_pte, cfg);
+   io_pgtable_tlb_flush_walk(>iop, iova, size, size);
+   }
+   /* Race does not exist */
+   pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg);
+
+   /* Have splited it into page? */
+   if (lvl == (ARM_LPAE_MAX_LE

[RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log

2021-01-28 Thread Keqian Zhu
From: jiangkunkun 

During dirty log tracking, user will try to retrieve dirty log from
iommu if it supports hardware dirty log. This adds a new interface
named sync_dirty_log in iommu layer and arm smmuv3 implements it,
which scans leaf TTD and treats it's dirty if it's writable (As we
just enable HTTU for stage1, so check AP[2] is not set).

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 27 +++
 drivers/iommu/io-pgtable-arm.c  | 90 +
 drivers/iommu/iommu.c   | 41 ++
 include/linux/io-pgtable.h  |  4 +
 include/linux/iommu.h   | 17 
 5 files changed, 179 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 2434519e4bb6..43d0536b429a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2548,6 +2548,32 @@ static size_t arm_smmu_merge_page(struct iommu_domain 
*domain, unsigned long iov
return ops->merge_page(ops, iova, paddr, size, prot);
 }
 
+static int arm_smmu_sync_dirty_log(struct iommu_domain *domain,
+  unsigned long iova, size_t size,
+  unsigned long *bitmap,
+  unsigned long base_iova,
+  unsigned long bitmap_pgshift)
+{
+   struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+   struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HTTU_HD)) {
+   dev_err(smmu->dev, "don't support HTTU_HD and sync dirty 
log\n");
+   return -EPERM;
+   }
+
+   if (!ops || !ops->sync_dirty_log) {
+   pr_err("don't support sync dirty log\n");
+   return -ENODEV;
+   }
+
+   /* To ensure all inflight transactions are completed */
+   arm_smmu_flush_iotlb_all(domain);
+
+   return ops->sync_dirty_log(ops, iova, size, bitmap,
+   base_iova, bitmap_pgshift);
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2649,6 +2675,7 @@ static struct iommu_ops arm_smmu_ops = {
.domain_set_attr= arm_smmu_domain_set_attr,
.split_block= arm_smmu_split_block,
.merge_page = arm_smmu_merge_page,
+   .sync_dirty_log = arm_smmu_sync_dirty_log,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 17390f258eb1..6cfe1ef3fedd 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -877,6 +877,95 @@ static size_t arm_lpae_merge_page(struct io_pgtable_ops 
*ops, unsigned long iova
return __arm_lpae_merge_page(data, iova, paddr, size, lvl, ptep, prot);
 }
 
+static int __arm_lpae_sync_dirty_log(struct arm_lpae_io_pgtable *data,
+unsigned long iova, size_t size,
+int lvl, arm_lpae_iopte *ptep,
+unsigned long *bitmap,
+unsigned long base_iova,
+unsigned long bitmap_pgshift)
+{
+   arm_lpae_iopte pte;
+   struct io_pgtable *iop = >iop;
+   size_t base, next_size;
+   unsigned long offset;
+   int nbits, ret;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return -EINVAL;
+
+   ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return -EINVAL;
+
+   if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+   if (iopte_leaf(pte, lvl, iop->fmt)) {
+   if (pte & ARM_LPAE_PTE_AP_RDONLY)
+   return 0;
+
+   /* It is writable, set the bitmap */
+   nbits = size >> bitmap_pgshift;
+   offset = (iova - base_iova) >> bitmap_pgshift;
+   bitmap_set(bitmap, offset, nbits);
+   return 0;
+   } else {
+   /* To traverse next level */
+   next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+   ptep = iopte_deref(pte, data);
+   for (base = 0; base < size; base += next_size) {
+   ret = __arm_lpae_sync_dirty_log(data,
+   iova + base, next_size, lvl + 1,
+  

[RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-01-28 Thread Keqian Zhu
From: jiangkunkun 

The SMMU which supports HTTU (Hardware Translation Table Update) can
update the access flag and the dirty state of TTD by hardware. It is
essential to track dirty pages of DMA.

This adds feature detection, none functional change.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
 include/linux/io-pgtable.h  |  1 +
 3 files changed, 25 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8ca7415d785d..0f0fe71cc10d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
.pgsize_bitmap  = smmu->pgsize_bitmap,
.ias= ias,
.oas= oas,
+   .httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
.coherent_walk  = smmu->features & ARM_SMMU_FEAT_COHERENCY,
.tlb= _smmu_flush_ops,
.iommu_dev  = smmu->dev,
@@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
if (reg & IDR0_HYP)
smmu->features |= ARM_SMMU_FEAT_HYP;
 
+   switch (FIELD_GET(IDR0_HTTU, reg)) {
+   case IDR0_HTTU_NONE:
+   break;
+   case IDR0_HTTU_HA:
+   smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+   break;
+   case IDR0_HTTU_HAD:
+   smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+   smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
+   break;
+   default:
+   dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
+   return -ENXIO;
+   }
+
/*
 * The coherency feature as set by FW is used in preference to the ID
 * register, but warn on mismatch.
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 96c2e9565e00..e91bea44519e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -33,6 +33,10 @@
 #define IDR0_ASID16(1 << 12)
 #define IDR0_ATS   (1 << 10)
 #define IDR0_HYP   (1 << 9)
+#define IDR0_HTTU  GENMASK(7, 6)
+#define IDR0_HTTU_NONE 0
+#define IDR0_HTTU_HA   1
+#define IDR0_HTTU_HAD  2
 #define IDR0_COHACC(1 << 4)
 #define IDR0_TTF   GENMASK(3, 2)
 #define IDR0_TTF_AARCH64   2
@@ -286,6 +290,8 @@
 #define CTXDESC_CD_0_TCR_TBI0  (1ULL << 38)
 
 #define CTXDESC_CD_0_AA64  (1UL << 41)
+#define CTXDESC_CD_0_HD(1UL << 42)
+#define CTXDESC_CD_0_HA(1UL << 43)
 #define CTXDESC_CD_0_S (1UL << 44)
 #define CTXDESC_CD_0_R (1UL << 45)
 #define CTXDESC_CD_0_A (1UL << 46)
@@ -604,6 +610,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_RANGE_INV(1 << 15)
 #define ARM_SMMU_FEAT_BTM  (1 << 16)
 #define ARM_SMMU_FEAT_SVA  (1 << 17)
+#define ARM_SMMU_FEAT_HTTU_HA  (1 << 18)
+#define ARM_SMMU_FEAT_HTTU_HD  (1 << 19)
u32 features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index ea727eb1a1a9..1a00ea8562c7 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -97,6 +97,7 @@ struct io_pgtable_cfg {
unsigned long   pgsize_bitmap;
unsigned intias;
unsigned intoas;
+   boolhttu_hd;
boolcoherent_walk;
const struct iommu_flush_ops*tlb;
struct device   *iommu_dev;
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log

2021-01-26 Thread Keqian Zhu
The intention:

On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
This leads to heavy side effect on VM, especially when multi vCPU race and
some of them block on kvm mmu_lock.

DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.

About this patch series:

The biggest problem of apply DBM for stage2 is that software must scan PTs to
collect dirty state, which may cost much time and affect downtime of migration.

This series realize a SW/HW combined dirty log that can effectively solve this
problem (The smmu side can also use this approach to solve dma dirty log 
tracking).

The core idea is that we do not enable hardware dirty at start (do not add DBM 
bit).
When a arbitrary PT occurs fault, we execute soft tracking for this PT and 
enable
hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). 
Then when
sync dirty log, we have known all PTs with hardware dirty enabled, so we do not 
need
to scan all PTs.

mem abort point mem abort point
  ↓↓
---
|||||||
---
 ↑↑
set DBM bit of   set DBM bit of
 this PT section (64PTEs)  this PT section (64PTEs)

We may worry that when dirty rate is over-high we still need to scan too much 
PTs.
We mainly concern the VM stop time. With Qemu dirty rate throttling, the dirty 
memory
is closing to the VM stop threshold, so there is a little PTs to scan after VM 
stop.

It has the advantages of hardware tracking that minimizes side effect on vCPU,
and also has the advantages of software tracking that controls vCPU dirty rate.
Moreover, software tracking helps us to scan PTs at some fixed points, which
greatly reduces scanning time. And the biggest benefit is that we can apply this
solution for dma dirty tracking.

Test:

Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure 
test result
  is not effected by dissolve of block page table at the early stage of 
migration).
VM:   16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).

Each run 5 times for software dirty log and SW/HW conbined dirty log. 

Test result:

Gain 5%~7% improvement of redis QPS during VM migration.
VM downtime is not affected fundamentally.
About 56.7% of DBM is effectively used.

Keqian Zhu (7):
  arm64: cpufeature: Add API to report system support of HWDBM
  kvm: arm64: Use atomic operation when update PTE
  kvm: arm64: Add level_apply parameter for stage2_attr_walker
  kvm: arm64: Add some HW_DBM related pgtable interfaces
  kvm: arm64: Add some HW_DBM related mmu interfaces
  kvm: arm64: Only write protect selected PTE
  kvm: arm64: Start up SW/HW combined dirty log

 arch/arm64/include/asm/cpufeature.h  |  12 +++
 arch/arm64/include/asm/kvm_host.h|   6 ++
 arch/arm64/include/asm/kvm_mmu.h |   7 ++
 arch/arm64/include/asm/kvm_pgtable.h |  45 ++
 arch/arm64/kvm/arm.c | 125 ++
 arch/arm64/kvm/hyp/pgtable.c | 130 ++-
 arch/arm64/kvm/mmu.c |  47 +-
 arch/arm64/kvm/reset.c   |   8 +-
 8 files changed, 351 insertions(+), 29 deletions(-)

-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH 7/7] kvm: arm64: Start up SW/HW combined dirty log

2021-01-26 Thread Keqian Zhu
We do not enable hardware dirty at start (do not add DBM bit). When
an arbitrary PT occurs fault, we execute soft tracking for this PT
and enable hardware tracking for its nearby PTs (Add DBM bit for
nearby 64PTs). Then when sync dirty log, we have known all PTs with
hardware dirty enabled, so we do not need to scan all PTs.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/include/asm/kvm_host.h |   6 ++
 arch/arm64/kvm/arm.c  | 125 ++
 arch/arm64/kvm/mmu.c  |   7 +-
 arch/arm64/kvm/reset.c|   8 +-
 4 files changed, 141 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 8fcfab0c2567..e9ea5b546326 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -99,6 +99,8 @@ struct kvm_s2_mmu {
 };
 
 struct kvm_arch_memory_slot {
+   #define HWDBM_GRANULE_SHIFT 6  /* 64 pages per bit */
+   unsigned long *hwdbm_bitmap;
 };
 
 struct kvm_arch {
@@ -565,6 +567,10 @@ struct kvm_vcpu_stat {
u64 exits;
 };
 
+int kvm_arm_init_hwdbm_bitmap(struct kvm_memory_slot *memslot);
+void kvm_arm_destroy_hwdbm_bitmap(struct kvm_memory_slot *memslot);
+void kvm_arm_enable_nearby_hwdbm(struct kvm *kvm, gfn_t gfn);
+
 int kvm_vcpu_preferred_target(struct kvm_vcpu_init *init);
 unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 04c44853b103..9e05d45fa6be 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1257,9 +1257,134 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
return r;
 }
 
+static unsigned long kvm_hwdbm_bitmap_bytes(struct kvm_memory_slot *memslot)
+{
+   unsigned long nbits = DIV_ROUND_UP(memslot->npages, 1 << 
HWDBM_GRANULE_SHIFT);
+
+   return ALIGN(nbits, BITS_PER_LONG) / 8;
+}
+
+static unsigned long *kvm_second_hwdbm_bitmap(struct kvm_memory_slot *memslot)
+{
+   unsigned long len = kvm_hwdbm_bitmap_bytes(memslot);
+
+   return (void *)memslot->arch.hwdbm_bitmap + len;
+}
+
+/*
+ * Allocate twice space. Refer kvm_arch_sync_dirty_log() to see why the
+ * second space is needed.
+ */
+int kvm_arm_init_hwdbm_bitmap(struct kvm_memory_slot *memslot)
+{
+   unsigned long bytes = 2 * kvm_hwdbm_bitmap_bytes(memslot);
+
+   if (!system_supports_hw_dbm())
+   return 0;
+
+   if (memslot->arch.hwdbm_bitmap) {
+   /* Inherited from old memslot */
+   bitmap_clear(memslot->arch.hwdbm_bitmap, 0, bytes * 8);
+   } else {
+   memslot->arch.hwdbm_bitmap = kvzalloc(bytes, 
GFP_KERNEL_ACCOUNT);
+   if (!memslot->arch.hwdbm_bitmap)
+   return -ENOMEM;
+   }
+
+   return 0;
+}
+
+void kvm_arm_destroy_hwdbm_bitmap(struct kvm_memory_slot *memslot)
+{
+   if (!memslot->arch.hwdbm_bitmap)
+   return;
+
+   kvfree(memslot->arch.hwdbm_bitmap);
+   memslot->arch.hwdbm_bitmap = NULL;
+}
+
+/* Add DBM for nearby pagetables but do not across memslot */
+void kvm_arm_enable_nearby_hwdbm(struct kvm *kvm, gfn_t gfn)
+{
+   struct kvm_memory_slot *memslot;
+
+   memslot = gfn_to_memslot(kvm, gfn);
+   if (memslot && kvm_slot_dirty_track_enabled(memslot) &&
+   memslot->arch.hwdbm_bitmap) {
+   unsigned long rel_gfn = gfn - memslot->base_gfn;
+   unsigned long dbm_idx = rel_gfn >> HWDBM_GRANULE_SHIFT;
+   unsigned long start_page, npages;
+
+   if (!test_and_set_bit(dbm_idx, memslot->arch.hwdbm_bitmap)) {
+   start_page = dbm_idx << HWDBM_GRANULE_SHIFT;
+   npages = 1 << HWDBM_GRANULE_SHIFT;
+   npages = min(memslot->npages - start_page, npages);
+   kvm_stage2_set_dbm(kvm, memslot, start_page, npages);
+   }
+   }
+}
+
+/*
+ * We have to find a place to clear hwdbm_bitmap, and clear hwdbm_bitmap means
+ * to clear DBM bit of all related pgtables. Note that between we clear DBM bit
+ * and flush TLB, HW dirty log may occur, so we must scan all related pgtables
+ * after flush TLB. Giving above, it's best choice to clear hwdbm_bitmap before
+ * sync HW dirty log.
+ */
 void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
 {
+   unsigned long *second_bitmap = kvm_second_hwdbm_bitmap(memslot);
+   unsigned long start_page, npages;
+   unsigned int end, rs, re;
+   bool has_hwdbm = false;
+
+   if (!memslot->arch.hwdbm_bitmap)
+   return;
 
+   end = kvm_hwdbm_bitmap_bytes(memslot) * 8;
+   bitmap_clear(second_bitmap, 0, end);
+
+   spin_lock(>mmu_lock);
+   bitmap_for_each_set_region(memslot->arch.hwdbm_bitmap, rs, re, 0, end) {
+   has_hwdb

[RFC PATCH 6/7] kvm: arm64: Only write protect selected PTE

2021-01-26 Thread Keqian Zhu
This function write protects all PTEs between the ffs and fls of mask.
There may has unset bit between this range. It works well under pure
software dirty log, as software dirty log is not working during this
process.

But it will unexpectly clear dirty status of PTE when hardware dirty
log is enabled. So change it to only write protect selected PTE.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 18717fd12731..2f8c6770a4dc 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -589,10 +589,14 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm 
*kvm,
gfn_t gfn_offset, unsigned long mask)
 {
phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
-   phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
-   phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+   phys_addr_t start, end;
+   int rs, re;
 
-   stage2_wp_range(>arch.mmu, start, end);
+   bitmap_for_each_set_region(, rs, re, 0, BITS_PER_LONG) {
+   start = (base_gfn + rs) << PAGE_SHIFT;
+   end = (base_gfn + re) << PAGE_SHIFT;
+   stage2_wp_range(>arch.mmu, start, end);
+   }
 }
 
 /*
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH 2/7] kvm: arm64: Use atomic operation when update PTE

2021-01-26 Thread Keqian Zhu
We are about to add HW_DBM support for stage2 dirty log, so software
updating PTE may race with the MMU trying to set the access flag or
dirty state.

Use atomic oparations to avoid reverting these bits set by MMU.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/hyp/pgtable.c | 41 
 1 file changed, 27 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bdf8e55ed308..4915ba35f93b 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -153,10 +153,34 @@ static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte)
return __va(kvm_pte_to_phys(pte));
 }
 
+/*
+ * We may race with the MMU trying to set the access flag or dirty state,
+ * use atomic oparations to avoid reverting these bits.
+ *
+ * Return original PTE.
+ */
+static kvm_pte_t kvm_update_pte(kvm_pte_t *ptep, kvm_pte_t bit_set,
+   kvm_pte_t bit_clr)
+{
+   kvm_pte_t old_pte, pte = *ptep;
+
+   do {
+   old_pte = pte;
+   pte &= ~bit_clr;
+   pte |= bit_set;
+
+   if (old_pte == pte)
+   break;
+
+   pte = cmpxchg_relaxed(ptep, old_pte, pte);
+   } while (pte != old_pte);
+
+   return old_pte;
+}
+
 static void kvm_set_invalid_pte(kvm_pte_t *ptep)
 {
-   kvm_pte_t pte = *ptep;
-   WRITE_ONCE(*ptep, pte & ~KVM_PTE_VALID);
+   kvm_update_pte(ptep, 0, KVM_PTE_VALID);
 }
 
 static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp)
@@ -723,18 +747,7 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 
level, kvm_pte_t *ptep,
return 0;
 
data->level = level;
-   data->pte = pte;
-   pte &= ~data->attr_clr;
-   pte |= data->attr_set;
-
-   /*
-* We may race with the CPU trying to set the access flag here,
-* but worst-case the access flag update gets lost and will be
-* set on the next access instead.
-*/
-   if (data->pte != pte)
-   WRITE_ONCE(*ptep, pte);
-
+   data->pte = kvm_update_pte(ptep, data->attr_set, data->attr_clr);
return 0;
 }
 
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH 5/7] kvm: arm64: Add some HW_DBM related mmu interfaces

2021-01-26 Thread Keqian Zhu
This adds set_dbm, clear_dbm and sync_dirty interfaces in mmu layer.
They simply wrap those interfaces of pgtable layer, adding reschedule
semantics.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/include/asm/kvm_mmu.h |  7 +++
 arch/arm64/kvm/mmu.c | 30 ++
 2 files changed, 37 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e52d82aeadca..51dd31ba1b94 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -183,6 +183,13 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t 
size,
 void **haddr);
 void free_hyp_pgds(void);
 
+void kvm_stage2_clear_dbm(struct kvm *kvm, struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long npages);
+void kvm_stage2_set_dbm(struct kvm *kvm, struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long npages);
+void kvm_stage2_sync_dirty(struct kvm *kvm, struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long npages);
+
 void stage2_unmap_vm(struct kvm *kvm);
 int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu);
 void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7d2257cc5438..18717fd12731 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -609,6 +609,36 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm 
*kvm,
kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
+void kvm_stage2_clear_dbm(struct kvm *kvm, struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long npages)
+{
+   phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+   phys_addr_t addr = base_gfn << PAGE_SHIFT;
+   phys_addr_t end = (base_gfn + npages) << PAGE_SHIFT;
+
+   stage2_apply_range_resched(kvm, addr, end, 
kvm_pgtable_stage2_clear_dbm);
+}
+
+void kvm_stage2_set_dbm(struct kvm *kvm, struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long npages)
+{
+   phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+   phys_addr_t addr = base_gfn << PAGE_SHIFT;
+   phys_addr_t end = (base_gfn + npages) << PAGE_SHIFT;
+
+   stage2_apply_range_resched(kvm, addr, end, kvm_pgtable_stage2_set_dbm);
+}
+
+void kvm_stage2_sync_dirty(struct kvm *kvm, struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long npages)
+{
+   phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+   phys_addr_t addr = base_gfn << PAGE_SHIFT;
+   phys_addr_t end = (base_gfn + npages) << PAGE_SHIFT;
+
+   stage2_apply_range_resched(kvm, addr, end, 
kvm_pgtable_stage2_sync_dirty);
+}
+
 static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size)
 {
__clean_dcache_guest_page(pfn, size);
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH 3/7] kvm: arm64: Add level_apply parameter for stage2_attr_walker

2021-01-26 Thread Keqian Zhu
In order to change PTEs of some specific levels, the level_apply
parameter can be used as a level mask.

This has no fuctional change for current code.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/hyp/pgtable.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 4915ba35f93b..0f8a319f16fe 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -734,6 +734,7 @@ struct stage2_attr_data {
kvm_pte_t   attr_clr;
kvm_pte_t   pte;
u32 level;
+   u32 level_apply;
 };
 
 static int stage2_attr_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
@@ -743,6 +744,9 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 level, 
kvm_pte_t *ptep,
kvm_pte_t pte = *ptep;
struct stage2_attr_data *data = arg;
 
+   if (!(data->level_apply & BIT(level)))
+   return 0;
+
if (!kvm_pte_valid(pte))
return 0;
 
@@ -753,14 +757,15 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 
level, kvm_pte_t *ptep,
 
 static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
u64 size, kvm_pte_t attr_set,
-   kvm_pte_t attr_clr, kvm_pte_t *orig_pte,
-   u32 *level)
+   kvm_pte_t attr_clr, u32 level_apply,
+   kvm_pte_t *orig_pte, u32 *level)
 {
int ret;
kvm_pte_t attr_mask = KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI;
struct stage2_attr_data data = {
.attr_set   = attr_set & attr_mask,
.attr_clr   = attr_clr & attr_mask,
+   .level_apply= level_apply,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_attr_walker,
@@ -783,7 +788,7 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable 
*pgt, u64 addr,
 int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
 {
return stage2_update_leaf_attrs(pgt, addr, size, 0,
-   KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
+   KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W, -1,
NULL, NULL);
 }
 
@@ -791,7 +796,7 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable 
*pgt, u64 addr)
 {
kvm_pte_t pte = 0;
stage2_update_leaf_attrs(pgt, addr, 1, KVM_PTE_LEAF_ATTR_LO_S2_AF, 0,
-, NULL);
+-1, , NULL);
dsb(ishst);
return pte;
 }
@@ -800,7 +805,7 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, 
u64 addr)
 {
kvm_pte_t pte = 0;
stage2_update_leaf_attrs(pgt, addr, 1, 0, KVM_PTE_LEAF_ATTR_LO_S2_AF,
-, NULL);
+-1, , NULL);
/*
 * "But where's the TLBI?!", you scream.
 * "Over in the core code", I sigh.
@@ -813,7 +818,7 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, 
u64 addr)
 bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr)
 {
kvm_pte_t pte = 0;
-   stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, , NULL);
+   stage2_update_leaf_attrs(pgt, addr, 1, 0, 0, -1, , NULL);
return pte & KVM_PTE_LEAF_ATTR_LO_S2_AF;
 }
 
@@ -833,7 +838,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, 
u64 addr,
if (prot & KVM_PGTABLE_PROT_X)
clr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;
 
-   ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, );
+   ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, -1, NULL, 
);
if (!ret)
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level);
return ret;
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH 4/7] kvm: arm64: Add some HW_DBM related pgtable interfaces

2021-01-26 Thread Keqian Zhu
This adds set_dbm, clear_dbm and sync_dirty interfaces in pgtable
layer. (1) set_dbm: Set DBM bit for last level PTE of a specified
range. TLBI is completed. (2) clear_dbm: Clear DBM bit for last
level PTE of a specified range. TLBI is not acted. (3) sync_dirty:
Scan last level PTE of a specific range. Log dirty if PTE is writable.

Besides, save the dirty state of PTE if it's invalided by map or
unmap.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/include/asm/kvm_pgtable.h | 45 ++
 arch/arm64/kvm/hyp/pgtable.c | 70 
 2 files changed, 115 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
b/arch/arm64/include/asm/kvm_pgtable.h
index 52ab38db04c7..8984b5227cfc 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -204,6 +204,51 @@ int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 
addr, u64 size);
  */
 int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size);
 
+/**
+ * kvm_pgtable_stage2_clear_dbm() - Clear DBM of guest stage-2 address range
+ *  without TLB invalidation (only last level).
+ * @pgt:   Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr:  Intermediate physical address from which to clear DBM,
+ * @size:  Size of the range.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * Note that it is the caller's responsibility to invalidate the TLB after
+ * calling this function to ensure that the disabled HW dirty are visible
+ * to the CPUs.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_clear_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size);
+
+/**
+ * kvm_pgtable_stage2_set_dbm() - Set DBM of guest stage-2 address range to
+ *enable HW dirty (only last level).
+ * @pgt:   Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr:  Intermediate physical address from which to set DBM.
+ * @size:  Size of the range.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_set_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size);
+
+/**
+ * kvm_pgtable_stage2_sync_dirty() - Sync HW dirty state into memslot.
+ * @pgt:   Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr:  Intermediate physical address from which to sync.
+ * @size:  Size of the range.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_sync_dirty(struct kvm_pgtable *pgt, u64 addr, u64 size);
+
 /**
  * kvm_pgtable_stage2_mkyoung() - Set the access flag in a page-table entry.
  * @pgt:   Page-table structure initialised by kvm_pgtable_stage2_init().
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0f8a319f16fe..b6f0d2f3aee4 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -43,6 +43,7 @@
 
 #define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
 
+#define KVM_PTE_LEAF_ATTR_HI_S2_DBMBIT(51)
 #define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
 
 struct kvm_pgtable_walk_data {
@@ -485,6 +486,11 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot 
prot,
return 0;
 }
 
+static bool stage2_pte_writable(kvm_pte_t pte)
+{
+   return pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
+}
+
 static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
   kvm_pte_t *ptep,
   struct stage2_map_data *data)
@@ -509,6 +515,11 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, 
u32 level,
/* There's an existing valid leaf entry, so perform break-before-make */
kvm_set_invalid_pte(ptep);
kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
+
+   /* Save the possible hardware dirty info */
+   if ((level == KVM_PGTABLE_MAX_LEVELS - 1) && stage2_pte_writable(*ptep))
+   mark_page_dirty(data->mmu->kvm, addr >> PAGE_SHIFT);
+
kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
 out:
data->phys += granule;
@@ -547,6 +558,10 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 
level, kvm_pte_t *ptep,
if (kvm_pte_valid(pte))
put_page(page);
 
+   /*
+* HW DBM is not working during page merging, so we don't
+* need to save possible hardware dirty info here.
+*/
return 0;
}
 
@@ -707,6 +722,10 @@ static int stage2_unmap_walker(u64 addr, u64 end, u32 
level, kvm_pte_t *ptep,
kvm_call_hyp(__

[RFC PATCH 1/7] arm64: cpufeature: Add API to report system support of HWDBM

2021-01-26 Thread Keqian Zhu
Though we already has a cpu capability named ARM64_HW_DBM, it's a
LOCAL_CPU cap and conditionally compiled by CONFIG_ARM64_HW_AFDBM.

This reports the system wide support of HW_DBM.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/include/asm/cpufeature.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index 9a555809b89c..dfded86c7684 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -664,6 +664,18 @@ static inline bool system_supports_mixed_endian(void)
return val == 0x1;
 }
 
+static inline bool system_supports_hw_dbm(void)
+{
+   u64 mmfr1;
+   u32 val;
+
+   mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
+   val = cpuid_feature_extract_unsigned_field(mmfr1,
+   ID_AA64MMFR1_HADBS_SHIFT);
+
+   return val == 0x2;
+}
+
 static __always_inline bool system_supports_fpsimd(void)
 {
return !cpus_have_const_cap(ARM64_HAS_NO_FPSIMD);
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH] kvm: arm64: Try stage2 block mapping for host device MMIO

2021-01-25 Thread Keqian Zhu
I forget to give the link of the bugfix I mentioned below :-).

[1] https://lkml.org/lkml/2020/5/1/1294

On 2021/1/25 19:25, Keqian Zhu wrote:
> Hi Marc,
> 
> On 2021/1/22 17:45, Marc Zyngier wrote:
>> On 2021-01-22 08:36, Keqian Zhu wrote:
>>> The MMIO region of a device maybe huge (GB level), try to use block
>>> mapping in stage2 to speedup both map and unmap.
>>>
>>> Especially for unmap, it performs TLBI right after each invalidation
>>> of PTE. If all mapping is of PAGE_SIZE, it takes much time to handle
>>> GB level range.
>>
>> This is only on VM teardown, right? Or do you unmap the device more ofet?
>> Can you please quantify the speedup and the conditions this occurs in?
> 
> Yes, and there are some other paths (includes what your patch series handles) 
> will do the unmap action:
> 
> 1、guest reboot without S2FWB: stage2_unmap_vm()which only unmaps guest 
> regular RAM.
> 2、userspace deletes memslot: kvm_arch_flush_shadow_memslot().
> 3、rollback of device MMIO mapping: kvm_arch_prepare_memory_region().
> 4、rollback of dirty log tracking: If we enable hugepage for guest RAM, after 
> dirty log is stopped,
>the newly created block mappings will 
> unmap all page mappings.
> 5、mmu notifier: kvm_unmap_hva_range(). AFAICS, we will use this path when VM 
> teardown or guest resets pass-through devices.
> The bugfix[1] gives the reason for 
> unmapping MMIO region when guest resets pass-through devices.
> 
> unmap related to MMIO region, as this patch solves:
> point 1 is not applied.
> point 2 occurs when userspace unplug pass-through devices.
> point 3 can occurs, but rarely.
> point 4 is not applied.
> point 5 occurs when VM teardown or guest resets pass-through devices.
> 
> And I had a look at your patch series, it can solve:
> For VM teardown, elide CMO and perform VMALL instead of individually (But 
> current kernel do not go through this path when VM teardown).
> For rollback of dirty log tracking, elide CMO.
> For kvm_unmap_hva_range, if event is MMU_NOTIFY_UNMAP. elide CMO.
> 
> (But I doubt the CMOs in unmap. As we perform CMOs in user_mem_abort when 
> install new stage2 mapping for VM,
>  maybe the CMO in unmap is unnecessary under all conditions :-) ?)
> 
> So it shows that we are solving different parts of unmap, so they are not 
> conflicting. At least this patch can
> still speedup map of device MMIO region, and speedup unmap of device MMIO 
> region even if we do not need to perform
> CMO and TLBI ;-).
> 
> speedup: unmap 8GB MMIO on FPGA.
> 
>beforeafter opt
> cost30+ minutes949ms
> 
> Thanks,
> Keqian
> 
>>
>> I have the feeling that we are just circling around another problem,
>> which is that we could rely on a VM-wide TLBI when tearing down the
>> guest. I worked on something like that[1] a long while ago, and parked
>> it for some reason. Maybe it is worth reviving.
>>
>> [1] 
>> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/elide-cmo-tlbi
>>
>>>
>>> Signed-off-by: Keqian Zhu 
>>> ---
>>>  arch/arm64/include/asm/kvm_pgtable.h | 11 +++
>>>  arch/arm64/kvm/hyp/pgtable.c | 15 +++
>>>  arch/arm64/kvm/mmu.c | 12 
>>>  3 files changed, 34 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_pgtable.h
>>> b/arch/arm64/include/asm/kvm_pgtable.h
>>> index 52ab38db04c7..2266ac45f10c 100644
>>> --- a/arch/arm64/include/asm/kvm_pgtable.h
>>> +++ b/arch/arm64/include/asm/kvm_pgtable.h
>>> @@ -82,6 +82,17 @@ struct kvm_pgtable_walker {
>>>  const enum kvm_pgtable_walk_flagsflags;
>>>  };
>>>
>>> +/**
>>> + * kvm_supported_pgsize() - Get the max supported page size of a mapping.
>>> + * @pgt:Initialised page-table structure.
>>> + * @addr:Virtual address at which to place the mapping.
>>> + * @end:End virtual address of the mapping.
>>> + * @phys:Physical address of the memory to map.
>>> + *
>>> + * The smallest return value is PAGE_SIZE.
>>> + */
>>> +u64 kvm_supported_pgsize(struct kvm_pgtable *pgt, u64 addr, u64 end, u64 
>>> phys);
>>> +
>>>  /**
>>>   * kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
>>>   * @pgt:Uninitialised page-table structure to initialise.
>>> diff --git a/arch/arm64/kvm

Re: [RFC PATCH] kvm: arm64: Try stage2 block mapping for host device MMIO

2021-01-25 Thread Keqian Zhu
Hi Marc,

On 2021/1/22 17:45, Marc Zyngier wrote:
> On 2021-01-22 08:36, Keqian Zhu wrote:
>> The MMIO region of a device maybe huge (GB level), try to use block
>> mapping in stage2 to speedup both map and unmap.
>>
>> Especially for unmap, it performs TLBI right after each invalidation
>> of PTE. If all mapping is of PAGE_SIZE, it takes much time to handle
>> GB level range.
> 
> This is only on VM teardown, right? Or do you unmap the device more ofet?
> Can you please quantify the speedup and the conditions this occurs in?

Yes, and there are some other paths (includes what your patch series handles) 
will do the unmap action:

1、guest reboot without S2FWB: stage2_unmap_vm()which only unmaps guest regular 
RAM.
2、userspace deletes memslot: kvm_arch_flush_shadow_memslot().
3、rollback of device MMIO mapping: kvm_arch_prepare_memory_region().
4、rollback of dirty log tracking: If we enable hugepage for guest RAM, after 
dirty log is stopped,
   the newly created block mappings will unmap 
all page mappings.
5、mmu notifier: kvm_unmap_hva_range(). AFAICS, we will use this path when VM 
teardown or guest resets pass-through devices.
The bugfix[1] gives the reason for 
unmapping MMIO region when guest resets pass-through devices.

unmap related to MMIO region, as this patch solves:
point 1 is not applied.
point 2 occurs when userspace unplug pass-through devices.
point 3 can occurs, but rarely.
point 4 is not applied.
point 5 occurs when VM teardown or guest resets pass-through devices.

And I had a look at your patch series, it can solve:
For VM teardown, elide CMO and perform VMALL instead of individually (But 
current kernel do not go through this path when VM teardown).
For rollback of dirty log tracking, elide CMO.
For kvm_unmap_hva_range, if event is MMU_NOTIFY_UNMAP. elide CMO.

(But I doubt the CMOs in unmap. As we perform CMOs in user_mem_abort when 
install new stage2 mapping for VM,
 maybe the CMO in unmap is unnecessary under all conditions :-) ?)

So it shows that we are solving different parts of unmap, so they are not 
conflicting. At least this patch can
still speedup map of device MMIO region, and speedup unmap of device MMIO 
region even if we do not need to perform
CMO and TLBI ;-).

speedup: unmap 8GB MMIO on FPGA.

   beforeafter opt
cost30+ minutes949ms

Thanks,
Keqian

> 
> I have the feeling that we are just circling around another problem,
> which is that we could rely on a VM-wide TLBI when tearing down the
> guest. I worked on something like that[1] a long while ago, and parked
> it for some reason. Maybe it is worth reviving.
> 
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/elide-cmo-tlbi
> 
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  arch/arm64/include/asm/kvm_pgtable.h | 11 +++
>>  arch/arm64/kvm/hyp/pgtable.c | 15 +++
>>  arch/arm64/kvm/mmu.c | 12 
>>  3 files changed, 34 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_pgtable.h
>> b/arch/arm64/include/asm/kvm_pgtable.h
>> index 52ab38db04c7..2266ac45f10c 100644
>> --- a/arch/arm64/include/asm/kvm_pgtable.h
>> +++ b/arch/arm64/include/asm/kvm_pgtable.h
>> @@ -82,6 +82,17 @@ struct kvm_pgtable_walker {
>>  const enum kvm_pgtable_walk_flagsflags;
>>  };
>>
>> +/**
>> + * kvm_supported_pgsize() - Get the max supported page size of a mapping.
>> + * @pgt:Initialised page-table structure.
>> + * @addr:Virtual address at which to place the mapping.
>> + * @end:End virtual address of the mapping.
>> + * @phys:Physical address of the memory to map.
>> + *
>> + * The smallest return value is PAGE_SIZE.
>> + */
>> +u64 kvm_supported_pgsize(struct kvm_pgtable *pgt, u64 addr, u64 end, u64 
>> phys);
>> +
>>  /**
>>   * kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
>>   * @pgt:Uninitialised page-table structure to initialise.
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index bdf8e55ed308..ab11609b9b13 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -81,6 +81,21 @@ static bool kvm_block_mapping_supported(u64 addr,
>> u64 end, u64 phys, u32 level)
>>  return IS_ALIGNED(addr, granule) && IS_ALIGNED(phys, granule);
>>  }
>>
>> +u64 kvm_supported_pgsize(struct kvm_pgtable *pgt, u64 addr, u64 end, u64 
>> phys)
>> +{
>> +u32 lvl;
>> +u64 pgsize = PAGE_SIZE;
>> +
>> +for (lvl = pgt->start_level; lvl &

[PATCH] vfio/iommu_type1: Mantainance a counter for non_pinned_groups

2021-01-24 Thread Keqian Zhu
With this counter, we never need to traverse all groups to update
pinned_scope of vfio_iommu.

Suggested-by: Alex Williamson 
Signed-off-by: Keqian Zhu 
---
 drivers/vfio/vfio_iommu_type1.c | 40 +
 1 file changed, 5 insertions(+), 35 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0b4dedaa9128..bb4bbcc79101 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -73,7 +73,7 @@ struct vfio_iommu {
boolv2;
boolnesting;
booldirty_page_tracking;
-   boolpinned_page_dirty_scope;
+   uint64_tnum_non_pinned_groups;
 };
 
 struct vfio_domain {
@@ -148,7 +148,6 @@ static int put_pfn(unsigned long pfn, int prot);
 static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
   struct iommu_group *iommu_group);
 
-static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu);
 /*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
@@ -714,7 +713,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
group = vfio_iommu_find_iommu_group(iommu, iommu_group);
if (!group->pinned_page_dirty_scope) {
group->pinned_page_dirty_scope = true;
-   update_pinned_page_dirty_scope(iommu);
+   iommu->num_non_pinned_groups--;
}
 
goto pin_done;
@@ -991,7 +990,7 @@ static int update_user_bitmap(u64 __user *bitmap, struct 
vfio_iommu *iommu,
 * mark all pages dirty if any IOMMU capable device is not able
 * to report dirty pages and all pages are pinned and mapped.
 */
-   if (!iommu->pinned_page_dirty_scope && dma->iommu_mapped)
+   if (iommu->num_non_pinned_groups && dma->iommu_mapped)
bitmap_set(dma->bitmap, 0, nbits);
 
if (shift) {
@@ -1622,33 +1621,6 @@ static struct vfio_group 
*vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
return group;
 }
 
-static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu)
-{
-   struct vfio_domain *domain;
-   struct vfio_group *group;
-
-   list_for_each_entry(domain, >domain_list, next) {
-   list_for_each_entry(group, >group_list, next) {
-   if (!group->pinned_page_dirty_scope) {
-   iommu->pinned_page_dirty_scope = false;
-   return;
-   }
-   }
-   }
-
-   if (iommu->external_domain) {
-   domain = iommu->external_domain;
-   list_for_each_entry(group, >group_list, next) {
-   if (!group->pinned_page_dirty_scope) {
-   iommu->pinned_page_dirty_scope = false;
-   return;
-   }
-   }
-   }
-
-   iommu->pinned_page_dirty_scope = true;
-}
-
 static bool vfio_iommu_has_sw_msi(struct list_head *group_resv_regions,
  phys_addr_t *base)
 {
@@ -2057,8 +2029,6 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 * addition of a dirty tracking group.
 */
group->pinned_page_dirty_scope = true;
-   if (!iommu->pinned_page_dirty_scope)
-   update_pinned_page_dirty_scope(iommu);
mutex_unlock(>lock);
 
return 0;
@@ -2188,7 +2158,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 * demotes the iommu scope until it declares itself dirty tracking
 * capable via the page pinning interface.
 */
-   iommu->pinned_page_dirty_scope = false;
+   iommu->num_non_pinned_groups++;
mutex_unlock(>lock);
vfio_iommu_resv_free(_resv_regions);
 
@@ -2416,7 +2386,7 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
 * to be promoted.
 */
if (update_dirty_scope)
-   update_pinned_page_dirty_scope(iommu);
+   iommu->num_non_pinned_groups--;
mutex_unlock(>lock);
 }
 
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 2/2] vfio/iommu_type1: Fix some sanity checks in detach group

2021-01-22 Thread Keqian Zhu
vfio_sanity_check_pfn_list() is used to check whether pfn_list and
notifier are empty when remove the external domain, so it makes a
wrong assumption that only external domain will use the pinning
interface.

Now we apply the pfn_list check when a vfio_dma is removed and apply
the notifier check when all domains are removed.

Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices")
Signed-off-by: Keqian Zhu 
---
 drivers/vfio/vfio_iommu_type1.c | 33 ++---
 1 file changed, 10 insertions(+), 23 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 161725395f2f..d8c10f508321 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -957,6 +957,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, 
struct vfio_dma *dma,
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
+   WARN_ON(!RB_EMPTY_ROOT(>pfn_list));
vfio_unmap_unpin(iommu, dma, true);
vfio_unlink_dma(iommu, dma);
put_task_struct(dma->task);
@@ -2250,23 +2251,6 @@ static void vfio_iommu_unmap_unpin_reaccount(struct 
vfio_iommu *iommu)
}
 }
 
-static void vfio_sanity_check_pfn_list(struct vfio_iommu *iommu)
-{
-   struct rb_node *n;
-
-   n = rb_first(>dma_list);
-   for (; n; n = rb_next(n)) {
-   struct vfio_dma *dma;
-
-   dma = rb_entry(n, struct vfio_dma, node);
-
-   if (WARN_ON(!RB_EMPTY_ROOT(>pfn_list)))
-   break;
-   }
-   /* mdev vendor driver must unregister notifier */
-   WARN_ON(iommu->notifier.head);
-}
-
 /*
  * Called when a domain is removed in detach. It is possible that
  * the removed domain decided the iova aperture window. Modify the
@@ -2366,10 +2350,10 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
kfree(group);
 
if (list_empty(>external_domain->group_list)) {
-   vfio_sanity_check_pfn_list(iommu);
-
-   if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+   if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+   WARN_ON(iommu->notifier.head);
vfio_iommu_unmap_unpin_all(iommu);
+   }
 
kfree(iommu->external_domain);
iommu->external_domain = NULL;
@@ -2403,10 +2387,12 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
 */
if (list_empty(>group_list)) {
if (list_is_singular(>domain_list)) {
-   if (!iommu->external_domain)
+   if (!iommu->external_domain) {
+   WARN_ON(iommu->notifier.head);
vfio_iommu_unmap_unpin_all(iommu);
-   else
+   } else {
vfio_iommu_unmap_unpin_reaccount(iommu);
+   }
}
iommu_domain_free(domain->domain);
list_del(>next);
@@ -2488,9 +2474,10 @@ static void vfio_iommu_type1_release(void *iommu_data)
struct vfio_iommu *iommu = iommu_data;
struct vfio_domain *domain, *domain_tmp;
 
+   WARN_ON(iommu->notifier.head);
+
if (iommu->external_domain) {
vfio_release_domain(iommu->external_domain, true);
-   vfio_sanity_check_pfn_list(iommu);
kfree(iommu->external_domain);
}
 
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 0/2] vfio/iommu_type1: some fixes

2021-01-22 Thread Keqian Zhu
v3:
 - Populate bitmap unconditionally.
 - Sanity check notifier when remove all domains.

v2:
 - Address suggestions from Alex.
 - Remove unnecessary patches.
 

Keqian Zhu (2):
  vfio/iommu_type1: Populate full dirty when detach non-pinned group
  vfio/iommu_type1: Fix some sanity checks in detach group

 drivers/vfio/vfio_iommu_type1.c | 50 +
 1 file changed, 26 insertions(+), 24 deletions(-)

-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 1/2] vfio/iommu_type1: Populate full dirty when detach non-pinned group

2021-01-22 Thread Keqian Zhu
If a group with non-pinned-page dirty scope is detached with dirty
logging enabled, we should fully populate the dirty bitmaps at the
time it's removed since we don't know the extent of its previous DMA,
nor will the group be present to trigger the full bitmap when the user
retrieves the dirty bitmap.

Fixes: d6a4c185660c ("vfio iommu: Implementation of ioctl for dirty pages 
tracking")
Suggested-by: Alex Williamson 
Signed-off-by: Keqian Zhu 
---
 drivers/vfio/vfio_iommu_type1.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0b4dedaa9128..161725395f2f 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -236,6 +236,18 @@ static void vfio_dma_populate_bitmap(struct vfio_dma *dma, 
size_t pgsize)
}
 }
 
+static void vfio_iommu_populate_bitmap_full(struct vfio_iommu *iommu)
+{
+   struct rb_node *n;
+   unsigned long pgshift = __ffs(iommu->pgsize_bitmap);
+
+   for (n = rb_first(>dma_list); n; n = rb_next(n)) {
+   struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+   bitmap_set(dma->bitmap, 0, dma->size >> pgshift);
+   }
+}
+
 static int vfio_dma_bitmap_alloc_all(struct vfio_iommu *iommu, size_t pgsize)
 {
struct rb_node *n;
@@ -2415,8 +2427,11 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
 * Removal of a group without dirty tracking may allow the iommu scope
 * to be promoted.
 */
-   if (update_dirty_scope)
+   if (update_dirty_scope) {
update_pinned_page_dirty_scope(iommu);
+   if (iommu->dirty_page_tracking)
+   vfio_iommu_populate_bitmap_full(iommu);
+   }
mutex_unlock(>lock);
 }
 
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC PATCH] kvm: arm64: Try stage2 block mapping for host device MMIO

2021-01-22 Thread Keqian Zhu
The MMIO region of a device maybe huge (GB level), try to use block
mapping in stage2 to speedup both map and unmap.

Especially for unmap, it performs TLBI right after each invalidation
of PTE. If all mapping is of PAGE_SIZE, it takes much time to handle
GB level range.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/include/asm/kvm_pgtable.h | 11 +++
 arch/arm64/kvm/hyp/pgtable.c | 15 +++
 arch/arm64/kvm/mmu.c | 12 
 3 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
b/arch/arm64/include/asm/kvm_pgtable.h
index 52ab38db04c7..2266ac45f10c 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -82,6 +82,17 @@ struct kvm_pgtable_walker {
const enum kvm_pgtable_walk_flags   flags;
 };
 
+/**
+ * kvm_supported_pgsize() - Get the max supported page size of a mapping.
+ * @pgt:   Initialised page-table structure.
+ * @addr:  Virtual address at which to place the mapping.
+ * @end:   End virtual address of the mapping.
+ * @phys:  Physical address of the memory to map.
+ *
+ * The smallest return value is PAGE_SIZE.
+ */
+u64 kvm_supported_pgsize(struct kvm_pgtable *pgt, u64 addr, u64 end, u64 phys);
+
 /**
  * kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
  * @pgt:   Uninitialised page-table structure to initialise.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bdf8e55ed308..ab11609b9b13 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -81,6 +81,21 @@ static bool kvm_block_mapping_supported(u64 addr, u64 end, 
u64 phys, u32 level)
return IS_ALIGNED(addr, granule) && IS_ALIGNED(phys, granule);
 }
 
+u64 kvm_supported_pgsize(struct kvm_pgtable *pgt, u64 addr, u64 end, u64 phys)
+{
+   u32 lvl;
+   u64 pgsize = PAGE_SIZE;
+
+   for (lvl = pgt->start_level; lvl < KVM_PGTABLE_MAX_LEVELS; lvl++) {
+   if (kvm_block_mapping_supported(addr, end, phys, lvl)) {
+   pgsize = kvm_granule_size(lvl);
+   break;
+   }
+   }
+
+   return pgsize;
+}
+
 static u32 kvm_pgtable_idx(struct kvm_pgtable_walk_data *data, u32 level)
 {
u64 shift = kvm_granule_shift(level);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7d2257cc5438..80b403fc8e64 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -499,7 +499,8 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
  phys_addr_t pa, unsigned long size, bool writable)
 {
-   phys_addr_t addr;
+   phys_addr_t addr, end;
+   unsigned long pgsize;
int ret = 0;
struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
@@ -509,21 +510,24 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t 
guest_ipa,
 
size += offset_in_page(guest_ipa);
guest_ipa &= PAGE_MASK;
+   end = guest_ipa + size;
 
-   for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
+   for (addr = guest_ipa; addr < end; addr += pgsize) {
ret = kvm_mmu_topup_memory_cache(,
 kvm_mmu_cache_min_pages(kvm));
if (ret)
break;
 
+   pgsize = kvm_supported_pgsize(pgt, addr, end, pa);
+
spin_lock(>mmu_lock);
-   ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
+   ret = kvm_pgtable_stage2_map(pgt, addr, pgsize, pa, prot,
 );
spin_unlock(>mmu_lock);
if (ret)
break;
 
-   pa += PAGE_SIZE;
+   pa += pgsize;
}
 
kvm_mmu_free_memory_cache();
-- 
2.19.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


  1   2   3   >