from:"Keqian Zhu"

Re: [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log

2021-04-20 Thread Keqian Zhu

Hi Ben,

On 2021/4/20 3:20, Ben Gardon wrote:
> On Fri, Apr 16, 2021 at 1:25 AM Keqian Zhu  wrote:
>>
>> Currently during start dirty logging, if we're with init-all-set,
>> we write protect huge pages and leave normal pages untouched, for
>> that we can enable dirty logging for these pages lazily.
>>
>> Actually enable dirty logging lazily for huge pages is feasible
>> too, which not only reduces the time of start dirty logging, also
>> greatly reduces side-effect on guest when there is high dirty rate.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  arch/x86/kvm/mmu/mmu.c | 48 ++
>>  arch/x86/kvm/x86.c | 37 +---
>>  2 files changed, 54 insertions(+), 31 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 2ce5bc2ea46d..98fa25172b9a 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct 
>> kvm_rmap_head *rmap_head,
>>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
>>   * @mask: indicates which pages we should protect
>>   *
>> - * Used when we do not need to care about huge page mappings: e.g. during 
>> dirty
>> - * logging we do not have any such mappings.
>> + * Used when we do not need to care about huge page mappings.
>>   */
>>  static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>>  struct kvm_memory_slot *slot,
>> @@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm 
>> *kvm,
>>   * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages 
>> to
>>   * enable dirty logging for them.
>>   *
>> - * Used when we do not need to care about huge page mappings: e.g. during 
>> dirty
>> - * logging we do not have any such mappings.
>> + * We need to care about huge page mappings: e.g. during dirty logging we 
>> may
>> + * have any such mappings.
>>   */
>>  void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>> struct kvm_memory_slot *slot,
>> gfn_t gfn_offset, unsigned long mask)
>>  {
>> +   gfn_t start, end;
>> +
>> +   /*
>> +* Huge pages are NOT write protected when we start dirty log with
>> +* init-all-set, so we must write protect them at here.
>> +*
>> +* The gfn_offset is guaranteed to be aligned to 64, but the base_gfn
>> +* of memslot has no such restriction, so the range can cross two 
>> large
>> +* pages.
>> +*/
>> +   if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
>> +   start = slot->base_gfn + gfn_offset + __ffs(mask);
>> +   end = slot->base_gfn + gfn_offset + __fls(mask);
>> +   kvm_mmu_slot_gfn_write_protect(kvm, slot, start, 
>> PG_LEVEL_2M);
>> +
>> +   /* Cross two large pages? */
>> +   if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) !=
>> +   ALIGN(end << PAGE_SHIFT, PMD_SIZE))
>> +   kvm_mmu_slot_gfn_write_protect(kvm, slot, end,
>> +  PG_LEVEL_2M);
>> +   }
>> +
>> +   /*
>> +* RFC:
>> +*
>> +* 1. I don't return early when kvm_mmu_slot_gfn_write_protect() 
>> returns
>> +* true, because I am not very clear about the relationship between
>> +* legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else
>> +* manner.
>> +*
>> +* The kvm_mmu_slot_gfn_write_protect() returns true when we hit a
>> +* writable large page mapping in legacy mmu mapping or tdp mmu 
>> mapping.
>> +* Do we still have normal mapping in that case? (e.g. We have large
>> +* mapping in legacy mmu and normal mapping in tdp mmu).
> 
> Right, we can't return early because the two MMUs could map the page
> in different ways, but each MMU could also map the page in multiple
> ways independently.
> For example, if the legacy MMU was being used and we were running a
> nested VM, a page could be mapped 2M in EPT01 and 4K in EPT02, so we'd
> still need kvm_mmu_slot_gfn_write_protect  calls for both levels.
> I don't think there's a case where we can return early here with the
> information that the first calls to kvm_mmu_slot_gfn_write_protect
> access.
Thanks for the detailed

Re: [PATCH v3 02/12] iommu: Add iommu_split_block interface

2021-04-20 Thread Keqian Zhu

Hi Baolu,

Cheers for the your quick reply.

On 2021/4/20 10:09, Lu Baolu wrote:
> Hi Keqian,
> 
> On 4/20/21 9:25 AM, Keqian Zhu wrote:
>> Hi Baolu,
>>
>> On 2021/4/19 21:33, Lu Baolu wrote:
>>> Hi Keqian,
>>>
>>> On 2021/4/19 17:32, Keqian Zhu wrote:
>>>>>> +EXPORT_SYMBOL_GPL(iommu_split_block);
>>>>> Do you really have any consumers of this interface other than the dirty
>>>>> bit tracking? If not, I don't suggest to make this as a generic IOMMU
>>>>> interface.
>>>>>
>>>>> There is an implicit requirement for such interfaces. The
>>>>> iommu_map/unmap(iova, size) shouldn't be called at the same time.
>>>>> Currently there's no such sanity check in the iommu core. A poorly
>>>>> written driver could mess up the kernel by misusing this interface.
>>>> Yes, I don't think up a scenario except dirty tracking.
>>>>
>>>> Indeed, we'd better not make them as a generic interface.
>>>>
>>>> Do you have any suggestion that underlying iommu drivers can share these 
>>>> code but
>>>> not make it as a generic iommu interface?
>>>>
>>>> I have a not so good idea. Make the "split" interfaces as a static 
>>>> function, and
>>>> transfer the function pointer to start_dirty_log. But it looks weird and 
>>>> inflexible.
>>>
>>> I understand splitting/merging super pages is an optimization, but not a
>>> functional requirement. So is it possible to let the vendor iommu driver
>>> decide whether splitting super pages when starting dirty bit tracking
>>> and the opposite operation during when stopping it? The requirement for
>> Right. If I understand you correct, actually that is what this series does.
> 
> I mean to say no generic APIs, jut do it by the iommu subsystem itself.
> It's totally transparent to the upper level, just like what map() does.
> The upper layer doesn't care about either super page or small page is
> in use when do a mapping, right?
> 
> If you want to consolidate some code, how about putting them in
> start/stop_tracking()?

Yep, this reminds me. What we want to reuse is the logic of "chunk by chunk" in 
split().
We can implement switch_dirty_log to be "chunk by chunk" too (just the same as 
sync/clear),
then the vendor iommu driver can invoke it's own private implementation of 
split().
So we can completely remove split() in the IOMMU core layer.

example code logic

iommu.c:
switch_dirty_log(big range) {
for_each_iommu_page(big range) {
  ops->switch_dirty_log(iommu_pgsize)
}
}

vendor iommu driver:
switch_dirty_log(iommu_pgsize) {

if (enable) {
ops->split_block(iommu_pgsize)
/* And other actions, such as enable hardware capability */
} else {
for_each_continuous_physical_address(iommu_pgsize)
ops->merge_page()
}
}

Besides, vendor iommu driver can invoke split() in clear_dirty_log instead of 
in switch_dirty_log.
The benefit is that we usually clear dirty log gradually during dirty tracking, 
then we can split
large page mapping gradually, which speedup start_dirty_log and make less side 
effect on DMA performance.

Does it looks good for you?

Thanks,
Keqian

Re: [PATCH v3 02/12] iommu: Add iommu_split_block interface

2021-04-19 Thread Keqian Zhu

Hi Baolu,

On 2021/4/19 21:33, Lu Baolu wrote:
> Hi Keqian,
> 
> On 2021/4/19 17:32, Keqian Zhu wrote:
>>>> +EXPORT_SYMBOL_GPL(iommu_split_block);
>>> Do you really have any consumers of this interface other than the dirty
>>> bit tracking? If not, I don't suggest to make this as a generic IOMMU
>>> interface.
>>>
>>> There is an implicit requirement for such interfaces. The
>>> iommu_map/unmap(iova, size) shouldn't be called at the same time.
>>> Currently there's no such sanity check in the iommu core. A poorly
>>> written driver could mess up the kernel by misusing this interface.
>> Yes, I don't think up a scenario except dirty tracking.
>>
>> Indeed, we'd better not make them as a generic interface.
>>
>> Do you have any suggestion that underlying iommu drivers can share these 
>> code but
>> not make it as a generic iommu interface?
>>
>> I have a not so good idea. Make the "split" interfaces as a static function, 
>> and
>> transfer the function pointer to start_dirty_log. But it looks weird and 
>> inflexible.
> 
> I understand splitting/merging super pages is an optimization, but not a
> functional requirement. So is it possible to let the vendor iommu driver
> decide whether splitting super pages when starting dirty bit tracking
> and the opposite operation during when stopping it? The requirement for
Right. If I understand you correct, actually that is what this series does.
We realized split/merge in IOMMU core layer, but don't force vendor driver to 
use it.

The problem is that when we expose these interfaces to vendor IOMMU driver, 
will also
expose them to upper driver.

> upper layer is that starting/stopping dirty bit tracking and
> mapping/unmapping are mutually exclusive.
OK, I will explicitly add the hints. Thanks.

Thanks,
Keqian
> 
>>
>> On the other hand, if a driver calls map/unmap with split/merge at the same 
>> time,
>> it's a bug of driver, it should follow the rule.
>>
> 
> Best regards,
> baolu
> .
>

Re: [PATCH v3 02/12] iommu: Add iommu_split_block interface

2021-04-19 Thread Keqian Zhu

Hi Baolu,

On 2021/4/14 15:14, Lu Baolu wrote:
> On 4/13/21 4:54 PM, Keqian Zhu wrote:
>> Block(largepage) mapping is not a proper granule for dirty log tracking.
>> Take an extreme example, if DMA writes one byte, under 1G mapping, the
>> dirty amount reported is 1G, but under 4K mapping, the dirty amount is
>> just 4K.
>>
>> This adds a new interface named iommu_split_block in IOMMU base layer.
>> A specific IOMMU driver can invoke it during start dirty log. If so, the
>> driver also need to realize the split_block iommu ops.
>>
>> We flush all iotlbs after the whole procedure is completed to ease the
>> pressure of IOMMU, as we will hanle a huge range of mapping in general.
>>
>> Signed-off-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/iommu.c | 41 +
>>   include/linux/iommu.h | 11 +++
>>   2 files changed, 52 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 667b2d6d2fc0..bb413a927870 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2721,6 +2721,47 @@ int iommu_domain_set_attr(struct iommu_domain *domain,
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
>>   +int iommu_split_block(struct iommu_domain *domain, unsigned long iova,
>> +  size_t size)
>> +{
>> +const struct iommu_ops *ops = domain->ops;
>> +unsigned int min_pagesz;
>> +size_t pgsize;
>> +bool flush = false;
>> +int ret = 0;
>> +
>> +if (unlikely(!ops || !ops->split_block))
>> +return -ENODEV;
>> +
>> +min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
>> +if (!IS_ALIGNED(iova | size, min_pagesz)) {
>> +pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
>> +   iova, size, min_pagesz);
>> +return -EINVAL;
>> +}
>> +
>> +while (size) {
>> +flush = true;
>> +
>> +pgsize = iommu_pgsize(domain, iova, size);
>> +
>> +ret = ops->split_block(domain, iova, pgsize);
>> +if (ret)
>> +break;
>> +
>> +pr_debug("split handled: iova 0x%lx size 0x%zx\n", iova, pgsize);
>> +
>> +iova += pgsize;
>> +size -= pgsize;
>> +}
>> +
>> +if (flush)
>> +iommu_flush_iotlb_all(domain);
>> +
>> +return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_split_block);
> 
> Do you really have any consumers of this interface other than the dirty
> bit tracking? If not, I don't suggest to make this as a generic IOMMU
> interface.
> 
> There is an implicit requirement for such interfaces. The
> iommu_map/unmap(iova, size) shouldn't be called at the same time.
> Currently there's no such sanity check in the iommu core. A poorly
> written driver could mess up the kernel by misusing this interface.

Yes, I don't think up a scenario except dirty tracking.

Indeed, we'd better not make them as a generic interface.

Do you have any suggestion that underlying iommu drivers can share these code 
but
not make it as a generic iommu interface?

I have a not so good idea. Make the "split" interfaces as a static function, and
transfer the function pointer to start_dirty_log. But it looks weird and 
inflexible.

On the other hand, if a driver calls map/unmap with split/merge at the same 
time,
it's a bug of driver, it should follow the rule.

> 
> This also applies to iommu_merge_page().
>

Thanks,
Keqian

Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-16 Thread Keqian Zhu

Hi Marc,

On 2021/4/16 22:44, Marc Zyngier wrote:
> On Thu, 15 Apr 2021 15:08:09 +0100,
> Keqian Zhu  wrote:
>>
>> Hi Marc,
>>
>> On 2021/4/15 22:03, Keqian Zhu wrote:
>>> The MMIO region of a device maybe huge (GB level), try to use
>>> block mapping in stage2 to speedup both map and unmap.
>>>
>>> Compared to normal memory mapping, we should consider two more
>>> points when try block mapping for MMIO region:
>>>
>>> 1. For normal memory mapping, the PA(host physical address) and
>>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>>> the HVA to request hugepage, so we don't need to consider PA
>>> alignment when verifing block mapping. But for device memory
>>> mapping, the PA and HVA may have different alignment.
>>>
>>> 2. For normal memory mapping, we are sure hugepage size properly
>>> fit into vma, so we don't check whether the mapping size exceeds
>>> the boundary of vma. But for device memory mapping, we should pay
>>> attention to this.
>>>
>>> This adds get_vma_page_shift() to get page shift for both normal
>>> memory and device MMIO region, and check these two points when
>>> selecting block mapping size for MMIO region.
>>>
>>> Signed-off-by: Keqian Zhu 
>>> ---
>>>  arch/arm64/kvm/mmu.c | 61 
>>>  1 file changed, 51 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index c59af5ca01b0..5a1cc7751e6d 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
>>> *memslot,
>>> return PAGE_SIZE;
>>>  }
>>>  
[...]

>>> +   /*
>>> +* logging_active is guaranteed to never be true for VM_PFNMAP
>>> +* memslots.
>>> +*/
>>> +   if (logging_active) {
>>> force_pte = true;
>>> vma_shift = PAGE_SHIFT;
>>> +   } else {
>>> +   vma_shift = get_vma_page_shift(vma, hva);
>>> }
>> I use a if/else manner in v4, please check that. Thanks very much!
> 
> That's fine. However, it is getting a bit late for 5.13, and we don't
> have much time to left it simmer in -next. I'll probably wait until
> after the merge window to pick it up.
OK, no problem. Thanks! :)

BRs,
Keqian

Re: [PATCH v3 01/12] iommu: Introduce dirty log tracking framework

2021-04-16 Thread Keqian Zhu

Hi Baolu,

On 2021/4/15 18:21, Lu Baolu wrote:
> Hi,
> 
> On 2021/4/15 15:43, Keqian Zhu wrote:
>>>> design it as not switchable. I will modify the commit message of patch#12, 
>>>> thanks!
>>> I am not sure that I fully get your point. But I can't see any gaps of
>>> using iommu_dev_enable/disable_feature() to switch dirty log on and off.
>>> Probably I missed anything.
>> IOMMU_DEV_FEAT_HWDBM just tells user whether underlying IOMMU driver supports
>> dirty tracking, it is not used to management the status of dirty log 
>> tracking.
>>
>> The feature reporting is per device, but the status management is per 
>> iommu_domain.
>> Only when all devices in a domain support HWDBM, we can start dirty log for 
>> the domain.
> 
> So why not
> 
> for_each_device_attached_domain()
> iommu_dev_enable_feature(IOMMU_DEV_FEAT_HWDBM)
Looks reasonable, but the problem is that we just need to enable dirty log once 
per domain.

> 
> ?
>>
>> And I think we'd better not mix the feature reporting and status management. 
>> Thoughts?
>>
> 
> I am worrying about having two sets of APIs for single purpose. From
> vendor iommu driver's point of view, this feature is per device. Hence,
> it still needs to do the same thing.
Yes, we can unify the granule of feature reporting and status management.

The basic granule of dirty tracking is iommu_domain, I think it's very 
reasonable. We need an
interface to report the feature of iommu_domain, then the logic is much more 
clear.

Every time we add new device or remove device from the domain, we should update 
the feature (e.g.,
maintain a counter of unsupported devices).

What do you think about this idea?

Thanks,
Keqian

Re: [PATCH 3/3] vfio/iommu_type1: Add support for manual dirty log clear

2021-04-16 Thread Keqian Zhu

Hi Alex,

On 2021/4/16 4:43, Alex Williamson wrote:
> On Tue, 13 Apr 2021 17:14:45 +0800
> Keqian Zhu  wrote:
> 
>> From: Kunkun Jiang 
>>
>> In the past, we clear dirty log immediately after sync dirty
>> log to userspace. This may cause redundant dirty handling if
>> userspace handles dirty log iteratively:
>>
>> After vfio clears dirty log, new dirty log starts to generate.
>> These new dirty log will be reported to userspace even if they
>> are generated before userspace handles the same dirty page.
>>
>> That's to say, we should minimize the time gap of dirty log
>> clearing and dirty log handling. We can give userspace the
>> interface to clear dirty log.
> 
> IIUC, a user would be expected to clear the bitmap before copying the
> dirty pages, therefore you're trying to reduce that time gap between
> clearing any copy, but it cannot be fully eliminated and importantly,
> if the user clears after copying, they've introduced a race.  Correct?
Yep, it's totally correct. If user clears after copying, it may lose dirty info.
I'll enhance the doc.

> 
> What results do you have to show that this is a worthwhile optimization?
This optimization is inspired by KVM[1]. The results are differ by different 
workload of guest.
In theory, the higher dirty rate the better result. But sorry that I tested it 
on our FPGA, the dirty
rate is heavily limited, so the result is not obvious.

> 
> I really don't like the semantics that testing for an IOMMU capability
> enables it.  It needs to be explicitly controllable feature, which
> suggests to me that it might be a flag used in combination with _GET or
> a separate _GET_NOCLEAR operations.  Thanks,
Yes, good suggestion. We should give userspace a choice.

Thanks,
Keqian

[1] 
https://lore.kernel.org/kvm/1543251253-24762-1-git-send-email-pbonz...@redhat.com/

> 
> Alex
> 
> 
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 100 ++--
>>  include/uapi/linux/vfio.h   |  28 -
>>  2 files changed, 123 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 77950e47f56f..d9c4a27b3c4e 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -78,6 +78,7 @@ struct vfio_iommu {
>>  boolv2;
>>  boolnesting;
>>  booldirty_page_tracking;
>> +booldirty_log_manual_clear;
>>  boolpinned_page_dirty_scope;
>>  boolcontainer_open;
>>  };
>> @@ -1242,6 +1243,78 @@ static int vfio_iommu_dirty_log_sync(struct 
>> vfio_iommu *iommu,
>>  return ret;
>>  }
>>  
>> +static int vfio_iova_dirty_log_clear(u64 __user *bitmap,
>> + struct vfio_iommu *iommu,
>> + dma_addr_t iova, size_t size,
>> + size_t pgsize)
>> +{
>> +struct vfio_dma *dma;
>> +struct rb_node *n;
>> +dma_addr_t start_iova, end_iova, riova;
>> +unsigned long pgshift = __ffs(pgsize);
>> +unsigned long bitmap_size;
>> +unsigned long *bitmap_buffer = NULL;
>> +bool clear_valid;
>> +int rs, re, start, end, dma_offset;
>> +int ret = 0;
>> +
>> +bitmap_size = DIRTY_BITMAP_BYTES(size >> pgshift);
>> +bitmap_buffer = kvmalloc(bitmap_size, GFP_KERNEL);
>> +if (!bitmap_buffer) {
>> +ret = -ENOMEM;
>> +goto out;
>> +}
>> +
>> +if (copy_from_user(bitmap_buffer, bitmap, bitmap_size)) {
>> +ret = -EFAULT;
>> +goto out;
>> +}
>> +
>> +for (n = rb_first(>dma_list); n; n = rb_next(n)) {
>> +dma = rb_entry(n, struct vfio_dma, node);
>> +if (!dma->iommu_mapped)
>> +continue;
>> +if ((dma->iova + dma->size - 1) < iova)
>> +continue;
>> +if (dma->iova > iova + size - 1)
>> +break;
>> +
>> +start_iova = max(iova, dma->iova);
>> +end_iova = min(iova + size, dma->iova + dma->size);
>> +
>> +/* Similar logic as the tail of vfio_iova_dirty_bitmap */
>> +
>> +clear_valid = false;
>> +start = (start_iova - iova) >> pgshift;
>> +end = (

[RFC PATCH v2 1/2] KVM: x86: Support write protect gfn with min_level

2021-04-16 Thread Keqian Zhu

Under some circumstances, we just need to write protect large page
gfn. This gets prepared for write protecting large page lazily during
dirty log tracking.

None function and performance change expected.

Signed-off-by: Keqian Zhu 
---
 arch/x86/kvm/mmu/mmu.c  |  9 +
 arch/x86/kvm/mmu/mmu_internal.h |  3 ++-
 arch/x86/kvm/mmu/page_track.c   |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c  | 16 
 arch/x86/kvm/mmu/tdp_mmu.h  |  3 ++-
 5 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 486aa94ecf1d..2ce5bc2ea46d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1265,20 +1265,21 @@ int kvm_cpu_dirty_log_size(void)
 }
 
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-   struct kvm_memory_slot *slot, u64 gfn)
+   struct kvm_memory_slot *slot, u64 gfn,
+   int min_level)
 {
struct kvm_rmap_head *rmap_head;
int i;
bool write_protected = false;
 
-   for (i = PG_LEVEL_4K; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
+   for (i = min_level; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
rmap_head = __gfn_to_rmap(gfn, i, slot);
write_protected |= __rmap_write_protect(kvm, rmap_head, true);
}
 
if (is_tdp_mmu_enabled(kvm))
write_protected |=
-   kvm_tdp_mmu_write_protect_gfn(kvm, slot, gfn);
+   kvm_tdp_mmu_write_protect_gfn(kvm, slot, gfn, 
min_level);
 
return write_protected;
 }
@@ -1288,7 +1289,7 @@ static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 
gfn)
struct kvm_memory_slot *slot;
 
slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-   return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn);
+   return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn, 
PG_LEVEL_4K);
 }
 
 static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1f6f98c76bdf..4c7c42bb8cf8 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -104,7 +104,8 @@ bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t 
gfn,
 void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-   struct kvm_memory_slot *slot, u64 gfn);
+   struct kvm_memory_slot *slot, u64 gfn,
+   int min_level);
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
u64 start_gfn, u64 pages);
 
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 34bb0ec69bd8..91a9f7e0fd91 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -100,7 +100,7 @@ void kvm_slot_page_track_add_page(struct kvm *kvm,
kvm_mmu_gfn_disallow_lpage(slot, gfn);
 
if (mode == KVM_PAGE_TRACK_WRITE)
-   if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
+   if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
kvm_flush_remote_tlbs(kvm);
 }
 EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 018d82e73e31..6cf0284e2e6a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1338,15 +1338,22 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
  * Returns true if an SPTE was set and a TLB flush is needed.
  */
 static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t gfn)
+ gfn_t gfn, int min_level)
 {
struct tdp_iter iter;
u64 new_spte;
bool spte_set = false;
 
+   BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
+
rcu_read_lock();
 
-   tdp_root_for_each_leaf_pte(iter, root, gfn, gfn + 1) {
+   for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
+  min_level, gfn, gfn + 1) {
+   if (!is_shadow_present_pte(iter.old_spte) ||
+   !is_last_spte(iter.old_spte, iter.level))
+   continue;
+
if (!is_writable_pte(iter.old_spte))
break;
 
@@ -1368,7 +1375,8 @@ static bool write_protect_gfn(struct kvm *kvm, struct 
kvm_mmu_page *root,
  * Returns true if an SPTE was set and a TLB flush is needed.
  */
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-  struct kvm_memory_slot *slot, gfn_t gfn)
+  struct kvm_memory_slot *slot, gfn_t gfn,
+  int min_lev

[RFC PATCH v2 0/2] KVM: x86: Enable dirty logging lazily for huge pages

2021-04-16 Thread Keqian Zhu

Currently during start dirty logging, if we're with init-all-set,
we write protect huge pages and leave normal pages untouched, for
that we can enable dirty logging for these pages lazily.

Actually enable dirty logging lazily for huge pages is feasible
too, which not only reduces the time of start dirty logging, also
greatly reduces side-effect on guest when there is high dirty rate.

Thanks,
Keqian

Keqian Zhu (2):
  KVM: x86: Support write protect gfn with min_level
  KVM: x86: Not wr-protect huge page with init_all_set dirty log

 arch/x86/kvm/mmu/mmu.c  | 57 -
 arch/x86/kvm/mmu/mmu_internal.h |  3 +-
 arch/x86/kvm/mmu/page_track.c   |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c  | 16 ++---
 arch/x86/kvm/mmu/tdp_mmu.h  |  3 +-
 arch/x86/kvm/x86.c  | 37 ++---
 6 files changed, 76 insertions(+), 42 deletions(-)

-- 
2.23.0

[RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log

2021-04-16 Thread Keqian Zhu

Currently during start dirty logging, if we're with init-all-set,
we write protect huge pages and leave normal pages untouched, for
that we can enable dirty logging for these pages lazily.

Actually enable dirty logging lazily for huge pages is feasible
too, which not only reduces the time of start dirty logging, also
greatly reduces side-effect on guest when there is high dirty rate.

Signed-off-by: Keqian Zhu 
---
 arch/x86/kvm/mmu/mmu.c | 48 ++
 arch/x86/kvm/x86.c | 37 +---
 2 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2ce5bc2ea46d..98fa25172b9a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct 
kvm_rmap_head *rmap_head,
  * @gfn_offset: start of the BITS_PER_LONG pages we care about
  * @mask: indicates which pages we should protect
  *
- * Used when we do not need to care about huge page mappings: e.g. during dirty
- * logging we do not have any such mappings.
+ * Used when we do not need to care about huge page mappings.
  */
 static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 struct kvm_memory_slot *slot,
@@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm 
*kvm,
  * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
  * enable dirty logging for them.
  *
- * Used when we do not need to care about huge page mappings: e.g. during dirty
- * logging we do not have any such mappings.
+ * We need to care about huge page mappings: e.g. during dirty logging we may
+ * have any such mappings.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn_offset, unsigned long mask)
 {
+   gfn_t start, end;
+
+   /*
+* Huge pages are NOT write protected when we start dirty log with
+* init-all-set, so we must write protect them at here.
+*
+* The gfn_offset is guaranteed to be aligned to 64, but the base_gfn
+* of memslot has no such restriction, so the range can cross two large
+* pages.
+*/
+   if (kvm_dirty_log_manual_protect_and_init_set(kvm)) {
+   start = slot->base_gfn + gfn_offset + __ffs(mask);
+   end = slot->base_gfn + gfn_offset + __fls(mask);
+   kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
+
+   /* Cross two large pages? */
+   if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) !=
+   ALIGN(end << PAGE_SHIFT, PMD_SIZE))
+   kvm_mmu_slot_gfn_write_protect(kvm, slot, end,
+  PG_LEVEL_2M);
+   }
+
+   /*
+* RFC:
+*
+* 1. I don't return early when kvm_mmu_slot_gfn_write_protect() returns
+* true, because I am not very clear about the relationship between
+* legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else
+* manner.
+*
+* The kvm_mmu_slot_gfn_write_protect() returns true when we hit a
+* writable large page mapping in legacy mmu mapping or tdp mmu mapping.
+* Do we still have normal mapping in that case? (e.g. We have large
+* mapping in legacy mmu and normal mapping in tdp mmu).
+*
+* 2. kvm_mmu_slot_gfn_write_protect() doesn't tell us whether the large
+* page mapping exist. If it exists but is clean, we can return early.
+* However, we have to do invasive change.
+*/
+
+   /* Then we can handle the PT level pages */
if (kvm_x86_ops.cpu_dirty_log_size)
kvm_mmu_clear_dirty_pt_masked(kvm, slot, gfn_offset, mask);
else
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eca63625aee4..dfd676ffa7da 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10888,36 +10888,19 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 */
kvm_mmu_zap_collapsible_sptes(kvm, new);
} else {
-   /* By default, write-protect everything to log writes. */
-   int level = PG_LEVEL_4K;
+   /*
+* If we're with initial-all-set, we don't need to write protect
+* any page because they're reported as dirty already.
+*/
+   if (kvm_dirty_log_manual_protect_and_init_set(kvm))
+   return;
 
if (kvm_x86_ops.cpu_dirty_log_size) {
-   /*
-* Clear all dirty bits, unless pages are treated as
-* dirty from the get-go.
-*/
-   if (!kvm_dirty_log_manual_protec

Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Keqian Zhu

Hi Marc,

On 2021/4/15 22:03, Keqian Zhu wrote:
> The MMIO region of a device maybe huge (GB level), try to use
> block mapping in stage2 to speedup both map and unmap.
> 
> Compared to normal memory mapping, we should consider two more
> points when try block mapping for MMIO region:
> 
> 1. For normal memory mapping, the PA(host physical address) and
> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
> the HVA to request hugepage, so we don't need to consider PA
> alignment when verifing block mapping. But for device memory
> mapping, the PA and HVA may have different alignment.
> 
> 2. For normal memory mapping, we are sure hugepage size properly
> fit into vma, so we don't check whether the mapping size exceeds
> the boundary of vma. But for device memory mapping, we should pay
> attention to this.
> 
> This adds get_vma_page_shift() to get page shift for both normal
> memory and device MMIO region, and check these two points when
> selecting block mapping size for MMIO region.
> 
> Signed-off-by: Keqian Zhu 
> ---
>  arch/arm64/kvm/mmu.c | 61 
>  1 file changed, 51 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c59af5ca01b0..5a1cc7751e6d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
> *memslot,
>   return PAGE_SIZE;
>  }
>  
> +static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
> +{
> + unsigned long pa;
> +
> + if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
> + return huge_page_shift(hstate_vma(vma));
> +
> + if (!(vma->vm_flags & VM_PFNMAP))
> + return PAGE_SHIFT;
> +
> + VM_BUG_ON(is_vm_hugetlb_page(vma));
> +
> + pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
> +
> +#ifndef __PAGETABLE_PMD_FOLDED
> + if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
> + ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
> + ALIGN(hva, PUD_SIZE) <= vma->vm_end)
> + return PUD_SHIFT;
> +#endif
> +
> + if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
> + ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
> + ALIGN(hva, PMD_SIZE) <= vma->vm_end)
> + return PMD_SHIFT;
> +
> + return PAGE_SHIFT;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> struct kvm_memory_slot *memslot, unsigned long hva,
> unsigned long fault_status)
> @@ -769,7 +798,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>   }
>  
> - /* Let's check if we will get back a huge page backed by hugetlbfs */
> + /*
> +  * Let's check if we will get back a huge page backed by hugetlbfs, or
> +  * get block mapping for device MMIO region.
> +  */
>   mmap_read_lock(current->mm);
>   vma = find_vma_intersection(current->mm, hva, hva + 1);
>   if (unlikely(!vma)) {
> @@ -778,15 +810,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>   }
>  
> - if (is_vm_hugetlb_page(vma))
> - vma_shift = huge_page_shift(hstate_vma(vma));
> - else
> - vma_shift = PAGE_SHIFT;
> -
> - if (logging_active ||
> - (vma->vm_flags & VM_PFNMAP)) {
> + /*
> +  * logging_active is guaranteed to never be true for VM_PFNMAP
> +  * memslots.
> +  */
> + if (logging_active) {
>   force_pte = true;
>   vma_shift = PAGE_SHIFT;
> + } else {
> + vma_shift = get_vma_page_shift(vma, hva);
>   }
I use a if/else manner in v4, please check that. Thanks very much!


BRs,
Keqian

>  
>   switch (vma_shift) {
> @@ -854,8 +886,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>  
>   if (kvm_is_device_pfn(pfn)) {
> + /*
> +  * If the page was identified as device early by looking at
> +  * the VMA flags, vma_pagesize is already representing the
> +  * largest quantity we can map.  If instead it was mapped
> +  * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
> +  * and must not be upgraded.
> +  *
> +  * In both cases, we don't let transpare

[PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions

2021-04-15 Thread Keqian Zhu

The MMIO regions may be unmapped for many reasons and can be remapped
by stage2 fault path. Map MMIO regions at creation time becomes a
minor optimization and makes these two mapping path hard to sync.

Remove the mapping code while keep the useful sanity check.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 38 +++---
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..c59af5ca01b0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 {
hva_t hva = mem->userspace_addr;
hva_t reg_end = hva + mem->memory_size;
-   bool writable = !(mem->flags & KVM_MEM_READONLY);
int ret = 0;
 
if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
@@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
mmap_read_lock(current->mm);
/*
 * A memory region could potentially cover multiple VMAs, and any holes
-* between them, so iterate over all of them to find out if we can map
-* any of them right now.
+* between them, so iterate over all of them.
 *
 * ++
 * +---++   ++
@@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 */
do {
struct vm_area_struct *vma = find_vma(current->mm, hva);
-   hva_t vm_start, vm_end;
 
if (!vma || vma->vm_start >= reg_end)
break;
 
-   /*
-* Take the intersection of this VMA with the memory region
-*/
-   vm_start = max(hva, vma->vm_start);
-   vm_end = min(reg_end, vma->vm_end);
-
if (vma->vm_flags & VM_PFNMAP) {
-   gpa_t gpa = mem->guest_phys_addr +
-   (vm_start - mem->userspace_addr);
-   phys_addr_t pa;
-
-   pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
-   pa += vm_start - vma->vm_start;
-
/* IO region dirty page logging not allowed */
if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
ret = -EINVAL;
-   goto out;
-   }
-
-   ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
-   vm_end - vm_start,
-   writable);
-   if (ret)
break;
+   }
}
-   hva = vm_end;
+   hva = min(reg_end, vma->vm_end);
} while (hva < reg_end);
 
-   if (change == KVM_MR_FLAGS_ONLY)
-   goto out;
-
-   spin_lock(>mmu_lock);
-   if (ret)
-   unmap_stage2_range(>arch.mmu, mem->guest_phys_addr, 
mem->memory_size);
-   else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
-   stage2_flush_memslot(kvm, memslot);
-   spin_unlock(>mmu_lock);
-out:
mmap_read_unlock(current->mm);
return ret;
 }
-- 
2.19.1

[PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Keqian Zhu

The MMIO region of a device maybe huge (GB level), try to use
block mapping in stage2 to speedup both map and unmap.

Compared to normal memory mapping, we should consider two more
points when try block mapping for MMIO region:

1. For normal memory mapping, the PA(host physical address) and
HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
the HVA to request hugepage, so we don't need to consider PA
alignment when verifing block mapping. But for device memory
mapping, the PA and HVA may have different alignment.

2. For normal memory mapping, we are sure hugepage size properly
fit into vma, so we don't check whether the mapping size exceeds
the boundary of vma. But for device memory mapping, we should pay
attention to this.

This adds get_vma_page_shift() to get page shift for both normal
memory and device MMIO region, and check these two points when
selecting block mapping size for MMIO region.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 61 
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c59af5ca01b0..5a1cc7751e6d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
*memslot,
return PAGE_SIZE;
 }
 
+static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
+{
+   unsigned long pa;
+
+   if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
+   return huge_page_shift(hstate_vma(vma));
+
+   if (!(vma->vm_flags & VM_PFNMAP))
+   return PAGE_SHIFT;
+
+   VM_BUG_ON(is_vm_hugetlb_page(vma));
+
+   pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
+
+#ifndef __PAGETABLE_PMD_FOLDED
+   if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
+   ALIGN(hva, PUD_SIZE) <= vma->vm_end)
+   return PUD_SHIFT;
+#endif
+
+   if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
+   ALIGN(hva, PMD_SIZE) <= vma->vm_end)
+   return PMD_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
@@ -769,7 +798,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   /* Let's check if we will get back a huge page backed by hugetlbfs */
+   /*
+* Let's check if we will get back a huge page backed by hugetlbfs, or
+* get block mapping for device MMIO region.
+*/
mmap_read_lock(current->mm);
vma = find_vma_intersection(current->mm, hva, hva + 1);
if (unlikely(!vma)) {
@@ -778,15 +810,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   if (is_vm_hugetlb_page(vma))
-   vma_shift = huge_page_shift(hstate_vma(vma));
-   else
-   vma_shift = PAGE_SHIFT;
-
-   if (logging_active ||
-   (vma->vm_flags & VM_PFNMAP)) {
+   /*
+* logging_active is guaranteed to never be true for VM_PFNMAP
+* memslots.
+*/
+   if (logging_active) {
force_pte = true;
vma_shift = PAGE_SHIFT;
+   } else {
+   vma_shift = get_vma_page_shift(vma, hva);
}
 
switch (vma_shift) {
@@ -854,8 +886,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
 
if (kvm_is_device_pfn(pfn)) {
+   /*
+* If the page was identified as device early by looking at
+* the VMA flags, vma_pagesize is already representing the
+* largest quantity we can map.  If instead it was mapped
+* via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
+* and must not be upgraded.
+*
+* In both cases, we don't let transparent_hugepage_adjust()
+* change things at the last minute.
+*/
device = true;
-   force_pte = true;
} else if (logging_active && !write_fault) {
/*
 * Only actually map the page as writable if this was a write
@@ -876,7 +917,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 * If we are not forced to use page mapping, check if we are
 * backed by a THP and thus use block mapping if possible.
 */
-   if (vma_pag

[PATCH v4 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Keqian Zhu

Hi,

We have two pathes to build stage2 mapping for MMIO regions.

Create time's path and stage2 fault path.

Patch#1 removes the creation time's mapping of MMIO regions
Patch#2 tries stage2 block mapping for host device MMIO at fault path

Changelog:

v4:
 - use get_vma_page_shift() handle all cases. (Marc)
 - get rid of force_pte for device mapping. (Marc)

v3:
 - Do not need to check memslot boundary in device_rough_page_shift(). (Marc)

Thanks,
Keqian


Keqian Zhu (2):
  kvm/arm64: Remove the creation time's mapping of MMIO regions
  kvm/arm64: Try stage2 block mapping for host device MMIO

 arch/arm64/kvm/mmu.c | 99 
 1 file changed, 54 insertions(+), 45 deletions(-)

-- 
2.19.1

Re: [PATCH 1/5] KVM: arm64: Divorce the perf code from oprofile helpers

2021-04-15 Thread Keqian Zhu

Hi Marc,

On 2021/4/15 18:42, Marc Zyngier wrote:
> On Thu, 15 Apr 2021 07:59:26 +0100,
> Keqian Zhu  wrote:
>>
>> Hi Marc,
>>
>> On 2021/4/14 21:44, Marc Zyngier wrote:
>>> KVM/arm64 is the sole user of perf_num_counters(), and really
>>> could do without it. Stop using the obsolete API by relying on
>>> the existing probing code.
>>>
>>> Signed-off-by: Marc Zyngier 
>>> ---
>>>  arch/arm64/kvm/perf.c | 7 +--
>>>  arch/arm64/kvm/pmu-emul.c | 2 +-
>>>  include/kvm/arm_pmu.h | 4 
>>>  3 files changed, 6 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
>>> index 739164324afe..b8b398670ef2 100644
>>> --- a/arch/arm64/kvm/perf.c
>>> +++ b/arch/arm64/kvm/perf.c
>>> @@ -50,12 +50,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
>>>  
>>>  int kvm_perf_init(void)
>>>  {
>>> -   /*
>>> -* Check if HW_PERF_EVENTS are supported by checking the number of
>>> -* hardware performance counters. This could ensure the presence of
>>> -* a physical PMU and CONFIG_PERF_EVENT is selected.
>>> -*/
>>> -   if (IS_ENABLED(CONFIG_ARM_PMU) && perf_num_counters() > 0)
>>> +   if (kvm_pmu_probe_pmuver() != 0xf)
>> The probe() function may be called many times
>> (kvm_arm_pmu_v3_set_attr also calls it).  I don't know whether the
>> first calling is enough. If so, can we use a static variable in it,
>> so the following calling can return the result right away?
> 
> No, because that wouldn't help with crappy big-little implementations
> that could have PMUs with different versions. We want to find the
> version at the point where the virtual PMU is created, which is why we
> call the probe function once per vcpu.
I see.

But AFAICS the pmuver is placed in kvm->arch, and the probe function is called
once per VM. Maybe I miss something.

> 
> This of course is broken in other ways (BL+KVM is a total disaster
> when it comes to PMU), but making this static would just make it
> worse.
OK.

Thanks,
Keqian

Re: [PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-15 Thread Keqian Zhu

Hi Marc,

On 2021/4/15 18:23, Marc Zyngier wrote:
> On Thu, 15 Apr 2021 03:20:52 +0100,
> Keqian Zhu  wrote:
>>
>> Hi Marc,
>>
>> On 2021/4/14 17:05, Marc Zyngier wrote:
>>> + Santosh, who found some interesting bugs in that area before.
>>>
>>> On Wed, 14 Apr 2021 07:51:09 +0100,
>>> Keqian Zhu  wrote:
>>>>
>>>> The MMIO region of a device maybe huge (GB level), try to use
>>>> block mapping in stage2 to speedup both map and unmap.
>>>>
>>>> Compared to normal memory mapping, we should consider two more
>>>> points when try block mapping for MMIO region:
>>>>
>>>> 1. For normal memory mapping, the PA(host physical address) and
>>>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>>>> the HVA to request hugepage, so we don't need to consider PA
>>>> alignment when verifing block mapping. But for device memory
>>>> mapping, the PA and HVA may have different alignment.
>>>>
>>>> 2. For normal memory mapping, we are sure hugepage size properly
>>>> fit into vma, so we don't check whether the mapping size exceeds
>>>> the boundary of vma. But for device memory mapping, we should pay
>>>> attention to this.
>>>>
>>>> This adds device_rough_page_shift() to check these two points when
>>>> selecting block mapping size.
>>>>
>>>> Signed-off-by: Keqian Zhu 
>>>> ---
>>>>  arch/arm64/kvm/mmu.c | 37 +
>>>>  1 file changed, 33 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index c59af5ca01b0..1a6d96169d60 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -624,6 +624,31 @@ static void kvm_send_hwpoison_signal(unsigned long 
>>>> address, short lsb)
>>>>send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>>>>  }
>>>>  
>>>> +/*
>>>> + * Find a max mapping size that properly insides the vma. And hva and pa 
>>>> must
>>>> + * have the same alignment to this mapping size. It's rough as there are 
>>>> still
>>>> + * other restrictions, will be checked by 
>>>> fault_supports_stage2_huge_mapping().
>>>> + */
>>>> +static short device_rough_page_shift(struct vm_area_struct *vma,
>>>> +   unsigned long hva)
>>>
>>> My earlier question still stands. Under which circumstances would this
>>> function return something that is *not* the final mapping size? I
>>> really don't see a reason why this would not return the final mapping
>>> size.
>>
>> IIUC, all the restrictions are about alignment and area boundary.
>>
>> That's to say, HVA, IPA and PA must have same alignment within the
>> mapping size.  And the areas are memslot and vma, which means the
>> mapping size must properly fit into the memslot and vma.
>>
>> In this function, we just checked the alignment of HVA and PA, and
>> the boundary of vma.  So we still need to check the alignment of HVA
>> and IPA, and the boundary of memslot.  These will be checked by
>> fault_supports_stage2_huge_mapping().
> 
> But that's no different from what we do with normal memory, is it? So
> it really feels like we should have *one* function that deals with
> establishing the basic mapping size from the VMA (see below for what I
> have in mind).
Right. And it looks better.

> 
>>
>>>
>>>> +{
>>>> +  phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
>>>> +
>>>> +#ifndef __PAGETABLE_PMD_FOLDED
>>>> +  if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
>>>> +  ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
>>>> +  ALIGN(hva, PUD_SIZE) <= vma->vm_end)
>>>> +  return PUD_SHIFT;
>>>> +#endif
>>>> +
>>>> +  if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
>>>> +  ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
>>>> +  ALIGN(hva, PMD_SIZE) <= vma->vm_end)
>>>> +  return PMD_SHIFT;
>>>> +
>>>> +  return PAGE_SHIFT;
>>>> +}
>>>> +
>>>>  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_

Re: [PATCH v3 01/12] iommu: Introduce dirty log tracking framework

2021-04-15 Thread Keqian Zhu




On 2021/4/15 15:03, Lu Baolu wrote:
> On 4/15/21 2:18 PM, Keqian Zhu wrote:
>> Hi Baolu,
>>
>> Thanks for the review!
>>
>> On 2021/4/14 15:00, Lu Baolu wrote:
>>> Hi Keqian,
>>>
>>> On 4/13/21 4:54 PM, Keqian Zhu wrote:
>>>> Some types of IOMMU are capable of tracking DMA dirty log, such as
>>>> ARM SMMU with HTTU or Intel IOMMU with SLADE. This introduces the
>>>> dirty log tracking framework in the IOMMU base layer.
>>>>
>>>> Three new essential interfaces are added, and we maintaince the status
>>>> of dirty log tracking in iommu_domain.
>>>> 1. iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking
>>>> 2. iommu_sync_dirty_log: Sync dirty log from IOMMU into a dirty bitmap
>>>> 3. iommu_clear_dirty_log: Clear dirty log of IOMMU by a mask bitmap
>>>>
>>>> A new dev feature are added to indicate whether a specific type of
>>>> iommu hardware supports and its driver realizes them.
>>>>
>>>> Signed-off-by: Keqian Zhu 
>>>> Signed-off-by: Kunkun Jiang 
>>>> ---
>>>>drivers/iommu/iommu.c | 150 ++
>>>>include/linux/iommu.h |  53 +++
>>>>2 files changed, 203 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>>> index d0b0a15dba84..667b2d6d2fc0 100644
>>>> --- a/drivers/iommu/iommu.c
>>>> +++ b/drivers/iommu/iommu.c
>>>> @@ -1922,6 +1922,7 @@ static struct iommu_domain 
>>>> *__iommu_domain_alloc(struct bus_type *bus,
>>>>domain->type = type;
>>>>/* Assume all sizes by default; the driver may override this later 
>>>> */
>>>>domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;
>>>> +mutex_init(>switch_log_lock);
>>>>  return domain;
>>>>}
>>>> @@ -2720,6 +2721,155 @@ int iommu_domain_set_attr(struct iommu_domain 
>>>> *domain,
>>>>}
>>>>EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
>>>>+int iommu_switch_dirty_log(struct iommu_domain *domain, bool enable,
>>>> +   unsigned long iova, size_t size, int prot)
>>>> +{
>>>> +const struct iommu_ops *ops = domain->ops;
>>>> +int ret;
>>>> +
>>>> +if (unlikely(!ops || !ops->switch_dirty_log))
>>>> +return -ENODEV;
>>>> +
>>>> +mutex_lock(>switch_log_lock);
>>>> +if (enable && domain->dirty_log_tracking) {
>>>> +ret = -EBUSY;
>>>> +goto out;
>>>> +} else if (!enable && !domain->dirty_log_tracking) {
>>>> +ret = -EINVAL;
>>>> +goto out;
>>>> +}
>>>> +
>>>> +ret = ops->switch_dirty_log(domain, enable, iova, size, prot);
>>>> +if (ret)
>>>> +goto out;
>>>> +
>>>> +domain->dirty_log_tracking = enable;
>>>> +out:
>>>> +mutex_unlock(>switch_log_lock);
>>>> +return ret;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(iommu_switch_dirty_log);
>>>
>>> Since you also added IOMMU_DEV_FEAT_HWDBM, I am wondering what's the
>>> difference between
>>>
>>> iommu_switch_dirty_log(on) vs. 
>>> iommu_dev_enable_feature(IOMMU_DEV_FEAT_HWDBM)
>>>
>>> iommu_switch_dirty_log(off) vs. 
>>> iommu_dev_disable_feature(IOMMU_DEV_FEAT_HWDBM)
>> Indeed. As I can see, IOMMU_DEV_FEAT_AUX is not switchable, so enable/disable
>> are not applicable for it. IOMMU_DEV_FEAT_SVA is switchable, so we can use 
>> these
>> interfaces for it.
>>
>> IOMMU_DEV_FEAT_HWDBM is used to indicate whether hardware supports HWDBM, so 
>> we should
> 
> Previously we had iommu_dev_has_feature() and then was cleaned up due to
> lack of real users. If you have a real case for it, why not bringing it
> back?
Yep, good suggestion.

> 
>> design it as not switchable. I will modify the commit message of patch#12, 
>> thanks!
> 
> I am not sure that I fully get your point. But I can't see any gaps of
> using iommu_dev_enable/disable_feature() to switch dirty log on and off.
> Probably I missed anything.
IOMMU_DEV_FEAT_HWDBM just tells user whether underlying IOMMU driver supports
dirty tracking,

Re: [PATCH 1/5] KVM: arm64: Divorce the perf code from oprofile helpers

2021-04-15 Thread Keqian Zhu

Hi Marc,

On 2021/4/14 21:44, Marc Zyngier wrote:
> KVM/arm64 is the sole user of perf_num_counters(), and really
> could do without it. Stop using the obsolete API by relying on
> the existing probing code.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/perf.c | 7 +--
>  arch/arm64/kvm/pmu-emul.c | 2 +-
>  include/kvm/arm_pmu.h | 4 
>  3 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
> index 739164324afe..b8b398670ef2 100644
> --- a/arch/arm64/kvm/perf.c
> +++ b/arch/arm64/kvm/perf.c
> @@ -50,12 +50,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
>  
>  int kvm_perf_init(void)
>  {
> - /*
> -  * Check if HW_PERF_EVENTS are supported by checking the number of
> -  * hardware performance counters. This could ensure the presence of
> -  * a physical PMU and CONFIG_PERF_EVENT is selected.
> -  */
> - if (IS_ENABLED(CONFIG_ARM_PMU) && perf_num_counters() > 0)
> + if (kvm_pmu_probe_pmuver() != 0xf)
The probe() function may be called many times (kvm_arm_pmu_v3_set_attr also 
calls it).
I don't know whether the first calling is enough. If so, can we use a static 
variable
in it, so the following calling can return the result right away?

Thanks,
Keqian

Re: [PATCH v3 01/12] iommu: Introduce dirty log tracking framework

2021-04-15 Thread Keqian Zhu

Hi Baolu,

Thanks for the review!

On 2021/4/14 15:00, Lu Baolu wrote:
> Hi Keqian,
> 
> On 4/13/21 4:54 PM, Keqian Zhu wrote:
>> Some types of IOMMU are capable of tracking DMA dirty log, such as
>> ARM SMMU with HTTU or Intel IOMMU with SLADE. This introduces the
>> dirty log tracking framework in the IOMMU base layer.
>>
>> Three new essential interfaces are added, and we maintaince the status
>> of dirty log tracking in iommu_domain.
>> 1. iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking
>> 2. iommu_sync_dirty_log: Sync dirty log from IOMMU into a dirty bitmap
>> 3. iommu_clear_dirty_log: Clear dirty log of IOMMU by a mask bitmap
>>
>> A new dev feature are added to indicate whether a specific type of
>> iommu hardware supports and its driver realizes them.
>>
>> Signed-off-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/iommu.c | 150 ++
>>   include/linux/iommu.h |  53 +++
>>   2 files changed, 203 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index d0b0a15dba84..667b2d6d2fc0 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -1922,6 +1922,7 @@ static struct iommu_domain 
>> *__iommu_domain_alloc(struct bus_type *bus,
>>   domain->type = type;
>>   /* Assume all sizes by default; the driver may override this later */
>>   domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;
>> +mutex_init(>switch_log_lock);
>> return domain;
>>   }
>> @@ -2720,6 +2721,155 @@ int iommu_domain_set_attr(struct iommu_domain 
>> *domain,
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
>>   +int iommu_switch_dirty_log(struct iommu_domain *domain, bool enable,
>> +   unsigned long iova, size_t size, int prot)
>> +{
>> +const struct iommu_ops *ops = domain->ops;
>> +int ret;
>> +
>> +if (unlikely(!ops || !ops->switch_dirty_log))
>> +return -ENODEV;
>> +
>> +mutex_lock(>switch_log_lock);
>> +if (enable && domain->dirty_log_tracking) {
>> +ret = -EBUSY;
>> +goto out;
>> +} else if (!enable && !domain->dirty_log_tracking) {
>> +ret = -EINVAL;
>> +goto out;
>> +}
>> +
>> +ret = ops->switch_dirty_log(domain, enable, iova, size, prot);
>> +if (ret)
>> +goto out;
>> +
>> +domain->dirty_log_tracking = enable;
>> +out:
>> +mutex_unlock(>switch_log_lock);
>> +return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_switch_dirty_log);
> 
> Since you also added IOMMU_DEV_FEAT_HWDBM, I am wondering what's the
> difference between
> 
> iommu_switch_dirty_log(on) vs. iommu_dev_enable_feature(IOMMU_DEV_FEAT_HWDBM)
> 
> iommu_switch_dirty_log(off) vs. 
> iommu_dev_disable_feature(IOMMU_DEV_FEAT_HWDBM)
Indeed. As I can see, IOMMU_DEV_FEAT_AUX is not switchable, so enable/disable
are not applicable for it. IOMMU_DEV_FEAT_SVA is switchable, so we can use these
interfaces for it.

IOMMU_DEV_FEAT_HWDBM is used to indicate whether hardware supports HWDBM, so we 
should
design it as not switchable. I will modify the commit message of patch#12, 
thanks!

> 
>> +
>> +int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
>> + size_t size, unsigned long *bitmap,
>> + unsigned long base_iova, unsigned long bitmap_pgshift)
>> +{
>> +const struct iommu_ops *ops = domain->ops;
>> +unsigned int min_pagesz;
>> +size_t pgsize;
>> +int ret = 0;
>> +
>> +if (unlikely(!ops || !ops->sync_dirty_log))
>> +return -ENODEV;
>> +
>> +min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
>> +if (!IS_ALIGNED(iova | size, min_pagesz)) {
>> +pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
>> +   iova, size, min_pagesz);
>> +return -EINVAL;
>> +}
>> +
>> +mutex_lock(>switch_log_lock);
>> +if (!domain->dirty_log_tracking) {
>> +ret = -EINVAL;
>> +goto out;
>> +}
>> +
>> +while (size) {
>> +pgsize = iommu_pgsize(domain, iova, size);
>> +
>> +ret = ops->sync_dirty_log(domain, iova, pgsize,
>> +  bitmap, base_iova, bitmap_pgshift);
> 
> Any reason why do you want to do this in a per-4K page manner? This can
> lead to a lot of indirect calls and bad performance.
> 
> How about a sync_dirty_pages()?
The function name of iommu_pgsize() is a bit puzzling. Actually it will try to
compute the max size that fit into size, so the pgsize can be a large page size
even if the underlying mapping is 4K. The __iommu_unmap() also has a similar 
logic.

BRs,
Keqian

Re: [PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-14 Thread Keqian Zhu

Hi Marc,

On 2021/4/14 17:05, Marc Zyngier wrote:
> + Santosh, who found some interesting bugs in that area before.
> 
> On Wed, 14 Apr 2021 07:51:09 +0100,
> Keqian Zhu  wrote:
>>
>> The MMIO region of a device maybe huge (GB level), try to use
>> block mapping in stage2 to speedup both map and unmap.
>>
>> Compared to normal memory mapping, we should consider two more
>> points when try block mapping for MMIO region:
>>
>> 1. For normal memory mapping, the PA(host physical address) and
>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>> the HVA to request hugepage, so we don't need to consider PA
>> alignment when verifing block mapping. But for device memory
>> mapping, the PA and HVA may have different alignment.
>>
>> 2. For normal memory mapping, we are sure hugepage size properly
>> fit into vma, so we don't check whether the mapping size exceeds
>> the boundary of vma. But for device memory mapping, we should pay
>> attention to this.
>>
>> This adds device_rough_page_shift() to check these two points when
>> selecting block mapping size.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  arch/arm64/kvm/mmu.c | 37 +
>>  1 file changed, 33 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c59af5ca01b0..1a6d96169d60 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -624,6 +624,31 @@ static void kvm_send_hwpoison_signal(unsigned long 
>> address, short lsb)
>>  send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>>  }
>>  
>> +/*
>> + * Find a max mapping size that properly insides the vma. And hva and pa 
>> must
>> + * have the same alignment to this mapping size. It's rough as there are 
>> still
>> + * other restrictions, will be checked by 
>> fault_supports_stage2_huge_mapping().
>> + */
>> +static short device_rough_page_shift(struct vm_area_struct *vma,
>> + unsigned long hva)
> 
> My earlier question still stands. Under which circumstances would this
> function return something that is *not* the final mapping size? I
> really don't see a reason why this would not return the final mapping
> size.

IIUC, all the restrictions are about alignment and area boundary.

That's to say, HVA, IPA and PA must have same alignment within the mapping size.
And the areas are memslot and vma, which means the mapping size must properly 
fit
into the memslot and vma.

In this function, we just checked the alignment of HVA and PA, and the boundary 
of vma.
So we still need to check the alignment of HVA and IPA, and the boundary of 
memslot.
These will be checked by fault_supports_stage2_huge_mapping().

> 
>> +{
>> +phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
>> +
>> +#ifndef __PAGETABLE_PMD_FOLDED
>> +if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
>> +ALIGN(hva, PUD_SIZE) <= vma->vm_end)
>> +return PUD_SHIFT;
>> +#endif
>> +
>> +if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
>> +ALIGN(hva, PMD_SIZE) <= vma->vm_end)
>> +return PMD_SHIFT;
>> +
>> +return PAGE_SHIFT;
>> +}
>> +
>>  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot 
>> *memslot,
>> unsigned long hva,
>> unsigned long map_size)
>> @@ -769,7 +794,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  return -EFAULT;
>>  }
>>  
>> -/* Let's check if we will get back a huge page backed by hugetlbfs */
>> +/*
>> + * Let's check if we will get back a huge page backed by hugetlbfs, or
>> + * get block mapping for device MMIO region.
>> + */
>>  mmap_read_lock(current->mm);
>>  vma = find_vma_intersection(current->mm, hva, hva + 1);
>>  if (unlikely(!vma)) {
>> @@ -780,11 +808,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  
>>  if (is_vm_hugetlb_page(vma))
>>  vma_shift = huge_page_shift(hstate_vma(vma));
>> +else if (vma->vm_flags & VM_PFNMAP)
>> +vma_shift = devi

Re: [RFC PATCH] KVM: x86: Support write protect huge pages lazily

2021-04-14 Thread Keqian Zhu

Hi Ben,

On 2021/4/14 0:43, Ben Gardon wrote:
> On Tue, Apr 13, 2021 at 2:39 AM Keqian Zhu  wrote:
>>
>>
>>
>> On 2021/4/13 1:19, Ben Gardon wrote:
>>> On Tue, Apr 6, 2021 at 4:42 PM Sean Christopherson  
>>> wrote:
>>>>
>>>> +Ben
>>>>
>>>> On Tue, Apr 06, 2021, Keqian Zhu wrote:
>>>>> Hi Paolo,
>>>>>
>>>>> I plan to rework this patch and do full test. What do you think about 
>>>>> this idea
>>>>> (enable dirty logging for huge pages lazily)?
>>>>
>>>> Ben, don't you also have something similar (or maybe the exact opposite?) 
>>>> in the
>>>> hopper?  This sounds very familiar, but I can't quite connect the dots 
>>>> that are
>>>> floating around my head...
>>>
>>> Sorry for the late response, I was out of office last week.
>> Never mind, Sean has told to me. :)
>>
>>>
>>> Yes, we have two relevant features I'd like to reconcile somehow:
>>> 1.) Large page shattering - Instead of clearing a large TDP mapping,
>>> flushing the TLBs, then replacing it with an empty TDP page table, go
>>> straight from the large mapping to a fully pre-populated table. This
>>> is slightly slower because the table needs to be pre-populated, but it
>>> saves many vCPU page faults.
>>> 2.) Eager page splitting - split all large mappings down to 4k when
>>> enabling dirty logging, using large page shattering. This makes
>>> enabling dirty logging much slower, but speeds up the first round (and
>>> later rounds) of gathering / clearing the dirty log and reduces the
>>> number of vCPU page faults. We've prefered to do this when enabling
>>> dirty logging because it's a little less perf-sensitive than the later
>>> passes where latency and convergence are critical.
>> OK, I see. I think the lock stuff is an important part, so one question is 
>> that
>> the shattering process is designed to be locked (i.e., protect mapping) or 
>> lock-less?
>>
>> If it's locked, vCPU thread may be blocked for a long time (For arm, there 
>> is a
>> mmu_lock per VM). If it's lock-less, how can we ensure the synchronization of
>> mapping?
> 
> The TDP MMU for x86 could do it under the MMU read lock, but the
> legacy / shadow x86 MMU and other architectures would need the whole
> MMU lock.
> While we do increase the time required to address a large SPTE, we can
> completely avoid the vCPU needing the MMU lock on an access to that
> SPTE as the translation goes straight from a large, writable SPTE, to
> a 4k spte with either the d bit cleared or write protected. If it's
> write protected, the fault can (at least on x86) be resolved without
> the MMU lock.
That's sounds good! In terms of lock, x86 is better than arm64. For arm64,
we must hold whole MMU lock both for split large page or change permission
for 4K page.

> 
> When I'm able to put together a large page shattering series, I'll do
> some performance analysis and see how it changes things, but that part
OK.

> is sort of orthogonal to this change. The more I think about it, the
> better the init-all-set approach for large pages sounds, compared to
> eager splitting. I'm definitely in support of this patch and am happy
> to help review when you send out the v2 with TDP MMU support and such.
Thanks a lot. :)

> 
>>
>>>
>>> Large page shattering can happen in the NPT page fault handler or the
>>> thread enabling dirty logging / clearing the dirty log, so it's
>>> more-or-less orthogonal to this patch.
>>>
>>> Eager page splitting on the other hand takes the opposite approach to
>>> this patch, frontloading as much of the work to enable dirty logging
>>> as possible. Which approach is better is going to depend a lot on the
>>> guest workload, your live migration constraints, and how the
>>> user-space hypervisor makes use of KVM's growing number of dirty
>>> logging options. In our case, the time to migrate a VM is usually less
>>> of a concern than the performance degradation the guest experiences,
>>> so we want to do everything we can to minimize vCPU exits and exit
>>> latency.
>> Yes, make sense to me.
>>
>>>
>>> I think this is a reasonable change in principle if we're not write
>>> protecting 4k pages already, but it's hard to really validate all the
>>> performance implications. With this change we'd move pretty much all
>>> the work to the first pass of clearing the dirty log, which is
>>> probably an improvement since it's much more granular. The downside is
>> Yes, at least split large page lazily is better than current logic.
>>
>>> that we do more work when we'd really like to be converging the dirty
>>> set as opposed to earlier when we know all pages are dirty anyway.
>> I think the dirty collecting procedure is not affected, do I miss something?
> 
> Oh yeah, good point. Since the splitting of large SPTEs is happening
> in the vCPU threads it wouldn't slow dirty log collection at all. We
> would have to do slightly more work to write protect the large SPTEs
> that weren't written to, but that's a relatively small amount of work.
Indeed.


BRs,
Keqian

Re: [RFC PATCH v2 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-14 Thread Keqian Zhu

Hi Marc,

On 2021/4/8 15:28, Keqian Zhu wrote:
> Hi Marc,
> 
> On 2021/4/7 21:18, Marc Zyngier wrote:
>> On Tue, 16 Mar 2021 13:43:38 +0000,
>> Keqian Zhu  wrote:
>>>
[...]

>>>  
>>> +/*
>>> + * Find a mapping size that properly insides the intersection of vma and
>>> + * memslot. And hva and pa have the same alignment to this mapping size.
>>> + * It's rough because there are still other restrictions, which will be
>>> + * checked by the following fault_supports_stage2_huge_mapping().
>>
>> I don't think these restrictions make complete sense to me. If this is
>> a PFNMAP VMA, we should use the biggest mapping size that covers the
>> VMA, and not more than the VMA.
> But as described by kvm_arch_prepare_memory_region(), the memslot may not 
> fully
> cover the VMA. If that's true and we just consider the boundary of the VMA, 
> our
> block mapping may beyond the boundary of memslot. Is this a problem?
emm... Sorry I missed something. The fault_supports_stage2_huge_mapping() will 
check
the boundary of memslot, so we don't need to check it here. I have send v3, 
please
check that.

BRs,
Keqian

[PATCH v3 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-14 Thread Keqian Zhu

Hi,

We have two pathes to build stage2 mapping for MMIO regions.

Create time's path and stage2 fault path.

Patch#1 removes the creation time's mapping of MMIO regions
Patch#2 tries stage2 block mapping for host device MMIO at fault path

Changelog:

v3:
 - Do not need to check memslot boundary in device_rough_page_shift(). (Marc)

Thanks,
Keqian

Keqian Zhu (2):
  kvm/arm64: Remove the creation time's mapping of MMIO regions
  kvm/arm64: Try stage2 block mapping for host device MMIO

 arch/arm64/kvm/mmu.c | 75 +---
 1 file changed, 36 insertions(+), 39 deletions(-)

-- 
2.19.1

[PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-14 Thread Keqian Zhu

The MMIO region of a device maybe huge (GB level), try to use
block mapping in stage2 to speedup both map and unmap.

Compared to normal memory mapping, we should consider two more
points when try block mapping for MMIO region:

1. For normal memory mapping, the PA(host physical address) and
HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
the HVA to request hugepage, so we don't need to consider PA
alignment when verifing block mapping. But for device memory
mapping, the PA and HVA may have different alignment.

2. For normal memory mapping, we are sure hugepage size properly
fit into vma, so we don't check whether the mapping size exceeds
the boundary of vma. But for device memory mapping, we should pay
attention to this.

This adds device_rough_page_shift() to check these two points when
selecting block mapping size.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 37 +
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c59af5ca01b0..1a6d96169d60 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -624,6 +624,31 @@ static void kvm_send_hwpoison_signal(unsigned long 
address, short lsb)
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
 }
 
+/*
+ * Find a max mapping size that properly insides the vma. And hva and pa must
+ * have the same alignment to this mapping size. It's rough as there are still
+ * other restrictions, will be checked by fault_supports_stage2_huge_mapping().
+ */
+static short device_rough_page_shift(struct vm_area_struct *vma,
+unsigned long hva)
+{
+   phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
+
+#ifndef __PAGETABLE_PMD_FOLDED
+   if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
+   ALIGN(hva, PUD_SIZE) <= vma->vm_end)
+   return PUD_SHIFT;
+#endif
+
+   if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
+   ALIGN(hva, PMD_SIZE) <= vma->vm_end)
+   return PMD_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
 static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
   unsigned long hva,
   unsigned long map_size)
@@ -769,7 +794,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   /* Let's check if we will get back a huge page backed by hugetlbfs */
+   /*
+* Let's check if we will get back a huge page backed by hugetlbfs, or
+* get block mapping for device MMIO region.
+*/
mmap_read_lock(current->mm);
vma = find_vma_intersection(current->mm, hva, hva + 1);
if (unlikely(!vma)) {
@@ -780,11 +808,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 
if (is_vm_hugetlb_page(vma))
vma_shift = huge_page_shift(hstate_vma(vma));
+   else if (vma->vm_flags & VM_PFNMAP)
+   vma_shift = device_rough_page_shift(vma, hva);
else
vma_shift = PAGE_SHIFT;
 
-   if (logging_active ||
-   (vma->vm_flags & VM_PFNMAP)) {
+   if (logging_active) {
force_pte = true;
vma_shift = PAGE_SHIFT;
}
@@ -855,7 +884,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 
if (kvm_is_device_pfn(pfn)) {
device = true;
-   force_pte = true;
+   force_pte = (vma_pagesize == PAGE_SIZE);
} else if (logging_active && !write_fault) {
/*
 * Only actually map the page as writable if this was a write
-- 
2.19.1

[PATCH v3 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions

2021-04-14 Thread Keqian Zhu

The MMIO regions may be unmapped for many reasons and can be remapped
by stage2 fault path. Map MMIO regions at creation time becomes a
minor optimization and makes these two mapping path hard to sync.

Remove the mapping code while keep the useful sanity check.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 38 +++---
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..c59af5ca01b0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 {
hva_t hva = mem->userspace_addr;
hva_t reg_end = hva + mem->memory_size;
-   bool writable = !(mem->flags & KVM_MEM_READONLY);
int ret = 0;
 
if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
@@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
mmap_read_lock(current->mm);
/*
 * A memory region could potentially cover multiple VMAs, and any holes
-* between them, so iterate over all of them to find out if we can map
-* any of them right now.
+* between them, so iterate over all of them.
 *
 * ++
 * +---++   ++
@@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 */
do {
struct vm_area_struct *vma = find_vma(current->mm, hva);
-   hva_t vm_start, vm_end;
 
if (!vma || vma->vm_start >= reg_end)
break;
 
-   /*
-* Take the intersection of this VMA with the memory region
-*/
-   vm_start = max(hva, vma->vm_start);
-   vm_end = min(reg_end, vma->vm_end);
-
if (vma->vm_flags & VM_PFNMAP) {
-   gpa_t gpa = mem->guest_phys_addr +
-   (vm_start - mem->userspace_addr);
-   phys_addr_t pa;
-
-   pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
-   pa += vm_start - vma->vm_start;
-
/* IO region dirty page logging not allowed */
if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
ret = -EINVAL;
-   goto out;
-   }
-
-   ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
-   vm_end - vm_start,
-   writable);
-   if (ret)
break;
+   }
}
-   hva = vm_end;
+   hva = min(reg_end, vma->vm_end);
} while (hva < reg_end);
 
-   if (change == KVM_MR_FLAGS_ONLY)
-   goto out;
-
-   spin_lock(>mmu_lock);
-   if (ret)
-   unmap_stage2_range(>arch.mmu, mem->guest_phys_addr, 
mem->memory_size);
-   else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
-   stage2_flush_memslot(kvm, memslot);
-   spin_unlock(>mmu_lock);
-out:
mmap_read_unlock(current->mm);
return ret;
 }
-- 
2.19.1

Re: [RFC PATCH v2 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-13 Thread Keqian Zhu

Hi Marc,

I think I have fully tested this patch. The next step is to do some restriction 
on
HVA in vfio module, so we can build block mapping for it with a higher 
probability.

Is there anything to improve? If not, could you apply it? ^_^

Thanks,
Keqian

On 2021/4/7 21:18, Marc Zyngier wrote:
> On Tue, 16 Mar 2021 13:43:38 +,
> Keqian Zhu  wrote:
>>
>> The MMIO region of a device maybe huge (GB level), try to use
>> block mapping in stage2 to speedup both map and unmap.
>>
>> Compared to normal memory mapping, we should consider two more
>> points when try block mapping for MMIO region:
>>
>> 1. For normal memory mapping, the PA(host physical address) and
>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>> the HVA to request hugepage, so we don't need to consider PA
>> alignment when verifing block mapping. But for device memory
>> mapping, the PA and HVA may have different alignment.
>>
>> 2. For normal memory mapping, we are sure hugepage size properly
>> fit into vma, so we don't check whether the mapping size exceeds
>> the boundary of vma. But for device memory mapping, we should pay
>> attention to this.
>>
>> This adds device_rough_page_shift() to check these two points when
>> selecting block mapping size.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>
>> Mainly for RFC, not fully tested. I will fully test it when the
>> code logic is well accepted.
>>
>> ---
>>  arch/arm64/kvm/mmu.c | 42 ++
>>  1 file changed, 38 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c59af5ca01b0..224aa15eb4d9 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -624,6 +624,36 @@ static void kvm_send_hwpoison_signal(unsigned long 
>> address, short lsb)
>>  send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>>  }
>>  
>> +/*
>> + * Find a mapping size that properly insides the intersection of vma and
>> + * memslot. And hva and pa have the same alignment to this mapping size.
>> + * It's rough because there are still other restrictions, which will be
>> + * checked by the following fault_supports_stage2_huge_mapping().
> 
> I don't think these restrictions make complete sense to me. If this is
> a PFNMAP VMA, we should use the biggest mapping size that covers the
> VMA, and not more than the VMA.
> 
>> + */
>> +static short device_rough_page_shift(struct kvm_memory_slot *memslot,
>> + struct vm_area_struct *vma,
>> + unsigned long hva)
>> +{
>> +size_t size = memslot->npages * PAGE_SIZE;
>> +hva_t sec_start = max(memslot->userspace_addr, vma->vm_start);
>> +hva_t sec_end = min(memslot->userspace_addr + size, vma->vm_end);
>> +phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
>> +
>> +#ifndef __PAGETABLE_PMD_FOLDED
>> +if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PUD_SIZE) >= sec_start &&
>> +ALIGN(hva, PUD_SIZE) <= sec_end)
>> +return PUD_SHIFT;
>> +#endif
>> +
>> +if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PMD_SIZE) >= sec_start &&
>> +ALIGN(hva, PMD_SIZE) <= sec_end)
>> +return PMD_SHIFT;
>> +
>> +return PAGE_SHIFT;
>> +}
>> +
>>  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot 
>> *memslot,
>> unsigned long hva,
>> unsigned long map_size)
>> @@ -769,7 +799,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  return -EFAULT;
>>  }
>>  
>> -/* Let's check if we will get back a huge page backed by hugetlbfs */
>> +/*
>> + * Let's check if we will get back a huge page backed by hugetlbfs, or
>> + * get block mapping for device MMIO region.
>> + */
>>  mmap_read_lock(current->mm);
>>  vma = find_vma_intersection(current->mm, hva, hva + 1);
>>  if (unlikely(!vma)) {
>> @@ -780,11 +813,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  
>>  if (is_vm_hugetlb_page(vma))
>>  vma_shift = huge_page_shift(hstate_vma(vma));
>> +else if (v

Re: [RFC PATCH] KVM: x86: Support write protect huge pages lazily

2021-04-13 Thread Keqian Zhu




On 2021/4/13 1:19, Ben Gardon wrote:
> On Tue, Apr 6, 2021 at 4:42 PM Sean Christopherson  wrote:
>>
>> +Ben
>>
>> On Tue, Apr 06, 2021, Keqian Zhu wrote:
>>> Hi Paolo,
>>>
>>> I plan to rework this patch and do full test. What do you think about this 
>>> idea
>>> (enable dirty logging for huge pages lazily)?
>>
>> Ben, don't you also have something similar (or maybe the exact opposite?) in 
>> the
>> hopper?  This sounds very familiar, but I can't quite connect the dots that 
>> are
>> floating around my head...
> 
> Sorry for the late response, I was out of office last week.
Never mind, Sean has told to me. :)

> 
> Yes, we have two relevant features I'd like to reconcile somehow:
> 1.) Large page shattering - Instead of clearing a large TDP mapping,
> flushing the TLBs, then replacing it with an empty TDP page table, go
> straight from the large mapping to a fully pre-populated table. This
> is slightly slower because the table needs to be pre-populated, but it
> saves many vCPU page faults.
> 2.) Eager page splitting - split all large mappings down to 4k when
> enabling dirty logging, using large page shattering. This makes
> enabling dirty logging much slower, but speeds up the first round (and
> later rounds) of gathering / clearing the dirty log and reduces the
> number of vCPU page faults. We've prefered to do this when enabling
> dirty logging because it's a little less perf-sensitive than the later
> passes where latency and convergence are critical.
OK, I see. I think the lock stuff is an important part, so one question is that
the shattering process is designed to be locked (i.e., protect mapping) or 
lock-less?

If it's locked, vCPU thread may be blocked for a long time (For arm, there is a
mmu_lock per VM). If it's lock-less, how can we ensure the synchronization of
mapping?

> 
> Large page shattering can happen in the NPT page fault handler or the
> thread enabling dirty logging / clearing the dirty log, so it's
> more-or-less orthogonal to this patch.
> 
> Eager page splitting on the other hand takes the opposite approach to
> this patch, frontloading as much of the work to enable dirty logging
> as possible. Which approach is better is going to depend a lot on the
> guest workload, your live migration constraints, and how the
> user-space hypervisor makes use of KVM's growing number of dirty
> logging options. In our case, the time to migrate a VM is usually less
> of a concern than the performance degradation the guest experiences,
> so we want to do everything we can to minimize vCPU exits and exit
> latency.
Yes, make sense to me.

> 
> I think this is a reasonable change in principle if we're not write
> protecting 4k pages already, but it's hard to really validate all the
> performance implications. With this change we'd move pretty much all
> the work to the first pass of clearing the dirty log, which is
> probably an improvement since it's much more granular. The downside is
Yes, at least split large page lazily is better than current logic.

> that we do more work when we'd really like to be converging the dirty
> set as opposed to earlier when we know all pages are dirty anyway.
I think the dirty collecting procedure is not affected, do I miss something?

> 
>>
>>> PS: As dirty log of TDP MMU has been supported, I should add more code.
>>>
>>> On 2020/8/28 16:11, Keqian Zhu wrote:
>>>> Currently during enable dirty logging, if we're with init-all-set,
>>>> we just write protect huge pages and leave normal pages untouched,
>>>> for that we can enable dirty logging for these pages lazily.
>>>>
>>>> It seems that enable dirty logging lazily for huge pages is feasible
>>>> too, which not only reduces the time of start dirty logging, also
>>>> greatly reduces side-effect on guest when there is high dirty rate.
> 
> The side effect on the guest would also be greatly reduced with large
> page shattering above.
Sure.

> 
>>>>
>>>> (These codes are not tested, for RFC purpose :-) ).
>>>>
>>>> Signed-off-by: Keqian Zhu 
>>>> ---
>>>>  arch/x86/include/asm/kvm_host.h |  3 +-
>>>>  arch/x86/kvm/mmu/mmu.c  | 65 ++---
>>>>  arch/x86/kvm/vmx/vmx.c  |  3 +-
>>>>  arch/x86/kvm/x86.c  | 22 +--
>>>>  4 files changed, 62 insertions(+), 31 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/kvm_host.h 
>>>> b/arch/x86/include/asm/kvm_host.h
>>>> index 5303dbc5c9bc..201a068cf43d 100644
>>>> --- a/arch/x8

[PATCH 2/3] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM

2021-04-13 Thread Keqian Zhu

From: Kunkun Jiang 

In the past if vfio_iommu is not of pinned_page_dirty_scope and
vfio_dma is iommu_mapped, we populate full dirty bitmap for this
vfio_dma. Now we can try to get dirty log from iommu before make
the lousy decision.

The bitmap population:

In detail, if all vfio_group are of pinned_page_dirty_scope, the
dirty bitmap population is not affected. If there are vfio_groups
not of pinned_page_dirty_scope and their domains support HWDBM,
then we can try to get dirty log from IOMMU. Otherwise, lead to
full dirty bitmap.

Consider DMA and group hotplug:

Start dirty log for newly added DMA range, and stop dirty log for
DMA range going to remove.

If a domain don't support HWDBM at start, but can support it after
hotplug some groups (attach a first group with HWDBM or detach all
groups without HWDBM). If a domain support HWDBM at start, but do
not support it after hotplug some groups (attach a group without
HWDBM or detach all groups without HWDBM). So our policy is that
switch dirty log for domains dynamically.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/vfio/vfio_iommu_type1.c | 166 ++--
 1 file changed, 159 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9cb9ce021b22..77950e47f56f 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1202,6 +1202,46 @@ static void vfio_update_pgsize_bitmap(struct vfio_iommu 
*iommu)
}
 }
 
+static int vfio_iommu_dirty_log_clear(struct vfio_iommu *iommu,
+ dma_addr_t start_iova, size_t size,
+ unsigned long *bitmap_buffer,
+ dma_addr_t base_iova,
+ unsigned long pgshift)
+{
+   struct vfio_domain *d;
+   int ret = 0;
+
+   list_for_each_entry(d, >domain_list, next) {
+   ret = iommu_clear_dirty_log(d->domain, start_iova, size,
+   bitmap_buffer, base_iova, pgshift);
+   if (ret) {
+   pr_warn("vfio_iommu dirty log clear failed!\n");
+   break;
+   }
+   }
+
+   return ret;
+}
+
+static int vfio_iommu_dirty_log_sync(struct vfio_iommu *iommu,
+struct vfio_dma *dma,
+unsigned long pgshift)
+{
+   struct vfio_domain *d;
+   int ret = 0;
+
+   list_for_each_entry(d, >domain_list, next) {
+   ret = iommu_sync_dirty_log(d->domain, dma->iova, dma->size,
+  dma->bitmap, dma->iova, pgshift);
+   if (ret) {
+   pr_warn("vfio_iommu dirty log sync failed!\n");
+   break;
+   }
+   }
+
+   return ret;
+}
+
 static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
  struct vfio_dma *dma, dma_addr_t base_iova,
  size_t pgsize)
@@ -1212,13 +1252,22 @@ static int update_user_bitmap(u64 __user *bitmap, 
struct vfio_iommu *iommu,
unsigned long copy_offset = bit_offset / BITS_PER_LONG;
unsigned long shift = bit_offset % BITS_PER_LONG;
unsigned long leftover;
+   int ret;
 
-   /*
-* mark all pages dirty if any IOMMU capable device is not able
-* to report dirty pages and all pages are pinned and mapped.
-*/
-   if (iommu->num_non_pinned_groups && dma->iommu_mapped)
+   if (!iommu->num_non_pinned_groups || !dma->iommu_mapped) {
+   /* nothing to do */
+   } else if (!iommu->num_non_hwdbm_groups) {
+   /* try to get dirty log from IOMMU */
+   ret = vfio_iommu_dirty_log_sync(iommu, dma, pgshift);
+   if (ret)
+   return ret;
+   } else {
+   /*
+* mark all pages dirty if any IOMMU capable device is not able
+* to report dirty pages and all pages are pinned and mapped.
+*/
bitmap_set(dma->bitmap, 0, nbits);
+   }
 
if (shift) {
bitmap_shift_left(dma->bitmap, dma->bitmap, shift,
@@ -1236,6 +1285,12 @@ static int update_user_bitmap(u64 __user *bitmap, struct 
vfio_iommu *iommu,
 DIRTY_BITMAP_BYTES(nbits + shift)))
return -EFAULT;
 
+   /* Recover the bitmap if it'll be used to clear hardware dirty log */
+   if (shift && iommu->num_non_pinned_groups && dma->iommu_mapped &&
+   !iommu->num_non_hwdbm_groups)
+   bitmap_shift_right(dma->bitmap, dma->bitmap, shift,
+  nbits + shift);
+
return 0;
 }
 
@@ -1274,6 +

[PATCH 1/3] vfio/iommu_type1: Add HWDBM status maintanance

2021-04-13 Thread Keqian Zhu

From: Kunkun Jiang 

We are going to optimize dirty log tracking based on iommu
HWDBM feature, but the dirty log from iommu is useful only
when all iommu backed groups are with HWDBM feature.

This maintains a counter in vfio_iommu, which is used in
the policy of dirty bitmap population in next patch.

This also maintains a counter in vfio_domain, which is used
in the policy of switch dirty log in next patch.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/vfio/vfio_iommu_type1.c | 44 +
 1 file changed, 44 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 45cbfd4879a5..9cb9ce021b22 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -73,6 +73,7 @@ struct vfio_iommu {
unsigned intvaddr_invalid_count;
uint64_tpgsize_bitmap;
uint64_tnum_non_pinned_groups;
+   uint64_tnum_non_hwdbm_groups;
wait_queue_head_t   vaddr_wait;
boolv2;
boolnesting;
@@ -85,6 +86,7 @@ struct vfio_domain {
struct iommu_domain *domain;
struct list_headnext;
struct list_headgroup_list;
+   uint64_tnum_non_hwdbm_groups;
int prot;   /* IOMMU_CACHE */
boolfgsp;   /* Fine-grained super pages */
 };
@@ -116,6 +118,7 @@ struct vfio_group {
struct list_headnext;
boolmdev_group; /* An mdev group */
boolpinned_page_dirty_scope;
+   booliommu_hwdbm;/* For iommu-backed group */
 };
 
 struct vfio_iova {
@@ -2252,6 +2255,44 @@ static void vfio_iommu_iova_insert_copy(struct 
vfio_iommu *iommu,
list_splice_tail(iova_copy, iova);
 }
 
+static int vfio_dev_enable_feature(struct device *dev, void *data)
+{
+   enum iommu_dev_features *feat = data;
+
+   if (iommu_dev_feature_enabled(dev, *feat))
+   return 0;
+
+   return iommu_dev_enable_feature(dev, *feat);
+}
+
+static bool vfio_group_supports_hwdbm(struct vfio_group *group)
+{
+   enum iommu_dev_features feat = IOMMU_DEV_FEAT_HWDBM;
+
+   return !iommu_group_for_each_dev(group->iommu_group, ,
+vfio_dev_enable_feature);
+}
+
+/*
+ * Called after a new group is added to the group_list of domain, or before an
+ * old group is removed from the group_list of domain.
+ */
+static void vfio_iommu_update_hwdbm(struct vfio_iommu *iommu,
+   struct vfio_domain *domain,
+   struct vfio_group *group,
+   bool attach)
+{
+   /* Update the HWDBM status of group, domain and iommu */
+   group->iommu_hwdbm = vfio_group_supports_hwdbm(group);
+   if (!group->iommu_hwdbm && attach) {
+   domain->num_non_hwdbm_groups++;
+   iommu->num_non_hwdbm_groups++;
+   } else if (!group->iommu_hwdbm && !attach) {
+   domain->num_non_hwdbm_groups--;
+   iommu->num_non_hwdbm_groups--;
+   }
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 struct iommu_group *iommu_group)
 {
@@ -2409,6 +2450,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
vfio_iommu_detach_group(domain, group);
if (!vfio_iommu_attach_group(d, group)) {
list_add(>next, >group_list);
+   vfio_iommu_update_hwdbm(iommu, d, group, true);
iommu_domain_free(domain->domain);
kfree(domain);
goto done;
@@ -2435,6 +2477,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
list_add(>next, >domain_list);
vfio_update_pgsize_bitmap(iommu);
+   vfio_iommu_update_hwdbm(iommu, domain, group, true);
 done:
/* Delete the old one and insert new iova list */
vfio_iommu_iova_insert_copy(iommu, _copy);
@@ -2618,6 +2661,7 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
continue;
 
vfio_iommu_detach_group(domain, group);
+   vfio_iommu_update_hwdbm(iommu, domain, group, false);
update_dirty_scope = !group->pinned_page_dirty_scope;
list_del(>next);
kfree(group);
-- 
2.19.1

[PATCH 0/3] vfio/iommu_type1: Implement dirty log tracking based on IOMMU HWDBM

2021-04-13 Thread Keqian Zhu

Hi everyone,

This patch series implement vfio dma dirty log tracking based on IOMMU HWDBM 
(hardware
dirty bit management, such as SMMU with HTTU or intel IOMMU with SLADE).

This patch series is split from the series[1] that containes both IOMMU part and
VFIO part. Please refer the new IOMMU part[2] to review or test.

Intention:

As we know, vfio live migration is an important and valuable feature, but there
are still many hurdles to solve, including migration of interrupt, device state,
DMA dirty log tracking, and etc.

For now, the only dirty log tracking interface is pinning. It has some 
drawbacks:
1. Only smart vendor drivers are aware of this.
2. It's coarse-grained, the pinned-scope is generally bigger than what the 
device actually access.
3. It can't track dirty continuously and precisely, vfio populates all 
pinned-scope as dirty.
   So it doesn't work well with iteratively dirty log handling.

About this series:

Implement a new dirty log tracking method for vfio based on iommu hwdbm. A new
ioctl operation named VFIO_DIRTY_LOG_MANUAL_CLEAR is added, which can eliminate
some redundant dirty handling of userspace.   
   
Optimizations Todo:

1. We recognized that each smmu_domain (a vfio_container may has several 
smmu_domain) has its
   own stage1 mapping, and we must scan all these mapping to sync dirty state. 
We plan to refactor
   smmu_domain to support more than one smmu in one smmu_domain, then these 
smmus can share a same
   stage1 mapping.
2. We also recognized that scan TTD is a hotspot of performance. Recently, I 
have implement a
   SW/HW conbined dirty log tracking at MMU side[3], which can effectively 
solve this problem.
   This idea can be applied to smmu side too.

Thanks,
Keqian

[1] 
https://lore.kernel.org/linux-iommu/20210310090614.26668-1-zhukeqi...@huawei.com/
[2] 
https://lore.kernel.org/linux-iommu/20210413085457.25400-1-zhukeqi...@huawei.com/
  
[3] 
https://lore.kernel.org/linux-arm-kernel/2021012612.27136-1-zhukeqi...@huawei.com/

Kunkun Jiang (3):
  vfio/iommu_type1: Add HWDBM status maintanance
  vfio/iommu_type1: Optimize dirty bitmap population based on iommu
HWDBM
  vfio/iommu_type1: Add support for manual dirty log clear

 drivers/vfio/vfio_iommu_type1.c | 310 ++--
 include/uapi/linux/vfio.h   |  28 ++-
 2 files changed, 326 insertions(+), 12 deletions(-)

-- 
2.19.1

[PATCH 3/3] vfio/iommu_type1: Add support for manual dirty log clear

2021-04-13 Thread Keqian Zhu

From: Kunkun Jiang 

In the past, we clear dirty log immediately after sync dirty
log to userspace. This may cause redundant dirty handling if
userspace handles dirty log iteratively:

After vfio clears dirty log, new dirty log starts to generate.
These new dirty log will be reported to userspace even if they
are generated before userspace handles the same dirty page.

That's to say, we should minimize the time gap of dirty log
clearing and dirty log handling. We can give userspace the
interface to clear dirty log.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/vfio/vfio_iommu_type1.c | 100 ++--
 include/uapi/linux/vfio.h   |  28 -
 2 files changed, 123 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 77950e47f56f..d9c4a27b3c4e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -78,6 +78,7 @@ struct vfio_iommu {
boolv2;
boolnesting;
booldirty_page_tracking;
+   booldirty_log_manual_clear;
boolpinned_page_dirty_scope;
boolcontainer_open;
 };
@@ -1242,6 +1243,78 @@ static int vfio_iommu_dirty_log_sync(struct vfio_iommu 
*iommu,
return ret;
 }
 
+static int vfio_iova_dirty_log_clear(u64 __user *bitmap,
+struct vfio_iommu *iommu,
+dma_addr_t iova, size_t size,
+size_t pgsize)
+{
+   struct vfio_dma *dma;
+   struct rb_node *n;
+   dma_addr_t start_iova, end_iova, riova;
+   unsigned long pgshift = __ffs(pgsize);
+   unsigned long bitmap_size;
+   unsigned long *bitmap_buffer = NULL;
+   bool clear_valid;
+   int rs, re, start, end, dma_offset;
+   int ret = 0;
+
+   bitmap_size = DIRTY_BITMAP_BYTES(size >> pgshift);
+   bitmap_buffer = kvmalloc(bitmap_size, GFP_KERNEL);
+   if (!bitmap_buffer) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   if (copy_from_user(bitmap_buffer, bitmap, bitmap_size)) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   for (n = rb_first(>dma_list); n; n = rb_next(n)) {
+   dma = rb_entry(n, struct vfio_dma, node);
+   if (!dma->iommu_mapped)
+   continue;
+   if ((dma->iova + dma->size - 1) < iova)
+   continue;
+   if (dma->iova > iova + size - 1)
+   break;
+
+   start_iova = max(iova, dma->iova);
+   end_iova = min(iova + size, dma->iova + dma->size);
+
+   /* Similar logic as the tail of vfio_iova_dirty_bitmap */
+
+   clear_valid = false;
+   start = (start_iova - iova) >> pgshift;
+   end = (end_iova - iova) >> pgshift;
+   bitmap_for_each_set_region(bitmap_buffer, rs, re, start, end) {
+   clear_valid = true;
+   riova = iova + (rs << pgshift);
+   dma_offset = (riova - dma->iova) >> pgshift;
+   bitmap_clear(dma->bitmap, dma_offset, re - rs);
+   }
+
+   if (clear_valid)
+   vfio_dma_populate_bitmap(dma, pgsize);
+
+   if (clear_valid && !iommu->pinned_page_dirty_scope &&
+   dma->iommu_mapped && !iommu->num_non_hwdbm_groups) {
+   ret = vfio_iommu_dirty_log_clear(iommu, start_iova,
+   end_iova - start_iova,  bitmap_buffer,
+   iova, pgshift);
+   if (ret) {
+   pr_warn("dma dirty log clear failed!\n");
+   goto out;
+   }
+   }
+
+   }
+
+out:
+   kfree(bitmap_buffer);
+   return ret;
+}
+
 static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
  struct vfio_dma *dma, dma_addr_t base_iova,
  size_t pgsize)
@@ -1329,6 +1402,10 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, 
struct vfio_iommu *iommu,
if (ret)
return ret;
 
+   /* Do not clear dirty automatically when manual_clear enabled */
+   if (iommu->dirty_log_manual_clear)
+   continue;
+
/* Clear iommu dirty log to re-enable dirty log tracking */
if (iommu->num_non_pinned_groups && dma->iommu_mapped &&
!iommu->num_non_hwdbm_groups) {
@@ -2946,6 +3023,11 @@ static int vfio_io

[PATCH v3 11/12] iommu/arm-smmu-v3: Realize clear_dirty_log iommu ops

2021-04-13 Thread Keqian Zhu

From: Kunkun Jiang 

After dirty log is retrieved, user should clear dirty log to re-enable
dirty log tracking for these dirtied pages. This clears the dirty state
(As we just enable HTTU for stage1, so set the AP[2] bit) of these leaf
TTDs that are specified by the user provided bitmap.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 ++
 drivers/iommu/io-pgtable-arm.c  | 95 +
 include/linux/io-pgtable.h  |  4 +
 3 files changed, 124 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9eb209a07acc..59bb1d198631 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2656,6 +2656,30 @@ static int arm_smmu_sync_dirty_log(struct iommu_domain 
*domain,
   bitmap_pgshift);
 }
 
+static int arm_smmu_clear_dirty_log(struct iommu_domain *domain,
+   unsigned long iova, size_t size,
+   unsigned long *bitmap,
+   unsigned long base_iova,
+   unsigned long bitmap_pgshift)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HD))
+   return -ENODEV;
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+   return -EINVAL;
+
+   if (!ops || !ops->clear_dirty_log) {
+   pr_err("io-pgtable don't realize clear dirty log\n");
+   return -ENODEV;
+   }
+
+   return ops->clear_dirty_log(ops, iova, size, bitmap, base_iova,
+   bitmap_pgshift);
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2759,6 +2783,7 @@ static struct iommu_ops arm_smmu_ops = {
.merge_page = arm_smmu_merge_page,
.switch_dirty_log   = arm_smmu_switch_dirty_log,
.sync_dirty_log = arm_smmu_sync_dirty_log,
+   .clear_dirty_log= arm_smmu_clear_dirty_log,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 67a208a05ab2..e3ef0f50611c 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -966,6 +966,100 @@ static int arm_lpae_sync_dirty_log(struct io_pgtable_ops 
*ops,
 bitmap, base_iova, bitmap_pgshift);
 }
 
+static int __arm_lpae_clear_dirty_log(struct arm_lpae_io_pgtable *data,
+ unsigned long iova, size_t size,
+ int lvl, arm_lpae_iopte *ptep,
+ unsigned long *bitmap,
+ unsigned long base_iova,
+ unsigned long bitmap_pgshift)
+{
+   arm_lpae_iopte pte;
+   struct io_pgtable *iop = >iop;
+   unsigned long offset;
+   size_t base, next_size;
+   int nbits, ret, i;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return -EINVAL;
+
+   ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return -EINVAL;
+
+   if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+   if (iopte_leaf(pte, lvl, iop->fmt)) {
+   if (pte & ARM_LPAE_PTE_AP_RDONLY)
+   return 0;
+
+   /* Ensure all corresponding bits are set */
+   nbits = size >> bitmap_pgshift;
+   offset = (iova - base_iova) >> bitmap_pgshift;
+   for (i = offset; i < offset + nbits; i++) {
+   if (!test_bit(i, bitmap))
+   return 0;
+   }
+
+   /* Race does not exist */
+   pte |= ARM_LPAE_PTE_AP_RDONLY;
+   __arm_lpae_set_pte(ptep, pte, >cfg);
+   return 0;
+   } else {
+   /* To traverse next level */
+   next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+   ptep = iopte_deref(pte, data);
+   for (base = 0; base < size; base += next_size) {
+   ret = __arm_lpae_clear_dirty_log(data,
+   iova + base, next_s

[PATCH v3 08/12] iommu/arm-smmu-v3: Realize merge_page iommu ops

2021-04-13 Thread Keqian Zhu

From: Kunkun Jiang 

This reinstalls block mappings and unmap the span of page mappings.
BBML1 or BBML2 feature is required.

Merging page does not simultaneously work with other pgtable ops,
as the only designed user is vfio, which always hold a lock, so race
condition is not considered in the pgtable ops.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 28 
 drivers/iommu/io-pgtable-arm.c  | 78 +
 include/linux/io-pgtable.h  |  2 +
 3 files changed, 108 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index cfa83fa03c89..4d8495d88be2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2563,6 +2563,33 @@ static int arm_smmu_split_block(struct iommu_domain 
*domain,
return 0;
 }
 
+static int arm_smmu_merge_page(struct iommu_domain *domain,
+  unsigned long iova, phys_addr_t paddr,
+  size_t size, int prot)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+   size_t handled_size;
+
+   if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
+   dev_err(smmu->dev, "don't support BBML1/2, can't merge page\n");
+   return -ENODEV;
+   }
+   if (!ops || !ops->merge_page) {
+   pr_err("io-pgtable don't realize merge page\n");
+   return -ENODEV;
+   }
+
+   handled_size = ops->merge_page(ops, iova, paddr, size, prot);
+   if (handled_size != size) {
+   pr_err("merge page failed\n");
+   return -EFAULT;
+   }
+
+   return 0;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2663,6 +2690,7 @@ static struct iommu_ops arm_smmu_ops = {
.domain_get_attr= arm_smmu_domain_get_attr,
.domain_set_attr= arm_smmu_domain_set_attr,
.split_block= arm_smmu_split_block,
+   .merge_page = arm_smmu_merge_page,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 4c4eec3c0698..9028328b99b0 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -800,6 +800,83 @@ static size_t arm_lpae_split_block(struct io_pgtable_ops 
*ops,
return __arm_lpae_split_block(data, iova, size, lvl, ptep);
 }
 
+static size_t __arm_lpae_merge_page(struct arm_lpae_io_pgtable *data,
+   unsigned long iova, phys_addr_t paddr,
+   size_t size, int lvl, arm_lpae_iopte *ptep,
+   arm_lpae_iopte prot)
+{
+   arm_lpae_iopte pte, *tablep;
+   struct io_pgtable *iop = >iop;
+   struct io_pgtable_cfg *cfg = >iop.cfg;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return 0;
+
+   ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return 0;
+
+   if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+   if (iopte_leaf(pte, lvl, iop->fmt))
+   return size;
+
+   /* Race does not exist */
+   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_BBML1) {
+   prot |= ARM_LPAE_PTE_NT;
+   __arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+   io_pgtable_tlb_flush_walk(iop, iova, size,
+ ARM_LPAE_GRANULE(data));
+
+   prot &= ~(ARM_LPAE_PTE_NT);
+   __arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+   } else {
+   __arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+   }
+
+   tablep = iopte_deref(pte, data);
+   __arm_lpae_free_pgtable(data, lvl + 1, tablep);
+   return size;
+   } else if (iopte_leaf(pte, lvl, iop->fmt)) {
+   /* The size is too small, already merged */
+   return size;
+   }
+
+   /* Keep on walkin */
+   ptep = iopte_deref(pte, data);
+   return __arm_lpae_merge_page(data, iova, paddr, size, lvl + 1, ptep, 
prot);
+}
+
+static size_t arm_lpae_merge_page(struct io_pgtable_ops *ops, unsigned long 
iova,
+ phys_addr_t paddr, size_t size, int 
iommu_prot)
+{
+

[PATCH v3 10/12] iommu/arm-smmu-v3: Realize sync_dirty_log iommu ops

2021-04-13 Thread Keqian Zhu

From: Kunkun Jiang 

During dirty log tracking, user will try to retrieve dirty log from
iommu if it supports hardware dirty log. Scan leaf TTD and treat it
is dirty if it's writable. As we just enable HTTU for stage1, so
check whether AP[2] is not set.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 30 +++
 drivers/iommu/io-pgtable-arm.c  | 90 +
 include/linux/io-pgtable.h  |  4 +
 3 files changed, 124 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 52c6f3e74d6f..9eb209a07acc 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2627,6 +2627,35 @@ static int arm_smmu_switch_dirty_log(struct iommu_domain 
*domain, bool enable,
return 0;
 }
 
+static int arm_smmu_sync_dirty_log(struct iommu_domain *domain,
+  unsigned long iova, size_t size,
+  unsigned long *bitmap,
+  unsigned long base_iova,
+  unsigned long bitmap_pgshift)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HD))
+   return -ENODEV;
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+   return -EINVAL;
+
+   if (!ops || !ops->sync_dirty_log) {
+   pr_err("io-pgtable don't realize sync dirty log\n");
+   return -ENODEV;
+   }
+
+   /*
+* Flush iotlb to ensure all inflight transactions are completed.
+* See doc IHI0070Da 3.13.4 "HTTU behavior summary".
+*/
+   arm_smmu_flush_iotlb_all(domain);
+   return ops->sync_dirty_log(ops, iova, size, bitmap, base_iova,
+  bitmap_pgshift);
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2729,6 +2758,7 @@ static struct iommu_ops arm_smmu_ops = {
.split_block= arm_smmu_split_block,
.merge_page = arm_smmu_merge_page,
.switch_dirty_log   = arm_smmu_switch_dirty_log,
+   .sync_dirty_log = arm_smmu_sync_dirty_log,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 9028328b99b0..67a208a05ab2 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -877,6 +877,95 @@ static size_t arm_lpae_merge_page(struct io_pgtable_ops 
*ops, unsigned long iova
return __arm_lpae_merge_page(data, iova, paddr, size, lvl, ptep, prot);
 }
 
+static int __arm_lpae_sync_dirty_log(struct arm_lpae_io_pgtable *data,
+unsigned long iova, size_t size,
+int lvl, arm_lpae_iopte *ptep,
+unsigned long *bitmap,
+unsigned long base_iova,
+unsigned long bitmap_pgshift)
+{
+   arm_lpae_iopte pte;
+   struct io_pgtable *iop = >iop;
+   size_t base, next_size;
+   unsigned long offset;
+   int nbits, ret;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return -EINVAL;
+
+   ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return -EINVAL;
+
+   if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+   if (iopte_leaf(pte, lvl, iop->fmt)) {
+   if (pte & ARM_LPAE_PTE_AP_RDONLY)
+   return 0;
+
+   /* It is writable, set the bitmap */
+   nbits = size >> bitmap_pgshift;
+   offset = (iova - base_iova) >> bitmap_pgshift;
+   bitmap_set(bitmap, offset, nbits);
+   return 0;
+   } else {
+   /* To traverse next level */
+   next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+   ptep = iopte_deref(pte, data);
+   for (base = 0; base < size; base += next_size) {
+   ret = __arm_lpae_sync_dirty_log(data,
+   iova + base, next_size, lvl + 1,
+   ptep, bitmap, base_iova, 
bitmap_pgshift);
+

[PATCH v3 02/12] iommu: Add iommu_split_block interface

2021-04-13 Thread Keqian Zhu

Block(largepage) mapping is not a proper granule for dirty log tracking.
Take an extreme example, if DMA writes one byte, under 1G mapping, the
dirty amount reported is 1G, but under 4K mapping, the dirty amount is
just 4K.

This adds a new interface named iommu_split_block in IOMMU base layer.
A specific IOMMU driver can invoke it during start dirty log. If so, the
driver also need to realize the split_block iommu ops.

We flush all iotlbs after the whole procedure is completed to ease the
pressure of IOMMU, as we will hanle a huge range of mapping in general.

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/iommu.c | 41 +
 include/linux/iommu.h | 11 +++
 2 files changed, 52 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 667b2d6d2fc0..bb413a927870 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2721,6 +2721,47 @@ int iommu_domain_set_attr(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
 
+int iommu_split_block(struct iommu_domain *domain, unsigned long iova,
+ size_t size)
+{
+   const struct iommu_ops *ops = domain->ops;
+   unsigned int min_pagesz;
+   size_t pgsize;
+   bool flush = false;
+   int ret = 0;
+
+   if (unlikely(!ops || !ops->split_block))
+   return -ENODEV;
+
+   min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+   if (!IS_ALIGNED(iova | size, min_pagesz)) {
+   pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
+  iova, size, min_pagesz);
+   return -EINVAL;
+   }
+
+   while (size) {
+   flush = true;
+
+   pgsize = iommu_pgsize(domain, iova, size);
+
+   ret = ops->split_block(domain, iova, pgsize);
+   if (ret)
+   break;
+
+   pr_debug("split handled: iova 0x%lx size 0x%zx\n", iova, 
pgsize);
+
+   iova += pgsize;
+   size -= pgsize;
+   }
+
+   if (flush)
+   iommu_flush_iotlb_all(domain);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_split_block);
+
 int iommu_switch_dirty_log(struct iommu_domain *domain, bool enable,
   unsigned long iova, size_t size, int prot)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 7f9ed9f520e2..c6c90ac069e3 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -208,6 +208,7 @@ struct iommu_iotlb_gather {
  * @device_group: find iommu group for a particular device
  * @domain_get_attr: Query domain attributes
  * @domain_set_attr: Change domain attributes
+ * @split_block: Split block mapping into page mapping
  * @switch_dirty_log: Perform actions to start|stop dirty log tracking
  * @sync_dirty_log: Sync dirty log from IOMMU into a dirty bitmap
  * @clear_dirty_log: Clear dirty log of IOMMU by a mask bitmap
@@ -267,6 +268,8 @@ struct iommu_ops {
   enum iommu_attr attr, void *data);
 
/* Track dirty log */
+   int (*split_block)(struct iommu_domain *domain, unsigned long iova,
+  size_t size);
int (*switch_dirty_log)(struct iommu_domain *domain, bool enable,
unsigned long iova, size_t size, int prot);
int (*sync_dirty_log)(struct iommu_domain *domain,
@@ -529,6 +532,8 @@ extern int iommu_domain_get_attr(struct iommu_domain 
*domain, enum iommu_attr,
 void *data);
 extern int iommu_domain_set_attr(struct iommu_domain *domain, enum iommu_attr,
 void *data);
+extern int iommu_split_block(struct iommu_domain *domain, unsigned long iova,
+size_t size);
 extern int iommu_switch_dirty_log(struct iommu_domain *domain, bool enable,
  unsigned long iova, size_t size, int prot);
 extern int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long 
iova,
@@ -929,6 +934,12 @@ static inline int iommu_domain_set_attr(struct 
iommu_domain *domain,
return -EINVAL;
 }
 
+static inline int iommu_split_block(struct iommu_domain *domain,
+   unsigned long iova, size_t size)
+{
+   return -EINVAL;
+}
+
 static inline int iommu_switch_dirty_log(struct iommu_domain *domain,
 bool enable, unsigned long iova,
 size_t size, int prot)
-- 
2.19.1

[PATCH v3 06/12] iommu/arm-smmu-v3: Add feature detection for BBML

2021-04-13 Thread Keqian Zhu

From: Kunkun Jiang 

When altering a translation table descriptor of some specific reasons,
we require break-before-make procedure. But it might cause problems when
the TTD is alive. The I/O streams might not tolerate translation faults.

If the SMMU supports BBM level 1 or BBM level 2, we can change the block
size without using break-before-make sequence.

This adds feature detection for BBML, none functional change expected.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  6 ++
 include/linux/io-pgtable.h  |  8 
 3 files changed, 33 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 369c0ea7a104..443ac19c6da9 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2030,6 +2030,11 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
if (smmu->features & ARM_SMMU_FEAT_HD)
pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
 
+   if (smmu->features & ARM_SMMU_FEAT_BBML1)
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
+   else if (smmu->features & ARM_SMMU_FEAT_BBML2)
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML2;
+
pgtbl_ops = alloc_io_pgtable_ops(fmt, _cfg, smmu_domain);
if (!pgtbl_ops)
return -ENOMEM;
@@ -3373,6 +3378,20 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
 
/* IDR3 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
+   switch (FIELD_GET(IDR3_BBML, reg)) {
+   case IDR3_BBML0:
+   break;
+   case IDR3_BBML1:
+   smmu->features |= ARM_SMMU_FEAT_BBML1;
+   break;
+   case IDR3_BBML2:
+   smmu->features |= ARM_SMMU_FEAT_BBML2;
+   break;
+   default:
+   dev_err(smmu->dev, "unknown/unsupported BBM behavior level\n");
+   return -ENXIO;
+   }
+
if (FIELD_GET(IDR3_RIL, reg))
smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 26d6b935b383..a74125675544 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -54,6 +54,10 @@
 #define IDR1_SIDSIZE   GENMASK(5, 0)
 
 #define ARM_SMMU_IDR3  0xc
+#define IDR3_BBML  GENMASK(12, 11)
+#define IDR3_BBML0 0
+#define IDR3_BBML1 1
+#define IDR3_BBML2 2
 #define IDR3_RIL   (1 << 10)
 
 #define ARM_SMMU_IDR5  0x14
@@ -615,6 +619,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_E2H  (1 << 18)
 #define ARM_SMMU_FEAT_HA   (1 << 19)
 #define ARM_SMMU_FEAT_HD   (1 << 20)
+#define ARM_SMMU_FEAT_BBML1(1 << 21)
+#define ARM_SMMU_FEAT_BBML2(1 << 22)
u32 features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 64cee6831c97..9e7163ec9447 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -84,6 +84,12 @@ struct io_pgtable_cfg {
 *  attributes set in the TCR for a non-coherent page-table walker.
 *
 * IO_PGTABLE_QUIRK_ARM_HD: Support hardware management of dirty status.
+*
+* IO_PGTABLE_QUIRK_ARM_BBML1: ARM SMMU supports BBM Level 1 behavior
+*  when changing block size.
+*
+* IO_PGTABLE_QUIRK_ARM_BBML2: ARM SMMU supports BBM Level 2 behavior
+*  when changing block size.
 */
#define IO_PGTABLE_QUIRK_ARM_NS BIT(0)
#define IO_PGTABLE_QUIRK_NO_PERMS   BIT(1)
@@ -92,6 +98,8 @@ struct io_pgtable_cfg {
#define IO_PGTABLE_QUIRK_ARM_TTBR1  BIT(5)
#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA BIT(6)
#define IO_PGTABLE_QUIRK_ARM_HD BIT(7)
+   #define IO_PGTABLE_QUIRK_ARM_BBML1  BIT(8)
+   #define IO_PGTABLE_QUIRK_ARM_BBML2  BIT(9)
unsigned long   quirks;
unsigned long   pgsize_bitmap;
unsigned intias;
-- 
2.19.1

[PATCH v3 03/12] iommu: Add iommu_merge_page interface

2021-04-13 Thread Keqian Zhu

If block(largepage) mappings are split during start dirty log, then
when stop dirty log, we need to recover them for better DMA performance.

This adds a new interface named iommu_merge_page in IOMMU base layer.
A specific IOMMU driver can invoke it during stop dirty log. If so, the
driver also need to realize the merge_page iommu ops.

We flush all iotlbs after the whole procedure is completed to ease the
pressure of iommu, as we will hanle a huge range of mapping in general.

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/iommu.c | 75 +++
 include/linux/iommu.h | 12 +++
 2 files changed, 87 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index bb413a927870..8f0d71bafb3a 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2762,6 +2762,81 @@ int iommu_split_block(struct iommu_domain *domain, 
unsigned long iova,
 }
 EXPORT_SYMBOL_GPL(iommu_split_block);
 
+static int __iommu_merge_page(struct iommu_domain *domain,
+ unsigned long iova, phys_addr_t paddr,
+ size_t size, int prot)
+{
+   const struct iommu_ops *ops = domain->ops;
+   unsigned int min_pagesz;
+   size_t pgsize;
+   int ret = 0;
+
+   if (unlikely(!ops || !ops->merge_page))
+   return -ENODEV;
+
+   min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+   if (!IS_ALIGNED(iova | paddr | size, min_pagesz)) {
+   pr_err("unaligned: iova 0x%lx pa %pa size 0x%zx min_pagesz 
0x%x\n",
+   iova, , size, min_pagesz);
+   return -EINVAL;
+   }
+
+   while (size) {
+   pgsize = iommu_pgsize(domain, iova | paddr, size);
+
+   ret = ops->merge_page(domain, iova, paddr, pgsize, prot);
+   if (ret)
+   break;
+
+   pr_debug("merge handled: iova 0x%lx pa %pa size 0x%zx\n",
+iova, , pgsize);
+
+   iova += pgsize;
+   paddr += pgsize;
+   size -= pgsize;
+   }
+
+   return ret;
+}
+
+int iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
+size_t size, int prot)
+{
+   phys_addr_t phys;
+   dma_addr_t p, i;
+   size_t cont_size;
+   bool flush = false;
+   int ret = 0;
+
+   while (size) {
+   flush = true;
+
+   phys = iommu_iova_to_phys(domain, iova);
+   cont_size = PAGE_SIZE;
+   p = phys + cont_size;
+   i = iova + cont_size;
+
+   while (cont_size < size && p == iommu_iova_to_phys(domain, i)) {
+   p += PAGE_SIZE;
+   i += PAGE_SIZE;
+   cont_size += PAGE_SIZE;
+   }
+
+   ret = __iommu_merge_page(domain, iova, phys, cont_size, prot);
+   if (ret)
+   break;
+
+   iova += cont_size;
+   size -= cont_size;
+   }
+
+   if (flush)
+   iommu_flush_iotlb_all(domain);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_merge_page);
+
 int iommu_switch_dirty_log(struct iommu_domain *domain, bool enable,
   unsigned long iova, size_t size, int prot)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c6c90ac069e3..fea3ecabff3d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -209,6 +209,7 @@ struct iommu_iotlb_gather {
  * @domain_get_attr: Query domain attributes
  * @domain_set_attr: Change domain attributes
  * @split_block: Split block mapping into page mapping
+ * @merge_page: Merge page mapping into block mapping
  * @switch_dirty_log: Perform actions to start|stop dirty log tracking
  * @sync_dirty_log: Sync dirty log from IOMMU into a dirty bitmap
  * @clear_dirty_log: Clear dirty log of IOMMU by a mask bitmap
@@ -270,6 +271,8 @@ struct iommu_ops {
/* Track dirty log */
int (*split_block)(struct iommu_domain *domain, unsigned long iova,
   size_t size);
+   int (*merge_page)(struct iommu_domain *domain, unsigned long iova,
+ phys_addr_t phys, size_t size, int prot);
int (*switch_dirty_log)(struct iommu_domain *domain, bool enable,
unsigned long iova, size_t size, int prot);
int (*sync_dirty_log)(struct iommu_domain *domain,
@@ -534,6 +537,8 @@ extern int iommu_domain_set_attr(struct iommu_domain 
*domain, enum iommu_attr,
 void *data);
 extern int iommu_split_block(struct iommu_domain *domain, unsigned long iova,
 size_t size);
+extern int iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
+   size_t size, int prot);
 extern int iommu_switch_d

[PATCH v3 12/12] iommu/arm-smmu-v3: Add HWDBM device feature reporting

2021-04-13 Thread Keqian Zhu

From: Kunkun Jiang 

We have implemented these interfaces required to support iommu
dirty log tracking. The last step is reporting this feature to
upper user, then the user can perform higher policy base on it.

There is a new dev feature named IOMMU_DEV_FEAT_HWDBM in iommu
layer. For arm smmuv3, it is equal to ARM_SMMU_FEAT_HD and it is
enabled by default if supported. Other types of IOMMU can enable
it by default or when user invokes enable_feature.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 59bb1d198631..2d716ee5621f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2712,6 +2712,9 @@ static bool arm_smmu_dev_has_feature(struct device *dev,
switch (feat) {
case IOMMU_DEV_FEAT_SVA:
return arm_smmu_master_sva_supported(master);
+   case IOMMU_DEV_FEAT_HWDBM:
+   /* No requirement for device, require HTTU HD of smmu */
+   return !!(master->smmu->features & ARM_SMMU_FEAT_HD);
default:
return false;
}
@@ -2728,6 +2731,9 @@ static bool arm_smmu_dev_feature_enabled(struct device 
*dev,
switch (feat) {
case IOMMU_DEV_FEAT_SVA:
return arm_smmu_master_sva_enabled(master);
+   case IOMMU_DEV_FEAT_HWDBM:
+   /* HTTU HD is enabled if supported */
+   return arm_smmu_dev_has_feature(dev, feat);
default:
return false;
}
-- 
2.19.1

[PATCH v3 09/12] iommu/arm-smmu-v3: Realize switch_dirty_log iommu ops

2021-04-13 Thread Keqian Zhu

From: Kunkun Jiang 

This realizes switch_dirty_log by invoking iommu_split_block() and
iommu_merge_page(). HTTU HD feature is required.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 38 +
 1 file changed, 38 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 4d8495d88be2..52c6f3e74d6f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2590,6 +2590,43 @@ static int arm_smmu_merge_page(struct iommu_domain 
*domain,
return 0;
 }
 
+static int arm_smmu_switch_dirty_log(struct iommu_domain *domain, bool enable,
+unsigned long iova, size_t size, int prot)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HD))
+   return -ENODEV;
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+   return -EINVAL;
+
+   if (enable) {
+   /*
+* For SMMU, the hardware dirty management is always enabled if
+* hardware supports HTTU HD. The action to start dirty log is
+* spliting block mapping.
+*
+* We don't return error even if the split operation fail, as we
+* can still track dirty at block granule, which is still a much
+* better choice compared to full dirty policy.
+*/
+   iommu_split_block(domain, iova, size);
+   } else {
+   /*
+* For SMMU, the hardware dirty management is always enabled if
+* hardware supports HTTU HD. The action to start dirty log is
+* merging page mapping.
+*
+* We don't return error even if the merge operation fail, as it
+* just effects performace of DMA transaction.
+*/
+   iommu_merge_page(domain, iova, size, prot);
+   }
+
+   return 0;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2691,6 +2728,7 @@ static struct iommu_ops arm_smmu_ops = {
.domain_set_attr= arm_smmu_domain_set_attr,
.split_block= arm_smmu_split_block,
.merge_page = arm_smmu_merge_page,
+   .switch_dirty_log   = arm_smmu_switch_dirty_log,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
-- 
2.19.1

[PATCH v3 05/12] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping

2021-04-13 Thread Keqian Zhu

From: Kunkun Jiang 

As nested mode is not upstreamed now, we just aim to support dirty
log tracking for stage1 with io-pgtable mapping (means not support
SVA mapping). If HTTU is supported, we enable HA/HD bits in the SMMU
CD, and set DBM bit for writable TTD.

The dirty state information is encoded using the access permission
bits AP[2] (stage 1) or S2AP[1] (stage 2) in conjunction with the
DBM (Dirty Bit Modifier) bit, where DBM means writable and AP[2]/
S2AP[1] means dirty.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 3 +++
 drivers/iommu/io-pgtable-arm.c  | 7 ++-
 include/linux/io-pgtable.h  | 3 +++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b6d965504f44..369c0ea7a104 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1921,6 +1921,7 @@ static int arm_smmu_domain_finalise_s1(struct 
arm_smmu_domain *smmu_domain,
  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
+ CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
cfg->cd.mair= pgtbl_cfg->arm_lpae_s1_cfg.mair;
 
@@ -2026,6 +2027,8 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
 
if (smmu_domain->non_strict)
pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
+   if (smmu->features & ARM_SMMU_FEAT_HD)
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
 
pgtbl_ops = alloc_io_pgtable_ops(fmt, _cfg, smmu_domain);
if (!pgtbl_ops)
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 87def58e79b5..94d790b8ed27 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -72,6 +72,7 @@
 
 #define ARM_LPAE_PTE_NSTABLE   (((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_DBM   (((arm_lpae_iopte)1) << 51)
 #define ARM_LPAE_PTE_AF(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS (((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS (((arm_lpae_iopte)2) << 8)
@@ -81,7 +82,7 @@
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK  (((arm_lpae_iopte)0x3ff) << 2)
 /* Ignore the contiguous bit for block splitting */
-#define ARM_LPAE_PTE_ATTR_HI_MASK  (((arm_lpae_iopte)6) << 52)
+#define ARM_LPAE_PTE_ATTR_HI_MASK  (((arm_lpae_iopte)13) << 51)
 #define ARM_LPAE_PTE_ATTR_MASK (ARM_LPAE_PTE_ATTR_LO_MASK |\
 ARM_LPAE_PTE_ATTR_HI_MASK)
 /* Software bit for solving coherency races */
@@ -379,6 +380,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, 
unsigned long iova,
 static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
   int prot)
 {
+   struct io_pgtable_cfg *cfg = >iop.cfg;
arm_lpae_iopte pte;
 
if (data->iop.fmt == ARM_64_LPAE_S1 ||
@@ -386,6 +388,9 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
arm_lpae_io_pgtable *data,
pte = ARM_LPAE_PTE_nG;
if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
pte |= ARM_LPAE_PTE_AP_RDONLY;
+   else if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_HD)
+   pte |= ARM_LPAE_PTE_DBM;
+
if (!(prot & IOMMU_PRIV))
pte |= ARM_LPAE_PTE_AP_UNPRIV;
} else {
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index a4c9ca2c31f1..64cee6831c97 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -82,6 +82,8 @@ struct io_pgtable_cfg {
 *
 * IO_PGTABLE_QUIRK_ARM_OUTER_WBWA: Override the outer-cacheability
 *  attributes set in the TCR for a non-coherent page-table walker.
+*
+* IO_PGTABLE_QUIRK_ARM_HD: Support hardware management of dirty status.
 */
#define IO_PGTABLE_QUIRK_ARM_NS BIT(0)
#define IO_PGTABLE_QUIRK_NO_PERMS   BIT(1)
@@ -89,6 +91,7 @@ struct io_pgtable_cfg {
#define IO_PGTABLE_QUIRK_NON_STRICT BIT(4)
#define IO_PGTABLE_QUIRK_ARM_TTBR1  BIT(5)
#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA BIT(6)
+   #define IO_PGTABLE_QUIRK_ARM_HD BIT(7)
unsigned long   quirks;
unsigned long   pgsize_bitmap;
unsigned intias;
-- 
2.19.1

[PATCH v3 04/12] iommu/arm-smmu-v3: Add support for Hardware Translation Table Update

2021-04-13 Thread Keqian Zhu

From: Jean-Philippe Brucker 

If the SMMU supports it and the kernel was built with HTTU support,
enable hardware update of access and dirty flags. This is essential for
shared page tables, to reduce the number of access faults on the fault
queue. Normal DMA with io-pgtables doesn't currently use the access or
dirty flags.

We can enable HTTU even if CPUs don't support it, because the kernel
always checks for HW dirty bit and updates the PTE flags atomically.

Signed-off-by: Jean-Philippe Brucker 
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  2 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 41 ++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  8 
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index bb251cab61f3..ae075e675892 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -121,10 +121,12 @@ static struct arm_smmu_ctx_desc 
*arm_smmu_alloc_shared_cd(struct mm_struct *mm)
if (err)
goto out_free_asid;
 
+   /* HA and HD will be filtered out later if not supported by the SMMU */
tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, 64ULL - vabits_actual) |
  FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, ARM_LPAE_TCR_RGN_WBWA) |
  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, ARM_LPAE_TCR_RGN_WBWA) |
  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, ARM_LPAE_TCR_SH_IS) |
+ CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
 
switch (PAGE_SIZE) {
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8594b4a83043..b6d965504f44 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1012,10 +1012,17 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain 
*smmu_domain, int ssid,
 * this substream's traffic
 */
} else { /* (1) and (2) */
+   u64 tcr = cd->tcr;
+
cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
cdptr[2] = 0;
cdptr[3] = cpu_to_le64(cd->mair);
 
+   if (!(smmu->features & ARM_SMMU_FEAT_HD))
+   tcr &= ~CTXDESC_CD_0_TCR_HD;
+   if (!(smmu->features & ARM_SMMU_FEAT_HA))
+   tcr &= ~CTXDESC_CD_0_TCR_HA;
+
/*
 * STE is live, and the SMMU might read dwords of this CD in any
 * order. Ensure that it observes valid values before reading
@@ -1023,7 +1030,7 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain 
*smmu_domain, int ssid,
 */
arm_smmu_sync_cd(smmu_domain, ssid, true);
 
-   val = cd->tcr |
+   val = tcr |
 #ifdef __BIG_ENDIAN
CTXDESC_CD_0_ENDI |
 #endif
@@ -3196,6 +3203,28 @@ static int arm_smmu_device_reset(struct arm_smmu_device 
*smmu, bool bypass)
return 0;
 }
 
+static void arm_smmu_get_httu(struct arm_smmu_device *smmu, u32 reg)
+{
+   u32 fw_features = smmu->features & (ARM_SMMU_FEAT_HA | 
ARM_SMMU_FEAT_HD);
+   u32 features = 0;
+
+   switch (FIELD_GET(IDR0_HTTU, reg)) {
+   case IDR0_HTTU_ACCESS_DIRTY:
+   features |= ARM_SMMU_FEAT_HD;
+   fallthrough;
+   case IDR0_HTTU_ACCESS:
+   features |= ARM_SMMU_FEAT_HA;
+   }
+
+   if (smmu->dev->of_node)
+   smmu->features |= features;
+   else if (features != fw_features)
+   /* ACPI IORT sets the HTTU bits */
+   dev_warn(smmu->dev,
+"IDR0.HTTU overridden by FW configuration (0x%x)\n",
+fw_features);
+}
+
 static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 {
u32 reg;
@@ -3256,6 +3285,8 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
smmu->features |= ARM_SMMU_FEAT_E2H;
}
 
+   arm_smmu_get_httu(smmu, reg);
+
/*
 * The coherency feature as set by FW is used in preference to the ID
 * register, but warn on mismatch.
@@ -3441,6 +3472,14 @@ static int arm_smmu_device_acpi_probe(struct 
platform_device *pdev,
if (iort_smmu->flags & ACPI_IORT_SMMU_V3_COHACC_OVERRIDE)
smmu->features |= ARM_SMMU_FEAT_COHERENCY;
 
+   switch (FIELD_GET(ACPI_IORT_SMMU_V3_HTTU_OVERRIDE, iort_smmu->flags)) {
+   case IDR0_HTTU_ACCESS_DIRTY:
+   smmu->features |= ARM_SMMU_FEAT_HD;
+   fallthrough;
+   case IDR0_HTTU_ACCESS:
+   smmu->features |= ARM_SMMU_FEAT_HA;
+   }
+
return 0;
 }
 #else
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index

[PATCH v3 07/12] iommu/arm-smmu-v3: Realize split_block iommu ops

2021-04-13 Thread Keqian Zhu

From: Kunkun Jiang 

This splits block descriptor to an span of page descriptors. BBML1
or BBML2 feature is required.

Spliting block does not simultaneously work with other pgtable ops,
as the only designed user is vfio, which always hold a lock, so race
condition is not considered in the pgtable ops.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  27 +
 drivers/iommu/io-pgtable-arm.c  | 122 
 include/linux/io-pgtable.h  |   2 +
 3 files changed, 151 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 443ac19c6da9..cfa83fa03c89 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2537,6 +2537,32 @@ static int arm_smmu_domain_set_attr(struct iommu_domain 
*domain,
return ret;
 }
 
+static int arm_smmu_split_block(struct iommu_domain *domain,
+   unsigned long iova, size_t size)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+   size_t handled_size;
+
+   if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
+   dev_err(smmu->dev, "don't support BBML1/2, can't split 
block\n");
+   return -ENODEV;
+   }
+   if (!ops || !ops->split_block) {
+   pr_err("io-pgtable don't realize split block\n");
+   return -ENODEV;
+   }
+
+   handled_size = ops->split_block(ops, iova, size);
+   if (handled_size != size) {
+   pr_err("split block failed\n");
+   return -EFAULT;
+   }
+
+   return 0;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2636,6 +2662,7 @@ static struct iommu_ops arm_smmu_ops = {
.device_group   = arm_smmu_device_group,
.domain_get_attr= arm_smmu_domain_get_attr,
.domain_set_attr= arm_smmu_domain_set_attr,
+   .split_block= arm_smmu_split_block,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 94d790b8ed27..4c4eec3c0698 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -79,6 +79,8 @@
 #define ARM_LPAE_PTE_SH_IS (((arm_lpae_iopte)3) << 8)
 #define ARM_LPAE_PTE_NS(((arm_lpae_iopte)1) << 5)
 #define ARM_LPAE_PTE_VALID (((arm_lpae_iopte)1) << 0)
+/* Block descriptor bits */
+#define ARM_LPAE_PTE_NT(((arm_lpae_iopte)1) << 16)
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK  (((arm_lpae_iopte)0x3ff) << 2)
 /* Ignore the contiguous bit for block splitting */
@@ -679,6 +681,125 @@ static phys_addr_t arm_lpae_iova_to_phys(struct 
io_pgtable_ops *ops,
return iopte_to_paddr(pte, data) | iova;
 }
 
+static size_t __arm_lpae_split_block(struct arm_lpae_io_pgtable *data,
+unsigned long iova, size_t size, int lvl,
+arm_lpae_iopte *ptep);
+
+static size_t arm_lpae_do_split_blk(struct arm_lpae_io_pgtable *data,
+   unsigned long iova, size_t size,
+   arm_lpae_iopte blk_pte, int lvl,
+   arm_lpae_iopte *ptep)
+{
+   struct io_pgtable_cfg *cfg = >iop.cfg;
+   arm_lpae_iopte pte, *tablep;
+   phys_addr_t blk_paddr;
+   size_t tablesz = ARM_LPAE_GRANULE(data);
+   size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+   int i;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return 0;
+
+   tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg);
+   if (!tablep)
+   return 0;
+
+   blk_paddr = iopte_to_paddr(blk_pte, data);
+   pte = iopte_prot(blk_pte);
+   for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz)
+   __arm_lpae_init_pte(data, blk_paddr, pte, lvl, [i]);
+
+   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_BBML1) {
+   /* Race does not exist */
+   blk_pte |= ARM_LPAE_PTE_NT;
+   __arm_lpae_set_pte(ptep, blk_pte, cfg);
+   io_pgtable_tlb_flush_walk(>iop, iova, size, size);
+   }
+   /* Race does not exist */
+   pte = arm_lpae_install_table(tablep, ptep, blk_pte, cfg);
+
+   /* Have splited it into page? */
+   if (lvl == (ARM_LPAE_MAX_LEVELS - 1))
+

[PATCH v3 01/12] iommu: Introduce dirty log tracking framework

2021-04-13 Thread Keqian Zhu

Some types of IOMMU are capable of tracking DMA dirty log, such as
ARM SMMU with HTTU or Intel IOMMU with SLADE. This introduces the
dirty log tracking framework in the IOMMU base layer.

Three new essential interfaces are added, and we maintaince the status
of dirty log tracking in iommu_domain.
1. iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking
2. iommu_sync_dirty_log: Sync dirty log from IOMMU into a dirty bitmap
3. iommu_clear_dirty_log: Clear dirty log of IOMMU by a mask bitmap

A new dev feature are added to indicate whether a specific type of
iommu hardware supports and its driver realizes them.

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
 drivers/iommu/iommu.c | 150 ++
 include/linux/iommu.h |  53 +++
 2 files changed, 203 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d0b0a15dba84..667b2d6d2fc0 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1922,6 +1922,7 @@ static struct iommu_domain *__iommu_domain_alloc(struct 
bus_type *bus,
domain->type = type;
/* Assume all sizes by default; the driver may override this later */
domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;
+   mutex_init(>switch_log_lock);
 
return domain;
 }
@@ -2720,6 +2721,155 @@ int iommu_domain_set_attr(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
 
+int iommu_switch_dirty_log(struct iommu_domain *domain, bool enable,
+  unsigned long iova, size_t size, int prot)
+{
+   const struct iommu_ops *ops = domain->ops;
+   int ret;
+
+   if (unlikely(!ops || !ops->switch_dirty_log))
+   return -ENODEV;
+
+   mutex_lock(>switch_log_lock);
+   if (enable && domain->dirty_log_tracking) {
+   ret = -EBUSY;
+   goto out;
+   } else if (!enable && !domain->dirty_log_tracking) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = ops->switch_dirty_log(domain, enable, iova, size, prot);
+   if (ret)
+   goto out;
+
+   domain->dirty_log_tracking = enable;
+out:
+   mutex_unlock(>switch_log_lock);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_switch_dirty_log);
+
+int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
+size_t size, unsigned long *bitmap,
+unsigned long base_iova, unsigned long bitmap_pgshift)
+{
+   const struct iommu_ops *ops = domain->ops;
+   unsigned int min_pagesz;
+   size_t pgsize;
+   int ret = 0;
+
+   if (unlikely(!ops || !ops->sync_dirty_log))
+   return -ENODEV;
+
+   min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+   if (!IS_ALIGNED(iova | size, min_pagesz)) {
+   pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
+  iova, size, min_pagesz);
+   return -EINVAL;
+   }
+
+   mutex_lock(>switch_log_lock);
+   if (!domain->dirty_log_tracking) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   while (size) {
+   pgsize = iommu_pgsize(domain, iova, size);
+
+   ret = ops->sync_dirty_log(domain, iova, pgsize,
+ bitmap, base_iova, bitmap_pgshift);
+   if (ret)
+   break;
+
+   pr_debug("dirty_log_sync handle: iova 0x%lx pagesz 0x%zx\n",
+iova, pgsize);
+
+   iova += pgsize;
+   size -= pgsize;
+   }
+out:
+   mutex_unlock(>switch_log_lock);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_sync_dirty_log);
+
+static int __iommu_clear_dirty_log(struct iommu_domain *domain,
+  unsigned long iova, size_t size,
+  unsigned long *bitmap,
+  unsigned long base_iova,
+  unsigned long bitmap_pgshift)
+{
+   const struct iommu_ops *ops = domain->ops;
+   size_t pgsize;
+   int ret = 0;
+
+   if (unlikely(!ops || !ops->clear_dirty_log))
+   return -ENODEV;
+
+   while (size) {
+   pgsize = iommu_pgsize(domain, iova, size);
+
+   ret = ops->clear_dirty_log(domain, iova, pgsize, bitmap,
+  base_iova, bitmap_pgshift);
+   if (ret)
+   break;
+
+   pr_debug("dirty_log_clear handled: iova 0x%lx pagesz 0x%zx\n",
+iova, pgsize);
+
+   iova += pgsize;
+   size -= pgsize;
+   }
+
+   return ret;
+}
+
+int iommu_clear_dirty_log(struct iommu_domain *domain,
+

[PATCH v3 00/12] iommu/smmuv3: Implement hardware dirty log tracking

2021-04-13 Thread Keqian Zhu



This patch series is split from the series[1] that containes both IOMMU part and
VFIO part. The new VFIO part will be sent out in another series.

[1] 
https://lore.kernel.org/linux-iommu/20210310090614.26668-1-zhukeqi...@huawei.com/

changelog:

v3:
 - Merge start_dirty_log and stop_dirty_log into switch_dirty_log. (Yi Sun)
 - Maintain the dirty log status in iommu_domain.
 - Update commit message to make patch easier to review.

v2:
 - Address all comments of RFC version, thanks for all of you ;-)
 - Add a bugfix that start dirty log for newly added dma ranges and domain.



Hi everyone,

This patch series introduces a framework of iommu dirty log tracking, and smmuv3
realizes this framework. This new feature can be used by VFIO dma dirty 
tracking.

Intention：

Some types of IOMMU are capable of tracking DMA dirty log, such as
ARM SMMU with HTTU or Intel IOMMU with SLADE. This introduces the
dirty log tracking framework in the IOMMU base layer.

Three new essential interfaces are added, and we maintaince the status
of dirty log tracking in iommu_domain.
1. iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking
2. iommu_sync_dirty_log: Sync dirty log from IOMMU into a dirty bitmap
3. iommu_clear_dirty_log: Clear dirty log of IOMMU by a mask bitmap

About SMMU HTTU:

HTTU (Hardware Translation Table Update) is a feature of ARM SMMUv3, it can 
update
access flag or/and dirty state of the TTD (Translation Table Descriptor) by 
hardware.
With HTTU, stage1 TTD is classified into 3 types:
DBM bit AP[2](readonly bit)
1. writable_clean 1   1
2. writable_dirty 1   0
3. readonly   0   1

If HTTU_HD (manage dirty state) is enabled, smmu can change TTD from 
writable_clean to
writable_dirty. Then software can scan TTD to sync dirty state into dirty 
bitmap. With
this feature, we can track the dirty log of DMA continuously and precisely.

About this series:

Patch 1-3：Introduce dirty log tracking framework in the IOMMU base layer, and 
two common
   interfaces that can be used by many types of iommu.

Patch 4-6: Add feature detection for smmu HTTU and enable HTTU for smmu stage1 
mapping.
   And add feature detection for smmu BBML. We need to split block 
mapping when
   start dirty log tracking and merge page mapping when stop dirty log 
tracking,
   which requires break-before-make procedure. But it might 
cause problems when the
   TTD is alive. The I/O streams might not tolerate translation 
faults. So BBML
   should be used.

Patch 7-12: We implement these interfaces for arm smmuv3.

Thanks,
Keqian

Jean-Philippe Brucker (1):
  iommu/arm-smmu-v3: Add support for Hardware Translation Table Update

Keqian Zhu (3):
  iommu: Introduce dirty log tracking framework
  iommu: Add iommu_split_block interface
  iommu: Add iommu_merge_page interface

Kunkun Jiang (8):
  iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
  iommu/arm-smmu-v3: Add feature detection for BBML
  iommu/arm-smmu-v3: Realize split_block iommu ops
  iommu/arm-smmu-v3: Realize merge_page iommu ops
  iommu/arm-smmu-v3: Realize switch_dirty_log iommu ops
  iommu/arm-smmu-v3: Realize sync_dirty_log iommu ops
  iommu/arm-smmu-v3: Realize clear_dirty_log iommu ops
  iommu/arm-smmu-v3: Add HWDBM device feature reporting

 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |   2 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 217 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  14 +
 drivers/iommu/io-pgtable-arm.c| 392 +-
 drivers/iommu/iommu.c | 266 
 include/linux/io-pgtable.h|  23 +
 include/linux/iommu.h |  76 
 7 files changed, 988 insertions(+), 2 deletions(-)

-- 
2.19.1

Re: [PATCH] KVM: arm/arm64: Fix KVM_VGIC_V3_ADDR_TYPE_REDIST read

2021-04-13 Thread Keqian Zhu



On 2021/4/12 23:00, Eric Auger wrote:
> When reading the base address of the a REDIST region
> through KVM_VGIC_V3_ADDR_TYPE_REDIST we expect the
> redistributor region list to be populated with a single
> element.
> 
> However list_first_entry() expects the list to be non empty.
Indeed, list_first_entry() always return a non-null ptr. If the list
is empty, it will mistake the list head as the first element.

> Instead we should use list_first_entry_or_null which effectively
> returns NULL if the list is empty.
> 
> Fixes: dbd9733ab674 ("KVM: arm/arm64: Replace the single rdist region by a 
> list")
> Cc:  # v4.18+
> Signed-off-by: Eric Auger 
> Reported-by: Gavin Shan 
> ---
>  arch/arm64/kvm/vgic/vgic-kvm-device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-kvm-device.c 
> b/arch/arm64/kvm/vgic/vgic-kvm-device.c
> index 44419679f91a..5eaede3e3b5a 100644
> --- a/arch/arm64/kvm/vgic/vgic-kvm-device.c
> +++ b/arch/arm64/kvm/vgic/vgic-kvm-device.c
> @@ -87,8 +87,8 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
>   r = vgic_v3_set_redist_base(kvm, 0, *addr, 0);
>   goto out;
>   }
> - rdreg = list_first_entry(>rd_regions,
> -  struct vgic_redist_region, list);
> + rdreg = list_first_entry_or_null(>rd_regions,
> +  struct vgic_redist_region, 
> list);
>   if (!rdreg)
>   addr_ptr = _value;
>   else
>

[PATCH] vfio/iommu_type1: Remove unused pinned_page_dirty_scope in vfio_iommu

2021-04-11 Thread Keqian Zhu

pinned_page_dirty_scope is optimized out by commit 010321565a7d
("vfio/iommu_type1: Mantain a counter for non_pinned_groups"),
but appears again due to some issues during merging branches.
We can safely remove it here.

Signed-off-by: Keqian Zhu 
---

However, I'm not clear about the root problem. Is there a bug in git?

---
 drivers/vfio/vfio_iommu_type1.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 45cbfd4879a5..4d1f10a33d74 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -77,7 +77,6 @@ struct vfio_iommu {
boolv2;
boolnesting;
booldirty_page_tracking;
-   boolpinned_page_dirty_scope;
boolcontainer_open;
 };
 
-- 
2.19.1

Re: [RFC PATCH] KVM: x86: Support write protect huge pages lazily

2021-04-08 Thread Keqian Zhu

Hi Sean,

On 2021/4/8 23:52, Sean Christopherson wrote:
> On Thu, Apr 08, 2021, Keqian Zhu wrote:
>> Hi Ben,
>>
>> Do you have any similar idea that can share with us?
> 
> Doh, Ben is out this week, he'll be back Monday.  Sorry for gumming up the 
> works :-/
Please don't mind. I'm glad we can have some intersection of idea.

Re: [RFC PATCH] KVM: x86: Support write protect huge pages lazily

2021-04-08 Thread Keqian Zhu

Hi Ben,

Do you have any similar idea that can share with us?


Thanks
Keqian

On 2021/4/7 7:42, Sean Christopherson wrote:
> +Ben
> 
> On Tue, Apr 06, 2021, Keqian Zhu wrote:
>> Hi Paolo,
>>
>> I plan to rework this patch and do full test. What do you think about this 
>> idea
>> (enable dirty logging for huge pages lazily)?
> 
> Ben, don't you also have something similar (or maybe the exact opposite?) in 
> the
> hopper?  This sounds very familiar, but I can't quite connect the dots that 
> are
> floating around my head...
>  
>> PS: As dirty log of TDP MMU has been supported, I should add more code.
>>
>> On 2020/8/28 16:11, Keqian Zhu wrote:
>>> Currently during enable dirty logging, if we're with init-all-set,
>>> we just write protect huge pages and leave normal pages untouched,
>>> for that we can enable dirty logging for these pages lazily.
>>>
>>> It seems that enable dirty logging lazily for huge pages is feasible
>>> too, which not only reduces the time of start dirty logging, also
>>> greatly reduces side-effect on guest when there is high dirty rate.
>>>
>>> (These codes are not tested, for RFC purpose :-) ).
>>>
>>> Signed-off-by: Keqian Zhu 
>>> ---
>>>  arch/x86/include/asm/kvm_host.h |  3 +-
>>>  arch/x86/kvm/mmu/mmu.c  | 65 ++---
>>>  arch/x86/kvm/vmx/vmx.c  |  3 +-
>>>  arch/x86/kvm/x86.c  | 22 +--
>>>  4 files changed, 62 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h 
>>> b/arch/x86/include/asm/kvm_host.h
>>> index 5303dbc5c9bc..201a068cf43d 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -1296,8 +1296,7 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
>>> accessed_mask,
>>>  
>>>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>>>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>> - struct kvm_memory_slot *memslot,
>>> - int start_level);
>>> + struct kvm_memory_slot *memslot);
>>>  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>>>const struct kvm_memory_slot *memslot);
>>>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 43fdb0c12a5d..4b7d577de6cd 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -1625,14 +1625,45 @@ static bool __rmap_set_dirty(struct kvm *kvm, 
>>> struct kvm_rmap_head *rmap_head)
>>>  }
>>>  
>>>  /**
>>> - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
>>> + * kvm_mmu_write_protect_largepage_masked - write protect selected 
>>> largepages
>>>   * @kvm: kvm instance
>>>   * @slot: slot to protect
>>>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
>>>   * @mask: indicates which pages we should protect
>>>   *
>>> - * Used when we do not need to care about huge page mappings: e.g. during 
>>> dirty
>>> - * logging we do not have any such mappings.
>>> + * @ret: true if all pages are write protected
>>> + */
>>> +static bool kvm_mmu_write_protect_largepage_masked(struct kvm *kvm,
>>> +   struct kvm_memory_slot *slot,
>>> +   gfn_t gfn_offset, unsigned long mask)
>>> +{
>>> +   struct kvm_rmap_head *rmap_head;
>>> +   bool protected, all_protected;
>>> +   gfn_t start_gfn = slot->base_gfn + gfn_offset;
>>> +   int i;
>>> +
>>> +   all_protected = true;
>>> +   while (mask) {
>>> +   protected = false;
>>> +   for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
>>> +   rmap_head = __gfn_to_rmap(start_gfn + __ffs(mask), i, 
>>> slot);
>>> +   protectd |= __rmap_write_protect(kvm, rmap_head, false);
>>> +   }
>>> +
>>> +   all_protected &= protectd;
>>> +   /* clear the first set bit */
>>> +   mask &= mask - 1;
>>> +   }
>>> +
>>> +   return all_protected;
>>> +}
>>> +
>>> +/**
>>> + * kvm_mmu_write_protect_pt_masked - write protect selected

Re: [RFC PATCH v2 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-08 Thread Keqian Zhu

Hi Marc,

On 2021/4/7 21:18, Marc Zyngier wrote:
> On Tue, 16 Mar 2021 13:43:38 +,
> Keqian Zhu  wrote:
>>
>> The MMIO region of a device maybe huge (GB level), try to use
>> block mapping in stage2 to speedup both map and unmap.
>>
>> Compared to normal memory mapping, we should consider two more
>> points when try block mapping for MMIO region:
>>
>> 1. For normal memory mapping, the PA(host physical address) and
>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>> the HVA to request hugepage, so we don't need to consider PA
>> alignment when verifing block mapping. But for device memory
>> mapping, the PA and HVA may have different alignment.
>>
>> 2. For normal memory mapping, we are sure hugepage size properly
>> fit into vma, so we don't check whether the mapping size exceeds
>> the boundary of vma. But for device memory mapping, we should pay
>> attention to this.
>>
>> This adds device_rough_page_shift() to check these two points when
>> selecting block mapping size.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>
>> Mainly for RFC, not fully tested. I will fully test it when the
>> code logic is well accepted.
>>
>> ---
>>  arch/arm64/kvm/mmu.c | 42 ++
>>  1 file changed, 38 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c59af5ca01b0..224aa15eb4d9 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -624,6 +624,36 @@ static void kvm_send_hwpoison_signal(unsigned long 
>> address, short lsb)
>>  send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>>  }
>>  
>> +/*
>> + * Find a mapping size that properly insides the intersection of vma and
>> + * memslot. And hva and pa have the same alignment to this mapping size.
>> + * It's rough because there are still other restrictions, which will be
>> + * checked by the following fault_supports_stage2_huge_mapping().
> 
> I don't think these restrictions make complete sense to me. If this is
> a PFNMAP VMA, we should use the biggest mapping size that covers the
> VMA, and not more than the VMA.
But as described by kvm_arch_prepare_memory_region(), the memslot may not fully
cover the VMA. If that's true and we just consider the boundary of the VMA, our
block mapping may beyond the boundary of memslot. Is this a problem?

> 
>> + */
>> +static short device_rough_page_shift(struct kvm_memory_slot *memslot,
>> + struct vm_area_struct *vma,
>> + unsigned long hva)
>> +{
>> +size_t size = memslot->npages * PAGE_SIZE;
>> +hva_t sec_start = max(memslot->userspace_addr, vma->vm_start);
>> +hva_t sec_end = min(memslot->userspace_addr + size, vma->vm_end);
>> +phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
>> +
>> +#ifndef __PAGETABLE_PMD_FOLDED
>> +if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PUD_SIZE) >= sec_start &&
>> +ALIGN(hva, PUD_SIZE) <= sec_end)
>> +return PUD_SHIFT;
>> +#endif
>> +
>> +if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
>> +ALIGN_DOWN(hva, PMD_SIZE) >= sec_start &&
>> +ALIGN(hva, PMD_SIZE) <= sec_end)
>> +return PMD_SHIFT;
>> +
>> +return PAGE_SHIFT;
>> +}
>> +
>>  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot 
>> *memslot,
>> unsigned long hva,
>> unsigned long map_size)
>> @@ -769,7 +799,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  return -EFAULT;
>>  }
>>  
>> -/* Let's check if we will get back a huge page backed by hugetlbfs */
>> +/*
>> + * Let's check if we will get back a huge page backed by hugetlbfs, or
>> + * get block mapping for device MMIO region.
>> + */
>>  mmap_read_lock(current->mm);
>>  vma = find_vma_intersection(current->mm, hva, hva + 1);
>>  if (unlikely(!vma)) {
>> @@ -780,11 +813,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  
>>  if (is_vm_hugetlb_page(vma))
>>  vma_shift = huge_page_shift(hstate_vma(vma));
>> +else if (vma->vm_flags & VM

Re: [PATCH] KVM: MMU: protect TDP MMU pages only down to required level

2021-04-06 Thread Keqian Zhu




On 2021/4/7 7:38, Sean Christopherson wrote:
> On Tue, Apr 06, 2021, Keqian Zhu wrote:
>> Hi Paolo,
>>
>> I'm just going to fix this issue, and found that you have done this ;-)
> 
> Ha, and meanwhile I'm having a serious case of deja vu[1].  It even received a
> variant of the magic "Queued, thanks"[2].  Doesn't appear in either of the 
> 5.12
> pull requests though, must have gotten lost along the way.
Good job. We should pick them up :)

> 
> [1] https://lkml.kernel.org/r/20210213005015.1651772-3-sea...@google.com
> [2] https://lkml.kernel.org/r/b5ab72f2-970f-64bd-891c-48f1c3035...@redhat.com
> 
>> Please feel free to add:
>>
>> Reviewed-by: Keqian Zhu 
>>
>> Thanks,
>> Keqian
>>
>> On 2021/4/2 20:17, Paolo Bonzini wrote:
>>> When using manual protection of dirty pages, it is not necessary
>>> to protect nested page tables down to the 4K level; instead KVM
>>> can protect only hugepages in order to split them lazily, and
>>> delay write protection at 4K-granularity until KVM_CLEAR_DIRTY_LOG.
>>> This was overlooked in the TDP MMU, so do it there as well.
>>>
>>> Fixes: a6a0b05da9f37 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
>>> Cc: Ben Gardon 
>>> Signed-off-by: Paolo Bonzini 
>>> ---
>>>  arch/x86/kvm/mmu/mmu.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index efb41f31e80a..0d92a269c5fa 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -5538,7 +5538,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>>> flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
>>> start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
>>> if (is_tdp_mmu_enabled(kvm))
>>> -   flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_4K);
>>> +   flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
>>> write_unlock(>mmu_lock);
>>>  
>>> /*
>>>
> .
>

Re: [RFC PATCH] KVM: x86: Support write protect huge pages lazily

2021-04-06 Thread Keqian Zhu

Hi Paolo,

I plan to rework this patch and do full test. What do you think about this idea
(enable dirty logging for huge pages lazily)?

Best Regards,
Keqian

PS: As dirty log of TDP MMU has been supported, I should add more code.

On 2020/8/28 16:11, Keqian Zhu wrote:
> Currently during enable dirty logging, if we're with init-all-set,
> we just write protect huge pages and leave normal pages untouched,
> for that we can enable dirty logging for these pages lazily.
> 
> It seems that enable dirty logging lazily for huge pages is feasible
> too, which not only reduces the time of start dirty logging, also
> greatly reduces side-effect on guest when there is high dirty rate.
> 
> (These codes are not tested, for RFC purpose :-) ).
> 
> Signed-off-by: Keqian Zhu 
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +-
>  arch/x86/kvm/mmu/mmu.c  | 65 ++---
>  arch/x86/kvm/vmx/vmx.c  |  3 +-
>  arch/x86/kvm/x86.c  | 22 +--
>  4 files changed, 62 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5303dbc5c9bc..201a068cf43d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1296,8 +1296,7 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
> accessed_mask,
>  
>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> -   struct kvm_memory_slot *memslot,
> -   int start_level);
> +   struct kvm_memory_slot *memslot);
>  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  const struct kvm_memory_slot *memslot);
>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 43fdb0c12a5d..4b7d577de6cd 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1625,14 +1625,45 @@ static bool __rmap_set_dirty(struct kvm *kvm, struct 
> kvm_rmap_head *rmap_head)
>  }
>  
>  /**
> - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
> + * kvm_mmu_write_protect_largepage_masked - write protect selected largepages
>   * @kvm: kvm instance
>   * @slot: slot to protect
>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
>   * @mask: indicates which pages we should protect
>   *
> - * Used when we do not need to care about huge page mappings: e.g. during 
> dirty
> - * logging we do not have any such mappings.
> + * @ret: true if all pages are write protected
> + */
> +static bool kvm_mmu_write_protect_largepage_masked(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + gfn_t gfn_offset, unsigned long mask)
> +{
> + struct kvm_rmap_head *rmap_head;
> + bool protected, all_protected;
> + gfn_t start_gfn = slot->base_gfn + gfn_offset;
> + int i;
> +
> + all_protected = true;
> + while (mask) {
> + protected = false;
> + for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> + rmap_head = __gfn_to_rmap(start_gfn + __ffs(mask), i, 
> slot);
> + protectd |= __rmap_write_protect(kvm, rmap_head, false);
> + }
> +
> + all_protected &= protectd;
> + /* clear the first set bit */
> + mask &= mask - 1;
> + }
> +
> + return all_protected;
> +}
> +
> +/**
> + * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
> + * @kvm: kvm instance
> + * @slot: slot to protect
> + * @gfn_offset: start of the BITS_PER_LONG pages we care about
> + * @mask: indicates which pages we should protect
>   */
>  static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>struct kvm_memory_slot *slot,
> @@ -1679,18 +1710,25 @@ EXPORT_SYMBOL_GPL(kvm_mmu_clear_dirty_pt_masked);
>  
>  /**
>   * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for 
> selected
> - * PT level pages.
> - *
> - * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages 
> to
> - * enable dirty logging for them.
> - *
> - * Used when we do not need to care about huge page mappings: e.g. during 
> dirty
> - * logging we do not have any such mappings.
> + * dirty pages.
>   */
>  void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>   struct kvm_memory_slot *slot,
>   gfn_t gfn_offset,

[PATCH] KVM: x86: Remove unused function declaration

2021-04-06 Thread Keqian Zhu

kvm_mmu_slot_largepage_remove_write_access() is decared but not used,
just remove it.

Signed-off-by: Keqian Zhu 
---
 arch/x86/include/asm/kvm_host.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3768819693e5..9c0af0971c9f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1440,8 +1440,6 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
   const struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
   struct kvm_memory_slot *memslot);
-void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
-   struct kvm_memory_slot *memslot);
 void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm);
-- 
2.23.0

Re: [PATCH] KVM: MMU: protect TDP MMU pages only down to required level

2021-04-06 Thread Keqian Zhu

Hi Paolo,

I'm just going to fix this issue, and found that you have done this ;-)
Please feel free to add:

Reviewed-by: Keqian Zhu 

Thanks,
Keqian

On 2021/4/2 20:17, Paolo Bonzini wrote:
> When using manual protection of dirty pages, it is not necessary
> to protect nested page tables down to the 4K level; instead KVM
> can protect only hugepages in order to split them lazily, and
> delay write protection at 4K-granularity until KVM_CLEAR_DIRTY_LOG.
> This was overlooked in the TDP MMU, so do it there as well.
> 
> Fixes: a6a0b05da9f37 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
> Cc: Ben Gardon 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/mmu/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index efb41f31e80a..0d92a269c5fa 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5538,7 +5538,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>   flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
>   start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
>   if (is_tdp_mmu_enabled(kvm))
> - flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_4K);
> + flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
>   write_unlock(>mmu_lock);
>  
>   /*
>

Re: [RFC PATCH v2 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-03-31 Thread Keqian Zhu

Kind ping...

On 2021/3/16 21:43, Keqian Zhu wrote:
> Hi all,
> 
> We have two pathes to build stage2 mapping for MMIO regions.
> 
> Create time's path and stage2 fault path.
> 
> Patch#1 removes the creation time's mapping of MMIO regions
> Patch#2 tries stage2 block mapping for host device MMIO at fault path
> 
> Thanks,
> Keqian
> 
> Keqian Zhu (2):
>   kvm/arm64: Remove the creation time's mapping of MMIO regions
>   kvm/arm64: Try stage2 block mapping for host device MMIO
> 
>  arch/arm64/kvm/mmu.c | 80 +++-
>  1 file changed, 41 insertions(+), 39 deletions(-)
>

Re: [PATCH v14 05/13] iommu/smmuv3: Implement attach/detach_pasid_table

2021-03-22 Thread Keqian Zhu

Hi Eric,

On 2021/3/19 21:15, Auger Eric wrote:
> Hi Keqian,
> 
> On 3/2/21 9:35 AM, Keqian Zhu wrote:
>> Hi Eric,
>>
>> On 2021/2/24 4:56, Eric Auger wrote:
>>> On attach_pasid_table() we program STE S1 related info set
>>> by the guest into the actual physical STEs. At minimum
>>> we need to program the context descriptor GPA and compute
>>> whether the stage1 is translated/bypassed or aborted.
>>>
>>> On detach, the stage 1 config is unset and the abort flag is
>>> unset.
>>>
>>> Signed-off-by: Eric Auger 
>>>
>> [...]
>>
>>> +
>>> +   /*
>>> +* we currently support a single CD so s1fmt and s1dss
>>> +* fields are also ignored
>>> +*/
>>> +   if (cfg->pasid_bits)
>>> +   goto out;
>>> +
>>> +   smmu_domain->s1_cfg.cdcfg.cdtab_dma = cfg->base_ptr;
>> only the "cdtab_dma" field of "cdcfg" is set, we are not able to locate a 
>> specific cd using arm_smmu_get_cd_ptr().
>>
>> Maybe we'd better use a specialized function to fill other fields of "cdcfg" 
>> or add a sanity check in arm_smmu_get_cd_ptr()
>> to prevent calling it under nested mode?
>>
>> As now we just call arm_smmu_get_cd_ptr() during finalise_s1(), no problem 
>> found. Just a suggestion ;-)
> 
> forgive me for the delay. yes I can indeed make sure that code is not
> called in nested mode. Please could you detail why you would need to
> call arm_smmu_get_cd_ptr()?
I accidentally called this function in nested mode when verify the smmu mpam 
feature. :)

Yes, in nested mode, context descriptor is owned by guest, hypervisor does not 
need to care about its content.
Maybe we'd better give an explicit comment for arm_smmu_get_cd_ptr() to let 
coder pay attention to this? :)

Thanks,
Keqian

> 
> Thanks
> 
> Eric
>>
>> Thanks,
>> Keqian
>>
>>
>>> +   smmu_domain->s1_cfg.set = true;
>>> +   smmu_domain->abort = false;
>>> +   break;
>>> +   default:
>>> +   goto out;
>>> +   }
>>> +   spin_lock_irqsave(_domain->devices_lock, flags);
>>> +   list_for_each_entry(master, _domain->devices, domain_head)
>>> +   arm_smmu_install_ste_for_dev(master);
>>> +   spin_unlock_irqrestore(_domain->devices_lock, flags);
>>> +   ret = 0;
>>> +out:
>>> +   mutex_unlock(_domain->init_mutex);
>>> +   return ret;
>>> +}
>>> +
>>> +static void arm_smmu_detach_pasid_table(struct iommu_domain *domain)
>>> +{
>>> +   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>>> +   struct arm_smmu_master *master;
>>> +   unsigned long flags;
>>> +
>>> +   mutex_lock(_domain->init_mutex);
>>> +
>>> +   if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
>>> +   goto unlock;
>>> +
>>> +   smmu_domain->s1_cfg.set = false;
>>> +   smmu_domain->abort = false;
>>> +
>>> +   spin_lock_irqsave(_domain->devices_lock, flags);
>>> +   list_for_each_entry(master, _domain->devices, domain_head)
>>> +   arm_smmu_install_ste_for_dev(master);
>>> +   spin_unlock_irqrestore(_domain->devices_lock, flags);
>>> +
>>> +unlock:
>>> +   mutex_unlock(_domain->init_mutex);
>>> +}
>>> +
>>>  static bool arm_smmu_dev_has_feature(struct device *dev,
>>>  enum iommu_dev_features feat)
>>>  {
>>> @@ -2939,6 +3026,8 @@ static struct iommu_ops arm_smmu_ops = {
>>> .of_xlate   = arm_smmu_of_xlate,
>>> .get_resv_regions   = arm_smmu_get_resv_regions,
>>> .put_resv_regions   = generic_iommu_put_resv_regions,
>>> +   .attach_pasid_table = arm_smmu_attach_pasid_table,
>>> +   .detach_pasid_table = arm_smmu_detach_pasid_table,
>>> .dev_has_feat   = arm_smmu_dev_has_feature,
>>> .dev_feat_enabled   = arm_smmu_dev_feature_enabled,
>>> .dev_enable_feat= arm_smmu_dev_enable_feature,
>>>
>>
> 
> .
>

Re: [RFC PATCH v1 0/4] vfio: Add IOPF support for VFIO passthrough

2021-03-18 Thread Keqian Zhu

Hi Baolu,

On 2021/3/19 8:33, Lu Baolu wrote:
> On 3/18/21 7:53 PM, Shenming Lu wrote:
>> On 2021/3/18 17:07, Tian, Kevin wrote:
 From: Shenming Lu 
 Sent: Thursday, March 18, 2021 3:53 PM

 On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many
>>> devices allow I/O faulting only in selective contexts. However, there
>>> is no standard way (e.g. PCISIG) for the device to report whether
>>> arbitrary I/O fault is allowed. Then we may have to maintain device
>>> specific knowledge in software, e.g. in an opt-in table to list devices
>>> which allows arbitrary faults. For devices which only support selective
>>> faulting, a mediator (either through vendor extensions on vfio-pci-core
>>> or a mdev wrapper) might be necessary to help lock down non-faultable
>>> mappings and then enable faulting on the rest mappings.
>>
>> For devices which only support selective faulting, they could tell it to 
>> the
>> IOMMU driver and let it filter out non-faultable faults? Do I get it 
>> wrong?
>
> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
> selectively page-pinning. The matter is that 'they' imply some device
> specific logic to decide which pages must be pinned and such knowledge
> is outside of VFIO.
>
>  From enabling p.o.v we could possibly do it in phased approach. First
> handles devices which tolerate arbitrary DMA faults, and then extends
> to devices with selective-faulting. The former is simpler, but with one
> main open whether we want to maintain such device IDs in a static
> table in VFIO or rely on some hints from other components (e.g. PF
> driver in VF assignment case). Let's see how Alex thinks about it.

 Hi Kevin,

 You mentioned selective-faulting some time ago. I still have some doubt
 about it:
 There is already a vfio_pin_pages() which is used for limiting the IOMMU
 group dirty scope to pinned pages, could it also be used for indicating
 the faultable scope is limited to the pinned pages and the rest mappings
 is non-faultable that should be pinned and mapped immediately? But it
 seems to be a little weird and not exactly to what you meant... I will
 be grateful if you can help to explain further. :-)

>>>
>>> The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down
>>> pages that are not faultable (based on its specific knowledge) and then
>>> the rest memory becomes faultable.
>>
>> Ahh...
>> Thus, from the perspective of VFIO IOMMU, if IOPF enabled for such device,
>> only the page faults within the pinned range are valid in the registered
>> iommu fault handler...
> 
> Isn't it opposite? The pinned pages will never generate any page faults.
> I might miss some contexts here.
It seems that vfio_pin_pages() just pin some pages and record the pinned scope 
to pfn_list of vfio_dma.
No mapping is established, so we still has page faults.

IIUC, vfio_pin_pages() is used to
1. pin pages for non-iommu backed devices.
2. mark dirty scope for non-iommu backed devices and iommu backed devices.

Thanks,
Keqian

> 
>> I have another question here, for the IOMMU backed devices, they are already
>> all pinned and mapped when attaching, is there a need to call 
>> vfio_pin_pages()
>> to lock down pages for them? Did I miss something?...
> 
> Best regards,
> baolu
> .
>

Re: [PATCH v2 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log

2021-03-17 Thread Keqian Zhu

On 2021/3/17 18:44, Yi Sun wrote:
> On 21-03-10 17:06:09, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> During dirty log tracking, user will try to retrieve dirty log from
>> iommu if it supports hardware dirty log.
>>
>> This adds a new interface named sync_dirty_log in iommu layer and
>> arm smmuv3 implements it, which scans leaf TTD and treats it's dirty
>> if it's writable (As we just enable HTTU for stage1, so check whether
>> AP[2] is not set).
>>
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>
>> changelog:
>>
>> v2:
>>  - Add new sanity check in arm_smmu_sync_dirty_log(). (smmu_domain->stage != 
>> ARM_SMMU_DOMAIN_S1)
>>  - Document the purpose of flush_iotlb in arm_smmu_sync_dirty_log(). (Robin)
>>  
>> ---
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 30 +++
>>  drivers/iommu/io-pgtable-arm.c  | 90 +
>>  drivers/iommu/iommu.c   | 38 +
>>  include/linux/io-pgtable.h  |  4 +
>>  include/linux/iommu.h   | 18 +
>>  5 files changed, 180 insertions(+)
>>
> Please split iommu common interface out. Thanks!
Yes, I will do it in v3.

> 
> [...]
> 
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 2a10294b62a3..44dfb78f9050 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2850,6 +2850,44 @@ int iommu_stop_dirty_log(struct iommu_domain *domain, 
>> unsigned long iova,
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_stop_dirty_log);
>>  
>> +int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
>> + size_t size, unsigned long *bitmap,
>> + unsigned long base_iova, unsigned long bitmap_pgshift)
> 
> One open question: shall we add PASID as one parameter to make iommu
> know which address space to visit?
> 
> For live migration, the pasid should not be necessary. But considering
Sure, for live migration we just need to care about level/stage 2 mapping under 
nested mode.

> future extension, it may be required.
It sounds a good idea. I will consider this, thanks!

> 
> BRs,
> Yi Sun
> .
> 
Thanks,
Keqian

[RFC PATCH v2 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-03-16 Thread Keqian Zhu

The MMIO region of a device maybe huge (GB level), try to use
block mapping in stage2 to speedup both map and unmap.

Compared to normal memory mapping, we should consider two more
points when try block mapping for MMIO region:

1. For normal memory mapping, the PA(host physical address) and
HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
the HVA to request hugepage, so we don't need to consider PA
alignment when verifing block mapping. But for device memory
mapping, the PA and HVA may have different alignment.

2. For normal memory mapping, we are sure hugepage size properly
fit into vma, so we don't check whether the mapping size exceeds
the boundary of vma. But for device memory mapping, we should pay
attention to this.

This adds device_rough_page_shift() to check these two points when
selecting block mapping size.

Signed-off-by: Keqian Zhu 
---

Mainly for RFC, not fully tested. I will fully test it when the
code logic is well accepted.

---
 arch/arm64/kvm/mmu.c | 42 ++
 1 file changed, 38 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c59af5ca01b0..224aa15eb4d9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -624,6 +624,36 @@ static void kvm_send_hwpoison_signal(unsigned long 
address, short lsb)
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
 }
 
+/*
+ * Find a mapping size that properly insides the intersection of vma and
+ * memslot. And hva and pa have the same alignment to this mapping size.
+ * It's rough because there are still other restrictions, which will be
+ * checked by the following fault_supports_stage2_huge_mapping().
+ */
+static short device_rough_page_shift(struct kvm_memory_slot *memslot,
+struct vm_area_struct *vma,
+unsigned long hva)
+{
+   size_t size = memslot->npages * PAGE_SIZE;
+   hva_t sec_start = max(memslot->userspace_addr, vma->vm_start);
+   hva_t sec_end = min(memslot->userspace_addr + size, vma->vm_end);
+   phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
+
+#ifndef __PAGETABLE_PMD_FOLDED
+   if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PUD_SIZE) >= sec_start &&
+   ALIGN(hva, PUD_SIZE) <= sec_end)
+   return PUD_SHIFT;
+#endif
+
+   if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
+   ALIGN_DOWN(hva, PMD_SIZE) >= sec_start &&
+   ALIGN(hva, PMD_SIZE) <= sec_end)
+   return PMD_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
 static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
   unsigned long hva,
   unsigned long map_size)
@@ -769,7 +799,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   /* Let's check if we will get back a huge page backed by hugetlbfs */
+   /*
+* Let's check if we will get back a huge page backed by hugetlbfs, or
+* get block mapping for device MMIO region.
+*/
mmap_read_lock(current->mm);
vma = find_vma_intersection(current->mm, hva, hva + 1);
if (unlikely(!vma)) {
@@ -780,11 +813,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 
if (is_vm_hugetlb_page(vma))
vma_shift = huge_page_shift(hstate_vma(vma));
+   else if (vma->vm_flags & VM_PFNMAP)
+   vma_shift = device_rough_page_shift(memslot, vma, hva);
else
vma_shift = PAGE_SHIFT;
 
-   if (logging_active ||
-   (vma->vm_flags & VM_PFNMAP)) {
+   if (logging_active) {
force_pte = true;
vma_shift = PAGE_SHIFT;
}
@@ -855,7 +889,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 
if (kvm_is_device_pfn(pfn)) {
device = true;
-   force_pte = true;
+   force_pte = (vma_pagesize == PAGE_SIZE);
} else if (logging_active && !write_fault) {
/*
 * Only actually map the page as writable if this was a write
-- 
2.19.1

[RFC PATCH v2 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions

2021-03-16 Thread Keqian Zhu

The MMIO regions may be unmapped for many reasons and can be remapped
by stage2 fault path. Map MMIO regions at creation time becomes a
minor optimization and makes these two mapping path hard to sync.

Remove the mapping code while keep the useful sanity check.

Signed-off-by: Keqian Zhu 
---
 arch/arm64/kvm/mmu.c | 38 +++---
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..c59af5ca01b0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 {
hva_t hva = mem->userspace_addr;
hva_t reg_end = hva + mem->memory_size;
-   bool writable = !(mem->flags & KVM_MEM_READONLY);
int ret = 0;
 
if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
@@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
mmap_read_lock(current->mm);
/*
 * A memory region could potentially cover multiple VMAs, and any holes
-* between them, so iterate over all of them to find out if we can map
-* any of them right now.
+* between them, so iterate over all of them.
 *
 * ++
 * +---++   ++
@@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 */
do {
struct vm_area_struct *vma = find_vma(current->mm, hva);
-   hva_t vm_start, vm_end;
 
if (!vma || vma->vm_start >= reg_end)
break;
 
-   /*
-* Take the intersection of this VMA with the memory region
-*/
-   vm_start = max(hva, vma->vm_start);
-   vm_end = min(reg_end, vma->vm_end);
-
if (vma->vm_flags & VM_PFNMAP) {
-   gpa_t gpa = mem->guest_phys_addr +
-   (vm_start - mem->userspace_addr);
-   phys_addr_t pa;
-
-   pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
-   pa += vm_start - vma->vm_start;
-
/* IO region dirty page logging not allowed */
if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
ret = -EINVAL;
-   goto out;
-   }
-
-   ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
-   vm_end - vm_start,
-   writable);
-   if (ret)
break;
+   }
}
-   hva = vm_end;
+   hva = min(reg_end, vma->vm_end);
} while (hva < reg_end);
 
-   if (change == KVM_MR_FLAGS_ONLY)
-   goto out;
-
-   spin_lock(>mmu_lock);
-   if (ret)
-   unmap_stage2_range(>arch.mmu, mem->guest_phys_addr, 
mem->memory_size);
-   else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
-   stage2_flush_memslot(kvm, memslot);
-   spin_unlock(>mmu_lock);
-out:
mmap_read_unlock(current->mm);
return ret;
 }
-- 
2.19.1

[RFC PATCH v2 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-03-16 Thread Keqian Zhu

Hi all,

We have two pathes to build stage2 mapping for MMIO regions.

Create time's path and stage2 fault path.

Patch#1 removes the creation time's mapping of MMIO regions
Patch#2 tries stage2 block mapping for host device MMIO at fault path

Thanks,
Keqian

Keqian Zhu (2):
  kvm/arm64: Remove the creation time's mapping of MMIO regions
  kvm/arm64: Try stage2 block mapping for host device MMIO

 arch/arm64/kvm/mmu.c | 80 +++-
 1 file changed, 41 insertions(+), 39 deletions(-)

-- 
2.19.1

Re: [PATCH v2 04/11] iommu/arm-smmu-v3: Split block descriptor when start dirty log

2021-03-16 Thread Keqian Zhu

Hi Yi,

On 2021/3/16 17:17, Yi Sun wrote:
> On 21-03-10 17:06:07, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> Block descriptor is not a proper granule for dirty log tracking.
>> Take an extreme example, if DMA writes one byte, under 1G mapping,
>> the dirty amount reported to userspace is 1G, but under 4K mapping,
>> the dirty amount is just 4K.
>>
>> This adds a new interface named start_dirty_log in iommu layer and
>> arm smmuv3 implements it, which splits block descriptor to an span
>> of page descriptors. Other types of IOMMU will perform architecture
>> specific actions to start dirty log.
>>
>> To allow code reuse, the split_block operation is realized as an
>> iommu_ops too. We flush all iotlbs after the whole procedure is
>> completed to ease the pressure of iommu, as we will hanle a huge
>> range of mapping in general.
>>
>> Spliting block does not simultaneously work with other pgtable ops,
>> as the only designed user is vfio, which always hold a lock, so race
>> condition is not considered in the pgtable ops.
>>
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>
>> changelog:
>>
>> v2:
>>  - Change the return type of split_block(). size_t -> int.
>>  - Change commit message to properly describe race condition. (Robin)
>>  - Change commit message to properly describe the need of split block.
>>  - Add a new interface named start_dirty_log(). (Sun Yi)
>>  - Change commit message to explain the realtionship of split_block() and 
>> start_dirty_log().
>>
>> ---
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  52 +
>>  drivers/iommu/io-pgtable-arm.c  | 122 
>>  drivers/iommu/iommu.c   |  48 
>>  include/linux/io-pgtable.h  |   2 +
>>  include/linux/iommu.h   |  24 
>>  5 files changed, 248 insertions(+)
>>
> Could you please split iommu common interface to a separate patch?
> This may make review and comments easier.
Yup, good suggestion.

> 
> IMHO, I think the start/stop interfaces could be merged into one, e.g:
> int iommu_domain_set_hwdbm(struct iommu_domain *domain, bool enable,
>unsigned long iova, size_t size,
>int prot);
Looks good, this reduces some code. but I have a concern that this causes loss 
of flexibility,
as we must pass same arguments when start|stop dirty log. What's your opinion 
about this?

> 
> Same comments to patch 5.
OK. Thanks.

> 
> BRs,
> Yi Sun
> 
>> -- 
>> 2.19.1
> .
Thanks,
Keqian

Re: [PATCH 2/4] KVM: arm64: Use find_vma_intersection()

2021-03-15 Thread Keqian Zhu

Hi Gavin,

On 2021/3/16 11:52, Gavin Shan wrote:
> Hi Keqian,
> 
> On 3/15/21 8:42 PM, Gavin Shan wrote:
>> On 3/15/21 7:04 PM, Keqian Zhu wrote:
>>> On 2021/3/15 12:18, Gavin Shan wrote:
>>>> find_vma_intersection() has been existing to search the intersected
>>>> vma. This uses the function where it's applicable, to simplify the
>>>> code.
>>>>
>>>> Signed-off-by: Gavin Shan 
>>>> ---
>>>>   arch/arm64/kvm/mmu.c | 10 ++
>>>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index 84e70f953de6..286b603ed0d3 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -421,10 +421,11 @@ static void stage2_unmap_memslot(struct kvm *kvm,
>>>>* ++
>>>>*/
>>>>   do {
>>>> -struct vm_area_struct *vma = find_vma(current->mm, hva);
>>>> +struct vm_area_struct *vma;
>>>>   hva_t vm_start, vm_end;
>>>> -if (!vma || vma->vm_start >= reg_end)
>>>> +vma = find_vma_intersection(current->mm, hva, reg_end);
>>> Nit: Keep a same style may be better(Assign vma when declare it).
>>> Other looks good to me.
>>>
>>
>> Yeah, I agree. I will adjust the code in v2 and included your r-b.
>> Thanks for your time to review.
>>
> 
> After rechecking the code, I think it'd better to keep current style
> because there is a follow-on validation on @vma. Keeping them together
> seems a good idea. I think it wouldn't a big deal to you. So I will
> keep current style with your r-b in v2.
Sure, both is OK. ;-)

Thanks,
Keqian
> 
> vma = find_vma_intersection(current->mm, hva, reg_end);
> if (!vma)
>  break;
> Thanks,
> Gavin
>  
>>>> +if (!vma)
>>>>   break;
>>>>   /*
>>>> @@ -1330,10 +1331,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>>>* ++
>>>>*/
>>>>   do {
>>>> -struct vm_area_struct *vma = find_vma(current->mm, hva);
>>>> +struct vm_area_struct *vma;
>>>>   hva_t vm_start, vm_end;
>>>> -if (!vma || vma->vm_start >= reg_end)
>>>> +vma = find_vma_intersection(current->mm, hva, reg_end);
>>>> +if (!vma)
>>>>   break;
>>>>   /*
>>>>
>>>
>>
> 
> .
>

Re: [PATCH 4/4] KVM: arm64: Don't retrieve memory slot again in page fault handler

2021-03-15 Thread Keqian Zhu

Hi Gavin,

On 2021/3/15 17:56, Gavin Shan wrote:
> Hi Keqian,
> 
> On 3/15/21 7:25 PM, Keqian Zhu wrote:
>> On 2021/3/15 12:18, Gavin Shan wrote:
>>> We needn't retrieve the memory slot again in user_mem_abort() because
>>> the corresponding memory slot has been passed from the caller. This
>> I think you are right, though fault_ipa will be adjusted when we try to use 
>> block mapping,
>> the fault_supports_stage2_huge_mapping() makes sure we're not trying to map 
>> anything
>> not covered by the memslot, so the adjusted fault_ipa still belongs to the 
>> memslot.
>>
> 
> Yeah, it's correct. Besides, the @logging_active is determined
> based on the passed memory slot. It means user_mem_abort() can't
> support memory range which spans multiple memory slot.
> 
>>> would save some CPU cycles. For example, the time used to write 1GB
>>> memory, which is backed by 2MB hugetlb pages and write-protected, is
>>> dropped by 6.8% from 928ms to 864ms.
>>>
>>> Signed-off-by: Gavin Shan 
>>> ---
>>>   arch/arm64/kvm/mmu.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index a5a8ade9fde4..4a4abcccfafb 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -846,7 +846,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>>> phys_addr_t fault_ipa,
>>>*/
>>>   smp_rmb();
>>>   -pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, );
>>> +pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
>>> +   write_fault, , NULL);
>> It's better to update the code comments at same time.
>>
> 
> I guess you need some comments here? If so, I would add something
> like below in v2:
> 
> /*
>  * gfn_to_pfn_prot() can be used either with unnecessary overhead
>  * introduced to locate the memory slot because the memory slot is
>  * always fixed even @gfn is adjusted for huge pages.
>  */
Looks good.

See comments above "smp_rmb();", and actually my meaning is that we should 
change "gfn_to_pfn_prot"
to "__gfn_to_pfn_memslot" :)

Thanks,
Keqian

> 
>>>   if (pfn == KVM_PFN_ERR_HWPOISON) {
>>>   kvm_send_hwpoison_signal(hva, vma_shift);
>>>   return 0;
>>> @@ -912,7 +913,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>>> phys_addr_t fault_ipa,
>>>   /* Mark the page dirty only if the fault is handled successfully */
>>>   if (writable && !ret) {
>>>   kvm_set_pfn_dirty(pfn);
>>> -mark_page_dirty(kvm, gfn);
>>> +mark_page_dirty_in_slot(kvm, memslot, gfn);
>>>   }
>>> out_unlock:
>>>
> 
> Thanks,
> Gavin
> 
> 
> .
>

Re: [PATCH 4/4] KVM: arm64: Don't retrieve memory slot again in page fault handler

2021-03-15 Thread Keqian Zhu

Hi Gavin,

On 2021/3/15 12:18, Gavin Shan wrote:
> We needn't retrieve the memory slot again in user_mem_abort() because
> the corresponding memory slot has been passed from the caller. This
I think you are right, though fault_ipa will be adjusted when we try to use 
block mapping,
the fault_supports_stage2_huge_mapping() makes sure we're not trying to map 
anything
not covered by the memslot, so the adjusted fault_ipa still belongs to the 
memslot.

> would save some CPU cycles. For example, the time used to write 1GB
> memory, which is backed by 2MB hugetlb pages and write-protected, is
> dropped by 6.8% from 928ms to 864ms.
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/kvm/mmu.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index a5a8ade9fde4..4a4abcccfafb 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -846,7 +846,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>*/
>   smp_rmb();
>  
> - pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, );
> + pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
> +write_fault, , NULL);
It's better to update the code comments at same time.

>   if (pfn == KVM_PFN_ERR_HWPOISON) {
>   kvm_send_hwpoison_signal(hva, vma_shift);
>   return 0;
> @@ -912,7 +913,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   /* Mark the page dirty only if the fault is handled successfully */
>   if (writable && !ret) {
>   kvm_set_pfn_dirty(pfn);
> - mark_page_dirty(kvm, gfn);
> + mark_page_dirty_in_slot(kvm, memslot, gfn);
>   }
>  
>  out_unlock:
> 

Thanks,
Keqian.

Re: [PATCH 2/4] KVM: arm64: Use find_vma_intersection()

2021-03-15 Thread Keqian Zhu

Hi Gavin,

On 2021/3/15 12:18, Gavin Shan wrote:
> find_vma_intersection() has been existing to search the intersected
> vma. This uses the function where it's applicable, to simplify the
> code.
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/kvm/mmu.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 84e70f953de6..286b603ed0d3 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -421,10 +421,11 @@ static void stage2_unmap_memslot(struct kvm *kvm,
>* ++
>*/
>   do {
> - struct vm_area_struct *vma = find_vma(current->mm, hva);
> + struct vm_area_struct *vma;
>   hva_t vm_start, vm_end;
>  
> - if (!vma || vma->vm_start >= reg_end)
> + vma = find_vma_intersection(current->mm, hva, reg_end);
Nit: Keep a same style may be better(Assign vma when declare it).
Other looks good to me.

Thank,
Keqian


> + if (!vma)
>   break;
>  
>   /*
> @@ -1330,10 +1331,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>* ++
>*/
>   do {
> - struct vm_area_struct *vma = find_vma(current->mm, hva);
> + struct vm_area_struct *vma;
>   hva_t vm_start, vm_end;
>  
> - if (!vma || vma->vm_start >= reg_end)
> + vma = find_vma_intersection(current->mm, hva, reg_end);
> + if (!vma)
>   break;
>  
>   /*
>

Re: [PATCH 1/4] KVM: arm64: Hide kvm_mmu_wp_memory_region()

2021-03-15 Thread Keqian Zhu

Hi Gavin,

This function is only used by mmu.c in the first commit c64735554c0a, so please 
feel free
to add:

Reviewed-by: Keqian Zhu 


Thanks,
Keqian

On 2021/3/15 12:18, Gavin Shan wrote:
> We needn't expose the function as it's only used by mmu.c.
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/include/asm/kvm_host.h | 1 -
>  arch/arm64/kvm/mmu.c  | 2 +-
>  2 files changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 3d10e6527f7d..688f2df1957b 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -632,7 +632,6 @@ void kvm_arm_resume_guest(struct kvm *kvm);
>   })
>  
>  void force_vm_exit(const cpumask_t *mask);
> -void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
>  
>  int handle_exit(struct kvm_vcpu *vcpu, int exception_index);
>  void handle_exit_early(struct kvm_vcpu *vcpu, int exception_index);
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..84e70f953de6 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -555,7 +555,7 @@ static void stage2_wp_range(struct kvm_s2_mmu *mmu, 
> phys_addr_t addr, phys_addr_
>   * Acquires kvm_mmu_lock. Called with kvm->slots_lock mutex acquired,
>   * serializing operations for VM memory regions.
>   */
> -void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
> +static void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
>  {
>   struct kvm_memslots *slots = kvm_memslots(kvm);
>   struct kvm_memory_slot *memslot = id_to_memslot(slots, slot);
>

Re: [PATCH 3/4] KVM: arm64: Fix address check for memory slot

2021-03-15 Thread Keqian Zhu

Hi Gavin,

FYI, this has been fixed by Marc in commit 262b003d059c.

Thanks,
Keqian

On 2021/3/15 12:18, Gavin Shan wrote:
> The last (IPA) page can't be specified when a new memory slot is
> added. The error -EFAULT is returned when the memory slot is added
> with the following parameters for the VM, which has 40-bits IPA
> limit. The host has 4KB base page size. It's not correct because
> the last (IPA) page is still usable.
> 
>struct kvm_userspace_memory_region {
>   __u32 slot;   /* 1*/
>   __u32 flags;  /* 0*/
>   __u64 guest_phys_addr;/* 0xfff000 */
>   __u64 memory_size;/* 0x1000   */
>   __u64 userspace_addr;
>};
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/kvm/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 286b603ed0d3..a5a8ade9fde4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1313,7 +1313,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>* Prevent userspace from creating a memory region outside of the IPA
>* space addressable by the KVM guest IPA space.
>*/
> - if (memslot->base_gfn + memslot->npages >=
> + if (memslot->base_gfn + memslot->npages >
>   (kvm_phys_size(kvm) >> PAGE_SHIFT))
>   return -EFAULT;
>  
>

Re: [PATCH] KVM: clean up the unused argument

2021-03-14 Thread Keqian Zhu



This looks OK. The use of vcpu argument is removed in commit d383b3146d80 (KVM: 
x86: Fix NULL dereference at kvm_msr_ignored_check())

Reviewed-by: Keqian Zhu 

On 2021/3/13 13:10, lihaiwei.ker...@gmail.com wrote:
> From: Haiwei Li 
> 
> kvm_msr_ignored_check function never uses vcpu argument. Clean up the
> function and invokers.
> 
> Signed-off-by: Haiwei Li 
> ---
>  arch/x86/kvm/x86.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 012d5df..27e9ee8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -271,8 +271,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
>   * When called, it means the previous get/set msr reached an invalid msr.
>   * Return true if we want to ignore/silent this failed msr access.
>   */
> -static bool kvm_msr_ignored_check(struct kvm_vcpu *vcpu, u32 msr,
> -   u64 data, bool write)
> +static bool kvm_msr_ignored_check(u32 msr, u64 data, bool write)
>  {
>   const char *op = write ? "wrmsr" : "rdmsr";
>  
> @@ -1447,7 +1446,7 @@ static int do_get_msr_feature(struct kvm_vcpu *vcpu, 
> unsigned index, u64 *data)
>   if (r == KVM_MSR_RET_INVALID) {
>   /* Unconditionally clear the output for simplicity */
>   *data = 0;
> - if (kvm_msr_ignored_check(vcpu, index, 0, false))
> + if (kvm_msr_ignored_check(index, 0, false))
>   r = 0;
>   }
>  
> @@ -1613,7 +1612,7 @@ static int kvm_set_msr_ignored_check(struct kvm_vcpu 
> *vcpu,
>   int ret = __kvm_set_msr(vcpu, index, data, host_initiated);
>  
>   if (ret == KVM_MSR_RET_INVALID)
> - if (kvm_msr_ignored_check(vcpu, index, data, true))
> + if (kvm_msr_ignored_check(index, data, true))
>   ret = 0;
>  
>   return ret;
> @@ -1651,7 +1650,7 @@ static int kvm_get_msr_ignored_check(struct kvm_vcpu 
> *vcpu,
>   if (ret == KVM_MSR_RET_INVALID) {
>   /* Unconditionally clear *data for simplicity */
>   *data = 0;
> - if (kvm_msr_ignored_check(vcpu, index, 0, false))
> + if (kvm_msr_ignored_check(index, 0, false))
>   ret = 0;
>   }
>  
>

Re: [PATCH] vfio/type1: fix vaddr_get_pfns() return in vfio_pin_page_external()

2021-03-14 Thread Keqian Zhu



Hi Daniel,

[+Cc iommu mail list]

This patch looks good to me. (but I don't test it too.)

Thanks,
Keqian

On 2021/3/9 1:24, Daniel Jordan wrote:
> vaddr_get_pfns() now returns the positive number of pfns successfully
> gotten instead of zero.  vfio_pin_page_external() might return 1 to
> vfio_iommu_type1_pin_pages(), which will treat it as an error, if
> vaddr_get_pfns() is successful but vfio_pin_page_external() doesn't
> reach vfio_lock_acct().
> 
> Fix it up in vfio_pin_page_external().  Found by inspection.
> 
> Fixes: be16c1fd99f4 ("vfio/type1: Change success value of vaddr_get_pfn()")
> Signed-off-by: Daniel Jordan 
> ---
> 
> I couldn't test this due to lack of hardware.
> 
>  drivers/vfio/vfio_iommu_type1.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 4bb162c1d649..2a0e3b3ce206 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -785,7 +785,12 @@ static int vfio_pin_page_external(struct vfio_dma *dma, 
> unsigned long vaddr,
>   return -ENODEV;
>  
>   ret = vaddr_get_pfns(mm, vaddr, 1, dma->prot, pfn_base, pages);
> - if (ret == 1 && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
> + if (ret != 1)
> + goto out;
> +
> + ret = 0;
> +
> + if (do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
>   ret = vfio_lock_acct(dma, 1, true);
>   if (ret) {
>   put_pfn(*pfn_base, dma->prot);
> @@ -797,6 +802,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, 
> unsigned long vaddr,
>   }
>   }
>  
> +out:
>   mmput(mm);
>   return ret;
>  }
> 
> base-commit: 144c79ef33536b4ecb4951e07dbc1f2b7fa99d32
>

Re: [RFC PATCH] kvm: arm64: Try stage2 block mapping for host device MMIO

2021-03-12 Thread Keqian Zhu

Hi Marc,

On 2021/3/12 16:52, Marc Zyngier wrote:
> On Thu, 11 Mar 2021 14:28:17 +,
> Keqian Zhu  wrote:
>>
>> Hi Marc,
>>
>> On 2021/3/11 16:43, Marc Zyngier wrote:
>>> Digging this patch back from my Inbox...
>> Yeah, thanks ;-)
>>
>>>
>>> On Fri, 22 Jan 2021 08:36:50 +,
>>> Keqian Zhu  wrote:
>>>>
>>>> The MMIO region of a device maybe huge (GB level), try to use block
>>>> mapping in stage2 to speedup both map and unmap.
[...]

>>>>break;
>>>>  
>>>> -  pa += PAGE_SIZE;
>>>> +  pa += pgsize;
>>>>}
>>>>  
>>>>kvm_mmu_free_memory_cache();
>>>
>>> There is one issue with this patch, which is that it only does half
>>> the job. A VM_PFNMAP VMA can definitely be faulted in dynamically, and
>>> in that case we force this to be a page mapping. This conflicts with
>>> what you are doing here.
>> Oh yes, these two paths should keep a same mapping logic.
>>
>> I try to search the "force_pte" and find out some discussion [1]
>> between you and Christoffer.  And I failed to get a reason about
>> forcing pte mapping for device MMIO region (expect that we want to
>> keep a same logic with the eager mapping path). So if you don't
>> object to it, I will try to implement block mapping for device MMIO
>> in user_mem_abort().
>>
>>>
>>> There is also the fact that if we can map things on demand, why are we
>>> still mapping these MMIO regions ahead of time?
>>
>> Indeed. Though this provides good *startup* performance for guest
>> accessing MMIO, it's hard to keep the two paths in sync. We can keep
>> this minor optimization or delete it to avoid hard maintenance,
>> which one do you prefer?
> 
> I think we should be able to get rid of the startup path. If we can do
> it for memory, I see no reason not to do it for MMIO.
OK, I will do.

> 
>> BTW, could you please have a look at my another patch series[2]
>> about HW/SW combined dirty log? ;)
> 
> I will eventually, but while I really appreciate your contributions in
> terms of features and bug fixes, I would really *love* it if you were
> a bit more active on the list when it comes to reviewing other
> people's code.
> 
> There is no shortage of patches that really need reviewing, and just
> pointing me in the direction of your favourite series doesn't really
> help. I have something like 200+ patches that need careful reviewing
> in my inbox, and they all deserve the same level of attention.
> 
> To make it short, help me to help you!
My apologies, and I can't agree more.

I have noticed this, and have reviewed several patches of IOMMU community.
For that some patches are with much background knowledge, so it's hard to
review. I will dig into them in the future.

Thanks for your valuable advice. :)

Thanks,
Keqian


> 
> Thanks,
> 
>   M.
>

Re: [RFC PATCH] kvm: arm64: Try stage2 block mapping for host device MMIO

2021-03-11 Thread Keqian Zhu

Hi Marc,

On 2021/3/11 16:43, Marc Zyngier wrote:
> Digging this patch back from my Inbox...
Yeah, thanks ;-)

> 
> On Fri, 22 Jan 2021 08:36:50 +,
> Keqian Zhu  wrote:
>>
>> The MMIO region of a device maybe huge (GB level), try to use block
>> mapping in stage2 to speedup both map and unmap.
>>
>> Especially for unmap, it performs TLBI right after each invalidation
>> of PTE. If all mapping is of PAGE_SIZE, it takes much time to handle
>> GB level range.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  arch/arm64/include/asm/kvm_pgtable.h | 11 +++
>>  arch/arm64/kvm/hyp/pgtable.c | 15 +++
>>  arch/arm64/kvm/mmu.c | 12 
>>  3 files changed, 34 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
>> b/arch/arm64/include/asm/kvm_pgtable.h
>> index 52ab38db04c7..2266ac45f10c 100644
>> --- a/arch/arm64/include/asm/kvm_pgtable.h
>> +++ b/arch/arm64/include/asm/kvm_pgtable.h
>> @@ -82,6 +82,17 @@ struct kvm_pgtable_walker {
>>  const enum kvm_pgtable_walk_flags   flags;
>>  };
>>  
>> +/**
>> + * kvm_supported_pgsize() - Get the max supported page size of a mapping.
>> + * @pgt:Initialised page-table structure.
>> + * @addr:   Virtual address at which to place the mapping.
>> + * @end:End virtual address of the mapping.
>> + * @phys:   Physical address of the memory to map.
>> + *
>> + * The smallest return value is PAGE_SIZE.
>> + */
>> +u64 kvm_supported_pgsize(struct kvm_pgtable *pgt, u64 addr, u64 end, u64 
>> phys);
>> +
>>  /**
>>   * kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
>>   * @pgt:Uninitialised page-table structure to initialise.
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index bdf8e55ed308..ab11609b9b13 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -81,6 +81,21 @@ static bool kvm_block_mapping_supported(u64 addr, u64 
>> end, u64 phys, u32 level)
>>  return IS_ALIGNED(addr, granule) && IS_ALIGNED(phys, granule);
>>  }
>>  
>> +u64 kvm_supported_pgsize(struct kvm_pgtable *pgt, u64 addr, u64 end, u64 
>> phys)
>> +{
>> +u32 lvl;
>> +u64 pgsize = PAGE_SIZE;
>> +
>> +for (lvl = pgt->start_level; lvl < KVM_PGTABLE_MAX_LEVELS; lvl++) {
>> +if (kvm_block_mapping_supported(addr, end, phys, lvl)) {
>> +pgsize = kvm_granule_size(lvl);
>> +break;
>> +}
>> +}
>> +
>> +return pgsize;
>> +}
>> +
>>  static u32 kvm_pgtable_idx(struct kvm_pgtable_walk_data *data, u32 level)
>>  {
>>  u64 shift = kvm_granule_shift(level);
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 7d2257cc5438..80b403fc8e64 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -499,7 +499,8 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>>  int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>phys_addr_t pa, unsigned long size, bool writable)
>>  {
>> -phys_addr_t addr;
>> +phys_addr_t addr, end;
>> +unsigned long pgsize;
>>  int ret = 0;
>>  struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
>>  struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>> @@ -509,21 +510,24 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t 
>> guest_ipa,
>>  
>>  size += offset_in_page(guest_ipa);
>>  guest_ipa &= PAGE_MASK;
>> +end = guest_ipa + size;
>>  
>> -for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
>> +for (addr = guest_ipa; addr < end; addr += pgsize) {
>>  ret = kvm_mmu_topup_memory_cache(,
>>   kvm_mmu_cache_min_pages(kvm));
>>  if (ret)
>>  break;
>>  
>> +pgsize = kvm_supported_pgsize(pgt, addr, end, pa);
>> +
>>  spin_lock(>mmu_lock);
>> -ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
>> +ret = kvm_pgtable_stage2_map(pgt, addr, pgsize, pa, prot,
>>   );
>>  spin_unlock(>mmu_lock);
>>  if (ret)
>>  break;
>>  
>> -pa += PAGE_SIZE;
>> +pa += pgsize

[PATCH v2 11/11] vfio/iommu_type1: Add support for manual dirty log clear

2021-03-10 Thread Keqian Zhu

From: jiangkunkun 

In the past, we clear dirty log immediately after sync dirty
log to userspace. This may cause redundant dirty handling if
userspace handles dirty log iteratively:

After vfio clears dirty log, new dirty log starts to generate.
These new dirty log will be reported to userspace even if they
are generated before userspace handles the same dirty page.

That's to say, we should minimize the time gap of dirty log
clearing and dirty log handling. We can give userspace the
interface to clear dirty log.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---

changelog:

v2:
 - Rebase to newest code, so change VFIO_DIRTY_LOG_MANUAL_CLEAR form 9 to 11.

---
 drivers/vfio/vfio_iommu_type1.c | 104 ++--
 include/uapi/linux/vfio.h   |  28 -
 2 files changed, 127 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a7ab0279eda0..94306f567894 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -77,6 +77,7 @@ struct vfio_iommu {
boolv2;
boolnesting;
booldirty_page_tracking;
+   booldirty_log_manual_clear;
boolpinned_page_dirty_scope;
boolcontainer_open;
uint64_tnum_non_hwdbm_groups;
@@ -1226,6 +1227,78 @@ static int vfio_iommu_dirty_log_clear(struct vfio_iommu 
*iommu,
return 0;
 }
 
+static int vfio_iova_dirty_log_clear(u64 __user *bitmap,
+struct vfio_iommu *iommu,
+dma_addr_t iova, size_t size,
+size_t pgsize)
+{
+   struct vfio_dma *dma;
+   struct rb_node *n;
+   dma_addr_t start_iova, end_iova, riova;
+   unsigned long pgshift = __ffs(pgsize);
+   unsigned long bitmap_size;
+   unsigned long *bitmap_buffer = NULL;
+   bool clear_valid;
+   int rs, re, start, end, dma_offset;
+   int ret = 0;
+
+   bitmap_size = DIRTY_BITMAP_BYTES(size >> pgshift);
+   bitmap_buffer = kvmalloc(bitmap_size, GFP_KERNEL);
+   if (!bitmap_buffer) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   if (copy_from_user(bitmap_buffer, bitmap, bitmap_size)) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   for (n = rb_first(>dma_list); n; n = rb_next(n)) {
+   dma = rb_entry(n, struct vfio_dma, node);
+   if (!dma->iommu_mapped)
+   continue;
+   if ((dma->iova + dma->size - 1) < iova)
+   continue;
+   if (dma->iova > iova + size - 1)
+   break;
+
+   start_iova = max(iova, dma->iova);
+   end_iova = min(iova + size, dma->iova + dma->size);
+
+   /* Similar logic as the tail of vfio_iova_dirty_bitmap */
+
+   clear_valid = false;
+   start = (start_iova - iova) >> pgshift;
+   end = (end_iova - iova) >> pgshift;
+   bitmap_for_each_set_region(bitmap_buffer, rs, re, start, end) {
+   clear_valid = true;
+   riova = iova + (rs << pgshift);
+   dma_offset = (riova - dma->iova) >> pgshift;
+   bitmap_clear(dma->bitmap, dma_offset, re - rs);
+   }
+
+   if (clear_valid)
+   vfio_dma_populate_bitmap(dma, pgsize);
+
+   if (clear_valid && !iommu->pinned_page_dirty_scope &&
+   dma->iommu_mapped && !iommu->num_non_hwdbm_groups) {
+   ret = vfio_iommu_dirty_log_clear(iommu, start_iova,
+   end_iova - start_iova,  bitmap_buffer,
+   iova, pgsize);
+   if (ret) {
+   pr_warn("dma dirty log clear failed!\n");
+   goto out;
+   }
+   }
+
+   }
+
+out:
+   kfree(bitmap_buffer);
+   return ret;
+}
+
 static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
  struct vfio_dma *dma, dma_addr_t base_iova,
  size_t pgsize)
@@ -1275,6 +1348,11 @@ static int update_user_bitmap(u64 __user *bitmap, struct 
vfio_iommu *iommu,
 DIRTY_BITMAP_BYTES(nbits + shift)))
return -EFAULT;
 
+   /* Recover the bitmap under manual clear */
+   if (shift && iommu->dirty_log_manual_clear)
+   bitmap_shift_right(dma->bitmap, dma->bitmap, shift,
+  nbits + shift);
+
return 0;
 }

[PATCH v2 08/11] iommu/arm-smmu-v3: Add HWDBM device feature reporting

2021-03-10 Thread Keqian Zhu

From: jiangkunkun 

We have implemented these interfaces required to support iommu
dirty log tracking. The last step is reporting this feature to
upper user, then the user can perform higher policy base on it.

This adds a new dev feature named IOMMU_DEV_FEAT_HWDBM in iommu
layer. For arm smmuv3, it is equal to ARM_SMMU_FEAT_HD and it is
enabled by default if supported. Other types of IOMMU can enable
it by default or when dev_enable_feature() is called.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---

changelog:

v2:
 - As dev_has_feature() has been removed from iommu layer, IOMMU_DEV_FEAT_HWDBM
   is designed to be used through "enable" interface.

---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 4 
 include/linux/iommu.h   | 1 +
 2 files changed, 5 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 696df51a3282..cd1627123e80 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2722,6 +2722,8 @@ static bool arm_smmu_dev_has_feature(struct device *dev,
switch (feat) {
case IOMMU_DEV_FEAT_SVA:
return arm_smmu_master_sva_supported(master);
+   case IOMMU_DEV_FEAT_HWDBM:
+   return !!(master->smmu->features & ARM_SMMU_FEAT_HD);
default:
return false;
}
@@ -2738,6 +2740,8 @@ static bool arm_smmu_dev_feature_enabled(struct device 
*dev,
switch (feat) {
case IOMMU_DEV_FEAT_SVA:
return arm_smmu_master_sva_enabled(master);
+   case IOMMU_DEV_FEAT_HWDBM:
+   return arm_smmu_dev_has_feature(dev, feat);
default:
return false;
}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 4f7db5d23b23..88584a2d027c 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -160,6 +160,7 @@ struct iommu_resv_region {
 enum iommu_dev_features {
IOMMU_DEV_FEAT_AUX, /* Aux-domain feature */
IOMMU_DEV_FEAT_SVA, /* Shared Virtual Addresses */
+   IOMMU_DEV_FEAT_HWDBM,   /* Hardware Dirty Bit Management */
 };
 
 #define IOMMU_PASID_INVALID(-1U)
-- 
2.19.1

[PATCH v2 01/11] iommu/arm-smmu-v3: Add support for Hardware Translation Table Update

2021-03-10 Thread Keqian Zhu

From: Jean-Philippe Brucker 

If the SMMU supports it and the kernel was built with HTTU support,
enable hardware update of access and dirty flags. This is essential for
shared page tables, to reduce the number of access faults on the fault
queue. Normal DMA with io-pgtables doesn't currently use the access or
dirty flags.

We can enable HTTU even if CPUs don't support it, because the kernel
always checks for HW dirty bit and updates the PTE flags atomically.

Signed-off-by: Jean-Philippe Brucker 
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  2 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 41 ++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  8 
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index bb251cab61f3..ae075e675892 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -121,10 +121,12 @@ static struct arm_smmu_ctx_desc 
*arm_smmu_alloc_shared_cd(struct mm_struct *mm)
if (err)
goto out_free_asid;
 
+   /* HA and HD will be filtered out later if not supported by the SMMU */
tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, 64ULL - vabits_actual) |
  FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, ARM_LPAE_TCR_RGN_WBWA) |
  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, ARM_LPAE_TCR_RGN_WBWA) |
  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, ARM_LPAE_TCR_SH_IS) |
+ CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
 
switch (PAGE_SIZE) {
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8594b4a83043..b6d965504f44 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1012,10 +1012,17 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain 
*smmu_domain, int ssid,
 * this substream's traffic
 */
} else { /* (1) and (2) */
+   u64 tcr = cd->tcr;
+
cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
cdptr[2] = 0;
cdptr[3] = cpu_to_le64(cd->mair);
 
+   if (!(smmu->features & ARM_SMMU_FEAT_HD))
+   tcr &= ~CTXDESC_CD_0_TCR_HD;
+   if (!(smmu->features & ARM_SMMU_FEAT_HA))
+   tcr &= ~CTXDESC_CD_0_TCR_HA;
+
/*
 * STE is live, and the SMMU might read dwords of this CD in any
 * order. Ensure that it observes valid values before reading
@@ -1023,7 +1030,7 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain 
*smmu_domain, int ssid,
 */
arm_smmu_sync_cd(smmu_domain, ssid, true);
 
-   val = cd->tcr |
+   val = tcr |
 #ifdef __BIG_ENDIAN
CTXDESC_CD_0_ENDI |
 #endif
@@ -3196,6 +3203,28 @@ static int arm_smmu_device_reset(struct arm_smmu_device 
*smmu, bool bypass)
return 0;
 }
 
+static void arm_smmu_get_httu(struct arm_smmu_device *smmu, u32 reg)
+{
+   u32 fw_features = smmu->features & (ARM_SMMU_FEAT_HA | 
ARM_SMMU_FEAT_HD);
+   u32 features = 0;
+
+   switch (FIELD_GET(IDR0_HTTU, reg)) {
+   case IDR0_HTTU_ACCESS_DIRTY:
+   features |= ARM_SMMU_FEAT_HD;
+   fallthrough;
+   case IDR0_HTTU_ACCESS:
+   features |= ARM_SMMU_FEAT_HA;
+   }
+
+   if (smmu->dev->of_node)
+   smmu->features |= features;
+   else if (features != fw_features)
+   /* ACPI IORT sets the HTTU bits */
+   dev_warn(smmu->dev,
+"IDR0.HTTU overridden by FW configuration (0x%x)\n",
+fw_features);
+}
+
 static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 {
u32 reg;
@@ -3256,6 +3285,8 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
smmu->features |= ARM_SMMU_FEAT_E2H;
}
 
+   arm_smmu_get_httu(smmu, reg);
+
/*
 * The coherency feature as set by FW is used in preference to the ID
 * register, but warn on mismatch.
@@ -3441,6 +3472,14 @@ static int arm_smmu_device_acpi_probe(struct 
platform_device *pdev,
if (iort_smmu->flags & ACPI_IORT_SMMU_V3_COHACC_OVERRIDE)
smmu->features |= ARM_SMMU_FEAT_COHERENCY;
 
+   switch (FIELD_GET(ACPI_IORT_SMMU_V3_HTTU_OVERRIDE, iort_smmu->flags)) {
+   case IDR0_HTTU_ACCESS_DIRTY:
+   smmu->features |= ARM_SMMU_FEAT_HD;
+   fallthrough;
+   case IDR0_HTTU_ACCESS:
+   smmu->features |= ARM_SMMU_FEAT_HA;
+   }
+
return 0;
 }
 #else
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index

[PATCH v2 04/11] iommu/arm-smmu-v3: Split block descriptor when start dirty log

2021-03-10 Thread Keqian Zhu

From: jiangkunkun 

Block descriptor is not a proper granule for dirty log tracking.
Take an extreme example, if DMA writes one byte, under 1G mapping,
the dirty amount reported to userspace is 1G, but under 4K mapping,
the dirty amount is just 4K.

This adds a new interface named start_dirty_log in iommu layer and
arm smmuv3 implements it, which splits block descriptor to an span
of page descriptors. Other types of IOMMU will perform architecture
specific actions to start dirty log.

To allow code reuse, the split_block operation is realized as an
iommu_ops too. We flush all iotlbs after the whole procedure is
completed to ease the pressure of iommu, as we will hanle a huge
range of mapping in general.

Spliting block does not simultaneously work with other pgtable ops,
as the only designed user is vfio, which always hold a lock, so race
condition is not considered in the pgtable ops.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---

changelog:

v2:
 - Change the return type of split_block(). size_t -> int.
 - Change commit message to properly describe race condition. (Robin)
 - Change commit message to properly describe the need of split block.
 - Add a new interface named start_dirty_log(). (Sun Yi)
 - Change commit message to explain the realtionship of split_block() and 
start_dirty_log().

---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  52 +
 drivers/iommu/io-pgtable-arm.c  | 122 
 drivers/iommu/iommu.c   |  48 
 include/linux/io-pgtable.h  |   2 +
 include/linux/iommu.h   |  24 
 5 files changed, 248 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 443ac19c6da9..5d2fb926a08e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2537,6 +2537,56 @@ static int arm_smmu_domain_set_attr(struct iommu_domain 
*domain,
return ret;
 }
 
+static int arm_smmu_split_block(struct iommu_domain *domain,
+   unsigned long iova, size_t size)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+   size_t handled_size;
+
+   if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
+   dev_err(smmu->dev, "don't support BBML1/2, can't split 
block\n");
+   return -ENODEV;
+   }
+   if (!ops || !ops->split_block) {
+   pr_err("io-pgtable don't realize split block\n");
+   return -ENODEV;
+   }
+
+   handled_size = ops->split_block(ops, iova, size);
+   if (handled_size != size) {
+   pr_err("split block failed\n");
+   return -EFAULT;
+   }
+
+   return 0;
+}
+
+/*
+ * For SMMU, the action to start dirty log is spliting block mapping. The
+ * hardware dirty management is always enabled if hardware supports HTTU HD.
+ */
+static int arm_smmu_start_dirty_log(struct iommu_domain *domain,
+   unsigned long iova, size_t size)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HD))
+   return -ENODEV;
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+   return -EINVAL;
+
+   /*
+* Even if the split operation fail, we can still track dirty at block
+* granule, which is still a much better choice compared to full dirty
+* policy.
+*/
+   iommu_split_block(domain, iova, size);
+   return 0;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2636,6 +2686,8 @@ static struct iommu_ops arm_smmu_ops = {
.device_group   = arm_smmu_device_group,
.domain_get_attr= arm_smmu_domain_get_attr,
.domain_set_attr= arm_smmu_domain_set_attr,
+   .split_block= arm_smmu_split_block,
+   .start_dirty_log= arm_smmu_start_dirty_log,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 94d790b8ed27..4c4eec3c0698 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -79,6 +79,8 @@
 #define ARM_LPAE_PTE_SH_IS (((arm_lpae_iopte)3) << 8)
 #define ARM_LPAE_PTE_NS(((arm_lpae_iopte)1) << 5)
 #define ARM_LPAE_PTE_VALID

[PATCH v2 05/11] iommu/arm-smmu-v3: Merge a span of page when stop dirty log

2021-03-10 Thread Keqian Zhu

From: jiangkunkun 

When stop dirty log tracking, we need to recover all block descriptors
which are splited when start dirty log tracking.

This adds a new interface named stop_dirty_log in iommu layer and
arm smmuv3 implements it, which reinstall block mappings and unmap
the span of page mappings. Other types of IOMMU perform architecture
specific actions to stop dirty log.

To allow code reuse, the merge_page operation is realized as an
iommu_ops too. We flush all iotlbs after the whole procedure is
completed to ease the pressure of iommu, as we will hanle a huge
range of mapping in general.

Merging page does not simultaneously work with other pgtable ops,
as the only designed user is vfio, which always hold a lock, so race
condition is not considered in the pgtable ops.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---

changelog:

v2:
 - Change the return type of merge_page(). size_t -> int.
 - Change commit message to properly describe race condition. (Robin)
 - Add a new interface named stop_dirty_log(). (Sun Yi)
 - Change commit message to explain the realtionship of merge_page() and 
stop_dirty_log().
 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 52 +
 drivers/iommu/io-pgtable-arm.c  | 78 
 drivers/iommu/iommu.c   | 82 +
 include/linux/io-pgtable.h  |  2 +
 include/linux/iommu.h   | 24 ++
 5 files changed, 238 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5d2fb926a08e..ac0d881c77b8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2587,6 +2587,56 @@ static int arm_smmu_start_dirty_log(struct iommu_domain 
*domain,
return 0;
 }
 
+static int arm_smmu_merge_page(struct iommu_domain *domain,
+  unsigned long iova, phys_addr_t paddr,
+  size_t size, int prot)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+   size_t handled_size;
+
+   if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
+   dev_err(smmu->dev, "don't support BBML1/2, can't merge page\n");
+   return -ENODEV;
+   }
+   if (!ops || !ops->merge_page) {
+   pr_err("io-pgtable don't realize merge page\n");
+   return -ENODEV;
+   }
+
+   handled_size = ops->merge_page(ops, iova, paddr, size, prot);
+   if (handled_size != size) {
+   pr_err("merge page failed\n");
+   return -EFAULT;
+   }
+
+   return 0;
+}
+
+/*
+ * For SMMU, the action to stop dirty log is merge page mapping. The hardware
+ * dirty management is always enabled if hardware supports HTTU HD.
+ */
+static int arm_smmu_stop_dirty_log(struct iommu_domain *domain,
+  unsigned long iova, size_t size, int prot)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HD))
+   return -ENODEV;
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+   return -EINVAL;
+
+   /*
+* Even if the merge operation fail, it just effects performace of DMA
+* transaction.
+*/
+   iommu_merge_page(domain, iova, size, prot);
+   return 0;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2688,6 +2738,8 @@ static struct iommu_ops arm_smmu_ops = {
.domain_set_attr= arm_smmu_domain_set_attr,
.split_block= arm_smmu_split_block,
.start_dirty_log= arm_smmu_start_dirty_log,
+   .merge_page = arm_smmu_merge_page,
+   .stop_dirty_log = arm_smmu_stop_dirty_log,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 4c4eec3c0698..9028328b99b0 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -800,6 +800,83 @@ static size_t arm_lpae_split_block(struct io_pgtable_ops 
*ops,
return __arm_lpae_split_block(data, iova, size, lvl, ptep);
 }
 
+static size_t __arm_lpae_merge_page(struct arm_lpae_io_pgtable *data,
+   unsigned long iova, phys_addr_t paddr,
+   size_t size, int lvl, arm_lpae_iopte *ptep,
+

[PATCH v2 09/11] vfio/iommu_type1: Add HWDBM status maintanance

2021-03-10 Thread Keqian Zhu

From: jiangkunkun 

We are going to optimize dirty log tracking based on iommu
HWDBM feature, but the dirty log from iommu is useful only
when all iommu backed groups are connected to iommu with
HWDBM feature. This maintains a counter for this feature.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---

changelog:

v2:
 - Simplify vfio_group_supports_hwdbm().
 - AS feature report of HWDBM has been changed, so change 
vfio_dev_has_feature() to
   vfio_dev_enable_feature().

---
 drivers/vfio/vfio_iommu_type1.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4bb162c1d649..876351c061e4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -79,6 +79,7 @@ struct vfio_iommu {
booldirty_page_tracking;
boolpinned_page_dirty_scope;
boolcontainer_open;
+   uint64_tnum_non_hwdbm_groups;
 };
 
 struct vfio_domain {
@@ -116,6 +117,7 @@ struct vfio_group {
struct list_headnext;
boolmdev_group; /* An mdev group */
boolpinned_page_dirty_scope;
+   booliommu_hwdbm;/* For iommu-backed group */
 };
 
 struct vfio_iova {
@@ -1187,6 +1189,24 @@ static void vfio_update_pgsize_bitmap(struct vfio_iommu 
*iommu)
}
 }
 
+static int vfio_dev_enable_feature(struct device *dev, void *data)
+{
+   enum iommu_dev_features *feat = data;
+
+   if (iommu_dev_feature_enabled(dev, *feat))
+   return 0;
+
+   return iommu_dev_enable_feature(dev, *feat);
+}
+
+static bool vfio_group_supports_hwdbm(struct vfio_group *group)
+{
+   enum iommu_dev_features feat = IOMMU_DEV_FEAT_HWDBM;
+
+   return !iommu_group_for_each_dev(group->iommu_group, ,
+vfio_dev_enable_feature);
+}
+
 static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
  struct vfio_dma *dma, dma_addr_t base_iova,
  size_t pgsize)
@@ -2435,6 +2455,12 @@ static int vfio_iommu_type1_attach_group(void 
*iommu_data,
 * capable via the page pinning interface.
 */
iommu->num_non_pinned_groups++;
+
+   /* Update the hwdbm status of group and iommu */
+   group->iommu_hwdbm = vfio_group_supports_hwdbm(group);
+   if (!group->iommu_hwdbm)
+   iommu->num_non_hwdbm_groups++;
+
mutex_unlock(>lock);
vfio_iommu_resv_free(_resv_regions);
 
@@ -2571,6 +2597,7 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
struct vfio_domain *domain;
struct vfio_group *group;
bool update_dirty_scope = false;
+   bool update_iommu_hwdbm = false;
LIST_HEAD(iova_copy);
 
mutex_lock(>lock);
@@ -2609,6 +2636,7 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
 
vfio_iommu_detach_group(domain, group);
update_dirty_scope = !group->pinned_page_dirty_scope;
+   update_iommu_hwdbm = !group->iommu_hwdbm;
list_del(>next);
kfree(group);
/*
@@ -2651,6 +2679,8 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
if (iommu->dirty_page_tracking)
vfio_iommu_populate_bitmap_full(iommu);
}
+   if (update_iommu_hwdbm)
+   iommu->num_non_hwdbm_groups--;
mutex_unlock(>lock);
 }
 
-- 
2.19.1

[PATCH v2 03/11] iommu/arm-smmu-v3: Add feature detection for BBML

2021-03-10 Thread Keqian Zhu

From: jiangkunkun 

When altering a translation table descriptor of some specific reasons,
we require break-before-make procedure. But it might cause problems when
the TTD is alive. The I/O streams might not tolerate translation faults.

If the SMMU supports BBM level 1 or BBM level 2, we can change the block
size without using break-before-make sequence.

This adds feature detection for BBML, none functional change expected.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---

changelog:

v2:
 - Use two new quirk flags named IO_PGTABLE_QUIRK_ARM_BBML1/2 to transfer
   SMMU BBML feature to io-pgtable. (Robin)
   
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  6 ++
 include/linux/io-pgtable.h  |  8 
 3 files changed, 33 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 369c0ea7a104..443ac19c6da9 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2030,6 +2030,11 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
if (smmu->features & ARM_SMMU_FEAT_HD)
pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
 
+   if (smmu->features & ARM_SMMU_FEAT_BBML1)
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
+   else if (smmu->features & ARM_SMMU_FEAT_BBML2)
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML2;
+
pgtbl_ops = alloc_io_pgtable_ops(fmt, _cfg, smmu_domain);
if (!pgtbl_ops)
return -ENOMEM;
@@ -3373,6 +3378,20 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
 
/* IDR3 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
+   switch (FIELD_GET(IDR3_BBML, reg)) {
+   case IDR3_BBML0:
+   break;
+   case IDR3_BBML1:
+   smmu->features |= ARM_SMMU_FEAT_BBML1;
+   break;
+   case IDR3_BBML2:
+   smmu->features |= ARM_SMMU_FEAT_BBML2;
+   break;
+   default:
+   dev_err(smmu->dev, "unknown/unsupported BBM behavior level\n");
+   return -ENXIO;
+   }
+
if (FIELD_GET(IDR3_RIL, reg))
smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 26d6b935b383..a74125675544 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -54,6 +54,10 @@
 #define IDR1_SIDSIZE   GENMASK(5, 0)
 
 #define ARM_SMMU_IDR3  0xc
+#define IDR3_BBML  GENMASK(12, 11)
+#define IDR3_BBML0 0
+#define IDR3_BBML1 1
+#define IDR3_BBML2 2
 #define IDR3_RIL   (1 << 10)
 
 #define ARM_SMMU_IDR5  0x14
@@ -615,6 +619,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_E2H  (1 << 18)
 #define ARM_SMMU_FEAT_HA   (1 << 19)
 #define ARM_SMMU_FEAT_HD   (1 << 20)
+#define ARM_SMMU_FEAT_BBML1(1 << 21)
+#define ARM_SMMU_FEAT_BBML2(1 << 22)
u32 features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 64cee6831c97..857932357f1d 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -84,6 +84,12 @@ struct io_pgtable_cfg {
 *  attributes set in the TCR for a non-coherent page-table walker.
 *
 * IO_PGTABLE_QUIRK_ARM_HD: Support hardware management of dirty status.
+*
+* IO_PGTABLE_QUIRK_ARM_BBML1: ARM SMMU supports BBM Level 1 behavior
+*  when changing block size.
+*
+* IO_PGTABLE_QUIRK_ARM_BBML2: ARM SMMU supports BBM Level 2 behavior
+* when changing block size.
 */
#define IO_PGTABLE_QUIRK_ARM_NS BIT(0)
#define IO_PGTABLE_QUIRK_NO_PERMS   BIT(1)
@@ -92,6 +98,8 @@ struct io_pgtable_cfg {
#define IO_PGTABLE_QUIRK_ARM_TTBR1  BIT(5)
#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA BIT(6)
#define IO_PGTABLE_QUIRK_ARM_HD BIT(7)
+   #define IO_PGTABLE_QUIRK_ARM_BBML1  BIT(8)
+   #define IO_PGTABLE_QUIRK_ARM_BBML2  BIT(9)
unsigned long   quirks;
unsigned long   pgsize_bitmap;
unsigned intias;
-- 
2.19.1

[PATCH v2 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM

2021-03-10 Thread Keqian Zhu

From: jiangkunkun 

In the past if vfio_iommu is not of pinned_page_dirty_scope and
vfio_dma is iommu_mapped, we populate full dirty bitmap for this
vfio_dma. Now we can try to get dirty log from iommu before make
the lousy decision.

In detail, if all vfio_group are of pinned_page_dirty_scope, the
dirty bitmap population is not affected. If there are vfio_groups
not of pinned_page_dirty_scope and their domains support HWDBM,
then we can try to get dirty log from IOMMU. Otherwise, lead to
full dirty bitmap.

We should start dirty log for newly added dma range and domain.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---

changelog:

v2:
 - Use new interface to start|stop dirty log. As split_block|merge_page are 
related to ARM SMMU. (Sun Yi)
 - Bugfix: Start dirty log for newly added dma range and domain.
 
---
 drivers/vfio/vfio_iommu_type1.c | 136 +++-
 1 file changed, 132 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 876351c061e4..a7ab0279eda0 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1207,6 +1207,25 @@ static bool vfio_group_supports_hwdbm(struct vfio_group 
*group)
 vfio_dev_enable_feature);
 }
 
+static int vfio_iommu_dirty_log_clear(struct vfio_iommu *iommu,
+ dma_addr_t start_iova, size_t size,
+ unsigned long *bitmap_buffer,
+ dma_addr_t base_iova, size_t pgsize)
+{
+   struct vfio_domain *d;
+   unsigned long pgshift = __ffs(pgsize);
+   int ret;
+
+   list_for_each_entry(d, >domain_list, next) {
+   ret = iommu_clear_dirty_log(d->domain, start_iova, size,
+   bitmap_buffer, base_iova, pgshift);
+   if (ret)
+   return ret;
+   }
+
+   return 0;
+}
+
 static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
  struct vfio_dma *dma, dma_addr_t base_iova,
  size_t pgsize)
@@ -1218,13 +1237,28 @@ static int update_user_bitmap(u64 __user *bitmap, 
struct vfio_iommu *iommu,
unsigned long shift = bit_offset % BITS_PER_LONG;
unsigned long leftover;
 
+   if (!iommu->num_non_pinned_groups || !dma->iommu_mapped)
+   goto bitmap_done;
+
+   /* try to get dirty log from IOMMU */
+   if (!iommu->num_non_hwdbm_groups) {
+   struct vfio_domain *d;
+
+   list_for_each_entry(d, >domain_list, next) {
+   if (iommu_sync_dirty_log(d->domain, dma->iova, 
dma->size,
+   dma->bitmap, dma->iova, 
pgshift))
+   return -EFAULT;
+   }
+   goto bitmap_done;
+   }
+
/*
 * mark all pages dirty if any IOMMU capable device is not able
 * to report dirty pages and all pages are pinned and mapped.
 */
-   if (iommu->num_non_pinned_groups && dma->iommu_mapped)
-   bitmap_set(dma->bitmap, 0, nbits);
+   bitmap_set(dma->bitmap, 0, nbits);
 
+bitmap_done:
if (shift) {
bitmap_shift_left(dma->bitmap, dma->bitmap, shift,
  nbits + shift);
@@ -1286,6 +1320,18 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, 
struct vfio_iommu *iommu,
 */
bitmap_clear(dma->bitmap, 0, dma->size >> pgshift);
vfio_dma_populate_bitmap(dma, pgsize);
+
+   /* Clear iommu dirty log to re-enable dirty log tracking */
+   if (!iommu->pinned_page_dirty_scope &&
+   dma->iommu_mapped && !iommu->num_non_hwdbm_groups) {
+   ret = vfio_iommu_dirty_log_clear(iommu, dma->iova,
+   dma->size, dma->bitmap, dma->iova,
+   pgsize);
+   if (ret) {
+   pr_warn("dma dirty log clear failed!\n");
+   return ret;
+   }
+   }
}
return 0;
 }
@@ -1561,6 +1607,9 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu 
*iommu,
return list_empty(iova);
 }
 
+static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
+struct vfio_dma *dma);
+
 static int vfio_dma_do_map(struct vfio_iommu *iommu,
   struct vfio_iommu_type1_dma_map *map)
 {
@@ -1684,8 +1733,13 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 
if (!ret && iommu->dirty_page_tracking) {
ret = vfio_dma_

[PATCH v2 07/11] iommu/arm-smmu-v3: Clear dirty log according to bitmap

2021-03-10 Thread Keqian Zhu

From: jiangkunkun 

After dirty log is retrieved, user should clear dirty log to re-enable
dirty log tracking for these dirtied pages.

This adds a new interface named clear_dirty_log in iommu layer and
arm smmuv3 implements it, which clears the dirty state (As we just
enable HTTU for stage1, so set the AP[2] bit) of these TTDs that are
specified by the user provided bitmap.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---

changelog:

v2:
 - Add new sanity check in arm_smmu_sync_dirty_log(). (smmu_domain->stage != 
ARM_SMMU_DOMAIN_S1)
 - Remove extra flush_iotlb in __iommu_clear_dirty_log().
 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 ++
 drivers/iommu/io-pgtable-arm.c  | 95 +
 drivers/iommu/iommu.c   | 68 +++
 include/linux/io-pgtable.h  |  4 +
 include/linux/iommu.h   | 17 
 5 files changed, 209 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7407896a710e..696df51a3282 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2666,6 +2666,30 @@ static int arm_smmu_sync_dirty_log(struct iommu_domain 
*domain,
   bitmap_pgshift);
 }
 
+static int arm_smmu_clear_dirty_log(struct iommu_domain *domain,
+   unsigned long iova, size_t size,
+   unsigned long *bitmap,
+   unsigned long base_iova,
+   unsigned long bitmap_pgshift)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HD))
+   return -ENODEV;
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+   return -EINVAL;
+
+   if (!ops || !ops->clear_dirty_log) {
+   pr_err("io-pgtable don't realize clear dirty log\n");
+   return -ENODEV;
+   }
+
+   return ops->clear_dirty_log(ops, iova, size, bitmap, base_iova,
+   bitmap_pgshift);
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2770,6 +2794,7 @@ static struct iommu_ops arm_smmu_ops = {
.merge_page = arm_smmu_merge_page,
.stop_dirty_log = arm_smmu_stop_dirty_log,
.sync_dirty_log = arm_smmu_sync_dirty_log,
+   .clear_dirty_log= arm_smmu_clear_dirty_log,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 67a208a05ab2..e3ef0f50611c 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -966,6 +966,100 @@ static int arm_lpae_sync_dirty_log(struct io_pgtable_ops 
*ops,
 bitmap, base_iova, bitmap_pgshift);
 }
 
+static int __arm_lpae_clear_dirty_log(struct arm_lpae_io_pgtable *data,
+ unsigned long iova, size_t size,
+ int lvl, arm_lpae_iopte *ptep,
+ unsigned long *bitmap,
+ unsigned long base_iova,
+ unsigned long bitmap_pgshift)
+{
+   arm_lpae_iopte pte;
+   struct io_pgtable *iop = >iop;
+   unsigned long offset;
+   size_t base, next_size;
+   int nbits, ret, i;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return -EINVAL;
+
+   ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return -EINVAL;
+
+   if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+   if (iopte_leaf(pte, lvl, iop->fmt)) {
+   if (pte & ARM_LPAE_PTE_AP_RDONLY)
+   return 0;
+
+   /* Ensure all corresponding bits are set */
+   nbits = size >> bitmap_pgshift;
+   offset = (iova - base_iova) >> bitmap_pgshift;
+   for (i = offset; i < offset + nbits; i++) {
+   if (!test_bit(i, bitmap))
+   return 0;
+   }
+
+   /* Race does not exist */
+   pte |= ARM_LPAE_PTE_AP_RDONLY;
+   __arm_lpae_set_pte(ptep, pte, >cfg);
+   return 0;
+

[PATCH v2 02/11] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping

2021-03-10 Thread Keqian Zhu

From: jiangkunkun 

If HTTU is supported, we enable HA/HD bits in the SMMU CD (stage 1
mapping), and set DBM bit for writable TTD.

The dirty state information is encoded using the access permission
bits AP[2] (stage 1) or S2AP[1] (stage 2) in conjunction with the
DBM (Dirty Bit Modifier) bit, where DBM means writable and AP[2]/
S2AP[1] means dirty.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---

changelog:

v2:
 - Use a new quirk flag named IO_PGTABLE_QUIRK_ARM_HD to transfer
   SMMU HD feature to io-pgtable. (Robin)

 - Rebase on Jean's HTTU patch(#1).

---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 3 +++
 drivers/iommu/io-pgtable-arm.c  | 7 ++-
 include/linux/io-pgtable.h  | 3 +++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b6d965504f44..369c0ea7a104 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1921,6 +1921,7 @@ static int arm_smmu_domain_finalise_s1(struct 
arm_smmu_domain *smmu_domain,
  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
+ CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
cfg->cd.mair= pgtbl_cfg->arm_lpae_s1_cfg.mair;
 
@@ -2026,6 +2027,8 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
 
if (smmu_domain->non_strict)
pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
+   if (smmu->features & ARM_SMMU_FEAT_HD)
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
 
pgtbl_ops = alloc_io_pgtable_ops(fmt, _cfg, smmu_domain);
if (!pgtbl_ops)
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 87def58e79b5..94d790b8ed27 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -72,6 +72,7 @@
 
 #define ARM_LPAE_PTE_NSTABLE   (((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_DBM   (((arm_lpae_iopte)1) << 51)
 #define ARM_LPAE_PTE_AF(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS (((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS (((arm_lpae_iopte)2) << 8)
@@ -81,7 +82,7 @@
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK  (((arm_lpae_iopte)0x3ff) << 2)
 /* Ignore the contiguous bit for block splitting */
-#define ARM_LPAE_PTE_ATTR_HI_MASK  (((arm_lpae_iopte)6) << 52)
+#define ARM_LPAE_PTE_ATTR_HI_MASK  (((arm_lpae_iopte)13) << 51)
 #define ARM_LPAE_PTE_ATTR_MASK (ARM_LPAE_PTE_ATTR_LO_MASK |\
 ARM_LPAE_PTE_ATTR_HI_MASK)
 /* Software bit for solving coherency races */
@@ -379,6 +380,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, 
unsigned long iova,
 static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
   int prot)
 {
+   struct io_pgtable_cfg *cfg = >iop.cfg;
arm_lpae_iopte pte;
 
if (data->iop.fmt == ARM_64_LPAE_S1 ||
@@ -386,6 +388,9 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
arm_lpae_io_pgtable *data,
pte = ARM_LPAE_PTE_nG;
if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
pte |= ARM_LPAE_PTE_AP_RDONLY;
+   else if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_HD)
+   pte |= ARM_LPAE_PTE_DBM;
+
if (!(prot & IOMMU_PRIV))
pte |= ARM_LPAE_PTE_AP_UNPRIV;
} else {
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index a4c9ca2c31f1..64cee6831c97 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -82,6 +82,8 @@ struct io_pgtable_cfg {
 *
 * IO_PGTABLE_QUIRK_ARM_OUTER_WBWA: Override the outer-cacheability
 *  attributes set in the TCR for a non-coherent page-table walker.
+*
+* IO_PGTABLE_QUIRK_ARM_HD: Support hardware management of dirty status.
 */
#define IO_PGTABLE_QUIRK_ARM_NS BIT(0)
#define IO_PGTABLE_QUIRK_NO_PERMS   BIT(1)
@@ -89,6 +91,7 @@ struct io_pgtable_cfg {
#define IO_PGTABLE_QUIRK_NON_STRICT BIT(4)
#define IO_PGTABLE_QUIRK_ARM_TTBR1  BIT(5)
#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA BIT(6)
+   #define IO_PGTABLE_QUIRK_ARM_HD BIT(7)
unsigned long   quirks;
unsigned long   pgsize_bitmap;
unsigned intias;
-- 
2.19.1

[PATCH v2 00/11] vfio/iommu_type1: Implement dirty log tracking based on smmuv3 HTTU

2021-03-10 Thread Keqian Zhu

Hi all,

This patch series implement vfio dma dirty log tracking based on smmuv3 HTTU.

changelog:

v2:
 - Address all comments of RFC version, thanks for all of you ;-)
 - Add a bugfix that start dirty log for newly added dma ranges and domain.

Intention：

As we know, vfio live migration is an important and valuable feature, but there
are still many hurdles to solve, including migration of interrupt, device state,
DMA dirty log tracking, and etc.

For now, the only dirty log tracking interface is pinning. It has some 
drawbacks:
1. Only smart vendor drivers are aware of this.
2. It's coarse-grained, the pinned-scope is generally bigger than what the 
device actually access.
3. It can't track dirty continuously and precisely, vfio populates all 
pinned-scope as dirty.
   So it doesn't work well with iteratively dirty log handling.

About SMMU HTTU:

HTTU (Hardware Translation Table Update) is a feature of ARM SMMUv3, it can 
update
access flag or/and dirty state of the TTD (Translation Table Descriptor) by 
hardware.
With HTTU, stage1 TTD is classified into 3 types:
DBM bit AP[2](readonly bit)
1. writable_clean 1   1
2. writable_dirty 1   0
3. readonly   0   1

If HTTU_HD (manage dirty state) is enabled, smmu can change TTD from 
writable_clean to
writable_dirty. Then software can scan TTD to sync dirty state into dirty 
bitmap. With
this feature, we can track the dirty log of DMA continuously and precisely.

About this series:

Patch 1-3: Add feature detection for smmu HTTU and enable HTTU for smmu stage1 
mapping.
   And add feature detection for smmu BBML. We need to split block 
mapping when
   start dirty log tracking and merge page mapping when stop dirty log 
tracking,
   which requires break-before-make procedure. But it might 
cause problems when the
   TTD is alive. The I/O streams might not tolerate translation 
faults. So BBML
   should be used.

Patch 4-7: Add four interfaces (start_dirty_log, stop_dirty_log, sync_dirty_log 
and clear_dirty_log)
   in IOMMU layer, they are essential to implement dma dirty log 
tracking for vfio.
   We implement these interfaces for arm smmuv3.

Patch   8: Add HWDBM (Hardware Dirty Bit Management) device feature reporting 
in IOMMU layer.

Patch9-11: Implement a new dirty log tracking method for vfio based on iommu 
hwdbm. A new
   ioctl operation named VFIO_DIRTY_LOG_MANUAL_CLEAR is added, which 
can eliminate
   some redundant dirty handling of userspace.

Optimizations TO Do:

1. We recognized that each smmu_domain (a vfio_container may has several 
smmu_domain) has its
   own stage1 mapping, and we must scan all these mapping to sync dirty state. 
We plan to refactor
   smmu_domain to support more than one smmu in one smmu_domain, then these 
smmus can share a same
   stage1 mapping.
2. We also recognized that scan TTD is a hotspot of performance. Recently, I 
have implement a
   SW/HW conbined dirty log tracking at MMU side [1], which can effectively 
solve this problem.
   This idea can be applied to smmu side too.

Thanks,
Keqian


[1] 
https://lore.kernel.org/linux-arm-kernel/2021012612.27136-1-zhukeqi...@huawei.com/

Jean-Philippe Brucker (1):
  iommu/arm-smmu-v3: Add support for Hardware Translation Table Update

jiangkunkun (10):
  iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
  iommu/arm-smmu-v3: Add feature detection for BBML
  iommu/arm-smmu-v3: Split block descriptor when start dirty log
  iommu/arm-smmu-v3: Merge a span of page when stop dirty log
  iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log
  iommu/arm-smmu-v3: Clear dirty log according to bitmap
  iommu/arm-smmu-v3: Add HWDBM device feature reporting
  vfio/iommu_type1: Add HWDBM status maintanance
  vfio/iommu_type1: Optimize dirty bitmap population based on iommu
HWDBM
  vfio/iommu_type1: Add support for manual dirty log clear

 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |   2 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 226 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  14 +
 drivers/iommu/io-pgtable-arm.c| 392 +-
 drivers/iommu/iommu.c | 236 +++
 drivers/vfio/vfio_iommu_type1.c   | 270 +++-
 include/linux/io-pgtable.h|  23 +
 include/linux/iommu.h |  84 
 include/uapi/linux/vfio.h |  28 +-
 9 files changed, 1264 insertions(+), 11 deletions(-)

-- 
2.19.1

[PATCH v2 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log

2021-03-10 Thread Keqian Zhu

From: jiangkunkun 

During dirty log tracking, user will try to retrieve dirty log from
iommu if it supports hardware dirty log.

This adds a new interface named sync_dirty_log in iommu layer and
arm smmuv3 implements it, which scans leaf TTD and treats it's dirty
if it's writable (As we just enable HTTU for stage1, so check whether
AP[2] is not set).

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---

changelog:

v2:
 - Add new sanity check in arm_smmu_sync_dirty_log(). (smmu_domain->stage != 
ARM_SMMU_DOMAIN_S1)
 - Document the purpose of flush_iotlb in arm_smmu_sync_dirty_log(). (Robin)
 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 30 +++
 drivers/iommu/io-pgtable-arm.c  | 90 +
 drivers/iommu/iommu.c   | 38 +
 include/linux/io-pgtable.h  |  4 +
 include/linux/iommu.h   | 18 +
 5 files changed, 180 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index ac0d881c77b8..7407896a710e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2637,6 +2637,35 @@ static int arm_smmu_stop_dirty_log(struct iommu_domain 
*domain,
return 0;
 }
 
+static int arm_smmu_sync_dirty_log(struct iommu_domain *domain,
+  unsigned long iova, size_t size,
+  unsigned long *bitmap,
+  unsigned long base_iova,
+  unsigned long bitmap_pgshift)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HD))
+   return -ENODEV;
+   if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+   return -EINVAL;
+
+   if (!ops || !ops->sync_dirty_log) {
+   pr_err("io-pgtable don't realize sync dirty log\n");
+   return -ENODEV;
+   }
+
+   /*
+* Flush iotlb to ensure all inflight transactions are completed.
+* See doc IHI0070Da 3.13.4 "HTTU behavior summary".
+*/
+   arm_smmu_flush_iotlb_all(domain);
+   return ops->sync_dirty_log(ops, iova, size, bitmap, base_iova,
+  bitmap_pgshift);
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2740,6 +2769,7 @@ static struct iommu_ops arm_smmu_ops = {
.start_dirty_log= arm_smmu_start_dirty_log,
.merge_page = arm_smmu_merge_page,
.stop_dirty_log = arm_smmu_stop_dirty_log,
+   .sync_dirty_log = arm_smmu_sync_dirty_log,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 9028328b99b0..67a208a05ab2 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -877,6 +877,95 @@ static size_t arm_lpae_merge_page(struct io_pgtable_ops 
*ops, unsigned long iova
return __arm_lpae_merge_page(data, iova, paddr, size, lvl, ptep, prot);
 }
 
+static int __arm_lpae_sync_dirty_log(struct arm_lpae_io_pgtable *data,
+unsigned long iova, size_t size,
+int lvl, arm_lpae_iopte *ptep,
+unsigned long *bitmap,
+unsigned long base_iova,
+unsigned long bitmap_pgshift)
+{
+   arm_lpae_iopte pte;
+   struct io_pgtable *iop = >iop;
+   size_t base, next_size;
+   unsigned long offset;
+   int nbits, ret;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return -EINVAL;
+
+   ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return -EINVAL;
+
+   if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+   if (iopte_leaf(pte, lvl, iop->fmt)) {
+   if (pte & ARM_LPAE_PTE_AP_RDONLY)
+   return 0;
+
+   /* It is writable, set the bitmap */
+   nbits = size >> bitmap_pgshift;
+   offset = (iova - base_iova) >> bitmap_pgshift;
+   bitmap_set(bitmap, offset, nbits);
+   return 0;
+   } else {
+   /* To traverse next level */
+   next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+   ptep = iopt

Re: [RFC PATCH] kvm: arm64: Try stage2 block mapping for host device MMIO

2021-03-02 Thread Keqian Zhu

Hi Marc,

Do you have further suggestion on this? Block mapping do bring obvious benefit.

Thanks,
Keqian

On 2021/1/25 19:25, Keqian Zhu wrote:
> Hi Marc,
> 
> On 2021/1/22 17:45, Marc Zyngier wrote:
>> On 2021-01-22 08:36, Keqian Zhu wrote:
>>> The MMIO region of a device maybe huge (GB level), try to use block
>>> mapping in stage2 to speedup both map and unmap.
>>>
>>> Especially for unmap, it performs TLBI right after each invalidation
>>> of PTE. If all mapping is of PAGE_SIZE, it takes much time to handle
>>> GB level range.
>>
>> This is only on VM teardown, right? Or do you unmap the device more ofet?
>> Can you please quantify the speedup and the conditions this occurs in?
> 
> Yes, and there are some other paths (includes what your patch series handles) 
> will do the unmap action:
> 
> 1、guest reboot without S2FWB: stage2_unmap_vm（）which only unmaps guest 
> regular RAM.
> 2、userspace deletes memslot: kvm_arch_flush_shadow_memslot().
> 3、rollback of device MMIO mapping: kvm_arch_prepare_memory_region().
> 4、rollback of dirty log tracking: If we enable hugepage for guest RAM, after 
> dirty log is stopped,
>the newly created block mappings will 
> unmap all page mappings.
> 5、mmu notifier: kvm_unmap_hva_range(). AFAICS, we will use this path when VM 
> teardown or guest resets pass-through devices.
> The bugfix[1] gives the reason for 
> unmapping MMIO region when guest resets pass-through devices.
> 
> unmap related to MMIO region, as this patch solves:
> point 1 is not applied.
> point 2 occurs when userspace unplug pass-through devices.
> point 3 can occurs, but rarely.
> point 4 is not applied.
> point 5 occurs when VM teardown or guest resets pass-through devices.
> 
> And I had a look at your patch series, it can solve:
> For VM teardown, elide CMO and perform VMALL instead of individually (But 
> current kernel do not go through this path when VM teardown).
> For rollback of dirty log tracking, elide CMO.
> For kvm_unmap_hva_range, if event is MMU_NOTIFY_UNMAP. elide CMO.
> 
> (But I doubt the CMOs in unmap. As we perform CMOs in user_mem_abort when 
> install new stage2 mapping for VM,
>  maybe the CMO in unmap is unnecessary under all conditions :-) ?)
> 
> So it shows that we are solving different parts of unmap, so they are not 
> conflicting. At least this patch can
> still speedup map of device MMIO region, and speedup unmap of device MMIO 
> region even if we do not need to perform
> CMO and TLBI ;-).
> 
> speedup: unmap 8GB MMIO on FPGA.
> 
>beforeafter opt
> cost30+ minutes949ms
> 
> Thanks,
> Keqian
> 
>>
>> I have the feeling that we are just circling around another problem,
>> which is that we could rely on a VM-wide TLBI when tearing down the
>> guest. I worked on something like that[1] a long while ago, and parked
>> it for some reason. Maybe it is worth reviving.
>>
>> [1] 
>> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/elide-cmo-tlbi
>>
>>>
>>> Signed-off-by: Keqian Zhu 
>>> ---
>>>  arch/arm64/include/asm/kvm_pgtable.h | 11 +++
>>>  arch/arm64/kvm/hyp/pgtable.c | 15 +++
>>>  arch/arm64/kvm/mmu.c | 12 
>>>  3 files changed, 34 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_pgtable.h
>>> b/arch/arm64/include/asm/kvm_pgtable.h
>>> index 52ab38db04c7..2266ac45f10c 100644
>>> --- a/arch/arm64/include/asm/kvm_pgtable.h
>>> +++ b/arch/arm64/include/asm/kvm_pgtable.h
>>> @@ -82,6 +82,17 @@ struct kvm_pgtable_walker {
>>>  const enum kvm_pgtable_walk_flagsflags;
>>>  };
>>>
>>> +/**
>>> + * kvm_supported_pgsize() - Get the max supported page size of a mapping.
>>> + * @pgt:Initialised page-table structure.
>>> + * @addr:Virtual address at which to place the mapping.
>>> + * @end:End virtual address of the mapping.
>>> + * @phys:Physical address of the memory to map.
>>> + *
>>> + * The smallest return value is PAGE_SIZE.
>>> + */
>>> +u64 kvm_supported_pgsize(struct kvm_pgtable *pgt, u64 addr, u64 end, u64 
>>> phys);
>>> +
>>>  /**
>>>   * kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
>>>   * @pgt:Uninitialised page-table structure to initialise.
>>> diff --git a/arch/arm64/kvm

Re: [RFC PATCH 0/7] kvm: arm64: Implement SW/HW combined dirty log

2021-03-02 Thread Keqian Zhu

Hi everyone,

Any comments are welcome :).

Thanks,
Keqian

On 2021/1/26 20:44, Keqian Zhu wrote:
> The intention:
> 
> On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
> KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
> This leads to heavy side effect on VM, especially when multi vCPU race and
> some of them block on kvm mmu_lock.
> 
> DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
> its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.
> 
> About this patch series:
> 
> The biggest problem of apply DBM for stage2 is that software must scan PTs to
> collect dirty state, which may cost much time and affect downtime of 
> migration.
> 
> This series realize a SW/HW combined dirty log that can effectively solve this
> problem (The smmu side can also use this approach to solve dma dirty log 
> tracking).
> 
> The core idea is that we do not enable hardware dirty at start (do not add 
> DBM bit).
> When a arbitrary PT occurs fault, we execute soft tracking for this PT and 
> enable
> hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). 
> Then when
> sync dirty log, we have known all PTs with hardware dirty enabled, so we do 
> not need
> to scan all PTs.
> 
> mem abort point mem abort point
>   ↓↓
> ---
> |||||||
> ---
>  ↑↑
> set DBM bit of   set DBM bit of
>  this PT section (64PTEs)  this PT section (64PTEs)
> 
> We may worry that when dirty rate is over-high we still need to scan too much 
> PTs.
> We mainly concern the VM stop time. With Qemu dirty rate throttling, the 
> dirty memory
> is closing to the VM stop threshold, so there is a little PTs to scan after 
> VM stop.
> 
> It has the advantages of hardware tracking that minimizes side effect on vCPU,
> and also has the advantages of software tracking that controls vCPU dirty 
> rate.
> Moreover, software tracking helps us to scan PTs at some fixed points, which
> greatly reduces scanning time. And the biggest benefit is that we can apply 
> this
> solution for dma dirty tracking.
> 
> Test:
> 
> Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure 
> test result
>   is not effected by dissolve of block page table at the early stage of 
> migration).
> VM:   16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).
> 
> Each run 5 times for software dirty log and SW/HW conbined dirty log. 
> 
> Test result:
> 
> Gain 5%~7% improvement of redis QPS during VM migration.
> VM downtime is not affected fundamentally.
> About 56.7% of DBM is effectively used.
> 
> Keqian Zhu (7):
>   arm64: cpufeature: Add API to report system support of HWDBM
>   kvm: arm64: Use atomic operation when update PTE
>   kvm: arm64: Add level_apply parameter for stage2_attr_walker
>   kvm: arm64: Add some HW_DBM related pgtable interfaces
>   kvm: arm64: Add some HW_DBM related mmu interfaces
>   kvm: arm64: Only write protect selected PTE
>   kvm: arm64: Start up SW/HW combined dirty log
> 
>  arch/arm64/include/asm/cpufeature.h  |  12 +++
>  arch/arm64/include/asm/kvm_host.h|   6 ++
>  arch/arm64/include/asm/kvm_mmu.h |   7 ++
>  arch/arm64/include/asm/kvm_pgtable.h |  45 ++
>  arch/arm64/kvm/arm.c | 125 ++
>  arch/arm64/kvm/hyp/pgtable.c | 130 ++-
>  arch/arm64/kvm/mmu.c |  47 +-
>  arch/arm64/kvm/reset.c   |   8 +-
>  8 files changed, 351 insertions(+), 29 deletions(-)
>

Re: [PATCH v14 05/13] iommu/smmuv3: Implement attach/detach_pasid_table

2021-03-02 Thread Keqian Zhu

Hi Eric,

On 2021/2/24 4:56, Eric Auger wrote:
> On attach_pasid_table() we program STE S1 related info set
> by the guest into the actual physical STEs. At minimum
> we need to program the context descriptor GPA and compute
> whether the stage1 is translated/bypassed or aborted.
> 
> On detach, the stage 1 config is unset and the abort flag is
> unset.
> 
> Signed-off-by: Eric Auger 
> 
[...]

> +
> + /*
> +  * we currently support a single CD so s1fmt and s1dss
> +  * fields are also ignored
> +  */
> + if (cfg->pasid_bits)
> + goto out;
> +
> + smmu_domain->s1_cfg.cdcfg.cdtab_dma = cfg->base_ptr;
only the "cdtab_dma" field of "cdcfg" is set, we are not able to locate a 
specific cd using arm_smmu_get_cd_ptr().

Maybe we'd better use a specialized function to fill other fields of "cdcfg" or 
add a sanity check in arm_smmu_get_cd_ptr()
to prevent calling it under nested mode?

As now we just call arm_smmu_get_cd_ptr() during finalise_s1(), no problem 
found. Just a suggestion ;-)

Thanks,
Keqian


> + smmu_domain->s1_cfg.set = true;
> + smmu_domain->abort = false;
> + break;
> + default:
> + goto out;
> + }
> + spin_lock_irqsave(_domain->devices_lock, flags);
> + list_for_each_entry(master, _domain->devices, domain_head)
> + arm_smmu_install_ste_for_dev(master);
> + spin_unlock_irqrestore(_domain->devices_lock, flags);
> + ret = 0;
> +out:
> + mutex_unlock(_domain->init_mutex);
> + return ret;
> +}
> +
> +static void arm_smmu_detach_pasid_table(struct iommu_domain *domain)
> +{
> + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> + struct arm_smmu_master *master;
> + unsigned long flags;
> +
> + mutex_lock(_domain->init_mutex);
> +
> + if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> + goto unlock;
> +
> + smmu_domain->s1_cfg.set = false;
> + smmu_domain->abort = false;
> +
> + spin_lock_irqsave(_domain->devices_lock, flags);
> + list_for_each_entry(master, _domain->devices, domain_head)
> + arm_smmu_install_ste_for_dev(master);
> + spin_unlock_irqrestore(_domain->devices_lock, flags);
> +
> +unlock:
> + mutex_unlock(_domain->init_mutex);
> +}
> +
>  static bool arm_smmu_dev_has_feature(struct device *dev,
>enum iommu_dev_features feat)
>  {
> @@ -2939,6 +3026,8 @@ static struct iommu_ops arm_smmu_ops = {
>   .of_xlate   = arm_smmu_of_xlate,
>   .get_resv_regions   = arm_smmu_get_resv_regions,
>   .put_resv_regions   = generic_iommu_put_resv_regions,
> + .attach_pasid_table = arm_smmu_attach_pasid_table,
> + .detach_pasid_table = arm_smmu_detach_pasid_table,
>   .dev_has_feat   = arm_smmu_dev_has_feature,
>   .dev_feat_enabled   = arm_smmu_dev_feature_enabled,
>   .dev_enable_feat= arm_smmu_dev_enable_feature,
>

Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-03-02 Thread Keqian Zhu

Hi Robin,

I am going to send v2 at next week, to addresses these issues reported by you. 
Many thanks!
And do you have any further comments on patch #4 #5 and #6?

Thanks,
Keqian

On 2021/2/5 3:50, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>> update the access flag and the dirty state of TTD by hardware. It is
>> essential to track dirty pages of DMA.
>>
>> This adds feature detection, none functional change.
>>
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
>>   include/linux/io-pgtable.h  |  1 +
>>   3 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 8ca7415d785d..0f0fe71cc10d 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct 
>> iommu_domain *domain,
>>   .pgsize_bitmap= smmu->pgsize_bitmap,
>>   .ias= ias,
>>   .oas= oas,
>> +.httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>   .coherent_walk= smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>   .tlb= _smmu_flush_ops,
>>   .iommu_dev= smmu->dev,
>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
>> arm_smmu_device *smmu)
>>   if (reg & IDR0_HYP)
>>   smmu->features |= ARM_SMMU_FEAT_HYP;
>>   +switch (FIELD_GET(IDR0_HTTU, reg)) {
> 
> We need to accommodate the firmware override as well if we need this to be 
> meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
> stack[1].
> 
>> +case IDR0_HTTU_NONE:
>> +break;
>> +case IDR0_HTTU_HA:
>> +smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>> +break;
>> +case IDR0_HTTU_HAD:
>> +smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>> +smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
>> +break;
>> +default:
>> +dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
>> +return -ENXIO;
>> +}
>> +
>>   /*
>>* The coherency feature as set by FW is used in preference to the ID
>>* register, but warn on mismatch.
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> index 96c2e9565e00..e91bea44519e 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> @@ -33,6 +33,10 @@
>>   #define IDR0_ASID16(1 << 12)
>>   #define IDR0_ATS(1 << 10)
>>   #define IDR0_HYP(1 << 9)
>> +#define IDR0_HTTUGENMASK(7, 6)
>> +#define IDR0_HTTU_NONE0
>> +#define IDR0_HTTU_HA1
>> +#define IDR0_HTTU_HAD2
>>   #define IDR0_COHACC(1 << 4)
>>   #define IDR0_TTFGENMASK(3, 2)
>>   #define IDR0_TTF_AARCH642
>> @@ -286,6 +290,8 @@
>>   #define CTXDESC_CD_0_TCR_TBI0(1ULL << 38)
>> #define CTXDESC_CD_0_AA64(1UL << 41)
>> +#define CTXDESC_CD_0_HD(1UL << 42)
>> +#define CTXDESC_CD_0_HA(1UL << 43)
>>   #define CTXDESC_CD_0_S(1UL << 44)
>>   #define CTXDESC_CD_0_R(1UL << 45)
>>   #define CTXDESC_CD_0_A(1UL << 46)
>> @@ -604,6 +610,8 @@ struct arm_smmu_device {
>>   #define ARM_SMMU_FEAT_RANGE_INV(1 << 15)
>>   #define ARM_SMMU_FEAT_BTM(1 << 16)
>>   #define ARM_SMMU_FEAT_SVA(1 << 17)
>> +#define ARM_SMMU_FEAT_HTTU_HA(1 << 18)
>> +#define ARM_SMMU_FEAT_HTTU_HD(1 << 19)
>>   u32features;
>> #define ARM_SMMU_OPT_SKIP_PREFETCH(1 << 0)
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index ea727eb1a1a9..1a00ea8562c7 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -97,6 +97,7 @@ struct io_pgtable_cfg {
>>   unsigned longpgsize_bitmap;
>>   unsigned intias;
>>   unsigned intoas;
>> +boolhttu_hd;
> 
> This is very specific to the AArch64 stage 1 format, not a generic capability 
> - I think it should be a quirk flag rather than a common field.
> 
> Robin.
> 
> [1] 
> https://jpbrucker.net/git/linux/commit/?h=sva/current=1ef7d512fb9082450dfe0d22ca4f7e35625a097b
> 
>>   boolcoherent_walk;
>>   const struct iommu_flush_ops*tlb;
>>   struct device*iommu_dev;
>>
> .
>

Re: [PATCH v11 01/13] vfio: VFIO_IOMMU_SET_PASID_TABLE

2021-02-22 Thread Keqian Zhu

Hi Eric,

On 2021/2/22 18:53, Auger Eric wrote:
> Hi Keqian,
> 
> On 2/2/21 1:34 PM, Keqian Zhu wrote:
>> Hi Eric,
>>
>> On 2020/11/16 19:00, Eric Auger wrote:
>>> From: "Liu, Yi L" 
>>>
>>> This patch adds an VFIO_IOMMU_SET_PASID_TABLE ioctl
>>> which aims to pass the virtual iommu guest configuration
>>> to the host. This latter takes the form of the so-called
>>> PASID table.
>>>
>>> Signed-off-by: Jacob Pan 
>>> Signed-off-by: Liu, Yi L 
>>> Signed-off-by: Eric Auger 
>>>
>>> ---
>>> v11 -> v12:
>>> - use iommu_uapi_set_pasid_table
>>> - check SET and UNSET are not set simultaneously (Zenghui)
>>>
>>> v8 -> v9:
>>> - Merge VFIO_IOMMU_ATTACH/DETACH_PASID_TABLE into a single
>>>   VFIO_IOMMU_SET_PASID_TABLE ioctl.
>>>
>>> v6 -> v7:
>>> - add a comment related to VFIO_IOMMU_DETACH_PASID_TABLE
>>>
>>> v3 -> v4:
>>> - restore ATTACH/DETACH
>>> - add unwind on failure
>>>
>>> v2 -> v3:
>>> - s/BIND_PASID_TABLE/SET_PASID_TABLE
>>>
>>> v1 -> v2:
>>> - s/BIND_GUEST_STAGE/BIND_PASID_TABLE
>>> - remove the struct device arg
>>> ---
>>>  drivers/vfio/vfio_iommu_type1.c | 65 +
>>>  include/uapi/linux/vfio.h   | 19 ++
>>>  2 files changed, 84 insertions(+)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>>> b/drivers/vfio/vfio_iommu_type1.c
>>> index 67e827638995..87ddd9e882dc 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -2587,6 +2587,41 @@ static int vfio_iommu_iova_build_caps(struct 
>>> vfio_iommu *iommu,
>>> return ret;
>>>  }
>>>  
>>> +static void
>>> +vfio_detach_pasid_table(struct vfio_iommu *iommu)
>>> +{
>>> +   struct vfio_domain *d;
>>> +
>>> +   mutex_lock(>lock);
>>> +   list_for_each_entry(d, >domain_list, next)
>>> +   iommu_detach_pasid_table(d->domain);
>>> +
>>> +   mutex_unlock(>lock);
>>> +}
>>> +
>>> +static int
>>> +vfio_attach_pasid_table(struct vfio_iommu *iommu, unsigned long arg)
>>> +{
>>> +   struct vfio_domain *d;
>>> +   int ret = 0;
>>> +
>>> +   mutex_lock(>lock);
>>> +
>>> +   list_for_each_entry(d, >domain_list, next) {
>>> +   ret = iommu_uapi_attach_pasid_table(d->domain, (void __user 
>>> *)arg);
>> This design is not very clear to me. This assumes all iommu_domains share 
>> the same pasid table.
>>
>> As I understand, it's reasonable when there is only one group in the domain, 
>> and only one domain in the vfio_iommu.
>> If more than one group in the vfio_iommu, the guest may put them into 
>> different guest iommu_domain, then they have different pasid table.
>>
>> Is this the use scenario?
> 
> the vfio_iommu is attached to a container. all the groups within a
> container share the same set of page tables (linux
> Documentation/driver-api/vfio.rst). So to me if you want to use
> different pasid tables, the groups need to be attached to different
> containers. Does that make sense to you?
OK, so this is what I understand about the design. A little question is that 
when
we perform attach_pasid_table on a container, maybe we ought to do a sanity
check to make sure that only one group is in this container, instead of
iterating all domain?

To be frank, my main concern is that if we put each group into different 
container
under nested mode, then we give up the possibility that they can share stage2 
page tables,
which saves host memory and reduces the time of preparing environment for VM.

To me, I'd like to understand the "container shares page table" to be:
1) share stage2 page table under nested mode.
2) share stage1 page table under non-nested mode.

As when we perform "map" on a container:
1) under nested mode, we setup stage2 mapping.
2) under non-nested mode, we setup stage1 mapping.

Indeed, to realize stage2 mapping sharing, we should do much more work to 
refactor
SMMU_DOMAIN...

Hope you can consider this. :)

Thanks,
Keqian

> 
> Thanks
> 
> Eric
>>
>> Thanks,
>> Keqian
>>
>>> +   if (ret)
>>> +   goto unwind;
>>> +   }
>>> +   goto unlock;
>>> +unwind:
>>> +   list_for_each_entry_continue_reverse(d, >domain_list, next) {
>

Re: [PATCH v13 02/15] iommu: Introduce bind/unbind_guest_msi

2021-02-18 Thread Keqian Zhu

Hi Eric,

On 2021/2/12 16:55, Auger Eric wrote:
> Hi Keqian,
> 
> On 2/1/21 12:52 PM, Keqian Zhu wrote:
>> Hi Eric,
>>
>> On 2020/11/18 19:21, Eric Auger wrote:
>>> On ARM, MSI are translated by the SMMU. An IOVA is allocated
>>> for each MSI doorbell. If both the host and the guest are exposed
>>> with SMMUs, we end up with 2 different IOVAs allocated by each.
>>> guest allocates an IOVA (gIOVA) to map onto the guest MSI
>>> doorbell (gDB). The Host allocates another IOVA (hIOVA) to map
>>> onto the physical doorbell (hDB).
>>>
>>> So we end up with 2 untied mappings:
>>>  S1S2
>>> gIOVA->gDB
>>>   hIOVA->hDB
>>>
>>> Currently the PCI device is programmed by the host with hIOVA
>>> as MSI doorbell. So this does not work.
>>>
>>> This patch introduces an API to pass gIOVA/gDB to the host so
>>> that gIOVA can be reused by the host instead of re-allocating
>>> a new IOVA. So the goal is to create the following nested mapping:
>> Does the gDB can be reused under non-nested mode?
> 
> Under non nested mode the hIOVA is allocated within the MSI reserved
> region exposed by the SMMU driver, [0x800, 80f]. see
> iommu_dma_prepare_msi/iommu_dma_get_msi_page in dma_iommu.c. this hIOVA
> is programmed in the physical device so that the physical SMMU
> translates it into the physical doorbell (hDB = host physical ITS
So, AFAIU, under non-nested mode, at smmu side, we reuse the workflow of 
non-virtualization scenario.

> doorbell). The gDB is not used at pIOMMU programming level. It is only
> used when setting up the KVM irq route.
> 
> Hope this answers your question.
Thanks for your explanation!
> 

Thanks,
Keqian

>>
>>>
>>>  S1S2
>>> gIOVA->gDB ->hDB
>>>
>>> and program the PCI device with gIOVA MSI doorbell.
>>>
>>> In case we have several devices attached to this nested domain
>>> (devices belonging to the same group), they cannot be isolated
>>> on guest side either. So they should also end up in the same domain
>>> on guest side. We will enforce that all the devices attached to
>>> the host iommu domain use the same physical doorbell and similarly
>>> a single virtual doorbell mapping gets registered (1 single
>>> virtual doorbell is used on guest as well).
>>>
>> [...]
>>
>>> + *
>>> + * The associated IOVA can be reused by the host to create a nested
>>> + * stage2 binding mapping translating into the physical doorbell used
>>> + * by the devices attached to the domain.
>>> + *
>>> + * All devices within the domain must share the same physical doorbell.
>>> + * A single MSI GIOVA/GPA mapping can be attached to an iommu_domain.
>>> + */
>>> +
>>> +int iommu_bind_guest_msi(struct iommu_domain *domain,
>>> +dma_addr_t giova, phys_addr_t gpa, size_t size)
>>> +{
>>> +   if (unlikely(!domain->ops->bind_guest_msi))
>>> +   return -ENODEV;
>>> +
>>> +   return domain->ops->bind_guest_msi(domain, giova, gpa, size);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_bind_guest_msi);
>>> +
>>> +void iommu_unbind_guest_msi(struct iommu_domain *domain,
>>> +   dma_addr_t iova)
>> nit: s/iova/giova
> sure
>>
>>> +{
>>> +   if (unlikely(!domain->ops->unbind_guest_msi))
>>> +   return;
>>> +
>>> +   domain->ops->unbind_guest_msi(domain, iova);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_unbind_guest_msi);
>>> +
>> [...]
>>
>> Thanks,
>> Keqian
>>
> 
> Thanks
> 
> Eric
> 
> .
>

Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM

2021-02-17 Thread Keqian Zhu

Hi Yi,

On 2021/2/9 19:57, Yi Sun wrote:
> On 21-02-07 18:40:36, Keqian Zhu wrote:
>> Hi Yi,
>>
>> On 2021/2/7 17:56, Yi Sun wrote:
>>> Hi,
>>>
>>> On 21-01-28 23:17:41, Keqian Zhu wrote:
>>>
>>> [...]
>>>
>>>> +static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
>>>> +   struct vfio_dma *dma)
>>>> +{
>>>> +  struct vfio_domain *d;
>>>> +
>>>> +  list_for_each_entry(d, >domain_list, next) {
>>>> +  /* Go through all domain anyway even if we fail */
>>>> +  iommu_split_block(d->domain, dma->iova, dma->size);
>>>> +  }
>>>> +}
>>>
>>> This should be a switch to prepare for dirty log start. Per Intel
>>> Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
>>> It enables Accessed/Dirty Flags in second-level paging entries.
>>> So, a generic iommu interface here is better. For Intel iommu, it
>>> enables SLADE. For ARM, it splits block.
>> Indeed, a generic interface name is better.
>>
>> The vendor iommu driver plays vendor's specific actions to start dirty log, 
>> and Intel iommu and ARM smmu may differ. Besides, we may add more actions in 
>> ARM smmu driver in future.
>>
>> One question: Though I am not familiar with Intel iommu, I think it also 
>> should split block mapping besides enable SLADE. Right?
>>
> I am not familiar with ARM smmu. :) So I want to clarify if the block
> in smmu is big page, e.g. 2M page? Intel Vtd manages the memory per
Yes, for ARM, the "block" is big page :).

> page, 4KB/2MB/1GB. There are two ways to manage dirty pages.
> 1. Keep default granularity. Just set SLADE to enable the dirty track.
> 2. Split big page to 4KB to get finer granularity.
According to your statement, I see that VT-D's SLADE behaves like smmu HTTU. 
They are both based on page-table.

Right, we should give more freedom to iommu vendor driver, so a generic 
interface is better.
1) As you said, set SLADE when enable dirty log.
2) IOMMUs of other architecture may has completely different dirty tracking 
mechanism.

> 
> But question about the second solution is if it can benefit the user
> space, e.g. live migration. If my understanding about smmu block (i.e.
> the big page) is correct, have you collected some performance data to
> prove that the split can improve performance? Thanks!
The purpose of splitting block mapping is to reduce the amount of dirty bytes, 
which depends on actual DMA transaction.
Take an extreme example, if DMA writes one byte, under 1G mapping, the dirty 
amount reported to userspace is 1G, but under 4K mapping, the dirty amount is 
just 4K.

I will detail the commit message in v2.

Thanks,
Keqian

Re: [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log

2021-02-07 Thread Keqian Zhu




On 2021/2/5 3:52, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> During dirty log tracking, user will try to retrieve dirty log from
>> iommu if it supports hardware dirty log. This adds a new interface
[...]

>>   static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
>>   {
>>   unsigned long granule, page_sizes;
>> @@ -957,6 +1046,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>>   .iova_to_phys= arm_lpae_iova_to_phys,
>>   .split_block= arm_lpae_split_block,
>>   .merge_page= arm_lpae_merge_page,
>> +.sync_dirty_log= arm_lpae_sync_dirty_log,
>>   };
>> return data;
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index f1261da11ea8..69f268069282 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2822,6 +2822,47 @@ size_t iommu_merge_page(struct iommu_domain *domain, 
>> unsigned long iova,
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_merge_page);
>>   +int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
>> + size_t size, unsigned long *bitmap,
>> + unsigned long base_iova, unsigned long bitmap_pgshift)
>> +{
>> +const struct iommu_ops *ops = domain->ops;
>> +unsigned int min_pagesz;
>> +size_t pgsize;
>> +int ret;
>> +
>> +min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
>> +
>> +if (!IS_ALIGNED(iova | size, min_pagesz)) {
>> +pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
>> +   iova, size, min_pagesz);
>> +return -EINVAL;
>> +}
>> +
>> +if (!ops || !ops->sync_dirty_log) {
>> +pr_err("don't support sync dirty log\n");
>> +return -ENODEV;
>> +}
>> +
>> +while (size) {
>> +pgsize = iommu_pgsize(domain, iova, size);
>> +
>> +ret = ops->sync_dirty_log(domain, iova, pgsize,
>> +  bitmap, base_iova, bitmap_pgshift);
> 
> Once again, we have a worst-of-both-worlds iteration that doesn't make much 
> sense. iommu_pgsize() essentially tells you the best supported size that an 
> IOVA range *can* be mapped with, but we're iterating a range that's already 
> mapped, so we don't know if it's relevant, and either way it may not bear any 
> relation to the granularity of the bitmap, which is presumably what actually 
> matters.
> 
> Logically, either we should iterate at the bitmap granularity here, and the 
> driver just says whether the given iova chunk contains any dirty pages or 
> not, or we just pass everything through to the driver and let it do the whole 
> job itself. Doing a little bit of both is just an overcomplicated mess.
> 
> I'm skimming patch #7 and pretty much the same comments apply, so I can't be 
> bothered to repeat them there...
> 
> Robin.
Sorry that I missed these comments...

As I clarified in #4, due to unsuitable variable name, the @pgsize actually is 
the max size that meets alignment acquirement and fits into the pgsize_bitmap.

All iommu interfaces acquire the @size fits into pgsize_bitmap to simplify 
their implementation. And the logic is very similar to "unmap" here.

Thanks,
Keqian

> 
>> +if (ret)
>> +break;
>> +
>> +pr_debug("dirty_log_sync: iova 0x%lx pagesz 0x%zx\n", iova,
>> + pgsize);
>> +
>> +iova += pgsize;
>> +size -= pgsize;
>> +}
>> +
>> +return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_sync_dirty_log);
>> +
>>   void iommu_get_resv_regions(struct device *dev, struct list_head *list)
>>   {
>>   const struct iommu_ops *ops = dev->bus->iommu_ops;
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index 754b62a1bbaf..f44551e4a454 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -166,6 +166,10 @@ struct io_pgtable_ops {
>> size_t size);
>>   size_t (*merge_page)(struct io_pgtable_ops *ops, unsigned long iova,
>>phys_addr_t phys, size_t size, int prot);
>> +int (*sync_dirty_log)(struct io_pgtable_ops *ops,
>> +  unsigned long iova, size_t size,
>> +  unsigned long *bitmap, unsigned long base_iova,
>> +  unsigned long bitmap_pgshift);
>>   };
>> /**
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> in

Re: [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log

2021-02-07 Thread Keqian Zhu

Hi Robin,

On 2021/2/5 3:52, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> During dirty log tracking, user will try to retrieve dirty log from
>> iommu if it supports hardware dirty log. This adds a new interface
>> named sync_dirty_log in iommu layer and arm smmuv3 implements it,
>> which scans leaf TTD and treats it's dirty if it's writable (As we
>> just enable HTTU for stage1, so check AP[2] is not set).
>>
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 27 +++
>>   drivers/iommu/io-pgtable-arm.c  | 90 +
>>   drivers/iommu/iommu.c   | 41 ++
>>   include/linux/io-pgtable.h  |  4 +
>>   include/linux/iommu.h   | 17 
>>   5 files changed, 179 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 2434519e4bb6..43d0536b429a 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -2548,6 +2548,32 @@ static size_t arm_smmu_merge_page(struct iommu_domain 
>> *domain, unsigned long iov
>>   return ops->merge_page(ops, iova, paddr, size, prot);
>>   }
>>   +static int arm_smmu_sync_dirty_log(struct iommu_domain *domain,
>> +   unsigned long iova, size_t size,
>> +   unsigned long *bitmap,
>> +   unsigned long base_iova,
>> +   unsigned long bitmap_pgshift)
>> +{
>> +struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
>> +struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
>> +
>> +if (!(smmu->features & ARM_SMMU_FEAT_HTTU_HD)) {
>> +dev_err(smmu->dev, "don't support HTTU_HD and sync dirty log\n");
>> +return -EPERM;
>> +}
>> +
>> +if (!ops || !ops->sync_dirty_log) {
>> +pr_err("don't support sync dirty log\n");
>> +return -ENODEV;
>> +}
>> +
>> +/* To ensure all inflight transactions are completed */
>> +arm_smmu_flush_iotlb_all(domain);
> 
> What about transactions that arrive between the point that this completes, 
> and the point - potentially much later - that we actually access any given 
> PTE during the walk? I don't see what this is supposed to be synchronising 
> against, even if it were just a CMD_SYNC (I especially don't see why we'd 
> want to knock out the TLBs).
The idea is that pgtable may be updated by HTTU *before* or *after* actual DMA 
access.

1) For PCI ATS. As SMMU spec (3.13.6.1 Hardware flag update for ATS & PRI) 
states:

"In addition to the behavior that is described earlier in this section, if 
hardware-management of Dirty state is enabled
and an ATS request for write access (with NW == 0) is made to a page that is 
marked Writable Clean, the SMMU
assumes a write will be made to that page and marks the page as Writable Dirty 
before returning the ATS response
that grants write access. When this happens, the modification to the page data 
by a device is not visible before
the page state is visible as Writable Dirty."

The problem is that guest memory may be dirtied *after* we actually handle it.

2) For inflight DMA. As SMMU spec (3.13.4 HTTU behavior summary) states:

"In addition, the completion of a TLB invalidation operation makes TTD updates 
that were caused by
transactions that are themselves completed by the completion of the TLB 
invalidation visible. Both
broadcast and explicit CMD_TLBI_* invalidations have this property."

The problem is that we should flush all dma transaction after guest stop.

The key to solve these problems is that we should invalidate related TLB.
1) TLBI can flush inflight dma translation (before dirty_log_sync()).
2) If a DMA translation uses ATC and occurs after we have handle dirty memory, 
then the ATC has been invalidated, so this will remark page as dirty (in 
dirty_log_clear()).

Thanks,
Keqian

> 
>> +
>> +return ops->sync_dirty_log(ops, iova, size, bitmap,
>> +base_iova, bitmap_pgshift);
>> +}
>> +
>>   static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args 
>> *args)
>>   {
>>   return iommu_fwspec_add_ids(dev, args->args, 1);
>> @@ -2649,6 +2675,7 @@ static struct iommu_ops arm_smmu_ops = {
>>   .domain_set_attr= arm_smmu_domain_set_attr,
>>   .split_block= arm_smmu_split_block,
>>   .merge_page= arm_smmu_m

Re: [RFC PATCH 05/11] iommu/arm-smmu-v3: Merge a span of page to block descriptor

2021-02-07 Thread Keqian Zhu

Hi Robin,

On 2021/2/5 3:52, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> When stop dirty log tracking, we need to recover all block descriptors
>> which are splited when start dirty log tracking. This adds a new
>> interface named merge_page in iommu layer and arm smmuv3 implements it,
>> which reinstall block mappings and unmap the span of page mappings.
>>
>> It's caller's duty to find contiuous physical memory.
>>
>> During merging page, other interfaces are not expected to be working,
>> so race condition does not exist. And we flush all iotlbs after the merge
>> procedure is completed to ease the pressure of iommu, as we will merge a
>> huge range of page mappings in general.
> 
> Again, I think we need better reasoning than "race conditions don't exist 
> because we don't expect them to exist".
Sure, because they can't. ;-)

> 
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 ++
>>   drivers/iommu/io-pgtable-arm.c  | 78 +
>>   drivers/iommu/iommu.c   | 75 
>>   include/linux/io-pgtable.h  |  2 +
>>   include/linux/iommu.h   | 10 +++
>>   5 files changed, 185 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 5469f4fca820..2434519e4bb6 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -2529,6 +2529,25 @@ static size_t arm_smmu_split_block(struct 
>> iommu_domain *domain,
>>   return ops->split_block(ops, iova, size);
>>   }
[...]

>> +
>> +size_t iommu_merge_page(struct iommu_domain *domain, unsigned long iova,
>> +size_t size, int prot)
>> +{
>> +phys_addr_t phys;
>> +dma_addr_t p, i;
>> +size_t cont_size, merged_size;
>> +size_t merged = 0;
>> +
>> +while (size) {
>> +phys = iommu_iova_to_phys(domain, iova);
>> +cont_size = PAGE_SIZE;
>> +p = phys + cont_size;
>> +i = iova + cont_size;
>> +
>> +while (cont_size < size && p == iommu_iova_to_phys(domain, i)) {
>> +p += PAGE_SIZE;
>> +i += PAGE_SIZE;
>> +cont_size += PAGE_SIZE;
>> +}
>> +
>> +merged_size = __iommu_merge_page(domain, iova, phys, cont_size,
>> +prot);
> 
> This is incredibly silly. The amount of time you'll spend just on walking the 
> tables in all those iova_to_phys() calls is probably significantly more than 
> it would take the low-level pagetable code to do the entire operation for 
> itself. At this level, any knowledge of how mappings are actually constructed 
> is lost once __iommu_map() returns, so we just don't know, and for this 
> operation in particular there seems little point in trying to guess - the 
> driver backend still has to figure out whether something we *think* might me 
> mergeable actually is, so it's better off doing the entire operation in a 
> single pass by itself.
>
> There's especially little point in starting all this work *before* checking 
> that it's even possible...
>
> Robin.

Well, this looks silly indeed. But the iova->phys info is only stored in 
pgtable. It seems that there is no other method to find continuous physical 
address :-( (actually, the vfio_iommu_replay() has similar logic).

We put the finding procedure of continuous physical address in common iommu 
layer, because this is a common logic for all types of iommu driver.

If a vendor iommu driver thinks (iova, phys, cont_size) is not merge-able, it 
can make its own decision to map them. This keeps same as iommu_map(), which 
provides (iova, paddr, pgsize) to vendor driver, and vendor driver can make its 
own decision to map them.

Do I understand your idea correctly?

Thanks,
Keqian
> 
>> +iova += merged_size;
>> +size -= merged_size;
>> +merged += merged_size;
>> +
>> +if (merged_size != cont_size)
>> +break;
>> +}
>> +iommu_flush_iotlb_all(domain);
>> +
>> +return merged;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_merge_page);
>> +
>>   void iommu_get_resv_regions(struct device *dev, struct list_head *list)
>>   {
>>   const struct iommu_ops *ops = dev->bus->iommu_ops;
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
&g

Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM

2021-02-07 Thread Keqian Zhu

Hi Yi,

On 2021/2/7 17:56, Yi Sun wrote:
> Hi,
> 
> On 21-01-28 23:17:41, Keqian Zhu wrote:
> 
> [...]
> 
>> +static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
>> + struct vfio_dma *dma)
>> +{
>> +struct vfio_domain *d;
>> +
>> +list_for_each_entry(d, >domain_list, next) {
>> +/* Go through all domain anyway even if we fail */
>> +iommu_split_block(d->domain, dma->iova, dma->size);
>> +}
>> +}
> 
> This should be a switch to prepare for dirty log start. Per Intel
> Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
> It enables Accessed/Dirty Flags in second-level paging entries.
> So, a generic iommu interface here is better. For Intel iommu, it
> enables SLADE. For ARM, it splits block.
Indeed, a generic interface name is better.

The vendor iommu driver plays vendor's specific actions to start dirty log, and 
Intel iommu and ARM smmu may differ. Besides, we may add more actions in ARM 
smmu driver in future.

One question: Though I am not familiar with Intel iommu, I think it also should 
split block mapping besides enable SLADE. Right?

Thanks,
Keqian

Re: [RFC PATCH 04/11] iommu/arm-smmu-v3: Split block descriptor to a span of page

2021-02-07 Thread Keqian Zhu

Hi Robin,

On 2021/2/5 3:51, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> Block descriptor is not a proper granule for dirty log tracking. This
>> adds a new interface named split_block in iommu layer and arm smmuv3
>> implements it, which splits block descriptor to an equivalent span of
>> page descriptors.
>>
>> During spliting block, other interfaces are not expected to be working,
>> so race condition does not exist. And we flush all iotlbs after the split
>> procedure is completed to ease the pressure of iommu, as we will split a
>> huge range of block mappings in general.
> 
> "Not expected to be" is not the same thing as "can not". Presumably the whole 
> point of dirty log tracking is that it can be run speculatively in the 
> background, so is there any actual guarantee that the guest can't, say, issue 
> a hotplug event that would cause some memory to be released back to the host 
> and unmapped while a scan might be in progress? Saying effectively "there is 
> no race condition as long as you assume there is no race condition" isn't all 
> that reassuring...
Sorry for my inaccuracy expression. "Not expected to be" is inappropriate here, 
the actual meaning is "can not".

As the only user of these newly added interfaces is vfio_iommu_type1 for now, 
and vfio_iommu_type1 always acquires "iommu->lock" before invoke them.

> 
> That said, it's not very clear why patches #4 and #5 are here at all, given 
> that patches #6 and #7 appear quite happy to handle block entries.
Split block into page is very important for dirty page tracking. Page mapping 
can greatly reduce the amount of dirty memory handling. The KVM mmu stage2 side 
also has this logic.

Yes, #6 (log_sync) and #7 (log_clear) is designed to be applied for both block 
and page mapping. As the "split" operation may fail (e.g, without BBML1/2 or 
ENOMEM), but we can still track dirty at block granule, which is still a much 
better choice compared to the full dirty policy.

> 
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  20 
>>   drivers/iommu/io-pgtable-arm.c  | 122 
>>   drivers/iommu/iommu.c   |  40 +++
>>   include/linux/io-pgtable.h  |   2 +
>>   include/linux/iommu.h   |  10 ++
>>   5 files changed, 194 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 9208881a571c..5469f4fca820 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -2510,6 +2510,25 @@ static int arm_smmu_domain_set_attr(struct 
>> iommu_domain *domain,
>>   return ret;
>>   }
>>   +static size_t arm_smmu_split_block(struct iommu_domain *domain,
>> +   unsigned long iova, size_t size)
>> +{
>> +struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
>> +struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
>> +
>> +if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
>> +dev_err(smmu->dev, "don't support BBML1/2 and split block\n");
>> +return 0;
>> +}
>> +
>> +if (!ops || !ops->split_block) {
>> +pr_err("don't support split block\n");
>> +return 0;
>> +}
>> +
>> +return ops->split_block(ops, iova, size);
>> +}
>> +
>>   static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args 
>> *args)
>>   {
>>   return iommu_fwspec_add_ids(dev, args->args, 1);
>> @@ -2609,6 +2628,7 @@ static struct iommu_ops arm_smmu_ops = {
>>   .device_group= arm_smmu_device_group,
>>   .domain_get_attr= arm_smmu_domain_get_attr,
>>   .domain_set_attr= arm_smmu_domain_set_attr,
>> +.split_block= arm_smmu_split_block,
>>   .of_xlate= arm_smmu_of_xlate,
>>   .get_resv_regions= arm_smmu_get_resv_regions,
>>   .put_resv_regions= generic_iommu_put_resv_regions,
>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
>> index e299a44808ae..f3b7f7115e38 100644
>> --- a/drivers/iommu/io-pgtable-arm.c
>> +++ b/drivers/iommu/io-pgtable-arm.c
>> @@ -79,6 +79,8 @@
>>   #define ARM_LPAE_PTE_SH_IS(((arm_lpae_iopte)3) << 8)
>>   #define ARM_LPAE_P

Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-06 Thread Keqian Zhu

Hi Robin,

On 2021/2/5 19:48, Robin Murphy wrote:
> On 2021-02-05 09:13, Keqian Zhu wrote:
>> Hi Robin and Jean,
>>
>> On 2021/2/5 3:50, Robin Murphy wrote:
>>> On 2021-01-28 15:17, Keqian Zhu wrote:
>>>> From: jiangkunkun 
>>>>
>>>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>>>> update the access flag and the dirty state of TTD by hardware. It is
>>>> essential to track dirty pages of DMA.
>>>>
>>>> This adds feature detection, none functional change.
>>>>
>>>> Co-developed-by: Keqian Zhu 
>>>> Signed-off-by: Kunkun Jiang 
>>>> ---
>>>>drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
>>>>drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
>>>>include/linux/io-pgtable.h  |  1 +
>>>>3 files changed, 25 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>>>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> index 8ca7415d785d..0f0fe71cc10d 100644
>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct 
>>>> iommu_domain *domain,
>>>>.pgsize_bitmap= smmu->pgsize_bitmap,
>>>>.ias= ias,
>>>>.oas= oas,
>>>> +.httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>>>.coherent_walk= smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>>>.tlb= _smmu_flush_ops,
>>>>.iommu_dev= smmu->dev,
>>>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
>>>> arm_smmu_device *smmu)
>>>>if (reg & IDR0_HYP)
>>>>smmu->features |= ARM_SMMU_FEAT_HYP;
>>>>+switch (FIELD_GET(IDR0_HTTU, reg)) {
>>>
>>> We need to accommodate the firmware override as well if we need this to be 
>>> meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
>>> stack[1].
>> Robin, Thanks for pointing it out.
>>
>> Jean, I see that the IORT HTTU flag overrides the hardware register info 
>> unconditionally. I have some concern about it:
>>
>> If the override flag has HTTU but hardware doesn't support it, then driver 
>> will use this feature but receive access fault or permission fault from SMMU 
>> unexpectedly.
>> 1) If IOPF is not supported, then kernel can not work normally.
>> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU 
>> based dma dirty tracking (this series).
> 
> Yes, if the IORT describes the SMMU incorrectly, things will not work well. 
> Just like if it describes the wrong base address or the wrong interrupt 
> numbers, things will also not work well. The point is that incorrect firmware 
> can be updated in the field fairly easily; incorrect hardware can not.
Agree.

> 
> Say the SMMU designer hard-codes the ID register field to 0x2 because the 
> SMMU itself is capable of HTTU, and they assume it's always going to be wired 
> up coherently, but then a customer integrates it to a non-coherent 
> interconnect. Firmware needs to override that value to prevent an OS thinking 
> that the claimed HTTU capability is ever going to work.
> 
> Or say the SMMU *is* integrated correctly, but due to an erratum discovered 
> later in the interconnect or SMMU itself, it turns out DBM doesn't always 
> work reliably, but AF is still OK. Firmware needs to downgrade the indicated 
> level of support from that which was intended to that which works reliably.
> 
> Or say someone forgets to set an integration tieoff so their SMMU reports 0x0 
> even though it and the interconnect *can* happily support HTTU. In that case, 
> firmware may want to upgrade the value to *allow* an OS to use HTTU despite 
> the ID register being wrong.
Fair enough. Mask can realize "downgrade", but not "upgrade". You give a 
reasonable point for upgrade.

BTW, my original intention is that mask can provide some convenience for BIOS 
maker, as the override flag can keep same for SMMUs regardless they support 
HTTU or not. But it shows that mask cannot cover all scenario.

> 
>> As the IORT spec doesn't give an explicit explanation for HTTU override, can 
>> we comprehend it as a mask for HTTU related hardware register?
>> So the logic becomes: smmu->feature = HTTU override &

Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-06 Thread Keqian Zhu

Hi Robin,

On 2021/2/6 0:11, Robin Murphy wrote:
> On 2021-02-05 11:48, Robin Murphy wrote:
>> On 2021-02-05 09:13, Keqian Zhu wrote:
>>> Hi Robin and Jean,
>>>
>>> On 2021/2/5 3:50, Robin Murphy wrote:
>>>> On 2021-01-28 15:17, Keqian Zhu wrote:
>>>>> From: jiangkunkun 
>>>>>
>>>>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>>>>> update the access flag and the dirty state of TTD by hardware. It is
>>>>> essential to track dirty pages of DMA.
>>>>>
>>>>> This adds feature detection, none functional change.
>>>>>
>>>>> Co-developed-by: Keqian Zhu 
>>>>> Signed-off-by: Kunkun Jiang 
>>>>> ---
>>>>>drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
>>>>>drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
>>>>>include/linux/io-pgtable.h  |  1 +
>>>>>3 files changed, 25 insertions(+)
>>>>>
>>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>>>>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> index 8ca7415d785d..0f0fe71cc10d 100644
>>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct 
>>>>> iommu_domain *domain,
>>>>>.pgsize_bitmap= smmu->pgsize_bitmap,
>>>>>.ias= ias,
>>>>>.oas= oas,
>>>>> +.httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>>>>.coherent_walk= smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>>>>.tlb= _smmu_flush_ops,
>>>>>.iommu_dev= smmu->dev,
>>>>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
>>>>> arm_smmu_device *smmu)
>>>>>if (reg & IDR0_HYP)
>>>>>smmu->features |= ARM_SMMU_FEAT_HYP;
>>>>>+switch (FIELD_GET(IDR0_HTTU, reg)) {
>>>>
>>>> We need to accommodate the firmware override as well if we need this to be 
>>>> meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
>>>> stack[1].
>>> Robin, Thanks for pointing it out.
>>>
>>> Jean, I see that the IORT HTTU flag overrides the hardware register info 
>>> unconditionally. I have some concern about it:
>>>
>>> If the override flag has HTTU but hardware doesn't support it, then driver 
>>> will use this feature but receive access fault or permission fault from 
>>> SMMU unexpectedly.
>>> 1) If IOPF is not supported, then kernel can not work normally.
>>> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU 
>>> based dma dirty tracking (this series).
>>
>> Yes, if the IORT describes the SMMU incorrectly, things will not work well. 
>> Just like if it describes the wrong base address or the wrong interrupt 
>> numbers, things will also not work well. The point is that incorrect 
>> firmware can be updated in the field fairly easily; incorrect hardware can 
>> not.
>>
>> Say the SMMU designer hard-codes the ID register field to 0x2 because the 
>> SMMU itself is capable of HTTU, and they assume it's always going to be 
>> wired up coherently, but then a customer integrates it to a non-coherent 
>> interconnect. Firmware needs to override that value to prevent an OS 
>> thinking that the claimed HTTU capability is ever going to work.
>>
>> Or say the SMMU *is* integrated correctly, but due to an erratum discovered 
>> later in the interconnect or SMMU itself, it turns out DBM doesn't always 
>> work reliably, but AF is still OK. Firmware needs to downgrade the indicated 
>> level of support from that which was intended to that which works reliably.
>>
>> Or say someone forgets to set an integration tieoff so their SMMU reports 
>> 0x0 even though it and the interconnect *can* happily support HTTU. In that 
>> case, firmware may want to upgrade the value to *allow* an OS to use HTTU 
>> despite the ID register being wrong.
>>
>>> As the IORT spec doesn't give an explicit explanation for HTTU override, 
>>> can we comprehend it as a mask for HTTU related hardware register?
>>> So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;
>>
>> No, it literally states that the OS must use the value of the firmware field 
>> *instead* of the value from the hardware field.
> 
> Oops, apologies for an oversight there - I've been reviewing IORT spec 
> updates lately so naturally had the newest version open already. Turns out 
> these descriptions were only clarified in the most recent release, so if you 
> were looking at an older document they *were* horribly vague.
Yep, my local version is E which was released at July 2020. I download the 
version E.a just now, thanks. ;-)

Thanks,
Keqian

Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-06 Thread Keqian Zhu

Hi Jean,

On 2021/2/5 17:51, Jean-Philippe Brucker wrote:
> Hi Keqian,
> 
> On Fri, Feb 05, 2021 at 05:13:50PM +0800, Keqian Zhu wrote:
>>> We need to accommodate the firmware override as well if we need this to be 
>>> meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
>>> stack[1].
>> Robin, Thanks for pointing it out.
>>
>> Jean, I see that the IORT HTTU flag overrides the hardware register info 
>> unconditionally. I have some concern about it:
>>
>> If the override flag has HTTU but hardware doesn't support it, then driver 
>> will use this feature but receive access fault or permission fault from SMMU 
>> unexpectedly.
>> 1) If IOPF is not supported, then kernel can not work normally.
>> 2) If IOPF is supported, kernel will perform useless actions, such as HTTU 
>> based dma dirty tracking (this series).
>>
>> As the IORT spec doesn't give an explicit explanation for HTTU override, can 
>> we comprehend it as a mask for HTTU related hardware register?
> 
> To me "Overrides the value of SMMU_IDR0.HTTU" is clear enough: disregard
> the value of SMMU_IDR0.HTTU and use the one specified by IORT instead. And
> that's both ways, since there is no validity mask for the IORT value: if
> there is an IORT table, always ignore SMMU_IDR0.HTTU.
> 
> That's how the SMMU driver implements the COHACC bit, which has the same
> wording in IORT. So I think we should implement HTTU the same way.
OK, and Robin said that the latest IORT spec literally states it.

> 
> One complication is that there is no equivalent override for device tree.
> I think it can be added later if necessary, because unlike IORT it can be
> tri state (property not present, overriden positive, overridden negative).
Yeah, that would be more flexible. ;-)

> 
> Thanks,
> Jean
> 
> .
> 
Thanks,
Keqian

Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-05 Thread Keqian Zhu

Hi Robin and Jean,

On 2021/2/5 3:50, Robin Murphy wrote:
> On 2021-01-28 15:17, Keqian Zhu wrote:
>> From: jiangkunkun 
>>
>> The SMMU which supports HTTU (Hardware Translation Table Update) can
>> update the access flag and the dirty state of TTD by hardware. It is
>> essential to track dirty pages of DMA.
>>
>> This adds feature detection, none functional change.
>>
>> Co-developed-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
>>   include/linux/io-pgtable.h  |  1 +
>>   3 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 8ca7415d785d..0f0fe71cc10d 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct 
>> iommu_domain *domain,
>>   .pgsize_bitmap= smmu->pgsize_bitmap,
>>   .ias= ias,
>>   .oas= oas,
>> +.httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
>>   .coherent_walk= smmu->features & ARM_SMMU_FEAT_COHERENCY,
>>   .tlb= _smmu_flush_ops,
>>   .iommu_dev= smmu->dev,
>> @@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
>> arm_smmu_device *smmu)
>>   if (reg & IDR0_HYP)
>>   smmu->features |= ARM_SMMU_FEAT_HYP;
>>   +switch (FIELD_GET(IDR0_HTTU, reg)) {
> 
> We need to accommodate the firmware override as well if we need this to be 
> meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
> stack[1].
Robin, Thanks for pointing it out.

Jean, I see that the IORT HTTU flag overrides the hardware register info 
unconditionally. I have some concern about it:

If the override flag has HTTU but hardware doesn't support it, then driver will 
use this feature but receive access fault or permission fault from SMMU 
unexpectedly.
1) If IOPF is not supported, then kernel can not work normally.
2) If IOPF is supported, kernel will perform useless actions, such as HTTU 
based dma dirty tracking (this series).

As the IORT spec doesn't give an explicit explanation for HTTU override, can we 
comprehend it as a mask for HTTU related hardware register?
So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;

> 
>> +case IDR0_HTTU_NONE:
>> +break;
>> +case IDR0_HTTU_HA:
>> +smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>> +break;
>> +case IDR0_HTTU_HAD:
>> +smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
>> +smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
>> +break;
>> +default:
>> +dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
>> +return -ENXIO;
>> +}
>> +
>>   /*
>>* The coherency feature as set by FW is used in preference to the ID
>>* register, but warn on mismatch.
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> index 96c2e9565e00..e91bea44519e 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> @@ -33,6 +33,10 @@
>>   #define IDR0_ASID16(1 << 12)
>>   #define IDR0_ATS(1 << 10)
>>   #define IDR0_HYP(1 << 9)
>> +#define IDR0_HTTUGENMASK(7, 6)
>> +#define IDR0_HTTU_NONE0
>> +#define IDR0_HTTU_HA1
>> +#define IDR0_HTTU_HAD2
>>   #define IDR0_COHACC(1 << 4)
>>   #define IDR0_TTFGENMASK(3, 2)
>>   #define IDR0_TTF_AARCH642
>> @@ -286,6 +290,8 @@
>>   #define CTXDESC_CD_0_TCR_TBI0(1ULL << 38)
>> #define CTXDESC_CD_0_AA64(1UL << 41)
>> +#define CTXDESC_CD_0_HD(1UL << 42)
>> +#define CTXDESC_CD_0_HA(1UL << 43)
>>   #define CTXDESC_CD_0_S(1UL << 44)
>>   #define CTXDESC_CD_0_R(1UL << 45)
>>   #define CTXDESC_CD_0_A(1UL << 46)
>> @@ -604,6 +610,8 @@ struct arm_smmu_device {
>>   #define ARM_SMMU_FEAT_RANGE_INV(1 << 15)
>>   #define ARM_SMMU_FEAT_BTM(1 << 16)
>>   #define ARM_SMMU_FEAT_SVA(1 << 17)
>> +#define ARM_SMMU_FEAT_

Re: [PATCH] iommu: Update the document of IOMMU_DOMAIN_UNMANAGED

2021-02-02 Thread Keqian Zhu




On 2021/2/2 20:58, Robin Murphy wrote:
> On 2021-02-02 08:53, Keqian Zhu wrote:
>> Signed-off-by: Keqian Zhu 
>> ---
>>   include/linux/iommu.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 77e561ed57fd..e8f2efae212b 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -68,7 +68,7 @@ struct iommu_domain_geometry {
>>*  devices
>>*IOMMU_DOMAIN_IDENTITY- DMA addresses are system physical 
>> addresses
>>*IOMMU_DOMAIN_UNMANAGED- DMA mappings managed by IOMMU-API user, 
>> used
>> - *  for VMs
>> + *  for VMs or userspace driver frameworks
> 
> Given that "VMs" effectively has to mean VFIO, doesn't it effectively already 
> imply other uses of VFIO anyway? Unmanaged domains are also used in other 
> subsystems/drivers inside the kernel and we're not naming those, so I don't 
> see that it's particularly helpful to specifically call out one more VFIO 
> use-case.
> 
> Perhaps the current wording could be generalised a little more, but we 
> certainly don't want to start trying to maintain an exhaustive list of users 
> here...
Yep, a more generalised description is better. After I have a look at all the 
use cases...

Thanks,
Keqian

> 
> Robin.
> 
>>*IOMMU_DOMAIN_DMA- Internally used for DMA-API implementations.
>>*  This flag allows IOMMU drivers to implement
>>*  certain optimizations for these domains
>>
> .
>

1 2 3 >

1 - 100 of 245 matches

Mail list logo