Re: [PATCH 4/7] vfio: iommu_type1: Fix missing dirty page when promote pinned_scope

2020-12-15 Thread zhukeqian
Hi Alex,

On 2020/12/15 8:04, Alex Williamson wrote:
> On Thu, 10 Dec 2020 15:34:22 +0800
> Keqian Zhu  wrote:
> 
>> When we pin or detach a group which is not dirty tracking capable,
>> we will try to promote pinned_scope of vfio_iommu.
>>
>> If we succeed to do so, vfio only report pinned_scope as dirty to
>> userspace next time, but these memory written before pin or detach
>> is missed.
>>
>> The solution is that we must populate all dma range as dirty before
>> promoting pinned_scope of vfio_iommu.
> 
> Please don't bury fixes patches into a series with other optimizations
> and semantic changes.  Send it separately.
> 
OK, I will.

>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 18 ++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index bd9a94590ebc..00684597b098 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -1633,6 +1633,20 @@ static struct vfio_group 
>> *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
>>  return group;
>>  }
>>  
>> +static void vfio_populate_bitmap_all(struct vfio_iommu *iommu)
>> +{
>> +struct rb_node *n;
>> +unsigned long pgshift = __ffs(iommu->pgsize_bitmap);
>> +
>> +for (n = rb_first(>dma_list); n; n = rb_next(n)) {
>> +struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +unsigned long nbits = dma->size >> pgshift;
>> +
>> +if (dma->iommu_mapped)
>> +bitmap_set(dma->bitmap, 0, nbits);
>> +}
>> +}
> 
> 
> If we detach a group which results in only non-IOMMU backed mdevs,
> don't we also clear dma->iommu_mapped as part of vfio_unmap_unpin()
> such that this test is invalid?  Thanks,

Good spot :-). The code will skip bitmap_set under this situation.

We should set the bitmap unconditionally when vfio_iommu is promoted,
as we must have IOMMU backed domain before promoting the vfio_iommu.

Besides, I think we should also mark dirty in vfio_remove_dma if dirty
tracking is active. Right?

Thanks,
Keqian

> 
> Alex
> 
>> +
>>  static void promote_pinned_page_dirty_scope(struct vfio_iommu *iommu)
>>  {
>>  struct vfio_domain *domain;
>> @@ -1657,6 +1671,10 @@ static void promote_pinned_page_dirty_scope(struct 
>> vfio_iommu *iommu)
>>  }
>>  
>>  iommu->pinned_page_dirty_scope = true;
>> +
>> +/* Set all bitmap to avoid missing dirty page */
>> +if (iommu->dirty_page_tracking)
>> +vfio_populate_bitmap_all(iommu);
>>  }
>>  
>>  static bool vfio_iommu_has_sw_msi(struct list_head *group_resv_regions,
> 
> .
> 


Re: [PATCH 1/7] vfio: iommu_type1: Clear added dirty bit when unwind pin

2020-12-10 Thread zhukeqian



On 2020/12/11 3:16, Alex Williamson wrote:
> On Thu, 10 Dec 2020 15:34:19 +0800
> Keqian Zhu  wrote:
> 
>> Currently we do not clear added dirty bit of bitmap when unwind
>> pin, so if pin failed at halfway, we set unnecessary dirty bit
>> in bitmap. Clearing added dirty bit when unwind pin, userspace
>> will see less dirty page, which can save much time to handle them.
>>
>> Note that we should distinguish the bits added by pin and the bits
>> already set before pin, so introduce bitmap_added to record this.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 33 ++---
>>  1 file changed, 22 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 67e827638995..f129d24a6ec3 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -637,7 +637,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>  struct vfio_iommu *iommu = iommu_data;
>>  struct vfio_group *group;
>>  int i, j, ret;
>> +unsigned long pgshift = __ffs(iommu->pgsize_bitmap);
>>  unsigned long remote_vaddr;
>> +unsigned long bitmap_offset;
>> +unsigned long *bitmap_added;
>> +dma_addr_t iova;
>>  struct vfio_dma *dma;
>>  bool do_accounting;
>>  
>> @@ -650,6 +654,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>  
>>  mutex_lock(>lock);
>>  
>> +bitmap_added = bitmap_zalloc(npage, GFP_KERNEL);
>> +if (!bitmap_added) {
>> +ret = -ENOMEM;
>> +goto pin_done;
>> +}
> 
> 
> This is allocated regardless of whether dirty tracking is enabled, so
> this adds overhead to the common case in order to reduce user overhead
> (not correctness) in the rare condition that dirty tracking is enabled,
> and the even rarer condition that there's a fault during that case.
> This is not a good trade-off.  Thanks,

Hi Alex,

We can allocate the bitmap when dirty tracking is active, do you accept this?
Or we can set the dirty bitmap after all pins succeed, but this costs cpu time
to locate vfio_dma with iova.

Thanks,
Keqian

> 
> Alex
> 
> 
>> +
>>  /* Fail if notifier list is empty */
>>  if (!iommu->notifier.head) {
>>  ret = -EINVAL;
>> @@ -664,7 +674,6 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>  do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
>>  
>>  for (i = 0; i < npage; i++) {
>> -dma_addr_t iova;
>>  struct vfio_pfn *vpfn;
>>  
>>  iova = user_pfn[i] << PAGE_SHIFT;
>> @@ -699,14 +708,10 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>  }
>>  
>>  if (iommu->dirty_page_tracking) {
>> -unsigned long pgshift = __ffs(iommu->pgsize_bitmap);
>> -
>> -/*
>> - * Bitmap populated with the smallest supported page
>> - * size
>> - */
>> -bitmap_set(dma->bitmap,
>> -   (iova - dma->iova) >> pgshift, 1);
>> +/* Populated with the smallest supported page size */
>> +bitmap_offset = (iova - dma->iova) >> pgshift;
>> +if (!test_and_set_bit(bitmap_offset, dma->bitmap))
>> +set_bit(i, bitmap_added);
>>  }
>>  }
>>  ret = i;
>> @@ -722,14 +727,20 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>  pin_unwind:
>>  phys_pfn[i] = 0;
>>  for (j = 0; j < i; j++) {
>> -dma_addr_t iova;
>> -
>>  iova = user_pfn[j] << PAGE_SHIFT;
>>  dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>>  vfio_unpin_page_external(dma, iova, do_accounting);
>>  phys_pfn[j] = 0;
>> +
>> +if (test_bit(j, bitmap_added)) {
>> +bitmap_offset = (iova - dma->iova) >> pgshift;
>> +clear_bit(bitmap_offset, dma->bitmap);
>> +}
>>  }
>>  pin_done:
>> +if (bitmap_added)
>> +bitmap_free(bitmap_added);
>> +
>>  mutex_unlock(>lock);
>>  return ret;
>>  }
> 
> .
> 


Re: [PATCH] iommu: Up front sanity check in the arm_lpae_map

2020-12-07 Thread zhukeqian
Hi Robin,

On 2020/12/7 20:46, Robin Murphy wrote:
> On 2020-12-07 12:15, zhukeqian wrote:
>> Hi,
>>
>> On 2020/12/7 20:05, Will Deacon wrote:
>>> On Mon, Dec 07, 2020 at 12:01:09PM +, Robin Murphy wrote:
>>>> On 2020-12-05 08:29, Keqian Zhu wrote:
>>>>> ... then we have more chance to detect wrong code logic.
>>>>
>>>> I don't follow that justification - it's still the same check with the same
>>>> outcome, so how does moving it have any effect on the chance to detect
>>>> errors?
>>
>>>>
>>>> AFAICS the only difference it would make is to make some errors *less*
>>>> obvious - if a sufficiently broken caller passes an empty prot value
>>>> alongside an invalid size or already-mapped address, this will now quietly
>>>> hide the warnings from the more serious condition(s).
>>>>
>>>> Yes, it will bail out a bit faster in the specific case where the prot 
>>>> value
>>>> is the only thing wrong, but since when do we optimise for fundamentally
>>>> incorrect API usage?
>>>
>>> I thought it was the other way round -- doesn't this patch move the "empty
>>> prot" check later, so we have a chance to check the size and addresses
>>> first?
>>
>> Yes, this is my original idea.
>> For that we treat iommu_prot with no permission as success at early start, 
>> defer
>> this early return can expose hidden errors.
> 
> ...oh dear, my apologies. I've just had a week off and apparently in that 
> time I lost the ability to read :(
> 
> I was somehow convinced that the existing check happened at the point where 
> we go to install the PTE, and this patch was moving it earlier. Looking at 
> the whole code in context now I see I got it completely backwards. Sorry for 
> being an idiot.
> 
I see :-) I should make a more descriptive commit message.

> I guess that only goes to show that a more descriptive commit message would 
> definitely be a good thing :)
> 
I have adapted Will's suggestion and sent v2, please check whether it is OK to 
you?

Cheers,
Keqian

[...]
>> ___
>> iommu mailing list
>> io...@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>
> .
> 


Re: [PATCH] iommu: Up front sanity check in the arm_lpae_map

2020-12-07 Thread zhukeqian
Hi,

On 2020/12/7 20:05, Will Deacon wrote:
> On Mon, Dec 07, 2020 at 12:01:09PM +, Robin Murphy wrote:
>> On 2020-12-05 08:29, Keqian Zhu wrote:
>>> ... then we have more chance to detect wrong code logic.
>>
>> I don't follow that justification - it's still the same check with the same
>> outcome, so how does moving it have any effect on the chance to detect
>> errors?

>>
>> AFAICS the only difference it would make is to make some errors *less*
>> obvious - if a sufficiently broken caller passes an empty prot value
>> alongside an invalid size or already-mapped address, this will now quietly
>> hide the warnings from the more serious condition(s).
>>
>> Yes, it will bail out a bit faster in the specific case where the prot value
>> is the only thing wrong, but since when do we optimise for fundamentally
>> incorrect API usage?
> 
> I thought it was the other way round -- doesn't this patch move the "empty
> prot" check later, so we have a chance to check the size and addresses
> first?

Yes, this is my original idea.
For that we treat iommu_prot with no permission as success at early start, defer
this early return can expose hidden errors.

Thanks,
Keqian
> 
> Will
> 
>>> Signed-off-by: Keqian Zhu 
>>> ---
>>>   drivers/iommu/io-pgtable-arm.c | 8 
>>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
>>> index a7a9bc08dcd1..8ade72adab31 100644
>>> --- a/drivers/iommu/io-pgtable-arm.c
>>> +++ b/drivers/iommu/io-pgtable-arm.c
>>> @@ -444,10 +444,6 @@ static int arm_lpae_map(struct io_pgtable_ops *ops, 
>>> unsigned long iova,
>>> arm_lpae_iopte prot;
>>> long iaext = (s64)iova >> cfg->ias;
>>> -   /* If no access, then nothing to do */
>>> -   if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
>>> -   return 0;
>>> -
>>> if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
>>> return -EINVAL;
>>> @@ -456,6 +452,10 @@ static int arm_lpae_map(struct io_pgtable_ops *ops, 
>>> unsigned long iova,
>>> if (WARN_ON(iaext || paddr >> cfg->oas))
>>> return -ERANGE;
>>> +   /* If no access, then nothing to do */
>>> +   if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
>>> +   return 0;
>>> +
>>> prot = arm_lpae_prot_to_pte(data, iommu_prot);
>>> ret = __arm_lpae_map(data, iova, paddr, size, prot, lvl, ptep, gfp);
>>> /*
>>>
> .
> 


Re: [PATCH] iommu: Up front sanity check in the arm_lpae_map

2020-12-07 Thread zhukeqian
Hi Will,

On 2020/12/7 18:59, Will Deacon wrote:
> On Sat, Dec 05, 2020 at 04:29:57PM +0800, Keqian Zhu wrote:
>> ... then we have more chance to detect wrong code logic.
> 
> This could do with being a bit more explicit. Something like:
> 
>   Although handling a mapping request with no permissions is a
>   trivial no-op, defer the early return until after the size/range
>   checks so that we are consistent with other mapping requests.
This looks well, thanks.

> 
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/iommu/io-pgtable-arm.c | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
>> index a7a9bc08dcd1..8ade72adab31 100644
>> --- a/drivers/iommu/io-pgtable-arm.c
>> +++ b/drivers/iommu/io-pgtable-arm.c
>> @@ -444,10 +444,6 @@ static int arm_lpae_map(struct io_pgtable_ops *ops, 
>> unsigned long iova,
>>  arm_lpae_iopte prot;
>>  long iaext = (s64)iova >> cfg->ias;
>>  
>> -/* If no access, then nothing to do */
>> -if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
>> -return 0;
>> -
>>  if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
>>  return -EINVAL;
>>  
>> @@ -456,6 +452,10 @@ static int arm_lpae_map(struct io_pgtable_ops *ops, 
>> unsigned long iova,
>>  if (WARN_ON(iaext || paddr >> cfg->oas))
>>  return -ERANGE;
>>  
>> +/* If no access, then nothing to do */
>> +if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
>> +return 0;
> 
> This looks sensible to me, but please can you make the same change for
> io-pgtable-arm-v7s.c so that the behaviour is consistent across the two
> formats?
> 
OK. I can do it right now.


Thanks,
Keqian
> Thanks,
> 
> Will
> .
> 


Re: [PATCH v2 1/2] clocksource: arm_arch_timer: Use stable count reader in erratum sne

2020-12-03 Thread zhukeqian
Hi Marc,

On 2020/12/3 22:58, Marc Zyngier wrote:
> On 2020-08-18 04:28, Keqian Zhu wrote:
>> In commit 0ea415390cd3 ("clocksource/arm_arch_timer: Use 
>> arch_timer_read_counter
>> to access stable counters"), we separate stable and normal count reader to 
>> omit
>> unnecessary overhead on systems that have no timer erratum.
>>
>> However, in erratum_set_next_event_tval_generic(), count reader becomes 
>> normal
>> reader. This converts it to stable reader.
>>
>> Fixes: 0ea415390cd3 ("clocksource/arm_arch_timer: Use
>>arch_timer_read_counter to access stable counters")
> 
> On a single line.
Addressed. Thanks.

> 
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/clocksource/arm_arch_timer.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/clocksource/arm_arch_timer.c
>> b/drivers/clocksource/arm_arch_timer.c
>> index 6c3e841..777d38c 100644
>> --- a/drivers/clocksource/arm_arch_timer.c
>> +++ b/drivers/clocksource/arm_arch_timer.c
>> @@ -396,10 +396,10 @@ static void
>> erratum_set_next_event_tval_generic(const int access, unsigned long
>>  ctrl &= ~ARCH_TIMER_CTRL_IT_MASK;
>>
>>  if (access == ARCH_TIMER_PHYS_ACCESS) {
>> -cval = evt + arch_counter_get_cntpct();
>> +cval = evt + arch_counter_get_cntpct_stable();
>>  write_sysreg(cval, cntp_cval_el0);
>>  } else {
>> -cval = evt + arch_counter_get_cntvct();
>> +cval = evt + arch_counter_get_cntvct_stable();
>>  write_sysreg(cval, cntv_cval_el0);
>>  }
> 
> With that fixed:
> 
> Acked-by: Marc Zyngier 
> 
> This should go via the clocksource tree.
Added Cc to it's maintainers, thanks.

> 
> Thanks,
> 
> M.
Cheers,
Keqian


Re: [PATCH v2 2/2] clocksource: arm_arch_timer: Correct fault programming of CNTKCTL_EL1.EVNTI

2020-12-03 Thread zhukeqian
Hi Marc,

On 2020/12/3 22:57, Marc Zyngier wrote:
> On 2020-08-18 04:28, Keqian Zhu wrote:
>> ARM virtual counter supports event stream, it can only trigger an event
>> when the trigger bit (the value of CNTKCTL_EL1.EVNTI) of CNTVCT_EL0 changes,
>> so the actual period of event stream is 2^(cntkctl_evnti + 1). For example,
>> when the trigger bit is 0, then virtual counter trigger an event for every
>> two cycles.
>>
>> Fixes: 037f637767a8 ("drivers: clocksource: add support for
>>ARM architected timer event stream")
> 
> Fixes: tags should on a single line.
Will do.

> 
>> Suggested-by: Marc Zyngier 
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/clocksource/arm_arch_timer.c | 10 +++---
>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/clocksource/arm_arch_timer.c
>> b/drivers/clocksource/arm_arch_timer.c
>> index 777d38c..e3b2ee0 100644
>> --- a/drivers/clocksource/arm_arch_timer.c
>> +++ b/drivers/clocksource/arm_arch_timer.c
>> @@ -824,10 +824,14 @@ static void arch_timer_configure_evtstream(void)
>>  {
>>  int evt_stream_div, pos;
>>
>> -/* Find the closest power of two to the divisor */
>> -evt_stream_div = arch_timer_rate / ARCH_TIMER_EVT_STREAM_FREQ;
>> +/*
>> + * Find the closest power of two to the divisor. As the event
>> + * stream can at most be generated at half the frequency of the
>> + * counter, use half the frequency when computing the divider.
>> + */
>> +evt_stream_div = arch_timer_rate / ARCH_TIMER_EVT_STREAM_FREQ / 2;
>>  pos = fls(evt_stream_div);
>> -if (pos > 1 && !(evt_stream_div & (1 << (pos - 2
>> +if ((pos == 1) || (pos > 1 && !(evt_stream_div & (1 << (pos - 2)
>>  pos--;
> 
> You don't explain why you are special-casing pos == 1.
The logic is not clear here, I will try to reform it.

> 
>>  /* enable event stream */
>>  arch_timer_evtstrm_enable(min(pos, 15));
> 
> Also, please Cc the subsystem maintainers:
> 
> CLOCKSOURCE, CLOCKEVENT DRIVERS
> M:  Daniel Lezcano 
> M:  Thomas Gleixner 
> L:  linux-kernel@vger.kernel.org
> S:  Supported
> T:  git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
> timers/core
> F:  Documentation/devicetree/bindings/timer/
> F:  drivers/clocksource/
> 
Will do.

> Thanks,
> 
> M.

Thanks,
Keqian


Re: [PATCH v2 1/2] KVM: arm64: Some fixes of PV-time interface document

2020-12-03 Thread zhukeqian



On 2020/12/3 23:04, Marc Zyngier wrote:
> On 2020-08-17 12:07, Keqian Zhu wrote:
>> Rename PV_FEATURES to PV_TIME_FEATURES.
>>
>> Signed-off-by: Keqian Zhu 
>> Reviewed-by: Andrew Jones 
>> Reviewed-by: Steven Price 
>> ---
>>  Documentation/virt/kvm/arm/pvtime.rst | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/arm/pvtime.rst
>> b/Documentation/virt/kvm/arm/pvtime.rst
>> index 687b60d..94bffe2 100644
>> --- a/Documentation/virt/kvm/arm/pvtime.rst
>> +++ b/Documentation/virt/kvm/arm/pvtime.rst
>> @@ -3,7 +3,7 @@
>>  Paravirtualized time support for arm64
>>  ==
>>
>> -Arm specification DEN0057/A defines a standard for paravirtualised time
>> +Arm specification DEN0057/A defines a standard for paravirtualized time
>>  support for AArch64 guests:
> 
> nit: I do object to this change (some of us are British! ;-).
Oh, I will pay attention to this. Thanks!

Keqian
> 
> M.


Re: [PATCH v2 0/2] clocksource: arm_arch_timer: Some fixes

2020-12-03 Thread zhukeqian
Hi Marc,

Found that this bugfix series is not applied for now.
Does it need some modification? Wish you can pick it up :-)

Thanks,
Keqian

On 2020/8/18 11:28, Keqian Zhu wrote:
> change log:
> 
> v2:
>  - Do not revert commit 0ea415390cd3, fix it instead.
>  - Correct the tags of second patch.
> 
> Keqian Zhu (2):
>   clocksource: arm_arch_timer: Use stable count reader in erratum sne
>   clocksource: arm_arch_timer: Correct fault programming of
> CNTKCTL_EL1.EVNTI
> 
>  drivers/clocksource/arm_arch_timer.c | 14 +-
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 


Re: [PATCH v2 0/2] KVM: arm64: Some fixes and code adjustments for pvtime ST

2020-12-03 Thread zhukeqian
Hi Marc,

Found that this series is not applied for now.
Does it need some modification? Wish you can pick it up :-)

Thanks,
Keqian

On 2020/8/17 19:07, Keqian Zhu wrote:
> During picking up pvtime LPT support for arm64, I do some trivial fixes for
> pvtime ST.
> 
> change log:
> 
> v2:
>  - Add Andrew's and Steven's R-b.
>  - Correct commit message of the first patch.
>  - Drop the second patch.
> 
> Keqian Zhu (2):
>   KVM: arm64: Some fixes of PV-time interface document
>   KVM: arm64: Use kvm_write_guest_lock when init stolen time
> 
>  Documentation/virt/kvm/arm/pvtime.rst | 6 +++---
>  arch/arm64/kvm/pvtime.c   | 6 +-
>  2 files changed, 4 insertions(+), 8 deletions(-)
> 


Re: [RFC PATCH 0/5] KVM: arm64: Add pvtime LPT support

2020-09-06 Thread zhukeqian
Hi Marc and Steven,

On 2020/9/2 18:09, Steven Price wrote:
> Hi Marc,
> 
> Sorry for the slow response, I've been on holiday.
> 
> On 22/08/2020 11:31, Marc Zyngier wrote:
>> Hi Steven,
>>
>> On Wed, 19 Aug 2020 09:54:40 +0100,
>> Steven Price  wrote:
>>>
>>> On 18/08/2020 15:41, Marc Zyngier wrote:
 On 2020-08-17 09:41, Keqian Zhu wrote:
> [...]
>
> Things need concern:
> 1. https://developer.arm.com/docs/den0057/a needs update.

 LPT was explicitly removed from the spec because it doesn't really
 solve the problem, specially for the firmware: EFI knows
 nothing about this, for example. How is it going to work?
 Also, nobody was ever able to explain how this would work for
 nested virt.

 ARMv8.4 and ARMv8.6 have the feature set that is required to solve
 this problem without adding more PV to the kernel.
>>>
>>> Hi Marc,
>>>
>>> These are good points, however we do still have the situation that
>>> CPUs that don't have ARMv8.4/8.6 clearly cannot implement this. I
>>> presume the use-case Keqian is looking at predates the necessary
>>> support in the CPU - Keqian if you can provide more details on the
>>> architecture(s) involved that would be helpful.
>>
>> My take on this is that it is a fictional use case. In my experience,
>> migration happens across *identical* systems, and *any* difference
>> visible to guests will cause things to go wrong. Errata management
>> gets in the way, as usual (name *one* integration that isn't broken
>> one way or another!).
> 
> Keqian appears to have a use case - but obviously I don't know the details. I 
> guess Keqian needs to convince you of that.
Sure, there is use case, but I'm very sorry that it's inconvenient to show the 
detail. Maybe cross-chip migration
will be supported by arm64 eventually, so I think use case is not a key problem.

> 
>> Allowing migration across heterogeneous hosts requires a solution to
>> the errata management problem, which everyone (including me) has
>> decided to ignore so far (and I claim that not having a constant timer
>> frequency exposed to guests is an architecture bug).
> 
> I agree - errata management needs to be solved before LPT. Between restricted 
> subsets of hosts this doesn't seem impossible, but I guess we should stall 
> LPT until a credible solution is proposed. I'm certainly not proposing one at 
> the moment.
> 
>>> Nested virt is indeed more of an issue - we did have some ideas around
>>> using SDEI that never made it to the spec.
>>
>> SDEI? Sigh... Why would SDEI be useful for NV and not for !NV?
> 
> SDEI provides a way of injecting a synchronous exception on migration - 
> although that certainly isn't the only possible mechanism. For NV we have the 
> problem that a guest-guest may be running at the point of migration. However 
> it's not practical for the host hypervisor to provide the necessary table 
> directly to the guest-guest which means the guest-hypervisor must update the 
> tables before the guest-guest is allowed to run on the new host. The only 
> plausible route I could see for this is injecting a synchronous exception 
> into the guest (per VCPU) to ensure any guest-guests running are exited at 
> migration time.
> 
> !NV is easier because we don't have to worry about multiple levels of 
> para-virtualisation.
> 
>>> However I would argue that the most pragmatic approach would be to
>>> not support the combination of nested virt and LPT. Hopefully that
>>> can wait until the counter scaling support is available and not
>>> require PV.
>>
>> And have yet another set of band aids that paper over the fact that we
>> can't get a consistent story on virtualization? No, thank you.
>>
>> NV is (IMHO) much more important than LPT as it has a chance of
>> getting used. LPT is just another tick box, and the fact that ARM is
>> ready to ignore sideline a decent portion of the architecture is a
>> clear sign that it hasn't been thought out.
> 
> Different people have different priorities. NV is definitely important for 
> many people. LPT may also be important if you've already got a bunch of VMs 
> running on machines and you want to be able to (gradually) replace them with 
> newer hosts which happen to have a different clock frequency. Those VMs 
> running now clearly aren't using NV.
> 
> However, I have to admit it's not me that has the use-case, so I'll leave it 
> for others who might actually know the specifics to explain the details.
> 
>>> We are discussing (re-)releasing the spec with the LPT parts added. If
>>> you have fundamental objections then please me know.
>>
>> I do, see above. I'm stating that the use case doesn't really exist
>> given the state of the available HW and the fragmentation of the
>> architecture, and that ignoring the most important innovation in the
>> virtualization architecture since ARMv7 is at best short-sighted.
>>
>> Time scaling is just an instance of the errata management problem, and
>> that is the issue that 

Re: [RFC PATCH 0/5] KVM: arm64: Add pvtime LPT support

2020-08-20 Thread zhukeqian



On 2020/8/19 16:54, Steven Price wrote:
> On 18/08/2020 15:41, Marc Zyngier wrote:
>> On 2020-08-17 09:41, Keqian Zhu wrote:
>>> Hi all,
>>>
>>> This patch series picks up the LPT pvtime feature originally developed
>>> by Steven Price: https://patchwork.kernel.org/cover/10726499/
>>>
>>> Backgroud:
>>>
>>> There is demand for cross-platform migration, which means we have to
>>> solve different CPU features and arch counter frequency between hosts.
>>> This patch series can solve the latter problem.
>>>
>>> About LPT:
>>>
>>> This implements support for Live Physical Time (LPT) which provides the
>>> guest with a method to derive a stable counter of time during which the
>>> guest is executing even when the guest is being migrated between hosts
>>> with different physical counter frequencies.
>>>
>>> Changes on Steven Price's work:
>>> 1. LPT structure: use symmatical semantics of scale multiplier, and use
>>>fraction bits instead of "shift" to make everything clear.
>>> 2. Structure allocation: host kernel does not allocates the LPT structure,
>>>instead it is allocated by userspace through VM attributes. The 
>>> save/restore
>>>functionality can be removed.
>>> 3. Since LPT structure just need update once for each guest run, add a flag 
>>> to
>>>indicate the update status. This has two benifits: 1) avoid multiple 
>>> update
>>>by each vCPUs. 2) If the update flag is not set, then return NOT SUPPORT 
>>> for
>>>coressponding guest HVC call.
>>> 4. Add VM device attributes interface for userspace configuration.
>>> 5. Add a base LPT read/write layer to reduce code.
>>> 6. Support ptimer scaling.
>>> 7. Support timer event stream translation.
>>>
>>> Things need concern:
>>> 1. https://developer.arm.com/docs/den0057/a needs update.
>>
>> LPT was explicitly removed from the spec because it doesn't really
>> solve the problem, specially for the firmware: EFI knows
>> nothing about this, for example. How is it going to work?
>> Also, nobody was ever able to explain how this would work for
>> nested virt.
>>
>> ARMv8.4 and ARMv8.6 have the feature set that is required to solve
>> this problem without adding more PV to the kernel.
> 
> Hi Marc,
> 
> These are good points, however we do still have the situation that CPUs that 
> don't have ARMv8.4/8.6 clearly cannot implement this. I presume the use-case 
> Keqian is looking at predates the necessary support in the CPU - Keqian if 
> you can provide more details on the architecture(s) involved that would be 
> helpful.
> 
> Nested virt is indeed more of an issue - we did have some ideas around using 
> SDEI that never made it to the spec. However I would argue that the most 
> pragmatic approach would be to not support the combination of nested virt and 
> LPT. Hopefully that can wait until the counter scaling support is available 
> and not require PV.
> 
> We are discussing (re-)releasing the spec with the LPT parts added. If you 
> have fundamental objections then please me know.
> 
> Thanks,
> 
> Steve
> .
> 
Hi Marc and Steven,

In fact, I have realize a demo which utilize v8.6-ECV to present a constant 
timer freq to guest. It seems
work well, but this approach has some shortcoming:

1. Guest access to cntvct cntv_ctl cntv_tval cntv_cval must trap to EL2. Every 
trap will take about
   hundreds of nano-seconds. For every timer interrupt, there is about 5~6 
traps, so it will spend
   several us (this seems not a serious problem :-) ). But trap will cause big 
deviation for nano-sleep.
2. We have to make cntfrq be a context of guest. However, only the highest 
exception level has right to
   modify cntfrq. It means we have to add a new SMC call.
3. cntkctl controls event stream freq, so KVM should also translate the guest 
access of cntkctl. However
   we cannot trap guest access of that. Any solution for this problem?

I think LPT as a software solution can solve these problems. However, as Marc 
said, UEFI knows nothing about
LPT, and it will access vtimer/counter directly. The key point is how serious 
the impact is on UEFI.

I can see that some UEFI runtime services and drivers/applications will access 
timer/counter.
For runtime services, it is OK. Because we can translate the result which 
return from UEFI for Linux.
For drivers/applications, they will feel time goes faster or slower after 
migration. This is a problem indeed :-)

Thanks,
Keqian


Re: [PATCH 2/2] clocksource: arm_arch_timer: Correct fault programming of CNTKCTL_EL1.EVNTI

2020-08-17 Thread zhukeqian
Hi Marc,

On 2020/8/17 20:56, Marc Zyngier wrote:
> On 2020-08-17 13:24, Keqian Zhu wrote:
>> ARM virtual counter supports event stream, it can only trigger an event
>> when the trigger bit (the value of CNTKCTL_EL1.EVNTI) of CNTVCT_EL0 changes,
>> so the actual period of event stream is 2^(cntkctl_evnti + 1). For example,
>> when the trigger bit is 0, then virtual counter trigger an event for every
>> two cycles.
>>
>> Signed-off-by: Marc Zyngier 
> 
> I have never given you this tag, you are making it up. Please read
> Documentation/process/submitting-patches.rst to understand what
> tag you can put by yourself.
Sorry about my mistake.
> 
> At best, put "Suggested-by" tag, as this is different from what
> I posted anyway.
OK, I will use this tag.

Thanks,
Keqian
> 
> Thanks,
> 
> M.
> 
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/clocksource/arm_arch_timer.c | 10 +++---
>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/clocksource/arm_arch_timer.c
>> b/drivers/clocksource/arm_arch_timer.c
>> index 6e11c60..4140a37 100644
>> --- a/drivers/clocksource/arm_arch_timer.c
>> +++ b/drivers/clocksource/arm_arch_timer.c
>> @@ -794,10 +794,14 @@ static void arch_timer_configure_evtstream(void)
>>  {
>>  int evt_stream_div, pos;
>>
>> -/* Find the closest power of two to the divisor */
>> -evt_stream_div = arch_timer_rate / ARCH_TIMER_EVT_STREAM_FREQ;
>> +/*
>> + * Find the closest power of two to the divisor. As the event
>> + * stream can at most be generated at half the frequency of the
>> + * counter, use half the frequency when computing the divider.
>> + */
>> +evt_stream_div = arch_timer_rate / ARCH_TIMER_EVT_STREAM_FREQ / 2;
>>  pos = fls(evt_stream_div);
>> -if (pos > 1 && !(evt_stream_div & (1 << (pos - 2
>> +if ((pos == 1) || (pos > 1 && !(evt_stream_div & (1 << (pos - 2)
>>  pos--;
>>  /* enable event stream */
>>  arch_timer_evtstrm_enable(min(pos, 15));
> 


Re: [PATCH 1/2] clocksource: arm_arch_timer: Simplify and fix count reader code logic

2020-08-17 Thread zhukeqian
Hi Marc,

On 2020/8/17 20:52, Marc Zyngier wrote:
> On 2020-08-17 13:24, Keqian Zhu wrote:
>> In commit 0ea415390cd3 (clocksource/arm_arch_timer: Use 
>> arch_timer_read_counter
>> to access stable counters), we separate stable and normal count reader. 
>> Actually
>> the stable reader can correctly lead us to normal reader if we has no
>> workaround.
> 
> Resulting in an unnecessary overhead on non-broken systems that can run
> without CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND. Not happening.
OK, so I got the purpose of that patch wrong.
> 
>> Besides, in erratum_set_next_event_tval_generic(), we use normal reader, it 
>> is
>> obviously wrong, so just revert this commit to solve this problem by the way.
> 
> If you want to fix something, post a patch that does exactly that.
> 
I will.

Thanks,
Keqian
> M.


Re: [PATCH 2/3] KVM: uapi: Remove KVM_DEV_TYPE_ARM_PV_TIME in kvm_device_type

2020-08-17 Thread zhukeqian
Hi Steven,

On 2020/8/17 17:49, Steven Price wrote:
> On 17/08/2020 09:43, zhukeqian wrote:
>> Hi Marc,
>>
[...]
>>>
>>> It is pretty unfortunate that PV time has turned into such a train wreck,
>>> but that's what we have now, and it has to stay.
>> Well, I see. It is a sad thing indeed.
> 
> Sorry about that, this got refactored so many times I guess I lost track of 
> what was actually needed and this hunk remained when it should have been 
> removed.
> 
It's fine :-) , not a serious problem.
> I would hope that I'm the only one who has any userspace code which uses 
> this, but I guess we should still be cautious since this has been in several 
> releases now.
> 
OK. For insurance purposes, we ought to ignore this patch to avoid breaking any 
user-space program.
> Steve
> .
Thanks,
Keqian
> 


Re: [PATCH 1/3] KVM: arm64: Some fixes of PV-time interface document

2020-08-17 Thread zhukeqian
Hi Andrew,

On 2020/8/17 16:47, Andrew Jones wrote:
> On Mon, Aug 17, 2020 at 11:37:27AM +0800, Keqian Zhu wrote:
>> Rename PV_FEATURES tp PV_TIME_FEATURES
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  Documentation/virt/kvm/arm/pvtime.rst | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/arm/pvtime.rst 
>> b/Documentation/virt/kvm/arm/pvtime.rst
>> index 687b60d..94bffe2 100644
>> --- a/Documentation/virt/kvm/arm/pvtime.rst
>> +++ b/Documentation/virt/kvm/arm/pvtime.rst
>> @@ -3,7 +3,7 @@
>>  Paravirtualized time support for arm64
>>  ==
>>  
>> -Arm specification DEN0057/A defines a standard for paravirtualised time
>> +Arm specification DEN0057/A defines a standard for paravirtualized time
>>  support for AArch64 guests:
>>  
>>  https://developer.arm.com/docs/den0057/a
>> @@ -19,8 +19,8 @@ Two new SMCCC compatible hypercalls are defined:
>>  
>>  These are only available in the SMC64/HVC64 calling convention as
>>  paravirtualized time is not available to 32 bit Arm guests. The existence of
>> -the PV_FEATURES hypercall should be probed using the SMCCC 1.1 ARCH_FEATURES
>> -mechanism before calling it.
>> +the PV_TIME_FEATURES hypercall should be probed using the SMCCC 1.1
>> +ARCH_FEATURES mechanism before calling it.
>>  
>>  PV_TIME_FEATURES
>>  = ==
>> -- 
>> 1.8.3.1
>>
> 
> Reviewed-by: Andrew Jones 
Thanks for your review.

Also It will be very nice if you have time to review the patch series
supporting pvtime LPT.

Thanks,
Keqian
> 
> .
> 


Re: [PATCH 2/3] KVM: uapi: Remove KVM_DEV_TYPE_ARM_PV_TIME in kvm_device_type

2020-08-17 Thread zhukeqian
Hi Marc,

On 2020/8/17 15:39, Marc Zyngier wrote:
> On 2020-08-17 04:37, Keqian Zhu wrote:
>> ARM64 PV-time ST is configured by userspace through vCPU attribute,
>> and KVM_DEV_TYPE_ARM_PV_TIME is unused.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  include/uapi/linux/kvm.h   | 2 --
>>  tools/include/uapi/linux/kvm.h | 2 --
>>  2 files changed, 4 deletions(-)
>>
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 4fdf303..9a6b97e 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1258,8 +1258,6 @@ enum kvm_device_type {
>>  #define KVM_DEV_TYPE_ARM_VGIC_ITSKVM_DEV_TYPE_ARM_VGIC_ITS
>>  KVM_DEV_TYPE_XIVE,
>>  #define KVM_DEV_TYPE_XIVEKVM_DEV_TYPE_XIVE
>> -KVM_DEV_TYPE_ARM_PV_TIME,
>> -#define KVM_DEV_TYPE_ARM_PV_TIMEKVM_DEV_TYPE_ARM_PV_TIME
>>  KVM_DEV_TYPE_MAX,
>>  };
>>
>> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
>> index 4fdf303..9a6b97e 100644
>> --- a/tools/include/uapi/linux/kvm.h
>> +++ b/tools/include/uapi/linux/kvm.h
>> @@ -1258,8 +1258,6 @@ enum kvm_device_type {
>>  #define KVM_DEV_TYPE_ARM_VGIC_ITSKVM_DEV_TYPE_ARM_VGIC_ITS
>>  KVM_DEV_TYPE_XIVE,
>>  #define KVM_DEV_TYPE_XIVEKVM_DEV_TYPE_XIVE
>> -KVM_DEV_TYPE_ARM_PV_TIME,
>> -#define KVM_DEV_TYPE_ARM_PV_TIMEKVM_DEV_TYPE_ARM_PV_TIME
>>  KVM_DEV_TYPE_MAX,
>>  };
> 
> No. You can't drop anything from UAPI, used or not. Doing so will
> break the compilation of any userspace that, for any reason, references
> this value. We cannot reuse this value in the future either, as it would
> create a we wouldn't know which device to create.
> 
> It is pretty unfortunate that PV time has turned into such a train wreck,
> but that's what we have now, and it has to stay.
Well, I see. It is a sad thing indeed.

Thanks,
Keqian


Re: [PATCH 0/9] arm64: Stolen time support

2020-07-28 Thread zhukeqian
Hi Steven,

On 2020/7/27 18:48, Steven Price wrote:
> On 21/07/2020 04:26, zhukeqian wrote:
>> Hi Steven,
> 
> Hi Keqian,
> 
>> On 2019/8/2 22:50, Steven Price wrote:
>>> This series add support for paravirtualized time for arm64 guests and
>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>
>>> https://developer.arm.com/docs/den0057/a
>>>
>>> It implements support for stolen time, allowing the guest to
>>> identify time when it is forcibly not executing.
>>>
>>> It doesn't implement support for Live Physical Time (LPT) as there are
>>> some concerns about the overheads and approach in the above
>> Do you plan to pick up LPT support? As there is demand of cross-frequency 
>> migration
>> (from older platform to newer platform).
> 
> I don't have any plans to pick up the LPT support at the moment - feel free 
> to pick it up! ;)
> 
>> I am not clear about the overheads and approach problem here, could you 
>> please
>> give some detail information? Maybe we can work together to solve these 
>> concerns. :-)
> 
> Fundamentally the issue here is that LPT only solves one small part of 
> migration between different hosts. To successfully migrate between hosts with 
> different CPU implementations it is also necessary to be able to virtualise 
> various ID registers (e.g. MIDR_EL1, REVIDR_EL1, AIDR_EL1) which we have no 
> support for currently.
> 
Yeah, currently we are trying to do both timer freq virtualization and CPU 
feature virtualization.

> The problem with just virtualising the registers is how you handle errata. 
> The guest will currently use those (and other) ID registers to decide whether 
> to enable specific errata workarounds. But what errata should be enabled for 
> a guest which might migrate to another host?
> 
Thanks for pointing this out.

I think the most important thing is that we should introduce a concept named 
CPU baseline which represents a standard platform.
If we bring up a guest with a specific CPU baseline, then this guest can only 
run on a platform that is compatible with this CPU baseline.
So "baseline" and "compatible" are the key point to promise successful 
cross-platform migration.


> What we ideally need is a mechanism to communicate to the guest what 
> workarounds are required to successfully run on any of the hosts that the 
> guest may be migrated to. You may also have the situation where the 
> workarounds required for two hosts are mutually incompatible - something 
> needs to understand this and do the "right thing" (most likely just reject 
> this situation, i.e. prevent the migration).
> 
> There are various options here: e.g. a para-virtualised interface to describe 
> the workarounds (but this is hard to do in an OS-agnostic way), or virtual-ID 
> registers describing an idealised environment where no workarounds are 
> required (and only hosts that have no errata affecting a guest would be able 
> to provide this).
> 
My idea is similar with the "idealised environment", but errata workaround 
still exists.
We do not provide para-virtualised interface, and migration is restricted 
between platforms that are compatible with baseline.

Baseline should has two aspects: CPU feature and errata. These platforms that 
are compatible with a specific baseline should have the corresponding CPU 
feature and errata.

> Given the above complexity and the fact that Armv8.6-A standardises the 
> frequency to 1GHz this didn't seem worth continuing with. So LPT was dropped 
> from the spec and patches to avoid holding up the stolen time support.
> 
> However, if you have a use case which doesn't require such a generic 
> migration (e.g. perhaps old and new platforms are based on the same IP) then 
> it might be worth looking at bring this back. But to make the problem 
> solvable it either needs to be restricted to platforms which are 
> substantially the same (so the errata list will be identical), or there's 
> work to be done in preparation to deal with migrating a guest successfully 
> between hosts with potentially different errata requirements.
> 
> Can you share more details about the hosts that you are interested in 
> migrating between?
Here we have new platform with 1GHz timer, and old platform is 100MHZ, so we 
want to solve the cross-platform migration firstly.

Thanks,
Keqian
> 
> Thanks,
> 
> Steve
> .
> 


Re: [RESEND PATCH] drivers: arm arch timer: Correct fault programming of CNTKCTL_EL1.EVNTI

2020-07-28 Thread zhukeqian
Hi Marc,

On 2020/7/28 18:16, Marc Zyngier wrote:
> On 2020-07-17 10:21, Keqian Zhu wrote:
>> ARM virtual counter supports event stream. It can only trigger an event
>> when the trigger bit of CNTVCT_EL0 changes from 0 to 1 (or from 1 to 0),
>> so the actual period of event stream is 2 ^ (cntkctl_evnti + 1). For
>> example, when the trigger bit is 0, then it triggers an event for every
>> two cycles.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/clocksource/arm_arch_timer.c | 17 ++---
>>  1 file changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/clocksource/arm_arch_timer.c
>> b/drivers/clocksource/arm_arch_timer.c
>> index ecf7b7db2d05..06d99a4b1b9b 100644
>> --- a/drivers/clocksource/arm_arch_timer.c
>> +++ b/drivers/clocksource/arm_arch_timer.c
>> @@ -799,10 +799,20 @@ static void __arch_timer_setup(unsigned type,
>>  static void arch_timer_evtstrm_enable(int divider)
>>  {
>>  u32 cntkctl = arch_timer_get_cntkctl();
>> +int cntkctl_evnti;
>> +
>> +/*
>> + * Note that it can only trigger an event when the trigger bit
>> + * of CNTVCT_EL0 changes from 1 to 0 (or from 0 to 1), so the
>> + * actual period of event stream is 2 ^ (cntkctl_evnti + 1).
>> + */
>> +cntkctl_evnti = divider - 1;
>> +cntkctl_evnti = min(cntkctl_evnti, 15);
>> +cntkctl_evnti = max(cntkctl_evnti, 0);
>>
>>  cntkctl &= ~ARCH_TIMER_EVT_TRIGGER_MASK;
>>  /* Set the divider and enable virtual event stream */
>> -cntkctl |= (divider << ARCH_TIMER_EVT_TRIGGER_SHIFT)
>> +cntkctl |= (cntkctl_evnti << ARCH_TIMER_EVT_TRIGGER_SHIFT)
>>  | ARCH_TIMER_VIRT_EVT_EN;
>>  arch_timer_set_cntkctl(cntkctl);
>>  arch_timer_set_evtstrm_feature();
>> @@ -816,10 +826,11 @@ static void arch_timer_configure_evtstream(void)
>>  /* Find the closest power of two to the divisor */
>>  evt_stream_div = arch_timer_rate / ARCH_TIMER_EVT_STREAM_FREQ;
>>  pos = fls(evt_stream_div);
>> -if (pos > 1 && !(evt_stream_div & (1 << (pos - 2
>> +if ((pos == 1) || (pos > 1 && !(evt_stream_div & (1 << (pos - 2)
>>  pos--;
>> +
>>  /* enable event stream */
>> -arch_timer_evtstrm_enable(min(pos, 15));
>> +arch_timer_evtstrm_enable(pos);
>>  }
>>
>>  static void arch_counter_set_user_access(void)
> 
> This looks like a very convoluted fix. If the problem you are
> trying to fix is that the event frequency is at most half of
> that of the counter, why isn't the patch below the most
> obvious fix?
> 
> Thanks,
> 
> M.
> 
> diff --git a/drivers/clocksource/arm_arch_timer.c 
> b/drivers/clocksource/arm_arch_timer.c
> index 6c3e84180146..0a65414b781f 100644
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -824,8 +824,12 @@ static void arch_timer_configure_evtstream(void)
>  {
>  int evt_stream_div, pos;
> 
> -/* Find the closest power of two to the divisor */
> -evt_stream_div = arch_timer_rate / ARCH_TIMER_EVT_STREAM_FREQ;
> +/*
> + * Find the closest power of two to the divisor. As the event
> + * stream can at most be generated at half the frequency of the
> + * counter, use half the frequency when computing the divider.
> + */
> +evt_stream_div = arch_timer_rate / ARCH_TIMER_EVT_STREAM_FREQ / 2;
>  pos = fls(evt_stream_div);
>  if (pos > 1 && !(evt_stream_div & (1 << (pos - 2
I think here does not consider the case of pos==1 (though it will not occur...)
>  pos--;
> 

It looks good to me.

Thanks,
Keqian


Re: [RESEND PATCH] drivers: arm arch timer: Correct fault programming of CNTKCTL_EL1.EVNTI

2020-07-28 Thread zhukeqian
Friendly ping.
Is this an effective bugfix?

On 2020/7/17 17:21, Keqian Zhu wrote:
> ARM virtual counter supports event stream. It can only trigger an event
> when the trigger bit of CNTVCT_EL0 changes from 0 to 1 (or from 1 to 0),
> so the actual period of event stream is 2 ^ (cntkctl_evnti + 1). For
> example, when the trigger bit is 0, then it triggers an event for every
> two cycles.
> 
> Signed-off-by: Keqian Zhu 
> ---
>  drivers/clocksource/arm_arch_timer.c | 17 ++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/clocksource/arm_arch_timer.c 
> b/drivers/clocksource/arm_arch_timer.c
> index ecf7b7db2d05..06d99a4b1b9b 100644
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -799,10 +799,20 @@ static void __arch_timer_setup(unsigned type,
>  static void arch_timer_evtstrm_enable(int divider)
>  {
>   u32 cntkctl = arch_timer_get_cntkctl();
> + int cntkctl_evnti;
> +
> + /*
> +  * Note that it can only trigger an event when the trigger bit
> +  * of CNTVCT_EL0 changes from 1 to 0 (or from 0 to 1), so the
> +  * actual period of event stream is 2 ^ (cntkctl_evnti + 1).
> +  */
> + cntkctl_evnti = divider - 1;
> + cntkctl_evnti = min(cntkctl_evnti, 15);
> + cntkctl_evnti = max(cntkctl_evnti, 0);
>  
>   cntkctl &= ~ARCH_TIMER_EVT_TRIGGER_MASK;
>   /* Set the divider and enable virtual event stream */
> - cntkctl |= (divider << ARCH_TIMER_EVT_TRIGGER_SHIFT)
> + cntkctl |= (cntkctl_evnti << ARCH_TIMER_EVT_TRIGGER_SHIFT)
>   | ARCH_TIMER_VIRT_EVT_EN;
>   arch_timer_set_cntkctl(cntkctl);
>   arch_timer_set_evtstrm_feature();
> @@ -816,10 +826,11 @@ static void arch_timer_configure_evtstream(void)
>   /* Find the closest power of two to the divisor */
>   evt_stream_div = arch_timer_rate / ARCH_TIMER_EVT_STREAM_FREQ;
>   pos = fls(evt_stream_div);
> - if (pos > 1 && !(evt_stream_div & (1 << (pos - 2
> + if ((pos == 1) || (pos > 1 && !(evt_stream_div & (1 << (pos - 2)
>   pos--;
> +
>   /* enable event stream */
> - arch_timer_evtstrm_enable(min(pos, 15));
> + arch_timer_evtstrm_enable(pos);
>  }
>  
>  static void arch_counter_set_user_access(void)
> 


Re: [PATCH 0/9] arm64: Stolen time support

2020-07-20 Thread zhukeqian
Hi Steven,

On 2019/8/2 22:50, Steven Price wrote:
> This series add support for paravirtualized time for arm64 guests and
> KVM hosts following the specification in Arm's document DEN 0057A:
> 
> https://developer.arm.com/docs/den0057/a
> 
> It implements support for stolen time, allowing the guest to
> identify time when it is forcibly not executing.
> 
> It doesn't implement support for Live Physical Time (LPT) as there are
> some concerns about the overheads and approach in the above
Do you plan to pick up LPT support? As there is demand of cross-frequency 
migration
(from older platform to newer platform).

I am not clear about the overheads and approach problem here, could you please
give some detail information? Maybe we can work together to solve these 
concerns. :-)

Thanks,
Keqian
> specification, and I expect an updated version of the specification to
> be released soon with just the stolen time parts.
> 
> I previously posted a series including LPT (as well as stolen time):
> https://lore.kernel.org/kvmarm/20181212150226.38051-1-steven.pr...@arm.com/
> 
> Patches 2, 5, 7 and 8 are cleanup patches and could be taken separately.
> 
> Christoffer Dall (1):
>   KVM: arm/arm64: Factor out hypercall handling from PSCI code
> 
> Steven Price (8):
>   KVM: arm64: Document PV-time interface
>   KVM: arm64: Implement PV_FEATURES call
>   KVM: arm64: Support stolen time reporting via shared structure
>   KVM: Allow kvm_device_ops to be const
>   KVM: arm64: Provide a PV_TIME device to user space
>   arm/arm64: Provide a wrapper for SMCCC 1.1 calls
>   arm/arm64: Make use of the SMCCC 1.1 wrapper
>   arm64: Retrieve stolen time as paravirtualized guest
> 
>  Documentation/virtual/kvm/arm/pvtime.txt | 107 +
>  arch/arm/kvm/Makefile|   2 +-
>  arch/arm/kvm/handle_exit.c   |   2 +-
>  arch/arm/mm/proc-v7-bugs.c   |  13 +-
>  arch/arm64/include/asm/kvm_host.h|  13 +-
>  arch/arm64/include/asm/kvm_mmu.h |   2 +
>  arch/arm64/include/asm/pvclock-abi.h |  20 +++
>  arch/arm64/include/uapi/asm/kvm.h|   6 +
>  arch/arm64/kernel/Makefile   |   1 +
>  arch/arm64/kernel/cpu_errata.c   |  80 --
>  arch/arm64/kernel/kvm.c  | 155 ++
>  arch/arm64/kvm/Kconfig   |   1 +
>  arch/arm64/kvm/Makefile  |   2 +
>  arch/arm64/kvm/handle_exit.c |   4 +-
>  include/kvm/arm_hypercalls.h |  44 ++
>  include/kvm/arm_psci.h   |   2 +-
>  include/linux/arm-smccc.h|  58 +++
>  include/linux/cpuhotplug.h   |   1 +
>  include/linux/kvm_host.h |   4 +-
>  include/linux/kvm_types.h|   2 +
>  include/uapi/linux/kvm.h |   2 +
>  virt/kvm/arm/arm.c   |  18 +++
>  virt/kvm/arm/hypercalls.c| 138 
>  virt/kvm/arm/mmu.c   |  44 ++
>  virt/kvm/arm/psci.c  |  84 +-
>  virt/kvm/arm/pvtime.c| 190 +++
>  virt/kvm/kvm_main.c  |   6 +-
>  27 files changed, 848 insertions(+), 153 deletions(-)
>  create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
>  create mode 100644 arch/arm64/include/asm/pvclock-abi.h
>  create mode 100644 arch/arm64/kernel/kvm.c
>  create mode 100644 include/kvm/arm_hypercalls.h
>  create mode 100644 virt/kvm/arm/hypercalls.c
>  create mode 100644 virt/kvm/arm/pvtime.c
> 


Re: [PATCH v2 6/8] KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability

2020-07-05 Thread zhukeqian
Hi,

On 2020/7/2 21:55, Keqian Zhu wrote:
> For that using arm64 DBM to log dirty pages has the side effect
> of long time dirty log sync, we should give userspace opportunity
> to enable or disable this feature, to realize some policy.
> 
> This feature is disabled by default.
> 
> Signed-off-by: Keqian Zhu 
> Signed-off-by: Peng Liang 
> ---
>  arch/arm64/include/asm/kvm_host.h |  1 +
>  arch/arm64/kvm/arm.c  | 35 +++
>  arch/arm64/kvm/mmu.c  | 22 +++
>  arch/arm64/kvm/reset.c|  5 +
>  include/uapi/linux/kvm.h  |  1 +
>  tools/include/uapi/linux/kvm.h|  1 +
>  6 files changed, 65 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 86b9c210ba43..69a5317c7049 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -481,6 +481,7 @@ u64 __kvm_call_hyp(void *hypfn, ...);
>  void force_vm_exit(const cpumask_t *mask);
>  void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
>  void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot 
> *memslot);
> +void kvm_mmu_sync_dirty_log_all(struct kvm *kvm);
>  
>  int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
>   int exception_index);
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index fefa5406e037..9e3f765d5467 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -78,6 +78,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>   struct kvm_enable_cap *cap)
>  {
>   int r;
> +#ifdef CONFIG_ARM64_HW_AFDBM
> + int i;
> + struct kvm_vcpu *vcpu;
> + bool enable_hw_dirty_log;
> +#endif
>  
>   if (cap->flags)
>   return -EINVAL;
> @@ -87,6 +92,36 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>   r = 0;
>   kvm->arch.return_nisv_io_abort_to_user = true;
>   break;
> +#ifdef CONFIG_ARM64_HW_AFDBM
> + case KVM_CAP_ARM_HW_DIRTY_LOG:
> + if (!arm_mmu_hw_dbm_supported() || (cap->args[0] & ~0x1))
> + r = -EINVAL;
Miss "break" here :-( . Code is converted from "return -EINVAL".
> +
> + enable_hw_dirty_log = !!(cap->args[0] & 0x1);
> + if (!!(kvm->arch.vtcr & VTCR_EL2_HD) != enable_hw_dirty_log) {
> + if (enable_hw_dirty_log)
> + kvm->arch.vtcr |= VTCR_EL2_HD;
> + else
> + kvm->arch.vtcr &= ~VTCR_EL2_HD;
> +
> + /*
> +  * We should kick vcpus out of guest mode here to
> +  * load new vtcr value to vtcr_el2 register when
> +  * re-enter guest mode.
> +  */
> + kvm_for_each_vcpu(i, vcpu, kvm)
> + kvm_vcpu_kick(vcpu);
> +
> + if (!enable_hw_dirty_log) {
> + mutex_lock(>slots_lock);
> + kvm_mmu_sync_dirty_log_all(kvm);
> + mutex_unlock(>slots_lock);
> + }
> + }
> +
> + r = 0;
> + break;
> +#endif
>   default:
>   r = -EINVAL;
>   break;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index adfa62f1fced..1a48554accb0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -2590,6 +2590,28 @@ void kvm_mmu_sync_dirty_log(struct kvm *kvm, struct 
> kvm_memory_slot *memslot)
>   spin_unlock(>mmu_lock);
>   srcu_read_unlock(>srcu, srcu_idx);
>  }
> +
> +/**
> + * kvm_mmu_sync_dirty_log_all() - synchronize dirty log from PTEs for whole 
> VM
> + * @kvm: The KVM pointer
> + *
> + * Called with kvm->slots_lock mutex acquired
> + */
> +void kvm_mmu_sync_dirty_log_all(struct kvm *kvm)
> +{
> + struct kvm_memslots *slots = kvm_memslots(kvm);
> + struct kvm_memory_slot *memslots = slots->memslots;
> + struct kvm_memory_slot *memslot;
> + int slot;
> +
> + if (unlikely(!slots->used_slots))
> + return;
> +
> + for (slot = 0; slot < slots->used_slots; slot++) {
> + memslot = [slot];
> + kvm_mmu_sync_dirty_log(kvm, memslot);
> + }
> +}
>  #endif /* CONFIG_ARM64_HW_AFDBM */
>  
>  void kvm_arch_commit_memory_region(struct kvm *kvm,
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index d3b209023727..a3be703dd54b 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -83,6 +83,11 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, 
> long ext)
>   r = has_vhe() && system_supports_address_auth() &&
>system_supports_generic_auth();
>   break;
> +#ifdef CONFIG_ARM64_HW_AFDBM
> + case KVM_CAP_ARM_HW_DIRTY_LOG:
> + r = arm_mmu_hw_dbm_supported();
> +

Re: [PATCH 03/12] KVM: arm64: Report hardware dirty status of stage2 PTE if coverred

2020-07-02 Thread zhukeqian
Hi Steven,

On 2020/7/1 19:28, Steven Price wrote:
> Hi,
> 
> On 16/06/2020 10:35, Keqian Zhu wrote:
>> kvm_set_pte is called to replace a target PTE with a desired one.
>> We always do this without changing the desired one, but if dirty
>> status set by hardware is coverred, let caller know it.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>   arch/arm64/kvm/mmu.c | 36 +++-
>>   1 file changed, 35 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 5ad87bce23c0..27407153121b 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -194,11 +194,45 @@ static void clear_stage2_pmd_entry(struct kvm *kvm, 
>> pmd_t *pmd, phys_addr_t addr
>>   put_page(virt_to_page(pmd));
>>   }
>>   -static inline void kvm_set_pte(pte_t *ptep, pte_t new_pte)
>> +#ifdef CONFIG_ARM64_HW_AFDBM
>> +/**
>> + * @ret: true if dirty status set by hardware is coverred.
> 
> NIT: s/coverred/covered/, this is in several places.
> 
OK.
>> + */
>> +static bool kvm_set_pte(pte_t *ptep, pte_t new_pte)
>> +{
>> +pteval_t old_pteval, new_pteval, pteval;
>> +bool old_logging, new_no_write;
>> +
>> +old_logging = kvm_hw_dbm_enabled() && !pte_none(*ptep) &&
>> +  kvm_s2pte_dbm(ptep);
>> +new_no_write = pte_none(new_pte) || kvm_s2pte_readonly(_pte);
>> +
>> +if (!old_logging || !new_no_write) {
>> +WRITE_ONCE(*ptep, new_pte);
>> +dsb(ishst);
>> +return false;
>> +}
>> +
>> +new_pteval = pte_val(new_pte);
>> +pteval = READ_ONCE(pte_val(*ptep));
> 
> This usage of *ptep looks wrong - it's read twice using READ_ONCE (once in 
> kvm_s2pte_dbm()) and once without any decoration (in the pte_none() call). 
> Which looks a bit dodgy and at the very least needs some justification. 
> AFAICT you would be better taking a local copy and using that rather than 
> reading from memory repeatedly.
> 
oh yes. Here we can use a local copy to get higher performance.

I am not sure here has problem about correctness. I *think* we should always 
acquire corresponding lock before manipulate PTs,
so there is no other agent will update PTs during our several PTs access 
(except MMU, but it just modifies AF and [S2]AP) .
But do I miss something?

>> +do {
>> +old_pteval = pteval;
>> +pteval = cmpxchg_relaxed(_val(*ptep), old_pteval, new_pteval);
>> +} while (pteval != old_pteval);
> This look appears to be reinventing xchg_relaxed(). Any reason we can't just 
> use xchg_relaxed()? Also we had a dsb() after the WRITE_ONCE but you are 
> using the _relaxed variant here. What is the justification for not having a 
> barrier?
> 
Aha, I have changed the code for several times and it is equal to xchg_relaxed 
now, thanks.

I use _relaxd here because I am not clear about the reason that we use _relaxed 
in kvm_set_s2XXX_XXX and use dsb() in kvm_set_pte.
But I will correct this in patch v2.
>> +
>> +return !kvm_s2pte_readonly(&__pte(pteval));
>> +}
>> +#else
>> +/**
>> + * @ret: true if dirty status set by hardware is coverred.
>> + */
>> +static inline bool kvm_set_pte(pte_t *ptep, pte_t new_pte)
>>   {
>>   WRITE_ONCE(*ptep, new_pte);
>>   dsb(ishst);
>> +return false;
>>   }
>> +#endif /* CONFIG_ARM64_HW_AFDBM */
> 
> You might be able to avoid this #ifdef by redefining old_logging as:
> 
>   old_logging = IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && ...
> 
> I *think* the compiler should be able to kill the dead code and leave you 
> with just the above when the config symbol is off.
> 
After my test, you are right ;-) .


Thanks,
Keqian
> Steve
> 
>> static inline void kvm_set_pmd(pmd_t *pmdp, pmd_t new_pmd)
>>   {
>>
> 
> .
> 


Re: [PATCH 00/12] KVM: arm64: Support stage2 hardware DBM

2020-06-17 Thread zhukeqian
Hi,

On 2020/6/16 17:35, Keqian Zhu wrote:
> This patch series add support for stage2 hardware DBM, and it is only
> used for dirty log for now.
> 
> It works well under some migration test cases, including VM with 4K
> pages or 2M THP. I checked the SHA256 hash digest of all memory and
> they keep same for source VM and destination VM, which means no dirty
> pages is missed under hardware DBM.
> 
> Some key points:
> 
> 1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs
>are not involved for now.
> 
> 2. About *performance*: In RFC patch, I have mentioned that for every 64GB
>memory, KVM consumes about 40ms to scan all PTEs to collect dirty log.
>
>Initially, I plan to solve this problem using parallel CPUs. However
>I faced two problems.
> 
>The first is bottleneck of memory bandwith. Single thread will occupy
>bandwidth about 500GB/s, we can support about 4 parallel threads at
>most, so the ideal speedup ratio is low.
Aha, I make it wrong here. I test it again, and find that speedup ratio can
be about 23x when I use 32 CPUs to scan PTs (takes about 5ms when scanning PTs
of 200GB RAM).

> 
>The second is huge impact on other CPUs. To scan PTs quickly, I use
>smp_call_function_many, which is based on IPI, to dispatch workload
>on other CPUs. Though it can complete work in time, the interrupt is
>disabled during scaning PTs, which has huge impact on other CPUs.
I think we can divide scanning workload into smaller ones, which can disable
and enable interrupts periodly.

> 
>Now, I make hardware dirty log can be dynamic enabled and disabled.
>Userspace can enable it before VM migration and disable it when
>remaining dirty pages is little. Thus VM downtime is not affected. 
BTW, we can reserve this interface for userspace if CPU computing resources
are not enough.

Thanks,
Keqian
> 
> 
> 3. About correctness: Only add DBM bit when PTE is already writable, so
>we still have readonly PTE and some mechanisms which rely on readonly
>PTs are not broken.
> 
> 4. About PTs modification races: There are two kinds of PTs modification.
>
>The first is adding or clearing specific bit, such as AF or RW. All
>these operations have been converted to be atomic, avoid covering
>dirty status set by hardware.
>
>The second is replacement, such as PTEs unmapping or changement. All
>these operations will invoke kvm_set_pte finally. kvm_set_pte have
>been converted to be atomic and we save the dirty status to underlying
>bitmap if dirty status is coverred.
> 
> 
> Keqian Zhu (12):
>   KVM: arm64: Add some basic functions to support hw DBM
>   KVM: arm64: Modify stage2 young mechanism to support hw DBM
>   KVM: arm64: Report hardware dirty status of stage2 PTE if coverred
>   KVM: arm64: Support clear DBM bit for PTEs
>   KVM: arm64: Add KVM_CAP_ARM_HW_DIRTY_LOG capability
>   KVM: arm64: Set DBM bit of PTEs during write protecting
>   KVM: arm64: Scan PTEs to sync dirty log
>   KVM: Omit dirty log sync in log clear if initially all set
>   KVM: arm64: Steply write protect page table by mask bit
>   KVM: arm64: Save stage2 PTE dirty status if it is coverred
>   KVM: arm64: Support disable hw dirty log after enable
>   KVM: arm64: Enable stage2 hardware DBM
> 
>  arch/arm64/include/asm/kvm_host.h |  11 +
>  arch/arm64/include/asm/kvm_mmu.h  |  56 +++-
>  arch/arm64/include/asm/sysreg.h   |   2 +
>  arch/arm64/kvm/arm.c  |  22 +-
>  arch/arm64/kvm/mmu.c  | 411 --
>  arch/arm64/kvm/reset.c|  14 +-
>  include/uapi/linux/kvm.h  |   1 +
>  tools/include/uapi/linux/kvm.h|   1 +
>  virt/kvm/kvm_main.c   |   7 +-
>  9 files changed, 499 insertions(+), 26 deletions(-)
> 


Re: [PATCH RFC] KVM: arm64: Sidestep stage2_unmap_vm() on vcpu reset when S2FWB is supported

2020-06-01 Thread zhukeqian
Hi Marc,

On 2020/5/31 0:31, Marc Zyngier wrote:
> Hi Alex,
> 
> On 2020-05-30 11:46, Alexandru Elisei wrote:
>> Hi,
> 
> [...]
> 
 diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
 index 48d0ec44ad77..e6378162cdef 100644
 --- a/virt/kvm/arm/arm.c
 +++ b/virt/kvm/arm/arm.c
 @@ -983,8 +983,11 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct 
 kvm_vcpu *vcpu,
  /*
   * Ensure a rebooted VM will fault in RAM pages and detect if the
   * guest MMU is turned off and flush the caches as needed.
 + *
 + * S2FWB enforces all memory accesses to RAM being cacheable, we
 + * ensure that the cache is always coherent.
   */
 -if (vcpu->arch.has_run_once)
 +if (vcpu->arch.has_run_once && 
 !cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>>> I think userspace does not invalidate the icache when loading a new kernel 
>>> image,
>>> and if the guest patched instructions, they could potentially still be in 
>>> the
>>> icache. Should the icache be invalidated if FWB is present?
>>
>> I noticed that this was included in the current pull request and I
>> remembered that
>> I wasn't sure about this part. Did some more digging and it turns out that 
>> FWB
>> implies no cache maintenance needed for *data to instruction*
>> coherence. From ARM
>> DDI 0487F.b, page D5-2635:
>>
>> "When ARMv8.4-S2FWB is implemented, the architecture requires that
>> CLIDR_EL1.{LOUU, LOIUS} are zero so that no levels of data cache need to be
>> cleaned in order to manage coherency with instruction fetches".
>>
>> However, there's no mention that I found for instruction to data coherence,
>> meaning that the icache would still need to be invalidated on each vcpu in 
>> order
>> to prevent fetching of patched instructions from the icache. Am I
>> missing something?
> 
> I think you are right, and this definitely matches the way we deal with
> the icache on the fault path. For some bizarre reason, I always assume
> that FWB implies DIC, which isn't true at all.
> 
> I'm planning to address it as follows. Please let me know what you think.
> 
> Thanks,
> 
> M.
> 
> From f7860d1d284f41afea176cc17e5c9d895ae665e9 Mon Sep 17 00:00:00 2001
> From: Marc Zyngier 
> Date: Sat, 30 May 2020 17:22:19 +0100
> Subject: [PATCH] KVM: arm64: Flush the instruction cache if not unmapping the
>  VM on reboot
> 
> On a system with FWB, we don't need to unmap Stage-2 on reboot,
> as even if userspace takes this opportunity to repaint the whole
> of memory, FWB ensures that the data side stays consistent even
> if the guest uses non-cacheable mappings.
> 
> However, the I-side is not necessarily coherent with the D-side
> if CTR_EL0.DIC is 0. In this case, invalidate the i-cache to
> preserve coherency.
> 
> Reported-by: Alexandru Elisei 
> Fixes: 892713e97ca1 ("KVM: arm64: Sidestep stage2_unmap_vm() on vcpu reset 
> when S2FWB is supported")
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/arm.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index b0b569f2cdd0..d6988401c22a 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -989,11 +989,17 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct 
> kvm_vcpu *vcpu,
>   * Ensure a rebooted VM will fault in RAM pages and detect if the
>   * guest MMU is turned off and flush the caches as needed.
>   *
> - * S2FWB enforces all memory accesses to RAM being cacheable, we
> - * ensure that the cache is always coherent.
> + * S2FWB enforces all memory accesses to RAM being cacheable,
> + * ensuring that the data side is always coherent. We still
> + * need to invalidate the I-cache though, as FWB does *not*
> + * imply CTR_EL0.DIC.
>   */
> -if (vcpu->arch.has_run_once && 
> !cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> -stage2_unmap_vm(vcpu->kvm);
> +if (vcpu->arch.has_run_once) {
> +if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
> +stage2_unmap_vm(vcpu->kvm);
> +else
> +__flush_icache_all();
After I looking into this function, I think it's OK here. Please ignore
my question :-).
> +}
> 
>  vcpu_reset_hcr(vcpu);
> 
Thanks,
Keqian


Re: [RFC PATCH 2/7] KVM: arm64: Set DBM bit of PTEs if hw DBM enabled

2020-05-27 Thread zhukeqian
Hi Catalin,

On 2020/5/26 19:49, Catalin Marinas wrote:
> On Mon, May 25, 2020 at 07:24:01PM +0800, Keqian Zhu wrote:
>> diff --git a/arch/arm64/include/asm/pgtable-prot.h 
>> b/arch/arm64/include/asm/pgtable-prot.h
>> index 1305e28225fc..f9910ba2afd8 100644
>> --- a/arch/arm64/include/asm/pgtable-prot.h
>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>> @@ -79,6 +79,7 @@ extern bool arm64_use_ng_mappings;
>>  })
>>  
>>  #define PAGE_S2 __pgprot(_PROT_DEFAULT | 
>> PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN)
>> +#define PAGE_S2_DBM __pgprot(_PROT_DEFAULT | 
>> PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN | PTE_DBM)
> 
> You don't need a new page permission (see below).
> 
>>  #define PAGE_S2_DEVICE  __pgprot(_PROT_DEFAULT | 
>> PAGE_S2_MEMATTR(DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_S2_XN)
>>  
>>  #define PAGE_NONE   __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | 
>> PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index e3b9ee268823..dc97988eb2e0 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1426,6 +1426,10 @@ static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t 
>> addr, phys_addr_t end)
>>  pte = pte_offset_kernel(pmd, addr);
>>  do {
>>  if (!pte_none(*pte)) {
>> +#ifdef CONFIG_ARM64_HW_AFDBM
>> +if (kvm_hw_dbm_enabled() && !kvm_s2pte_dbm(pte))
>> +kvm_set_s2pte_dbm(pte);
>> +#endif
>>  if (!kvm_s2pte_readonly(pte))
>>  kvm_set_s2pte_readonly(pte);
>>  }
> 
> Setting the DBM bit is equivalent to marking the page writable. The
> actual writable pte bit (S2AP[1] or HAP[2] as we call them in Linux for
> legacy reasons) tells you whether the page has been dirtied but it is
> still writable if you set DBM. Doing this in stage2_wp_ptes()
> practically means that you no longer have read-only pages at S2. There
> are several good reasons why you don't want to break this. For example,
> the S2 pte may already be read-only for other reasons (CoW).
> 
Thanks, your comments help to solve the first problem in cover letter.

> I think you should only set the DBM bit if the pte was previously
> writable. In addition, any permission change to the S2 pte must take
> into account the DBM bit and clear it while transferring the dirty
> status to the underlying page. I'm not deeply familiar with all these
> callbacks into KVM but two such paths are kvm_unmap_hva_range() and the
> kvm_mmu_notifier_change_pte().
Yes, I agree.
> 
> 
>> @@ -1827,7 +1831,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  
>>  ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, _pmd);
>>  } else {
>> -pte_t new_pte = kvm_pfn_pte(pfn, mem_type);
>> +pte_t new_pte;
>> +
>> +#ifdef CONFIG_ARM64_HW_AFDBM
>> +if (kvm_hw_dbm_enabled() &&
>> +pgprot_val(mem_type) == pgprot_val(PAGE_S2)) {
>> +mem_type = PAGE_S2_DBM;
>> +}
>> +#endif
>> +new_pte = kvm_pfn_pte(pfn, mem_type);
>>  
>>  if (writable) {
>>  new_pte = kvm_s2pte_mkwrite(new_pte);
> 
> That's wrong here. Basically for any fault you get, you just turn the S2
> page writable. The point of DBM is that you don't get write faults at
> all if you have a writable page. So, as I said above, only set the DBM
> bit if you stored a writable S2 pte (kvm_s2pte_mkwrite()).
Yeah, you are right. I will correct it in Patch v1.
> 

Thanks,
Keqian



Re: [RFC PATCH 0/7] kvm: arm64: Support stage2 hardware DBM

2020-05-25 Thread zhukeqian
Hi Marc,

On 2020/5/25 23:44, Marc Zyngier wrote:
> On 2020-05-25 12:23, Keqian Zhu wrote:
>> This patch series add support for stage2 hardware DBM, and it is only
>> used for dirty log for now.
>>
>> It works well under some migration test cases, including VM with 4K
>> pages or 2M THP. I checked the SHA256 hash digest of all memory and
>> they keep same for source VM and destination VM, which means no dirty
>> pages is missed under hardware DBM.
>>
>> However, there are some known issues not solved.
>>
>> 1. Some mechanisms that rely on "write permission fault" become invalid,
>>such as kvm_set_pfn_dirty and "mmap page sharing".
>>
>>kvm_set_pfn_dirty is called in user_mem_abort when guest issues write
>>fault. This guarantees physical page will not be dropped directly when
>>host kernel recycle memory. After using hardware dirty management, we
>>have no chance to call kvm_set_pfn_dirty.
> 
> Then you will end-up with memory corruption under memory pressure.
> This also breaks things like CoW, which we depend on.
>
Yes, these problems looks knotty. But I think x86 PML support will face these
problems too. I believe there must be some methods to solve them.
>>
>>For "mmap page sharing" mechanism, host kernel will allocate a new
>>physical page when guest writes a page that is shared with other page
>>table entries. After using hardware dirty management, we have no chance
>>to do this too.
>>
>>I need to do some survey on how stage1 hardware DBM solve these problems.
>>It helps if anyone can figure it out.
>>
>> 2. Page Table Modification Races: Though I have found and solved some data
>>races when kernel changes page table entries, I still doubt that there
>>are data races I am not aware of. It's great if anyone can figure them 
>> out.
>>
>> 3. Performance: Under Kunpeng 920 platform, for every 64GB memory, KVM
>>consumes about 40ms to traverse all PTEs to collect dirty log. It will
>>cause unbearable downtime for migration if memory size is too big. I will
>>try to solve this problem in Patch v1.
> 
> This, in my opinion, is why Stage-2 DBM is fairly useless.
> From a performance perspective, this is the worse possible
> situation. You end up continuously scanning page tables, at
> an arbitrary rate, without a way to evaluate the fault rate.
> 
> One thing S2-DBM would be useful for is SVA, where a device
> write would mark the S2 PTs dirty as they are shared between
> CPU and SMMU. Another thing is SPE, which is essentially a DMA
> agent using the CPU's PTs.
> 
> But on its own, and just to log the dirty pages, S2-DBM is
> pretty rubbish. I wish arm64 had something like Intel's PML,
> which looks far more interesting for the purpose of tracking
> accesses.

Sure, PML is a better solution on hardware management of dirty state.
However, compared to optimizing hardware, optimizing software is with
shorter cycle time.

Here I have an optimization in mind to solve it. Scanning page tables
can be done parallel, which can greatly reduce time consumption. For there
is no communication between parallel CPUs, we can achieve high speedup
ratio.


> 
> Thanks,
> 
> M.
Thanks,
Keqian