Re: ARM/gic-v4: deadlock occurred

2019-07-15 Thread Marc Zyngier
On 15/07/2019 07:32, Guoheyi wrote:
> Hi Marc,
> 
> The issue only occurs after applying the vlpi_map_rework patches, and we 
> can see the patches only affect VM; it changes its_create_device() a 
> little so it may affect host booting in some ways, so I took the lazy 
> way to send it out for some insights.
> 
> I am suspecting below code; if alloc_lpis == false, what will happen? 

If !alloc_lpis, then we don't allocate the lpi_map, which is the
intended effect.

> Anyway, I will investigate more on this.
> 
> 
>   if  (alloc_lpis)  {
>   lpi_map  =  its_lpi_alloc(nvecs,  &lpi_base,  &nr_lpis);
>   if  (lpi_map)
>   col_map  =  kcalloc(nr_lpis,  sizeof(*col_map),
>   GFP_KERNEL);
>   }  else  {
>   col_map  =  kcalloc(nr_ites,  sizeof(*col_map),  GFP_KERNEL);
>   nr_lpis  =  0;
>   lpi_base  =  0;
>   }
>   if  (its->is_v4)
>   vlpi_map  =  kcalloc(nr_lpis,  sizeof(*vlpi_map),  GFP_KERNEL);
> 
>   if  (!dev  ||  !itt  ||   !col_map  ||  (!lpi_map  &&  alloc_lpis)  ||
>   (!vlpi_map  &&  its->is_v4))  {
>   kfree(dev);
>   kfree(itt);
>   kfree(lpi_map);
>   kfree(col_map);
>   kfree(vlpi_map);
>   return  NULL;
>   }

How does this relate to the patch posted in this discussion? The
proposed changes turn the locking from a mutex into a raw_spinlock.

That's not to say there is no bug in the GICv4 code, but I'd expect a
bit more analysis before you start pointing at random pieces of code
without establishing any link between effects and possible causes.

Thanks,

M.
> 
> 
> Thanks,
> 
> Heyi
> 
> 
> On 2019/7/13 19:37, Marc Zyngier wrote:
>> On Sat, 13 Jul 2019 19:08:57 +0800
>> Guoheyi  wrote:
>>
>> Hi Heyi,
>>
>>> Hi Marc,
>>>
>>> Really sorry for the delay of testing the rework patches. I picked up
>>> the work these days and applied the patches to our 4.19.36 stable
>>> branch. However, I got below panic during the boot process of host
>>> (not yet to boot guests).
>>>
>>> I supposed the result was not related with my testing kernel version,
>>> for we don't have many differences in ITS driver; I can test against
>>> mainline if you think it is necessary.
>> In general, please report bugs against mainline. There isn't much I can
>> do about your private tree...
>>
>> That being said, a couple of comments below.
>>
>>> Thanks,
>>>
>>> Heyi
>>>
>>>
>>> [   16.990413] iommu: Adding device :00:00.0 to group 6
>>> [   17.000691] pcieport :00:00.0: Signaling PME with IRQ 133
>>> [   17.006456] pcieport :00:00.0: AER enabled with IRQ 134
>>> [   17.012151] iommu: Adding device :00:08.0 to group 7
>>> [   17.018575] Unable to handle kernel paging request at virtual address 
>>> 00686361635f746f
>>> [   17.026467] Mem abort info:
>>> [   17.029251]   ESR = 0x9604
>>> [   17.032313]   Exception class = DABT (current EL), IL = 32 bits
>>> [   17.038207]   SET = 0, FnV = 0
>>> [   17.041258]   EA = 0, S1PTW = 0
>>> [   17.044391] Data abort info:
>>> [   17.047260]   ISV = 0, ISS = 0x0004
>>> [   17.051081]   CM = 0, WnR = 0
>>> [   17.054035] [00686361635f746f] address between user and kernel address 
>>> ranges
>>> [   17.061140] Internal error: Oops: 9604 [#1] SMP
>>> [   17.065997] Process kworker/0:4 (pid: 889, stack limit = 
>>> 0x(ptrval))
>>> [   17.073013] CPU: 0 PID: 889 Comm: kworker/0:4 Not tainted 4.19.36+ #8
>>> [   17.079422] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 0.52 
>>> 06/20/2019
>>> [   17.086788] Workqueue: events work_for_cpu_fn
>>> [   17.091126] pstate: 20c9 (nzCv daif +PAN +UAO)
>>> [   17.095895] pc : __kmalloc_track_caller+0xb0/0x2a0
>>> [   17.100662] lr : __kmalloc_track_caller+0x64/0x2a0
>>> [   17.105429] sp : 2920ba00
>>> [   17.108728] x29: 2920ba00 x28: 802cb6792780
>>> [   17.114015] x27: 006080c0 x26: 006000c0
>>> [   17.119302] x25: 084c8a00 x24: 802cbfc0fc00
>>> [   17.124588] x23: 802cbfc0fc00 x22: 084c8a00
>>> [   17.129875] x21: 0004 x20: 006000c0
>>> [   17.135161] x19: 65686361635f746f x18: 
>>> [   17.140448] x17: 000e x16: 0007
>>> [   17.145734] x15: 09119708 x14: 
>>> [   17.151021] x13: 0003 x12: 
>>> [   17.156307] x11: 05f5e0ff x10: 2920bb80
>>> [   17.161594] x9 : ffd0 x8 : 0098
>>> [   17.166880] x7 : 2920bb80 x6 : 08a8cb98
>>> [   17.172167] x5 : a705 x4 : 803f802d22e0
>>> [   17.177453] x3 : 2920b990 x2 : 7e00b2dafd00
>>> [   17.182740] x1 : 803f77476000 x0 : 
>>> [   17.188027] Call trace:
>>> [   17.190461]  __kmalloc_track_caller+0xb0/0x2a0
>>> [   17.194886]  kvasprintf+0x7c/0x108
>>> [   17.198271] 

Re: ARM/gic-v4: deadlock occurred

2019-07-15 Thread Guoheyi




On 2019/7/15 17:07, Marc Zyngier wrote:

On 15/07/2019 07:32, Guoheyi wrote:

Hi Marc,

The issue only occurs after applying the vlpi_map_rework patches, and we
can see the patches only affect VM; it changes its_create_device() a
little so it may affect host booting in some ways, so I took the lazy
way to send it out for some insights.

I am suspecting below code; if alloc_lpis == false, what will happen?

If !alloc_lpis, then we don't allocate the lpi_map, which is the
intended effect.


Anyway, I will investigate more on this.


if  (alloc_lpis)  {
lpi_map  =  its_lpi_alloc(nvecs,  &lpi_base,  &nr_lpis);
if  (lpi_map)
col_map  =  kcalloc(nr_lpis,  sizeof(*col_map),
GFP_KERNEL);
}  else  {
col_map  =  kcalloc(nr_ites,  sizeof(*col_map),  GFP_KERNEL);
nr_lpis  =  0;
lpi_base  =  0;
}
if  (its->is_v4)
vlpi_map  =  kcalloc(nr_lpis,  sizeof(*vlpi_map),  GFP_KERNEL);

if  (!dev  ||  !itt  ||   !col_map  ||  (!lpi_map  &&  alloc_lpis)  ||
(!vlpi_map  &&  its->is_v4))  {
kfree(dev);
kfree(itt);
kfree(lpi_map);
kfree(col_map);
kfree(vlpi_map);
return  NULL;
}

How does this relate to the patch posted in this discussion? The
proposed changes turn the locking from a mutex into a raw_spinlock.


I'm testing the patchset in 
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/vlpi-map-rework, 
not only the patch posted in the mail directly. The first patch 
*"**irqchip/gic-v3-its: Make vlpi_map allocations atomic" works well in 
our internal tree, and my new testing is against the other 3 patches in 
your vlpi-map-rework branch, as I promised. I'm sorry if I didn't state 
this clearly.


Heyi
*


That's not to say there is no bug in the GICv4 code, but I'd expect a
bit more analysis before you start pointing at random pieces of code
without establishing any link between effects and possible causes.

Thanks,

M.


Thanks,

Heyi


On 2019/7/13 19:37, Marc Zyngier wrote:

On Sat, 13 Jul 2019 19:08:57 +0800
Guoheyi  wrote:

Hi Heyi,


Hi Marc,

Really sorry for the delay of testing the rework patches. I picked up
the work these days and applied the patches to our 4.19.36 stable
branch. However, I got below panic during the boot process of host
(not yet to boot guests).

I supposed the result was not related with my testing kernel version,
for we don't have many differences in ITS driver; I can test against
mainline if you think it is necessary.

In general, please report bugs against mainline. There isn't much I can
do about your private tree...

That being said, a couple of comments below.


Thanks,

Heyi


[   16.990413] iommu: Adding device :00:00.0 to group 6
[   17.000691] pcieport :00:00.0: Signaling PME with IRQ 133
[   17.006456] pcieport :00:00.0: AER enabled with IRQ 134
[   17.012151] iommu: Adding device :00:08.0 to group 7
[   17.018575] Unable to handle kernel paging request at virtual address 
00686361635f746f
[   17.026467] Mem abort info:
[   17.029251]   ESR = 0x9604
[   17.032313]   Exception class = DABT (current EL), IL = 32 bits
[   17.038207]   SET = 0, FnV = 0
[   17.041258]   EA = 0, S1PTW = 0
[   17.044391] Data abort info:
[   17.047260]   ISV = 0, ISS = 0x0004
[   17.051081]   CM = 0, WnR = 0
[   17.054035] [00686361635f746f] address between user and kernel address ranges
[   17.061140] Internal error: Oops: 9604 [#1] SMP
[   17.065997] Process kworker/0:4 (pid: 889, stack limit = 0x(ptrval))
[   17.073013] CPU: 0 PID: 889 Comm: kworker/0:4 Not tainted 4.19.36+ #8
[   17.079422] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 0.52 
06/20/2019
[   17.086788] Workqueue: events work_for_cpu_fn
[   17.091126] pstate: 20c9 (nzCv daif +PAN +UAO)
[   17.095895] pc : __kmalloc_track_caller+0xb0/0x2a0
[   17.100662] lr : __kmalloc_track_caller+0x64/0x2a0
[   17.105429] sp : 2920ba00
[   17.108728] x29: 2920ba00 x28: 802cb6792780
[   17.114015] x27: 006080c0 x26: 006000c0
[   17.119302] x25: 084c8a00 x24: 802cbfc0fc00
[   17.124588] x23: 802cbfc0fc00 x22: 084c8a00
[   17.129875] x21: 0004 x20: 006000c0
[   17.135161] x19: 65686361635f746f x18: 
[   17.140448] x17: 000e x16: 0007
[   17.145734] x15: 09119708 x14: 
[   17.151021] x13: 0003 x12: 
[   17.156307] x11: 05f5e0ff x10: 2920bb80
[   17.161594] x9 : ffd0 x8 : 0098
[   17.166880] x7 : 2920bb80 x6 : 08a8cb98
[   17.172167] x5 : a705 x4 : 803f802d22e0
[   17.177453] x3 : 2920b990 x2 : 7e00b2dafd00
[   17.1827

Re: [RFC v2 11/14] arm64: Move the ASID allocator code in a separate file

2019-07-15 Thread Julien Grall

On 04/07/2019 15:56, James Morse wrote:

Hi Julien,


Hi James,

Thank you for the review.



On 20/06/2019 14:06, Julien Grall wrote:

We will want to re-use the ASID allocator in a separate context (e.g
allocating VMID). So move the code in a new file.

The function asid_check_context has been moved in the header as a static
inline function because we want to avoid add a branch when checking if the
ASID is still valid.



diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index 3df63a28856c..b745cf356fe1 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -23,46 +23,21 @@



-#define ASID_FIRST_VERSION(info)   NUM_ASIDS(info)



diff --git a/arch/arm64/lib/asid.c b/arch/arm64/lib/asid.c
new file mode 100644
index ..7252e4fdd5e9
--- /dev/null
+++ b/arch/arm64/lib/asid.c
@@ -0,0 +1,185 @@



+#define ASID_FIRST_VERSION(info)   (1UL << ((info)->bits))


(oops!)


Good catch, I will fix it in the next version.





@@ -344,7 +115,7 @@ static int asids_init(void)
if (!asid_allocator_init(&asid_info, bits, ASID_PER_CONTEXT,
 asid_flush_cpu_ctxt))
panic("Unable to initialize ASID allocator for %lu ASIDs\n",
- 1UL << bits);
+ NUM_ASIDS(&asid_info));


Could this go in the patch that adds NUM_ASIDS()?


Actually this change is potentially wrong. This relies on asid_allocator_init() 
to set asid_info.bits even if the function fails.


So I think it would be best to keep 1UL << bits here.

Cheers,

--
Julien Grall
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: ARM/gic-v4: deadlock occurred

2019-07-15 Thread Marc Zyngier
On 15/07/2019 11:43, Guoheyi wrote:
> 
> 
> On 2019/7/15 17:07, Marc Zyngier wrote:
>> On 15/07/2019 07:32, Guoheyi wrote:
>>> Hi Marc,
>>>
>>> The issue only occurs after applying the vlpi_map_rework patches, and we
>>> can see the patches only affect VM; it changes its_create_device() a
>>> little so it may affect host booting in some ways, so I took the lazy
>>> way to send it out for some insights.
>>>
>>> I am suspecting below code; if alloc_lpis == false, what will happen?
>> If !alloc_lpis, then we don't allocate the lpi_map, which is the
>> intended effect.
>>
>>> Anyway, I will investigate more on this.
>>>
>>>
>>> if  (alloc_lpis)  {
>>> lpi_map  =  its_lpi_alloc(nvecs,  &lpi_base,  &nr_lpis);
>>> if  (lpi_map)
>>> col_map  =  kcalloc(nr_lpis,  sizeof(*col_map),
>>> GFP_KERNEL);
>>> }  else  {
>>> col_map  =  kcalloc(nr_ites,  sizeof(*col_map),  GFP_KERNEL);
>>> nr_lpis  =  0;
>>> lpi_base  =  0;
>>> }
>>> if  (its->is_v4)
>>> vlpi_map  =  kcalloc(nr_lpis,  sizeof(*vlpi_map),  GFP_KERNEL);
>>>
>>> if  (!dev  ||  !itt  ||   !col_map  ||  (!lpi_map  &&  alloc_lpis)  ||
>>> (!vlpi_map  &&  its->is_v4))  {
>>> kfree(dev);
>>> kfree(itt);
>>> kfree(lpi_map);
>>> kfree(col_map);
>>> kfree(vlpi_map);
>>> return  NULL;
>>> }
>> How does this relate to the patch posted in this discussion? The
>> proposed changes turn the locking from a mutex into a raw_spinlock.
> 
> I'm testing the patchset in 
> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/vlpi-map-rework,
>  
> not only the patch posted in the mail directly. The first patch 
> *"**irqchip/gic-v3-its: Make vlpi_map allocations atomic" works well in 
> our internal tree, and my new testing is against the other 3 patches in 
> your vlpi-map-rework branch, as I promised. I'm sorry if I didn't state 
> this clearly.

Ah, I had completely forgot about this branch. As I said, it is
completely untested. I'll see if I can get some brain bandwidth in the
next couple of weeks to get back to it...

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...


Re: ARM/gic-v4: deadlock occurred

2019-07-15 Thread Guoheyi



On 2019/7/15 19:13, Marc Zyngier wrote:

On 15/07/2019 11:43, Guoheyi wrote:


On 2019/7/15 17:07, Marc Zyngier wrote:

On 15/07/2019 07:32, Guoheyi wrote:

Hi Marc,

The issue only occurs after applying the vlpi_map_rework patches, and we
can see the patches only affect VM; it changes its_create_device() a
little so it may affect host booting in some ways, so I took the lazy
way to send it out for some insights.

I am suspecting below code; if alloc_lpis == false, what will happen?

If !alloc_lpis, then we don't allocate the lpi_map, which is the
intended effect.


Anyway, I will investigate more on this.


if  (alloc_lpis)  {
lpi_map  =  its_lpi_alloc(nvecs,  &lpi_base,  &nr_lpis);
if  (lpi_map)
col_map  =  kcalloc(nr_lpis,  sizeof(*col_map),
GFP_KERNEL);
}  else  {
col_map  =  kcalloc(nr_ites,  sizeof(*col_map),  GFP_KERNEL);
nr_lpis  =  0;
lpi_base  =  0;
}
if  (its->is_v4)
vlpi_map  =  kcalloc(nr_lpis,  sizeof(*vlpi_map),  GFP_KERNEL);

if  (!dev  ||  !itt  ||   !col_map  ||  (!lpi_map  &&  alloc_lpis)  ||
(!vlpi_map  &&  its->is_v4))  {
kfree(dev);
kfree(itt);
kfree(lpi_map);
kfree(col_map);
kfree(vlpi_map);
return  NULL;
}

How does this relate to the patch posted in this discussion? The
proposed changes turn the locking from a mutex into a raw_spinlock.

I'm testing the patchset in
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/vlpi-map-rework,
not only the patch posted in the mail directly. The first patch
*"**irqchip/gic-v3-its: Make vlpi_map allocations atomic" works well in
our internal tree, and my new testing is against the other 3 patches in
your vlpi-map-rework branch, as I promised. I'm sorry if I didn't state
this clearly.

Ah, I had completely forgot about this branch. As I said, it is
completely untested. I'll see if I can get some brain bandwidth in the
next couple of weeks to get back to it...

Yes, a bit too long ago...

And finally I found the panic is caused by this patch: 
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=irq/vlpi-map-rework&id=fe3dd7e06ee0e82bade4f2a107ef6422e5c9021e


diff --git a/drivers/irqchip/irq-gic-v3-its.c 
b/drivers/irqchip/irq-gic-v3-its.c

index 18aa04b..6f55886 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -2458,6 +2458,8 @@ static void its_free_device(struct its_device 
*its_dev)

 list_del(&its_dev->entry);
 raw_spin_unlock_irqrestore(&its_dev->its->lock, flags);
 kfree(its_dev->itt);
+kfree(its_dev->event_map.lpi_map);
+kfree(its_dev->event_map.col_map);
 kfree(its_dev);
 }

This patch causes double free for both lpi_map and col_map in 
its_irq_domain_free():


if (!its_dev->shared &&
bitmap_empty(its_dev->event_map.lpi_map,
 its_dev->event_map.nr_lpis)) {
its_lpi_free(its_dev->event_map.lpi_map, > 
its_dev->event_map.lpi_map is freed

 its_dev->event_map.lpi_base,
 its_dev->event_map.nr_lpis);
kfree(its_dev->event_map.col_map);> 
its_dev->event_map.col_map is freed


/* Unmap device/itt */
its_send_mapd(its_dev, 0);
its_free_device(its_dev); > 
lpi_map and col_map are freed again

}

Thanks,

Heyi



Thanks,

M.



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC] Add virtual SDEI support in qemu

2019-07-15 Thread Dave Martin
On Sat, Jul 13, 2019 at 05:53:57PM +0800, Guoheyi wrote:
> Hi folks,
> 
> Do it make sense to implement virtual SDEI in qemu? So that we can have the
> standard way for guest to handle NMI watchdog, RAS events and something else
> which involves SDEI in a physical ARM64 machine.
> 
> My basic idea is like below:
> 
> 1. Change a few lines of code in kvm to allow unhandled SMC invocations
> (like SDEI) to be sent to qemu, with exit reason of KVM_EXIT_HYPERCALL, so
> we don't need to add new API.

So long as KVM_EXIT_HYPERCALL reports sufficient information so that
userspace can identify the cause as an SMC and retrieve the SMC
immediate field, this seems feasible.

For its own SMCCC APIs, KVM exclusively uses HVC, so rerouting SMC to
userspace shouldn't conflict.

This bouncing of SMCs to userspace would need to be opt-in, otherwise
old userspace would see exits that it doesn't know what to do with.

> 2. qemu handles supported SDEI calls just as the spec says for what a
> hypervisor should do for a guest OS.
> 
> 3. For interrupts bound to hypervisor, qemu should stop injecting the IRQ to
> guest through KVM, but jump to the registered event handler directly,
> including context saving and restoring. Some interrupts like virtual timer
> are handled by kvm directly, so we may refuse to bind such interrupts to
> SDEI events.

Something like that.

Interactions between SDEI and PSCI would need some thought: for example,
after PSCI_CPU_ON, the newly online cpu needs to have SDEs masked.

One option (suggested to me by James Morse) would be to allow userspace
to disable in the in-kernel PSCI implementation and provide its own
PSCI to the guest via SMC -- in which case userspace that wants to
implement SDEI would have to implement PSCI as well.

There may be reasons why this wouldn't work ... I haven't thought about
it in depth.

Cheers
---Dave
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH v2 0/3] Support CPU hotplug for ARM64

2019-07-15 Thread James Morse
Hi Maran,

On 10/07/2019 17:05, Maran Wilson wrote:
> On 7/10/2019 2:15 AM, Marc Zyngier wrote:
>> On 09/07/2019 20:06, Maran Wilson wrote:
>>> On 7/5/2019 3:12 AM, James Morse wrote:
 On 29/06/2019 03:42, Xiongfeng Wang wrote:
> This patchset mark all the GICC node in MADT as possible CPUs even though 
> it
> is disabled. But only those enabled GICC node are marked as present CPUs.
> So that kernel will initialize some CPU related data structure in advance 
> before
> the CPU is actually hot added into the system. This patchset also 
> implement
> 'acpi_(un)map_cpu()' and 'arch_(un)register_cpu()' for ARM64. These 
> functions are
> needed to enable CPU hotplug.
>
> To support CPU hotplug, we need to add all the possible GICC node in MADT
> including those CPUs that are not present but may be hot added later. 
> Those
> CPUs are marked as disabled in GICC nodes.
 ... what do you need this for?

 (The term cpu-hotplug in the arm world almost never means hot-adding a new 
 package/die to
 the platform, we usually mean taking CPUs online/offline for power 
 management. e.g.
 cpuhp_offline_cpu_device())

 It looks like you're adding support for hot-adding a new package/die to 
 the platform ...
 but only for virtualisation.

 I don't see why this is needed for virtualisation. The in-kernel irqchip 
 needs to know
 these vcpu exist before you can enter the guest for the first time. You 
 can't create them
 late. At best you're saving the host scheduling a vcpu that is offline. Is 
 this really a
 problem?

 If we moved PSCI support to user-space, you could avoid creating host vcpu 
 threads until
 the guest brings the vcpu online, which would solve that problem, and save 
 the host
 resources for the thread too. (and its acpi/dt agnostic)

 I don't see the difference here between booting the guest with 
 'maxcpus=1', and bringing
 the vcpu online later. The only real difference seems to be moving the 
 can-be-online
 policy into the hypervisor/VMM...

>>> Isn't that an important distinction from a cloud service provider's
>>> perspective?

Host cpu-time is. Describing this as guest vcpu's is a bit weird.

I'd expect the statement be something like "you're paying for 50% of one Xeon 
v-whatever".
It shouldn't make a difference if I run 8 vcpus or 2, the amount of cpu-time 
would still
be constrained by the cloud provider.


>>> As far as I understand it, you also need CPU hotplug capabilities to
>>> support things like Kata runtime under Kubernetes. i.e. when
>>> implementing your containers in the form of light weight VMs for the
>>> additional security ... and the orchestration layer cannot determine
>>> ahead of time how much CPU/memory resources are going to be needed to
>>> run the pod(s).

>> Why would it be any different? You can pre-allocate your vcpus, leave
>> them parked until some external agent decides to signal the container
>> that it it can use another bunch of CPUs. At that point, the container
>> must actively boot these vcpus (they aren't going to come up by magic).
>>
>> Given that you must have sized your virtual platform to deal with the
>> maximum set of resources you anticipate (think of the GIC
>> redistributors, for example), I really wonder what you gain here.

> Maybe I'm not following the alternative proposal completely, but wouldn't a 
> guest VM (who
> happens to be in control of its OS) be able to add/online vCPU resources 
> without approval
> from the VMM this way?

The in-kernel PSCI implementation will allow all CPUs to be online/offline. If 
we moved
that support to the VMM, it could apply some policy as to whether a cpu-online 
call
succeeds or fails.


Thanks,

James
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC] Add virtual SDEI support in qemu

2019-07-15 Thread Mark Rutland
On Mon, Jul 15, 2019 at 02:41:00PM +0100, Dave Martin wrote:
> On Sat, Jul 13, 2019 at 05:53:57PM +0800, Guoheyi wrote:
> > Hi folks,
> > 
> > Do it make sense to implement virtual SDEI in qemu? So that we can have the
> > standard way for guest to handle NMI watchdog, RAS events and something else
> > which involves SDEI in a physical ARM64 machine.
> > 
> > My basic idea is like below:
> > 
> > 1. Change a few lines of code in kvm to allow unhandled SMC invocations
> > (like SDEI) to be sent to qemu, with exit reason of KVM_EXIT_HYPERCALL, so
> > we don't need to add new API.
> 
> So long as KVM_EXIT_HYPERCALL reports sufficient information so that
> userspace can identify the cause as an SMC and retrieve the SMC
> immediate field, this seems feasible.
> 
> For its own SMCCC APIs, KVM exclusively uses HVC, so rerouting SMC to
> userspace shouldn't conflict.

Be _very_ careful here! In systems without EL3 (and without NV), SMC
UNDEFs rather than trapping to EL2. Given that, we shouldn't build a
hypervisor ABI that depends on SMC.

I am strongly of the opinion that (for !NV) we should always use HVC
here and have KVM appropriately forward calls to userspace, rather than
trying to use HVC/SMC to distinguish handled-by-kernel and
handled-by-userspace events.

For NV, the first guest hypervisor would use SMC to talk to KVM, all
else being the same.

> This bouncing of SMCs to userspace would need to be opt-in, otherwise
> old userspace would see exits that it doesn't know what to do with.
> 
> > 2. qemu handles supported SDEI calls just as the spec says for what a
> > hypervisor should do for a guest OS.
> > 
> > 3. For interrupts bound to hypervisor, qemu should stop injecting the IRQ to
> > guest through KVM, but jump to the registered event handler directly,
> > including context saving and restoring. Some interrupts like virtual timer
> > are handled by kvm directly, so we may refuse to bind such interrupts to
> > SDEI events.
> 
> Something like that.
> 
> Interactions between SDEI and PSCI would need some thought: for example,
> after PSCI_CPU_ON, the newly online cpu needs to have SDEs masked.
> 
> One option (suggested to me by James Morse) would be to allow userspace
> to disable in the in-kernel PSCI implementation and provide its own
> PSCI to the guest via SMC -- in which case userspace that wants to
> implement SDEI would have to implement PSCI as well.

I think this would be the best approach, since it puts userspace in
charge of everything.

However, this interacts poorly with FW-based mitigations that we
implement in hyp. I suspect we'd probably need a mechanism to delegate
that responsibility back to the kernel, and figure out if that has any
interaction with thigns that got punted to userspace...

Thanks,
Mark.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC] Add virtual SDEI support in qemu

2019-07-15 Thread James Morse
Hi guys,

On 15/07/2019 14:48, Mark Rutland wrote:
> On Mon, Jul 15, 2019 at 02:41:00PM +0100, Dave Martin wrote:
>> On Sat, Jul 13, 2019 at 05:53:57PM +0800, Guoheyi wrote:
>>> Do it make sense to implement virtual SDEI in qemu? So that we can have the
>>> standard way for guest to handle NMI watchdog, RAS events and something else
>>> which involves SDEI in a physical ARM64 machine.

I think so!


>>> My basic idea is like below:
>>>
>>> 1. Change a few lines of code in kvm to allow unhandled SMC invocations
>>> (like SDEI) to be sent to qemu, with exit reason of KVM_EXIT_HYPERCALL, so
>>> we don't need to add new API.
>>
>> So long as KVM_EXIT_HYPERCALL reports sufficient information so that
>> userspace can identify the cause as an SMC and retrieve the SMC
>> immediate field, this seems feasible.
>>
>> For its own SMCCC APIs, KVM exclusively uses HVC, so rerouting SMC to
>> userspace shouldn't conflict.
> 
> Be _very_ careful here! In systems without EL3 (and without NV), SMC
> UNDEFs rather than trapping to EL2. Given that, we shouldn't build a
> hypervisor ABI that depends on SMC.
> 
> I am strongly of the opinion that (for !NV) we should always use HVC
> here and have KVM appropriately forward calls to userspace, rather than
> trying to use HVC/SMC to distinguish handled-by-kernel and
> handled-by-userspace events.
> 
> For NV, the first guest hypervisor would use SMC to talk to KVM, all
> else being the same.
> 
>> This bouncing of SMCs to userspace would need to be opt-in, otherwise
>> old userspace would see exits that it doesn't know what to do with.
>>
>>> 2. qemu handles supported SDEI calls just as the spec says for what a
>>> hypervisor should do for a guest OS.

[pulled point 3 out to discuss separately]

>> Something like that.
>>
>> Interactions between SDEI and PSCI would need some thought: for example,
>> after PSCI_CPU_ON, the newly online cpu needs to have SDEs masked.
>>
>> One option (suggested to me by James Morse) would be to allow userspace
>> to disable in the in-kernel PSCI implementation and provide its own
>> PSCI to the guest via SMC -- in which case userspace that wants to
>> implement SDEI would have to implement PSCI as well.
> 
> I think this would be the best approach, since it puts userspace in
> charge of everything.
> 
> However, this interacts poorly with FW-based mitigations that we
> implement in hyp. I suspect we'd probably need a mechanism to delegate
> that responsibility back to the kernel, and figure out if that has any
> interaction with thigns that got punted to userspace...

This has come up before:
https://lore.kernel.org/r/59c139d0.3040...@arm.com

I agree Qemu should opt-in to this, it needs to be a feature that is enabled.

I had an early version of something like this for testing SDEI before there was 
firmware
available. The review feedback from Christoffer was that it should include HVC 
and SMC,
their immediates, and shouldn't be tied to SMC-CC ranges.

I think this should be a catch-all as Heyi describes to deliver 'unhandled 
SMC/HVC' to
user-space as hypercall exits. We should include the immediate in the struct.

We can allow Qemu to disable the in-kernel PSCI implementation, which would let 
it be done
in user-space via this catch-all mechanism. (PSCI in user-space has come up on 
another
thread recently). The in-kernel PSCI needs to be default-on for backwards 
compatibility.

As Mark points out, the piece that's left is the 'arch workaround' stuff. We 
always need
to handle these in the kernel. I don't think these should be routed-back, they 
should be
un-obtainable by user-space.
Ideally there would be a way for user-space to discover the immediate/x0 values 
that the
kernel will not deliver via the catch-all. This could be done via some 
co-processor
register... thing...



>>> 3. For interrupts bound to hypervisor, qemu should stop injecting the IRQ to
>>> guest through KVM, but jump to the registered event handler directly,
>>> including context saving and restoring. Some interrupts like virtual timer
>>> are handled by kvm directly, so we may refuse to bind such interrupts to
>>> SDEI events.

I don't think we'd ever need a physical interrupt to be delivered via Qemu as 
an SDEI event.
The use-cases for this stuff mean it can be done 'higher-up'. For example, your 
timer is
probably used as a watchdog. On a real system this may well be a device with an 
interrupt,
but Qemu could happily emulate a watchdog using some other linux API. (e.g. 
SIGALRM).



Thanks,

James
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC v2 12/14] arm64/lib: asid: Allow user to update the context under the lock

2019-07-15 Thread Julien Grall




On 03/07/2019 18:35, James Morse wrote:

Hi Julien,


Hi James,


On 20/06/2019 14:06, Julien Grall wrote:

Some users of the ASID allocator (e.g VMID) will require to update the
context when a new ASID is generated. This has to be protected by a lock
to prevent concurrent modification.

Rather than introducing yet another lock, it is possible to re-use the
allocator lock for that purpose. This patch introduces a new callback
that will be call when updating the context.


You're using this later in the series to mask out the generation from the 
atomic64 to
leave just the vmid.


You are right.



Where does this concurrent modification happen? The value is only written if we 
have a
rollover, and while its active the only bits that could change are the 
generation.
(subsequent vCPUs that take the slow path for the same VM will see the updated 
generation
and skip the new_context call)

If we did the generation filtering in update_vmid() after the call to
asid_check_context(), what would go wrong?
It happens more often than is necessary and would need a WRITE_ONCE(), but the 
vmid can't
change until we become preemptible and another vCPU gets a chance to make its 
vmid active.


I think I was over cautious. Pre-filtering after asid_check_context() is equally 
fine as long as update_vttbr() is called from preemptible context.


Cheers,

--
Julien Grall
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC] Add virtual SDEI support in qemu

2019-07-15 Thread Mark Rutland
On Mon, Jul 15, 2019 at 03:26:39PM +0100, James Morse wrote:
> On 15/07/2019 14:48, Mark Rutland wrote:
> > On Mon, Jul 15, 2019 at 02:41:00PM +0100, Dave Martin wrote:
> >> One option (suggested to me by James Morse) would be to allow userspace
> >> to disable in the in-kernel PSCI implementation and provide its own
> >> PSCI to the guest via SMC -- in which case userspace that wants to
> >> implement SDEI would have to implement PSCI as well.
> > 
> > I think this would be the best approach, since it puts userspace in
> > charge of everything.
> > 
> > However, this interacts poorly with FW-based mitigations that we
> > implement in hyp. I suspect we'd probably need a mechanism to delegate
> > that responsibility back to the kernel, and figure out if that has any
> > interaction with thigns that got punted to userspace...
> 
> This has come up before:
> https://lore.kernel.org/r/59c139d0.3040...@arm.com
> 
> I agree Qemu should opt-in to this, it needs to be a feature that is enabled.
> 
> I had an early version of something like this for testing SDEI before
> there was firmware available. The review feedback from Christoffer was
> that it should include HVC and SMC, their immediates, and shouldn't be
> tied to SMC-CC ranges.
> 
> I think this should be a catch-all as Heyi describes to deliver
> 'unhandled SMC/HVC' to user-space as hypercall exits. We should
> include the immediate in the struct.
> 
> We can allow Qemu to disable the in-kernel PSCI implementation, which
> would let it be done in user-space via this catch-all mechanism. (PSCI
> in user-space has come up on another thread recently). The in-kernel
> PSCI needs to be default-on for backwards compatibility.
> 
> As Mark points out, the piece that's left is the 'arch workaround'
> stuff. We always need to handle these in the kernel. I don't think
> these should be routed-back, they should be un-obtainable by
> user-space.

Sure; I meant that those should be handled in the kernel rather than
going to host userspace and back.

I was suggesting was that userspace would opt into taking ownership of
all HVC calls, then explicitly opt-in to the kernel handling specific
(sets of) calls.

There are probably issues with that, but I suspect defining "all
undandled calls" will be problematic otherwise.

Thanks,
Mark.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH kvm-unit-tests v2] arm: Add PL031 test

2019-07-15 Thread Andrew Jones
On Fri, Jul 12, 2019 at 11:19:38AM +0200, Alexander Graf wrote:
> This patch adds a unit test for the PL031 RTC that is used in the virt 
> machine.
> It just pokes basic functionality. I've mostly written it to familiarize 
> myself
> with the device, but I suppose having the test around does not hurt, as it 
> also
> exercises the GIC SPI interrupt path.
> 
> Signed-off-by: Alexander Graf 
> 
> ---
> 
> v1 -> v2:
> 
>   - Use FDT to find base, irq and existence
>   - Put isb after timer read
>   - Use dist_base for gicv3
> ---
>  arm/Makefile.common |   1 +
>  arm/pl031.c | 265 
>  lib/arm/asm/gic.h   |   1 +
>  3 files changed, 267 insertions(+)
>  create mode 100644 arm/pl031.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index f0c4b5d..b8988f2 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -11,6 +11,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>  tests-common += $(TEST_DIR)/gic.flat
>  tests-common += $(TEST_DIR)/psci.flat
>  tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/pl031.flat

Makefile.common is for both arm32 and arm64, but this code is only
compilable on arm64. As there's no reason for it to be arm64 only,
then ideally we'd modify the code to allow compiling and running
on both, but otherwise we need to move this to Makefile.arm64.

>  
>  tests-all = $(tests-common) $(tests)
>  all: directories $(tests-all)
> diff --git a/arm/pl031.c b/arm/pl031.c
> new file mode 100644
> index 000..d975937
> --- /dev/null
> +++ b/arm/pl031.c
> @@ -0,0 +1,265 @@
> +/*
> + * Verify PL031 functionality
> + *
> + * This test verifies whether the emulated PL031 behaves correctly.
> + *
> + * Copyright 2019 Amazon.com, Inc. or its affiliates.
> + * Author: Alexander Graf 
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct pl031_regs {
> + uint32_t dr;/* Data Register */
> + uint32_t mr;/* Match Register */
> + uint32_t lr;/* Load Register */
> + union {
> + uint8_t cr; /* Control Register */
> + uint32_t cr32;
> + };
> + union {
> + uint8_t imsc;   /* Interrupt Mask Set or Clear register */
> + uint32_t imsc32;
> + };
> + union {
> + uint8_t ris;/* Raw Interrupt Status */
> + uint32_t ris32;
> + };
> + union {
> + uint8_t mis;/* Masked Interrupt Status */
> + uint32_t mis32;
> + };
> + union {
> + uint8_t icr;/* Interrupt Clear Register */
> + uint32_t icr32;
> + };
> + uint32_t reserved[1008];
> + uint32_t periph_id[4];
> + uint32_t pcell_id[4];
> +};
> +
> +static u32 cntfrq;
> +static struct pl031_regs *pl031;
> +static int pl031_irq;
> +static void *gic_ispendr;
> +static void *gic_isenabler;
> +static bool irq_triggered;
> +
> +static uint64_t read_timer(void)
> +{
> + uint64_t r = read_sysreg(cntpct_el0);
> + isb();
> +
> + return r;
> +}
> +
> +static int check_id(void)
> +{
> + uint32_t id[] = { 0x31, 0x10, 0x14, 0x00, 0x0d, 0xf0, 0x05, 0xb1 };
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(id); i++)
> + if (id[i] != readl(&pl031->periph_id[i]))
> + return 1;
> +
> + return 0;
> +}
> +
> +static int check_ro(void)
> +{
> + uint32_t offs[] = { offsetof(struct pl031_regs, ris),
> + offsetof(struct pl031_regs, mis),
> + offsetof(struct pl031_regs, periph_id[0]),
> + offsetof(struct pl031_regs, periph_id[1]),
> + offsetof(struct pl031_regs, periph_id[2]),
> + offsetof(struct pl031_regs, periph_id[3]),
> + offsetof(struct pl031_regs, pcell_id[0]),
> + offsetof(struct pl031_regs, pcell_id[1]),
> + offsetof(struct pl031_regs, pcell_id[2]),
> + offsetof(struct pl031_regs, pcell_id[3]) };
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(offs); i++) {
> + uint32_t before32;
> + uint16_t before16;
> + uint8_t before8;
> + void *addr = (void*)pl031 + offs[i];
> + uint32_t poison = 0xdeadbeefULL;
> +
> + before8 = readb(addr);
> + before16 = readw(addr);
> + before32 = readl(addr);
> +
> + writeb(poison, addr);
> + writew(poison, addr);
> + writel(poison, addr);
> +
> + if (before8 != readb(addr))
> + return 1;
> + if (before16 != readw(addr))
> + return 1;
> + if (before32 != readl(addr))
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> +static int check_rtc_freq(void)
> +{
> 

Re: [RFC v2 14/14] kvm/arm: Align the VMID allocation with the arm64 ASID one

2019-07-15 Thread Julien Grall

On 03/07/2019 18:36, James Morse wrote:

Hi Julien,


Hi James,


On 20/06/2019 14:06, Julien Grall wrote:

At the moment, the VMID algorithm will send an SGI to all the CPUs to
force an exit and then broadcast a full TLB flush and I-Cache
invalidation.

This patch re-use the new ASID allocator. The
benefits are:
 - CPUs are not forced to exit at roll-over. Instead the VMID will be
 marked reserved and the context will be flushed at next exit. This
 will reduce the IPIs traffic.
 - Context invalidation is now per-CPU rather than broadcasted.


+ Catalin has a model of the asid-allocator.


That's a good point :).





With the new algo, the code is now adapted:
 - The function __kvm_flush_vm_context() has been renamed to
 __kvm_flush_cpu_vmid_context and now only flushing the current CPU context.
 - The call to update_vttbr() will be done with preemption disabled
 as the new algo requires to store information per-CPU.
 - The TLBs associated to EL1 will be flushed when booting a CPU to
 deal with stale information. This was previously done on the
 allocation of the first VMID of a new generation.

The measurement was made on a Seattle based SoC (8 CPUs), with the
number of VMID limited to 4-bit. The test involves running concurrently 40
guests with 2 vCPUs. Each guest will then execute hackbench 5 times
before exiting.



diff --git a/arch/arm64/include/asm/kvm_asid.h 
b/arch/arm64/include/asm/kvm_asid.h
new file mode 100644
index ..8b586e43c094
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_asid.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ARM64_KVM_ASID_H__
+#define __ARM64_KVM_ASID_H__
+
+#include 
+
+#endif /* __ARM64_KVM_ASID_H__ */
+
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index ff73f5462aca..06821f548c0f 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -62,7 +62,7 @@ extern char __kvm_hyp_init_end[];
  
  extern char __kvm_hyp_vector[];
  
-extern void __kvm_flush_vm_context(void);

+extern void __kvm_flush_cpu_vmid_context(void);
  extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa);


As we've got a __kvm_tlb_flush_local_vmid(), would __kvm_tlb_flush_local_all() 
fit in
better? (This mirrors local_flush_tlb_all() too)


I am happy with the renaming here.





  extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
  extern void __kvm_tlb_flush_local_vmid(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 4bcd9c1291d5..7ef45b7da4eb 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -68,8 +68,8 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long 
ext);
  void __extended_idmap_trampoline(phys_addr_t boot_pgd, phys_addr_t 
idmap_start);
  
  struct kvm_vmid {

-   /* The VMID generation used for the virt. memory system */
-   u64vmid_gen;
+   /* The ASID used for the ASID allocator */
+   atomic64_t asid;


Can we call this 'id' as happens in mm_context_t? (calling it asid is confusing)


I am fine with this suggestion.




u32vmid;


Can we filter out the generation bits in kvm_get_vttbr() in the same way the 
arch code
does in cpu_do_switch_mm().

I think this saves writing back a cached pre-filtered version every time, or 
needing
special hooks to know when the value changed. (so we can remove this variable)


[...]


+static void vmid_update_ctxt(void *ctxt)
  {
+   struct kvm_vmid *vmid = ctxt;
+   u64 asid = atomic64_read(&vmid->asid);



+   vmid->vmid = asid & ((1ULL << kvm_get_vmid_bits()) - 1);


I don't like having to poke this through the asid-allocator as a kvm-specific 
hack. Can we
do it in kvm_get_vttbr()?


I will have a look.





  }



@@ -487,48 +467,11 @@ static bool need_new_vmid_gen(struct kvm_vmid *vmid)


(git made a mess of the diff here... squashed to just the new code:)


  static void update_vmid(struct kvm_vmid *vmid)
  {



+   int cpu = get_cpu();
  
+	asid_check_context(&vmid_info, &vmid->asid, cpu, vmid);
  
+	put_cpu();


If we're calling update_vmid() in a pre-emptible context, aren't we already 
doomed?


Yes we are. This made me realize that Linux-RT replaced the preempt_disable() in 
the caller by migrate_disable(). The latter will prevent the task to move to 
another CPU but allow preemption.


This patch will likely makes things awfully broken for Linux-RT. I will have a 
look to see if we can call this from preempt notifier.




Could we use smp_processor_id() instead.



  }




@@ -1322,6 +1271,8 @@ static void cpu_init_hyp_mode(void *dummy)
  
  	__cpu_init_hyp_mode(pgd_ptr, hyp_stack_ptr, vector_ptr);

__cpu_init_stage2();




+   kvm_call_hyp(__kvm_flush_cpu_vmid_context);


I think we only need to do this for VHE systems too. cpu_hyp_reinit() only does 
the call
to cpu_init_hyp_mode() if !is_kernel_in_hyp_mode(

Re: [PATCH RFC 11/14] arm64: Move the ASID allocator code in a separate file

2019-07-15 Thread Guo Ren
Hello Catalin,

Thanks for sharing about CnP assid experience. See my comment below.

On Mon, Jul 1, 2019 at 5:17 PM Catalin Marinas
> From the ASID reservation/allocation perspective, the mechanism is the
> same between multi-threaded with a shared TLB and multi-core. On arm64,
> a local_flush_tlb_all() on a thread invalidates the TLB for the other
> threads of the same core.
>
> The actual problem with multi-threaded CPUs is a lot more subtle.
> Digging some internal email from 1.5 years ago and pasting it below
> (where "current ASID algorithm" refers to the one prior to the fix and
> CnP - Common Not Private - means shared TLBs on a multi-threaded CPU):
>
>
> The current ASID roll-over algorithm allows for a small window where
> active_asids for a CPU (P1) is different from the actual ASID in TTBR0.
> This can lead to a roll-over on a different CPU (P2) allocating an ASID
> (for a different task) which is still hardware-active on P1.
>
> A TLBI on a CPU (or a peer CPU with CnP) does not guarantee that all the
> entries corresponding to a valid TTBRx are removed as they can still be
> speculatively loaded immediately after TLBI.
>
> While having two different page tables with the same ASID on different
> CPUs should be fine without CnP, it becomes problematic when CnP is
> enabled:
>
> P1  P2
> --  --
> TTBR0.BADDR = T1
> TTBR0.ASID = A1
> check_and_switch_context(T2,A2)
>   asid_maps[P1] = A2
>   goto fastpath
> check_and_switch_context(T3,A0)
>   new_context
> ASID roll-over allocates A1
>   since it is not active
>   TLBI ALL
> speculate TTBR0.ASID = A1 entry
>   TTBR0.BADDR = T3
>   TTBR0.ASID = A1
>   TTBR0.BADDR = T2
>   TTBR0.ASID = A2
>
> After this, the common TLB on P1 and P2 (CnP) contains entries
> corresponding to the old T1 and A1. Task T3 using the same ASID A1 can
> hit such entries. (T1,A1) will eventually be removed from the TLB on the
> next context switch on P1 since tlb_flush_pending was set but this is
> not guaranteed to happen.
>
>
> The fix on arm64 (as part of 5ffdfaedfa0a - "arm64: mm: Support Common
> Not Private translations") was to set the reserved TTBR0 in
> check_and_switch_context(), preventing speculative loads into the TLB
> being tagged with the wrong ASID. So this is specific to the ARM CPUs
> behaviour w.r.t. speculative TLB loads, it may not be the case (yet) for
> your architecture.

The most important thing is that TLBI ALL occurs between
"asid_maps[P1] = A2" and "TTBR0.BADDR = T2", then speculative
execution after TLBI which access to user space code/data will result
in a valid asid entry which re-filled into the TLB by PTW.

A similar problem should exist if C-SKY ISA supports SMT. Although the
C-SKY kernel prohibits the kernel from speculating on user space code
directly, ld/st can access user space memory in csky kernel mode.
Therefore, a similar problem occurs when it speculatively executes
copy_from / to_user codes in that window.

RISC-V ISA has a SUM setting bit that prevents the kernel from
speculating access to user space. So this problem has been bypassed
from the design.

I saw arm64 to prevent speculation by temporarily setting TTBR0.el1 to
a zero page table. Is that used to prevent speculative execution user
space code or just prevent ld/st in copy_use_* ?

-- 
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm