Re: [PATCH v6 2/4] arm64: KVM: add accessors to track guest/host only counters

2018-12-10 Thread Christoffer Dall
On Mon, Dec 10, 2018 at 11:46:56PM +, Andrew Murray wrote:
> On Mon, Dec 10, 2018 at 11:26:34AM +0100, Christoffer Dall wrote:
> > On Mon, Dec 10, 2018 at 09:45:57AM +, Andrew Murray wrote:
> > > In order to effeciently enable/disable guest/host only perf counters
> > > at guest entry/exit we add bitfields to kvm_cpu_context for guest and
> > > host events as well as accessors for updating them.
> > > 
> > > Signed-off-by: Andrew Murray 
> > > ---
> > >  arch/arm64/include/asm/kvm_host.h | 24 
> > >  1 file changed, 24 insertions(+)
> > > 
> > > diff --git a/arch/arm64/include/asm/kvm_host.h 
> > > b/arch/arm64/include/asm/kvm_host.h
> > > index 1550192..800c87b 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -203,6 +203,8 @@ struct kvm_cpu_context {
> > >   };
> > >  
> > >   struct kvm_vcpu *__hyp_running_vcpu;
> > > + u32 events_host;
> > > + u32 events_guest;
> > 
> > This is confusing to me.
> > 
> > These values appear only to be used for the host instance, which makes
> > me wonder why we add them to kvm_cpu_context, which is also used for the
> > guest state?  Should we not instead consider moving them to their own
> 
> Indeed they are only used for the host instance (i.e. not arch.ctxt). I
> hadn't realised the structure was used in other contexts. So it isn't
> optimal to have additional fields that aren't always used.
> 
> 
> > data structure and add a per-cpu data structure or something more fancy
> > like having a new data structure, kvm_percpu_host_data, which contains
> > the kvm_cpu_context and the events flags?
> 
> Using a percpu for the guest/host events was an approach I took prior to
> sharing on the list - an additional hypervisor mapping is needed (such
> that the percpu can be accessed from the hyp switch code) and I assume
> there to be a very marginal additional amount of work resulting from it
> switching between host/guest.
> 
> I also considered using the unused ctxt.sys_regs[PMCNTENSET_EL0] register
> though this feels like a hack and would probably involve a system register
> read in the switch code to read the current PMU state.

Yeah, that doesn't sound overly appealing.

> 
> I attempted the fancy approach (see below) - the struct and variable
> naming isn't ideal (virt/arm/arm.c defines kvm_cpu_context_t and of course
> is shared with arm32). Also I updated asm-offsets.c to reflect the now
> nested struct (for use with get_vcpu_ptr) - it's not strictly necessary
> but adds robustness.
> 
> What are your thoughts on the best approach?

The naming and typedef is the biggest issue with the patch below, but
that can be easily fixed.

The fact that you keep the context pointer on the vcpu structure makes
this much less painful that it could have been, so I think this is
acceptable.

I can also live with mapping the additional data structure if necessary.

> 
> > 
> > I don't know much about perf, but doesn't this design also imply that
> > you can only set these modifiers at a per-cpu level, and not attach
> > the modifiers to a task/vcpu or vm ?  Is that by design?
> 
> You can set these modifiers on a task and the perf core will take care of
> disabling the perf_event when the task is scheduled in/out.

I now remember how this works, thanks for the nudge in the right
direction.

> 
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 800c87b..1f4a78a 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -203,11 +203,19 @@ struct kvm_cpu_context {
> };
> 
> struct kvm_vcpu *__hyp_running_vcpu;
> +};
> +
> +struct kvm_pmu_events {
> u32 events_host;
> u32 events_guest;
>  };
> 
> -typedef struct kvm_cpu_context kvm_cpu_context_t;
> +struct kvm_host_data {
> +   struct kvm_cpu_context __kvm_cpu_state;
> +   struct kvm_pmu_events pmu_events;
> +};
> +
> +typedef struct kvm_host_data kvm_cpu_context_t;
> 

This look really strange.  I think we can just keep the old typedef, and
if you like you can introduce a kvm_host_data_t...

>  struct kvm_vcpu_arch {
> struct kvm_cpu_context ctxt;
> @@ -243,7 +251,7 @@ struct kvm_vcpu_arch {
> struct kvm_guest_debug_arch external_debug_state;
> 
> /* Pointer to host CPU context */
> -   kvm_cpu_context_t *host_cpu_context;
> +   struct kvm_cpu_context *host_cpu_context;
> 
> struct thread_info *host_thread_info;   /* hyp VA */
> struct user_fpsimd_state *host_fpsimd_state;/* hyp VA */
> @@ -482,16 +490,16 @@ static inline void kvm_set_pmu_events(u32 set, int 
> flags)
> kvm_cpu_context_t *ctx = this_cpu_ptr(&kvm_host_cpu_state);
> 
> if (flags & KVM_PMU_EVENTS_HOST)
> -   ctx->events_host |= set;
> +   ctx->pmu_events.events_host |= set;
> if (flags & KVM_PMU_EVENTS_GUEST)
> -   ctx->events_guest |= set;
> + 

Re: [PATCH v6 0/4] arm64: Support perf event modifiers :G and :H

2018-12-10 Thread Andrew Murray
On Mon, Dec 10, 2018 at 06:19:33PM +, Will Deacon wrote:
> Hi Andrew,
> 
> On Mon, Dec 10, 2018 at 09:45:55AM +, Andrew Murray wrote:
> > This patchset provides support for perf event modifiers :G and :H which
> > allows for filtering of PMU events between host and guests when used
> > with KVM.
> > 
> > As the underlying hardware cannot distinguish between guest and host
> > context, the performance counters must be stopped and started upon
> > entry/exit to the guest. This is performed at EL2 in a way that
> > minimizes overhead and improves accuracy of recording events that only
> > occur in the requested context.
> > 
> > This has been tested with VHE and non-VHE kernels with a KVM guest.
> > 
> > Changes from v5:
> > 
> >  - Tweak logic in use of kvm_set_pmu_events
> 
> This looks good to me, but I'll need Acks from the KVM folks on patches 2
> and 4 before I can take it (and Christoffer has some questions on patch 2
> as well).
> 
> Also, please be aware that I'm planning to close the arm64 tree later this
> week so we don't have to deal with late issues going into the holidays. Last
> Christmas was ruined by specdown and meltre, so I think many of us are due a
> proper break this year :)

Thanks for the heads up and sounding completely reasonable to me.

Thanks,

Andrew Murray

> 
> Cheers,
> 
> Will
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 2/4] arm64: KVM: add accessors to track guest/host only counters

2018-12-10 Thread Andrew Murray
On Mon, Dec 10, 2018 at 11:26:34AM +0100, Christoffer Dall wrote:
> On Mon, Dec 10, 2018 at 09:45:57AM +, Andrew Murray wrote:
> > In order to effeciently enable/disable guest/host only perf counters
> > at guest entry/exit we add bitfields to kvm_cpu_context for guest and
> > host events as well as accessors for updating them.
> > 
> > Signed-off-by: Andrew Murray 
> > ---
> >  arch/arm64/include/asm/kvm_host.h | 24 
> >  1 file changed, 24 insertions(+)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_host.h 
> > b/arch/arm64/include/asm/kvm_host.h
> > index 1550192..800c87b 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -203,6 +203,8 @@ struct kvm_cpu_context {
> > };
> >  
> > struct kvm_vcpu *__hyp_running_vcpu;
> > +   u32 events_host;
> > +   u32 events_guest;
> 
> This is confusing to me.
> 
> These values appear only to be used for the host instance, which makes
> me wonder why we add them to kvm_cpu_context, which is also used for the
> guest state?  Should we not instead consider moving them to their own

Indeed they are only used for the host instance (i.e. not arch.ctxt). I
hadn't realised the structure was used in other contexts. So it isn't
optimal to have additional fields that aren't always used.


> data structure and add a per-cpu data structure or something more fancy
> like having a new data structure, kvm_percpu_host_data, which contains
> the kvm_cpu_context and the events flags?

Using a percpu for the guest/host events was an approach I took prior to
sharing on the list - an additional hypervisor mapping is needed (such
that the percpu can be accessed from the hyp switch code) and I assume
there to be a very marginal additional amount of work resulting from it
switching between host/guest.

I also considered using the unused ctxt.sys_regs[PMCNTENSET_EL0] register
though this feels like a hack and would probably involve a system register
read in the switch code to read the current PMU state.

I attempted the fancy approach (see below) - the struct and variable
naming isn't ideal (virt/arm/arm.c defines kvm_cpu_context_t and of course
is shared with arm32). Also I updated asm-offsets.c to reflect the now
nested struct (for use with get_vcpu_ptr) - it's not strictly necessary
but adds robustness.

What are your thoughts on the best approach?

> 
> I don't know much about perf, but doesn't this design also imply that
> you can only set these modifiers at a per-cpu level, and not attach
> the modifiers to a task/vcpu or vm ?  Is that by design?

You can set these modifiers on a task and the perf core will take care of
disabling the perf_event when the task is scheduled in/out.


Here it is..


diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 800c87b..1f4a78a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -203,11 +203,19 @@ struct kvm_cpu_context {
};

struct kvm_vcpu *__hyp_running_vcpu;
+};
+
+struct kvm_pmu_events {
u32 events_host;
u32 events_guest;
 };

-typedef struct kvm_cpu_context kvm_cpu_context_t;
+struct kvm_host_data {
+   struct kvm_cpu_context __kvm_cpu_state;
+   struct kvm_pmu_events pmu_events;
+};
+
+typedef struct kvm_host_data kvm_cpu_context_t;

 struct kvm_vcpu_arch {
struct kvm_cpu_context ctxt;
@@ -243,7 +251,7 @@ struct kvm_vcpu_arch {
struct kvm_guest_debug_arch external_debug_state;

/* Pointer to host CPU context */
-   kvm_cpu_context_t *host_cpu_context;
+   struct kvm_cpu_context *host_cpu_context;

struct thread_info *host_thread_info;   /* hyp VA */
struct user_fpsimd_state *host_fpsimd_state;/* hyp VA */
@@ -482,16 +490,16 @@ static inline void kvm_set_pmu_events(u32 set, int flags)
kvm_cpu_context_t *ctx = this_cpu_ptr(&kvm_host_cpu_state);

if (flags & KVM_PMU_EVENTS_HOST)
-   ctx->events_host |= set;
+   ctx->pmu_events.events_host |= set;
if (flags & KVM_PMU_EVENTS_GUEST)
-   ctx->events_guest |= set;
+   ctx->pmu_events.events_guest |= set;
 }
 static inline void kvm_clr_pmu_events(u32 clr)
 {
kvm_cpu_context_t *ctx = this_cpu_ptr(&kvm_host_cpu_state);

-   ctx->events_host &= ~clr;
-   ctx->events_guest &= ~clr;
+   ctx->pmu_events.events_host &= ~clr;
+   ctx->pmu_events.events_guest &= ~clr;
 }
 #else
 static inline void kvm_set_pmu_events(u32 set, int flags) {}
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 323aeb5..da34022 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -142,7 +142,8 @@ int main(void)
   DEFINE(CPU_FP_REGS,  offsetof(struct kvm_regs, fp_regs));
   DEFINE(VCPU_FPEXC32_EL2, offsetof(struct kvm_vcpu, 
arch.ctxt.sys_regs[FPEXC32_EL2]));
   DEFINE(VCPU_HOST_CONTEXT,offsetof(struct kvm_

Re: [PATCH v5 5/7] iommu: Add virtio-iommu driver

2018-12-10 Thread Michael S. Tsirkin
On Mon, Dec 10, 2018 at 03:06:47PM +, Jean-Philippe Brucker wrote:
> On 27/11/2018 18:53, Michael S. Tsirkin wrote:
> > On Tue, Nov 27, 2018 at 06:10:46PM +, Jean-Philippe Brucker wrote:
> >> On 27/11/2018 18:04, Michael S. Tsirkin wrote:
> >>> On Tue, Nov 27, 2018 at 05:50:50PM +, Jean-Philippe Brucker wrote:
>  On 23/11/2018 22:02, Michael S. Tsirkin wrote:
> >> +/*
> >> + * __viommu_sync_req - Complete all in-flight requests
> >> + *
> >> + * Wait for all added requests to complete. When this function 
> >> returns, all
> >> + * requests that were in-flight at the time of the call have 
> >> completed.
> >> + */
> >> +static int __viommu_sync_req(struct viommu_dev *viommu)
> >> +{
> >> +  int ret = 0;
> >> +  unsigned int len;
> >> +  size_t write_len;
> >> +  struct viommu_request *req;
> >> +  struct virtqueue *vq = viommu->vqs[VIOMMU_REQUEST_VQ];
> >> +
> >> +  assert_spin_locked(&viommu->request_lock);
> >> +
> >> +  virtqueue_kick(vq);
> >> +
> >> +  while (!list_empty(&viommu->requests)) {
> >> +  len = 0;
> >> +  req = virtqueue_get_buf(vq, &len);
> >> +  if (!req)
> >> +  continue;
> >> +
> >> +  if (!len)
> >> +  viommu_set_req_status(req->buf, req->len,
> >> +VIRTIO_IOMMU_S_IOERR);
> >> +
> >> +  write_len = req->len - req->write_offset;
> >> +  if (req->writeback && len == write_len)
> >> +  memcpy(req->writeback, req->buf + 
> >> req->write_offset,
> >> + write_len);
> >> +
> >> +  list_del(&req->list);
> >> +  kfree(req);
> >> +  }
> >
> > I didn't notice this in the past but it seems this will spin
> > with interrupts disabled until host handles the request.
> > Please do not do this - host execution can be another
> > task that needs the same host CPU. This will then disable
> > interrupts for a very very long time.
> 
>  In the guest yes, but that doesn't prevent the host from running another
>  task right?
> >>>
> >>> Doesn't prevent it but it will delay it significantly
> >>> until scheduler decides to kick the VCPU task out.
> >>>
>  My tests run fine when QEMU is bound to a single CPU, even
>  though vcpu and viommu run in different threads
> 
> > What to do then? Queue in software and wake up task.
> 
>  Unfortunately I can't do anything here, because IOMMU drivers can't
>  sleep in the iommu_map() or iommu_unmap() path.
> 
>  The problem is the same
>  for all IOMMU drivers. That's because the DMA API allows drivers to call
>  some functions with interrupts disabled. For example
>  Documentation/DMA-API-HOWTO.txt allows dma_alloc_coherent() and
>  dma_unmap_single() to be called in interrupt context.
> >>>
> >>> In fact I don't really understand how it's supposed to
> >>> work at all: you only sync when ring is full.
> >>> So host may not have seen your map request if ring
> >>> is not full.
> >>> Why is it safe to use the address with a device then?
> >>
> >> viommu_map() calls viommu_send_req_sync(), which does the sync
> >> immediately after adding the MAP request.
> >>
> >> Thanks,
> >> Jean
> > 
> > I see. So it happens on every request. Maybe you should clear
> > event index then. This way if exits are disabled you know that
> > host is processing the ring. Event index is good for when
> > you don't care when it will be processed, you just want
> > to reduce number of exits as much as possible.
> > 
> 
> I think that's already the case: since we don't attach a callback to the
> request queue, VRING_AVAIL_F_NO_INTERRUPT is set in avail_flags_shadow,
> which causes the used event index to stay clear.
> 
> Thanks,
> Jean

VRING_AVAIL_F_NO_INTERRUPT has no effect when the event index
feature has been negotiated. In any case, it also does not
affect kick notifications from guest - it affects
device interrupts.


-- 
MST
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 04/13] arm64/kvm: hide ptrauth from guests

2018-12-10 Thread Kristina Martsenko
On 10/12/2018 20:22, Richard Henderson wrote:
> On 12/10/18 2:12 PM, Kristina Martsenko wrote:
>> The plan was to disable trapping, yes. However, after that thread there
>> was a retrospective change applied to the architecture, such that the
>> XPACLRI (and XPACD/XPACI) instructions are no longer trapped by
>> HCR_EL2.API. (The public documentation on this has not been updated
>> yet.) This means that no HINT-space instructions should trap anymore.
> 
> Ah, thanks for the update.  I'll update my QEMU patch set.
> 
>>> It seems like the header comment here, and
>> Sorry, which header comment?
> 
> Sorry, the patch commit message.

Ah ok. Still seems correct.

Kristina
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 04/13] arm64/kvm: hide ptrauth from guests

2018-12-10 Thread Richard Henderson
On 12/10/18 2:12 PM, Kristina Martsenko wrote:
> The plan was to disable trapping, yes. However, after that thread there
> was a retrospective change applied to the architecture, such that the
> XPACLRI (and XPACD/XPACI) instructions are no longer trapped by
> HCR_EL2.API. (The public documentation on this has not been updated
> yet.) This means that no HINT-space instructions should trap anymore.

Ah, thanks for the update.  I'll update my QEMU patch set.

>> It seems like the header comment here, and
> Sorry, which header comment?

Sorry, the patch commit message.


r~
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 04/13] arm64/kvm: hide ptrauth from guests

2018-12-10 Thread Kristina Martsenko
On 09/12/2018 14:53, Richard Henderson wrote:
> On 12/7/18 12:39 PM, Kristina Martsenko wrote:
>> From: Mark Rutland 
>>
>> In subsequent patches we're going to expose ptrauth to the host kernel
>> and userspace, but things are a bit trickier for guest kernels. For the
>> time being, let's hide ptrauth from KVM guests.
>>
>> Regardless of how well-behaved the guest kernel is, guest userspace
>> could attempt to use ptrauth instructions, triggering a trap to EL2,
>> resulting in noise from kvm_handle_unknown_ec(). So let's write up a
>> handler for the PAC trap, which silently injects an UNDEF into the
>> guest, as if the feature were really missing.
> 
> Reviewing the long thread that accompanied v5, I thought we were *not* going 
> to
> trap PAuth instructions from the guest.
> 
> In particular, the OS distribution may legitimately be built to include
> hint-space nops.  This includes XPACLRI, which is used by the C++ exception
> unwinder and not controlled by SCTLR_EL1.EnI{A,B}.

The plan was to disable trapping, yes. However, after that thread there
was a retrospective change applied to the architecture, such that the
XPACLRI (and XPACD/XPACI) instructions are no longer trapped by
HCR_EL2.API. (The public documentation on this has not been updated
yet.) This means that no HINT-space instructions should trap anymore.
(The guest is expected to not set SCTLR_EL1.EnI{A,B} since
ID_AA64ISAR1_EL1.{APA,API} read as 0.)

> It seems like the header comment here, and
Sorry, which header comment?

>> +/*
>> + * Guest usage of a ptrauth instruction (which the guest EL1 did not turn 
>> into
>> + * a NOP).
>> + */
>> +static int kvm_handle_ptrauth(struct kvm_vcpu *vcpu, struct kvm_run *run)
>> +
> 
> here, need updating.

Changed it to "a trapped ptrauth instruction".

Kristina
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 02/13] arm64: add pointer authentication register bits

2018-12-10 Thread Kristina Martsenko
On 09/12/2018 14:24, Richard Henderson wrote:
> On 12/7/18 12:39 PM, Kristina Martsenko wrote:
>>  #define SCTLR_ELx_DSSBS (1UL << 44)
>> +#define SCTLR_ELx_ENIA  (1 << 31)
> 
> 1U or 1UL lest you produce signed -0x8000.

Thanks, this was setting all SCTLR bits above 31 as well... Now fixed.

> Otherwise,
> Reviewed-by: Richard Henderson 

Thanks for all the review!

Kristina
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 22/25] ACPI / APEI: Kick the memory_failure() queue for synchronous errors

2018-12-10 Thread James Morse
Hi Xie XiuQi,

On 05/12/2018 02:02, Xie XiuQi wrote:
> On 2018/12/4 2:06, James Morse wrote:
>> memory_failure() offlines or repairs pages of memory that have been
>> discovered to be corrupt. These may be detected by an external
>> component, (e.g. the memory controller), and notified via an IRQ.
>> In this case the work is queued as not all of memory_failure()s work
>> can happen in IRQ context.
>>
>> If the error was detected as a result of user-space accessing a
>> corrupt memory location the CPU may take an abort instead. On arm64
>> this is a 'synchronous external abort', and on a firmware first
>> system it is replayed using NOTIFY_SEA.
>>
>> This notification has NMI like properties, (it can interrupt
>> IRQ-masked code), so the memory_failure() work is queued. If we
>> return to user-space before the queued memory_failure() work is
>> processed, we will take the fault again. This loop may cause platform
>> firmware to exceed some threshold and reboot when Linux could have
>> recovered from this error.
>>
>> If a ghes notification type indicates that it may be triggered again
>> when we return to user-space, use the task-work and notify-resume
>> hooks to kick the relevant memory_failure() queue before returning


>> @@ -407,7 +447,22 @@ static void ghes_handle_memory_failure(struct 
>> acpi_hest_generic_data *gdata, int
>>  
>>  if (flags != -1)
>>  memory_failure_queue(pfn, flags);
> 
> We may need to take MF_ACTION_REQUIRED flags for memory_failure() in SEA 
> condition.

Hmmm, I'd forgotten about the extra flags. They're only used by x86's
do_machine_check(), which knows more about what is going on. I agree we do know
it should be a 'MF_ACTION_REQUIRED' for Synchronous-external-abort, but I'd
really like all the notifications to behave in the same way as we can't change
which notification firmware uses.
(This ghes_is_synchronous() affects when memory_failure() runs, not what it 
does.)

What happens if we miss MF_ACTION_REQUIRED? Surely the page still gets unmapped
as its PG_Poisoned, an AO signal may be pending, but if user-space touches the
page it will get an AR signal. Is this just about removing an extra AO signal to
user-space?

If we do need this, I'd like to pick it up from the CPER records, as x86's
NOTIFY_NMI looks like it covers both AO/AR cases. (as does NOTIFY_SDEI). The
Master/Target abort or Invalid-address types in the memory-error-section CPER
records look like the best bet.


> And there is no return value check for memory_failure() in 
> memory_failure_work_func(),
> I'm not sure whether we need to check the return value.

What would we do if it fails? The reasons look fairly broad, -EBUSY can mean
"(page) still referenced by [..] users", 'thp split failed' or 'page already
poisoned'. I don't think the behaviour or return-codes are consistent enough to 
use.


> If the recovery fails here, we need to take other actions, such as force to 
> send a SIGBUS signal.

We don't do this today. If it fixes some mis-behaviour, and we can key it from
something in the CPER records then I'm all ears!


Thanks,

James
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 0/4] arm64: Support perf event modifiers :G and :H

2018-12-10 Thread Will Deacon
Hi Andrew,

On Mon, Dec 10, 2018 at 09:45:55AM +, Andrew Murray wrote:
> This patchset provides support for perf event modifiers :G and :H which
> allows for filtering of PMU events between host and guests when used
> with KVM.
> 
> As the underlying hardware cannot distinguish between guest and host
> context, the performance counters must be stopped and started upon
> entry/exit to the guest. This is performed at EL2 in a way that
> minimizes overhead and improves accuracy of recording events that only
> occur in the requested context.
> 
> This has been tested with VHE and non-VHE kernels with a KVM guest.
> 
> Changes from v5:
> 
>  - Tweak logic in use of kvm_set_pmu_events

This looks good to me, but I'll need Acks from the KVM folks on patches 2
and 4 before I can take it (and Christoffer has some questions on patch 2
as well).

Also, please be aware that I'm planning to close the arm64 tree later this
week so we don't have to deal with late issues going into the holidays. Last
Christmas was ruined by specdown and meltre, so I think many of us are due a
proper break this year :)

Cheers,

Will
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 08/13] arm64: expose user PAC bit positions via ptrace

2018-12-10 Thread Catalin Marinas
On Mon, Dec 10, 2018 at 02:29:45PM +, Will Deacon wrote:
> On Mon, Dec 10, 2018 at 08:22:06AM -0600, Richard Henderson wrote:
> > On 12/10/18 6:03 AM, Catalin Marinas wrote:
> > >> However, it won't be too long before someone implements support for
> > >> ARMv8.2-LVA, at which point, without changes to mandatory pointer 
> > >> tagging, we
> > >> will only have 3 authentication bits: [54:52].  This seems useless and 
> > >> easily
> > >> brute-force-able.
[...]
> > Perhaps the opt-in should be at exec time, with ELF flags (or equivalent) on
> > the application.  Because, as you say, changing the shape of the PAC in the
> > middle of execution is in general not possible.
> 
> I think we'd still have a potential performance problem with that approach,
> since we'd end up having to context-switch TCR.T0SZ, which is permitted to
> be cached in a TLB and would therefore force us to introduce TLB
> invalidation when context-switching between tasks using 52-bit VAs and tasks
> using 48-bit VAs.
> 
> There's a chance we could get the architecture tightened here, but it's
> not something we've pushed for so far and it depends on what's already been
> built.

Just a quick summary of our internal discussion:

ARMv8.3 also comes with a new bit, TCR_EL1.TBIDx, which practically
disables TBI for code pointers. This bit allows us to use 11 bits for
code PtrAuth with 52-bit VA.

Now, the problem is that TBI for code pointers is user ABI, so we can't
simply disable it. We may be able to do this with memory tagging since
that's an opt-in feature (prctl) where the user is aware that the top
byte of a pointer is no longer ignored. However, that's probably for a
future discussion.

-- 
Catalin
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 00/12] arm64: Paravirtualized time support

2018-12-10 Thread Steven Price
On 10/12/2018 11:40, Mark Rutland wrote:
> On Wed, Nov 28, 2018 at 02:45:15PM +, Steven Price wrote:
>> This series add support for paravirtualized time for Arm64 guests and
>> KVM hosts following the specification in Arm's document DEN 0057A:
>>
>> https://developer.arm.com/docs/den0057/a
>>
>> It implements support for Live Physical Time (LPT) which provides the
>> guest with a method to derive a stable counter of time during which the
>> guest is executing even when the guest is being migrated between hosts
>> with different physical counter frequencies.
>>
>> It also implements support for stolen time, allowing the guest to
>> identify time when it is forcibly not executing.
> 
> I know that stolen time reporting is important, and I think that we
> definitely want to pick up that part of the spec (once it is published
> in some non-draft form).
> 
> However, I am very concerned with the pv-freq part of LPT, and I'd like
> to avoid that if at all possible. I say that because:
> 
> * By design, it breaks architectural guarantees from the PoV of SW in
>   the guest.
> 
>   A VM may host multiple SW agents serially (e.g. when booting, or
>   across kexec), or concurrently (e.g. Linux w/ EFI runtime services),
>   and the host has no way to tell whether all software in the guest will
>   function correctly. Due to this, it's not possible to have a guest
>   opt-in to the architecturally-broken timekeeping.
> 
>   Existing guests will not work correctly once pv-freq is in use, and if
>   configured without pv-freq (or if the guest fails to discover pv-freq
>   for any reason), the administrator may encounter anything between
>   subtle breakage or fatally incorrect timekeeping.
> 
>   There's plenty of SW agents other than Linux which runs in a guest,
>   which would need to be updated to handle pv-freq, e.g. GRUB, *BSD,
>   iPXE.
> 
>   Given this, I think that this is going to lead to subtle breakage in
>   real-world scenarios. 

LPT only changes things on migration. Up until migration the
(architectural) clocks still behave perfectly normally. A guest which
opts in to LPT can derive a clock with a different frequency, but the
underlying clock doesn't change.

When migration happens it's a different story.

If the frequency of the new host matches the old host then again the
clocks behave 'normally': CNTVOFF is used to hide the change in offset
such that the guest at worst sees time pause during the actual migration.

But the whole point of LPT is to deal with the situation if the clock
frequency has changed. A guest (or SW agent) which doesn't know about PV
will experience one of two things:

* Without LPT: the clock frequency will suddenly change without warning,
but the virtual counter is monotonically increasing.

* With LPT: the clock frequency will suddenly change and the virtual
counter will jump (it won't be monotonically increasing).

So I agree the situation with LPT is worse (we lose the monotonicity),
but any guest/agent which didn't understand about the migration is in
trouble if it cares about time.

> * It is (necessarily) invasive to the low-level arch timer code. This is
>   unfortunate, and I strongly suspect this is going to be an area with
>   long-term subtle breakage.

I can't argue against that - I've tried to limit how invasive the code
changes are, but ultimately we're changing the interpretation of
low-level timers.

> * It's not clear to me how strongly people need this. My understanding
>   is that datacenters would run largely homogeneous platforms. I suspect
>   large datacenters which would use migration are in a position to
>   mandate a standard timer frequency from their OEMs or SiPs.
> 
>   I strongly believe that an architectural fix (e.g. in-hw scaling)
>   would be the better solution.

An architectural fix in hardware is clearly the best solution. The
question is whether we want to support the use-case with today's
hardware. While mandating a particular 'standard' timer frequency is a
good idea, there's currently no standard. Large datacenters might be
able to mandate that, and maybe there'll be sufficient consensus that
this doesn't matter. But I seem to have misplaced my crystal ball...

> I understand that LPT is supposed to account for time lost during the
> migration. Can we account for this without pv-freq? e.g. is it possible
> to account for this in the same way as stolen time?

LPT isn't really about accounting for the time lost (to some extent this
is already done by saving/restoring the "KVM_REG_ARM_TIMER_CNT"
register) but about ensuring that the guest can derive a monotonically
increasing counter which maintains a stable frequency when migrated.

I'm going to respin the series with the LPT parts split out to the end,
that way we can (hopefully) agree on the stolen time parts and can defer
the LPT part if necessary.

Thanks,

Steve

> Thanks,
> Mark.
> ___
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu

Re: [PATCH 06/12] KVM: arm64: Support Live Physical Time reporting

2018-12-10 Thread Steven Price
On 10/12/2018 10:56, Mark Rutland wrote:
> On Wed, Nov 28, 2018 at 02:45:21PM +, Steven Price wrote:
>> Provide a method for a guest to derive a paravirtualized counter/timer
>> which isn't dependent on the host's counter frequency. This allows a
>> guest to be migrated onto a new host which doesn't have the same
>> frequency without the virtual counter being disturbed.
> 
> I have a number of concerns about paravirtualizing the timer frequency,
> but I'll bring that up in reply to the cover letter.
> 
> I have some orthogonal comments below.
> 
>> The host provides a shared page which contains coefficients that can be
>> used to map the real counter from the host (the Arm "virtual counter")
>> to a paravirtualized view of time. On migration the new host updates the
>> coefficients to ensure that the guests view of time (after using the
>> coefficients) doesn't change and that the derived counter progresses at
>> the same real frequency.
> 
> Can we please avoid using the term 'page' here?
> 
> There is a data structure in shared memory, but it is not page-sized,
> and referring to it as a page here and elsewhere is confusing. The spec
> never uses the term 'page'
> 
> Could we please say something like:
> 
>   The host provides a datastrucutre in shared memory which ...
> 
> ... to avoid the implication this is page sized/aligned etc.
> 
> [...]

Sure, I'll update to avoid referring to it as a page. Although note that
when mapping it into the guest we can obviously only map in page sized
granules, so in practise the LPT structure is contained within a entire
page given to the guest...

>> +struct kvm_arch_pvtime {
>> +void *pv_page;
>> +
>> +gpa_t lpt_page;
>> +u32 lpt_fpv;
>> +} pvtime;
> 
> To remove the page terminology, perhaps something like:
> 
>   struct kvm_arch_pvtime {
>   struct lpt  *lpt;
>   gpa_t   lpt_gpa;
>   u32 lpt_fpv;
>   };
> 
> [...]
> 
>> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
>> index 8bf259dae9f6..fd3a2caabeb2 100644
>> --- a/include/linux/kvm_types.h
>> +++ b/include/linux/kvm_types.h
>> @@ -49,6 +49,8 @@ typedef unsigned long  gva_t;
>>  typedef u64gpa_t;
>>  typedef u64gfn_t;
>>  
>> +#define GPA_INVALID-1
> 
> To avoid any fun with signed/unsigned comparison, can we please make
> this:
> 
> #define GPA_INVALID   ((gpa_t)-1)
> 
> ... or:
> 
> #define GPA_INVALID (~(gpa_t)0)
> 
> [...]

I'll go with the latter as I think that's clearer.

>> +static void update_vtimer_cval(struct kvm *kvm, u32 previous_rate)
>> +{
>> +u32 current_rate = arch_timer_get_rate();
>> +u64 current_time = kvm_phys_timer_read();
>> +int i;
>> +struct kvm_vcpu *vcpu;
>> +u64 rel_cval;
>> +
>> +/* Early out if there's nothing to do */
>> +if (likely(previous_rate == current_rate))
>> +return;
> 
> Given this only happens on migration, I don't think we need to care
> about likely/unlikely here, and can drop that from the condition.

Fair enough

> [...]
> 
>> +int kvm_arm_update_lpt_sequence(struct kvm *kvm)
>> +{
>> +struct pvclock_vm_time_info *pvclock;
>> +u64 lpt_ipa = kvm->arch.pvtime.lpt_page;
>> +u64 native_freq, pv_freq, scale_mult, div_by_pv_freq_mult;
>> +u64 shift = 0;
>> +u64 sequence_number = 0;
>> +
>> +if (lpt_ipa == GPA_INVALID)
>> +return -EINVAL;
>> +
>> +/* Page address must be 64 byte aligned */
>> +if (lpt_ipa & 63)
>> +return -EINVAL;
> 
> Please use IS_ALIGNED(), e.g.
> 
>   if (!IS_ALIGNED(lpt_ipa, 64))
>   return -EINVAL;

Yes, much clearer - no need for the comment :)

Thanks,

Steve

> Thanks,
> Mark.
> ___
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v5 5/7] iommu: Add virtio-iommu driver

2018-12-10 Thread Jean-Philippe Brucker
On 27/11/2018 18:53, Michael S. Tsirkin wrote:
> On Tue, Nov 27, 2018 at 06:10:46PM +, Jean-Philippe Brucker wrote:
>> On 27/11/2018 18:04, Michael S. Tsirkin wrote:
>>> On Tue, Nov 27, 2018 at 05:50:50PM +, Jean-Philippe Brucker wrote:
 On 23/11/2018 22:02, Michael S. Tsirkin wrote:
>> +/*
>> + * __viommu_sync_req - Complete all in-flight requests
>> + *
>> + * Wait for all added requests to complete. When this function returns, 
>> all
>> + * requests that were in-flight at the time of the call have completed.
>> + */
>> +static int __viommu_sync_req(struct viommu_dev *viommu)
>> +{
>> +int ret = 0;
>> +unsigned int len;
>> +size_t write_len;
>> +struct viommu_request *req;
>> +struct virtqueue *vq = viommu->vqs[VIOMMU_REQUEST_VQ];
>> +
>> +assert_spin_locked(&viommu->request_lock);
>> +
>> +virtqueue_kick(vq);
>> +
>> +while (!list_empty(&viommu->requests)) {
>> +len = 0;
>> +req = virtqueue_get_buf(vq, &len);
>> +if (!req)
>> +continue;
>> +
>> +if (!len)
>> +viommu_set_req_status(req->buf, req->len,
>> +  VIRTIO_IOMMU_S_IOERR);
>> +
>> +write_len = req->len - req->write_offset;
>> +if (req->writeback && len == write_len)
>> +memcpy(req->writeback, req->buf + 
>> req->write_offset,
>> +   write_len);
>> +
>> +list_del(&req->list);
>> +kfree(req);
>> +}
>
> I didn't notice this in the past but it seems this will spin
> with interrupts disabled until host handles the request.
> Please do not do this - host execution can be another
> task that needs the same host CPU. This will then disable
> interrupts for a very very long time.

 In the guest yes, but that doesn't prevent the host from running another
 task right?
>>>
>>> Doesn't prevent it but it will delay it significantly
>>> until scheduler decides to kick the VCPU task out.
>>>
 My tests run fine when QEMU is bound to a single CPU, even
 though vcpu and viommu run in different threads

> What to do then? Queue in software and wake up task.

 Unfortunately I can't do anything here, because IOMMU drivers can't
 sleep in the iommu_map() or iommu_unmap() path.

 The problem is the same
 for all IOMMU drivers. That's because the DMA API allows drivers to call
 some functions with interrupts disabled. For example
 Documentation/DMA-API-HOWTO.txt allows dma_alloc_coherent() and
 dma_unmap_single() to be called in interrupt context.
>>>
>>> In fact I don't really understand how it's supposed to
>>> work at all: you only sync when ring is full.
>>> So host may not have seen your map request if ring
>>> is not full.
>>> Why is it safe to use the address with a device then?
>>
>> viommu_map() calls viommu_send_req_sync(), which does the sync
>> immediately after adding the MAP request.
>>
>> Thanks,
>> Jean
> 
> I see. So it happens on every request. Maybe you should clear
> event index then. This way if exits are disabled you know that
> host is processing the ring. Event index is good for when
> you don't care when it will be processed, you just want
> to reduce number of exits as much as possible.
> 

I think that's already the case: since we don't attach a callback to the
request queue, VRING_AVAIL_F_NO_INTERRUPT is set in avail_flags_shadow,
which causes the used event index to stay clear.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 08/13] arm64: expose user PAC bit positions via ptrace

2018-12-10 Thread Will Deacon
On Mon, Dec 10, 2018 at 08:22:06AM -0600, Richard Henderson wrote:
> On 12/10/18 6:03 AM, Catalin Marinas wrote:
> >> However, it won't be too long before someone implements support for
> >> ARMv8.2-LVA, at which point, without changes to mandatory pointer tagging, 
> >> we
> >> will only have 3 authentication bits: [54:52].  This seems useless and 
> >> easily
> >> brute-force-able.
> > 
> > Such support is already here (about to be queued):
> > 
> > https://lore.kernel.org/linux-arm-kernel/20181206225042.11548-1-steve.cap...@arm.com/
> 
> Thanks for the pointer.
> 
> >> Unfortunately, there is no obvious path to making this optional that does 
> >> not
> >> break compatibility with Documentation/arm64/tagged-pointers.txt.
> > 
> > There is also the ARMv8.5 MTE (memory tagging) which relies on tagged
> > pointers.
> 
> So it does.  I hadn't read through that extension completely before.
> 
> > An alternative would be to allow the opt-in to 52-bit VA, leaving it at
> > 48-bit by default. However, it has the problem of changing the PAC size
> > and not being able to return.
> 
> Perhaps the opt-in should be at exec time, with ELF flags (or equivalent) on
> the application.  Because, as you say, changing the shape of the PAC in the
> middle of execution is in general not possible.

I think we'd still have a potential performance problem with that approach,
since we'd end up having to context-switch TCR.T0SZ, which is permitted to
be cached in a TLB and would therefore force us to introduce TLB
invalidation when context-switching between tasks using 52-bit VAs and tasks
using 48-bit VAs.

There's a chance we could get the architecture tightened here, but it's
not something we've pushed for so far and it depends on what's already been
built.

Will
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 08/13] arm64: expose user PAC bit positions via ptrace

2018-12-10 Thread Richard Henderson
On 12/10/18 6:03 AM, Catalin Marinas wrote:
>> However, it won't be too long before someone implements support for
>> ARMv8.2-LVA, at which point, without changes to mandatory pointer tagging, we
>> will only have 3 authentication bits: [54:52].  This seems useless and easily
>> brute-force-able.
> 
> Such support is already here (about to be queued):
> 
> https://lore.kernel.org/linux-arm-kernel/20181206225042.11548-1-steve.cap...@arm.com/

Thanks for the pointer.

>> Unfortunately, there is no obvious path to making this optional that does not
>> break compatibility with Documentation/arm64/tagged-pointers.txt.
> 
> There is also the ARMv8.5 MTE (memory tagging) which relies on tagged
> pointers.

So it does.  I hadn't read through that extension completely before.

> An alternative would be to allow the opt-in to 52-bit VA, leaving it at
> 48-bit by default. However, it has the problem of changing the PAC size
> and not being able to return.

Perhaps the opt-in should be at exec time, with ELF flags (or equivalent) on
the application.  Because, as you say, changing the shape of the PAC in the
middle of execution is in general not possible.

It isn't perfect, since old kernels won't fail to exec an application setting
flags that can't be supported.  And it requires tooling changes.


r~
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 05/12] KVM: arm64: Implement PV_FEATURES call

2018-12-10 Thread Steven Price
On 10/12/2018 10:39, Mark Rutland wrote:
> On Wed, Nov 28, 2018 at 02:45:20PM +, Steven Price wrote:
>> This provides a mechanism for querying which paravirtualized features
>> are available in this hypervisor.
>>
>> Also add the header file which defines the ABI for the paravirtualized
>> clock features we're about to add.
>>
>> Signed-off-by: Steven Price 
>> ---
>>  arch/arm64/include/asm/pvclock-abi.h | 32 
>>  include/kvm/arm_pv.h | 28 
>>  include/linux/arm-smccc.h|  1 +
>>  virt/kvm/arm/hypercalls.c|  9 
>>  4 files changed, 70 insertions(+)
>>  create mode 100644 arch/arm64/include/asm/pvclock-abi.h
>>  create mode 100644 include/kvm/arm_pv.h
>>
>> diff --git a/arch/arm64/include/asm/pvclock-abi.h 
>> b/arch/arm64/include/asm/pvclock-abi.h
>> new file mode 100644
>> index ..64ce041c8922
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/pvclock-abi.h
>> @@ -0,0 +1,32 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (C) 2018 Arm Ltd. */
>> +
>> +#ifndef __ASM_PVCLOCK_ABI_H
>> +#define __ASM_PVCLOCK_ABI_H
>> +
>> +#include 
>> +
>> +struct pvclock_vm_time_info {
>> +__le32 revision;
>> +__le32 attributes;
>> +__le64 sequence_number;
>> +__le64 scale_mult;
>> +__le32 shift;
>> +__le32 reserved;
>> +__le64 native_freq;
>> +__le64 pv_freq;
>> +__le64 div_by_pv_freq_mult;
>> +} __packed;
>> +
>> +struct pvclock_vcpu_stolen_time_info {
>> +__le32 revision;
>> +__le32 attributes;
>> +__le64 stolen_time;
>> +/* Structure must be 64 byte aligned, pad to that size */
>> +u8 padding[48];
>> +} __packed;
>> +
>> +#define PV_VM_TIME_NOT_SUPPORTED-1
>> +#define PV_VM_TIME_INVALID_PARAMETERS   -2
> 
> Could you please add a comment describing that these are defined in ARM
> DEN0057A?

No problem.

>> +
>> +#endif
>> diff --git a/include/kvm/arm_pv.h b/include/kvm/arm_pv.h
>> new file mode 100644
>> index ..19d2dafff31a
>> --- /dev/null
>> +++ b/include/kvm/arm_pv.h
>> @@ -0,0 +1,28 @@
>> +/* SPDX-License-Identifier: GPL-2.0
>> + * Copyright (C) 2018 Arm Ltd.
>> + */
>> +
>> +#ifndef __KVM_ARM_PV_H
>> +#define __KVM_ARM_PV_H
>> +
>> +#include 
>> +
>> +#define ARM_SMCCC_HV_PV_FEATURES\
>> +ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
>> +   ARM_SMCCC_SMC_64,\
>> +   ARM_SMCCC_OWNER_HYP_STANDARD,\
>> +   0x20)
>> +
>> +#define ARM_SMCCC_HV_PV_TIME_LPT\
>> +ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
>> +   ARM_SMCCC_SMC_64,\
>> +   ARM_SMCCC_OWNER_HYP_STANDARD,\
>> +   0x21)
>> +
>> +#define ARM_SMCCC_HV_PV_TIME_ST \
>> +ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
>> +   ARM_SMCCC_SMC_64,\
>> +   ARM_SMCCC_OWNER_HYP_STANDARD,\
>> +   0x22)
>> +
>> +#endif /* __KVM_ARM_PV_H */
> 
> Do these need to live in a separate header, away from the struct
> definitions?
> 
> I'd be happy for these to live in , given they're
> standard calls.

I'll move them to linux/arm-smccc.h - I didn't want to place them in
pvclock-abi.h as it seemed wrong to pull that into the generic SMCCC
code which doesn't care about these structures.

> As before, a comment referring to ARM DEN0057A would be nice.

Will add

>> diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
>> index b047009e7a0a..4e0866cc48c0 100644
>> --- a/include/linux/arm-smccc.h
>> +++ b/include/linux/arm-smccc.h
>> @@ -54,6 +54,7 @@
>>  #define ARM_SMCCC_OWNER_SIP 2
>>  #define ARM_SMCCC_OWNER_OEM 3
>>  #define ARM_SMCCC_OWNER_STANDARD4
>> +#define ARM_SMCCC_OWNER_HYP_STANDARD5
> 
> Minor nit, but could we make that STANDARD_HYP?

Sure

>>  #define ARM_SMCCC_OWNER_TRUSTED_APP 48
>>  #define ARM_SMCCC_OWNER_TRUSTED_APP_END 49
>>  #define ARM_SMCCC_OWNER_TRUSTED_OS  50
>> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
>> index 153aa7642100..ba13b798f0f8 100644
>> --- a/virt/kvm/arm/hypercalls.c
>> +++ b/virt/kvm/arm/hypercalls.c
>> @@ -5,6 +5,7 @@
>>  #include 
>>  
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -40,6 +41,14 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>>  break;
>>  }
>>  break;
>> +case ARM_SMCCC_HV_PV_FEATURES:
>> +val = SMCCC_RET_SUCCESS;
>> +break;
>> +}
>> +break;
>> +case ARM_SMCCC_HV_PV_FEATURES:
>> +feature = smccc_get_arg1(vcpu);
>> +switch (feature) {
>>  }
> 

Re: [PATCH 03/12] arm/arm64: Provide a wrapper for SMCCC 1.1 calls

2018-12-10 Thread Steven Price
On 10/12/2018 10:27, Mark Rutland wrote:
> On Wed, Nov 28, 2018 at 02:45:18PM +, Steven Price wrote:
>> SMCCC 1.1 calls may use either HVC or SMC depending on the PSCI
>> conduit. Rather than coding this in every call site provide a macro
>> which uses the correct instruction. The macro also handles the case
>> where no PSCI conduit is configured returning a not supported error
>> in res, along with returning the conduit used for the call.
>>
>> This allow us to remove some duplicated code and will be useful later
>> when adding paravirtualized time hypervisor calls.
>>
>> Signed-off-by: Steven Price 
>> ---
>>  include/linux/arm-smccc.h | 44 +++
>>  1 file changed, 44 insertions(+)
>>
>> diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
>> index 18863d56273c..b047009e7a0a 100644
>> --- a/include/linux/arm-smccc.h
>> +++ b/include/linux/arm-smccc.h
>> @@ -311,5 +311,49 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, 
>> unsigned long a1,
>>  #define SMCCC_RET_NOT_SUPPORTED -1
>>  #define SMCCC_RET_NOT_REQUIRED  -2
>>  
>> +/* Like arm_smccc_1_1* but always returns SMCCC_RET_NOT_SUPPORTED.
>> + * Used when the PSCI conduit is not defined. The empty asm statement
>> + * avoids compiler warnings about unused variables.
>> + */
>> +#define __fail_smccc_1_1(...)   
>> \
>> +do {\
>> +__declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \
>> +asm ("" __constraints(__count_args(__VA_ARGS__)));  \
>> +if (___res) \
>> +___res->a0 = SMCCC_RET_NOT_SUPPORTED;   \
>> +} while (0)
>> +
>> +/*
>> + * arm_smccc_1_1() - make an SMCCC v1.1 compliant call
>> + *
>> + * This is a variadic macro taking one to eight source arguments, and
>> + * an optional return structure.
>> + *
>> + * @a0-a7: arguments passed in registers 0 to 7
>> + * @res: result values from registers 0 to 3
>> + *
>> + * This macro will make either an HVC call or an SMC call depending on the
>> + * current PSCI conduit. If no valid conduit is available then -1
>> + * (SMCCC_RET_NOT_SUPPORTED) is returned in @res.a0 (if supplied).
>> + *
>> + * The return value also provides the conduit that was used.
>> + */
>> +#define arm_smccc_1_1(...) ({   
>> \
> 
> As a minor nit, could we please give call this something like
> arm_smccc_1_1_call() or arm_smccc_1_1_invoke(), to make it clear what
> action this performs?

Sure, arm_smccc_1_1_call() works for me.

> I had some patches [1] cleaning up the SMCCC API, it would be nice if we
> could pick up some of those as preparatory bits, before we spread some
> of the current design warts (e.g. SMCCC depending on PSCI definitions).

Looks like a similar approach, just extended to include
s/PSCI_CONDUIT/SMCCC_CONDUIT/ and making arm_smccc_1_1_* return res.
Would you like me to drop this (and the next) patch in favour of yours?

Thanks,

Steve

> Thanks,
> Mark.
> 
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git 
> arm64/smccc-cleanup
> 
>> +int method = psci_ops.conduit;  \
>> +switch (method) {   \
>> +case PSCI_CONDUIT_HVC:  \
>> +arm_smccc_1_1_hvc(__VA_ARGS__); \
>> +break;  \
>> +case PSCI_CONDUIT_SMC:  \
>> +arm_smccc_1_1_smc(__VA_ARGS__); \
>> +break;  \
>> +default:\
>> +__fail_smccc_1_1(__VA_ARGS__);  \
>> +method = PSCI_CONDUIT_NONE; \
>> +break;  \
>> +}   \
>> +method; \
>> +})
>> +
>>  #endif /*__ASSEMBLY__*/
>>  #endif /*__LINUX_ARM_SMCCC_H*/
>> -- 
>> 2.19.2
>>
> ___
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: KVM arm realtime performance optimization

2018-12-10 Thread Christoffer Dall
On Mon, Dec 10, 2018 at 05:36:09AM +, Steven Miao (Arm Technology China) 
wrote:
> 
> From: kvmarm-boun...@lists.cs.columbia.edu 
>  On Behalf Of Steven Miao (Arm 
> Technology China)
> Sent: Thursday, December 6, 2018 3:05 PM
> To: kvmarm@lists.cs.columbia.edu
> Subject: KVM arm realtime performance optimization
> 
> Hi Everyone,
> 
> I' currently testing KVM arm realtime performance on a hikey960 board. My 
> test benchmark is cyclictest to measure thread wake up latency both on Host 
> linux OS and KVM Guest linux OS.
> 
> Host OS:
> 
> hikey960:/mnt/debian/usr/src/linux#  cyclictest -p 99 -t 4 -m -n -a 0-3 -l 
> 10
> # /dev/cpu_dma_latency set to 0us
> WARN: Running on unknown kernel version...YMMV
> policy: fifo: loadavg: 0.00 0.00 0.00 1/165 3270
> 
> T: 0 ( 3266) P:99 I:1000 C: 10 Min:  4 Act:   15 Avg:   15 Max: 
> 139
> T: 1 ( 3267) P:99 I:1500 C:  66736 Min:  4 Act:   15 Avg:   15 Max: 
> 239
> T: 2 ( 3268) P:99 I:2000 C:  50051 Min:  4 Act:   19 Avg:   15 Max:  
> 43
> T: 3 ( 3269) P:99 I:2500 C:  40039 Min:  5 Act:   15 Avg:   16 Max:  
> 74
> 
> Guest OS:
> root@genericarmv8:~# cyclictest -p 99 -t 4 -m -n -a 0-3 -l 10
> # /dev/cpu_dma_latency set to 0us
> WARN: Running on unknown kernel version...YMMV
> policy: fifo: loadavg: 0.13 0.05 0.01 1/70 293
> 
> T: 0 (  290) P:99 I:1000 C: 10 Min:  7 Act:   44 Avg:   85 Max:   
> 16111
> T: 1 (  291) P:99 I:1500 C:  5 Min:  7 Act:   81 Avg:   90 Max:   
> 15306
> T: 2 (  292) P:99 I:2000 C:  49995 Min:  7 Act:   88 Avg:   87 Max:   
> 16703
> T: 3 (  293) P:99 I:2500 C:  39992 Min:  8 Act:   72 Avg:   97 Max:   
> 14976
> 
> 
> RT performance on KVM guest OS is poor compared to that on host OS. The 
> average wake up latency is about 6 - 7 times on Guest OS vs on Host OS.
> I've tried some configurations to improve RT in KVM, like:
> 1 Can be combined with CPU isolation
> 2 Host OS and Guest OS use RT preempt kernel
> 3 Host CPU avoid frequency change
> 4 Configure NO_HZ_FULL for Guest OS
> 
> There could be a little improvement after apply above configuration, but the 
> RT performance is still very poor.
> 
> 5 Guest OS use idle poll instead of WFI to avoid trap and switch out
> 
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 2dc0f84..53aef78 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -83,7 +83,7 @@ void arch_cpu_idle(void)
>  * tricks
>  */
> trace_cpu_idle_rcuidle(1, smp_processor_id());
> -   cpu_do_idle();
> +   cpu_relax();
> local_irq_enable();
> trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
>  }
> 
> root@genericarmv8:~# cyclictest -p 99 -t 4 -m -n  -l 10
> # /dev/cpu_dma_latency set to 0us
> WARN: Running on unknown kernel version...YMMV
> policy: fifo: loadavg: 0.07 0.03 0.00 1/99 328
> 
> T: 0 (  325) P:99 I:1000 C: 10 Min:  3 Act:6 Avg:   13 Max:
> 4999
> T: 1 (  326) P:99 I:1500 C:  66659 Min:  5 Act:7 Avg:   14 Max:
> 3449
> T: 2 (  327) P:99 I:2000 C:  49989 Min:  4 Act:7 Avg:9 Max:   
> 11471
> T: 3 (  328) P:99 I:2500 C:  39986 Min:  4 Act:   14 Avg:   14 Max:   
> 11253
> 
> The method 5 can improve Guest OS RT performance a lot, the average thread 
> wake up latency on Guest OS is almost same as its on Host OS, but the Max 
> wake up latency is still very poor.
> 
> Anyone has any idea to improve RT performance on KVM Guest OS? Although 
> method 5 can improve RT performance on Guest OS a lot, I think it is not good 
> idea.
> 
This is a known problem and there have been presentations about similar
problems on x86 in past KVM Forums.

The first thing to do is analyze the critical path that adds latency to
a wakeup.  One way to do that is to instrument the path by adding time
counter reads to the path and figuring out what takes time.

One thing you can look at is having a configurable grace period in KVM's
block function before the process actually goes to sleep (and calls
kvm_vcpu_put) and the host scheduler, and see if that helps anything.

At the end of the day, virtualization is going to add a lot of latency
when you have to switch the entire state of your CPU, and in terms of
virtual RT, you end up with a very high minimal latency.


Thanks,

Christoffer
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvm/arm: return 0 when the number of objects is not lessthan min

2018-12-10 Thread Christoffer Dall
On Thu, Dec 06, 2018 at 09:56:30AM +0800, peng.h...@zte.com.cn wrote:
> >On Wed, Dec 05, 2018 at 09:15:51AM +0800, Peng Hao wrote:
> >> Return 0 when there is enough kvm_mmu_memory_cache object.
> >>
> >> Signed-off-by: Peng Hao 
> >> ---
> >>  virt/kvm/arm/mmu.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >> index ed162a6..fcda0ce 100644
> >> --- a/virt/kvm/arm/mmu.c
> >> +++ b/virt/kvm/arm/mmu.c
> >> @@ -127,7 +127,7 @@ static int mmu_topup_memory_cache(struct 
> >> kvm_mmu_memory_cache *cache,
> >>  while (cache->nobjs < max) {
> >>  page = (void *)__get_free_page(PGALLOC_GFP);
> >>  if (!page)
> >> -return -ENOMEM;
> >> +return cache->nobjs >= min ? 0 : -ENOMEM;
> >
> >This condition will never be true here, as the exact same condition is
> >already checked above, and if it had been true, then we wouldn't be here.
> >
> >What problem are you attempting to solve?
> >
> if (cache->nobjs >= min)  --here cache->nobjs can continue downward 
>  return 0;
> while (cache->nobjs < max) {
> page = (void *)__get_free_page(PGALLOC_GFP);
> if (!page)
>return -ENOMEM; -here it is possible that  
> (cache->nobjs >= min) and (cache->nobjs cache->objects[cache->nobjs++] = page; ---here cache->nobjs increasing
>   }
> 
> I just think the logic of this function is to return 0 as long as 
> (cache->nobjs >= min).
> thanks.

That's not the intention, nor is it on any of the other architectures
implementing the same thing (this one goes on the list of stuff we
should be sharing between architectures).

The idea is that you fill up the cache when it goes below min, and you
are always able to fill up to max.

If you're not able to fill up to max, then your system is seriously low
on memory and continuing to run this VM is not likely to be a good idea,
so you might as well tell user space to do something now instead of
waiting until the situation is even worse.


Thanks,

Christoffer
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 2/8] KVM: arm64: Rework detection of SVE, !VHE systems

2018-12-10 Thread Will Deacon
On Mon, Dec 10, 2018 at 10:28:05AM +, Marc Zyngier wrote:
> On 10/12/2018 10:13, Christoffer Dall wrote:
> > On Thu, Dec 06, 2018 at 05:31:20PM +, Marc Zyngier wrote:
> >> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> >> index 23774970c9df..1db4c15edcdd 100644
> >> --- a/virt/kvm/arm/arm.c
> >> +++ b/virt/kvm/arm/arm.c
> >> @@ -1640,8 +1640,10 @@ int kvm_arch_init(void *opaque)
> >>return -ENODEV;
> >>}
> >>  
> >> -  if (!kvm_arch_check_sve_has_vhe()) {
> >> -  kvm_pr_unimpl("SVE system without VHE unsupported.  Broken 
> >> cpu?");
> >> +  in_hyp_mode = is_kernel_in_hyp_mode();
> >> +
> >> +  if (!in_hyp_mode && kvm_arch_requires_vhe()) {
> >> +  kvm_pr_unimpl("CPU requiring VHE was booted in non-VHE mode");
> > 
> > nit: The error message feels weird to me (are we reporting CPU bugs?)
> > and I'm not sure about the unimpl and I think there's a linse space
> > missing.
> > 
> > How about:
> > 
> > kvm_err("Cannot support this CPU in non-VHE mode, not initializing\n");
> 
> Yup, works for me. Will, do you mind changing this message for me?

I pushed out an updated branch here:

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=kvm/cortex-a76-erratum-1165522

which should address all of Christoffer's comments and add his tags.

I plan to merge that into for-next/core later today.

Will
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 08/13] arm64: expose user PAC bit positions via ptrace

2018-12-10 Thread Catalin Marinas
On Sun, Dec 09, 2018 at 09:41:31AM -0600, Richard Henderson wrote:
> On 12/7/18 12:39 PM, Kristina Martsenko wrote:
> > When pointer authentication is in use, data/instruction pointers have a
> > number of PAC bits inserted into them. The number and position of these
> > bits depends on the configured TCR_ELx.TxSZ and whether tagging is
> > enabled. ARMv8.3 allows tagging to differ for instruction and data
> > pointers.
> 
> At this point I think it's worth starting a discussion about pointer tagging,
> and how we can make it controllable and not mandatory.
> 
> With this patch set, we are enabling 7 authentication bits: [54:48].
> 
> However, it won't be too long before someone implements support for
> ARMv8.2-LVA, at which point, without changes to mandatory pointer tagging, we
> will only have 3 authentication bits: [54:52].  This seems useless and easily
> brute-force-able.

Such support is already here (about to be queued):

https://lore.kernel.org/linux-arm-kernel/20181206225042.11548-1-steve.cap...@arm.com/

> I assume that pointer tagging is primarily used by Android, since I'm not 
> aware
> of anything else that uses it at all.

I would expect it to be enabled more widely (Linux distros), though only
the support for instructions currently in the NOP space.

> Unfortunately, there is no obvious path to making this optional that does not
> break compatibility with Documentation/arm64/tagged-pointers.txt.

There is also the ARMv8.5 MTE (memory tagging) which relies on tagged
pointers.

> I've been thinking that there ought to be some sort of global setting, akin to
> /proc/sys/kernel/randomize_va_space, as well as a prctl which an application
> could use to selectively enable TBI/TBID for an application that actually uses
> tagging.

An alternative would be to allow the opt-in to 52-bit VA, leaving it at
48-bit by default. However, it has the problem of changing the PAC size
and not being able to return.

> The global /proc setting allows the default to remain 1, which would let any
> application using tagging to continue working.  If there are none, the 
> sysadmin
> can set the default to 0.  Going forward, applications could be updated to use
> the prctl, allowing more systems to set the default to 0.
> 
> FWIW, pointer authentication continues to work when enabling TBI, but not the
> other way around.  Thus the prctl could be used to enable TBI at any point, 
> but
> if libc is built with PAuth, there's no way to turn it back off again.

This may work but, as you said, TBI is user ABI at this point, we can't
take it away now (at the time we didn't forsee the pauth).

Talking briefly with Will/Kristina/Mark, I think the best option is to
make 52-bit VA default off in the kernel config. Whoever needs it
enabled (enterprise systems) should be aware of the reduced PAC bits. I
don't really think we have a better solution.

-- 
Catalin
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 7/8] arm64: KVM: Handle ARM erratum 1165522 in TLB invalidation

2018-12-10 Thread Christoffer Dall
On Mon, Dec 10, 2018 at 11:15:00AM +, James Morse wrote:
> Hi Marc, Christoffer,
> 
> On 10/12/2018 10:46, Marc Zyngier wrote:
> > On 10/12/2018 10:19, Christoffer Dall wrote:
> >> On Thu, Dec 06, 2018 at 05:31:25PM +, Marc Zyngier wrote:
> >>> In order to avoid TLB corruption whilst invalidating TLBs on CPUs
> >>> affected by erratum 1165522, we need to prevent S1 page tables
> >>> from being usable.
> >>>
> >>> For this, we set the EL1 S1 MMU on, and also disable the page table
> >>> walker (by setting the TCR_EL1.EPD* bits to 1).
> >>>
> >>> This ensures that once we switch to the EL1/EL0 translation regime,
> >>> speculated AT instructions won't be able to parse the page tables.
> 
> >>> @@ -64,11 +93,18 @@ static void __hyp_text 
> >>> __tlb_switch_to_host_vhe(struct kvm *kvm,
> >>>   write_sysreg(0, vttbr_el2);
> >>>   write_sysreg(HCR_HOST_VHE_FLAGS, hcr_el2);
> >>>   isb();
> >>> - local_irq_restore(flags);
> >>> +
> >>> + if (cpus_have_const_cap(ARM64_WORKAROUND_1165522)) {
> >>> + /* Restore the guest's registers to what they were */
> >>
> >> host's ?
> > 
> > Hum... Yes, silly thinko.
> 
> I thought these were the guests registers because they are EL1 registers and
> this is a VHE-only path.
> 'interrupted guest' was how I read this. This stuff can get called if memory 
> is
> allocated for guest-A while a vcpu is loaded, and reclaims memory from guest-B
> causing an mmu-notifier call for stage2. This is why we have to put guest-A's
> registers back as we weren't pre-empted, and we expect EL1 to be untouched.
> 
> I agree they could belong to no-guest if a vcpu isn't loaded at all... is host
> the term used here?
> 

Ah, you're right.  Host is not the right term either.

I haven't done the call path analysis, so not sure about all the
possible contexts where all this can be called, but if it's really truly
only in guest context, then we don't need to save the values to a
temporary struct at all, but can save them on the vcpu.

We can also just side-step the whole thing and just say "Restore the
registers to what they were".


Thanks,

Christoffer
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 00/12] arm64: Paravirtualized time support

2018-12-10 Thread Mark Rutland
On Wed, Nov 28, 2018 at 02:45:15PM +, Steven Price wrote:
> This series add support for paravirtualized time for Arm64 guests and
> KVM hosts following the specification in Arm's document DEN 0057A:
> 
> https://developer.arm.com/docs/den0057/a
> 
> It implements support for Live Physical Time (LPT) which provides the
> guest with a method to derive a stable counter of time during which the
> guest is executing even when the guest is being migrated between hosts
> with different physical counter frequencies.
> 
> It also implements support for stolen time, allowing the guest to
> identify time when it is forcibly not executing.

I know that stolen time reporting is important, and I think that we
definitely want to pick up that part of the spec (once it is published
in some non-draft form).

However, I am very concerned with the pv-freq part of LPT, and I'd like
to avoid that if at all possible. I say that because:

* By design, it breaks architectural guarantees from the PoV of SW in
  the guest.

  A VM may host multiple SW agents serially (e.g. when booting, or
  across kexec), or concurrently (e.g. Linux w/ EFI runtime services),
  and the host has no way to tell whether all software in the guest will
  function correctly. Due to this, it's not possible to have a guest
  opt-in to the architecturally-broken timekeeping.

  Existing guests will not work correctly once pv-freq is in use, and if
  configured without pv-freq (or if the guest fails to discover pv-freq
  for any reason), the administrator may encounter anything between
  subtle breakage or fatally incorrect timekeeping.

  There's plenty of SW agents other than Linux which runs in a guest,
  which would need to be updated to handle pv-freq, e.g. GRUB, *BSD,
  iPXE.

  Given this, I think that this is going to lead to subtle breakage in
  real-world scenarios. 

* It is (necessarily) invasive to the low-level arch timer code. This is
  unfortunate, and I strongly suspect this is going to be an area with
  long-term subtle breakage.

* It's not clear to me how strongly people need this. My understanding
  is that datacenters would run largely homogeneous platforms. I suspect
  large datacenters which would use migration are in a position to
  mandate a standard timer frequency from their OEMs or SiPs.

  I strongly believe that an architectural fix (e.g. in-hw scaling)
  would be the better solution.

I understand that LPT is supposed to account for time lost during the
migration. Can we account for this without pv-freq? e.g. is it possible
to account for this in the same way as stolen time?

Thanks,
Mark.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 7/8] arm64: KVM: Handle ARM erratum 1165522 in TLB invalidation

2018-12-10 Thread James Morse
Hi Marc, Christoffer,

On 10/12/2018 10:46, Marc Zyngier wrote:
> On 10/12/2018 10:19, Christoffer Dall wrote:
>> On Thu, Dec 06, 2018 at 05:31:25PM +, Marc Zyngier wrote:
>>> In order to avoid TLB corruption whilst invalidating TLBs on CPUs
>>> affected by erratum 1165522, we need to prevent S1 page tables
>>> from being usable.
>>>
>>> For this, we set the EL1 S1 MMU on, and also disable the page table
>>> walker (by setting the TCR_EL1.EPD* bits to 1).
>>>
>>> This ensures that once we switch to the EL1/EL0 translation regime,
>>> speculated AT instructions won't be able to parse the page tables.

>>> @@ -64,11 +93,18 @@ static void __hyp_text __tlb_switch_to_host_vhe(struct 
>>> kvm *kvm,
>>> write_sysreg(0, vttbr_el2);
>>> write_sysreg(HCR_HOST_VHE_FLAGS, hcr_el2);
>>> isb();
>>> -   local_irq_restore(flags);
>>> +
>>> +   if (cpus_have_const_cap(ARM64_WORKAROUND_1165522)) {
>>> +   /* Restore the guest's registers to what they were */
>>
>> host's ?
> 
> Hum... Yes, silly thinko.

I thought these were the guests registers because they are EL1 registers and
this is a VHE-only path.
'interrupted guest' was how I read this. This stuff can get called if memory is
allocated for guest-A while a vcpu is loaded, and reclaims memory from guest-B
causing an mmu-notifier call for stage2. This is why we have to put guest-A's
registers back as we weren't pre-empted, and we expect EL1 to be untouched.

I agree they could belong to no-guest if a vcpu isn't loaded at all... is host
the term used here?


Thanks,

James
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v9 1/8] KVM: arm/arm64: Share common code in user_mem_abort()

2018-12-10 Thread Christoffer Dall
On Mon, Dec 10, 2018 at 10:47:42AM +, Suzuki K Poulose wrote:
> 
> 
> On 10/12/2018 08:56, Christoffer Dall wrote:
> >On Mon, Dec 03, 2018 at 01:37:37PM +, Suzuki K Poulose wrote:
> >>Hi Anshuman,
> >>
> >>On 03/12/2018 12:11, Anshuman Khandual wrote:
> >>>
> >>>
> >>>On 10/31/2018 11:27 PM, Punit Agrawal wrote:
> The code for operations such as marking the pfn as dirty, and
> dcache/icache maintenance during stage 2 fault handling is duplicated
> between normal pages and PMD hugepages.
> 
> Instead of creating another copy of the operations when we introduce
> PUD hugepages, let's share them across the different pagesizes.
> 
> Signed-off-by: Punit Agrawal 
> Reviewed-by: Suzuki K Poulose 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> ---
>   virt/kvm/arm/mmu.c | 49 --
>   1 file changed, 30 insertions(+), 19 deletions(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 5eca48bdb1a6..59595207c5e1 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1475,7 +1475,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
> unsigned long fault_status)
>   {
>   int ret;
> - bool write_fault, exec_fault, writable, hugetlb = false, force_pte = 
> false;
> + bool write_fault, exec_fault, writable, force_pte = false;
>   unsigned long mmu_seq;
>   gfn_t gfn = fault_ipa >> PAGE_SHIFT;
>   struct kvm *kvm = vcpu->kvm;
> @@ -1484,7 +1484,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   kvm_pfn_t pfn;
>   pgprot_t mem_type = PAGE_S2;
>   bool logging_active = memslot_is_logging(memslot);
> - unsigned long flags = 0;
> + unsigned long vma_pagesize, flags = 0;
> >>>
> >>>A small nit s/vma_pagesize/pagesize. Why call it VMA ? Its implicit.
> >>
> >>May be we could call it mapsize. pagesize is confusing.
> >>
> >
> >I'm ok with mapsize.  I see the vma_pagesize name coming from the fact
> >that this is initially set to the return value from vma_kernel_pagesize.
> >
> >I have not problems with either vma_pagesize or mapsize.
> >
> >>>
>   write_fault = kvm_is_write_fault(vcpu);
>   exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
> @@ -1504,10 +1504,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>   }
> - if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
> - hugetlb = true;
> + vma_pagesize = vma_kernel_pagesize(vma);
> + if (vma_pagesize == PMD_SIZE && !logging_active) {
>   gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
>   } else {
> + /*
> +  * Fallback to PTE if it's not one of the Stage 2
> +  * supported hugepage sizes
> +  */
> + vma_pagesize = PAGE_SIZE;
> >>>
> >>>This seems redundant and should be dropped. vma_kernel_pagesize() here 
> >>>either
> >>>calls hugetlb_vm_op_pagesize (via hugetlb_vm_ops->pagesize) or simply 
> >>>returns
> >>>PAGE_SIZE. The vm_ops path is taken if the QEMU VMA covering any given HVA 
> >>>is
> >>>backed either by HugeTLB pages or simply normal pages. vma_pagesize would
> >>>either has a value of PMD_SIZE (HugeTLB hstate based) or PAGE_SIZE. Hence 
> >>>if
> >>>its not PMD_SIZE it must be PAGE_SIZE which should not be assigned again.
> >>
> >>We may want to force using the PTE mappings when logging_active (e.g, 
> >>migration
> >>?) to prevent keep tracking of huge pages. So the check is still valid.
> >>
> >>
> >
> >Agreed, and let's not try additionally change the logic and flow with
> >this patch.
> >
> >>>
> +
>   /*
>    * Pages belonging to memslots that don't have the same
>    * alignment for userspace and IPA cannot be mapped 
>  using
> @@ -1573,23 +1579,33 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   if (mmu_notifier_retry(kvm, mmu_seq))
>   goto out_unlock;
> - if (!hugetlb && !force_pte)
> - hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> + if (vma_pagesize == PAGE_SIZE && !force_pte) {
> + /*
> +  * Only PMD_SIZE transparent hugepages(THP) are
> +  * currently supported. This code will need to be
> +  * updated to support other THP sizes.
> +  */
> >>>
> >>>This comment belongs to transparent_hugepage_adjust() but not here.
> >>
> >>I think this is relevant here than in thp_adjust, unless we rename
> >>the function below to something generic, handle_hugepage_adjust().
> >>
> >
> >Agreed.
> >
> + if (transparent_hugepage_adjust(&pf

Re: [PATCH 06/12] KVM: arm64: Support Live Physical Time reporting

2018-12-10 Thread Mark Rutland
On Wed, Nov 28, 2018 at 02:45:21PM +, Steven Price wrote:
> Provide a method for a guest to derive a paravirtualized counter/timer
> which isn't dependent on the host's counter frequency. This allows a
> guest to be migrated onto a new host which doesn't have the same
> frequency without the virtual counter being disturbed.

I have a number of concerns about paravirtualizing the timer frequency,
but I'll bring that up in reply to the cover letter.

I have some orthogonal comments below.

> The host provides a shared page which contains coefficients that can be
> used to map the real counter from the host (the Arm "virtual counter")
> to a paravirtualized view of time. On migration the new host updates the
> coefficients to ensure that the guests view of time (after using the
> coefficients) doesn't change and that the derived counter progresses at
> the same real frequency.

Can we please avoid using the term 'page' here?

There is a data structure in shared memory, but it is not page-sized,
and referring to it as a page here and elsewhere is confusing. The spec
never uses the term 'page'

Could we please say something like:

  The host provides a datastrucutre in shared memory which ...

... to avoid the implication this is page sized/aligned etc.

[...]

> + struct kvm_arch_pvtime {
> + void *pv_page;
> +
> + gpa_t lpt_page;
> + u32 lpt_fpv;
> + } pvtime;

To remove the page terminology, perhaps something like:

struct kvm_arch_pvtime {
struct lpt  *lpt;
gpa_t   lpt_gpa;
u32 lpt_fpv;
};

[...]

> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 8bf259dae9f6..fd3a2caabeb2 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -49,6 +49,8 @@ typedef unsigned long  gva_t;
>  typedef u64gpa_t;
>  typedef u64gfn_t;
>  
> +#define GPA_INVALID-1

To avoid any fun with signed/unsigned comparison, can we please make
this:

#define GPA_INVALID ((gpa_t)-1)

... or:

#define GPA_INVALID (~(gpa_t)0)

[...]

> +static void update_vtimer_cval(struct kvm *kvm, u32 previous_rate)
> +{
> + u32 current_rate = arch_timer_get_rate();
> + u64 current_time = kvm_phys_timer_read();
> + int i;
> + struct kvm_vcpu *vcpu;
> + u64 rel_cval;
> +
> + /* Early out if there's nothing to do */
> + if (likely(previous_rate == current_rate))
> + return;

Given this only happens on migration, I don't think we need to care
about likely/unlikely here, and can drop that from the condition.

[...]

> +int kvm_arm_update_lpt_sequence(struct kvm *kvm)
> +{
> + struct pvclock_vm_time_info *pvclock;
> + u64 lpt_ipa = kvm->arch.pvtime.lpt_page;
> + u64 native_freq, pv_freq, scale_mult, div_by_pv_freq_mult;
> + u64 shift = 0;
> + u64 sequence_number = 0;
> +
> + if (lpt_ipa == GPA_INVALID)
> + return -EINVAL;
> +
> + /* Page address must be 64 byte aligned */
> + if (lpt_ipa & 63)
> + return -EINVAL;

Please use IS_ALIGNED(), e.g.

if (!IS_ALIGNED(lpt_ipa, 64))
return -EINVAL;

Thanks,
Mark.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 1/8] arm64: KVM: Make VHE Stage-2 TLB invalidation operations non-interruptible

2018-12-10 Thread Christoffer Dall
On Mon, Dec 10, 2018 at 10:24:31AM +, Marc Zyngier wrote:
> Hi Christoffer,
> 
> On 10/12/2018 10:03, Christoffer Dall wrote:
> > On Thu, Dec 06, 2018 at 05:31:19PM +, Marc Zyngier wrote:
> >> Contrary to the non-VHE version of the TLB invalidation helpers, the VHE
> >> code  has interrupts enabled, meaning that we can take an interrupt in
> >> the middle of such a sequence, and start running something else with
> >> HCR_EL2.TGE cleared.
> > 
> > Do we have to clear TGE to perform the TLB invalidation, or is that just
> > a side-effect of re-using code?
> 
> We really do need to clear TGE. From the description of TLBI VMALLE1IS:
> 
> 
> When EL2 is implemented and enabled in the current Security state:
> — If HCR_EL2.{E2H, TGE} is not {1, 1}, the entry would be used with the
> current VMID and would be required to translate the specified VA using
> the EL1&0 translation regime.
> — If HCR_EL2.{E2H, TGE} is {1, 1}, the entry would be required to
> translate the specified VA using the EL2&0 translation regime.
> 
> 
> > Also, do we generally perform TLB invalidations in the kernel with
> > interrupts disabled, or is this just a result of clearing TGE?
> 
> That's definitely a result of clearing TGE. We could be taking an
> interrupt here, and execute a user access on the back of it (perf will
> happily walk a user-space stack in that context, for example). Having
> TGE clear in that context. An alternative solution would be to
> save/restore TGE on interrupt entry/exit, but that's a bit overkill when
> you consider how rarely we issue such TLB invalidation.
> 
> > Somehow I feel like this should look more like just another TLB
> > invalidation in the kernel, but if there's a good reason why it can't
> > then this is fine.
> 
> The rest of the TLB invalidation in the kernel doesn't need to
> save/restore any context. They apply to a set of parameters that are
> already loaded on the CPU. What we have here is substantially different.
> 

Thanks for the explanation and Arm ARM quote.  I failed to find that on
my own this particular Monday morning.

Acked-by: Christoffer Dall 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v9 1/8] KVM: arm/arm64: Share common code in user_mem_abort()

2018-12-10 Thread Suzuki K Poulose




On 10/12/2018 08:56, Christoffer Dall wrote:

On Mon, Dec 03, 2018 at 01:37:37PM +, Suzuki K Poulose wrote:

Hi Anshuman,

On 03/12/2018 12:11, Anshuman Khandual wrote:



On 10/31/2018 11:27 PM, Punit Agrawal wrote:

The code for operations such as marking the pfn as dirty, and
dcache/icache maintenance during stage 2 fault handling is duplicated
between normal pages and PMD hugepages.

Instead of creating another copy of the operations when we introduce
PUD hugepages, let's share them across the different pagesizes.

Signed-off-by: Punit Agrawal 
Reviewed-by: Suzuki K Poulose 
Cc: Christoffer Dall 
Cc: Marc Zyngier 
---
  virt/kvm/arm/mmu.c | 49 --
  1 file changed, 30 insertions(+), 19 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 5eca48bdb1a6..59595207c5e1 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1475,7 +1475,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
  unsigned long fault_status)
  {
int ret;
-   bool write_fault, exec_fault, writable, hugetlb = false, force_pte = 
false;
+   bool write_fault, exec_fault, writable, force_pte = false;
unsigned long mmu_seq;
gfn_t gfn = fault_ipa >> PAGE_SHIFT;
struct kvm *kvm = vcpu->kvm;
@@ -1484,7 +1484,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
kvm_pfn_t pfn;
pgprot_t mem_type = PAGE_S2;
bool logging_active = memslot_is_logging(memslot);
-   unsigned long flags = 0;
+   unsigned long vma_pagesize, flags = 0;


A small nit s/vma_pagesize/pagesize. Why call it VMA ? Its implicit.


May be we could call it mapsize. pagesize is confusing.



I'm ok with mapsize.  I see the vma_pagesize name coming from the fact
that this is initially set to the return value from vma_kernel_pagesize.

I have not problems with either vma_pagesize or mapsize.




write_fault = kvm_is_write_fault(vcpu);
exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
@@ -1504,10 +1504,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
-   if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
-   hugetlb = true;
+   vma_pagesize = vma_kernel_pagesize(vma);
+   if (vma_pagesize == PMD_SIZE && !logging_active) {
gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
} else {
+   /*
+* Fallback to PTE if it's not one of the Stage 2
+* supported hugepage sizes
+*/
+   vma_pagesize = PAGE_SIZE;


This seems redundant and should be dropped. vma_kernel_pagesize() here either
calls hugetlb_vm_op_pagesize (via hugetlb_vm_ops->pagesize) or simply returns
PAGE_SIZE. The vm_ops path is taken if the QEMU VMA covering any given HVA is
backed either by HugeTLB pages or simply normal pages. vma_pagesize would
either has a value of PMD_SIZE (HugeTLB hstate based) or PAGE_SIZE. Hence if
its not PMD_SIZE it must be PAGE_SIZE which should not be assigned again.


We may want to force using the PTE mappings when logging_active (e.g, migration
?) to prevent keep tracking of huge pages. So the check is still valid.




Agreed, and let's not try additionally change the logic and flow with
this patch.




+
/*
 * Pages belonging to memslots that don't have the same
 * alignment for userspace and IPA cannot be mapped using
@@ -1573,23 +1579,33 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (mmu_notifier_retry(kvm, mmu_seq))
goto out_unlock;
-   if (!hugetlb && !force_pte)
-   hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
+   if (vma_pagesize == PAGE_SIZE && !force_pte) {
+   /*
+* Only PMD_SIZE transparent hugepages(THP) are
+* currently supported. This code will need to be
+* updated to support other THP sizes.
+*/


This comment belongs to transparent_hugepage_adjust() but not here.


I think this is relevant here than in thp_adjust, unless we rename
the function below to something generic, handle_hugepage_adjust().



Agreed.


+   if (transparent_hugepage_adjust(&pfn, &fault_ipa))
+   vma_pagesize = PMD_SIZE;


IIUC transparent_hugepage_adjust() is only getting called here. Instead of
returning 'true' when it is able to detect a huge page backing and doing
an adjustment there after, it should rather return THP size (PMD_SIZE) to
accommodate probable multi size THP support in future .


That makes sense.



That's fine.



Btw, after a further thought, since we don't have any THP support for anything
other than PMD_SIZE, I am dropping the above suggestion. We need to make changes
in our stage2 page table manipulation code anyway to suppor

Re: [PATCH v3 7/8] arm64: KVM: Handle ARM erratum 1165522 in TLB invalidation

2018-12-10 Thread Marc Zyngier
On 10/12/2018 10:19, Christoffer Dall wrote:
> On Thu, Dec 06, 2018 at 05:31:25PM +, Marc Zyngier wrote:
>> In order to avoid TLB corruption whilst invalidating TLBs on CPUs
>> affected by erratum 1165522, we need to prevent S1 page tables
>> from being usable.
>>
>> For this, we set the EL1 S1 MMU on, and also disable the page table
>> walker (by setting the TCR_EL1.EPD* bits to 1).
>>
>> This ensures that once we switch to the EL1/EL0 translation regime,
>> speculated AT instructions won't be able to parse the page tables.
>>
>> Reviewed-by: James Morse 
>> Signed-off-by: Marc Zyngier 
>> ---
>>  arch/arm64/kvm/hyp/tlb.c | 66 +++-
>>  1 file changed, 51 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/hyp/tlb.c b/arch/arm64/kvm/hyp/tlb.c
>> index 7fcc9c1a5f45..ec157543d5a9 100644
>> --- a/arch/arm64/kvm/hyp/tlb.c
>> +++ b/arch/arm64/kvm/hyp/tlb.c
>> @@ -21,12 +21,36 @@
>>  #include 
>>  #include 
>>  
>> +struct tlb_inv_context {
>> +unsigned long   flags;
>> +u64 tcr;
>> +u64 sctlr;
>> +};
>> +
>>  static void __hyp_text __tlb_switch_to_guest_vhe(struct kvm *kvm,
>> - unsigned long *flags)
>> + struct tlb_inv_context *cxt)
>>  {
>>  u64 val;
>>  
>> -local_irq_save(*flags);
>> +local_irq_save(cxt->flags);
>> +
>> +if (cpus_have_const_cap(ARM64_WORKAROUND_1165522)) {
>> +/*
>> + * For CPUs that are affected by ARM erratum 1165522, we
>> + * cannot trust stage-1 to be in a correct state at that
>> + * point. Since we do not want to force a full load of the
>> + * vcpu state, we prevent the EL1 page-table walker to
>> + * allocate new TLBs. This is done by setting the EPD bits
>> + * in the TCR_EL1 register. We also need to prevent it to
>> + * allocate IPA->PA walks, so we enable the S1 MMU...
>> + */
>> +val = cxt->tcr = read_sysreg_el1(tcr);
>> +val |= TCR_EPD1_MASK | TCR_EPD0_MASK;
>> +write_sysreg_el1(val, tcr);
>> +val = cxt->sctlr = read_sysreg_el1(sctlr);
>> +val |= SCTLR_ELx_M;
>> +write_sysreg_el1(val, sctlr);
>> +}
>>  
>>  /*
>>   * With VHE enabled, we have HCR_EL2.{E2H,TGE} = {1,1}, and
>> @@ -34,6 +58,11 @@ static void __hyp_text __tlb_switch_to_guest_vhe(struct 
>> kvm *kvm,
>>   * guest TLBs (EL1/EL0), we need to change one of these two
>>   * bits. Changing E2H is impossible (goodbye TTBR1_EL2), so
>>   * let's flip TGE before executing the TLB operation.
>> + *
>> + * ARM erratum 1165522 requires some special handling (again),
>> + * as we need to make sure both stages of translation are in
>> + * place before clearing TGE. __load_guest_stage2() already
>> + * has an ISB in order to deal with this.
>>   */
>>  __load_guest_stage2(kvm);
>>  val = read_sysreg(hcr_el2);
>> @@ -43,7 +72,7 @@ static void __hyp_text __tlb_switch_to_guest_vhe(struct 
>> kvm *kvm,
>>  }
>>  
>>  static void __hyp_text __tlb_switch_to_guest_nvhe(struct kvm *kvm,
>> -  unsigned long *flags)
>> +  struct tlb_inv_context *cxt)
>>  {
>>  __load_guest_stage2(kvm);
>>  isb();
>> @@ -55,7 +84,7 @@ static hyp_alternate_select(__tlb_switch_to_guest,
>>  ARM64_HAS_VIRT_HOST_EXTN);
>>  
>>  static void __hyp_text __tlb_switch_to_host_vhe(struct kvm *kvm,
>> -unsigned long flags)
>> +struct tlb_inv_context *cxt)
>>  {
>>  /*
>>   * We're done with the TLB operation, let's restore the host's
>> @@ -64,11 +93,18 @@ static void __hyp_text __tlb_switch_to_host_vhe(struct 
>> kvm *kvm,
>>  write_sysreg(0, vttbr_el2);
>>  write_sysreg(HCR_HOST_VHE_FLAGS, hcr_el2);
>>  isb();
>> -local_irq_restore(flags);
>> +
>> +if (cpus_have_const_cap(ARM64_WORKAROUND_1165522)) {
>> +/* Restore the guest's registers to what they were */
> 
> host's ?

Hum... Yes, silly thinko.

[...]

> 
> Otherwise:
> 
> Acked-by: Christoffer Dall 
> 

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 05/12] KVM: arm64: Implement PV_FEATURES call

2018-12-10 Thread Mark Rutland
On Wed, Nov 28, 2018 at 02:45:20PM +, Steven Price wrote:
> This provides a mechanism for querying which paravirtualized features
> are available in this hypervisor.
> 
> Also add the header file which defines the ABI for the paravirtualized
> clock features we're about to add.
> 
> Signed-off-by: Steven Price 
> ---
>  arch/arm64/include/asm/pvclock-abi.h | 32 
>  include/kvm/arm_pv.h | 28 
>  include/linux/arm-smccc.h|  1 +
>  virt/kvm/arm/hypercalls.c|  9 
>  4 files changed, 70 insertions(+)
>  create mode 100644 arch/arm64/include/asm/pvclock-abi.h
>  create mode 100644 include/kvm/arm_pv.h
> 
> diff --git a/arch/arm64/include/asm/pvclock-abi.h 
> b/arch/arm64/include/asm/pvclock-abi.h
> new file mode 100644
> index ..64ce041c8922
> --- /dev/null
> +++ b/arch/arm64/include/asm/pvclock-abi.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (C) 2018 Arm Ltd. */
> +
> +#ifndef __ASM_PVCLOCK_ABI_H
> +#define __ASM_PVCLOCK_ABI_H
> +
> +#include 
> +
> +struct pvclock_vm_time_info {
> + __le32 revision;
> + __le32 attributes;
> + __le64 sequence_number;
> + __le64 scale_mult;
> + __le32 shift;
> + __le32 reserved;
> + __le64 native_freq;
> + __le64 pv_freq;
> + __le64 div_by_pv_freq_mult;
> +} __packed;
> +
> +struct pvclock_vcpu_stolen_time_info {
> + __le32 revision;
> + __le32 attributes;
> + __le64 stolen_time;
> + /* Structure must be 64 byte aligned, pad to that size */
> + u8 padding[48];
> +} __packed;
> +
> +#define PV_VM_TIME_NOT_SUPPORTED -1
> +#define PV_VM_TIME_INVALID_PARAMETERS-2

Could you please add a comment describing that these are defined in ARM
DEN0057A?

> +
> +#endif
> diff --git a/include/kvm/arm_pv.h b/include/kvm/arm_pv.h
> new file mode 100644
> index ..19d2dafff31a
> --- /dev/null
> +++ b/include/kvm/arm_pv.h
> @@ -0,0 +1,28 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + * Copyright (C) 2018 Arm Ltd.
> + */
> +
> +#ifndef __KVM_ARM_PV_H
> +#define __KVM_ARM_PV_H
> +
> +#include 
> +
> +#define ARM_SMCCC_HV_PV_FEATURES \
> + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
> +ARM_SMCCC_SMC_64,\
> +ARM_SMCCC_OWNER_HYP_STANDARD,\
> +0x20)
> +
> +#define ARM_SMCCC_HV_PV_TIME_LPT \
> + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
> +ARM_SMCCC_SMC_64,\
> +ARM_SMCCC_OWNER_HYP_STANDARD,\
> +0x21)
> +
> +#define ARM_SMCCC_HV_PV_TIME_ST  \
> + ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
> +ARM_SMCCC_SMC_64,\
> +ARM_SMCCC_OWNER_HYP_STANDARD,\
> +0x22)
> +
> +#endif /* __KVM_ARM_PV_H */

Do these need to live in a separate header, away from the struct
definitions?

I'd be happy for these to live in , given they're
standard calls.

As before, a comment referring to ARM DEN0057A would be nice.


> diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
> index b047009e7a0a..4e0866cc48c0 100644
> --- a/include/linux/arm-smccc.h
> +++ b/include/linux/arm-smccc.h
> @@ -54,6 +54,7 @@
>  #define ARM_SMCCC_OWNER_SIP  2
>  #define ARM_SMCCC_OWNER_OEM  3
>  #define ARM_SMCCC_OWNER_STANDARD 4
> +#define ARM_SMCCC_OWNER_HYP_STANDARD 5

Minor nit, but could we make that STANDARD_HYP?

>  #define ARM_SMCCC_OWNER_TRUSTED_APP  48
>  #define ARM_SMCCC_OWNER_TRUSTED_APP_END  49
>  #define ARM_SMCCC_OWNER_TRUSTED_OS   50
> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
> index 153aa7642100..ba13b798f0f8 100644
> --- a/virt/kvm/arm/hypercalls.c
> +++ b/virt/kvm/arm/hypercalls.c
> @@ -5,6 +5,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -40,6 +41,14 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>   break;
>   }
>   break;
> + case ARM_SMCCC_HV_PV_FEATURES:
> + val = SMCCC_RET_SUCCESS;
> + break;
> + }
> + break;
> + case ARM_SMCCC_HV_PV_FEATURES:
> + feature = smccc_get_arg1(vcpu);
> + switch (feature) {
>   }

IIUC, at this point in time, this happens to always return
SMCCC_RET_NOT_SUPPORTED.

If you leave this part out of the patch, and add it as required, this
patch is purely adding definitions, which would be a bit nicer for
review.

Thanks,
Mark.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columb

Re: [PATCH v3 2/8] KVM: arm64: Rework detection of SVE, !VHE systems

2018-12-10 Thread Marc Zyngier
On 10/12/2018 10:13, Christoffer Dall wrote:
> On Thu, Dec 06, 2018 at 05:31:20PM +, Marc Zyngier wrote:
>> An SVE system is so far the only case where we mandate VHE. As we're
>> starting to grow this requirements, let's slightly rework the way we
>> deal with that situation, allowing for easy extension of this check.
>>
>> Signed-off-by: Marc Zyngier 
>> ---
>>  arch/arm/include/asm/kvm_host.h   | 2 +-
>>  arch/arm64/include/asm/kvm_host.h | 6 +++---
>>  virt/kvm/arm/arm.c| 8 
>>  3 files changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_host.h 
>> b/arch/arm/include/asm/kvm_host.h
>> index 5ca5d9af0c26..2184d9ddb418 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -285,7 +285,7 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
>>  
>>  struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);
>>  
>> -static inline bool kvm_arch_check_sve_has_vhe(void) { return true; }
>> +static inline bool kvm_arch_requires_vhe(void) { return false; }
>>  static inline void kvm_arch_hardware_unsetup(void) {}
>>  static inline void kvm_arch_sync_events(struct kvm *kvm) {}
>>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>> diff --git a/arch/arm64/include/asm/kvm_host.h 
>> b/arch/arm64/include/asm/kvm_host.h
>> index 52fbc823ff8c..d6d9aa76a943 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -422,7 +422,7 @@ static inline void __cpu_init_hyp_mode(phys_addr_t 
>> pgd_ptr,
>>  }
>>  }
>>  
>> -static inline bool kvm_arch_check_sve_has_vhe(void)
>> +static inline bool kvm_arch_requires_vhe(void)
>>  {
>>  /*
>>   * The Arm architecture specifies that implementation of SVE
>> @@ -430,9 +430,9 @@ static inline bool kvm_arch_check_sve_has_vhe(void)
>>   * relies on this when SVE is present:
>>   */
>>  if (system_supports_sve())
>> -return has_vhe();
>> -else
>>  return true;
>> +
>> +return false;
>>  }
>>  
>>  static inline void kvm_arch_hardware_unsetup(void) {}
>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index 23774970c9df..1db4c15edcdd 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -1640,8 +1640,10 @@ int kvm_arch_init(void *opaque)
>>  return -ENODEV;
>>  }
>>  
>> -if (!kvm_arch_check_sve_has_vhe()) {
>> -kvm_pr_unimpl("SVE system without VHE unsupported.  Broken 
>> cpu?");
>> +in_hyp_mode = is_kernel_in_hyp_mode();
>> +
>> +if (!in_hyp_mode && kvm_arch_requires_vhe()) {
>> +kvm_pr_unimpl("CPU requiring VHE was booted in non-VHE mode");
> 
> nit: The error message feels weird to me (are we reporting CPU bugs?)
> and I'm not sure about the unimpl and I think there's a linse space
> missing.
> 
> How about:
> 
>   kvm_err("Cannot support this CPU in non-VHE mode, not initializing\n");

Yup, works for me. Will, do you mind changing this message for me?

> 
> If we wanted to be super helpful, we could expand
> kvm_arch_requires_vhe() with a print statement saying:
>   
>   kvm_err("SVE detected, requiring VHE mode\n");
> 
> But thay may be overkill.

I started with that, but having kvm_pr*() in an include file has proved
to be challenging, so I decided to spend my time on something more
useful and quickly gave up... :-/

> 
> 
>>  return -ENODEV;
>>  }
>>  
>> @@ -1657,8 +1659,6 @@ int kvm_arch_init(void *opaque)
>>  if (err)
>>  return err;
>>  
>> -in_hyp_mode = is_kernel_in_hyp_mode();
>> -
>>  if (!in_hyp_mode) {
>>  err = init_hyp_mode();
>>  if (err)
>> -- 
>> 2.19.2
>>
> 
> Otherwise:
> 
> Acked-by: Christoffer Dall 
> 

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 03/12] arm/arm64: Provide a wrapper for SMCCC 1.1 calls

2018-12-10 Thread Mark Rutland
On Wed, Nov 28, 2018 at 02:45:18PM +, Steven Price wrote:
> SMCCC 1.1 calls may use either HVC or SMC depending on the PSCI
> conduit. Rather than coding this in every call site provide a macro
> which uses the correct instruction. The macro also handles the case
> where no PSCI conduit is configured returning a not supported error
> in res, along with returning the conduit used for the call.
> 
> This allow us to remove some duplicated code and will be useful later
> when adding paravirtualized time hypervisor calls.
> 
> Signed-off-by: Steven Price 
> ---
>  include/linux/arm-smccc.h | 44 +++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
> index 18863d56273c..b047009e7a0a 100644
> --- a/include/linux/arm-smccc.h
> +++ b/include/linux/arm-smccc.h
> @@ -311,5 +311,49 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, 
> unsigned long a1,
>  #define SMCCC_RET_NOT_SUPPORTED  -1
>  #define SMCCC_RET_NOT_REQUIRED   -2
>  
> +/* Like arm_smccc_1_1* but always returns SMCCC_RET_NOT_SUPPORTED.
> + * Used when the PSCI conduit is not defined. The empty asm statement
> + * avoids compiler warnings about unused variables.
> + */
> +#define __fail_smccc_1_1(...)
> \
> + do {\
> + __declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \
> + asm ("" __constraints(__count_args(__VA_ARGS__)));  \
> + if (___res) \
> + ___res->a0 = SMCCC_RET_NOT_SUPPORTED;   \
> + } while (0)
> +
> +/*
> + * arm_smccc_1_1() - make an SMCCC v1.1 compliant call
> + *
> + * This is a variadic macro taking one to eight source arguments, and
> + * an optional return structure.
> + *
> + * @a0-a7: arguments passed in registers 0 to 7
> + * @res: result values from registers 0 to 3
> + *
> + * This macro will make either an HVC call or an SMC call depending on the
> + * current PSCI conduit. If no valid conduit is available then -1
> + * (SMCCC_RET_NOT_SUPPORTED) is returned in @res.a0 (if supplied).
> + *
> + * The return value also provides the conduit that was used.
> + */
> +#define arm_smccc_1_1(...) ({
> \

As a minor nit, could we please give call this something like
arm_smccc_1_1_call() or arm_smccc_1_1_invoke(), to make it clear what
action this performs?

I had some patches [1] cleaning up the SMCCC API, it would be nice if we
could pick up some of those as preparatory bits, before we spread some
of the current design warts (e.g. SMCCC depending on PSCI definitions).

Thanks,
Mark.

[1] git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git 
arm64/smccc-cleanup

> + int method = psci_ops.conduit;  \
> + switch (method) {   \
> + case PSCI_CONDUIT_HVC:  \
> + arm_smccc_1_1_hvc(__VA_ARGS__); \
> + break;  \
> + case PSCI_CONDUIT_SMC:  \
> + arm_smccc_1_1_smc(__VA_ARGS__); \
> + break;  \
> + default:\
> + __fail_smccc_1_1(__VA_ARGS__);  \
> + method = PSCI_CONDUIT_NONE; \
> + break;  \
> + }   \
> + method; \
> + })
> +
>  #endif /*__ASSEMBLY__*/
>  #endif /*__LINUX_ARM_SMCCC_H*/
> -- 
> 2.19.2
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v9 1/8] KVM: arm/arm64: Share common code in user_mem_abort()

2018-12-10 Thread Suzuki K Poulose




On 10/12/2018 08:56, Christoffer Dall wrote:

On Mon, Dec 03, 2018 at 01:37:37PM +, Suzuki K Poulose wrote:

Hi Anshuman,

On 03/12/2018 12:11, Anshuman Khandual wrote:



On 10/31/2018 11:27 PM, Punit Agrawal wrote:

The code for operations such as marking the pfn as dirty, and
dcache/icache maintenance during stage 2 fault handling is duplicated
between normal pages and PMD hugepages.

Instead of creating another copy of the operations when we introduce
PUD hugepages, let's share them across the different pagesizes.

Signed-off-by: Punit Agrawal 
Reviewed-by: Suzuki K Poulose 
Cc: Christoffer Dall 
Cc: Marc Zyngier 
---
  virt/kvm/arm/mmu.c | 49 --
  1 file changed, 30 insertions(+), 19 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 5eca48bdb1a6..59595207c5e1 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1475,7 +1475,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
  unsigned long fault_status)
  {
int ret;
-   bool write_fault, exec_fault, writable, hugetlb = false, force_pte = 
false;
+   bool write_fault, exec_fault, writable, force_pte = false;
unsigned long mmu_seq;
gfn_t gfn = fault_ipa >> PAGE_SHIFT;
struct kvm *kvm = vcpu->kvm;
@@ -1484,7 +1484,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
kvm_pfn_t pfn;
pgprot_t mem_type = PAGE_S2;
bool logging_active = memslot_is_logging(memslot);
-   unsigned long flags = 0;
+   unsigned long vma_pagesize, flags = 0;


A small nit s/vma_pagesize/pagesize. Why call it VMA ? Its implicit.


May be we could call it mapsize. pagesize is confusing.



I'm ok with mapsize.  I see the vma_pagesize name coming from the fact
that this is initially set to the return value from vma_kernel_pagesize.

I have not problems with either vma_pagesize or mapsize.


Ok, I will retain the vma_pagesize to avoid unnecessary changes to the patch.

Thanks
Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 2/4] arm64: KVM: add accessors to track guest/host only counters

2018-12-10 Thread Christoffer Dall
On Mon, Dec 10, 2018 at 09:45:57AM +, Andrew Murray wrote:
> In order to effeciently enable/disable guest/host only perf counters
> at guest entry/exit we add bitfields to kvm_cpu_context for guest and
> host events as well as accessors for updating them.
> 
> Signed-off-by: Andrew Murray 
> ---
>  arch/arm64/include/asm/kvm_host.h | 24 
>  1 file changed, 24 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 1550192..800c87b 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -203,6 +203,8 @@ struct kvm_cpu_context {
>   };
>  
>   struct kvm_vcpu *__hyp_running_vcpu;
> + u32 events_host;
> + u32 events_guest;

This is confusing to me.

These values appear only to be used for the host instance, which makes
me wonder why we add them to kvm_cpu_context, which is also used for the
guest state?  Should we not instead consider moving them to their own
data structure and add a per-cpu data structure or something more fancy
like having a new data structure, kvm_percpu_host_data, which contains
the kvm_cpu_context and the events flags?

I don't know much about perf, but doesn't this design also imply that
you can only set these modifiers at a per-cpu level, and not attach
the modifiers to a task/vcpu or vm ?  Is that by design?


Thanks,

Christoffer

>  };
>  
>  typedef struct kvm_cpu_context kvm_cpu_context_t;
> @@ -467,11 +469,33 @@ void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
>  void kvm_arch_vcpu_ctxsync_fp(struct kvm_vcpu *vcpu);
>  void kvm_arch_vcpu_put_fp(struct kvm_vcpu *vcpu);
>  
> +#define KVM_PMU_EVENTS_HOST  1
> +#define KVM_PMU_EVENTS_GUEST 2
> +
>  #ifdef CONFIG_KVM /* Avoid conflicts with core headers if CONFIG_KVM=n */
>  static inline int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
>  {
>   return kvm_arch_vcpu_run_map_fp(vcpu);
>  }
> +static inline void kvm_set_pmu_events(u32 set, int flags)
> +{
> + kvm_cpu_context_t *ctx = this_cpu_ptr(&kvm_host_cpu_state);
> +
> + if (flags & KVM_PMU_EVENTS_HOST)
> + ctx->events_host |= set;
> + if (flags & KVM_PMU_EVENTS_GUEST)
> + ctx->events_guest |= set;
> +}
> +static inline void kvm_clr_pmu_events(u32 clr)
> +{
> + kvm_cpu_context_t *ctx = this_cpu_ptr(&kvm_host_cpu_state);
> +
> + ctx->events_host &= ~clr;
> + ctx->events_guest &= ~clr;
> +}
> +#else
> +static inline void kvm_set_pmu_events(u32 set, int flags) {}
> +static inline void kvm_clr_pmu_events(u32 clr) {}
>  #endif
>  
>  static inline void kvm_arm_vhe_guest_enter(void)
> -- 
> 2.7.4
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 1/8] arm64: KVM: Make VHE Stage-2 TLB invalidation operations non-interruptible

2018-12-10 Thread Marc Zyngier
Hi Christoffer,

On 10/12/2018 10:03, Christoffer Dall wrote:
> On Thu, Dec 06, 2018 at 05:31:19PM +, Marc Zyngier wrote:
>> Contrary to the non-VHE version of the TLB invalidation helpers, the VHE
>> code  has interrupts enabled, meaning that we can take an interrupt in
>> the middle of such a sequence, and start running something else with
>> HCR_EL2.TGE cleared.
> 
> Do we have to clear TGE to perform the TLB invalidation, or is that just
> a side-effect of re-using code?

We really do need to clear TGE. From the description of TLBI VMALLE1IS:


When EL2 is implemented and enabled in the current Security state:
— If HCR_EL2.{E2H, TGE} is not {1, 1}, the entry would be used with the
current VMID and would be required to translate the specified VA using
the EL1&0 translation regime.
— If HCR_EL2.{E2H, TGE} is {1, 1}, the entry would be required to
translate the specified VA using the EL2&0 translation regime.


> Also, do we generally perform TLB invalidations in the kernel with
> interrupts disabled, or is this just a result of clearing TGE?

That's definitely a result of clearing TGE. We could be taking an
interrupt here, and execute a user access on the back of it (perf will
happily walk a user-space stack in that context, for example). Having
TGE clear in that context. An alternative solution would be to
save/restore TGE on interrupt entry/exit, but that's a bit overkill when
you consider how rarely we issue such TLB invalidation.

> Somehow I feel like this should look more like just another TLB
> invalidation in the kernel, but if there's a good reason why it can't
> then this is fine.

The rest of the TLB invalidation in the kernel doesn't need to
save/restore any context. They apply to a set of parameters that are
already loaded on the CPU. What we have here is substantially different.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 7/8] arm64: KVM: Handle ARM erratum 1165522 in TLB invalidation

2018-12-10 Thread Christoffer Dall
On Thu, Dec 06, 2018 at 05:31:25PM +, Marc Zyngier wrote:
> In order to avoid TLB corruption whilst invalidating TLBs on CPUs
> affected by erratum 1165522, we need to prevent S1 page tables
> from being usable.
> 
> For this, we set the EL1 S1 MMU on, and also disable the page table
> walker (by setting the TCR_EL1.EPD* bits to 1).
> 
> This ensures that once we switch to the EL1/EL0 translation regime,
> speculated AT instructions won't be able to parse the page tables.
> 
> Reviewed-by: James Morse 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/tlb.c | 66 +++-
>  1 file changed, 51 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/tlb.c b/arch/arm64/kvm/hyp/tlb.c
> index 7fcc9c1a5f45..ec157543d5a9 100644
> --- a/arch/arm64/kvm/hyp/tlb.c
> +++ b/arch/arm64/kvm/hyp/tlb.c
> @@ -21,12 +21,36 @@
>  #include 
>  #include 
>  
> +struct tlb_inv_context {
> + unsigned long   flags;
> + u64 tcr;
> + u64 sctlr;
> +};
> +
>  static void __hyp_text __tlb_switch_to_guest_vhe(struct kvm *kvm,
> -  unsigned long *flags)
> +  struct tlb_inv_context *cxt)
>  {
>   u64 val;
>  
> - local_irq_save(*flags);
> + local_irq_save(cxt->flags);
> +
> + if (cpus_have_const_cap(ARM64_WORKAROUND_1165522)) {
> + /*
> +  * For CPUs that are affected by ARM erratum 1165522, we
> +  * cannot trust stage-1 to be in a correct state at that
> +  * point. Since we do not want to force a full load of the
> +  * vcpu state, we prevent the EL1 page-table walker to
> +  * allocate new TLBs. This is done by setting the EPD bits
> +  * in the TCR_EL1 register. We also need to prevent it to
> +  * allocate IPA->PA walks, so we enable the S1 MMU...
> +  */
> + val = cxt->tcr = read_sysreg_el1(tcr);
> + val |= TCR_EPD1_MASK | TCR_EPD0_MASK;
> + write_sysreg_el1(val, tcr);
> + val = cxt->sctlr = read_sysreg_el1(sctlr);
> + val |= SCTLR_ELx_M;
> + write_sysreg_el1(val, sctlr);
> + }
>  
>   /*
>* With VHE enabled, we have HCR_EL2.{E2H,TGE} = {1,1}, and
> @@ -34,6 +58,11 @@ static void __hyp_text __tlb_switch_to_guest_vhe(struct 
> kvm *kvm,
>* guest TLBs (EL1/EL0), we need to change one of these two
>* bits. Changing E2H is impossible (goodbye TTBR1_EL2), so
>* let's flip TGE before executing the TLB operation.
> +  *
> +  * ARM erratum 1165522 requires some special handling (again),
> +  * as we need to make sure both stages of translation are in
> +  * place before clearing TGE. __load_guest_stage2() already
> +  * has an ISB in order to deal with this.
>*/
>   __load_guest_stage2(kvm);
>   val = read_sysreg(hcr_el2);
> @@ -43,7 +72,7 @@ static void __hyp_text __tlb_switch_to_guest_vhe(struct kvm 
> *kvm,
>  }
>  
>  static void __hyp_text __tlb_switch_to_guest_nvhe(struct kvm *kvm,
> -   unsigned long *flags)
> +   struct tlb_inv_context *cxt)
>  {
>   __load_guest_stage2(kvm);
>   isb();
> @@ -55,7 +84,7 @@ static hyp_alternate_select(__tlb_switch_to_guest,
>   ARM64_HAS_VIRT_HOST_EXTN);
>  
>  static void __hyp_text __tlb_switch_to_host_vhe(struct kvm *kvm,
> - unsigned long flags)
> + struct tlb_inv_context *cxt)
>  {
>   /*
>* We're done with the TLB operation, let's restore the host's
> @@ -64,11 +93,18 @@ static void __hyp_text __tlb_switch_to_host_vhe(struct 
> kvm *kvm,
>   write_sysreg(0, vttbr_el2);
>   write_sysreg(HCR_HOST_VHE_FLAGS, hcr_el2);
>   isb();
> - local_irq_restore(flags);
> +
> + if (cpus_have_const_cap(ARM64_WORKAROUND_1165522)) {
> + /* Restore the guest's registers to what they were */

host's ?

> + write_sysreg_el1(cxt->tcr, tcr);
> + write_sysreg_el1(cxt->sctlr, sctlr);
> + }
> +
> + local_irq_restore(cxt->flags);
>  }
>  
>  static void __hyp_text __tlb_switch_to_host_nvhe(struct kvm *kvm,
> -  unsigned long flags)
> +  struct tlb_inv_context *cxt)
>  {
>   write_sysreg(0, vttbr_el2);
>  }
> @@ -80,13 +116,13 @@ static hyp_alternate_select(__tlb_switch_to_host,
>  
>  void __hyp_text __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>  {
> - unsigned long flags;
> + struct tlb_inv_context cxt;
>  
>   dsb(ishst);
>  
>   /* Switch to requested VMID */
>   kvm = kern_hyp_va(kvm);
> - __tlb_switch_to_guest()(kvm, &flags);
> + __tlb_switch_to_guest()(kvm, &

Re: [PATCH v3 6/8] arm64: KVM: Add synchronization on translation regime change for erratum 1165522

2018-12-10 Thread Christoffer Dall
On Thu, Dec 06, 2018 at 05:31:24PM +, Marc Zyngier wrote:
> In order to ensure that slipping HCR_EL2.TGE is done at the right
> time when switching translation regime, let insert the required ISBs
> that will be patched in when erratum 1165522 is detected.
> 
> Take this opportunity to add the missing include of asm/alternative.h
> which was getting there by pure luck.
> 
> Reviewed-by: James Morse 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/include/asm/kvm_hyp.h |  8 
>  arch/arm64/kvm/hyp/switch.c  | 19 +++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_hyp.h 
> b/arch/arm64/include/asm/kvm_hyp.h
> index 23aca66767f9..a80a7ef57325 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -20,6 +20,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #define __hyp_text __section(.hyp.text) notrace
> @@ -163,6 +164,13 @@ static __always_inline void __hyp_text 
> __load_guest_stage2(struct kvm *kvm)
>  {
>   write_sysreg(kvm->arch.vtcr, vtcr_el2);
>   write_sysreg(kvm->arch.vttbr, vttbr_el2);
> +
> + /*
> +  * ARM erratum 1165522 requires the actual execution of the above
> +  * before we can switch to the EL1/EL0 translation regime used by
> +  * the guest.
> +  */
> + asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_1165522));
>  }
>  
>  #endif /* __ARM64_KVM_HYP_H__ */
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index a8fa61c68c32..31ee0bfc432f 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -143,6 +143,14 @@ static void deactivate_traps_vhe(void)
>  {
>   extern char vectors[];  /* kernel exception vectors */
>   write_sysreg(HCR_HOST_VHE_FLAGS, hcr_el2);
> +
> + /*
> +  * ARM erratum 1165522 requires the actual execution of the above
> +  * before we can switch to the EL2/EL0 translation regime used by
> +  * the host.
> +  */
> + asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_1165522));
> +
>   write_sysreg(CPACR_EL1_DEFAULT, cpacr_el1);
>   write_sysreg(vectors, vbar_el1);
>  }
> @@ -499,6 +507,17 @@ int kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
>  
>   sysreg_save_host_state_vhe(host_ctxt);
>  
> + /*
> +  * ARM erratum 1165522 requires us to configure both stage 1 and
> +  * stage 2 translation for the guest context before we clear
> +  * HCR_EL2.TGE.
> +  *
> +  * We have already configured the guest's stage 1 translation in
> +  * kvm_vcpu_load_sysregs above.  We must now call __activate_vm
> +  * before __activate_traps, because __activate_vm configures
> +  * stage 2 translation, and __activate_traps clear HCR_EL2.TGE
> +  * (among other things).
> +  */
>   __activate_vm(vcpu->kvm);
>   __activate_traps(vcpu);
>  
> -- 
> 2.19.2
> 

Acked-by: Christoffer Dall 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 3/8] arm64: KVM: Install stage-2 translation before enabling traps

2018-12-10 Thread Christoffer Dall
On Thu, Dec 06, 2018 at 05:31:21PM +, Marc Zyngier wrote:
> It is a bit odd that we only install stage-2 translation after having
> cleared HCR_EL2.TGE, which means that there is a window during which
> AT requests could fail as stage-2 is not configured yet.
> 
> Let's move stage-2 configuration before we clear TGE, making the
> guest entry sequence clearer: we first configure all the guest stuff,
> then only switch to the guest translation regime.
> 
> While we're at it, do the same thing for !VHE. It doesn't hurt,
> and keeps things symmetric.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/switch.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index 7cc175c88a37..a8fa61c68c32 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -499,8 +499,8 @@ int kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
>  
>   sysreg_save_host_state_vhe(host_ctxt);
>  
> - __activate_traps(vcpu);
>   __activate_vm(vcpu->kvm);
> + __activate_traps(vcpu);
>  
>   sysreg_restore_guest_state_vhe(guest_ctxt);
>   __debug_switch_to_guest(vcpu);
> @@ -545,8 +545,8 @@ int __hyp_text __kvm_vcpu_run_nvhe(struct kvm_vcpu *vcpu)
>  
>   __sysreg_save_state_nvhe(host_ctxt);
>  
> - __activate_traps(vcpu);
>   __activate_vm(kern_hyp_va(vcpu->kvm));
> + __activate_traps(vcpu);
>  
>   __hyp_vgic_restore_state(vcpu);
>   __timer_enable_traps(vcpu);
> -- 
> 2.19.2
> 

Acked-by: Christoffer Dall 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 2/8] KVM: arm64: Rework detection of SVE, !VHE systems

2018-12-10 Thread Christoffer Dall
On Thu, Dec 06, 2018 at 05:31:20PM +, Marc Zyngier wrote:
> An SVE system is so far the only case where we mandate VHE. As we're
> starting to grow this requirements, let's slightly rework the way we
> deal with that situation, allowing for easy extension of this check.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm/include/asm/kvm_host.h   | 2 +-
>  arch/arm64/include/asm/kvm_host.h | 6 +++---
>  virt/kvm/arm/arm.c| 8 
>  3 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 5ca5d9af0c26..2184d9ddb418 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -285,7 +285,7 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
>  
>  struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);
>  
> -static inline bool kvm_arch_check_sve_has_vhe(void) { return true; }
> +static inline bool kvm_arch_requires_vhe(void) { return false; }
>  static inline void kvm_arch_hardware_unsetup(void) {}
>  static inline void kvm_arch_sync_events(struct kvm *kvm) {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 52fbc823ff8c..d6d9aa76a943 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -422,7 +422,7 @@ static inline void __cpu_init_hyp_mode(phys_addr_t 
> pgd_ptr,
>   }
>  }
>  
> -static inline bool kvm_arch_check_sve_has_vhe(void)
> +static inline bool kvm_arch_requires_vhe(void)
>  {
>   /*
>* The Arm architecture specifies that implementation of SVE
> @@ -430,9 +430,9 @@ static inline bool kvm_arch_check_sve_has_vhe(void)
>* relies on this when SVE is present:
>*/
>   if (system_supports_sve())
> - return has_vhe();
> - else
>   return true;
> +
> + return false;
>  }
>  
>  static inline void kvm_arch_hardware_unsetup(void) {}
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 23774970c9df..1db4c15edcdd 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -1640,8 +1640,10 @@ int kvm_arch_init(void *opaque)
>   return -ENODEV;
>   }
>  
> - if (!kvm_arch_check_sve_has_vhe()) {
> - kvm_pr_unimpl("SVE system without VHE unsupported.  Broken 
> cpu?");
> + in_hyp_mode = is_kernel_in_hyp_mode();
> +
> + if (!in_hyp_mode && kvm_arch_requires_vhe()) {
> + kvm_pr_unimpl("CPU requiring VHE was booted in non-VHE mode");

nit: The error message feels weird to me (are we reporting CPU bugs?)
and I'm not sure about the unimpl and I think there's a linse space
missing.

How about:

kvm_err("Cannot support this CPU in non-VHE mode, not initializing\n");

If we wanted to be super helpful, we could expand
kvm_arch_requires_vhe() with a print statement saying:

kvm_err("SVE detected, requiring VHE mode\n");

But thay may be overkill.


>   return -ENODEV;
>   }
>  
> @@ -1657,8 +1659,6 @@ int kvm_arch_init(void *opaque)
>   if (err)
>   return err;
>  
> - in_hyp_mode = is_kernel_in_hyp_mode();
> -
>   if (!in_hyp_mode) {
>   err = init_hyp_mode();
>   if (err)
> -- 
> 2.19.2
> 

Otherwise:

Acked-by: Christoffer Dall 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 1/8] arm64: KVM: Make VHE Stage-2 TLB invalidation operations non-interruptible

2018-12-10 Thread Christoffer Dall
On Thu, Dec 06, 2018 at 05:31:19PM +, Marc Zyngier wrote:
> Contrary to the non-VHE version of the TLB invalidation helpers, the VHE
> code  has interrupts enabled, meaning that we can take an interrupt in
> the middle of such a sequence, and start running something else with
> HCR_EL2.TGE cleared.

Do we have to clear TGE to perform the TLB invalidation, or is that just
a side-effect of re-using code?

Also, do we generally perform TLB invalidations in the kernel with
interrupts disabled, or is this just a result of clearing TGE?

Somehow I feel like this should look more like just another TLB
invalidation in the kernel, but if there's a good reason why it can't
then this is fine.

Thanks,

Christoffer

> 
> That's really not a good idea.
> 
> Take the heavy-handed option and disable interrupts in
> __tlb_switch_to_guest_vhe, restoring them in __tlb_switch_to_host_vhe.
> The latter also gain an ISB in order to make sure that TGE really has
> taken effect.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/tlb.c | 35 +--
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/tlb.c b/arch/arm64/kvm/hyp/tlb.c
> index 4dbd9c69a96d..7fcc9c1a5f45 100644
> --- a/arch/arm64/kvm/hyp/tlb.c
> +++ b/arch/arm64/kvm/hyp/tlb.c
> @@ -15,14 +15,19 @@
>   * along with this program.  If not, see .
>   */
>  
> +#include 
> +
>  #include 
>  #include 
>  #include 
>  
> -static void __hyp_text __tlb_switch_to_guest_vhe(struct kvm *kvm)
> +static void __hyp_text __tlb_switch_to_guest_vhe(struct kvm *kvm,
> +  unsigned long *flags)
>  {
>   u64 val;
>  
> + local_irq_save(*flags);
> +
>   /*
>* With VHE enabled, we have HCR_EL2.{E2H,TGE} = {1,1}, and
>* most TLB operations target EL2/EL0. In order to affect the
> @@ -37,7 +42,8 @@ static void __hyp_text __tlb_switch_to_guest_vhe(struct kvm 
> *kvm)
>   isb();
>  }
>  
> -static void __hyp_text __tlb_switch_to_guest_nvhe(struct kvm *kvm)
> +static void __hyp_text __tlb_switch_to_guest_nvhe(struct kvm *kvm,
> +   unsigned long *flags)
>  {
>   __load_guest_stage2(kvm);
>   isb();
> @@ -48,7 +54,8 @@ static hyp_alternate_select(__tlb_switch_to_guest,
>   __tlb_switch_to_guest_vhe,
>   ARM64_HAS_VIRT_HOST_EXTN);
>  
> -static void __hyp_text __tlb_switch_to_host_vhe(struct kvm *kvm)
> +static void __hyp_text __tlb_switch_to_host_vhe(struct kvm *kvm,
> + unsigned long flags)
>  {
>   /*
>* We're done with the TLB operation, let's restore the host's
> @@ -56,9 +63,12 @@ static void __hyp_text __tlb_switch_to_host_vhe(struct kvm 
> *kvm)
>*/
>   write_sysreg(0, vttbr_el2);
>   write_sysreg(HCR_HOST_VHE_FLAGS, hcr_el2);
> + isb();
> + local_irq_restore(flags);
>  }
>  
> -static void __hyp_text __tlb_switch_to_host_nvhe(struct kvm *kvm)
> +static void __hyp_text __tlb_switch_to_host_nvhe(struct kvm *kvm,
> +  unsigned long flags)
>  {
>   write_sysreg(0, vttbr_el2);
>  }
> @@ -70,11 +80,13 @@ static hyp_alternate_select(__tlb_switch_to_host,
>  
>  void __hyp_text __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>  {
> + unsigned long flags;
> +
>   dsb(ishst);
>  
>   /* Switch to requested VMID */
>   kvm = kern_hyp_va(kvm);
> - __tlb_switch_to_guest()(kvm);
> + __tlb_switch_to_guest()(kvm, &flags);
>  
>   /*
>* We could do so much better if we had the VA as well.
> @@ -117,36 +129,39 @@ void __hyp_text __kvm_tlb_flush_vmid_ipa(struct kvm 
> *kvm, phys_addr_t ipa)
>   if (!has_vhe() && icache_is_vpipt())
>   __flush_icache_all();
>  
> - __tlb_switch_to_host()(kvm);
> + __tlb_switch_to_host()(kvm, flags);
>  }
>  
>  void __hyp_text __kvm_tlb_flush_vmid(struct kvm *kvm)
>  {
> + unsigned long flags;
> +
>   dsb(ishst);
>  
>   /* Switch to requested VMID */
>   kvm = kern_hyp_va(kvm);
> - __tlb_switch_to_guest()(kvm);
> + __tlb_switch_to_guest()(kvm, &flags);
>  
>   __tlbi(vmalls12e1is);
>   dsb(ish);
>   isb();
>  
> - __tlb_switch_to_host()(kvm);
> + __tlb_switch_to_host()(kvm, flags);
>  }
>  
>  void __hyp_text __kvm_tlb_flush_local_vmid(struct kvm_vcpu *vcpu)
>  {
>   struct kvm *kvm = kern_hyp_va(kern_hyp_va(vcpu)->kvm);
> + unsigned long flags;
>  
>   /* Switch to requested VMID */
> - __tlb_switch_to_guest()(kvm);
> + __tlb_switch_to_guest()(kvm, &flags);
>  
>   __tlbi(vmalle1);
>   dsb(nsh);
>   isb();
>  
> - __tlb_switch_to_host()(kvm);
> + __tlb_switch_to_host()(kvm, flags);
>  }
>  
>  void __hyp_text __kvm_flush_vm_context(void)
> -- 
> 2.19.2
> 
___

[PATCH v6 0/4] arm64: Support perf event modifiers :G and :H

2018-12-10 Thread Andrew Murray
This patchset provides support for perf event modifiers :G and :H which
allows for filtering of PMU events between host and guests when used
with KVM.

As the underlying hardware cannot distinguish between guest and host
context, the performance counters must be stopped and started upon
entry/exit to the guest. This is performed at EL2 in a way that
minimizes overhead and improves accuracy of recording events that only
occur in the requested context.

This has been tested with VHE and non-VHE kernels with a KVM guest.

Changes from v5:

 - Tweak logic in use of kvm_set_pmu_events

Changes from v4:

 - Prevent unnecessary write_sysreg calls by improving
   __pmu_switch_to_xxx logic.

Changes from v3:

 - Remove confusing _only suffix from bitfields in kvm_cpu_context
 - Remove unnecessary condition when clearing event bits in disable
 - Simplify API of KVM accessors
 - Prevent unnecessary setting of pmcnten when guest/host events are
   the same.

Changes from v2:

 - Ensured that exclude_kernel works for guest
 - Removed unnecessary exclusion of EL2 with exclude_host on !VHE
 - Renamed kvm_clr_set_host_pmu_events to reflect args order
 - Added additional information to isb patch

Changes from v1:

 - Removed unnecessary exclusion of EL1 with exclude_guest on VHE
 - Removed unnecessary isb from existing perf_event.c driver
 - Folded perf_event.c patches together
 - Added additional information to last patch commit message

Andrew Murray (4):
  arm64: arm_pmu: remove unnecessary isb instruction
  arm64: KVM: add accessors to track guest/host only counters
  arm64: arm_pmu: Add support for exclude_host/exclude_guest attributes
  arm64: KVM: Enable support for :G/:H perf event modifiers

 arch/arm64/include/asm/kvm_host.h | 24 ++
 arch/arm64/kernel/perf_event.c| 52 +--
 arch/arm64/kvm/hyp/switch.c   | 38 
 3 files changed, 106 insertions(+), 8 deletions(-)

-- 
2.7.4

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v6 3/4] arm64: arm_pmu: Add support for exclude_host/exclude_guest attributes

2018-12-10 Thread Andrew Murray
Add support for the :G and :H attributes in perf by handling the
exclude_host/exclude_guest event attributes.

We notify KVM of counters that we wish to be enabled or disabled on
guest entry/exit and thus defer from starting or stopping :G events
as per the events exclude_host attribute.

With both VHE and non-VHE we switch the counters between host/guest
at EL2. We are able to eliminate counters counting host events on
the boundaries of guest entry/exit when using :G by filtering out
EL2 for exclude_host. However when using :H unless exclude_hv is set
on non-VHE then there is a small blackout window at the guest
entry/exit where host events are not captured.

Signed-off-by: Andrew Murray 
---
 arch/arm64/kernel/perf_event.c | 51 --
 1 file changed, 44 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index de564ae..4a3c73d 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -26,6 +26,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -647,11 +648,26 @@ static inline int armv8pmu_enable_counter(int idx)
 
 static inline void armv8pmu_enable_event_counter(struct perf_event *event)
 {
+   struct perf_event_attr *attr = &event->attr;
int idx = event->hw.idx;
+   int flags = 0;
+   u32 counter_bits = BIT(ARMV8_IDX_TO_COUNTER(idx));
 
-   armv8pmu_enable_counter(idx);
if (armv8pmu_event_is_chained(event))
-   armv8pmu_enable_counter(idx - 1);
+   counter_bits |= BIT(ARMV8_IDX_TO_COUNTER(idx - 1));
+
+   if (!attr->exclude_host)
+   flags |= KVM_PMU_EVENTS_HOST;
+   if (!attr->exclude_guest)
+   flags |= KVM_PMU_EVENTS_GUEST;
+
+   kvm_set_pmu_events(counter_bits, flags);
+
+   if (!attr->exclude_host) {
+   armv8pmu_enable_counter(idx);
+   if (armv8pmu_event_is_chained(event))
+   armv8pmu_enable_counter(idx - 1);
+   }
 }
 
 static inline int armv8pmu_disable_counter(int idx)
@@ -664,11 +680,20 @@ static inline int armv8pmu_disable_counter(int idx)
 static inline void armv8pmu_disable_event_counter(struct perf_event *event)
 {
struct hw_perf_event *hwc = &event->hw;
+   struct perf_event_attr *attr = &event->attr;
int idx = hwc->idx;
+   u32 counter_bits = BIT(ARMV8_IDX_TO_COUNTER(idx));
 
if (armv8pmu_event_is_chained(event))
-   armv8pmu_disable_counter(idx - 1);
-   armv8pmu_disable_counter(idx);
+   counter_bits |= BIT(ARMV8_IDX_TO_COUNTER(idx - 1));
+
+   kvm_clr_pmu_events(counter_bits);
+
+   if (!attr->exclude_host) {
+   if (armv8pmu_event_is_chained(event))
+   armv8pmu_disable_counter(idx - 1);
+   armv8pmu_disable_counter(idx);
+   }
 }
 
 static inline int armv8pmu_enable_intens(int idx)
@@ -943,16 +968,25 @@ static int armv8pmu_set_event_filter(struct hw_perf_event 
*event,
 * Therefore we ignore exclude_hv in this configuration, since
 * there's no hypervisor to sample anyway. This is consistent
 * with other architectures (x86 and Power).
+*
+* To eliminate counting host events on the boundaries of
+* guest entry/exit we ensure EL2 is not included in hyp mode
+* with !exclude_host.
 */
if (is_kernel_in_hyp_mode()) {
-   if (!attr->exclude_kernel)
+   if (!attr->exclude_kernel && !attr->exclude_host)
config_base |= ARMV8_PMU_INCLUDE_EL2;
} else {
-   if (attr->exclude_kernel)
-   config_base |= ARMV8_PMU_EXCLUDE_EL1;
if (!attr->exclude_hv)
config_base |= ARMV8_PMU_INCLUDE_EL2;
}
+
+   /*
+* Filter out !VHE kernels and guest kernels
+*/
+   if (attr->exclude_kernel)
+   config_base |= ARMV8_PMU_EXCLUDE_EL1;
+
if (attr->exclude_user)
config_base |= ARMV8_PMU_EXCLUDE_EL0;
 
@@ -976,6 +1010,9 @@ static void armv8pmu_reset(void *info)
armv8pmu_disable_intens(idx);
}
 
+   /* Clear the counters we flip at guest entry/exit */
+   kvm_clr_pmu_events(U32_MAX);
+
/*
 * Initialize & Reset PMNC. Request overflow interrupt for
 * 64 bit cycle counter but cheat in armv8pmu_write_counter().
-- 
2.7.4

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v6 4/4] arm64: KVM: Enable support for :G/:H perf event modifiers

2018-12-10 Thread Andrew Murray
Enable/disable event counters as appropriate when entering and exiting
the guest to enable support for guest or host only event counting.

For both VHE and non-VHE we switch the counters between host/guest at
EL2. EL2 is filtered out by the PMU when we are using the :G modifier.

The PMU may be on when we change which counters are enabled however
we avoid adding an isb as we instead rely on existing context
synchronisation events: the isb in kvm_arm_vhe_guest_exit for VHE and
the eret from the hvc in kvm_call_hyp.

Signed-off-by: Andrew Murray 
Reviewed-by: Suzuki K Poulose 
---
 arch/arm64/kvm/hyp/switch.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index d496ef5..e505cad 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -373,6 +373,32 @@ static bool __hyp_text __hyp_switch_fpsimd(struct kvm_vcpu 
*vcpu)
return true;
 }
 
+static bool __hyp_text __pmu_switch_to_guest(struct kvm_cpu_context *host_ctxt)
+{
+   u32 clr = host_ctxt->events_host & ~host_ctxt->events_guest;
+   u32 set = host_ctxt->events_guest & ~host_ctxt->events_host;
+
+   if (clr)
+   write_sysreg(clr, pmcntenclr_el0);
+
+   if (set)
+   write_sysreg(set, pmcntenset_el0);
+
+   return (clr || set);
+}
+
+static void __hyp_text __pmu_switch_to_host(struct kvm_cpu_context *host_ctxt)
+{
+   u32 clr = host_ctxt->events_guest & ~host_ctxt->events_host;
+   u32 set = host_ctxt->events_host & ~host_ctxt->events_guest;
+
+   if (clr)
+   write_sysreg(clr, pmcntenclr_el0);
+
+   if (set)
+   write_sysreg(set, pmcntenset_el0);
+}
+
 /*
  * Return true when we were able to fixup the guest exit and should return to
  * the guest, false when we should restore the host state and return to the
@@ -488,12 +514,15 @@ int kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
 {
struct kvm_cpu_context *host_ctxt;
struct kvm_cpu_context *guest_ctxt;
+   bool pmu_switch_needed;
u64 exit_code;
 
host_ctxt = vcpu->arch.host_cpu_context;
host_ctxt->__hyp_running_vcpu = vcpu;
guest_ctxt = &vcpu->arch.ctxt;
 
+   pmu_switch_needed = __pmu_switch_to_guest(host_ctxt);
+
sysreg_save_host_state_vhe(host_ctxt);
 
__activate_traps(vcpu);
@@ -524,6 +553,9 @@ int kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
 
__debug_switch_to_host(vcpu);
 
+   if (pmu_switch_needed)
+   __pmu_switch_to_host(host_ctxt);
+
return exit_code;
 }
 
@@ -532,6 +564,7 @@ int __hyp_text __kvm_vcpu_run_nvhe(struct kvm_vcpu *vcpu)
 {
struct kvm_cpu_context *host_ctxt;
struct kvm_cpu_context *guest_ctxt;
+   bool pmu_switch_needed;
u64 exit_code;
 
vcpu = kern_hyp_va(vcpu);
@@ -540,6 +573,8 @@ int __hyp_text __kvm_vcpu_run_nvhe(struct kvm_vcpu *vcpu)
host_ctxt->__hyp_running_vcpu = vcpu;
guest_ctxt = &vcpu->arch.ctxt;
 
+   pmu_switch_needed = __pmu_switch_to_guest(host_ctxt);
+
__sysreg_save_state_nvhe(host_ctxt);
 
__activate_traps(vcpu);
@@ -586,6 +621,9 @@ int __hyp_text __kvm_vcpu_run_nvhe(struct kvm_vcpu *vcpu)
 */
__debug_switch_to_host(vcpu);
 
+   if (pmu_switch_needed)
+   __pmu_switch_to_host(host_ctxt);
+
return exit_code;
 }
 
-- 
2.7.4

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v6 1/4] arm64: arm_pmu: remove unnecessary isb instruction

2018-12-10 Thread Andrew Murray
The armv8pmu_enable_event_counter function issues an isb instruction
after enabling a pair of counters - this doesn't provide any value
and is inconsistent with the armv8pmu_disable_event_counter.

In any case armv8pmu_enable_event_counter is always called with the
PMU stopped. Starting the PMU with armv8pmu_start results in an isb
instruction being issued prior to writing to PMCR_EL0.

Let's remove the unnecessary isb instruction.

Signed-off-by: Andrew Murray 
Reviewed-by: Suzuki K Poulose 
Acked-by: Mark Rutland 
---
 arch/arm64/kernel/perf_event.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index 8e38d52..de564ae 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -652,7 +652,6 @@ static inline void armv8pmu_enable_event_counter(struct 
perf_event *event)
armv8pmu_enable_counter(idx);
if (armv8pmu_event_is_chained(event))
armv8pmu_enable_counter(idx - 1);
-   isb();
 }
 
 static inline int armv8pmu_disable_counter(int idx)
-- 
2.7.4

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v6 2/4] arm64: KVM: add accessors to track guest/host only counters

2018-12-10 Thread Andrew Murray
In order to effeciently enable/disable guest/host only perf counters
at guest entry/exit we add bitfields to kvm_cpu_context for guest and
host events as well as accessors for updating them.

Signed-off-by: Andrew Murray 
---
 arch/arm64/include/asm/kvm_host.h | 24 
 1 file changed, 24 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 1550192..800c87b 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -203,6 +203,8 @@ struct kvm_cpu_context {
};
 
struct kvm_vcpu *__hyp_running_vcpu;
+   u32 events_host;
+   u32 events_guest;
 };
 
 typedef struct kvm_cpu_context kvm_cpu_context_t;
@@ -467,11 +469,33 @@ void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_ctxsync_fp(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_put_fp(struct kvm_vcpu *vcpu);
 
+#define KVM_PMU_EVENTS_HOST1
+#define KVM_PMU_EVENTS_GUEST   2
+
 #ifdef CONFIG_KVM /* Avoid conflicts with core headers if CONFIG_KVM=n */
 static inline int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
 {
return kvm_arch_vcpu_run_map_fp(vcpu);
 }
+static inline void kvm_set_pmu_events(u32 set, int flags)
+{
+   kvm_cpu_context_t *ctx = this_cpu_ptr(&kvm_host_cpu_state);
+
+   if (flags & KVM_PMU_EVENTS_HOST)
+   ctx->events_host |= set;
+   if (flags & KVM_PMU_EVENTS_GUEST)
+   ctx->events_guest |= set;
+}
+static inline void kvm_clr_pmu_events(u32 clr)
+{
+   kvm_cpu_context_t *ctx = this_cpu_ptr(&kvm_host_cpu_state);
+
+   ctx->events_host &= ~clr;
+   ctx->events_guest &= ~clr;
+}
+#else
+static inline void kvm_set_pmu_events(u32 set, int flags) {}
+static inline void kvm_clr_pmu_events(u32 clr) {}
 #endif
 
 static inline void kvm_arm_vhe_guest_enter(void)
-- 
2.7.4

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v9 5/8] KVM: arm64: Support PUD hugepage in stage2_is_exec()

2018-12-10 Thread Christoffer Dall
On Wed, Dec 05, 2018 at 05:57:51PM +, Suzuki K Poulose wrote:
> 
> 
> On 01/11/2018 13:38, Christoffer Dall wrote:
> >On Wed, Oct 31, 2018 at 05:57:42PM +, Punit Agrawal wrote:
> >>In preparation for creating PUD hugepages at stage 2, add support for
> >>detecting execute permissions on PUD page table entries. Faults due to
> >>lack of execute permissions on page table entries is used to perform
> >>i-cache invalidation on first execute.
> >>
> >>Provide trivial implementations of arm32 helpers to allow sharing of
> >>code.
> >>
> >>Signed-off-by: Punit Agrawal 
> >>Reviewed-by: Suzuki K Poulose 
> >>Cc: Christoffer Dall 
> >>Cc: Marc Zyngier 
> >>Cc: Russell King 
> >>Cc: Catalin Marinas 
> >>Cc: Will Deacon 
> >>---
> >>  arch/arm/include/asm/kvm_mmu.h |  6 +++
> >>  arch/arm64/include/asm/kvm_mmu.h   |  5 +++
> >>  arch/arm64/include/asm/pgtable-hwdef.h |  2 +
> >>  virt/kvm/arm/mmu.c | 53 +++---
> >>  4 files changed, 61 insertions(+), 5 deletions(-)
> >>
> >>diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> >>index 37bf85d39607..839a619873d3 100644
> >>--- a/arch/arm/include/asm/kvm_mmu.h
> >>+++ b/arch/arm/include/asm/kvm_mmu.h
> >>@@ -102,6 +102,12 @@ static inline bool kvm_s2pud_readonly(pud_t *pud)
> >>return false;
> >>  }
> >>+static inline bool kvm_s2pud_exec(pud_t *pud)
> >>+{
> >>+   BUG();
> >
> >nit: I think this should be WARN() now :)
> >
> >>+   return false;
> >>+}
> >>+
> >>  static inline pte_t kvm_s2pte_mkwrite(pte_t pte)
> >>  {
> >>pte_val(pte) |= L_PTE_S2_RDWR;
> >>diff --git a/arch/arm64/include/asm/kvm_mmu.h 
> >>b/arch/arm64/include/asm/kvm_mmu.h
> >>index 8da6d1b2a196..c755b37b3f92 100644
> >>--- a/arch/arm64/include/asm/kvm_mmu.h
> >>+++ b/arch/arm64/include/asm/kvm_mmu.h
> >>@@ -261,6 +261,11 @@ static inline bool kvm_s2pud_readonly(pud_t *pudp)
> >>return kvm_s2pte_readonly((pte_t *)pudp);
> >>  }
> >>+static inline bool kvm_s2pud_exec(pud_t *pudp)
> >>+{
> >>+   return !(READ_ONCE(pud_val(*pudp)) & PUD_S2_XN);
> >>+}
> >>+
> >>  #define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
> >>  #ifdef __PAGETABLE_PMD_FOLDED
> >>diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> >>b/arch/arm64/include/asm/pgtable-hwdef.h
> >>index 1d7d8da2ef9b..336e24cddc87 100644
> >>--- a/arch/arm64/include/asm/pgtable-hwdef.h
> >>+++ b/arch/arm64/include/asm/pgtable-hwdef.h
> >>@@ -193,6 +193,8 @@
> >>  #define PMD_S2_RDWR   (_AT(pmdval_t, 3) << 6)   /* HAP[2:1] */
> >>  #define PMD_S2_XN (_AT(pmdval_t, 2) << 53)  /* XN[1:0] */
> >>+#define PUD_S2_XN  (_AT(pudval_t, 2) << 53)  /* XN[1:0] */
> >>+
> >>  /*
> >>   * Memory Attribute override for Stage-2 (MemAttr[3:0])
> >>   */
> >>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>index 1c669c3c1208..8e44dccd1b47 100644
> >>--- a/virt/kvm/arm/mmu.c
> >>+++ b/virt/kvm/arm/mmu.c
> >>@@ -1083,23 +1083,66 @@ static int stage2_set_pmd_huge(struct kvm *kvm, 
> >>struct kvm_mmu_memory_cache
> >>return 0;
> >>  }
> >>-static bool stage2_is_exec(struct kvm *kvm, phys_addr_t addr)
> >>+/*
> >>+ * stage2_get_leaf_entry - walk the stage2 VM page tables and return
> >>+ * true if a valid and present leaf-entry is found. A pointer to the
> >>+ * leaf-entry is returned in the appropriate level variable - pudpp,
> >>+ * pmdpp, ptepp.
> >>+ */
> >>+static bool stage2_get_leaf_entry(struct kvm *kvm, phys_addr_t addr,
> >>+ pud_t **pudpp, pmd_t **pmdpp, pte_t **ptepp)
> >
> >Do we need this type madness or could this just return a u64 pointer
> >(NULL if nothing is found) and pass that to kvm_s2pte_exec (because we
> >know it's the same bit we need to check regardless of the pgtable level
> >on both arm and arm64)?
> >
> >Or do we consider that bad for some reason?
> 
> Practically, yes the bit positions are same and thus we should be able
> to do this assuming that it is just a pte. When we get to independent stage2
> pgtable implementation which treats all page table entries as a single type
> with a level information, we should be able to get rid of these.
> But since we have followed the Linux way of page-table manipulation where we
> have "level" specific accessors. The other option is open code the walking
> sequence from the pgd to the leaf entry everywhere.
> 
> I am fine with changing this code, if you like.
> 

Meh, it just looked a bit over-engineered to me when I originally looked
at the patches, but you're right, they align with the rest of the
implementation.

Thanks,

Christoffer
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v9 3/8] KVM: arm/arm64: Introduce helpers to manipulate page table entries

2018-12-10 Thread Christoffer Dall
On Mon, Dec 03, 2018 at 07:20:08PM +0530, Anshuman Khandual wrote:
> 
> 
> On 10/31/2018 11:27 PM, Punit Agrawal wrote:
> > Introduce helpers to abstract architectural handling of the conversion
> > of pfn to page table entries and marking a PMD page table entry as a
> > block entry.
> 
> Why is this necessary ? we would still need to access PMD, PUD as is
> without any conversion. IOW KVM knows the breakdown of the page table
> at various levels. Is this something required from generic KVM code ?
>   
> > 
> > The helpers are introduced in preparation for supporting PUD hugepages
> > at stage 2 - which are supported on arm64 but do not exist on arm.
> 
> Some of these patches (including the earlier two) are good on it's
> own. Do we have still mention in commit message about the incoming PUD
> enablement as the reason for these cleanup patches ?
> 

Does it hurt?  What is your concern here?


Thanks,

Christoffer
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v9 2/8] KVM: arm/arm64: Re-factor setting the Stage 2 entry to exec on fault

2018-12-10 Thread Christoffer Dall
On Wed, Dec 05, 2018 at 10:47:10AM +, Suzuki K Poulose wrote:
> 
> 
> On 03/12/2018 13:32, Anshuman Khandual wrote:
> >
> >
> >On 10/31/2018 11:27 PM, Punit Agrawal wrote:
> >>Stage 2 fault handler marks a page as executable if it is handling an
> >>execution fault or if it was a permission fault in which case the
> >>executable bit needs to be preserved.
> >>
> >>The logic to decide if the page should be marked executable is
> >>duplicated for PMD and PTE entries. To avoid creating another copy
> >>when support for PUD hugepages is introduced refactor the code to
> >>share the checks needed to mark a page table entry as executable.
> >>
> >>Signed-off-by: Punit Agrawal 
> >>Reviewed-by: Suzuki K Poulose 
> >>Cc: Christoffer Dall 
> >>Cc: Marc Zyngier 
> >>---
> >>  virt/kvm/arm/mmu.c | 28 +++-
> >>  1 file changed, 15 insertions(+), 13 deletions(-)
> >>
> >>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>index 59595207c5e1..6912529946fb 100644
> >>--- a/virt/kvm/arm/mmu.c
> >>+++ b/virt/kvm/arm/mmu.c
> >>@@ -1475,7 +1475,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>phys_addr_t fault_ipa,
> >>  unsigned long fault_status)
> >>  {
> >>int ret;
> >>-   bool write_fault, exec_fault, writable, force_pte = false;
> >>+   bool write_fault, writable, force_pte = false;
> >>+   bool exec_fault, needs_exec;
> >
> >New line not required, still within 80 characters.
> >
> >>unsigned long mmu_seq;
> >>gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> >>struct kvm *kvm = vcpu->kvm;
> >>@@ -1598,19 +1599,25 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>phys_addr_t fault_ipa,
> >>if (exec_fault)
> >>invalidate_icache_guest_page(pfn, vma_pagesize);
> >>+   /*
> >>+* If we took an execution fault we have made the
> >>+* icache/dcache coherent above and should now let the s2
> >
> >Coherent or invalidated with invalidate_icache_guest_page ?
> 
> We also do clean_dcache above if needed. So that makes sure
> the data is coherent. Am I missing something here ?
> 

I think you've got it right.  We have made the icache coherent with the
data/instructions in the page by invalidating the icache.  I think the
comment is ok either way.

Thanks,

Christoffer
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v9 2/8] KVM: arm/arm64: Re-factor setting the Stage 2 entry to exec on fault

2018-12-10 Thread Christoffer Dall
On Mon, Dec 03, 2018 at 07:02:23PM +0530, Anshuman Khandual wrote:
> 
> 
> On 10/31/2018 11:27 PM, Punit Agrawal wrote:
> > Stage 2 fault handler marks a page as executable if it is handling an
> > execution fault or if it was a permission fault in which case the
> > executable bit needs to be preserved.
> > 
> > The logic to decide if the page should be marked executable is
> > duplicated for PMD and PTE entries. To avoid creating another copy
> > when support for PUD hugepages is introduced refactor the code to
> > share the checks needed to mark a page table entry as executable.
> > 
> > Signed-off-by: Punit Agrawal 
> > Reviewed-by: Suzuki K Poulose 
> > Cc: Christoffer Dall 
> > Cc: Marc Zyngier 
> > ---
> >  virt/kvm/arm/mmu.c | 28 +++-
> >  1 file changed, 15 insertions(+), 13 deletions(-)
> > 
> > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> > index 59595207c5e1..6912529946fb 100644
> > --- a/virt/kvm/arm/mmu.c
> > +++ b/virt/kvm/arm/mmu.c
> > @@ -1475,7 +1475,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> > phys_addr_t fault_ipa,
> >   unsigned long fault_status)
> >  {
> > int ret;
> > -   bool write_fault, exec_fault, writable, force_pte = false;
> > +   bool write_fault, writable, force_pte = false;
> > +   bool exec_fault, needs_exec;
> 
> New line not required, still within 80 characters.
> 

He's trying to logically group the two variables.  I don't see a problem
with that.


Thanks,

Christoffer
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v9 1/8] KVM: arm/arm64: Share common code in user_mem_abort()

2018-12-10 Thread Christoffer Dall
On Mon, Dec 03, 2018 at 01:37:37PM +, Suzuki K Poulose wrote:
> Hi Anshuman,
> 
> On 03/12/2018 12:11, Anshuman Khandual wrote:
> >
> >
> >On 10/31/2018 11:27 PM, Punit Agrawal wrote:
> >>The code for operations such as marking the pfn as dirty, and
> >>dcache/icache maintenance during stage 2 fault handling is duplicated
> >>between normal pages and PMD hugepages.
> >>
> >>Instead of creating another copy of the operations when we introduce
> >>PUD hugepages, let's share them across the different pagesizes.
> >>
> >>Signed-off-by: Punit Agrawal 
> >>Reviewed-by: Suzuki K Poulose 
> >>Cc: Christoffer Dall 
> >>Cc: Marc Zyngier 
> >>---
> >>  virt/kvm/arm/mmu.c | 49 --
> >>  1 file changed, 30 insertions(+), 19 deletions(-)
> >>
> >>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>index 5eca48bdb1a6..59595207c5e1 100644
> >>--- a/virt/kvm/arm/mmu.c
> >>+++ b/virt/kvm/arm/mmu.c
> >>@@ -1475,7 +1475,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>phys_addr_t fault_ipa,
> >>  unsigned long fault_status)
> >>  {
> >>int ret;
> >>-   bool write_fault, exec_fault, writable, hugetlb = false, force_pte = 
> >>false;
> >>+   bool write_fault, exec_fault, writable, force_pte = false;
> >>unsigned long mmu_seq;
> >>gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> >>struct kvm *kvm = vcpu->kvm;
> >>@@ -1484,7 +1484,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>phys_addr_t fault_ipa,
> >>kvm_pfn_t pfn;
> >>pgprot_t mem_type = PAGE_S2;
> >>bool logging_active = memslot_is_logging(memslot);
> >>-   unsigned long flags = 0;
> >>+   unsigned long vma_pagesize, flags = 0;
> >
> >A small nit s/vma_pagesize/pagesize. Why call it VMA ? Its implicit.
> 
> May be we could call it mapsize. pagesize is confusing.
> 

I'm ok with mapsize.  I see the vma_pagesize name coming from the fact
that this is initially set to the return value from vma_kernel_pagesize.

I have not problems with either vma_pagesize or mapsize.

> >
> >>write_fault = kvm_is_write_fault(vcpu);
> >>exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
> >>@@ -1504,10 +1504,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>phys_addr_t fault_ipa,
> >>return -EFAULT;
> >>}
> >>-   if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
> >>-   hugetlb = true;
> >>+   vma_pagesize = vma_kernel_pagesize(vma);
> >>+   if (vma_pagesize == PMD_SIZE && !logging_active) {
> >>gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
> >>} else {
> >>+   /*
> >>+* Fallback to PTE if it's not one of the Stage 2
> >>+* supported hugepage sizes
> >>+*/
> >>+   vma_pagesize = PAGE_SIZE;
> >
> >This seems redundant and should be dropped. vma_kernel_pagesize() here either
> >calls hugetlb_vm_op_pagesize (via hugetlb_vm_ops->pagesize) or simply returns
> >PAGE_SIZE. The vm_ops path is taken if the QEMU VMA covering any given HVA is
> >backed either by HugeTLB pages or simply normal pages. vma_pagesize would
> >either has a value of PMD_SIZE (HugeTLB hstate based) or PAGE_SIZE. Hence if
> >its not PMD_SIZE it must be PAGE_SIZE which should not be assigned again.
> 
> We may want to force using the PTE mappings when logging_active (e.g, 
> migration
> ?) to prevent keep tracking of huge pages. So the check is still valid.
> 
> 

Agreed, and let's not try additionally change the logic and flow with
this patch.

> >
> >>+
> >>/*
> >> * Pages belonging to memslots that don't have the same
> >> * alignment for userspace and IPA cannot be mapped using
> >>@@ -1573,23 +1579,33 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> >>phys_addr_t fault_ipa,
> >>if (mmu_notifier_retry(kvm, mmu_seq))
> >>goto out_unlock;
> >>-   if (!hugetlb && !force_pte)
> >>-   hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> >>+   if (vma_pagesize == PAGE_SIZE && !force_pte) {
> >>+   /*
> >>+* Only PMD_SIZE transparent hugepages(THP) are
> >>+* currently supported. This code will need to be
> >>+* updated to support other THP sizes.
> >>+*/
> >
> >This comment belongs to transparent_hugepage_adjust() but not here.
> 
> I think this is relevant here than in thp_adjust, unless we rename
> the function below to something generic, handle_hugepage_adjust().
> 

Agreed.

> >>+   if (transparent_hugepage_adjust(&pfn, &fault_ipa))
> >>+   vma_pagesize = PMD_SIZE;
> >
> >IIUC transparent_hugepage_adjust() is only getting called here. Instead of
> >returning 'true' when it is able to detect a huge page backing and doing
> >an adjustment there after, it should rather return THP size (PMD_SIZE) to
> >accommodate probable multi size THP support in future .
> 
> That makes sense.
> 

That's fine.

> >
> >>+   }
> >>+
> >>+   if (writable)
> >>+   kvm_set_pfn