Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()

2022-11-17 Thread Ricardo Koller
On Tue, Nov 15, 2022 at 11:54:27PM +, Oliver Upton wrote:
> On Tue, Nov 15, 2022 at 03:27:18PM -0800, Ricardo Koller wrote:
> > On Tue, Nov 15, 2022 at 03:03:42PM -0800, Ricardo Koller wrote:
> > > On Mon, Nov 14, 2022 at 08:54:52PM +, Oliver Upton wrote:
> 
> [...]
> 
> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c 
> > > > > b/arch/arm64/kvm/hyp/pgtable.c
> > > > > index d1f309128118..9c42eff6d42e 100644
> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t 
> > > > > *ptep, u64 phys, u32 level,
> > > > >   return __kvm_pgtable_visit(, mm_ops, ptep, level);
> > > > >  }
> > > > >  
> > > > > +struct stage2_split_data {
> > > > > + struct kvm_s2_mmu   *mmu;
> > > > > + void*memcache;
> > > > > + struct kvm_pgtable_mm_ops   *mm_ops;
> > > > 
> > > > You can also get at mm_ops through kvm_pgtable_visit_ctx
> > > > 
> > > > > +};
> > > > > +
> > > > > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx 
> > > > > *ctx,
> > > > > +enum kvm_pgtable_walk_flags visit)
> > > > > +{
> > > > > + struct stage2_split_data *data = ctx->arg;
> > > > > + struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> > > > > + kvm_pte_t pte = ctx->old, attr, new;
> > > > > + enum kvm_pgtable_prot prot;
> > > > > + void *mc = data->memcache;
> > > > > + u32 level = ctx->level;
> > > > > + u64 phys;
> > > > > +
> > > > > + if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> > > > > + return -EINVAL;
> > > > > +
> > > > > + /* Nothing to split at the last level */
> > > > > + if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > > > > + return 0;
> > > > > +
> > > > > + /* We only split valid block mappings */
> > > > > + if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > > > > + return 0;
> > > > > +
> > > > > + phys = kvm_pte_to_phys(pte);
> > > > > + prot = kvm_pgtable_stage2_pte_prot(pte);
> > > > > + stage2_set_prot_attr(data->mmu->pgt, prot, );
> > > > > +
> > > > > + /*
> > > > > +  * Eager page splitting is best-effort, so we can ignore the 
> > > > > error.
> > > > > +  * The returned PTE (new) will be valid even if this call 
> > > > > returns
> > > > > +  * error: new will be a single (big) block PTE.  The only issue 
> > > > > is
> > > > > +  * that it will affect dirty logging performance, as the 
> > > > > huge-pages
> > > > > +  * will have to be split on fault, and so we WARN.
> > > > > +  */
> > > > > + WARN_ON(stage2_create_removed(, phys, level, attr, mc, 
> > > > > mm_ops));
> > > > 
> > > > I don't believe we should warn in this case, at least not
> > > > unconditionally. ENOMEM is an expected outcome, for example.
> > > 
> > > Given that "eager page splitting" is best-effort, the error must be
> > > ignored somewhere: either here or by the caller (in mmu.c). It seems
> > > that ignoring the error here is not a very good idea.
> > 
> > Actually, ignoring the error here simplifies the error handling.
> > stage2_create_removed() is best-effort; here's an example.  If
> > stage2_create_removed() was called to split a 1G block PTE, and it
> > wasn't able to split all 2MB blocks, it would return ENOMEM and a valid
> > PTE pointing to a tree like this:
> > 
> > [-1GB-]
> > : :
> > [--2MB--][--2MB--][--2MB--]
> > :   :
> > [ ][ ][ ]
> > 
> > If we returned ENOMEM instead of ignoring the error, we would have to
> > clean all the intermediate state.  But stage2_create_removed() is
> > designed to always return a valid PTE, even if the tree is not fully
> > split (as above).  So, there's no really need to clean it: it's a valid
> > tree. Moreover, this valid tree would result in better dirty logging
> > performance as it already has some 2M blocks split into 4K pages.
> 
> I have no issue with installing a partially-populated table, but
> unconditionally ignoring the return code and marching onwards seems
> dangerous. If you document the behavior of -ENOMEM on
> stage2_create_removed() and return early for anything else it may read a
> bit better.

This got me thinking. These partially-populated tables are complicating
things too much for not good reason. They should be very rare, as the
caller will ensure there's enough memory in the memcache. So what do you
think of this other option?

https://github.com/ricarkol/linux/commit/54ba44f7d00d93cbaecebd148c102ca9d7c0e711

used here:

https://github.com/ricarkol/linux/commit/ff63a8744c18d5e4589911831e20ccb41712bda2#

It reuses the stage2 mapper to implement create_removed(), and makes
things simpler by only returning success when building a fully populated
tree. The caller doesn't need to clean anything in case of 

Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()

2022-11-15 Thread Oliver Upton
On Tue, Nov 15, 2022 at 03:27:18PM -0800, Ricardo Koller wrote:
> On Tue, Nov 15, 2022 at 03:03:42PM -0800, Ricardo Koller wrote:
> > On Mon, Nov 14, 2022 at 08:54:52PM +, Oliver Upton wrote:

[...]

> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index d1f309128118..9c42eff6d42e 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t 
> > > > *ptep, u64 phys, u32 level,
> > > > return __kvm_pgtable_visit(, mm_ops, ptep, level);
> > > >  }
> > > >  
> > > > +struct stage2_split_data {
> > > > +   struct kvm_s2_mmu   *mmu;
> > > > +   void*memcache;
> > > > +   struct kvm_pgtable_mm_ops   *mm_ops;
> > > 
> > > You can also get at mm_ops through kvm_pgtable_visit_ctx
> > > 
> > > > +};
> > > > +
> > > > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > +  enum kvm_pgtable_walk_flags visit)
> > > > +{
> > > > +   struct stage2_split_data *data = ctx->arg;
> > > > +   struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> > > > +   kvm_pte_t pte = ctx->old, attr, new;
> > > > +   enum kvm_pgtable_prot prot;
> > > > +   void *mc = data->memcache;
> > > > +   u32 level = ctx->level;
> > > > +   u64 phys;
> > > > +
> > > > +   if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> > > > +   return -EINVAL;
> > > > +
> > > > +   /* Nothing to split at the last level */
> > > > +   if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > > > +   return 0;
> > > > +
> > > > +   /* We only split valid block mappings */
> > > > +   if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > > > +   return 0;
> > > > +
> > > > +   phys = kvm_pte_to_phys(pte);
> > > > +   prot = kvm_pgtable_stage2_pte_prot(pte);
> > > > +   stage2_set_prot_attr(data->mmu->pgt, prot, );
> > > > +
> > > > +   /*
> > > > +* Eager page splitting is best-effort, so we can ignore the 
> > > > error.
> > > > +* The returned PTE (new) will be valid even if this call 
> > > > returns
> > > > +* error: new will be a single (big) block PTE.  The only issue 
> > > > is
> > > > +* that it will affect dirty logging performance, as the 
> > > > huge-pages
> > > > +* will have to be split on fault, and so we WARN.
> > > > +*/
> > > > +   WARN_ON(stage2_create_removed(, phys, level, attr, mc, 
> > > > mm_ops));
> > > 
> > > I don't believe we should warn in this case, at least not
> > > unconditionally. ENOMEM is an expected outcome, for example.
> > 
> > Given that "eager page splitting" is best-effort, the error must be
> > ignored somewhere: either here or by the caller (in mmu.c). It seems
> > that ignoring the error here is not a very good idea.
> 
> Actually, ignoring the error here simplifies the error handling.
> stage2_create_removed() is best-effort; here's an example.  If
> stage2_create_removed() was called to split a 1G block PTE, and it
> wasn't able to split all 2MB blocks, it would return ENOMEM and a valid
> PTE pointing to a tree like this:
> 
>   [-1GB-]
>   : :
>   [--2MB--][--2MB--][--2MB--]
>   :   :
>   [ ][ ][ ]
> 
> If we returned ENOMEM instead of ignoring the error, we would have to
> clean all the intermediate state.  But stage2_create_removed() is
> designed to always return a valid PTE, even if the tree is not fully
> split (as above).  So, there's no really need to clean it: it's a valid
> tree. Moreover, this valid tree would result in better dirty logging
> performance as it already has some 2M blocks split into 4K pages.

I have no issue with installing a partially-populated table, but
unconditionally ignoring the return code and marching onwards seems
dangerous. If you document the behavior of -ENOMEM on
stage2_create_removed() and return early for anything else it may read a
bit better.

--
Thanks,
Oliver
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()

2022-11-15 Thread Ricardo Koller
On Tue, Nov 15, 2022 at 03:03:42PM -0800, Ricardo Koller wrote:
> On Mon, Nov 14, 2022 at 08:54:52PM +, Oliver Upton wrote:
> > Hi Ricardo,
> > 
> > On Sat, Nov 12, 2022 at 08:17:06AM +, Ricardo Koller wrote:
> > 
> > [...]
> > 
> > > +/**
> > > + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf 
> > > PTEs pointing
> > > + *   to PAGE_SIZE guest pages.
> > > + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
> > > + * @addr:Intermediate physical address from which to split.
> > > + * @size:Size of the range.
> > > + * @mc:  Cache of pre-allocated and zeroed memory from which to 
> > > allocate
> > > + *   page-table pages.
> > > + *
> > > + * @addr and the end (@addr + @size) are effectively aligned down and up 
> > > to
> > > + * the top level huge-page block size. This is an exampe using 1GB
> > > + * huge-pages and 4KB granules.
> > > + *
> > > + *  [---input range---]
> > > + *  : :
> > > + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block 
> > > pte--]
> > > + *  : :
> > > + *   [--2MB--][--2MB--][--2MB--][--2MB--]
> > > + *  : :
> > > + *   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> > > + *  : :
> > > + *
> > > + * Return: 0 on success, negative error code on failure. Note that
> > > + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> > > + * blocks in the input range as allowed by the size of the memcache. It
> > > + * will fail it wasn't able to break any block.
> > > + */
> > > +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 
> > > size, void *mc);
> > > +
> > >  /**
> > >   * kvm_pgtable_walk() - Walk a page-table.
> > >   * @pgt: Page-table structure initialised by kvm_pgtable_*_init().
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index d1f309128118..9c42eff6d42e 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, 
> > > u64 phys, u32 level,
> > >   return __kvm_pgtable_visit(, mm_ops, ptep, level);
> > >  }
> > >  
> > > +struct stage2_split_data {
> > > + struct kvm_s2_mmu   *mmu;
> > > + void*memcache;
> > > + struct kvm_pgtable_mm_ops   *mm_ops;
> > 
> > You can also get at mm_ops through kvm_pgtable_visit_ctx
> > 
> > > +};
> > > +
> > > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > +enum kvm_pgtable_walk_flags visit)
> > > +{
> > > + struct stage2_split_data *data = ctx->arg;
> > > + struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> > > + kvm_pte_t pte = ctx->old, attr, new;
> > > + enum kvm_pgtable_prot prot;
> > > + void *mc = data->memcache;
> > > + u32 level = ctx->level;
> > > + u64 phys;
> > > +
> > > + if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> > > + return -EINVAL;
> > > +
> > > + /* Nothing to split at the last level */
> > > + if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > > + return 0;
> > > +
> > > + /* We only split valid block mappings */
> > > + if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > > + return 0;
> > > +
> > > + phys = kvm_pte_to_phys(pte);
> > > + prot = kvm_pgtable_stage2_pte_prot(pte);
> > > + stage2_set_prot_attr(data->mmu->pgt, prot, );
> > > +
> > > + /*
> > > +  * Eager page splitting is best-effort, so we can ignore the error.
> > > +  * The returned PTE (new) will be valid even if this call returns
> > > +  * error: new will be a single (big) block PTE.  The only issue is
> > > +  * that it will affect dirty logging performance, as the huge-pages
> > > +  * will have to be split on fault, and so we WARN.
> > > +  */
> > > + WARN_ON(stage2_create_removed(, phys, level, attr, mc, mm_ops));
> > 
> > I don't believe we should warn in this case, at least not
> > unconditionally. ENOMEM is an expected outcome, for example.
> 
> Given that "eager page splitting" is best-effort, the error must be
> ignored somewhere: either here or by the caller (in mmu.c). It seems
> that ignoring the error here is not a very good idea.

Actually, ignoring the error here simplifies the error handling.
stage2_create_removed() is best-effort; here's an example.  If
stage2_create_removed() was called to split a 1G block PTE, and it
wasn't able to split all 2MB blocks, it would return ENOMEM and a valid
PTE pointing to a tree like this:

[-1GB-]
: :
[--2MB--][--2MB--][--2MB--]
:   :
[ ][ ][ ]

If we returned ENOMEM instead of ignoring the error, we would have to
clean all the 

Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()

2022-11-15 Thread Ricardo Koller
On Mon, Nov 14, 2022 at 08:54:52PM +, Oliver Upton wrote:
> Hi Ricardo,
> 
> On Sat, Nov 12, 2022 at 08:17:06AM +, Ricardo Koller wrote:
> 
> [...]
> 
> > +/**
> > + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs 
> > pointing
> > + * to PAGE_SIZE guest pages.
> > + * @pgt:   Page-table structure initialised by kvm_pgtable_stage2_init*().
> > + * @addr:  Intermediate physical address from which to split.
> > + * @size:  Size of the range.
> > + * @mc:Cache of pre-allocated and zeroed memory from which to 
> > allocate
> > + * page-table pages.
> > + *
> > + * @addr and the end (@addr + @size) are effectively aligned down and up to
> > + * the top level huge-page block size. This is an exampe using 1GB
> > + * huge-pages and 4KB granules.
> > + *
> > + *  [---input range---]
> > + *  : :
> > + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> > + *  : :
> > + *   [--2MB--][--2MB--][--2MB--][--2MB--]
> > + *  : :
> > + *   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> > + *  : :
> > + *
> > + * Return: 0 on success, negative error code on failure. Note that
> > + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> > + * blocks in the input range as allowed by the size of the memcache. It
> > + * will fail it wasn't able to break any block.
> > + */
> > +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, 
> > void *mc);
> > +
> >  /**
> >   * kvm_pgtable_walk() - Walk a page-table.
> >   * @pgt:   Page-table structure initialised by kvm_pgtable_*_init().
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index d1f309128118..9c42eff6d42e 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, 
> > u64 phys, u32 level,
> > return __kvm_pgtable_visit(, mm_ops, ptep, level);
> >  }
> >  
> > +struct stage2_split_data {
> > +   struct kvm_s2_mmu   *mmu;
> > +   void*memcache;
> > +   struct kvm_pgtable_mm_ops   *mm_ops;
> 
> You can also get at mm_ops through kvm_pgtable_visit_ctx
> 
> > +};
> > +
> > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > +  enum kvm_pgtable_walk_flags visit)
> > +{
> > +   struct stage2_split_data *data = ctx->arg;
> > +   struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> > +   kvm_pte_t pte = ctx->old, attr, new;
> > +   enum kvm_pgtable_prot prot;
> > +   void *mc = data->memcache;
> > +   u32 level = ctx->level;
> > +   u64 phys;
> > +
> > +   if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> > +   return -EINVAL;
> > +
> > +   /* Nothing to split at the last level */
> > +   if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > +   return 0;
> > +
> > +   /* We only split valid block mappings */
> > +   if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > +   return 0;
> > +
> > +   phys = kvm_pte_to_phys(pte);
> > +   prot = kvm_pgtable_stage2_pte_prot(pte);
> > +   stage2_set_prot_attr(data->mmu->pgt, prot, );
> > +
> > +   /*
> > +* Eager page splitting is best-effort, so we can ignore the error.
> > +* The returned PTE (new) will be valid even if this call returns
> > +* error: new will be a single (big) block PTE.  The only issue is
> > +* that it will affect dirty logging performance, as the huge-pages
> > +* will have to be split on fault, and so we WARN.
> > +*/
> > +   WARN_ON(stage2_create_removed(, phys, level, attr, mc, mm_ops));
> 
> I don't believe we should warn in this case, at least not
> unconditionally. ENOMEM is an expected outcome, for example.

Given that "eager page splitting" is best-effort, the error must be
ignored somewhere: either here or by the caller (in mmu.c). It seems
that ignoring the error here is not a very good idea.

> 
> Additionally, I believe you'll want to bail out at this point to avoid
> installing a potentially garbage PTE as well.

It should be fine as stage2_create_removed() is also best-effort. The
returned PTE is valid even when it fails; it just returns a big block
PTE.

> 
> > +   stage2_put_pte(ctx, data->mmu, mm_ops);
> 
> Ah, I see why you've relaxed the WARN in patch 1 now.
> 
> I would recommend you follow the break-before-make pattern and use the
> helpers here as well. stage2_try_break_pte() will demote the store to
> WRITE_ONCE() if called from a non-shared context.
> 

ACK, I can do that. The only reason why I didnt' is because I would have
to handle the potential error from stage2_try_break_pte(). It would feel
wrong not to, even if it's !shared. On the other hand, I would like to
easily 

Re: [RFC PATCH 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()

2022-11-14 Thread Oliver Upton
Hi Ricardo,

On Sat, Nov 12, 2022 at 08:17:06AM +, Ricardo Koller wrote:

[...]

> +/**
> + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs 
> pointing
> + *   to PAGE_SIZE guest pages.
> + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
> + * @addr:Intermediate physical address from which to split.
> + * @size:Size of the range.
> + * @mc:  Cache of pre-allocated and zeroed memory from which to 
> allocate
> + *   page-table pages.
> + *
> + * @addr and the end (@addr + @size) are effectively aligned down and up to
> + * the top level huge-page block size. This is an exampe using 1GB
> + * huge-pages and 4KB granules.
> + *
> + *  [---input range---]
> + *  : :
> + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> + *  : :
> + *   [--2MB--][--2MB--][--2MB--][--2MB--]
> + *  : :
> + *   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> + *  : :
> + *
> + * Return: 0 on success, negative error code on failure. Note that
> + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> + * blocks in the input range as allowed by the size of the memcache. It
> + * will fail it wasn't able to break any block.
> + */
> +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, 
> void *mc);
> +
>  /**
>   * kvm_pgtable_walk() - Walk a page-table.
>   * @pgt: Page-table structure initialised by kvm_pgtable_*_init().
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index d1f309128118..9c42eff6d42e 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1267,6 +1267,80 @@ static int stage2_create_removed(kvm_pte_t *ptep, u64 
> phys, u32 level,
>   return __kvm_pgtable_visit(, mm_ops, ptep, level);
>  }
>  
> +struct stage2_split_data {
> + struct kvm_s2_mmu   *mmu;
> + void*memcache;
> + struct kvm_pgtable_mm_ops   *mm_ops;

You can also get at mm_ops through kvm_pgtable_visit_ctx

> +};
> +
> +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> +enum kvm_pgtable_walk_flags visit)
> +{
> + struct stage2_split_data *data = ctx->arg;
> + struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> + kvm_pte_t pte = ctx->old, attr, new;
> + enum kvm_pgtable_prot prot;
> + void *mc = data->memcache;
> + u32 level = ctx->level;
> + u64 phys;
> +
> + if (WARN_ON_ONCE(kvm_pgtable_walk_shared(ctx)))
> + return -EINVAL;
> +
> + /* Nothing to split at the last level */
> + if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> + return 0;
> +
> + /* We only split valid block mappings */
> + if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> + return 0;
> +
> + phys = kvm_pte_to_phys(pte);
> + prot = kvm_pgtable_stage2_pte_prot(pte);
> + stage2_set_prot_attr(data->mmu->pgt, prot, );
> +
> + /*
> +  * Eager page splitting is best-effort, so we can ignore the error.
> +  * The returned PTE (new) will be valid even if this call returns
> +  * error: new will be a single (big) block PTE.  The only issue is
> +  * that it will affect dirty logging performance, as the huge-pages
> +  * will have to be split on fault, and so we WARN.
> +  */
> + WARN_ON(stage2_create_removed(, phys, level, attr, mc, mm_ops));

I don't believe we should warn in this case, at least not
unconditionally. ENOMEM is an expected outcome, for example.

Additionally, I believe you'll want to bail out at this point to avoid
installing a potentially garbage PTE as well.

> + stage2_put_pte(ctx, data->mmu, mm_ops);

Ah, I see why you've relaxed the WARN in patch 1 now.

I would recommend you follow the break-before-make pattern and use the
helpers here as well. stage2_try_break_pte() will demote the store to
WRITE_ONCE() if called from a non-shared context.

Then the WARN will behave as expected in stage2_make_pte().

> + /*
> +  * Note, the contents of the page table are guaranteed to be made
> +  * visible before the new PTE is assigned because
> +  * stage2_make__pte() writes the PTE using smp_store_release().

typo: stage2_make_pte()

--
Thanks,
Oliver
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm