Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-04 Thread Conor Dooley
On Tue, Jun 04, 2024 at 01:44:15PM +0200, Alexandre Ghiti wrote:
> On Tue, Jun 4, 2024 at 10:52 AM Conor Dooley  wrote:
> >
> > On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote:
> > > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti  
> > > wrote:
> > > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui  
> > > > wrote:
> > > > >
> > > > > As for the current status of the patch, there are two points that can
> > > > > be optimized:
> > > > > 1. Some chip hardware implementations may not cache TLB invalid
> > > > > entries, so it doesn't matter whether svvptc is available or not. Can
> > > > > we consider adding a CONFIG_RISCV_SVVPTC to control it?
> > >
> > > That would produce a non-portable kernel. But I'm not opposed to that
> > > at all, let me check how we handle other extensions. Maybe @Conor
> > > Dooley has some feedback here?
> >
> > To be honest, not really sure what to give feedback on. Could you
> > elaborate on exactly what the option is going to do? Given the
> > portability concern, I guess you were proposing that the option would
> > remove the preventative fences, rather than your current patch that
> > removes them via an alternative?
> 
> No no, I won't do that, we need a generic kernel for distros so that's
> not even a question. What Yunhui was asking about (to me) is: can we
> introduce a Kconfig option to always remove the preventive fences,
> bypassing the use of alternatives altogether?
> 
> To me, it won't make a difference in terms of performance. But if we
> already offer such a possibility for other extensions, well I'll do
> it. Otherwise, the question is: should we start doing that?

We don't do that for other extensions yet, because currently all the
extensions we have options for are additive. There's like 3 alternative
patchsites, and they are all just one nop? I don't see the point of
having a Kconfig knob for that.


signature.asc
Description: PGP signature


Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-04 Thread Alexandre Ghiti
On Tue, Jun 4, 2024 at 10:52 AM Conor Dooley  wrote:
>
> On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote:
> > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti  
> > wrote:
> > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui  wrote:
> > > >
> > > > As for the current status of the patch, there are two points that can
> > > > be optimized:
> > > > 1. Some chip hardware implementations may not cache TLB invalid
> > > > entries, so it doesn't matter whether svvptc is available or not. Can
> > > > we consider adding a CONFIG_RISCV_SVVPTC to control it?
> >
> > That would produce a non-portable kernel. But I'm not opposed to that
> > at all, let me check how we handle other extensions. Maybe @Conor
> > Dooley has some feedback here?
>
> To be honest, not really sure what to give feedback on. Could you
> elaborate on exactly what the option is going to do? Given the
> portability concern, I guess you were proposing that the option would
> remove the preventative fences, rather than your current patch that
> removes them via an alternative?

No no, I won't do that, we need a generic kernel for distros so that's
not even a question. What Yunhui was asking about (to me) is: can we
introduce a Kconfig option to always remove the preventive fences,
bypassing the use of alternatives altogether?

To me, it won't make a difference in terms of performance. But if we
already offer such a possibility for other extensions, well I'll do
it. Otherwise, the question is: should we start doing that?

> I don't think we have any extension
> related options that work like that at the moment, and making that an
> option will just mean that distros that look to cater for multiple
> platforms won't be able to turn it on.
>
> Thanks,
> Conor.


Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-04 Thread Conor Dooley
On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote:
> On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti  wrote:
> > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui  wrote:
> > >
> > > As for the current status of the patch, there are two points that can
> > > be optimized:
> > > 1. Some chip hardware implementations may not cache TLB invalid
> > > entries, so it doesn't matter whether svvptc is available or not. Can
> > > we consider adding a CONFIG_RISCV_SVVPTC to control it?
> 
> That would produce a non-portable kernel. But I'm not opposed to that
> at all, let me check how we handle other extensions. Maybe @Conor
> Dooley has some feedback here?

To be honest, not really sure what to give feedback on. Could you
elaborate on exactly what the option is going to do? Given the
portability concern, I guess you were proposing that the option would
remove the preventative fences, rather than your current patch that
removes them via an alternative? I don't think we have any extension
related options that work like that at the moment, and making that an
option will just mean that distros that look to cater for multiple
platforms won't be able to turn it on.

Thanks,
Conor.


signature.asc
Description: PGP signature


Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-04 Thread Alexandre Ghiti
On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti  wrote:
>
> Hi Yunhui,
>
> On Tue, Jun 4, 2024 at 8:21 AM yunhui cui  wrote:
> >
> > Hi Alexandre,
> >
> > On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti  
> > wrote:
> > >
> > > Hi Yunhui,
> > >
> > > On Mon, Jun 3, 2024 at 4:26 AM yunhui cui  wrote:
> > > >
> > > > Hi Alexandre,
> > > >
> > > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti 
> > > >  wrote:
> > > > >
> > > > > In 6.5, we removed the vmalloc fault path because that can't work (see
> > > > > [1] [2]). Then in order to make sure that new page table entries were
> > > > > seen by the page table walker, we had to preventively emit a 
> > > > > sfence.vma
> > > > > on all harts [3] but this solution is very costly since it relies on 
> > > > > IPI.
> > > > >
> > > > > And even there, we could end up in a loop of vmalloc faults if a 
> > > > > vmalloc
> > > > > allocation is done in the IPI path (for example if it is traced, see
> > > > > [4]), which could result in a kernel stack overflow.
> > > > >
> > > > > Those preventive sfence.vma needed to be emitted because:
> > > > >
> > > > > - if the uarch caches invalid entries, the new mapping may not be
> > > > >   observed by the page table walker and an invalidation may be needed.
> > > > > - if the uarch does not cache invalid entries, a reordered access
> > > > >   could "miss" the new mapping and traps: in that case, we would 
> > > > > actually
> > > > >   only need to retry the access, no sfence.vma is required.
> > > > >
> > > > > So this patch removes those preventive sfence.vma and actually handles
> > > > > the possible (and unlikely) exceptions. And since the kernel stacks
> > > > > mappings lie in the vmalloc area, this handling must be done very 
> > > > > early
> > > > > when the trap is taken, at the very beginning of handle_exception: 
> > > > > this
> > > > > also rules out the vmalloc allocations in the fault path.
> > > > >
> > > > > Link: 
> > > > > https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bj...@kernel.org/
> > > > >  [1]
> > > > > Link: 
> > > > > https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dy...@andestech.com
> > > > >  [2]
> > > > > Link: 
> > > > > https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexgh...@rivosinc.com/
> > > > >  [3]
> > > > > Link: 
> > > > > https://lore.kernel.org/lkml/20200508144043.13893-1-j...@8bytes.org/ 
> > > > > [4]
> > > > > Signed-off-by: Alexandre Ghiti 
> > > > > ---
> > > > >  arch/riscv/include/asm/cacheflush.h  | 18 +-
> > > > >  arch/riscv/include/asm/thread_info.h |  5 ++
> > > > >  arch/riscv/kernel/asm-offsets.c  |  5 ++
> > > > >  arch/riscv/kernel/entry.S| 84 
> > > > > 
> > > > >  arch/riscv/mm/init.c |  2 +
> > > > >  5 files changed, 113 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/cacheflush.h 
> > > > > b/arch/riscv/include/asm/cacheflush.h
> > > > > index a129dac4521d..b0d631701757 100644
> > > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page 
> > > > > *page)
> > > > > flush_icache_mm(vma->vm_mm, 0)
> > > > >
> > > > >  #ifdef CONFIG_64BIT
> > > > > -#define flush_cache_vmap(start, end)   
> > > > > flush_tlb_kernel_range(start, end)
> > > > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > > > +extern char _end[];
> > > > > +#define flush_cache_vmap flush_cache_vmap
> > > > > +static inline void flush_cache_vmap(unsigned long start, unsigned 
> > > > > long end)
> > > > > +{
> > > > > +   if (is_vmalloc_or_module_addr((void *)start)) {
> > > > > +   int i;
> > > > > +
> > > > > +   /*
> > > > > +* We don't care if concurrently a cpu resets this 
> > > > > value since
> > > > > +* the only place this can happen is in 
> > > > > handle_exception() where
> > > > > +* an sfence.vma is emitted.
> > > > > +*/
> > > > > +   for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > > > > +   new_vmalloc[i] = -1ULL;
> > > > > +   }
> > > > > +}
> > > > >  #define flush_cache_vmap_early(start, end) 
> > > > > local_flush_tlb_kernel_range(start, end)
> > > > >  #endif
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/thread_info.h 
> > > > > b/arch/riscv/include/asm/thread_info.h
> > > > > index 5d473343634b..32631acdcdd4 100644
> > > > > --- a/arch/riscv/include/asm/thread_info.h
> > > > > +++ b/arch/riscv/include/asm/thread_info.h
> > > > > @@ -60,6 +60,11 @@ struct thread_info {
> > > > > void*scs_base;
> > > > > void*scs_sp;
> > > > >  #endif
> > > > > +   /*
> > > > > +* Used in handle_exception() to save a0, a1 and a2 before 
> > > > > knowing if we
> > > > > +* can access the kern

Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-04 Thread Alexandre Ghiti
Hi Yunhui,

On Tue, Jun 4, 2024 at 8:21 AM yunhui cui  wrote:
>
> Hi Alexandre,
>
> On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti  wrote:
> >
> > Hi Yunhui,
> >
> > On Mon, Jun 3, 2024 at 4:26 AM yunhui cui  wrote:
> > >
> > > Hi Alexandre,
> > >
> > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti  
> > > wrote:
> > > >
> > > > In 6.5, we removed the vmalloc fault path because that can't work (see
> > > > [1] [2]). Then in order to make sure that new page table entries were
> > > > seen by the page table walker, we had to preventively emit a sfence.vma
> > > > on all harts [3] but this solution is very costly since it relies on 
> > > > IPI.
> > > >
> > > > And even there, we could end up in a loop of vmalloc faults if a vmalloc
> > > > allocation is done in the IPI path (for example if it is traced, see
> > > > [4]), which could result in a kernel stack overflow.
> > > >
> > > > Those preventive sfence.vma needed to be emitted because:
> > > >
> > > > - if the uarch caches invalid entries, the new mapping may not be
> > > >   observed by the page table walker and an invalidation may be needed.
> > > > - if the uarch does not cache invalid entries, a reordered access
> > > >   could "miss" the new mapping and traps: in that case, we would 
> > > > actually
> > > >   only need to retry the access, no sfence.vma is required.
> > > >
> > > > So this patch removes those preventive sfence.vma and actually handles
> > > > the possible (and unlikely) exceptions. And since the kernel stacks
> > > > mappings lie in the vmalloc area, this handling must be done very early
> > > > when the trap is taken, at the very beginning of handle_exception: this
> > > > also rules out the vmalloc allocations in the fault path.
> > > >
> > > > Link: 
> > > > https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bj...@kernel.org/
> > > >  [1]
> > > > Link: 
> > > > https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dy...@andestech.com
> > > >  [2]
> > > > Link: 
> > > > https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexgh...@rivosinc.com/
> > > >  [3]
> > > > Link: 
> > > > https://lore.kernel.org/lkml/20200508144043.13893-1-j...@8bytes.org/ [4]
> > > > Signed-off-by: Alexandre Ghiti 
> > > > ---
> > > >  arch/riscv/include/asm/cacheflush.h  | 18 +-
> > > >  arch/riscv/include/asm/thread_info.h |  5 ++
> > > >  arch/riscv/kernel/asm-offsets.c  |  5 ++
> > > >  arch/riscv/kernel/entry.S| 84 
> > > >  arch/riscv/mm/init.c |  2 +
> > > >  5 files changed, 113 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/riscv/include/asm/cacheflush.h 
> > > > b/arch/riscv/include/asm/cacheflush.h
> > > > index a129dac4521d..b0d631701757 100644
> > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page 
> > > > *page)
> > > > flush_icache_mm(vma->vm_mm, 0)
> > > >
> > > >  #ifdef CONFIG_64BIT
> > > > -#define flush_cache_vmap(start, end)   
> > > > flush_tlb_kernel_range(start, end)
> > > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > > +extern char _end[];
> > > > +#define flush_cache_vmap flush_cache_vmap
> > > > +static inline void flush_cache_vmap(unsigned long start, unsigned long 
> > > > end)
> > > > +{
> > > > +   if (is_vmalloc_or_module_addr((void *)start)) {
> > > > +   int i;
> > > > +
> > > > +   /*
> > > > +* We don't care if concurrently a cpu resets this 
> > > > value since
> > > > +* the only place this can happen is in 
> > > > handle_exception() where
> > > > +* an sfence.vma is emitted.
> > > > +*/
> > > > +   for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > > > +   new_vmalloc[i] = -1ULL;
> > > > +   }
> > > > +}
> > > >  #define flush_cache_vmap_early(start, end) 
> > > > local_flush_tlb_kernel_range(start, end)
> > > >  #endif
> > > >
> > > > diff --git a/arch/riscv/include/asm/thread_info.h 
> > > > b/arch/riscv/include/asm/thread_info.h
> > > > index 5d473343634b..32631acdcdd4 100644
> > > > --- a/arch/riscv/include/asm/thread_info.h
> > > > +++ b/arch/riscv/include/asm/thread_info.h
> > > > @@ -60,6 +60,11 @@ struct thread_info {
> > > > void*scs_base;
> > > > void*scs_sp;
> > > >  #endif
> > > > +   /*
> > > > +* Used in handle_exception() to save a0, a1 and a2 before 
> > > > knowing if we
> > > > +* can access the kernel stack.
> > > > +*/
> > > > +   unsigned long   a0, a1, a2;
> > > >  };
> > > >
> > > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > > > diff --git a/arch/riscv/kernel/asm-offsets.c 
> > > > b/arch/riscv/kernel/asm-offsets.c
> > > > index a03129f40c46..939ddc0e3c6e 100644
> > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > +++ b/ar

Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-03 Thread yunhui cui
Hi Alexandre,

On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti  wrote:
>
> Hi Yunhui,
>
> On Mon, Jun 3, 2024 at 4:26 AM yunhui cui  wrote:
> >
> > Hi Alexandre,
> >
> > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti  
> > wrote:
> > >
> > > In 6.5, we removed the vmalloc fault path because that can't work (see
> > > [1] [2]). Then in order to make sure that new page table entries were
> > > seen by the page table walker, we had to preventively emit a sfence.vma
> > > on all harts [3] but this solution is very costly since it relies on IPI.
> > >
> > > And even there, we could end up in a loop of vmalloc faults if a vmalloc
> > > allocation is done in the IPI path (for example if it is traced, see
> > > [4]), which could result in a kernel stack overflow.
> > >
> > > Those preventive sfence.vma needed to be emitted because:
> > >
> > > - if the uarch caches invalid entries, the new mapping may not be
> > >   observed by the page table walker and an invalidation may be needed.
> > > - if the uarch does not cache invalid entries, a reordered access
> > >   could "miss" the new mapping and traps: in that case, we would actually
> > >   only need to retry the access, no sfence.vma is required.
> > >
> > > So this patch removes those preventive sfence.vma and actually handles
> > > the possible (and unlikely) exceptions. And since the kernel stacks
> > > mappings lie in the vmalloc area, this handling must be done very early
> > > when the trap is taken, at the very beginning of handle_exception: this
> > > also rules out the vmalloc allocations in the fault path.
> > >
> > > Link: 
> > > https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bj...@kernel.org/
> > >  [1]
> > > Link: 
> > > https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dy...@andestech.com
> > >  [2]
> > > Link: 
> > > https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexgh...@rivosinc.com/
> > >  [3]
> > > Link: 
> > > https://lore.kernel.org/lkml/20200508144043.13893-1-j...@8bytes.org/ [4]
> > > Signed-off-by: Alexandre Ghiti 
> > > ---
> > >  arch/riscv/include/asm/cacheflush.h  | 18 +-
> > >  arch/riscv/include/asm/thread_info.h |  5 ++
> > >  arch/riscv/kernel/asm-offsets.c  |  5 ++
> > >  arch/riscv/kernel/entry.S| 84 
> > >  arch/riscv/mm/init.c |  2 +
> > >  5 files changed, 113 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/riscv/include/asm/cacheflush.h 
> > > b/arch/riscv/include/asm/cacheflush.h
> > > index a129dac4521d..b0d631701757 100644
> > > --- a/arch/riscv/include/asm/cacheflush.h
> > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page)
> > > flush_icache_mm(vma->vm_mm, 0)
> > >
> > >  #ifdef CONFIG_64BIT
> > > -#define flush_cache_vmap(start, end)   
> > > flush_tlb_kernel_range(start, end)
> > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > +extern char _end[];
> > > +#define flush_cache_vmap flush_cache_vmap
> > > +static inline void flush_cache_vmap(unsigned long start, unsigned long 
> > > end)
> > > +{
> > > +   if (is_vmalloc_or_module_addr((void *)start)) {
> > > +   int i;
> > > +
> > > +   /*
> > > +* We don't care if concurrently a cpu resets this value 
> > > since
> > > +* the only place this can happen is in 
> > > handle_exception() where
> > > +* an sfence.vma is emitted.
> > > +*/
> > > +   for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > > +   new_vmalloc[i] = -1ULL;
> > > +   }
> > > +}
> > >  #define flush_cache_vmap_early(start, end) 
> > > local_flush_tlb_kernel_range(start, end)
> > >  #endif
> > >
> > > diff --git a/arch/riscv/include/asm/thread_info.h 
> > > b/arch/riscv/include/asm/thread_info.h
> > > index 5d473343634b..32631acdcdd4 100644
> > > --- a/arch/riscv/include/asm/thread_info.h
> > > +++ b/arch/riscv/include/asm/thread_info.h
> > > @@ -60,6 +60,11 @@ struct thread_info {
> > > void*scs_base;
> > > void*scs_sp;
> > >  #endif
> > > +   /*
> > > +* Used in handle_exception() to save a0, a1 and a2 before 
> > > knowing if we
> > > +* can access the kernel stack.
> > > +*/
> > > +   unsigned long   a0, a1, a2;
> > >  };
> > >
> > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > > diff --git a/arch/riscv/kernel/asm-offsets.c 
> > > b/arch/riscv/kernel/asm-offsets.c
> > > index a03129f40c46..939ddc0e3c6e 100644
> > > --- a/arch/riscv/kernel/asm-offsets.c
> > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > @@ -35,6 +35,8 @@ void asm_offsets(void)
> > > OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> > > OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> > > OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> > > +
> > > +   OFFSET(TASK_TI_CPU, tas

Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-03 Thread Alexandre Ghiti
Hi Yunhui,

On Mon, Jun 3, 2024 at 4:26 AM yunhui cui  wrote:
>
> Hi Alexandre,
>
> On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti  
> wrote:
> >
> > In 6.5, we removed the vmalloc fault path because that can't work (see
> > [1] [2]). Then in order to make sure that new page table entries were
> > seen by the page table walker, we had to preventively emit a sfence.vma
> > on all harts [3] but this solution is very costly since it relies on IPI.
> >
> > And even there, we could end up in a loop of vmalloc faults if a vmalloc
> > allocation is done in the IPI path (for example if it is traced, see
> > [4]), which could result in a kernel stack overflow.
> >
> > Those preventive sfence.vma needed to be emitted because:
> >
> > - if the uarch caches invalid entries, the new mapping may not be
> >   observed by the page table walker and an invalidation may be needed.
> > - if the uarch does not cache invalid entries, a reordered access
> >   could "miss" the new mapping and traps: in that case, we would actually
> >   only need to retry the access, no sfence.vma is required.
> >
> > So this patch removes those preventive sfence.vma and actually handles
> > the possible (and unlikely) exceptions. And since the kernel stacks
> > mappings lie in the vmalloc area, this handling must be done very early
> > when the trap is taken, at the very beginning of handle_exception: this
> > also rules out the vmalloc allocations in the fault path.
> >
> > Link: 
> > https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bj...@kernel.org/
> >  [1]
> > Link: 
> > https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dy...@andestech.com
> >  [2]
> > Link: 
> > https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexgh...@rivosinc.com/
> >  [3]
> > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-j...@8bytes.org/ 
> > [4]
> > Signed-off-by: Alexandre Ghiti 
> > ---
> >  arch/riscv/include/asm/cacheflush.h  | 18 +-
> >  arch/riscv/include/asm/thread_info.h |  5 ++
> >  arch/riscv/kernel/asm-offsets.c  |  5 ++
> >  arch/riscv/kernel/entry.S| 84 
> >  arch/riscv/mm/init.c |  2 +
> >  5 files changed, 113 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/riscv/include/asm/cacheflush.h 
> > b/arch/riscv/include/asm/cacheflush.h
> > index a129dac4521d..b0d631701757 100644
> > --- a/arch/riscv/include/asm/cacheflush.h
> > +++ b/arch/riscv/include/asm/cacheflush.h
> > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page)
> > flush_icache_mm(vma->vm_mm, 0)
> >
> >  #ifdef CONFIG_64BIT
> > -#define flush_cache_vmap(start, end)   
> > flush_tlb_kernel_range(start, end)
> > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > +extern char _end[];
> > +#define flush_cache_vmap flush_cache_vmap
> > +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> > +{
> > +   if (is_vmalloc_or_module_addr((void *)start)) {
> > +   int i;
> > +
> > +   /*
> > +* We don't care if concurrently a cpu resets this value 
> > since
> > +* the only place this can happen is in handle_exception() 
> > where
> > +* an sfence.vma is emitted.
> > +*/
> > +   for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > +   new_vmalloc[i] = -1ULL;
> > +   }
> > +}
> >  #define flush_cache_vmap_early(start, end) 
> > local_flush_tlb_kernel_range(start, end)
> >  #endif
> >
> > diff --git a/arch/riscv/include/asm/thread_info.h 
> > b/arch/riscv/include/asm/thread_info.h
> > index 5d473343634b..32631acdcdd4 100644
> > --- a/arch/riscv/include/asm/thread_info.h
> > +++ b/arch/riscv/include/asm/thread_info.h
> > @@ -60,6 +60,11 @@ struct thread_info {
> > void*scs_base;
> > void*scs_sp;
> >  #endif
> > +   /*
> > +* Used in handle_exception() to save a0, a1 and a2 before knowing 
> > if we
> > +* can access the kernel stack.
> > +*/
> > +   unsigned long   a0, a1, a2;
> >  };
> >
> >  #ifdef CONFIG_SHADOW_CALL_STACK
> > diff --git a/arch/riscv/kernel/asm-offsets.c 
> > b/arch/riscv/kernel/asm-offsets.c
> > index a03129f40c46..939ddc0e3c6e 100644
> > --- a/arch/riscv/kernel/asm-offsets.c
> > +++ b/arch/riscv/kernel/asm-offsets.c
> > @@ -35,6 +35,8 @@ void asm_offsets(void)
> > OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> > OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> > OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> > +
> > +   OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
> > OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
> > OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, 
> > thread_info.preempt_count);
> > OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> > @@ -42,6 +44,9 @@ void asm_offsets(void)
> >  

Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-06-02 Thread yunhui cui
Hi Alexandre,

On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti  wrote:
>
> In 6.5, we removed the vmalloc fault path because that can't work (see
> [1] [2]). Then in order to make sure that new page table entries were
> seen by the page table walker, we had to preventively emit a sfence.vma
> on all harts [3] but this solution is very costly since it relies on IPI.
>
> And even there, we could end up in a loop of vmalloc faults if a vmalloc
> allocation is done in the IPI path (for example if it is traced, see
> [4]), which could result in a kernel stack overflow.
>
> Those preventive sfence.vma needed to be emitted because:
>
> - if the uarch caches invalid entries, the new mapping may not be
>   observed by the page table walker and an invalidation may be needed.
> - if the uarch does not cache invalid entries, a reordered access
>   could "miss" the new mapping and traps: in that case, we would actually
>   only need to retry the access, no sfence.vma is required.
>
> So this patch removes those preventive sfence.vma and actually handles
> the possible (and unlikely) exceptions. And since the kernel stacks
> mappings lie in the vmalloc area, this handling must be done very early
> when the trap is taken, at the very beginning of handle_exception: this
> also rules out the vmalloc allocations in the fault path.
>
> Link: 
> https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bj...@kernel.org/ 
> [1]
> Link: 
> https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dy...@andestech.com
>  [2]
> Link: 
> https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexgh...@rivosinc.com/
>  [3]
> Link: https://lore.kernel.org/lkml/20200508144043.13893-1-j...@8bytes.org/ [4]
> Signed-off-by: Alexandre Ghiti 
> ---
>  arch/riscv/include/asm/cacheflush.h  | 18 +-
>  arch/riscv/include/asm/thread_info.h |  5 ++
>  arch/riscv/kernel/asm-offsets.c  |  5 ++
>  arch/riscv/kernel/entry.S| 84 
>  arch/riscv/mm/init.c |  2 +
>  5 files changed, 113 insertions(+), 1 deletion(-)
>
> diff --git a/arch/riscv/include/asm/cacheflush.h 
> b/arch/riscv/include/asm/cacheflush.h
> index a129dac4521d..b0d631701757 100644
> --- a/arch/riscv/include/asm/cacheflush.h
> +++ b/arch/riscv/include/asm/cacheflush.h
> @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page)
> flush_icache_mm(vma->vm_mm, 0)
>
>  #ifdef CONFIG_64BIT
> -#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, 
> end)
> +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> +extern char _end[];
> +#define flush_cache_vmap flush_cache_vmap
> +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> +{
> +   if (is_vmalloc_or_module_addr((void *)start)) {
> +   int i;
> +
> +   /*
> +* We don't care if concurrently a cpu resets this value since
> +* the only place this can happen is in handle_exception() 
> where
> +* an sfence.vma is emitted.
> +*/
> +   for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> +   new_vmalloc[i] = -1ULL;
> +   }
> +}
>  #define flush_cache_vmap_early(start, end) 
> local_flush_tlb_kernel_range(start, end)
>  #endif
>
> diff --git a/arch/riscv/include/asm/thread_info.h 
> b/arch/riscv/include/asm/thread_info.h
> index 5d473343634b..32631acdcdd4 100644
> --- a/arch/riscv/include/asm/thread_info.h
> +++ b/arch/riscv/include/asm/thread_info.h
> @@ -60,6 +60,11 @@ struct thread_info {
> void*scs_base;
> void*scs_sp;
>  #endif
> +   /*
> +* Used in handle_exception() to save a0, a1 and a2 before knowing if 
> we
> +* can access the kernel stack.
> +*/
> +   unsigned long   a0, a1, a2;
>  };
>
>  #ifdef CONFIG_SHADOW_CALL_STACK
> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> index a03129f40c46..939ddc0e3c6e 100644
> --- a/arch/riscv/kernel/asm-offsets.c
> +++ b/arch/riscv/kernel/asm-offsets.c
> @@ -35,6 +35,8 @@ void asm_offsets(void)
> OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> +
> +   OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
> OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
> OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
> OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> @@ -42,6 +44,9 @@ void asm_offsets(void)
>  #ifdef CONFIG_SHADOW_CALL_STACK
> OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
>  #endif
> +   OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> +   OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> +   OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
>
> OFFSET(TASK_

[PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-01-31 Thread Alexandre Ghiti
In 6.5, we removed the vmalloc fault path because that can't work (see
[1] [2]). Then in order to make sure that new page table entries were
seen by the page table walker, we had to preventively emit a sfence.vma
on all harts [3] but this solution is very costly since it relies on IPI.

And even there, we could end up in a loop of vmalloc faults if a vmalloc
allocation is done in the IPI path (for example if it is traced, see
[4]), which could result in a kernel stack overflow.

Those preventive sfence.vma needed to be emitted because:

- if the uarch caches invalid entries, the new mapping may not be
  observed by the page table walker and an invalidation may be needed.
- if the uarch does not cache invalid entries, a reordered access
  could "miss" the new mapping and traps: in that case, we would actually
  only need to retry the access, no sfence.vma is required.

So this patch removes those preventive sfence.vma and actually handles
the possible (and unlikely) exceptions. And since the kernel stacks
mappings lie in the vmalloc area, this handling must be done very early
when the trap is taken, at the very beginning of handle_exception: this
also rules out the vmalloc allocations in the fault path.

Link: 
https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bj...@kernel.org/ 
[1]
Link: 
https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dy...@andestech.com
 [2]
Link: 
https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexgh...@rivosinc.com/
 [3]
Link: https://lore.kernel.org/lkml/20200508144043.13893-1-j...@8bytes.org/ [4]
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/cacheflush.h  | 18 +-
 arch/riscv/include/asm/thread_info.h |  5 ++
 arch/riscv/kernel/asm-offsets.c  |  5 ++
 arch/riscv/kernel/entry.S| 84 
 arch/riscv/mm/init.c |  2 +
 5 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/include/asm/cacheflush.h 
b/arch/riscv/include/asm/cacheflush.h
index a129dac4521d..b0d631701757 100644
--- a/arch/riscv/include/asm/cacheflush.h
+++ b/arch/riscv/include/asm/cacheflush.h
@@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page)
flush_icache_mm(vma->vm_mm, 0)
 
 #ifdef CONFIG_64BIT
-#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, 
end)
+extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
+extern char _end[];
+#define flush_cache_vmap flush_cache_vmap
+static inline void flush_cache_vmap(unsigned long start, unsigned long end)
+{
+   if (is_vmalloc_or_module_addr((void *)start)) {
+   int i;
+
+   /*
+* We don't care if concurrently a cpu resets this value since
+* the only place this can happen is in handle_exception() where
+* an sfence.vma is emitted.
+*/
+   for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
+   new_vmalloc[i] = -1ULL;
+   }
+}
 #define flush_cache_vmap_early(start, end) 
local_flush_tlb_kernel_range(start, end)
 #endif
 
diff --git a/arch/riscv/include/asm/thread_info.h 
b/arch/riscv/include/asm/thread_info.h
index 5d473343634b..32631acdcdd4 100644
--- a/arch/riscv/include/asm/thread_info.h
+++ b/arch/riscv/include/asm/thread_info.h
@@ -60,6 +60,11 @@ struct thread_info {
void*scs_base;
void*scs_sp;
 #endif
+   /*
+* Used in handle_exception() to save a0, a1 and a2 before knowing if we
+* can access the kernel stack.
+*/
+   unsigned long   a0, a1, a2;
 };
 
 #ifdef CONFIG_SHADOW_CALL_STACK
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index a03129f40c46..939ddc0e3c6e 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -35,6 +35,8 @@ void asm_offsets(void)
OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
+
+   OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
@@ -42,6 +44,9 @@ void asm_offsets(void)
 #ifdef CONFIG_SHADOW_CALL_STACK
OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
 #endif
+   OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
+   OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
+   OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
 
OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 9d1a305d5508..c1ffaeaba7aa 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -19,6 +1