Re: [PATCH 11/12] mm: Remove pXX_devmap callers
On Tue, Sep 10, 2024 at 02:14:36PM +1000, Alistair Popple wrote: Hi Alistair, > diff --git a/arch/powerpc/mm/book3s64/pgtable.c > b/arch/powerpc/mm/book3s64/pgtable.c > index 5a4a753..4537a29 100644 > --- a/arch/powerpc/mm/book3s64/pgtable.c > +++ b/arch/powerpc/mm/book3s64/pgtable.c > @@ -193,7 +192,7 @@ pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct > *vma, > pmd_t pmd; > VM_BUG_ON(addr & ~HPAGE_PMD_MASK); > VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) && > -!pmd_devmap(*pmdp)) || !pmd_present(*pmdp)); > +|| !pmd_present(*pmdp)); That looks broken. > pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp); > /* >* if it not a fullmm flush, then we can possibly end up converting Thanks!
Re: [PATCH v3] mm/mm_init: use node's number of cpus in deferred_page_init_max_threads
On Tue, May 28, 2024 at 02:54:58PM -0400, Eric Chanudet wrote: > When DEFERRED_STRUCT_PAGE_INIT=y, use a node's cpu count as maximum > thread count for the deferred initialization of struct pages via padata. > This should result in shorter boot times for these configurations by > going through page_alloc_init_late() faster as systems tend not to be > under heavy load that early in the bootstrap. > > Only x86_64 does that now. Make it archs agnostic when > DEFERRED_STRUCT_PAGE_INIT is set. With the default defconfigs, that > includes powerpc and s390. > > It used to be so before offering archs to override the function for > tuning with commit ecd096506922 ("mm: make deferred init's max threads > arch-specific"). > > Setting DEFERRED_STRUCT_PAGE_INIT and testing on a few arm64 platforms > shows faster deferred_init_memmap completions: > | | x13s| SA8775p-ride | Ampere R137-P31 | Ampere HR330 | > | | Metal, 32GB | VM, 36GB | VM, 58GB| Metal, 128GB | > | | 8cpus | 8cpus| 8cpus | 32cpus | > |-|-|--|-|--| > | threads | ms (%) | ms (%) | ms (%) | ms (%) | > |-|-|--|-|--| > | 1 | 108(0%) | 72 (0%) | 224(0%) | 324 (0%) | > | cpus| 24 (-77%) | 36(-50%) | 40 (-82%) | 56 (-82%) | > > Michael Ellerman on a powerpc machine (1TB, 40 cores, 4KB pages) reports > faster deferred_init_memmap from 210-240ms to 90-110ms between nodes. > > Signed-off-by: Eric Chanudet > Tested-by: Michael Ellerman (powerpc) > > --- > - v1: > https://lore.kernel.org/linux-arm-kernel/20240520231555.395979-5-echan...@redhat.com > - Changes since v1: > - Make the generic function return the number of cpus of the node as >max threads limit instead overriding it for arm64. > - Drop Baoquan He's R-b on v1 since the logic changed. > - Add CCs according to patch changes (ppc and s390 set >DEFERRED_STRUCT_PAGE_INIT by default). > > - v2: > https://lore.kernel.org/linux-arm-kernel/20240522203758.626932-4-echan...@redhat.com/ > - Changes since v2: > - deferred_page_init_max_threads returns unsigned and use max instead >of max_t. > - Make deferred_page_init_max_threads static since there are no more >override. > - Rephrase description. > - Add T-b and report from Michael Ellerman. > > arch/x86/mm/init_64.c| 12 ---- > include/linux/memblock.h | 2 -- > mm/mm_init.c | 5 ++--- > 3 files changed, 2 insertions(+), 17 deletions(-) It does speed up. For s390: Acked-by: Alexander Gordeev
[PATCH v2 RESEND 0/5] sched/vtime: vtime.h headers cleanup
Hi All, There are no changes since the last post, just a re-send. v2: - patch 4: commit message reworded (Heiko) - patch 5: vtime.h is removed from Kbuild scripts (PowerPC only) (Heiko) v1: Please find a small cleanup to vtime_task_switch() wiring. I split it into smaller patches to allow separate PowerPC vs s390 reviews. Otherwise patches 2+3 and 4+5 could have been merged. I tested it on s390 and compile-tested it on 32- and 64-bit PowerPC and few other major architectures only, but it is only of concern for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE-capable ones (AFAICT). Thanks! Alexander Gordeev (5): sched/vtime: remove confusing arch_vtime_task_switch() declaration sched/vtime: get rid of generic vtime_task_switch() implementation s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover s390/irq,nmi: include header directly sched/vtime: do not include header arch/powerpc/include/asm/Kbuild| 1 - arch/powerpc/include/asm/cputime.h | 13 - arch/powerpc/kernel/time.c | 22 ++ arch/s390/include/asm/vtime.h | 2 -- arch/s390/kernel/irq.c | 1 + arch/s390/kernel/nmi.c | 1 + include/asm-generic/vtime.h| 1 - include/linux/vtime.h | 5 - kernel/sched/cputime.c | 13 - 9 files changed, 24 insertions(+), 35 deletions(-) delete mode 100644 include/asm-generic/vtime.h -- 2.40.1
[PATCH v2 RESEND 3/5] s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
__ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore. Reviewed-by: Frederic Weisbecker Acked-by: Heiko Carstens Acked-by: Nicholas Piggin Signed-off-by: Alexander Gordeev --- arch/s390/include/asm/vtime.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/s390/include/asm/vtime.h b/arch/s390/include/asm/vtime.h index fe17e448c0c5..561c91c1a87c 100644 --- a/arch/s390/include/asm/vtime.h +++ b/arch/s390/include/asm/vtime.h @@ -2,8 +2,6 @@ #ifndef _S390_VTIME_H #define _S390_VTIME_H -#define __ARCH_HAS_VTIME_TASK_SWITCH - static inline void update_timer_sys(void) { S390_lowcore.system_timer += S390_lowcore.last_update_timer - S390_lowcore.exit_timer; -- 2.40.1
[PATCH v2 RESEND 1/5] sched/vtime: remove confusing arch_vtime_task_switch() declaration
Callback arch_vtime_task_switch() is only defined when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is selected. Yet, the function prototype forward declaration is present for CONFIG_VIRT_CPU_ACCOUNTING_GEN variant. Remove it. Reviewed-by: Frederic Weisbecker Reviewed-by: Nicholas Piggin Signed-off-by: Alexander Gordeev --- include/linux/vtime.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/vtime.h b/include/linux/vtime.h index 3684487d01e1..593466ceebed 100644 --- a/include/linux/vtime.h +++ b/include/linux/vtime.h @@ -18,7 +18,6 @@ extern void vtime_account_idle(struct task_struct *tsk); #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN -extern void arch_vtime_task_switch(struct task_struct *tsk); extern void vtime_user_enter(struct task_struct *tsk); extern void vtime_user_exit(struct task_struct *tsk); extern void vtime_guest_enter(struct task_struct *tsk); -- 2.40.1
[PATCH v2 RESEND 4/5] s390/irq,nmi: include header directly
update_timer_sys() and update_timer_mcck() are inlines used for CPU time accounting from the interrupt and machine-check handlers. These routines are specific to s390 architecture, but included via header implicitly. Avoid the extra loop and include header directly. Acked-by: Heiko Carstens Signed-off-by: Alexander Gordeev --- arch/s390/kernel/irq.c | 1 + arch/s390/kernel/nmi.c | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c index 6f71b0ce1068..259496fe0ef9 100644 --- a/arch/s390/kernel/irq.c +++ b/arch/s390/kernel/irq.c @@ -29,6 +29,7 @@ #include #include #include +#include #include "entry.h" DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat); diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c index c77382a67325..230d010bac9b 100644 --- a/arch/s390/kernel/nmi.c +++ b/arch/s390/kernel/nmi.c @@ -31,6 +31,7 @@ #include #include #include +#include struct mcck_struct { unsigned int kill_task : 1; -- 2.40.1
[PATCH v2 RESEND 5/5] sched/vtime: do not include header
There is no architecture-specific code or data left that generic needs to know about. Thus, avoid the inclusion of header. Reviewed-by: Frederic Weisbecker Acked-by: Nicholas Piggin Signed-off-by: Alexander Gordeev --- arch/powerpc/include/asm/Kbuild | 1 - include/asm-generic/vtime.h | 1 - include/linux/vtime.h | 4 3 files changed, 6 deletions(-) delete mode 100644 include/asm-generic/vtime.h diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild index 61a8dcd7..e5fdc336c9b2 100644 --- a/arch/powerpc/include/asm/Kbuild +++ b/arch/powerpc/include/asm/Kbuild @@ -6,5 +6,4 @@ generic-y += agp.h generic-y += kvm_types.h generic-y += mcs_spinlock.h generic-y += qrwlock.h -generic-y += vtime.h generic-y += early_ioremap.h diff --git a/include/asm-generic/vtime.h b/include/asm-generic/vtime.h deleted file mode 100644 index b1a49677fe25.. --- a/include/asm-generic/vtime.h +++ /dev/null @@ -1 +0,0 @@ -/* no content, but patch(1) dislikes empty files */ diff --git a/include/linux/vtime.h b/include/linux/vtime.h index 593466ceebed..29dd5b91dd7d 100644 --- a/include/linux/vtime.h +++ b/include/linux/vtime.h @@ -5,10 +5,6 @@ #include #include -#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE -#include -#endif - /* * Common vtime APIs */ -- 2.40.1
[PATCH v2 RESEND 2/5] sched/vtime: get rid of generic vtime_task_switch() implementation
The generic vtime_task_switch() implementation gets built only if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an architecture to implement arch_vtime_task_switch() callback at the same time, which is confusing. Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC architecture only and vtime_task_switch() generic variant is rather superfluous. Simplify the whole vtime_task_switch() wiring by moving the existing generic implementation to PowerPC. Reviewed-by: Frederic Weisbecker Reviewed-by: Nicholas Piggin Acked-by: Michael Ellerman Signed-off-by: Alexander Gordeev --- arch/powerpc/include/asm/cputime.h | 13 - arch/powerpc/kernel/time.c | 22 ++ kernel/sched/cputime.c | 13 - 3 files changed, 22 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h index 4961fb38e438..aff858ca99c0 100644 --- a/arch/powerpc/include/asm/cputime.h +++ b/arch/powerpc/include/asm/cputime.h @@ -32,23 +32,10 @@ #ifdef CONFIG_PPC64 #define get_accounting(tsk)(&get_paca()->accounting) #define raw_get_accounting(tsk)(&local_paca->accounting) -static inline void arch_vtime_task_switch(struct task_struct *tsk) { } #else #define get_accounting(tsk)(&task_thread_info(tsk)->accounting) #define raw_get_accounting(tsk)get_accounting(tsk) -/* - * Called from the context switch with interrupts disabled, to charge all - * accumulated times to the current process, and to prepare accounting on - * the next process. - */ -static inline void arch_vtime_task_switch(struct task_struct *prev) -{ - struct cpu_accounting_data *acct = get_accounting(current); - struct cpu_accounting_data *acct0 = get_accounting(prev); - - acct->starttime = acct0->starttime; -} #endif /* diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index df20cf201f74..c0fdc6d94fee 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -354,6 +354,28 @@ void vtime_flush(struct task_struct *tsk) acct->hardirq_time = 0; acct->softirq_time = 0; } + +/* + * Called from the context switch with interrupts disabled, to charge all + * accumulated times to the current process, and to prepare accounting on + * the next process. + */ +void vtime_task_switch(struct task_struct *prev) +{ + if (is_idle_task(prev)) + vtime_account_idle(prev); + else + vtime_account_kernel(prev); + + vtime_flush(prev); + + if (!IS_ENABLED(CONFIG_PPC64)) { + struct cpu_accounting_data *acct = get_accounting(current); + struct cpu_accounting_data *acct0 = get_accounting(prev); + + acct->starttime = acct0->starttime; + } +} #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ void __no_kcsan __delay(unsigned long loops) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index af7952f12e6c..aa48b2ec879d 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -424,19 +424,6 @@ static inline void irqtime_account_process_tick(struct task_struct *p, int user_ */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE -# ifndef __ARCH_HAS_VTIME_TASK_SWITCH -void vtime_task_switch(struct task_struct *prev) -{ - if (is_idle_task(prev)) - vtime_account_idle(prev); - else - vtime_account_kernel(prev); - - vtime_flush(prev); - arch_vtime_task_switch(prev); -} -# endif - void vtime_account_irq(struct task_struct *tsk, unsigned int offset) { unsigned int pc = irq_count() - offset; -- 2.40.1
[PATCH v2 RESEND 4/5] s390/irq,nmi: include header directly
update_timer_sys() and update_timer_mcck() are inlines used for CPU time accounting from the interrupt and machine-check handlers. These routines are specific to s390 architecture, but included via header implicitly. Avoid the extra loop and include header directly. Acked-by: Heiko Carstens Signed-off-by: Alexander Gordeev --- arch/s390/kernel/irq.c | 1 + arch/s390/kernel/nmi.c | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c index 6f71b0ce1068..259496fe0ef9 100644 --- a/arch/s390/kernel/irq.c +++ b/arch/s390/kernel/irq.c @@ -29,6 +29,7 @@ #include #include #include +#include #include "entry.h" DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat); diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c index 9ad44c26d1a2..4422a27faace 100644 --- a/arch/s390/kernel/nmi.c +++ b/arch/s390/kernel/nmi.c @@ -32,6 +32,7 @@ #include #include #include +#include #include struct mcck_struct { -- 2.40.1
[PATCH v2 RESEND 3/5] s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
__ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore. Reviewed-by: Frederic Weisbecker Acked-by: Heiko Carstens Acked-by: Nicholas Piggin Signed-off-by: Alexander Gordeev --- arch/s390/include/asm/vtime.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/s390/include/asm/vtime.h b/arch/s390/include/asm/vtime.h index fe17e448c0c5..561c91c1a87c 100644 --- a/arch/s390/include/asm/vtime.h +++ b/arch/s390/include/asm/vtime.h @@ -2,8 +2,6 @@ #ifndef _S390_VTIME_H #define _S390_VTIME_H -#define __ARCH_HAS_VTIME_TASK_SWITCH - static inline void update_timer_sys(void) { S390_lowcore.system_timer += S390_lowcore.last_update_timer - S390_lowcore.exit_timer; -- 2.40.1
[PATCH v2 RESEND 2/5] sched/vtime: get rid of generic vtime_task_switch() implementation
The generic vtime_task_switch() implementation gets built only if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an architecture to implement arch_vtime_task_switch() callback at the same time, which is confusing. Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC architecture only and vtime_task_switch() generic variant is rather superfluous. Simplify the whole vtime_task_switch() wiring by moving the existing generic implementation to PowerPC. Reviewed-by: Frederic Weisbecker Reviewed-by: Nicholas Piggin Signed-off-by: Alexander Gordeev --- arch/powerpc/include/asm/cputime.h | 13 - arch/powerpc/kernel/time.c | 22 ++ kernel/sched/cputime.c | 13 - 3 files changed, 22 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h index 4961fb38e438..aff858ca99c0 100644 --- a/arch/powerpc/include/asm/cputime.h +++ b/arch/powerpc/include/asm/cputime.h @@ -32,23 +32,10 @@ #ifdef CONFIG_PPC64 #define get_accounting(tsk)(&get_paca()->accounting) #define raw_get_accounting(tsk)(&local_paca->accounting) -static inline void arch_vtime_task_switch(struct task_struct *tsk) { } #else #define get_accounting(tsk)(&task_thread_info(tsk)->accounting) #define raw_get_accounting(tsk)get_accounting(tsk) -/* - * Called from the context switch with interrupts disabled, to charge all - * accumulated times to the current process, and to prepare accounting on - * the next process. - */ -static inline void arch_vtime_task_switch(struct task_struct *prev) -{ - struct cpu_accounting_data *acct = get_accounting(current); - struct cpu_accounting_data *acct0 = get_accounting(prev); - - acct->starttime = acct0->starttime; -} #endif /* diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index df20cf201f74..c0fdc6d94fee 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -354,6 +354,28 @@ void vtime_flush(struct task_struct *tsk) acct->hardirq_time = 0; acct->softirq_time = 0; } + +/* + * Called from the context switch with interrupts disabled, to charge all + * accumulated times to the current process, and to prepare accounting on + * the next process. + */ +void vtime_task_switch(struct task_struct *prev) +{ + if (is_idle_task(prev)) + vtime_account_idle(prev); + else + vtime_account_kernel(prev); + + vtime_flush(prev); + + if (!IS_ENABLED(CONFIG_PPC64)) { + struct cpu_accounting_data *acct = get_accounting(current); + struct cpu_accounting_data *acct0 = get_accounting(prev); + + acct->starttime = acct0->starttime; + } +} #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ void __no_kcsan __delay(unsigned long loops) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index af7952f12e6c..aa48b2ec879d 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -424,19 +424,6 @@ static inline void irqtime_account_process_tick(struct task_struct *p, int user_ */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE -# ifndef __ARCH_HAS_VTIME_TASK_SWITCH -void vtime_task_switch(struct task_struct *prev) -{ - if (is_idle_task(prev)) - vtime_account_idle(prev); - else - vtime_account_kernel(prev); - - vtime_flush(prev); - arch_vtime_task_switch(prev); -} -# endif - void vtime_account_irq(struct task_struct *tsk, unsigned int offset) { unsigned int pc = irq_count() - offset; -- 2.40.1
[PATCH v2 RESEND 1/5] sched/vtime: remove confusing arch_vtime_task_switch() declaration
Callback arch_vtime_task_switch() is only defined when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is selected. Yet, the function prototype forward declaration is present for CONFIG_VIRT_CPU_ACCOUNTING_GEN variant. Remove it. Reviewed-by: Frederic Weisbecker Reviewed-by: Nicholas Piggin Signed-off-by: Alexander Gordeev --- include/linux/vtime.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/vtime.h b/include/linux/vtime.h index 3684487d01e1..593466ceebed 100644 --- a/include/linux/vtime.h +++ b/include/linux/vtime.h @@ -18,7 +18,6 @@ extern void vtime_account_idle(struct task_struct *tsk); #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN -extern void arch_vtime_task_switch(struct task_struct *tsk); extern void vtime_user_enter(struct task_struct *tsk); extern void vtime_user_exit(struct task_struct *tsk); extern void vtime_guest_enter(struct task_struct *tsk); -- 2.40.1
[PATCH v2 RESEND 5/5] sched/vtime: do not include header
There is no architecture-specific code or data left that generic needs to know about. Thus, avoid the inclusion of header. Reviewed-by: Frederic Weisbecker Acked-by: Nicholas Piggin Signed-off-by: Alexander Gordeev --- arch/powerpc/include/asm/Kbuild | 1 - include/asm-generic/vtime.h | 1 - include/linux/vtime.h | 4 3 files changed, 6 deletions(-) delete mode 100644 include/asm-generic/vtime.h diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild index 61a8dcd7..e5fdc336c9b2 100644 --- a/arch/powerpc/include/asm/Kbuild +++ b/arch/powerpc/include/asm/Kbuild @@ -6,5 +6,4 @@ generic-y += agp.h generic-y += kvm_types.h generic-y += mcs_spinlock.h generic-y += qrwlock.h -generic-y += vtime.h generic-y += early_ioremap.h diff --git a/include/asm-generic/vtime.h b/include/asm-generic/vtime.h deleted file mode 100644 index b1a49677fe25.. --- a/include/asm-generic/vtime.h +++ /dev/null @@ -1 +0,0 @@ -/* no content, but patch(1) dislikes empty files */ diff --git a/include/linux/vtime.h b/include/linux/vtime.h index 593466ceebed..29dd5b91dd7d 100644 --- a/include/linux/vtime.h +++ b/include/linux/vtime.h @@ -5,10 +5,6 @@ #include #include -#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE -#include -#endif - /* * Common vtime APIs */ -- 2.40.1
[PATCH v2 RESEND 0/5] sched/vtime: vtime.h headers cleanup
Hi All, There are no changes since the last post, just a re-send. v2: - patch 4: commit message reworded (Heiko) - patch 5: vtime.h is removed from Kbuild scripts (PowerPC only) (Heiko) v1: Please find a small cleanup to vtime_task_switch() wiring. I split it into smaller patches to allow separate PowerPC vs s390 reviews. Otherwise patches 2+3 and 4+5 could have been merged. I tested it on s390 and compile-tested it on 32- and 64-bit PowerPC and few other major architectures only, but it is only of concern for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE-capable ones (AFAICT). Thanks! Alexander Gordeev (5): sched/vtime: remove confusing arch_vtime_task_switch() declaration sched/vtime: get rid of generic vtime_task_switch() implementation s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover s390/irq,nmi: include header directly sched/vtime: do not include header arch/powerpc/include/asm/Kbuild| 1 - arch/powerpc/include/asm/cputime.h | 13 - arch/powerpc/kernel/time.c | 22 ++ arch/s390/include/asm/vtime.h | 2 -- arch/s390/kernel/irq.c | 1 + arch/s390/kernel/nmi.c | 1 + include/asm-generic/vtime.h| 1 - include/linux/vtime.h | 5 - kernel/sched/cputime.c | 13 - 9 files changed, 24 insertions(+), 35 deletions(-) delete mode 100644 include/asm-generic/vtime.h -- 2.40.1
[PATCH v2 1/5] sched/vtime: remove confusing arch_vtime_task_switch() declaration
Callback arch_vtime_task_switch() is only defined when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is selected. Yet, the function prototype forward declaration is present for CONFIG_VIRT_CPU_ACCOUNTING_GEN variant. Remove it. Reviewed-by: Frederic Weisbecker Signed-off-by: Alexander Gordeev --- include/linux/vtime.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/vtime.h b/include/linux/vtime.h index 3684487d01e1..593466ceebed 100644 --- a/include/linux/vtime.h +++ b/include/linux/vtime.h @@ -18,7 +18,6 @@ extern void vtime_account_idle(struct task_struct *tsk); #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN -extern void arch_vtime_task_switch(struct task_struct *tsk); extern void vtime_user_enter(struct task_struct *tsk); extern void vtime_user_exit(struct task_struct *tsk); extern void vtime_guest_enter(struct task_struct *tsk); -- 2.40.1
[PATCH v2 5/5] sched/vtime: do not include header
There is no architecture-specific code or data left that generic needs to know about. Thus, avoid the inclusion of header. Reviewed-by: Frederic Weisbecker Signed-off-by: Alexander Gordeev --- arch/powerpc/include/asm/Kbuild | 1 - include/asm-generic/vtime.h | 1 - include/linux/vtime.h | 4 3 files changed, 6 deletions(-) delete mode 100644 include/asm-generic/vtime.h diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild index 61a8dcd7..e5fdc336c9b2 100644 --- a/arch/powerpc/include/asm/Kbuild +++ b/arch/powerpc/include/asm/Kbuild @@ -6,5 +6,4 @@ generic-y += agp.h generic-y += kvm_types.h generic-y += mcs_spinlock.h generic-y += qrwlock.h -generic-y += vtime.h generic-y += early_ioremap.h diff --git a/include/asm-generic/vtime.h b/include/asm-generic/vtime.h deleted file mode 100644 index b1a49677fe25.. --- a/include/asm-generic/vtime.h +++ /dev/null @@ -1 +0,0 @@ -/* no content, but patch(1) dislikes empty files */ diff --git a/include/linux/vtime.h b/include/linux/vtime.h index 593466ceebed..29dd5b91dd7d 100644 --- a/include/linux/vtime.h +++ b/include/linux/vtime.h @@ -5,10 +5,6 @@ #include #include -#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE -#include -#endif - /* * Common vtime APIs */ -- 2.40.1
[PATCH v2 3/5] s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
__ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore. Reviewed-by: Frederic Weisbecker Acked-by: Heiko Carstens Signed-off-by: Alexander Gordeev --- arch/s390/include/asm/vtime.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/s390/include/asm/vtime.h b/arch/s390/include/asm/vtime.h index fe17e448c0c5..561c91c1a87c 100644 --- a/arch/s390/include/asm/vtime.h +++ b/arch/s390/include/asm/vtime.h @@ -2,8 +2,6 @@ #ifndef _S390_VTIME_H #define _S390_VTIME_H -#define __ARCH_HAS_VTIME_TASK_SWITCH - static inline void update_timer_sys(void) { S390_lowcore.system_timer += S390_lowcore.last_update_timer - S390_lowcore.exit_timer; -- 2.40.1
[PATCH v2 0/5] sched/vtime: vtime.h headers cleanup
Hi All, I kept all tags on reveiwed patches. v2: - patch 4: commit message reworded (Heiko) - patch 5: vtime.h is removed from Kbuild scripts (PowerPC only) (Heiko) v1: Please find a small cleanup to vtime_task_switch() wiring. I split it into smaller patches to allow separate PowerPC vs s390 reviews. Otherwise patches 2+3 and 4+5 could have been merged. I tested it on s390 and compile-tested it on 32- and 64-bit PowerPC and few other major architectures only, but it is only of concern for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE-capable ones (AFAICT). Thanks! Alexander Gordeev (5): sched/vtime: remove confusing arch_vtime_task_switch() declaration sched/vtime: get rid of generic vtime_task_switch() implementation s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover s390/irq,nmi: include header directly sched/vtime: do not include header arch/powerpc/include/asm/Kbuild| 1 - arch/powerpc/include/asm/cputime.h | 13 - arch/powerpc/kernel/time.c | 22 ++ arch/s390/include/asm/vtime.h | 2 -- arch/s390/kernel/irq.c | 1 + arch/s390/kernel/nmi.c | 1 + include/asm-generic/vtime.h| 1 - include/linux/vtime.h | 5 - kernel/sched/cputime.c | 13 - 9 files changed, 24 insertions(+), 35 deletions(-) delete mode 100644 include/asm-generic/vtime.h -- 2.40.1
[PATCH v2 4/5] s390/irq,nmi: include header directly
update_timer_sys() and update_timer_mcck() are inlines used for CPU time accounting from the interrupt and machine-check handlers. These routines are specific to s390 architecture, but included via header implicitly. Avoid the extra loop and include header directly. Acked-by: Heiko Carstens Signed-off-by: Alexander Gordeev --- arch/s390/kernel/irq.c | 1 + arch/s390/kernel/nmi.c | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c index 6f71b0ce1068..259496fe0ef9 100644 --- a/arch/s390/kernel/irq.c +++ b/arch/s390/kernel/irq.c @@ -29,6 +29,7 @@ #include #include #include +#include #include "entry.h" DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat); diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c index 9ad44c26d1a2..4422a27faace 100644 --- a/arch/s390/kernel/nmi.c +++ b/arch/s390/kernel/nmi.c @@ -32,6 +32,7 @@ #include #include #include +#include #include struct mcck_struct { -- 2.40.1
[PATCH v2 2/5] sched/vtime: get rid of generic vtime_task_switch() implementation
The generic vtime_task_switch() implementation gets built only if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an architecture to implement arch_vtime_task_switch() callback at the same time, which is confusing. Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC architecture only and vtime_task_switch() generic variant is rather superfluous. Simplify the whole vtime_task_switch() wiring by moving the existing generic implementation to PowerPC. Reviewed-by: Frederic Weisbecker Signed-off-by: Alexander Gordeev --- arch/powerpc/include/asm/cputime.h | 13 - arch/powerpc/kernel/time.c | 22 ++ kernel/sched/cputime.c | 13 - 3 files changed, 22 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h index 4961fb38e438..aff858ca99c0 100644 --- a/arch/powerpc/include/asm/cputime.h +++ b/arch/powerpc/include/asm/cputime.h @@ -32,23 +32,10 @@ #ifdef CONFIG_PPC64 #define get_accounting(tsk)(&get_paca()->accounting) #define raw_get_accounting(tsk)(&local_paca->accounting) -static inline void arch_vtime_task_switch(struct task_struct *tsk) { } #else #define get_accounting(tsk)(&task_thread_info(tsk)->accounting) #define raw_get_accounting(tsk)get_accounting(tsk) -/* - * Called from the context switch with interrupts disabled, to charge all - * accumulated times to the current process, and to prepare accounting on - * the next process. - */ -static inline void arch_vtime_task_switch(struct task_struct *prev) -{ - struct cpu_accounting_data *acct = get_accounting(current); - struct cpu_accounting_data *acct0 = get_accounting(prev); - - acct->starttime = acct0->starttime; -} #endif /* diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index df20cf201f74..c0fdc6d94fee 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -354,6 +354,28 @@ void vtime_flush(struct task_struct *tsk) acct->hardirq_time = 0; acct->softirq_time = 0; } + +/* + * Called from the context switch with interrupts disabled, to charge all + * accumulated times to the current process, and to prepare accounting on + * the next process. + */ +void vtime_task_switch(struct task_struct *prev) +{ + if (is_idle_task(prev)) + vtime_account_idle(prev); + else + vtime_account_kernel(prev); + + vtime_flush(prev); + + if (!IS_ENABLED(CONFIG_PPC64)) { + struct cpu_accounting_data *acct = get_accounting(current); + struct cpu_accounting_data *acct0 = get_accounting(prev); + + acct->starttime = acct0->starttime; + } +} #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ void __no_kcsan __delay(unsigned long loops) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index af7952f12e6c..aa48b2ec879d 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -424,19 +424,6 @@ static inline void irqtime_account_process_tick(struct task_struct *p, int user_ */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE -# ifndef __ARCH_HAS_VTIME_TASK_SWITCH -void vtime_task_switch(struct task_struct *prev) -{ - if (is_idle_task(prev)) - vtime_account_idle(prev); - else - vtime_account_kernel(prev); - - vtime_flush(prev); - arch_vtime_task_switch(prev); -} -# endif - void vtime_account_irq(struct task_struct *tsk, unsigned int offset) { unsigned int pc = irq_count() - offset; -- 2.40.1
Re: [PATCH 4/5] s390/irq,nmi: do not include header
On Mon, Jan 29, 2024 at 10:51:44AM +0100, Heiko Carstens wrote: > It is confusing when the patch subject is "do not include.." and all > what this patch is doing is to add two includes. I see what this is > doing: getting rid of the implicit include of asm/vtime.h most likely > via linux/hardirq.h, but that's not very obvious. > > Anyway: > Acked-by: Heiko Carstens Thank you, Heiko! Whether this wording sounds better? s390/irq,nmi: include header directly update_timer_sys() and update_timer_mcck() are inlines used for CPU time accounting from the interrupt and machine-check handlers. These routines are specific to s390 architecture, but included via header implicitly. Avoid the extra loop and include header directly.
Re: [PATCH 5/5] sched/vtime: do not include header
On Wed, Feb 07, 2024 at 12:30:08AM +0100, Frederic Weisbecker wrote: > Reviewed-by: Frederic Weisbecker Thank you for the review, Frederic! The Heiko comment is valid and I would add this chunk in v2: --- a/arch/powerpc/include/asm/Kbuild +++ b/arch/powerpc/include/asm/Kbuild @@ -6,5 +6,4 @@ generic-y += agp.h generic-y += kvm_types.h generic-y += mcs_spinlock.h generic-y += qrwlock.h -generic-y += vtime.h generic-y += early_ioremap.h Would you keep your Reviewed-by?
[PATCH 4/5] s390/irq,nmi: do not include header
update_timer_sys() and update_timer_mcck() are inlines used for CPU time accounting from the interrupt and machine-check handlers. These routines are specific to s390 architecture, but declared via header, which in turn inludes . Avoid the extra loop and include header directly. Signed-off-by: Alexander Gordeev --- arch/s390/kernel/irq.c | 1 + arch/s390/kernel/nmi.c | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c index 6f71b0ce1068..259496fe0ef9 100644 --- a/arch/s390/kernel/irq.c +++ b/arch/s390/kernel/irq.c @@ -29,6 +29,7 @@ #include #include #include +#include #include "entry.h" DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat); diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c index 9ad44c26d1a2..4422a27faace 100644 --- a/arch/s390/kernel/nmi.c +++ b/arch/s390/kernel/nmi.c @@ -32,6 +32,7 @@ #include #include #include +#include #include struct mcck_struct { -- 2.40.1
[PATCH 0/5] sched/vtime: vtime.h headers cleanup
Hi all, Please find a small cleanup to vtime_task_switch() wiring. I split it into smaller patches to allow separate PowerPC vs s390 reviews. Otherwise patches 2+3 and 4+5 could have been merged. I tested it on s390 and compile-tested it on 32- and 64-bit PowerPC and few other major architectures only, but it is only of concern for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE-capable ones (AFAICT). Thanks! Alexander Gordeev (5): sched/vtime: remove confusing arch_vtime_task_switch() declaration sched/vtime: get rid of generic vtime_task_switch() implementation s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover s390/irq,nmi: do not include header sched/vtime: do not include header arch/powerpc/include/asm/cputime.h | 13 - arch/powerpc/kernel/time.c | 22 ++ arch/s390/include/asm/vtime.h | 2 -- arch/s390/kernel/irq.c | 1 + arch/s390/kernel/nmi.c | 1 + include/asm-generic/vtime.h| 1 - include/linux/vtime.h | 5 - kernel/sched/cputime.c | 13 - 8 files changed, 24 insertions(+), 34 deletions(-) delete mode 100644 include/asm-generic/vtime.h -- 2.40.1
[PATCH 2/5] sched/vtime: get rid of generic vtime_task_switch() implementation
The generic vtime_task_switch() implementation gets built only if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an architecture to implement arch_vtime_task_switch() callback at the same time, which is confusing. Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC architecture only and vtime_task_switch() generic variant is rather superfluous. Simplify the whole vtime_task_switch() wiring by moving the existing generic implementation to PowerPC. Signed-off-by: Alexander Gordeev --- arch/powerpc/include/asm/cputime.h | 13 - arch/powerpc/kernel/time.c | 22 ++ kernel/sched/cputime.c | 13 - 3 files changed, 22 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h index 4961fb38e438..aff858ca99c0 100644 --- a/arch/powerpc/include/asm/cputime.h +++ b/arch/powerpc/include/asm/cputime.h @@ -32,23 +32,10 @@ #ifdef CONFIG_PPC64 #define get_accounting(tsk)(&get_paca()->accounting) #define raw_get_accounting(tsk)(&local_paca->accounting) -static inline void arch_vtime_task_switch(struct task_struct *tsk) { } #else #define get_accounting(tsk)(&task_thread_info(tsk)->accounting) #define raw_get_accounting(tsk)get_accounting(tsk) -/* - * Called from the context switch with interrupts disabled, to charge all - * accumulated times to the current process, and to prepare accounting on - * the next process. - */ -static inline void arch_vtime_task_switch(struct task_struct *prev) -{ - struct cpu_accounting_data *acct = get_accounting(current); - struct cpu_accounting_data *acct0 = get_accounting(prev); - - acct->starttime = acct0->starttime; -} #endif /* diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index df20cf201f74..c0fdc6d94fee 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -354,6 +354,28 @@ void vtime_flush(struct task_struct *tsk) acct->hardirq_time = 0; acct->softirq_time = 0; } + +/* + * Called from the context switch with interrupts disabled, to charge all + * accumulated times to the current process, and to prepare accounting on + * the next process. + */ +void vtime_task_switch(struct task_struct *prev) +{ + if (is_idle_task(prev)) + vtime_account_idle(prev); + else + vtime_account_kernel(prev); + + vtime_flush(prev); + + if (!IS_ENABLED(CONFIG_PPC64)) { + struct cpu_accounting_data *acct = get_accounting(current); + struct cpu_accounting_data *acct0 = get_accounting(prev); + + acct->starttime = acct0->starttime; + } +} #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ void __no_kcsan __delay(unsigned long loops) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index af7952f12e6c..aa48b2ec879d 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -424,19 +424,6 @@ static inline void irqtime_account_process_tick(struct task_struct *p, int user_ */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE -# ifndef __ARCH_HAS_VTIME_TASK_SWITCH -void vtime_task_switch(struct task_struct *prev) -{ - if (is_idle_task(prev)) - vtime_account_idle(prev); - else - vtime_account_kernel(prev); - - vtime_flush(prev); - arch_vtime_task_switch(prev); -} -# endif - void vtime_account_irq(struct task_struct *tsk, unsigned int offset) { unsigned int pc = irq_count() - offset; -- 2.40.1
[PATCH 3/5] s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
__ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore. Signed-off-by: Alexander Gordeev --- arch/s390/include/asm/vtime.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/s390/include/asm/vtime.h b/arch/s390/include/asm/vtime.h index fe17e448c0c5..561c91c1a87c 100644 --- a/arch/s390/include/asm/vtime.h +++ b/arch/s390/include/asm/vtime.h @@ -2,8 +2,6 @@ #ifndef _S390_VTIME_H #define _S390_VTIME_H -#define __ARCH_HAS_VTIME_TASK_SWITCH - static inline void update_timer_sys(void) { S390_lowcore.system_timer += S390_lowcore.last_update_timer - S390_lowcore.exit_timer; -- 2.40.1
[PATCH 5/5] sched/vtime: do not include header
There is no architecture-specific code or data left that generic needs to know about. Thus, avoid the inclusion of header. Signed-off-by: Alexander Gordeev --- include/asm-generic/vtime.h | 1 - include/linux/vtime.h | 4 2 files changed, 5 deletions(-) delete mode 100644 include/asm-generic/vtime.h diff --git a/include/asm-generic/vtime.h b/include/asm-generic/vtime.h deleted file mode 100644 index b1a49677fe25.. --- a/include/asm-generic/vtime.h +++ /dev/null @@ -1 +0,0 @@ -/* no content, but patch(1) dislikes empty files */ diff --git a/include/linux/vtime.h b/include/linux/vtime.h index 593466ceebed..29dd5b91dd7d 100644 --- a/include/linux/vtime.h +++ b/include/linux/vtime.h @@ -5,10 +5,6 @@ #include #include -#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE -#include -#endif - /* * Common vtime APIs */ -- 2.40.1
[PATCH 1/5] sched/vtime: remove confusing arch_vtime_task_switch() declaration
Callback arch_vtime_task_switch() is only defined when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is selected. Yet, the function prototype forward declaration is present for CONFIG_VIRT_CPU_ACCOUNTING_GEN variant. Remove it. Signed-off-by: Alexander Gordeev --- include/linux/vtime.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/vtime.h b/include/linux/vtime.h index 3684487d01e1..593466ceebed 100644 --- a/include/linux/vtime.h +++ b/include/linux/vtime.h @@ -18,7 +18,6 @@ extern void vtime_account_idle(struct task_struct *tsk); #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN -extern void arch_vtime_task_switch(struct task_struct *tsk); extern void vtime_user_enter(struct task_struct *tsk); extern void vtime_user_exit(struct task_struct *tsk); extern void vtime_guest_enter(struct task_struct *tsk); -- 2.40.1
Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
On Tue, Jan 17, 2023 at 10:45:25PM +0100, Jann Horn wrote: Hi Jann, > On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan wrote: > > On Tue, Jan 17, 2023 at 10:03 AM Jann Horn wrote: > > > > > > +locking maintainers > > > > Thanks! I'll CC the locking maintainers in the next posting. > > > > > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan > > > wrote: > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling > > > > instead of mmap_lock. Because there are cases when multiple VMAs need > > > > to be exclusively locked during VMA tree modifications, instead of the > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When > > > > mmap_write_lock holder is done with all modifications and drops > > > > mmap_lock, > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as > > > > locked. > > > [...] > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma) > > > > +{ > > > > + up_read(&vma->lock); > > > > +} > > > > > > One thing that might be gnarly here is that I think you might not be > > > allowed to use up_read() to fully release ownership of an object - > > > from what I remember, I think that up_read() (unlike something like > > > spin_unlock()) can access the lock object after it's already been > > > acquired by someone else. So if you want to protect against concurrent > > > deletion, this might have to be something like: > > > > > > rcu_read_lock(); /* keeps vma alive */ > > > up_read(&vma->lock); > > > rcu_read_unlock(); > > > > But for deleting VMA one would need to write-lock the vma->lock first, > > which I assume can't happen until this up_read() is complete. Is that > > assumption wrong? > > __up_read() does: > > rwsem_clear_reader_owned(sem); > tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count); > DEBUG_RWSEMS_WARN_ON(tmp < 0, sem); > if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) == > RWSEM_FLAG_WAITERS)) { > clear_nonspinnable(sem); > rwsem_wake(sem); > } This sequence is covered by preempt_disable()/preempt_enable(). Would not it preserve the RCU grace period until after __up_read() exited? > The atomic_long_add_return_release() is the point where we are doing > the main lock-releasing. > > So if a reader dropped the read-lock while someone else was waiting on > the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the > lock together with it, the reader also does clear_nonspinnable() and > rwsem_wake() afterwards. > But in rwsem_down_write_slowpath(), after we've set > RWSEM_FLAG_WAITERS, we can return successfully immediately once > rwsem_try_write_lock() sees that there are no active readers or > writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg > succeeds). We're not necessarily waiting for the "nonspinnable" bit or > the wake. > > So yeah, I think down_write() can return successfully before up_read() > is done with its memory accesses. Thanks!
Re: [PATCH 1/8] S390: Remove sentinel elem from ctl_table arrays
On Wed, Sep 06, 2023 at 12:03:22PM +0200, Joel Granados via B4 Relay wrote: > From: Joel Granados > > This commit comes at the tail end of a greater effort to remove the > empty elements at the end of the ctl_table arrays (sentinels) which > will reduce the overall build time size of the kernel and run time > memory bloat by ~64 bytes per sentinel (further information Link : > https://lore.kernel.org/all/zo5yx5jfoggi%2f...@bombadil.infradead.org/) > > Remove the sentinel element from appldata_table, s390dbf_table, > topology_ctl_table, cmm_table and page_table_sysctl. Reduced the > memory allocation in appldata_register_ops by 1 effectively removing the > sentinel from ops->ctl_table. > > Signed-off-by: Joel Granados > --- > arch/s390/appldata/appldata_base.c | 6 ++ > arch/s390/kernel/debug.c | 3 +-- > arch/s390/kernel/topology.c| 3 +-- > arch/s390/mm/cmm.c | 3 +-- > arch/s390/mm/pgalloc.c | 3 +-- > 5 files changed, 6 insertions(+), 12 deletions(-) Tested-by: Alexander Gordeev
Re: [PATCH rfc v2 04/10] s390: mm: use try_vma_locked_page_fault()
On Mon, Aug 21, 2023 at 08:30:50PM +0800, Kefeng Wang wrote: > Use new try_vma_locked_page_fault() helper to simplify code. > No functional change intended. > > Signed-off-by: Kefeng Wang > --- > arch/s390/mm/fault.c | 66 ++-- > 1 file changed, 27 insertions(+), 39 deletions(-) > > diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c > index 099c4824dd8a..fbbdebde6ea7 100644 > --- a/arch/s390/mm/fault.c > +++ b/arch/s390/mm/fault.c > @@ -357,16 +357,18 @@ static noinline void do_fault_error(struct pt_regs > *regs, vm_fault_t fault) > static inline vm_fault_t do_exception(struct pt_regs *regs, int access) > { > struct gmap *gmap; > - struct task_struct *tsk; > - struct mm_struct *mm; > struct vm_area_struct *vma; > enum fault_type type; > - unsigned long address; > - unsigned int flags; > + struct mm_struct *mm = current->mm; > + unsigned long address = get_fault_address(regs); > vm_fault_t fault; > bool is_write; > + struct vm_fault vmf = { > + .real_address = address, > + .flags = FAULT_FLAG_DEFAULT, > + .vm_flags = access, > + }; > > - tsk = current; > /* >* The instruction that caused the program check has >* been nullified. Don't signal single step via SIGTRAP. > @@ -376,8 +378,6 @@ static inline vm_fault_t do_exception(struct pt_regs > *regs, int access) > if (kprobe_page_fault(regs, 14)) > return 0; > > - mm = tsk->mm; > - address = get_fault_address(regs); > is_write = fault_is_write(regs); > > /* > @@ -398,45 +398,33 @@ static inline vm_fault_t do_exception(struct pt_regs > *regs, int access) > } > > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); > - flags = FAULT_FLAG_DEFAULT; > if (user_mode(regs)) > - flags |= FAULT_FLAG_USER; > + vmf.flags |= FAULT_FLAG_USER; > if (is_write) > - access = VM_WRITE; > - if (access == VM_WRITE) > - flags |= FAULT_FLAG_WRITE; > - if (!(flags & FAULT_FLAG_USER)) > - goto lock_mmap; > - vma = lock_vma_under_rcu(mm, address); > - if (!vma) > - goto lock_mmap; > - if (!(vma->vm_flags & access)) { > - vma_end_read(vma); > - goto lock_mmap; > - } > - fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, > regs); > - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) > - vma_end_read(vma); > - if (!(fault & VM_FAULT_RETRY)) { > - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); > - if (likely(!(fault & VM_FAULT_ERROR))) > - fault = 0; This fault fixup is removed in the new version. > + vmf.vm_flags = VM_WRITE; > + if (vmf.vm_flags == VM_WRITE) > + vmf.flags |= FAULT_FLAG_WRITE; > + > + fault = try_vma_locked_page_fault(&vmf); > + if (fault == VM_FAULT_NONE) > + goto lock_mm; Because VM_FAULT_NONE is set to 0 it gets confused with the success code of 0 returned by a fault handler. In the former case we want to continue, while in the latter - successfully return. I think it applies to all archs. > + if (!(fault & VM_FAULT_RETRY)) > goto out; > - } > - count_vm_vma_lock_event(VMA_LOCK_RETRY); > + > /* Quick path to respond to signals */ > if (fault_signal_pending(fault, regs)) { > fault = VM_FAULT_SIGNAL; > goto out; > } > -lock_mmap: > + > +lock_mm: > mmap_read_lock(mm); > > gmap = NULL; > if (IS_ENABLED(CONFIG_PGSTE) && type == GMAP_FAULT) { > gmap = (struct gmap *) S390_lowcore.gmap; > current->thread.gmap_addr = address; > - current->thread.gmap_write_flag = !!(flags & FAULT_FLAG_WRITE); > + current->thread.gmap_write_flag = !!(vmf.flags & > FAULT_FLAG_WRITE); > current->thread.gmap_int_code = regs->int_code & 0x; > address = __gmap_translate(gmap, address); > if (address == -EFAULT) { > @@ -444,7 +432,7 @@ static inline vm_fault_t do_exception(struct pt_regs > *regs, int access) > goto out_up; > } > if (gmap->pfault_enabled) > - flags |= FAULT_FLAG_RETRY_NOWAIT; > + vmf.flags |= FAULT_FLAG_RETRY_NOWAIT; > } > > retry: > @@ -466,7 +454,7 @@ static inline vm_fault_t do_exception(struct pt_regs > *regs, int access) >* we can handle it.. >*/ > fault = VM_FAULT_BADACCESS; > - if (unlikely(!(vma->vm_flags & access))) > + if (unlikely(!(vma->vm_flags & vmf.vm_flags))) > goto out_up; > > /* > @@ -474,10 +462,10 @@ static inline vm_fault_t do_exception(struct pt_regs > *regs, int access) >* make sure we exit gracefully rathe
Re: [PATCH rfc v2 01/10] mm: add a generic VMA lock-based page fault handler
On Mon, Aug 21, 2023 at 08:30:47PM +0800, Kefeng Wang wrote: Hi Kefeng, > The ARCH_SUPPORTS_PER_VMA_LOCK are enabled by more and more architectures, > eg, x86, arm64, powerpc and s390, and riscv, those implementation are very > similar which results in some duplicated codes, let's add a generic VMA > lock-based page fault handler try_to_vma_locked_page_fault() to eliminate > them, and which also make us easy to support this on new architectures. > > Since different architectures use different way to check vma whether is > accessable or not, the struct pt_regs, page fault error code and vma flags > are added into struct vm_fault, then, the architecture's page fault code > could re-use struct vm_fault to record and check vma accessable by each > own implementation. > > Signed-off-by: Kefeng Wang > --- > include/linux/mm.h | 17 + > include/linux/mm_types.h | 2 ++ > mm/memory.c | 39 +++ > 3 files changed, 58 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 3f764e84e567..22a6f4c56ff3 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -512,9 +512,12 @@ struct vm_fault { > pgoff_t pgoff; /* Logical page offset based on > vma */ > unsigned long address; /* Faulting virtual address - > masked */ > unsigned long real_address; /* Faulting virtual address - > unmasked */ > + unsigned long fault_code; /* Faulting error code during > page fault */ > + struct pt_regs *regs; /* The registers stored during > page fault */ > }; > enum fault_flag flags; /* FAULT_FLAG_xxx flags >* XXX: should really be 'const' */ > + vm_flags_t vm_flags;/* VMA flags to be used for access > checking */ > pmd_t *pmd; /* Pointer to pmd entry matching >* the 'address' */ > pud_t *pud; /* Pointer to pud entry matching > @@ -774,6 +777,9 @@ static inline void assert_fault_locked(struct vm_fault > *vmf) > struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, > unsigned long address); > > +bool arch_vma_access_error(struct vm_area_struct *vma, struct vm_fault *vmf); > +vm_fault_t try_vma_locked_page_fault(struct vm_fault *vmf); > + > #else /* CONFIG_PER_VMA_LOCK */ > > static inline bool vma_start_read(struct vm_area_struct *vma) > @@ -801,6 +807,17 @@ static inline void assert_fault_locked(struct vm_fault > *vmf) > mmap_assert_locked(vmf->vma->vm_mm); > } > > +static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, > + unsigned long address) > +{ > + return NULL; > +} > + > +static inline vm_fault_t try_vma_locked_page_fault(struct vm_fault *vmf) > +{ > + return VM_FAULT_NONE; > +} > + > #endif /* CONFIG_PER_VMA_LOCK */ > > extern const struct vm_operations_struct vma_dummy_vm_ops; > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index f5ba5b0bc836..702820cea3f9 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -1119,6 +1119,7 @@ typedef __bitwise unsigned int vm_fault_t; > * fault. Used to decide whether a process gets delivered SIGBUS or > * just gets major/minor fault counters bumped up. > * > + * @VM_FAULT_NONE: Special case, not starting to handle fault > * @VM_FAULT_OOM:Out Of Memory > * @VM_FAULT_SIGBUS: Bad access > * @VM_FAULT_MAJOR: Page read from storage > @@ -1139,6 +1140,7 @@ typedef __bitwise unsigned int vm_fault_t; > * > */ > enum vm_fault_reason { > + VM_FAULT_NONE = (__force vm_fault_t)0x00, > VM_FAULT_OOM= (__force vm_fault_t)0x01, > VM_FAULT_SIGBUS = (__force vm_fault_t)0x02, > VM_FAULT_MAJOR = (__force vm_fault_t)0x04, > diff --git a/mm/memory.c b/mm/memory.c > index 3b4aaa0d2fff..60fe35db5134 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5510,6 +5510,45 @@ struct vm_area_struct *lock_vma_under_rcu(struct > mm_struct *mm, > count_vm_vma_lock_event(VMA_LOCK_ABORT); > return NULL; > } > + > +#ifdef CONFIG_PER_VMA_LOCK > +bool __weak arch_vma_access_error(struct vm_area_struct *vma, struct > vm_fault *vmf) > +{ > + return (vma->vm_flags & vmf->vm_flags) == 0; > +} > +#endif > + > +vm_fault_t try_vma_locked_page_fault(struct vm_fault *vmf) > +{ > + vm_fault_t fault = VM_FAULT_NONE; > + struct vm_area_struct *vma; > + > + if (!(vmf->flags & FAULT_FLAG_USER)) > + return fault; > + > + vma = lock_vma_under_rcu(current->mm, vmf->real_address); > + if (!vma) > + return fault; > + > + if (arch_vma_access_error(vma, vmf)) { > + vma_end_read(vma); > +
Re: [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page
VM)) > + return; > + if (!mask && list_empty(&page->lru)) > return; > snprintf(msg, sizeof(msg), >"Invalid pgtable %p release half 0x%02x mask 0x%02x", > @@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, > void *table, > dump_page(page, msg); > } > > +static void pte_free_now(struct rcu_head *head) > +{ > + struct page *page; > + > + page = container_of(head, struct page, rcu_head); > + pgtable_pte_page_dtor(page); > + __free_page(page); > +} > + > void page_table_free(struct mm_struct *mm, unsigned long *table) > { > unsigned int mask, bit, half; > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned > long *table) >*/ > mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24)); > mask >>= 24; > - if (mask & 0x03U) > + if ((mask & 0x03U) && !PageActive(page)) { > + /* > + * Other half is allocated, and neither half has had > + * its free deferred: add page to head of list, to make > + * this freed half available for immediate reuse. > + */ > list_add(&page->lru, &mm->context.pgtable_list); > - else > - list_del(&page->lru); > + } else { > + /* If page is on list, now remove it. */ > + list_del_init(&page->lru); > + } > spin_unlock_bh(&mm->context.lock); > mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24)); > mask >>= 24; > @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long > *table) > } > > page_table_release_check(page, table, half, mask); > - pgtable_pte_page_dtor(page); > - __free_page(page); > + if (TestClearPageActive(page)) > + call_rcu(&page->rcu_head, pte_free_now); > + else > + pte_free_now(&page->rcu_head); > } > > void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table, > @@ -370,10 +400,18 @@ void page_table_free_rcu(struct mmu_gather *tlb, > unsigned long *table, >*/ > mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24)); > mask >>= 24; > - if (mask & 0x03U) > + if ((mask & 0x03U) && !PageActive(page)) { > + /* > + * Other half is allocated, and neither half has had > + * its free deferred: add page to end of list, to make > + * this freed half available for reuse once its pending > + * bit has been cleared by __tlb_remove_table(). > + */ > list_add_tail(&page->lru, &mm->context.pgtable_list); > - else > - list_del(&page->lru); > + } else { > + /* If page is on list, now remove it. */ > + list_del_init(&page->lru); > + } > spin_unlock_bh(&mm->context.lock); > table = (unsigned long *) ((unsigned long) table | (0x01U << bit)); > tlb_remove_table(tlb, table); > @@ -403,10 +441,28 @@ void __tlb_remove_table(void *_table) > } > > page_table_release_check(page, table, half, mask); > - pgtable_pte_page_dtor(page); > - __free_page(page); > + if (TestClearPageActive(page)) > + call_rcu(&page->rcu_head, pte_free_now); > + else > + pte_free_now(&page->rcu_head); > } > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) > +{ > + struct page *page; > + > + page = virt_to_page(pgtable); > + SetPageActive(page); > + page_table_free(mm, (unsigned long *)pgtable); > + /* > + * page_table_free() does not do the pgste gmap_unlink() which > + * page_table_free_rcu() does: warn us if pgste ever reaches here. > + */ > + WARN_ON_ONCE(mm_alloc_pgste(mm)); > +} > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > /* > * Base infrastructure required to generate basic asces, region, segment, > * and page tables that do not make use of enhanced features like EDAT1. Tested-by: Alexander Gordeev Acked-by: Alexander Gordeev
Re: [PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec
pe == KEXEC_TYPE_CRASH && > !kdump_csum_valid(image)) > | ^~ > arch/s390/kernel/machine_kexec.c:280:28: error: 'KEXEC_TYPE_CRASH' > undeclared (first use in this function); did you mean 'KEXEC_ON_CRASH'? > 280 | if (image->type == KEXEC_TYPE_CRASH && > !kdump_csum_valid(image)) > |^~~~ > |KEXEC_ON_CRASH > arch/s390/kernel/machine_kexec.c:280:66: error: passing argument 1 of > 'kdump_csum_valid' from incompatible pointer type > [-Werror=incompatible-pointer-types] > 280 | if (image->type == KEXEC_TYPE_CRASH && > !kdump_csum_valid(image)) > | > ^ > | | > | > struct kimage * > arch/s390/kernel/machine_kexec.c:120:45: note: expected 'struct kimage *' > but argument is of type 'struct kimage *' > 120 | static bool kdump_csum_valid(struct kimage *image) > | ~~~^ > cc1: some warnings being treated as errors > > I don't think this change is equivalent for s390, which had > > config KEXEC > def_bool y > select KEXEC_CORE > > but it is now the equivalent of > > config KEXEC > bool "Enable kexec system call" > default y > > which enables KEXEC by default but it also allows KEXEC to be disabled > for s390 now, because it is a user-visible symbol, not one that is > unconditionally enabled no matter what. If s390 can tolerate KEXEC being > user selectable, then I assume the fix is just adjusting > arch/s390/kernel/Makefile to only build the machine_kexec files when > CONFIG_KEXEC_CORE is set: > > diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile > index 6b2a051e1f8a..a06b39da95f0 100644 > --- a/arch/s390/kernel/Makefile > +++ b/arch/s390/kernel/Makefile > @@ -37,10 +37,10 @@ CFLAGS_unwind_bc.o+= -fno-optimize-sibling-calls > obj-y:= head64.o traps.o time.o process.o earlypgm.o early.o setup.o > idle.o vtime.o > obj-y+= processor.o syscall.o ptrace.o signal.o cpcmd.o ebcdic.o > nmi.o > obj-y+= debug.o irq.o ipl.o dis.o diag.o vdso.o cpufeature.o > -obj-y+= sysinfo.o lgr.o os_info.o machine_kexec.o > +obj-y+= sysinfo.o lgr.o os_info.o > obj-y+= runtime_instr.o cache.o fpu.o dumpstack.o guarded_storage.o > sthyi.o > obj-y+= entry.o reipl.o relocate_kernel.o kdebugfs.o alternative.o > -obj-y+= nospec-branch.o ipl_vmparm.o machine_kexec_reloc.o > unwind_bc.o > +obj-y+= nospec-branch.o ipl_vmparm.o unwind_bc.o > obj-y+= smp.o text_amode31.o stacktrace.o abs_lowcore.o > > extra-y += vmlinux.lds > @@ -66,6 +66,7 @@ obj-$(CONFIG_CRASH_DUMP)+= crash_dump.o > obj-$(CONFIG_UPROBES)+= uprobes.o > obj-$(CONFIG_JUMP_LABEL) += jump_label.o > > +obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o machine_kexec_reloc.o > obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file.o kexec_image.o > obj-$(CONFIG_KEXEC_FILE) += kexec_elf.o > > > Otherwise, the prompt for KEXEC could be made conditional on some ARCH > symbol so that architectures can opt out of it. Hi Nathan, Thanks a lot for looking into it! With few modification the fix would looke like below. It probably needs to be a pre-requisite for this series: [PATCH] s390/kexec: make machine_kexec depend on CONFIG_KEXEC_CORE Make machine_kexec.o and relocate_kernel.o depend on CONFIG_KEXEC_CORE option as other architectures do. Still generate machine_kexec_reloc.o unconditionally, since arch_kexec_do_relocs() function is neded by the decompressor. Probably, #include could be be removed from machine_kexec_reloc.c source as well, but that would revert commit 155e6c706125 ("s390/kexec: add missing include to machine_kexec_reloc.c"). Suggested-by: Nathan Chancellor Reported-by: Nathan Chancellor Reported-by: Linux Kernel Functional Testing Signed-off-by: Alexander Gordeev --- arch/s390/kernel/Makefile | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile index 8d7514c72bb8..0df2b88cc0da 100644 --- a/arch/s390/kernel/Makefile +++ b/arch/s390/kernel/Makefile @@ -37,9 +37,9 @@ CFLAGS_unwind_bc.o+= -fno-optimize-sibling-calls obj-y := head64.o traps.o time.o process.o earlypgm.o early.o setup.o idle.o vtime.o obj-y += processor.o syscall.o ptrace.o signal.o cpcmd.o ebcdic.o nmi.o obj-y += debug.o irq.o ipl.o dis.o diag.o vdso.o cpufeature.o -obj-y += sysinfo.o lgr.o os_info.o machine_kexec.o +obj-y += sysinfo.o lgr.o os_info.o obj-y += runtime_instr.o cache.o fpu.o dumpstack.o guarded_storage.o sthyi.o -obj-y += entry.o reipl.o relocate_kernel.o kdebugfs.o alternative.o +obj-y += entry.o reipl.o kdebugfs.o alternative.o obj-y += nospec-branch.o ipl_vmparm.o machine_kexec_reloc.o unwind_bc.o obj-y += smp.o text_amode31.o stacktrace.o abs_lowcore.o @@ -63,6 +63,7 @@ obj-$(CONFIG_RETHOOK) += rethook.o obj-$(CONFIG_FUNCTION_TRACER) += ftrace.o obj-$(CONFIG_FUNCTION_TRACER) += mcount.o obj-$(CONFIG_CRASH_DUMP) += crash_dump.o +obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o obj-$(CONFIG_UPROBES) += uprobes.o obj-$(CONFIG_JUMP_LABEL) += jump_label.o > Cheers, > Nathan Thanks!
Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote: > On Thu, 29 Jun 2023, Hugh Dickins wrote: Hi Hugh, ... > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) > +{ > + struct page *page; If I got your and Claudio conversation right, you were going to add here WARN_ON_ONCE() in case of mm_alloc_pgste(mm)? > + page = virt_to_page(pgtable); > + SetPageActive(page); > + page_table_free(mm, (unsigned long *)pgtable); > +} > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > /* > * Base infrastructure required to generate basic asces, region, segment, > * and page tables that do not make use of enhanced features like EDAT1. Thanks!
Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote: > On Thu, 29 Jun 2023, Hugh Dickins wrote: Hi Hugh, ... > No, not quite the same rules as before: I came to realize that using > list_add_tail() for the HH pages would be liable to put a page on the > list which forever blocked reuse of PP list_add_tail() pages after it > (could be solved by a list_move() somewhere, but we have agreed to > prefer simplicity). Just to make things more clear for me: do I understand correctly that this was an attempt to add HH fragments to pgtable_list from pte_free_defer()? Thanks!
Re: [PATCH v3 12/13] s390/kexec: refactor for kernel/Kconfig.kexec
On Mon, Jun 26, 2023 at 12:13:31PM -0400, Eric DeVolder wrote: > The kexec and crash kernel options are provided in the common > kernel/Kconfig.kexec. Utilize the common options and provide > the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the > equivalent set of KEXEC and CRASH options. > > NOTE: The original Kconfig has a KEXEC_SIG which depends on > MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT > dependency (using the strategy outlined in this series, and other > techniques) results in 'error: recursive dependency detected' > on CRYPTO. > > Per Alexander Gordeev : "the MODULE_SIG_FORMAT > dependency was introduced with [git commit below] and in fact was not > necessary, since s390 did/does not use mod_check_sig() anyway. > > commit c8424e776b09 ("MODSIGN: Export module signature definitions") > > MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But > SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping > MODULE_SIG_FORMAT does not hurt." > > Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency > from KEXEC_SIG. Still results in equivalent .config files for s390. > > Signed-off-by: Eric DeVolder > --- > arch/s390/Kconfig | 65 ++- > 1 file changed, 19 insertions(+), 46 deletions(-) > > diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig > index 6dab9c1be508..58dc124433ca 100644 > --- a/arch/s390/Kconfig > +++ b/arch/s390/Kconfig > @@ -243,6 +243,25 @@ config PGTABLE_LEVELS > > source "kernel/livepatch/Kconfig" > > +config ARCH_DEFAULT_KEXEC > + def_bool y > + > +config ARCH_SUPPORTS_KEXEC > + def_bool y > + > +config ARCH_SUPPORTS_KEXEC_FILE > + def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390 > + > +config ARCH_HAS_KEXEC_PURGATORY > + def_bool KEXEC_FILE > + > +config ARCH_SUPPORTS_CRASH_DUMP > + def_bool y > + help > + Refer to for more details on > this. > + This option also enables s390 zfcpdump. > + See also > + > menu "Processor type and features" > > config HAVE_MARCH_Z10_FEATURES > @@ -481,36 +500,6 @@ config SCHED_TOPOLOGY > > source "kernel/Kconfig.hz" > > -config KEXEC > - def_bool y > - select KEXEC_CORE > - > -config KEXEC_FILE > - bool "kexec file based system call" > - select KEXEC_CORE > - depends on CRYPTO > - depends on CRYPTO_SHA256 > - depends on CRYPTO_SHA256_S390 > - help > - Enable the kexec file based system call. In contrast to the normal > - kexec system call this system call takes file descriptors for the > - kernel and initramfs as arguments. > - > -config ARCH_HAS_KEXEC_PURGATORY > - def_bool y > - depends on KEXEC_FILE > - > -config KEXEC_SIG > - bool "Verify kernel signature during kexec_file_load() syscall" > - depends on KEXEC_FILE && MODULE_SIG_FORMAT > - help > - This option makes kernel signature verification mandatory for > - the kexec_file_load() syscall. > - > - In addition to that option, you need to enable signature > - verification for the corresponding kernel image type being > - loaded in order for this to work. > - > config KERNEL_NOBP > def_bool n > prompt "Enable modified branch prediction for the kernel by default" > @@ -732,22 +721,6 @@ config VFIO_AP > > endmenu > > -menu "Dump support" > - > -config CRASH_DUMP > - bool "kernel crash dumps" > - select KEXEC > - help > - Generate crash dump after being started by kexec. > - Crash dump kernels are loaded in the main kernel with kexec-tools > - into a specially reserved region and then later executed after > - a crash by kdump/kexec. > - Refer to for more details on > this. > - This option also enables s390 zfcpdump. > - See also > - > -endmenu > - > config CCW > def_bool y Acked-by: Alexander Gordeev
Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
On Wed, Jun 28, 2023 at 09:16:24PM +0200, Gerald Schaefer wrote: > On Tue, 20 Jun 2023 00:51:19 -0700 (PDT) > Hugh Dickins wrote: Hi Gerald, Hugh! ... > @@ -407,6 +445,88 @@ void __tlb_remove_table(void *_table) > __free_page(page); > } > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +static void pte_free_now0(struct rcu_head *head); > +static void pte_free_now1(struct rcu_head *head); What about pte_free_lower() / pte_free_upper()? ... > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) > +{ > + unsigned int bit, mask; > + struct page *page; > + > + page = virt_to_page(pgtable); > + if (mm_alloc_pgste(mm)) { > + /* > + * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in > + * page_table_free_rcu()? > + * If yes -> need addr parameter here, like in pte_free_tlb(). > + */ > + call_rcu(&page->rcu_head, pte_free_pgste); > + return; > +} > + bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * > sizeof(pte_t)); > + > + spin_lock_bh(&mm->context.lock); > + mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24)); This makes the bit logic increasingly complicated to me. What if instead we set the rule "one bit at a time only"? That means an atomic group bit flip is only allowed between pairs of bits, namely: bit flipinitiated from --- P <- A page_table_free(), page_table_free_rcu() H <- A pte_free_defer() P <- H pte_free_half() In the current model P bit could be on together with H bit simultaneously. That actually brings in equation nothing. Besides, this check in page_table_alloc() (while still correct) makes one (well, me) wonder "what about HH bits?": mask = (mask | (mask >> 4)) & 0x03U; if (mask != 0x03U) { ... } By contrast, with "one bit at a time only" policy every of three bits effectevely indicates which state a page half is currently in. Thanks!
Re: [PATCH v2 12/13] s390/kexec: refactor for kernel/Kconfig.kexec
On Wed, Jun 21, 2023 at 12:10:49PM -0500, Eric DeVolder wrote: Hi Eric, ... > > > NOTE: The original Kconfig has a KEXEC_SIG which depends on > > > MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT > > > dependency (using the strategy outlined in this series, and other > > > techniques) results in 'error: recursive dependency detected' > > > on CRYPTO. This occurs due to any path through KEXEC_SIG > > > attempting to select CRYPTO is ultimately dependent upon CRYPTO: > > > > > > CRYPTO > > ><- ARCH_SUPPORTS_KEXEC_FILE > > > <- KEXEC_FILE > > > <- KEXEC_SIG > > > > > > Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency > > > for KEXEC_SIG. In practice, however, MODULE_SIG_FORMAT is still > > > configured-in as the use of KEXEC_SIG is in step with the use of > > > SYSTEM_DATA_VERIFICATION, which does select MODULE_SIG_FORMAT. > > > > No, it is actually the other way around. > > Could you please provide the correct explanation? > > > > AFAICT the MODULE_SIG_FORMAT dependency was introduced with commit > > c8424e776b09 ("MODSIGN: Export module signature definitions") and > > in fact was not necessary, since s390 did/does not use mod_check_sig() > > anyway. So the SYSTEM_DATA_VERIFICATION could have left intact. > > Thomas, would the correct explanation be simply indicating that > MODULE_SIG_FORMAT isn't needed as it is not used by s390 (crediting your > summary above)? I guess, you asked me? Anyway, I will try to answer as if I were Thomas :) MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping MODULE_SIG_FORMAT does not hurt. Thanks!
Re: [PATCH v2 12/13] s390/kexec: refactor for kernel/Kconfig.kexec
On Mon, Jun 19, 2023 at 10:58:00AM -0400, Eric DeVolder wrote: Hi Eric, > The kexec and crash kernel options are provided in the common > kernel/Kconfig.kexec. Utilize the common options and provide > the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the > equivalent set of KEXEC and CRASH options. > > NOTE: The original Kconfig has a KEXEC_SIG which depends on > MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT > dependency (using the strategy outlined in this series, and other > techniques) results in 'error: recursive dependency detected' > on CRYPTO. This occurs due to any path through KEXEC_SIG > attempting to select CRYPTO is ultimately dependent upon CRYPTO: > > CRYPTO > <- ARCH_SUPPORTS_KEXEC_FILE > <- KEXEC_FILE > <- KEXEC_SIG > > Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency > for KEXEC_SIG. In practice, however, MODULE_SIG_FORMAT is still > configured-in as the use of KEXEC_SIG is in step with the use of > SYSTEM_DATA_VERIFICATION, which does select MODULE_SIG_FORMAT. No, it is actually the other way around. Could you please provide the correct explanation? AFAICT the MODULE_SIG_FORMAT dependency was introduced with commit c8424e776b09 ("MODSIGN: Export module signature definitions") and in fact was not necessary, since s390 did/does not use mod_check_sig() anyway. So the SYSTEM_DATA_VERIFICATION could have left intact. However, the original SYSTEM_DATA_VERIFICATION seems sane and I do not understand why other architectures do not have it also? May be Mimi Zohar (putting on CC) could explain that? It looks like such dependency actually exists in implicit form (which you picked from x86): In addition to this option, you need to enable signature verification for the corresponding kernel image type being loaded in order for this to work. Does it mean that if an architecture did not enable the signature verification type explicitly the linker could fail - both before and after you series? Thanks! > Not ideal, but results in equivalent .config files for s390. > > Signed-off-by: Eric DeVolder > --- > arch/s390/Kconfig | 65 ++- > 1 file changed, 19 insertions(+), 46 deletions(-) > > diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig > index 6dab9c1be508..58dc124433ca 100644 > --- a/arch/s390/Kconfig > +++ b/arch/s390/Kconfig > @@ -243,6 +243,25 @@ config PGTABLE_LEVELS > > source "kernel/livepatch/Kconfig" > > +config ARCH_DEFAULT_KEXEC > + def_bool y > + > +config ARCH_SUPPORTS_KEXEC > + def_bool y > + > +config ARCH_SUPPORTS_KEXEC_FILE > + def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390 > + > +config ARCH_HAS_KEXEC_PURGATORY > + def_bool KEXEC_FILE > + > +config ARCH_SUPPORTS_CRASH_DUMP > + def_bool y > + help > + Refer to for more details on > this. > + This option also enables s390 zfcpdump. > + See also > + > menu "Processor type and features" > > config HAVE_MARCH_Z10_FEATURES > @@ -481,36 +500,6 @@ config SCHED_TOPOLOGY > > source "kernel/Kconfig.hz" > > -config KEXEC > - def_bool y > - select KEXEC_CORE > - > -config KEXEC_FILE > - bool "kexec file based system call" > - select KEXEC_CORE > - depends on CRYPTO > - depends on CRYPTO_SHA256 > - depends on CRYPTO_SHA256_S390 > - help > - Enable the kexec file based system call. In contrast to the normal > - kexec system call this system call takes file descriptors for the > - kernel and initramfs as arguments. > - > -config ARCH_HAS_KEXEC_PURGATORY > - def_bool y > - depends on KEXEC_FILE > - > -config KEXEC_SIG > - bool "Verify kernel signature during kexec_file_load() syscall" > - depends on KEXEC_FILE && MODULE_SIG_FORMAT > - help > - This option makes kernel signature verification mandatory for > - the kexec_file_load() syscall. > - > - In addition to that option, you need to enable signature > - verification for the corresponding kernel image type being > - loaded in order for this to work. > - > config KERNEL_NOBP > def_bool n > prompt "Enable modified branch prediction for the kernel by default" > @@ -732,22 +721,6 @@ config VFIO_AP > > endmenu > > -menu "Dump support" > - > -config CRASH_DUMP > - bool "kernel crash dumps" > - select KEXEC > - help > - Generate crash dump after being started by kexec. > - Crash dump kernels are loaded in the main kernel with kexec-tools > - into a specially reserved region and then later executed after > - a crash by kdump/kexec. > - Refer to for more details on > this. > - This option also enables s390 zfcpdump. > - See also > - > -endmenu > - > config CCW > def_bool y > > -- > 2.31.1 >
Re: [PATCH v1 01/21] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec
On Mon, Jun 12, 2023 at 01:27:53PM -0400, Eric DeVolder wrote: ... > +config KEXEC_FILE > + bool "Enable kexec file based system call" > + depends on ARCH_HAS_KEXEC_FILE > + select KEXEC_CORE > + help > + This is new version of kexec system call. This system call is > + file based and takes file descriptors as system call argument > + for kernel and initramfs as opposed to list of segments as > + accepted by previous system call. Which "previous"? I guess, "by kexec system call" would sound clear. Thanks!
Re: [PATCH] irq_work: consolidate arch_irq_work_raise prototypes
On Tue, May 16, 2023 at 10:02:31PM +0200, Arnd Bergmann wrote: > From: Arnd Bergmann > > The prototype was hidden on x86, which causes a warning: > > kernel/irq_work.c:72:13: error: no previous prototype for > 'arch_irq_work_raise' [-Werror=missing-prototypes] > > Fix this by providing it in only one place that is always visible. > > Signed-off-by: Arnd Bergmann > --- > arch/arm/include/asm/irq_work.h | 2 -- > arch/arm64/include/asm/irq_work.h | 2 -- > arch/csky/include/asm/irq_work.h| 2 +- > arch/powerpc/include/asm/irq_work.h | 1 - > arch/riscv/include/asm/irq_work.h | 2 +- > arch/s390/include/asm/irq_work.h| 2 -- > arch/x86/include/asm/irq_work.h | 1 - > include/linux/irq_work.h| 3 +++ > 8 files changed, 5 insertions(+), 10 deletions(-) ... > diff --git a/arch/s390/include/asm/irq_work.h > b/arch/s390/include/asm/irq_work.h > index 603783766d0a..f00c9f610d5a 100644 > --- a/arch/s390/include/asm/irq_work.h > +++ b/arch/s390/include/asm/irq_work.h > @@ -7,6 +7,4 @@ static inline bool arch_irq_work_has_interrupt(void) > return true; > } > > -void arch_irq_work_raise(void); > - > #endif /* _ASM_S390_IRQ_WORK_H */ ... > diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h > index 8cd11a223260..136f2980cba3 100644 > --- a/include/linux/irq_work.h > +++ b/include/linux/irq_work.h > @@ -66,6 +66,9 @@ void irq_work_sync(struct irq_work *work); > void irq_work_run(void); > bool irq_work_needs_cpu(void); > void irq_work_single(void *arg); > + > +void arch_irq_work_raise(void); > + > #else > static inline bool irq_work_needs_cpu(void) { return false; } > static inline void irq_work_run(void) { } For s390: Reviewed-by: Alexander Gordeev
Re: [PATCH] procfs: consolidate arch_report_meminfo declaration
On Tue, May 16, 2023 at 09:57:29PM +0200, Arnd Bergmann wrote: > From: Arnd Bergmann > > The arch_report_meminfo() function is provided by four architectures, > with a __weak fallback in procfs itself. On architectures that don't > have a custom version, the __weak version causes a warning because > of the missing prototype. > > Remove the architecture specific prototypes and instead add one > in linux/proc_fs.h. > > Signed-off-by: Arnd Bergmann > --- > arch/parisc/include/asm/pgtable.h| 3 --- > arch/powerpc/include/asm/pgtable.h | 3 --- > arch/s390/include/asm/pgtable.h | 3 --- > arch/s390/mm/pageattr.c | 1 + > arch/x86/include/asm/pgtable.h | 1 + > arch/x86/include/asm/pgtable_types.h | 3 --- > arch/x86/mm/pat/set_memory.c | 1 + > include/linux/proc_fs.h | 2 ++ > 8 files changed, 5 insertions(+), 12 deletions(-) ... > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h > index 6822a11c2c8a..c55f3c3365af 100644 > --- a/arch/s390/include/asm/pgtable.h > +++ b/arch/s390/include/asm/pgtable.h > @@ -42,9 +42,6 @@ static inline void update_page_count(int level, long count) > atomic_long_add(count, &direct_pages_count[level]); > } > > -struct seq_file; > -void arch_report_meminfo(struct seq_file *m); > - > /* > * The S390 doesn't have any external MMU info: the kernel page > * tables contain all the necessary information. > diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c > index 5ba3bd8a7b12..ca5a418c58a8 100644 > --- a/arch/s390/mm/pageattr.c > +++ b/arch/s390/mm/pageattr.c > @@ -4,6 +4,7 @@ > * Author(s): Jan Glauber > */ > #include > +#include > #include > #include > #include For s390: Reviewed-by: Alexander Gordeev
Re: [PATCH 16/23] s390: gmap use pte_unmap_unlock() not spin_unlock()
On Tue, May 09, 2023 at 10:02:32PM -0700, Hugh Dickins wrote: > pte_alloc_map_lock() expects to be followed by pte_unmap_unlock(): to > keep balance in future, pass ptep as well as ptl to gmap_pte_op_end(), > and use pte_unmap_unlock() instead of direct spin_unlock() (even though > ptep ends up unused inside the macro). > > Signed-off-by: Hugh Dickins > --- > arch/s390/mm/gmap.c | 22 +++--- > 1 file changed, 11 insertions(+), 11 deletions(-) Acked-by: Alexander Gordeev
Re: [PATCH] cputime: remove cputime_to_nsecs fallback
On Tue, Dec 20, 2022 at 05:07:05PM +1000, Nicholas Piggin wrote: > The archs that use cputime_to_nsecs() internally provide their own > definition and don't need the fallback. cputime_to_usecs() unused except > in this fallback, and is not defined anywhere. > > This removes the final remnant of the cputime_t code from the kernel. > > Cc: Peter Zijlstra > Cc: Rik van Riel > Cc: Thomas Gleixner > Cc: Frederic Weisbecker > Cc: Sven Schnelle > Cc: Arnd Bergmann > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-s...@vger.kernel.org > Signed-off-by: Nicholas Piggin > --- > This required a couple of tweaks to s390 includes so we're not pulling > asm/cputime.h into the core header unnecessarily. In that case maybe > this can go via s390 tree because the patch should be otherwise quite > trivial. Could it get an ack or two from a core maintainer to support > that? > > Thanks, > Nick > > arch/s390/kernel/idle.c | 2 +- > arch/s390/kernel/vtime.c | 2 +- > include/linux/sched/cputime.h | 9 - > kernel/sched/cputime.c| 4 > 4 files changed, 6 insertions(+), 11 deletions(-) > > diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c > index 4bf1ee293f2b..a6bbceaf7616 100644 > --- a/arch/s390/kernel/idle.c > +++ b/arch/s390/kernel/idle.c > @@ -12,9 +12,9 @@ > #include > #include > #include > -#include > #include > #include > +#include > #include > #include > #include "entry.h" > diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c > index 9436f3053b88..e0a88dcaf5cb 100644 > --- a/arch/s390/kernel/vtime.c > +++ b/arch/s390/kernel/vtime.c > @@ -7,13 +7,13 @@ > */ > > #include > -#include > #include > #include > #include > #include > #include > #include > +#include > #include > #include > #include > diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h > index ce3c58286062..5f8fd5b24a2e 100644 > --- a/include/linux/sched/cputime.h > +++ b/include/linux/sched/cputime.h > @@ -8,15 +8,6 @@ > * cputime accounting APIs: > */ > > -#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE > -#include > - > -#ifndef cputime_to_nsecs > -# define cputime_to_nsecs(__ct) \ > - (cputime_to_usecs(__ct) * NSEC_PER_USEC) > -#endif > -#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ > - > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > extern bool task_cputime(struct task_struct *t, >u64 *utime, u64 *stime); > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 95fc77853743..af7952f12e6c 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -3,6 +3,10 @@ > * Simple CPU accounting cgroup controller > */ > > +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE > + #include > +#endif > + > #ifdef CONFIG_IRQ_TIME_ACCOUNTING > > /* For s390: Acked-by: Alexander Gordeev
Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
On Thu, Dec 03, 2020 at 09:14:22AM -0800, Andy Lutomirski wrote: > > > > On Dec 3, 2020, at 9:09 AM, Alexander Gordeev > > wrote: > > > > On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote: > >> other arch folk: there's some background here: > > > > >> > >> power: Ridiculously complicated, seems to vary by system and kernel config. > >> > >> So, Nick, your unconditional IPI scheme is apparently a big > >> improvement for power, and it should be an improvement and have low > >> cost for x86. On arm64 and s390x it will add more IPIs on process > >> exit but reduce contention on context switching depending on how lazy > > > > s390 does not invalidate TLBs per-CPU explicitly - we have special > > instructions for that. Those in turn initiate signalling to other > > CPUs, completely transparent to OS. > > Just to make sure I understand: this means that you broadcast flushes to all > CPUs, not just a subset? Correct. If mm has one CPU attached we flush TLB only for that CPU. If mm has more than one CPUs attached we flush all CPUs' TLBs. In fact, details are bit more complicated, since the hardware is able to flush subsets of TLB entries depending on provided parameters (e.g page tables used to create that entries). But we can not select a CPU subset. > > Apart from mm_count, I am struggling to realize how the suggested > > scheme could change the the contention on s390 in connection with > > TLB. Could you clarify a bit here, please? > > I’m just talking about mm_count. Maintaining mm_count is quite expensive on > some workloads. > > > > >> TLB works. I suppose we could try it for all architectures without > >> any further optimizations. Or we could try one of the perhaps > >> excessively clever improvements I linked above. arm64, s390x people, > >> what do you think? > > > > I do not immediately see anything in the series that would harm > > performance on s390. > > > > We however use mm_cpumask to distinguish between local and global TLB > > flushes. With this series it looks like mm_cpumask is *required* to > > be consistent with lazy users. And that is something quite diffucult > > for us to adhere (at least in the foreseeable future). > > You don’t actually need to maintain mm_cpumask — we could scan all CPUs > instead. > > > > > But actually keeping track of lazy users in a cpumask is something > > the generic code would rather do AFAICT. > > The problem is that arches don’t agree on what the contents of mm_cpumask > should be. Tracking a mask of exactly what the arch wants in generic code is > a nontrivial operation. It could be yet another cpumask or the CPU scan you mentioned. Just wanted to make sure there is no new requirement for an arch to maintain mm_cpumask ;) Thanks, Andy!
Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote: > other arch folk: there's some background here: > > https://lkml.kernel.org/r/calcetrvxube8lfnn-qs+dzroqaiw+sfug1j047ybyv31sat...@mail.gmail.com > > On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski wrote: > > > > On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski wrote: > > > > > > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin wrote: > > > > > > > > On big systems, the mm refcount can become highly contented when doing > > > > a lot of context switching with threaded applications (particularly > > > > switching between the idle thread and an application thread). > > > > > > > > Abandoning lazy tlb slows switching down quite a bit in the important > > > > user->idle->user cases, so so instead implement a non-refcounted scheme > > > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down > > > > any remaining lazy ones. > > > > > > > > Shootdown IPIs are some concern, but they have not been observed to be > > > > a big problem with this scheme (the powerpc implementation generated > > > > 314 additional interrupts on a 144 CPU system during a kernel compile). > > > > There are a number of strategies that could be employed to reduce IPIs > > > > if they turn out to be a problem for some workload. > > > > > > I'm still wondering whether we can do even better. > > > > > > > Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes > > the TLB. On x86, this will shoot down all lazies as long as even a > > single pagetable was freed. (Or at least it will if we don't have a > > serious bug, but the code seems okay. We'll hit pmd_free_tlb, which > > sets tlb->freed_tables, which will trigger the IPI.) So, on > > architectures like x86, the shootdown approach should be free. The > > only way it ought to have any excess IPIs is if we have CPUs in > > mm_cpumask() that don't need IPI to free pagetables, which could > > happen on paravirt. > > Indeed, on x86, we do this: > > [ 11.558844] flush_tlb_mm_range.cold+0x18/0x1d > [ 11.559905] tlb_finish_mmu+0x10e/0x1a0 > [ 11.561068] exit_mmap+0xc8/0x1a0 > [ 11.561932] mmput+0x29/0xd0 > [ 11.562688] do_exit+0x316/0xa90 > [ 11.563588] do_group_exit+0x34/0xb0 > [ 11.564476] __x64_sys_exit_group+0xf/0x10 > [ 11.565512] do_syscall_64+0x34/0x50 > > and we have info->freed_tables set. > > What are the architectures that have large systems like? > > x86: we already zap lazies, so it should cost basically nothing to do > a little loop at the end of __mmput() to make sure that no lazies are > left. If we care about paravirt performance, we could implement one > of the optimizations I mentioned above to fix up the refcounts instead > of sending an IPI to any remaining lazies. > > arm64: AFAICT arm64's flush uses magic arm64 hardware support for > remote flushes, so any lazy mm references will still exist after > exit_mmap(). (arm64 uses lazy TLB, right?) So this is kind of like > the x86 paravirt case. Are there large enough arm64 systems that any > of this matters? > > s390x: The code has too many acronyms for me to understand it fully, > but I think it's more or less the same situation as arm64. How big do > s390x systems come? > > power: Ridiculously complicated, seems to vary by system and kernel config. > > So, Nick, your unconditional IPI scheme is apparently a big > improvement for power, and it should be an improvement and have low > cost for x86. On arm64 and s390x it will add more IPIs on process > exit but reduce contention on context switching depending on how lazy s390 does not invalidate TLBs per-CPU explicitly - we have special instructions for that. Those in turn initiate signalling to other CPUs, completely transparent to OS. Apart from mm_count, I am struggling to realize how the suggested scheme could change the the contention on s390 in connection with TLB. Could you clarify a bit here, please? > TLB works. I suppose we could try it for all architectures without > any further optimizations. Or we could try one of the perhaps > excessively clever improvements I linked above. arm64, s390x people, > what do you think? I do not immediately see anything in the series that would harm performance on s390. We however use mm_cpumask to distinguish between local and global TLB flushes. With this series it looks like mm_cpumask is *required* to be consistent with lazy users. And that is something quite diffucult for us to adhere (at least in the foreseeable future). But actually keeping track of lazy users in a cpumask is something the generic code would rather do AFAICT. Thanks!
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 07:11:16PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote: > > > Or am I way off here, and it really is possible (aside from the current > > s390 situation) to observe something that "is no longer a page table"? > > Yes, that is the issue. Remember there is no locking for GUP > fast. While a page table cannot be freed there is nothing preventing > the page table entry from being concurrently modified. > > Without the stack variable it looks like this: > >pud_t pud = READ_ONCE(*pudp); >if (!pud_present(pud)) > return >pmd_offset(pudp, address); > > And pmd_offset() expands to > > return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address); > > Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value > of *pud can change, eg to !pud_present. > > Then pud_page_vaddr(*pud) will crash. It is not use after free, it > is using data that has not been validated. One thing I ask myself and it is probably a good moment to wonder. What if the entry is still pud_present, but got remapped after READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere? > Jason
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote: > On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote: > > I actually had to draw myself a picture to get some hold of > > this, or rather a walk-through with a certain pud-crossing > > range in a folded 3-level scenario. Not sure if I would have > > understood my explanation above w/o that, but I hope you can > > make some sense out of it. Or draw yourself a picture :-) > > What I don't understand is how does anything work with S390 today? > > If the fix is only to change pxx_addr_end() then than generic code > like mm/pagewalk.c will iterate over a *different list* of page table > entries. > > It's choice of entries to look at is entirely driven by pxx_addr_end(). > > Which suggest to me that mm/pagewalk.c also doesn't work properly > today on S390 and this issue is not really about stack variables? > > Fundamentally if pXX_offset() and pXX_addr_end() must be consistent > together, if pXX_offset() is folded then pXX_addr_end() must cause a > single iteration of that level. Your observation is correct. Another way to describe the problem is existing pXd_addr_end helpers could be applied to mismatching levels on s390 (e.g p4d_addr_end applied to pud or pgd_addr_end applied to p4d). As you noticed, all *_pXd_range iterators could be called with address ranges that exceed single pXd table. However, when it happens with pointers to real page tables (passed to *_pXd_range iterators) we still operate on valid tables, which just (lucky for us) happened to be folded. Thus we still reference correct table entries. It is only gup_fast case that exposes the issue. It hits because pointers to stack copies are passed to gup_pXd_range iterators, not pointers to real page tables itself. As Gerald mentioned, it is very difficult to explain in a clear way. Hopefully, one could make sense ot of it. > Jason
Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions
On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote: [...] > >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > >index 67ebc22cf83d..d9e7d16c2263 100644 > >--- a/include/linux/pgtable.h > >+++ b/include/linux/pgtable.h > >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm, > > */ > > #ifndef pgd_addr_end > >-#define pgd_addr_end(pgd, addr, end) > >\ > >-({ unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK; \ > >-(__boundary - 1 < (end) - 1)? __boundary: (end);\ > >-}) > >+#define pgd_addr_end pgd_addr_end > > I think that #define is pointless, usually there is no such #define > for the default case. Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h): #define pgd_addr_end pgd_addr_end static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, unsigned long end) { return rste_addr_end_folded(pgd_val(pgd), addr, end); } > >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, > >unsigned long end) > >+{ unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK; > >+return (__boundary - 1 < end - 1) ? __boundary : end; > >+}
Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote: [...] > You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems. If this one would be okay? diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c index 60c6ea16..3690d22 100644 --- a/arch/powerpc/mm/book3s64/subpage_prot.c +++ b/arch/powerpc/mm/book3s64/subpage_prot.c @@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr, static void subpage_prot_clear(unsigned long addr, unsigned long len) { struct mm_struct *mm = current->mm; + pmd_t *pmd = pmd_off(mm, addr); struct subpage_prot_table *spt; u32 **spm, *spp; unsigned long i; @@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len) limit = addr + len; if (limit > spt->maxaddr) limit = spt->maxaddr; - for (; addr < limit; addr = next) { - next = pmd_addr_end(addr, limit); + for (; addr < limit; addr = next, pmd++) { + next = pmd_addr_end(*pmd, addr, limit); if (addr < 0x1UL) { spm = spt->low_prot; } else { @@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr, unsigned long, len, u32 __user *, map) { struct mm_struct *mm = current->mm; + pmd_t *pmd = pmd_off(mm, addr); struct subpage_prot_table *spt; u32 **spm, *spp; unsigned long i; @@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr, } subpage_mark_vma_nohuge(mm, addr, len); - for (limit = addr + len; addr < limit; addr = next) { - next = pmd_addr_end(addr, limit); + for (limit = addr + len; addr < limit; addr = next, pmd++) { + next = pmd_addr_end(*pmd, addr, limit); err = -ENOMEM; if (addr < 0x1UL) { spm = spt->low_prot; Thanks! > Christophe
Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote: > >Yes, and also two more sources :/ > > arch/powerpc/mm/kasan/8xx.c > > arch/powerpc/mm/kasan/kasan_init_32.c > > > >But these two are not quite obvious wrt pgd_addr_end() used > >while traversing pmds. Could you please clarify a bit? > > > > > >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c > >index 2784224..89c5053 100644 > >--- a/arch/powerpc/mm/kasan/8xx.c > >+++ b/arch/powerpc/mm/kasan/8xx.c > >@@ -15,8 +15,8 @@ > > for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block > > += SZ_8M) { > > pte_basic_t *new; > >-k_next = pgd_addr_end(k_cur, k_end); > >-k_next = pgd_addr_end(k_next, k_end); > >+k_next = pmd_addr_end(k_cur, k_end); > >+k_next = pmd_addr_end(k_next, k_end); > > No, I don't think so. > On powerpc32 we have only two levels, so pgd and pmd are more or > less the same. > But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h > is a no-op, so I don't think it will work. > > It is likely that this function should iterate on pgd, then you get > pmd = pmd_offset(pud_offset(p4d_offset(pgd))); It looks like the code iterates over single pmd table while using pgd_addr_end() only to skip all the middle levels and bail out from the loop. I would be wary for switching from pmds to pgds, since we are trying to minimize impact (especially functional) and the rework does not seem that obvious. Assuming pmd and pgd are the same would actually such approach work for now? diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c index 2784224..94466cc 100644 --- a/arch/powerpc/mm/kasan/8xx.c +++ b/arch/powerpc/mm/kasan/8xx.c @@ -15,8 +15,8 @@ for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) { pte_basic_t *new; - k_next = pgd_addr_end(k_cur, k_end); - k_next = pgd_addr_end(k_next, k_end); + k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end); + k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, k_end); if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) continue; diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index fb29404..c0bcd64 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_ for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) { pte_t *new; - k_next = pgd_addr_end(k_cur, k_end); + k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end); if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) continue; @@ -196,7 +196,7 @@ void __init kasan_early_init(void) kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL); do { - next = pgd_addr_end(addr, end); + next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end); pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte); } while (pmd++, addr = next, addr != end); Alternatively we could pass invalid pgd to keep the code structure intact, but that of course is less nice. Thanks! > Christophe
Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote: > You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems. Yes, and also two more sources :/ arch/powerpc/mm/kasan/8xx.c arch/powerpc/mm/kasan/kasan_init_32.c But these two are not quite obvious wrt pgd_addr_end() used while traversing pmds. Could you please clarify a bit? diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c index 2784224..89c5053 100644 --- a/arch/powerpc/mm/kasan/8xx.c +++ b/arch/powerpc/mm/kasan/8xx.c @@ -15,8 +15,8 @@ for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) { pte_basic_t *new; - k_next = pgd_addr_end(k_cur, k_end); - k_next = pgd_addr_end(k_next, k_end); + k_next = pmd_addr_end(k_cur, k_end); + k_next = pmd_addr_end(k_next, k_end); if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) continue; diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index fb29404..3f7d6dc6 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_ for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) { pte_t *new; - k_next = pgd_addr_end(k_cur, k_end); + k_next = pmd_addr_end(k_cur, k_end); if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) continue; @@ -196,7 +196,7 @@ void __init kasan_early_init(void) kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL); do { - next = pgd_addr_end(addr, end); + next = pmd_addr_end(addr, end); pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte); } while (pmd++, addr = next, addr != end); > Christophe
Re: [PATCH V4 0/4] mm/debug_vm_pgtable: Add some more tests
On Mon, Jul 06, 2020 at 06:18:32AM +0530, Anshuman Khandual wrote: [...] > Tested on arm64, x86 platforms but only build tested on all other enabled > platforms through ARCH_HAS_DEBUG_VM_PGTABLE i.e powerpc, arc, s390. The Sorry for missing to test earlier. Works for me on s390. Also, tried with few relevant config options to set/unset. Thanks!
Re: FSL P5020/P5040: DPAA Ethernet issue with the latest Git kernel
On Thu, Jun 25, 2020 at 01:01:52AM +0200, Christian Zigotzky wrote: [...] > I compiled a test kernel with the option "CONFIG_TEST_BITMAP=y" > yesterday. After that Skateman and I booted it and looked for the > bitmap tests with "dmesg | grep -i bitmap". > > Results: > > FSL P5020: > > [Â Â Â 0.297756] test_bitmap: loaded. > [Â Â Â 0.298113] test_bitmap: parselist: 14: input is '0-2047:128/256' > OK, Time: 562 > [Â Â Â 0.298142] test_bitmap: parselist_user: 14: input is > '0-2047:128/256' OK, Time: 761 > [Â Â Â 0.301553] test_bitmap: all 1663 tests passed > > FSL P5040: > > [Â Â Â 0.296563] test_bitmap: loaded. > [Â Â Â 0.296894] test_bitmap: parselist: 14: input is '0-2047:128/256' > OK, Time: 540 > [Â Â Â 0.296920] test_bitmap: parselist_user: 14: input is > '0-2047:128/256' OK, Time: 680 > [Â Â Â 0.24] test_bitmap: all 1663 tests passed Thanks for the test! So it works as expected. I would suggest to compare what is going on on the device probing with and without the bisected commit. There seems to be MAC and PHY mode initialization issue that might resulted from the bitmap format change. I put Madalin and Sascha on CC as they have done some works on this part recently. Thanks! > Thanks, > Christian
Re: FSL P5020/P5040: DPAA Ethernet issue with the latest Git kernel
On Sun, Jun 21, 2020 at 08:40:14AM +0200, Christian Zigotzky wrote: > Hello Alexander, > > The DPAA Ethernet doesn’t work anymore on our FSL P5020/P5040 boards [1] > since the RC1 of kernel 5.8 [2]. > We bisected last days [3] and found the problematic commit [4]. I was able to > revert it [5]. After that the DPAA Ethernet works again. I created a patch > for reverting the commit [5]. This patch works and I will use it for the RC2. > Could you please check your commit? [4] Hi Christian, Could you please check if the kernel passes CONFIG_TEST_BITMAP self-test? Thanks! > Thanks, > Christian > > [1] http://wiki.amiga.org/index.php?title=X5000 > [2] https://forum.hyperion-entertainment.com/viewtopic.php?p=50885#p50885 > [3] https://forum.hyperion-entertainment.com/viewtopic.php?p=50892#p50892 > [4] lib: fix bitmap_parse() on 64-bit big endian archs: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=81c4f4d924d5d009b5ed785a3e22b18d0f7b831f > [5] https://forum.hyperion-entertainment.com/viewtopic.php?p=50982#p50982 > >
Re: [PATCH v2 1/3] PCI/MSI/PPC: Remove arch_msi_check_device()
On Mon, Sep 15, 2014 at 12:42:50PM +1000, Michael Ellerman wrote: > On Sun, 2014-09-07 at 20:57 +0200, Alexander Gordeev wrote: > > Moving MSI checks from arch_msi_check_device() function to > > arch_setup_msi_irqs() function makes code more compact and > > allows removing unnecessary hook arch_msi_check_device() > > from generic MSI code. > > > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: linux-...@vger.kernel.org > > Cc: Michael Ellerman > > I already acked the previous version, but if you didn't want it that's fine :) I do want it, but I thought you may not agree with the new changelog ;) > cheers > > -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 0/3] PCI/MSI: Remove arch_msi_check_device()
Hello, This is a cleanup effort to get rid of arch_msi_check_device() function. I am sending v2 series, since kbuild for v1 reports compile errors on ppc4xx and Armada 370. Still, I have not checked the fixes on these platforms. Changes since v1: - patch 1: 'pdev' undeclared compile error fixed on ppc4xx. I updated changelog and removed ACK from this patch in case there are any objections; - patch 2: 'struct msi_chip' has no 'check_device' error fixed on Armada 370 - armada_370_xp_check_msi_device() hook removed; - patch 3: msi_check_device() renamed to pci_msi_supported(). This is the very same patch I sent earlier; Thanks! Cc: Thomas Gleixner Cc: Jason Cooper Cc: Michael Ellerman Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-...@vger.kernel.org Alexander Gordeev (3): patch 1 - PCI/MSI/PPC: Remove arch_msi_check_device() patch 2 - PCI/MSI/Armada-370-xp: Remove arch_msi_check_device() patch 3 - PCI/MSI: Remove arch_msi_check_device() arch/powerpc/include/asm/machdep.h | 2 -- arch/powerpc/kernel/msi.c | 12 + arch/powerpc/platforms/cell/axon_msi.c | 9 --- arch/powerpc/platforms/powernv/pci.c | 19 - arch/powerpc/platforms/pseries/msi.c | 42 +++-- arch/powerpc/sysdev/fsl_msi.c | 12 +++-- arch/powerpc/sysdev/mpic_pasemi_msi.c | 11 ++-- arch/powerpc/sysdev/mpic_u3msi.c | 28 --- arch/powerpc/sysdev/ppc4xx_hsta_msi.c | 18 + arch/powerpc/sysdev/ppc4xx_msi.c | 19 + drivers/irqchip/irq-armada-370-xp.c| 14 +++--- drivers/pci/msi.c | 49 +- include/linux/msi.h| 3 --- 13 files changed, 67 insertions(+), 171 deletions(-) -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 1/3] PCI/MSI/PPC: Remove arch_msi_check_device()
Moving MSI checks from arch_msi_check_device() function to arch_setup_msi_irqs() function makes code more compact and allows removing unnecessary hook arch_msi_check_device() from generic MSI code. Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-...@vger.kernel.org Cc: Michael Ellerman Signed-off-by: Alexander Gordeev --- arch/powerpc/include/asm/machdep.h | 2 -- arch/powerpc/kernel/msi.c | 12 +- arch/powerpc/platforms/cell/axon_msi.c | 9 arch/powerpc/platforms/powernv/pci.c | 19 --- arch/powerpc/platforms/pseries/msi.c | 42 +- arch/powerpc/sysdev/fsl_msi.c | 12 +++--- arch/powerpc/sysdev/mpic_pasemi_msi.c | 11 ++--- arch/powerpc/sysdev/mpic_u3msi.c | 28 +-- arch/powerpc/sysdev/ppc4xx_hsta_msi.c | 18 +-- arch/powerpc/sysdev/ppc4xx_msi.c | 19 +-- 10 files changed, 50 insertions(+), 122 deletions(-) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index b125cea..3af7216 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -136,8 +136,6 @@ struct machdep_calls { int (*pci_setup_phb)(struct pci_controller *host); #ifdef CONFIG_PCI_MSI - int (*msi_check_device)(struct pci_dev* dev, - int nvec, int type); int (*setup_msi_irqs)(struct pci_dev *dev, int nvec, int type); void(*teardown_msi_irqs)(struct pci_dev *dev); diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c index 8bbc12d..71bd161 100644 --- a/arch/powerpc/kernel/msi.c +++ b/arch/powerpc/kernel/msi.c @@ -13,7 +13,7 @@ #include -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type) +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) { if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) { pr_debug("msi: Platform doesn't provide MSI callbacks.\n"); @@ -24,16 +24,6 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type) if (type == PCI_CAP_ID_MSI && nvec > 1) return 1; - if (ppc_md.msi_check_device) { - pr_debug("msi: Using platform check routine.\n"); - return ppc_md.msi_check_device(dev, nvec, type); - } - -return 0; -} - -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) -{ return ppc_md.setup_msi_irqs(dev, nvec, type); } diff --git a/arch/powerpc/platforms/cell/axon_msi.c b/arch/powerpc/platforms/cell/axon_msi.c index 85825b5..862b327 100644 --- a/arch/powerpc/platforms/cell/axon_msi.c +++ b/arch/powerpc/platforms/cell/axon_msi.c @@ -199,14 +199,6 @@ out_error: return msic; } -static int axon_msi_check_device(struct pci_dev *dev, int nvec, int type) -{ - if (!find_msi_translator(dev)) - return -ENODEV; - - return 0; -} - static int setup_msi_msg_address(struct pci_dev *dev, struct msi_msg *msg) { struct device_node *dn; @@ -416,7 +408,6 @@ static int axon_msi_probe(struct platform_device *device) ppc_md.setup_msi_irqs = axon_msi_setup_msi_irqs; ppc_md.teardown_msi_irqs = axon_msi_teardown_msi_irqs; - ppc_md.msi_check_device = axon_msi_check_device; axon_msi_debug_setup(dn, msic); diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index b854b57..b45c492 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -46,29 +46,21 @@ //#define cfg_dbg(fmt...) printk(fmt) #ifdef CONFIG_PCI_MSI -static int pnv_msi_check_device(struct pci_dev* pdev, int nvec, int type) -{ - struct pci_controller *hose = pci_bus_to_host(pdev->bus); - struct pnv_phb *phb = hose->private_data; - struct pci_dn *pdn = pci_get_pdn(pdev); - - if (pdn && pdn->force_32bit_msi && !phb->msi32_support) - return -ENODEV; - - return (phb && phb->msi_bmp.bitmap) ? 0 : -ENODEV; -} - static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type) { struct pci_controller *hose = pci_bus_to_host(pdev->bus); struct pnv_phb *phb = hose->private_data; + struct pci_dn *pdn = pci_get_pdn(pdev); struct msi_desc *entry; struct msi_msg msg; int hwirq; unsigned int virq; int rc; - if (WARN_ON(!phb)) + if (WARN_ON(!phb) || !phb->msi_bmp.bitmap) + return -ENODEV; + + if (pdn && pdn->force_32bit_msi && !phb->msi32_support) return -ENODEV; list_for_each_entry(entry, &pdev->msi_list, list) { @@ -860,7 +852,6 @@ void __init pnv_pci_init(void) /* Configure MSI
[PATCH v2 3/3] PCI/MSI: Remove arch_msi_check_device()
There are no archs that override arch_msi_check_device() hook. Remove it as it is completely redundant. If an arch would need to check MSI/MSI-X possibility for a device it should make it within arch_setup_msi_irqs() hook. Cc: Michael Ellerman Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-...@vger.kernel.org Signed-off-by: Alexander Gordeev --- drivers/pci/msi.c | 49 + include/linux/msi.h | 3 --- 2 files changed, 13 insertions(+), 39 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 5a40516..6c2cc41 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -56,16 +56,6 @@ void __weak arch_teardown_msi_irq(unsigned int irq) chip->teardown_irq(chip, irq); } -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type) -{ - struct msi_chip *chip = dev->bus->msi; - - if (!chip || !chip->check_device) - return 0; - - return chip->check_device(chip, dev, nvec, type); -} - int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) { struct msi_desc *entry; @@ -806,22 +796,23 @@ out_free: } /** - * pci_msi_check_device - check whether MSI may be enabled on a device + * pci_msi_supported - check whether MSI may be enabled on a device * @dev: pointer to the pci_dev data structure of MSI device function * @nvec: how many MSIs have been requested ? - * @type: are we checking for MSI or MSI-X ? * * Look at global flags, the device itself, and its parent buses * to determine if MSI/-X are supported for the device. If MSI/-X is * supported return 0, else return an error code. **/ -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type) +static int pci_msi_supported(struct pci_dev *dev, int nvec) { struct pci_bus *bus; - int ret; /* MSI must be globally enabled and supported by the device */ - if (!pci_msi_enable || !dev || dev->no_msi) + if (!pci_msi_enable) + return -EINVAL; + + if (!dev || dev->no_msi || dev->current_state != PCI_D0) return -EINVAL; /* @@ -843,10 +834,6 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type) if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI) return -EINVAL; - ret = arch_msi_check_device(dev, nvec, type); - if (ret) - return ret; - return 0; } @@ -949,13 +936,13 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec) int status, nr_entries; int i, j; - if (!entries || !dev->msix_cap || dev->current_state != PCI_D0) - return -EINVAL; - - status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX); + status = pci_msi_supported(dev, nvec); if (status) return status; + if (!entries) + return -EINVAL; + nr_entries = pci_msix_vec_count(dev); if (nr_entries < 0) return nr_entries; @@ -1062,8 +1049,9 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec) int nvec; int rc; - if (dev->current_state != PCI_D0) - return -EINVAL; + rc = pci_msi_supported(dev, minvec); + if (rc) + return rc; WARN_ON(!!dev->msi_enabled); @@ -1086,17 +1074,6 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec) nvec = maxvec; do { - rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI); - if (rc < 0) { - return rc; - } else if (rc > 0) { - if (rc < minvec) - return -ENOSPC; - nvec = rc; - } - } while (rc); - - do { rc = msi_capability_init(dev, nvec); if (rc < 0) { return rc; diff --git a/include/linux/msi.h b/include/linux/msi.h index 8103f32..dbf7cc9 100644 --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -60,7 +60,6 @@ int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc); void arch_teardown_msi_irq(unsigned int irq); int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type); void arch_teardown_msi_irqs(struct pci_dev *dev); -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type); void arch_restore_msi_irqs(struct pci_dev *dev); void default_teardown_msi_irqs(struct pci_dev *dev); @@ -77,8 +76,6 @@ struct msi_chip { int (*setup_irq)(struct msi_chip *chip, struct pci_dev *dev, struct msi_desc *desc); void (*teardown_irq)(struct msi_chip *chip, unsigned int irq); - int (*check_device)(struct msi_chip *chip, struct pci_dev *dev, - int
Re: [PATCH 0/2] PCI/MSI: Remove arch_msi_check_device()
On Fri, Sep 05, 2014 at 03:27:49PM -0600, Bjorn Helgaas wrote: > > I applied these (with Michael's ack on the first, and v2 of the second) to > > pci/msi for v3.18, thanks! Hi Bjorn, I resent a series with updates that fix kbuild robot errors. Hopefully, the rebase for pci/msi would not cause trouble for anyone. > Oh, I forgot -- if you'd rather take the first one through the PPC > tree, you can do that and I can merge the second one later. Let me > know if you want to do that. Nah, your treee is just fine. Thanks! > Bjorn -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] PCI/MSI/PPC: Remove arch_msi_check_device()
On Thu, Jul 31, 2014 at 03:53:33PM +0200, Alexander Gordeev wrote: > Michael, Ben, > > Any feedback? Michael, Ben? > Thanks! -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 2/2] PCI/MSI: Remove arch_msi_check_device()
On Mon, Aug 11, 2014 at 02:33:25PM +, bharat.bhus...@freescale.com wrote: > > > > -Original Message- > > From: linux-pci-ow...@vger.kernel.org > > [mailto:linux-pci-ow...@vger.kernel.org] > > On Behalf Of Alexander Gordeev > > Sent: Monday, August 11, 2014 5:16 PM > > To: Bjorn Helgaas > > Cc: linux-ker...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux- > > p...@vger.kernel.org > > Subject: [PATCH v2 2/2] PCI/MSI: Remove arch_msi_check_device() > > > > There are no archs that override arch_msi_check_device() hook. Remove it as > > it > > is completely redundant. > > Is not we are still overriding this in powerpc (not sure I missed some patch > , if so please point to that). Patch 1/2 ;) > $ grep -r arch_msi_check_device arch/powerpc/ > Binary file arch/powerpc/kernel/msi.o matches > arch/powerpc/kernel/msi.c:int arch_msi_check_device(struct pci_dev* dev, int > nvec, int type) > Binary file arch/powerpc/kernel/built-in.o matches > > Thanks > -Bharat > > > > > If an arch would need to check MSI/MSI-X possibility for a device it should > > make > > it within arch_setup_msi_irqs() hook. > > > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: linux-...@vger.kernel.org > > Signed-off-by: Alexander Gordeev > > --- > > drivers/pci/msi.c | 49 + > > include/linux/msi.h | 3 --- > > 2 files changed, 13 insertions(+), 39 deletions(-) > > > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 5a40516..6c2cc41 > > 100644 > > --- a/drivers/pci/msi.c > > +++ b/drivers/pci/msi.c > > @@ -56,16 +56,6 @@ void __weak arch_teardown_msi_irq(unsigned int irq) > > chip->teardown_irq(chip, irq); > > } > > > > -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type) > > -{ > > - struct msi_chip *chip = dev->bus->msi; > > - > > - if (!chip || !chip->check_device) > > - return 0; > > - > > - return chip->check_device(chip, dev, nvec, type); > > -} > > - > > int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) { > > struct msi_desc *entry; > > @@ -806,22 +796,23 @@ out_free: > > } > > > > /** > > - * pci_msi_check_device - check whether MSI may be enabled on a device > > + * pci_msi_supported - check whether MSI may be enabled on a device > > * @dev: pointer to the pci_dev data structure of MSI device function > > * @nvec: how many MSIs have been requested ? > > - * @type: are we checking for MSI or MSI-X ? > > * > > * Look at global flags, the device itself, and its parent buses > > * to determine if MSI/-X are supported for the device. If MSI/-X is > > * supported return 0, else return an error code. > > **/ > > -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type) > > +static int pci_msi_supported(struct pci_dev *dev, int nvec) > > { > > struct pci_bus *bus; > > - int ret; > > > > /* MSI must be globally enabled and supported by the device */ > > - if (!pci_msi_enable || !dev || dev->no_msi) > > + if (!pci_msi_enable) > > + return -EINVAL; > > + > > + if (!dev || dev->no_msi || dev->current_state != PCI_D0) > > return -EINVAL; > > > > /* > > @@ -843,10 +834,6 @@ static int pci_msi_check_device(struct pci_dev *dev, > > int > > nvec, int type) > > if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI) > > return -EINVAL; > > > > - ret = arch_msi_check_device(dev, nvec, type); > > - if (ret) > > - return ret; > > - > > return 0; > > } > > > > @@ -949,13 +936,13 @@ int pci_enable_msix(struct pci_dev *dev, struct > > msix_entry > > *entries, int nvec) > > int status, nr_entries; > > int i, j; > > > > - if (!entries || !dev->msix_cap || dev->current_state != PCI_D0) > > - return -EINVAL; > > - > > - status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX); > > + status = pci_msi_supported(dev, nvec); > > if (status) > > return status; > > > > + if (!entries) > > + return -EINVAL; > > + > > nr_entries = pci_msix_vec_count(dev); > > if (nr_entries < 0) > > return nr_entries; > > @@ -1062,8 +1049,9 @@ int pci_enable_msi_range(struct pci_dev
[PATCH v2 2/2] PCI/MSI: Remove arch_msi_check_device()
There are no archs that override arch_msi_check_device() hook. Remove it as it is completely redundant. If an arch would need to check MSI/MSI-X possibility for a device it should make it within arch_setup_msi_irqs() hook. Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-...@vger.kernel.org Signed-off-by: Alexander Gordeev --- drivers/pci/msi.c | 49 + include/linux/msi.h | 3 --- 2 files changed, 13 insertions(+), 39 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 5a40516..6c2cc41 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -56,16 +56,6 @@ void __weak arch_teardown_msi_irq(unsigned int irq) chip->teardown_irq(chip, irq); } -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type) -{ - struct msi_chip *chip = dev->bus->msi; - - if (!chip || !chip->check_device) - return 0; - - return chip->check_device(chip, dev, nvec, type); -} - int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) { struct msi_desc *entry; @@ -806,22 +796,23 @@ out_free: } /** - * pci_msi_check_device - check whether MSI may be enabled on a device + * pci_msi_supported - check whether MSI may be enabled on a device * @dev: pointer to the pci_dev data structure of MSI device function * @nvec: how many MSIs have been requested ? - * @type: are we checking for MSI or MSI-X ? * * Look at global flags, the device itself, and its parent buses * to determine if MSI/-X are supported for the device. If MSI/-X is * supported return 0, else return an error code. **/ -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type) +static int pci_msi_supported(struct pci_dev *dev, int nvec) { struct pci_bus *bus; - int ret; /* MSI must be globally enabled and supported by the device */ - if (!pci_msi_enable || !dev || dev->no_msi) + if (!pci_msi_enable) + return -EINVAL; + + if (!dev || dev->no_msi || dev->current_state != PCI_D0) return -EINVAL; /* @@ -843,10 +834,6 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type) if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI) return -EINVAL; - ret = arch_msi_check_device(dev, nvec, type); - if (ret) - return ret; - return 0; } @@ -949,13 +936,13 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec) int status, nr_entries; int i, j; - if (!entries || !dev->msix_cap || dev->current_state != PCI_D0) - return -EINVAL; - - status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX); + status = pci_msi_supported(dev, nvec); if (status) return status; + if (!entries) + return -EINVAL; + nr_entries = pci_msix_vec_count(dev); if (nr_entries < 0) return nr_entries; @@ -1062,8 +1049,9 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec) int nvec; int rc; - if (dev->current_state != PCI_D0) - return -EINVAL; + rc = pci_msi_supported(dev, minvec); + if (rc) + return rc; WARN_ON(!!dev->msi_enabled); @@ -1086,17 +1074,6 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec) nvec = maxvec; do { - rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI); - if (rc < 0) { - return rc; - } else if (rc > 0) { - if (rc < minvec) - return -ENOSPC; - nvec = rc; - } - } while (rc); - - do { rc = msi_capability_init(dev, nvec); if (rc < 0) { return rc; diff --git a/include/linux/msi.h b/include/linux/msi.h index 8103f32..dbf7cc9 100644 --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -60,7 +60,6 @@ int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc); void arch_teardown_msi_irq(unsigned int irq); int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type); void arch_teardown_msi_irqs(struct pci_dev *dev); -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type); void arch_restore_msi_irqs(struct pci_dev *dev); void default_teardown_msi_irqs(struct pci_dev *dev); @@ -77,8 +76,6 @@ struct msi_chip { int (*setup_irq)(struct msi_chip *chip, struct pci_dev *dev, struct msi_desc *desc); void (*teardown_irq)(struct msi_chip *chip, unsigned int irq); - int (*check_device)(struct msi_chip *chip, struct pci_dev *dev, - int nvec, int type); }; #endif /* LINUX_MSI_H */ -- 1.9.3 __
Re: [PATCH 1/2] PCI/MSI/PPC: Remove arch_msi_check_device()
On Sat, Jul 12, 2014 at 01:21:07PM +0200, Alexander Gordeev wrote: > PowerPC is the only architecture that makes use of hook > arch_msi_check_device() and does perform some checks to > figure out if MSI/MSI-X could be enabled for a device. > However, there are no reasons why those checks could not > be done within arch_setup_msi_irqs() hook. > > Moving MSI checks into arch_setup_msi_irqs() makes code > more readable and allows getting rid of unnecessary hook > arch_msi_check_device(). Michael, Ben, Any feedback? Thanks! -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] PCI/MSI: Remove arch_msi_check_device()
On Wed, Jul 16, 2014 at 04:20:24PM -0600, Bjorn Helgaas wrote: > > @@ -809,22 +799,23 @@ out_free: > > } > > > > /** > > - * pci_msi_check_device - check whether MSI may be enabled on a device > > + * msi_check_device - check whether MSI may be enabled on a device > > * @dev: pointer to the pci_dev data structure of MSI device function > > * @nvec: how many MSIs have been requested ? > > - * @type: are we checking for MSI or MSI-X ? > > * > > * Look at global flags, the device itself, and its parent buses > > * to determine if MSI/-X are supported for the device. If MSI/-X is > > * supported return 0, else return an error code. > > **/ > > -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type) > > +static int msi_check_device(struct pci_dev *dev, int nvec) > > I think "check_device" is a terrible name because it really doesn't > give a clue about what it's doing or what the return value means. > Since you're removing the external usage (arch_msi_check_device) and > this one is static, this would be a good time to fix it. Maybe > "pci_msi_supported()" or something? What about pci_can_enable_msi() or pci_msi_can_enable() or msi_can_enable()? > I *love* the idea of getting rid of this much code! > > Bjorn -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] PCI/MSI: Remove arch_msi_check_device()
On Mon, Jul 14, 2014 at 10:11:57AM +0800, Yijing Wang wrote: > > /** > > - * pci_msi_check_device - check whether MSI may be enabled on a device > > + * msi_check_device - check whether MSI may be enabled on a device > > * @dev: pointer to the pci_dev data structure of MSI device function > > * @nvec: how many MSIs have been requested ? > > - * @type: are we checking for MSI or MSI-X ? > > * > > * Look at global flags, the device itself, and its parent buses > > * to determine if MSI/-X are supported for the device. If MSI/-X is > > * supported return 0, else return an error code. > > **/ > > -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type) > > +static int msi_check_device(struct pci_dev *dev, int nvec) > > { > > struct pci_bus *bus; > > - int ret; > > > > /* MSI must be globally enabled and supported by the device */ > > - if (!pci_msi_enable || !dev || dev->no_msi) > > + if (!pci_msi_enable) > > + return -EINVAL; > > + > > + if (!dev || dev->no_msi || dev->current_state != PCI_D0) > > return -EINVAL; > > > > /* > > @@ -846,10 +837,6 @@ static int pci_msi_check_device(struct pci_dev *dev, > > int nvec, int type) > > if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI) > > return -EINVAL; > > > > - ret = arch_msi_check_device(dev, nvec, type); > > - if (ret) > > - return ret; > > - > > Move the arch_msi_check_device() into arch_msi_setup_irq(), make we can not > detect whether the device in this platform > supports MSI or MSI-X aeap. If we delay this, maybe we will do a lot > unnecessary working for MSI/MSI-X setup. A traditional approach for a function is first to make sanity check and then allocate resources. I do not see a reason to keep these two steps in separate functions: arch_msi_check_device() and arch_setup_msi_irq(). Just make checks within arch_setup_msi_irq() and bail out early would be as cheap as it is now, but more natural and would deflate the interface. Moreover, some platforms duplicate checks in arch_msi_check_device() and arch_setup_msi_irq(), which does not add to readability. > Thanks! > Yijing. > > > return 0; > > } > > > > @@ -954,13 +941,13 @@ int pci_enable_msix(struct pci_dev *dev, struct > > msix_entry *entries, int nvec) > > int status, nr_entries; > > int i, j; > > > > - if (!entries || !dev->msix_cap || dev->current_state != PCI_D0) > > - return -EINVAL; > > - > > - status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX); > > + status = msi_check_device(dev, nvec); > > if (status) > > return status; > > > > + if (!entries) > > + return -EINVAL; > > + > > nr_entries = pci_msix_vec_count(dev); > > if (nr_entries < 0) > > return nr_entries; > > @@ -1085,8 +1072,9 @@ int pci_enable_msi_range(struct pci_dev *dev, int > > minvec, int maxvec) > > int nvec; > > int rc; > > > > - if (dev->current_state != PCI_D0) > > - return -EINVAL; > > + rc = msi_check_device(dev, minvec); > > + if (rc) > > + return rc; > > > > WARN_ON(!!dev->msi_enabled); > > > > @@ -1109,17 +1097,6 @@ int pci_enable_msi_range(struct pci_dev *dev, int > > minvec, int maxvec) > > nvec = maxvec; > > > > do { > > - rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI); > > - if (rc < 0) { > > - return rc; > > - } else if (rc > 0) { > > - if (rc < minvec) > > - return -ENOSPC; > > - nvec = rc; > > - } > > - } while (rc); > > - > > - do { > > rc = msi_capability_init(dev, nvec); > > if (rc < 0) { > > return rc; > > diff --git a/include/linux/msi.h b/include/linux/msi.h > > index 92a2f99..3b873bc 100644 > > --- a/include/linux/msi.h > > +++ b/include/linux/msi.h > > @@ -59,7 +59,6 @@ int arch_setup_msi_irq(struct pci_dev *dev, struct > > msi_desc *desc); > > void arch_teardown_msi_irq(unsigned int irq); > > int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type); > > void arch_teardown_msi_irqs(struct pci_dev *dev); > > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type); > > void arch_res
[PATCH 0/2] PCI/MSI: Remove arch_msi_check_device()
Hello, This is a cleanup effort to get rid of useless arch_msi_check_device(). I am not sure what were the reasons for its existence in the first place, but at the moment it appears totally unnecessary. Thanks! Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-...@vger.kernel.org Alexander Gordeev (2): PCI/MSI/PPC: Remove arch_msi_check_device() PCI/MSI: Remove arch_msi_check_device() arch/powerpc/include/asm/machdep.h |2 - arch/powerpc/kernel/msi.c | 12 +--- arch/powerpc/platforms/cell/axon_msi.c |9 -- arch/powerpc/platforms/powernv/pci.c | 19 +++- arch/powerpc/platforms/pseries/msi.c | 42 ++- arch/powerpc/sysdev/fsl_msi.c | 12 ++-- arch/powerpc/sysdev/mpic_pasemi_msi.c | 11 +-- arch/powerpc/sysdev/mpic_u3msi.c | 28 +++--- arch/powerpc/sysdev/ppc4xx_hsta_msi.c | 18 arch/powerpc/sysdev/ppc4xx_msi.c | 19 drivers/pci/msi.c | 49 --- include/linux/msi.h|3 -- 12 files changed, 63 insertions(+), 161 deletions(-) -- 1.7.7.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/2] PCI/MSI: Remove arch_msi_check_device()
There are no archs that override arch_msi_check_device() hook. Remove it as it is completely redundant. If an arch would need to check MSI/MSI-X possibility for a device it should make it within arch_setup_msi_irqs() hook. Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-...@vger.kernel.org Signed-off-by: Alexander Gordeev --- drivers/pci/msi.c | 49 + include/linux/msi.h |3 --- 2 files changed, 13 insertions(+), 39 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 13f3d30..19ac058 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -56,16 +56,6 @@ void __weak arch_teardown_msi_irq(unsigned int irq) chip->teardown_irq(chip, irq); } -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type) -{ - struct msi_chip *chip = dev->bus->msi; - - if (!chip || !chip->check_device) - return 0; - - return chip->check_device(chip, dev, nvec, type); -} - int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) { struct msi_desc *entry; @@ -809,22 +799,23 @@ out_free: } /** - * pci_msi_check_device - check whether MSI may be enabled on a device + * msi_check_device - check whether MSI may be enabled on a device * @dev: pointer to the pci_dev data structure of MSI device function * @nvec: how many MSIs have been requested ? - * @type: are we checking for MSI or MSI-X ? * * Look at global flags, the device itself, and its parent buses * to determine if MSI/-X are supported for the device. If MSI/-X is * supported return 0, else return an error code. **/ -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type) +static int msi_check_device(struct pci_dev *dev, int nvec) { struct pci_bus *bus; - int ret; /* MSI must be globally enabled and supported by the device */ - if (!pci_msi_enable || !dev || dev->no_msi) + if (!pci_msi_enable) + return -EINVAL; + + if (!dev || dev->no_msi || dev->current_state != PCI_D0) return -EINVAL; /* @@ -846,10 +837,6 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type) if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI) return -EINVAL; - ret = arch_msi_check_device(dev, nvec, type); - if (ret) - return ret; - return 0; } @@ -954,13 +941,13 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec) int status, nr_entries; int i, j; - if (!entries || !dev->msix_cap || dev->current_state != PCI_D0) - return -EINVAL; - - status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX); + status = msi_check_device(dev, nvec); if (status) return status; + if (!entries) + return -EINVAL; + nr_entries = pci_msix_vec_count(dev); if (nr_entries < 0) return nr_entries; @@ -1085,8 +1072,9 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec) int nvec; int rc; - if (dev->current_state != PCI_D0) - return -EINVAL; + rc = msi_check_device(dev, minvec); + if (rc) + return rc; WARN_ON(!!dev->msi_enabled); @@ -1109,17 +1097,6 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec) nvec = maxvec; do { - rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI); - if (rc < 0) { - return rc; - } else if (rc > 0) { - if (rc < minvec) - return -ENOSPC; - nvec = rc; - } - } while (rc); - - do { rc = msi_capability_init(dev, nvec); if (rc < 0) { return rc; diff --git a/include/linux/msi.h b/include/linux/msi.h index 92a2f99..3b873bc 100644 --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -59,7 +59,6 @@ int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc); void arch_teardown_msi_irq(unsigned int irq); int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type); void arch_teardown_msi_irqs(struct pci_dev *dev); -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type); void arch_restore_msi_irqs(struct pci_dev *dev); void default_teardown_msi_irqs(struct pci_dev *dev); @@ -76,8 +75,6 @@ struct msi_chip { int (*setup_irq)(struct msi_chip *chip, struct pci_dev *dev, struct msi_desc *desc); void (*teardown_irq)(struct msi_chip *chip, unsigned int irq); - int (*check_device)(struct msi_chip *chip, struct pci_dev *dev, - int nvec, int type); }; #endif /* LINUX_MSI_H */ -- 1.7.7.6 __
[PATCH 1/2] PCI/MSI/PPC: Remove arch_msi_check_device()
PowerPC is the only architecture that makes use of hook arch_msi_check_device() and does perform some checks to figure out if MSI/MSI-X could be enabled for a device. However, there are no reasons why those checks could not be done within arch_setup_msi_irqs() hook. Moving MSI checks into arch_setup_msi_irqs() makes code more readable and allows getting rid of unnecessary hook arch_msi_check_device(). Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-...@vger.kernel.org Signed-off-by: Alexander Gordeev --- arch/powerpc/include/asm/machdep.h |2 - arch/powerpc/kernel/msi.c | 12 + arch/powerpc/platforms/cell/axon_msi.c |9 --- arch/powerpc/platforms/powernv/pci.c | 19 -- arch/powerpc/platforms/pseries/msi.c | 42 --- arch/powerpc/sysdev/fsl_msi.c | 12 ++--- arch/powerpc/sysdev/mpic_pasemi_msi.c | 11 +--- arch/powerpc/sysdev/mpic_u3msi.c | 28 - arch/powerpc/sysdev/ppc4xx_hsta_msi.c | 18 - arch/powerpc/sysdev/ppc4xx_msi.c | 19 -- 10 files changed, 50 insertions(+), 122 deletions(-) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index f92b0b5..d05aa76 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -136,8 +136,6 @@ struct machdep_calls { int (*pci_setup_phb)(struct pci_controller *host); #ifdef CONFIG_PCI_MSI - int (*msi_check_device)(struct pci_dev* dev, - int nvec, int type); int (*setup_msi_irqs)(struct pci_dev *dev, int nvec, int type); void(*teardown_msi_irqs)(struct pci_dev *dev); diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c index 8bbc12d..71bd161 100644 --- a/arch/powerpc/kernel/msi.c +++ b/arch/powerpc/kernel/msi.c @@ -13,7 +13,7 @@ #include -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type) +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) { if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) { pr_debug("msi: Platform doesn't provide MSI callbacks.\n"); @@ -24,16 +24,6 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type) if (type == PCI_CAP_ID_MSI && nvec > 1) return 1; - if (ppc_md.msi_check_device) { - pr_debug("msi: Using platform check routine.\n"); - return ppc_md.msi_check_device(dev, nvec, type); - } - -return 0; -} - -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) -{ return ppc_md.setup_msi_irqs(dev, nvec, type); } diff --git a/arch/powerpc/platforms/cell/axon_msi.c b/arch/powerpc/platforms/cell/axon_msi.c index 85825b5..862b327 100644 --- a/arch/powerpc/platforms/cell/axon_msi.c +++ b/arch/powerpc/platforms/cell/axon_msi.c @@ -199,14 +199,6 @@ out_error: return msic; } -static int axon_msi_check_device(struct pci_dev *dev, int nvec, int type) -{ - if (!find_msi_translator(dev)) - return -ENODEV; - - return 0; -} - static int setup_msi_msg_address(struct pci_dev *dev, struct msi_msg *msg) { struct device_node *dn; @@ -416,7 +408,6 @@ static int axon_msi_probe(struct platform_device *device) ppc_md.setup_msi_irqs = axon_msi_setup_msi_irqs; ppc_md.teardown_msi_irqs = axon_msi_teardown_msi_irqs; - ppc_md.msi_check_device = axon_msi_check_device; axon_msi_debug_setup(dn, msic); diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index f91a4e5..987b677 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -46,29 +46,21 @@ //#define cfg_dbg(fmt...) printk(fmt) #ifdef CONFIG_PCI_MSI -static int pnv_msi_check_device(struct pci_dev* pdev, int nvec, int type) -{ - struct pci_controller *hose = pci_bus_to_host(pdev->bus); - struct pnv_phb *phb = hose->private_data; - struct pci_dn *pdn = pci_get_pdn(pdev); - - if (pdn && pdn->force_32bit_msi && !phb->msi32_support) - return -ENODEV; - - return (phb && phb->msi_bmp.bitmap) ? 0 : -ENODEV; -} - static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type) { struct pci_controller *hose = pci_bus_to_host(pdev->bus); struct pnv_phb *phb = hose->private_data; + struct pci_dn *pdn = pci_get_pdn(pdev); struct msi_desc *entry; struct msi_msg msg; int hwirq; unsigned int virq; int rc; - if (WARN_ON(!phb)) + if (WARN_ON(!phb) || !phb->msi_bmp.bitmap) + return -ENODEV; + + if (pdn && pdn->force_32bit_msi && !phb->
Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
On Wed, Jul 09, 2014 at 10:06:48AM -0600, Bjorn Helgaas wrote: > Out of curiosity, do you have a pointer to this? It looks like it I.e. ICH8 chapter 12.1.30 or ICH10 chapter 14.1.27 > uses one vector per port, and I'm wondering if the reason it requests > 16 is because there's some possibility of a part with more than 8 > ports. I doubt that is the reason. The only allowed MME values (powers of two) are 0b000, 0b001, 0b010 and 0b100. As you can see, only one bit is used - I would speculate it suits nicely to some hardware logic. BTW, apart from AHCI, it seems the reason MSI is not going to disappear (in a decade at least) is it is way cheaper to implement than MSI-X. > > No, this is not an erratum. The value of 8 vectors is reserved and could > > cause undefined results if used. > > As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100 > (requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or > 16 vectors. If allocating 8 vectors and writing 0b011 to MME causes > undefined results, I'd say that's a chipset defect. Well, the PCI spec does not prevent devices to have their own specs on top of it. Undefined results are meant on the device side here. On the MSI side these results are likely perfectly within the PCI spec. I feel speaking as a lawer here ;) > Interrupt vector space is the issue I would worry about, but I think > I'm going to put this on the back burner until it actually becomes a > problem. I plan to try get rid of arch_msi_check_device() hook. Should I repost this series afterwards? Thanks! -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote: > >> Can you quantify the benefit of this? Can't a device already use > >> MSI-X to request exactly the number of vectors it can use? (I know > > > > A Intel AHCI chipset requires 16 vectors written to MME while advertises > > (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results > > in device's fallback to 1 (!). > > Is the fact that it uses only 6 vectors documented in the public spec? Yes, it is documented in ICH specs. > Is this a chipset erratum? Are there newer versions of the chipset > that fix this, e.g., by requesting 8 vectors and using 6, or by also > supporting MSI-X? No, this is not an erratum. The value of 8 vectors is reserved and could cause undefined results if used. > I know this conserves vector numbers. What does that mean in real > user-visible terms? Are there systems that won't boot because of this > issue, and this patch fixes them? Does it enable bigger > configurations, e.g., more I/O devices, than before? Visibly, it ceases logging messages ('ahci :00:1f.2: irq 107 for MSI/MSI-X') for IRQs that are not shown in /proc/interrupts later. No, it does not enable/fix any existing hardware issue I am aware of. It just saves a couple of interrupt vectors, as Michael put it (10/16 to be precise). However, interrupt vectors space is pretty much scarce resource on x86 and a risk of exhausting the vectors (and introducing quota i.e) has already been raised AFAIR. > Do you know how Windows handles this? Does it have a similar interface? Have no clue, TBH. Can try to investigate if you see it helpful. > As you can tell, I'm a little skeptical about this. It's a fairly big > change, it affects the arch interface, it seems to be targeted for > only a single chipset (though it's widely used), and we already > support a standard solution (MSI-X, reducing the number of vectors > requested, or even operating with 1 vector). I also do not like the fact the arch interface is getting complicated, so I happily leave it to your judgement ;) Well, it is low-level and hidden from drivers at least. Thanks! > Bjorn -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote: > As you can tell, I'm a little skeptical about this. It's a fairly big > change, it affects the arch interface, it seems to be targeted for > only a single chipset (though it's widely used), and we already > support a standard solution (MSI-X, reducing the number of vectors > requested, or even operating with 1 vector). Bjorn, I surely understand your concerns. I am answering this "summary" question right away. Even though an extra parameter is introduced, functionally this update is rather small. It is only the new pci_enable_msi_partial() function that could exploit a custom 'nvec_mme' parameter. By contrast, existing pci_enable_msi_range() function (and therefore all device drivers) is unaffected - it just rounds up 'nvec' to the nearest power of two and continues exactly as it has been. All archs besides x86 just ignore it. And x86 change is fairly small as well - all necessary functionality is already in. Thus, at the moment it is only AHCI of concern. And no, AHCI can not do MSI-X.. Thanks! > Bjorn -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
On Fri, Jul 04, 2014 at 09:11:50AM +, David Laight wrote: > > I might be missing something, but we are talking of MSI address space > > here, aren't we? I am not getting how we could end up with a 'write' > > to a random kernel location when a unclaimed MSI vector sent. We could > > only expect a spurious interrupt at worst, which is handled and reported. > > > > Anyway, as I described in my reply to Bjorn, this is not a concern IMO. > > I'm thinking of the following - which might be MSI-X ? > 1) Hardware requests some interrupts and tells the host the BAR (and offset) >where the 'vectors' should be written. > 2) To raise an interrupt the hardware uses the 'vector' as the address >of a normal PCIe write cycle. > > So if the hardware requests 4 interrupts, but the driver (believing it > will only use 3) only write 3 vectors, and then the hardware uses the > 4th vector it can write to a random location. > > Debugging that would be hard! MSI base address is kind of hardcoded for a platform. A combination of MSI base address, PCI function number and MSI vector makes a PCI host to raise interrupt on a CPU. I might be inaccurate in details, but the scenario you described is impossible AFAICT. > David > > > -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
On Thu, Jul 03, 2014 at 09:20:52AM +, David Laight wrote: > From: Bjorn Helgaas > > On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote: > > > There are PCI devices that require a particular value written > > > to the Multiple Message Enable (MME) register while aligned on > > > power of 2 boundary value of actually used MSI vectors 'nvec' > > > is a lesser of that MME value: > > > > > > roundup_pow_of_two(nvec) < 'Multiple Message Enable' > > > > > > However the existing pci_enable_msi_block() interface is not > > > able to configure such devices, since the value written to the > > > MME register is calculated from the number of requested MSIs > > > 'nvec': > > > > > > 'Multiple Message Enable' = roundup_pow_of_two(nvec) > > > > For MSI, software learns how many vectors a device requests by reading > > the Multiple Message Capable (MMC) field. This field is encoded, so a > > device can only request 1, 2, 4, 8, etc., vectors. It's impossible > > for a device to request 3 vectors; it would have to round up that up > > to a power of two and request 4 vectors. > > > > Software writes similarly encoded values to MME to tell the device how > > many vectors have been allocated for its use. For example, it's > > impossible to tell the device that it can use 3 vectors; the OS has to > > round that up and tell the device it can use 4 vectors. > > > > So if I understand correctly, the point of this series is to take > > advantage of device-specific knowledge, e.g., the device requests 4 > > vectors via MMC, but we "know" the device is only capable of using 3. > > Moreover, we tell the device via MME that 4 vectors are available, but > > we've only actually set up 3 of them. > ... > > Even if you do that, you ought to write valid interrupt information > into the 4th slot (maybe replicating one of the earlier interrupts). > Then, if the device does raise the 'unexpected' interrupt you don't > get a write to a random kernel location. I might be missing something, but we are talking of MSI address space here, aren't we? I am not getting how we could end up with a 'write' to a random kernel location when a unclaimed MSI vector sent. We could only expect a spurious interrupt at worst, which is handled and reported. Anyway, as I described in my reply to Bjorn, this is not a concern IMO. > Plausibly something similar should be done when a smaller number of > interrupts is assigned. > > David > -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote: > On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote: > > There are PCI devices that require a particular value written > > to the Multiple Message Enable (MME) register while aligned on > > power of 2 boundary value of actually used MSI vectors 'nvec' > > is a lesser of that MME value: > > > > roundup_pow_of_two(nvec) < 'Multiple Message Enable' > > > > However the existing pci_enable_msi_block() interface is not > > able to configure such devices, since the value written to the > > MME register is calculated from the number of requested MSIs > > 'nvec': > > > > 'Multiple Message Enable' = roundup_pow_of_two(nvec) > > For MSI, software learns how many vectors a device requests by reading > the Multiple Message Capable (MMC) field. This field is encoded, so a > device can only request 1, 2, 4, 8, etc., vectors. It's impossible > for a device to request 3 vectors; it would have to round up that up > to a power of two and request 4 vectors. > > Software writes similarly encoded values to MME to tell the device how > many vectors have been allocated for its use. For example, it's > impossible to tell the device that it can use 3 vectors; the OS has to > round that up and tell the device it can use 4 vectors. Nod. > So if I understand correctly, the point of this series is to take > advantage of device-specific knowledge, e.g., the device requests 4 > vectors via MMC, but we "know" the device is only capable of using 3. > Moreover, we tell the device via MME that 4 vectors are available, but > we've only actually set up 3 of them. Exactly. > This makes me uneasy because we're lying to the device, and the device > is perfectly within spec to use all 4 of those vectors. If anything > changes the number of vectors the device uses (new device revision, > firmware upgrade, etc.), this is liable to break. If a device committed via non-MSI specific means to send only 3 vectors out of 4 available why should we expect it to send 4? The probability of a firmware sending 4/4 vectors in this case is equal to the probability of sending 5/4 or 16/4, with the very same reason - a bug in the firmware. Moreover, even vector 4/4 would be unexpected by the device driver, though it is perfectly within the spec. As of new device revision or firmware update etc. - it is just yet another case of device driver vs the firmware match/mismatch. Not including this change does not help here at all IMHO. > Can you quantify the benefit of this? Can't a device already use > MSI-X to request exactly the number of vectors it can use? (I know A Intel AHCI chipset requires 16 vectors written to MME while advertises (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results in device's fallback to 1 (!). > not all devices support MSI-X, but maybe we should just accept MSI for > what it is and encourage the HW guys to use MSI-X if MSI isn't good > enough.) > > > In this case the result written to the MME register may not > > satisfy the aforementioned PCI devices requirement and therefore > > the PCI functions will not operate in a desired mode. > > I'm not sure what you mean by "will not operate in a desired mode." > I thought this was an optimization to save vectors and that these > changes would be completely invisible to the hardware. Yes, this should be invisible to the hardware. The above is an attempt to describe the Intel AHCI weirdness in general terms :) I think it could be omitted. > Bjorn > > > This update introduces pci_enable_msi_partial() extension to > > pci_enable_msi_block() interface that accepts extra 'nvec_mme' > > argument which is then written to MME register while the value > > of 'nvec' is still used to setup as many interrupts as requested. > > > > As result of this change, architecture-specific callbacks > > arch_msi_check_device() and arch_setup_msi_irqs() get an extra > > 'nvec_mme' parameter as well, but it is ignored for now. > > Therefore, this update is a placeholder for architectures that > > wish to support pci_enable_msi_partial() function in the future. > > > > Cc: linux-...@vger.kernel.org > > Cc: linux-m...@linux-mips.org > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: linux-s...@vger.kernel.org > > Cc: x...@kernel.org > > Cc: xen-de...@lists.xenproject.org > > Cc: io...@lists.linux-foundation.org > > Cc: linux-...@vger.kernel.org > > Cc: linux-...@vger.kernel.org > > Signed-off-by: Alexander Gordeev > > --- > > Doc
Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
Hi Bjorn, Any feedback? Thanks! -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
There are PCI devices that require a particular value written to the Multiple Message Enable (MME) register while aligned on power of 2 boundary value of actually used MSI vectors 'nvec' is a lesser of that MME value: roundup_pow_of_two(nvec) < 'Multiple Message Enable' However the existing pci_enable_msi_block() interface is not able to configure such devices, since the value written to the MME register is calculated from the number of requested MSIs 'nvec': 'Multiple Message Enable' = roundup_pow_of_two(nvec) In this case the result written to the MME register may not satisfy the aforementioned PCI devices requirement and therefore the PCI functions will not operate in a desired mode. This update introduces pci_enable_msi_partial() extension to pci_enable_msi_block() interface that accepts extra 'nvec_mme' argument which is then written to MME register while the value of 'nvec' is still used to setup as many interrupts as requested. As result of this change, architecture-specific callbacks arch_msi_check_device() and arch_setup_msi_irqs() get an extra 'nvec_mme' parameter as well, but it is ignored for now. Therefore, this update is a placeholder for architectures that wish to support pci_enable_msi_partial() function in the future. Cc: linux-...@vger.kernel.org Cc: linux-m...@linux-mips.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: x...@kernel.org Cc: xen-de...@lists.xenproject.org Cc: io...@lists.linux-foundation.org Cc: linux-...@vger.kernel.org Cc: linux-...@vger.kernel.org Signed-off-by: Alexander Gordeev --- Documentation/PCI/MSI-HOWTO.txt | 36 ++-- arch/mips/pci/msi-octeon.c |2 +- arch/powerpc/kernel/msi.c |4 +- arch/s390/pci/pci.c |2 +- arch/x86/kernel/x86_init.c |2 +- drivers/pci/msi.c | 83 ++- include/linux/msi.h |5 +- include/linux/pci.h |3 + 8 files changed, 115 insertions(+), 22 deletions(-) diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt index 10a9369..c8a8503 100644 --- a/Documentation/PCI/MSI-HOWTO.txt +++ b/Documentation/PCI/MSI-HOWTO.txt @@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact() returns zero in case of success, which indicates MSI interrupts have been successfully allocated. -4.2.4 pci_disable_msi +4.2.4 pci_enable_msi_partial + +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme) + +This variation on pci_enable_msi_exact() call allows a device driver to +setup 'nvec_mme' number of multiple MSIs with the PCI function, while +setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs +in operating system. The MSI specification only allows 'nvec_mme' to be +allocated in powers of two, up to a maximum of 2^5 (32). + +This function could be used when a PCI function is known to send 'nvec' +MSIs, but still requires a particular number of MSIs 'nvec_mme' to be +initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs +do not waste system resources. + +If this function returns 0, it has succeeded in allocating 'nvec_mme' +interrupts and setting up 'nvec' interrupts. In this case, the function +enables MSI on this device and updates dev->irq to be the lowest of the +new interrupts assigned to it. The other interrupts assigned to the +device are in the range dev->irq to dev->irq + nvec - 1. + +If this function returns a negative number, it indicates an error and +the driver should not attempt to request any more MSI interrupts for +this device. + +4.2.5 pci_disable_msi void pci_disable_msi(struct pci_dev *dev) -This function should be used to undo the effect of pci_enable_msi_range(). -Calling it restores dev->irq to the pin-based interrupt number and frees -the previously allocated MSIs. The interrupts may subsequently be assigned -to another device, so drivers should not cache the value of dev->irq. +This function should be used to undo the effect of pci_enable_msi_range() +or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based +interrupt number and frees the previously allocated MSIs. The interrupts +may subsequently be assigned to another device, so drivers should not cache +the value of dev->irq. Before calling this function, a device driver must always call free_irq() on any interrupt for which it previously called request_irq(). diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c index 2b91b0e..2be7979 100644 --- a/arch/mips/pci/msi-octeon.c +++ b/arch/mips/pci/msi-octeon.c @@ -178,7 +178,7 @@ msi_irq_allocated: return 0; } -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) +int arch_
[PATCH 0/3] Add pci_enable_msi_partial() to conserve MSI-related resources
Add new pci_enable_msi_partial() interface and use it to conserve on othewise wasted interrupt resources. AHCI driver is the first user which would conserve on 10 out of 16 unused MSI vectors on some Intel chipsets. Cc: linux-...@vger.kernel.org Cc: linux-m...@linux-mips.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: x...@kernel.org Cc: xen-de...@lists.xenproject.org Cc: io...@lists.linux-foundation.org Cc: linux-...@vger.kernel.org Cc: linux-...@vger.kernel.org Alexander Gordeev (3): PCI/MSI: Add pci_enable_msi_partial() PCI/MSI/x86: Support pci_enable_msi_partial() AHCI: Use pci_enable_msi_partial() to conserve on 10/16 MSIs Documentation/PCI/MSI-HOWTO.txt | 36 ++-- arch/mips/pci/msi-octeon.c |2 +- arch/powerpc/kernel/msi.c |4 +- arch/s390/pci/pci.c |2 +- arch/x86/include/asm/pci.h |3 +- arch/x86/include/asm/x86_init.h |3 +- arch/x86/kernel/apic/io_apic.c |2 +- arch/x86/kernel/x86_init.c |4 +- arch/x86/pci/xen.c |9 +++- drivers/ata/ahci.c |4 +- drivers/iommu/irq_remapping.c | 10 ++-- drivers/pci/msi.c | 83 ++- include/linux/msi.h |5 +- include/linux/pci.h |3 + 14 files changed, 134 insertions(+), 36 deletions(-) -- 1.7.7.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] PCI/MSI: Phase out pci_enable_msi_block()
On Wed, Apr 30, 2014 at 05:49:33PM -0600, Bjorn Helgaas wrote: > I mistakenly assumed this would have to wait because I thought there were > other pci_enable_msi_block() users that wouldn't be removed until the v3.16 > merge window. But I think I was wrong: I put your GenWQE patch in my tree, > and I think that was the last use, so I can just add this patch on top. That is right, GenWQE is the last one. > But I need a little help understanding the changelog: > > > Up until now, when enabling MSI mode for a device a single > > successful call to arch_msi_check_device() was followed by > > a single call to arch_setup_msi_irqs() function. > > I understand this part; the following two paths call > arch_msi_check_device() once and then arch_setup_msi_irqs() once: > > pci_enable_msi_block > pci_msi_check_device > arch_msi_check_device > msi_capability_init > arch_setup_msi_irqs > > pci_enable_msix > pci_msi_check_device > arch_msi_check_device > msix_capability_init > arch_setup_msi_irqs > > > Yet, if arch_msi_check_device() returned success we should be > > able to call arch_setup_msi_irqs() multiple times - while it > > returns a number of MSI vectors that could have been allocated > > (a third state). > > I don't know what you mean by "a third state." That is "a number of MSI vectors that could have been allocated", which is neither success nor failure. In previous conversations someone branded it this as "third state". > Previously we only called arch_msi_check_device() once. After your patch, > pci_enable_msi_range() can call it several times. The only non-trivial > implementation of arch_msi_check_device() is in powerpc, and all the > ppc_md.msi_check_device() possibilities look safe to call multiple times. Yep, I see it the same way. > After your patch, the pci_enable_msi_range() path can also call > arch_setup_msi_irqs() several times. I don't see a problem with that -- > even if the first call succeeds and allocates something, then a subsequent > call fails, I assume the allocations will be cleaned up when > msi_capability_init() calls free_msi_irqs(). Well, the potential problem related to the fact arch_msi_check_device() could be called with 'nvec1' while arch_setup_msi_irqs() could be called with 'nvec2', where 'nvec1' > 'nvec2'. While it is not a problem with current implementations, in therory it is possible free_msi_irqs() could be called with 'nvec2' and fail to clean up 'nvec1' - 'nvec2' number of resources. The only assumption that makes the above scenario impossible is if arch_msi_check_device() is stateless. Again, that is purely theoretical with no particular architecture in mind. > > This update makes use of the assumption described above. It > > could have broke things had the architectures done any pre- > > allocations or switch to some state in a call to function > > arch_msi_check_device(). But because arch_msi_check_device() > > is expected stateless and MSI resources are allocated in a > > follow-up call to arch_setup_msi_irqs() we should be fine. > > I guess you mean that your patch assumes arch_msi_check_device() is > stateless. That looks like a safe assumption to me. Moreover, if arch_msi_check_device() was not stateless then it would be superfluous, since all state switches and allocations are done in arch_setup_msi_irqs() anyway. In fact, I think arch_msi_check_device() could be eliminated, but I do not want to engage with PPC folks at this stage ;) > arch_setup_msi_irqs() is clearly not stateless, but I assume > free_msi_irqs() is enough to clean up any state if we fail. > > Bjorn -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] PCI/MSI: Phase out pci_enable_msi_block()
On Mon, Apr 14, 2014 at 03:28:34PM +0200, Alexander Gordeev wrote: > There are no users of pci_enable_msi_block() function have > left. Obsolete it in favor of pci_enable_msi_range() and > pci_enable_msi_exact() functions. Hi Bjorn, How about this one? Thanks! -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/2] PCI/MSI: Phase out pci_enable_msi_block()
There are no users of pci_enable_msi_block() function have left. Obsolete it in favor of pci_enable_msi_range() and pci_enable_msi_exact() functions. Up until now, when enabling MSI mode for a device a single successful call to arch_msi_check_device() was followed by a single call to arch_setup_msi_irqs() function. Yet, if arch_msi_check_device() returned success we should be able to call arch_setup_msi_irqs() multiple times - while it returns a number of MSI vectors that could have been allocated (a third state). This update makes use of the assumption described above. It could have broke things had the architectures done any pre- allocations or switch to some state in a call to function arch_msi_check_device(). But because arch_msi_check_device() is expected stateless and MSI resources are allocated in a follow-up call to arch_setup_msi_irqs() we should be fine. Signed-off-by: Alexander Gordeev Cc: linux-m...@linux-mips.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: x...@kernel.org Cc: linux-...@vger.kernel.org --- drivers/pci/msi.c | 79 +- include/linux/pci.h |5 +-- 2 files changed, 34 insertions(+), 50 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 955ab79..fc669ab 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -883,50 +883,6 @@ int pci_msi_vec_count(struct pci_dev *dev) } EXPORT_SYMBOL(pci_msi_vec_count); -/** - * pci_enable_msi_block - configure device's MSI capability structure - * @dev: device to configure - * @nvec: number of interrupts to configure - * - * Allocate IRQs for a device with the MSI capability. - * This function returns a negative errno if an error occurs. If it - * is unable to allocate the number of interrupts requested, it returns - * the number of interrupts it might be able to allocate. If it successfully - * allocates at least the number of interrupts requested, it returns 0 and - * updates the @dev's irq member to the lowest new interrupt number; the - * other interrupt numbers allocated to this device are consecutive. - */ -int pci_enable_msi_block(struct pci_dev *dev, int nvec) -{ - int status, maxvec; - - if (dev->current_state != PCI_D0) - return -EINVAL; - - maxvec = pci_msi_vec_count(dev); - if (maxvec < 0) - return maxvec; - if (nvec > maxvec) - return maxvec; - - status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI); - if (status) - return status; - - WARN_ON(!!dev->msi_enabled); - - /* Check whether driver already requested MSI-X irqs */ - if (dev->msix_enabled) { - dev_info(&dev->dev, "can't enable MSI " -"(MSI-X already enabled)\n"); - return -EINVAL; - } - - status = msi_capability_init(dev, nvec); - return status; -} -EXPORT_SYMBOL(pci_enable_msi_block); - void pci_msi_shutdown(struct pci_dev *dev) { struct msi_desc *desc; @@ -1132,14 +1088,45 @@ void pci_msi_init_pci_dev(struct pci_dev *dev) **/ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec) { - int nvec = maxvec; + int nvec; int rc; + if (dev->current_state != PCI_D0) + return -EINVAL; + + WARN_ON(!!dev->msi_enabled); + + /* Check whether driver already requested MSI-X irqs */ + if (dev->msix_enabled) { + dev_info(&dev->dev, +"can't enable MSI (MSI-X already enabled)\n"); + return -EINVAL; + } + if (maxvec < minvec) return -ERANGE; + nvec = pci_msi_vec_count(dev); + if (nvec < 0) + return nvec; + else if (nvec < minvec) + return -EINVAL; + else if (nvec > maxvec) + nvec = maxvec; + + do { + rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI); + if (rc < 0) { + return rc; + } else if (rc > 0) { + if (rc < minvec) + return -ENOSPC; + nvec = rc; + } + } while (rc); + do { - rc = pci_enable_msi_block(dev, nvec); + rc = msi_capability_init(dev, nvec); if (rc < 0) { return rc; } else if (rc > 0) { diff --git a/include/linux/pci.h b/include/linux/pci.h index aab57b4..499755e 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1158,7 +1158,6 @@ struct msix_entry { #ifdef CONFIG_PCI_MSI int pci_msi_vec_count(struct pci_dev *dev); -int pci_enable_msi_block(struct pci_dev *dev, int nvec); void pci_msi_shutdown(struct pci_dev *dev); void pci_disable_msi(struct pci_dev *dev); int pc
Re: [PATCH 2/2] PCI/MSI: Phase out pci_enable_msi_block()
On Mon, Apr 14, 2014 at 09:14:07AM +0200, Alexander Gordeev wrote: > @@ -1244,7 +1241,7 @@ static inline void pcie_set_ecrc_checking(struct > pci_dev *dev) { } > static inline void pcie_ecrc_get_policy(char *str) { } > #endif > > -#define pci_enable_msi(pdev) pci_enable_msi_block(pdev, 1) > +#define pci_enable_msi(pdev) pci_enable_msi_range(pdev, 1, 1) Self-Nack, it should have been pci_enable_msi_exact(pdev, 1) ^^^ > #ifdef CONFIG_HT_IRQ > /* The functions a driver should call */ -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/2] PCI/MSI: Phase out pci_enable_msi_block()
There are no users of pci_enable_msi_block() function have left. Obsolete it in favor of pci_enable_msi_range() and pci_enable_msi_exact() functions. Up until now, when enabling MSI mode for a device, a single successful call to arch_msi_check_device() was followed by a single call to arch_setup_msi_irqs() function. Yet, if arch_msi_check_device() returned success we should be able to call arch_setup_msi_irqs() multiple times - while it returns a number of MSI vectors that could have been allocated (a third state). This update makes use of the assumption described above. It could have broke things had the architectures done any pre- allocations or switch to some state in a call to function arch_msi_check_device(). But because arch_msi_check_device() is expected stateless and MSI resources are allocated in a follow-up call to arch_setup_msi_irqs() we should be fine. Signed-off-by: Alexander Gordeev Cc: linux-m...@linux-mips.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: x...@kernel.org Cc: linux-...@vger.kernel.org --- drivers/pci/msi.c | 79 +- include/linux/pci.h |5 +-- 2 files changed, 34 insertions(+), 50 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 955ab79..fc669ab 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -883,50 +883,6 @@ int pci_msi_vec_count(struct pci_dev *dev) } EXPORT_SYMBOL(pci_msi_vec_count); -/** - * pci_enable_msi_block - configure device's MSI capability structure - * @dev: device to configure - * @nvec: number of interrupts to configure - * - * Allocate IRQs for a device with the MSI capability. - * This function returns a negative errno if an error occurs. If it - * is unable to allocate the number of interrupts requested, it returns - * the number of interrupts it might be able to allocate. If it successfully - * allocates at least the number of interrupts requested, it returns 0 and - * updates the @dev's irq member to the lowest new interrupt number; the - * other interrupt numbers allocated to this device are consecutive. - */ -int pci_enable_msi_block(struct pci_dev *dev, int nvec) -{ - int status, maxvec; - - if (dev->current_state != PCI_D0) - return -EINVAL; - - maxvec = pci_msi_vec_count(dev); - if (maxvec < 0) - return maxvec; - if (nvec > maxvec) - return maxvec; - - status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI); - if (status) - return status; - - WARN_ON(!!dev->msi_enabled); - - /* Check whether driver already requested MSI-X irqs */ - if (dev->msix_enabled) { - dev_info(&dev->dev, "can't enable MSI " -"(MSI-X already enabled)\n"); - return -EINVAL; - } - - status = msi_capability_init(dev, nvec); - return status; -} -EXPORT_SYMBOL(pci_enable_msi_block); - void pci_msi_shutdown(struct pci_dev *dev) { struct msi_desc *desc; @@ -1132,14 +1088,45 @@ void pci_msi_init_pci_dev(struct pci_dev *dev) **/ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec) { - int nvec = maxvec; + int nvec; int rc; + if (dev->current_state != PCI_D0) + return -EINVAL; + + WARN_ON(!!dev->msi_enabled); + + /* Check whether driver already requested MSI-X irqs */ + if (dev->msix_enabled) { + dev_info(&dev->dev, +"can't enable MSI (MSI-X already enabled)\n"); + return -EINVAL; + } + if (maxvec < minvec) return -ERANGE; + nvec = pci_msi_vec_count(dev); + if (nvec < 0) + return nvec; + else if (nvec < minvec) + return -EINVAL; + else if (nvec > maxvec) + nvec = maxvec; + + do { + rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI); + if (rc < 0) { + return rc; + } else if (rc > 0) { + if (rc < minvec) + return -ENOSPC; + nvec = rc; + } + } while (rc); + do { - rc = pci_enable_msi_block(dev, nvec); + rc = msi_capability_init(dev, nvec); if (rc < 0) { return rc; } else if (rc > 0) { diff --git a/include/linux/pci.h b/include/linux/pci.h index aab57b4..d8104f4 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1158,7 +1158,6 @@ struct msix_entry { #ifdef CONFIG_PCI_MSI int pci_msi_vec_count(struct pci_dev *dev); -int pci_enable_msi_block(struct pci_dev *dev, int nvec); void pci_msi_shutdown(struct pci_dev *dev); void pci_disable_msi(struct pci_dev *dev); int pc
[PATCH 0/2] Phase out pci_enable_msi_block()
Hello, This series is against 3.15-rc1. This update obsoletes pci_enable_msi_block() function in favor of pci_enable_msi_range() and pci_enable_msi_exact(). Cc: Frank Haverkamp Cc: Greg Kroah-Hartman Cc: linux-m...@linux-mips.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: x...@kernel.org Cc: linux-...@vger.kernel.org Alexander Gordeev (2): GenWQE: Use pci_enable_msi_exact() instead of pci_enable_msi_block() PCI/MSI: Phase out pci_enable_msi_block() drivers/misc/genwqe/card_utils.c |2 +- drivers/pci/msi.c| 79 -- include/linux/pci.h |5 +-- 3 files changed, 35 insertions(+), 51 deletions(-) -- 1.7.7.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 00/11] Introduce pcim_enable_msi*() family helpers
On Tue, Nov 26, 2013 at 10:09:49AM +0100, Alexander Gordeev wrote: > Patches 3,4 - fixes for PowerPC pSeries platform [...] > PCI/MSI/pSeries: Fix wrong error code reporting > PCI/MSI/pSeries: Make quota traversing and requesting race-safe Hi Bjorn, The two unreviewed PowerPC patches is not a blocker for the rest of the series as they are completely independent with the rest of the changes. I can rework them later if needed. -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Fri, Oct 11, 2013 at 04:29:39PM -0400, Mark Lord wrote: > > static int xx_alloc_msix_irqs(struct xx_dev *dev, int nvec) > > { > > nvec = roundup_pow_of_two(nvec);/* assume 0 > nvec <= 16 */ > > > > xx_disable_all_irqs(dev); > > > > pci_lock_msi(dev->pdev); > > > > rc = pci_get_msix_limit(dev->pdev, nvec); > > if (rc < 0) > > goto err; > > > > nvec = min(nvec, rc); /* if limit is more than requested */ > > nvec = rounddown_pow_of_two(nvec); /* (a) */ > > > > xx_prep_for_msix_vectors(dev, nvec); > > > > rc = pci_enable_msix(dev->pdev, dev->irqs, nvec); /* (b) */ > > if (rc < 0) > > goto err; > > > > pci_unlock_msi(dev->pdev); > > > > dev->num_vectors = nvec;/* (b) */ > > return 0; > > > > err: > > pci_unlock_msi(dev->pdev); > > > > kerr(dev->name, "pci_enable_msix() failed, err=%d", rc); > > dev->num_vectors = 0; > > return rc; > > } > > That would still need a loop, to handle the natural race between > the calls to pci_get_msix_limit() and pci_enable_msix() -- the driver and > device > can and should fall back to a smaller number of vectors when > pci_enable_msix() fails. Could you please explain why the value returned by pci_get_msix_limit() might change before pci_enable_msix() returned, while both protected by pci_lock_msi()? Anyway, although the loop-free code (IMHO) reads better, pci_lock_msi() it is not a part of the original proposal and the more I think about it the less I like it. -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Thu, Oct 10, 2013 at 07:17:18PM -0400, Mark Lord wrote: > Just to help us all understand "the loop" issue.. > > Here's an example of driver code which uses the existing MSI-X interfaces, > for a device which can work with either 16, 8, 4, 2, or 1 MSI-X interrupt. > This is from a new driver I'm working on right now: > > > static int xx_alloc_msix_irqs (struct xx_dev *dev, int nvec) > { > xx_disable_all_irqs(dev); > do { > if (nvec < 2) > xx_prep_for_1_msix_vector(dev); > else if (nvec < 4) > xx_prep_for_2_msix_vectors(dev); > else if (nvec < 8) > xx_prep_for_4_msix_vectors(dev); > else if (nvec < 16) > xx_prep_for_8_msix_vectors(dev); > else > xx_prep_for_16_msix_vectors(dev); > nvec = pci_enable_msix(dev->pdev, dev->irqs, > dev->num_vectors); > } while (nvec > 0); > > if (nvec) { > kerr(dev->name, "pci_enable_msix() failed, err=%d", nvec); > dev->num_vectors = 0; > return nvec; > } > return 0; /* success */ > } Yeah, that is a very good example. I would move all xx_prep_for__msix_vector() functions to a single helper i.e. xx_prep_msix_vectors(dev, ndev). Considering also (a) we do not want to waste unused platform resources associated with MSI-Xs and pull more quota than needed and (b) fixing couple of bugs, this function could look like this: static int xx_alloc_msix_irqs(struct xx_dev *dev, int nvec_in) { int nvec = roundup_pow_of_two(nvec_in); /* assume 0 > nvec_in <= 16 */ int rc; xx_disable_all_irqs(dev); retry: xx_prep_for_msix_vectors(dev, nvec); rc = pci_enable_msix(dev->pdev, dev->irqs, nvec); /* (b) */ if (rc > 0) { nvec = rounddown_pow_of_two(nvec); /* (a) */ goto retry; } if (rc) { kerr(dev->name, "pci_enable_msix() failed, err=%d", rc); dev->num_vectors = 0; return rc; } dev->num_vectors = nvec;/* (b) */ return 0; /* success */ } Now, this is a loop-free alternative: static int xx_alloc_msix_irqs(struct xx_dev *dev, int nvec) { nvec = roundup_pow_of_two(nvec);/* assume 0 > nvec <= 16 */ xx_disable_all_irqs(dev); pci_lock_msi(dev->pdev); rc = pci_get_msix_limit(dev->pdev, nvec); if (rc < 0) goto err; nvec = min(nvec, rc); /* if limit is more than requested */ nvec = rounddown_pow_of_two(nvec); /* (a) */ xx_prep_for_msix_vectors(dev, nvec); rc = pci_enable_msix(dev->pdev, dev->irqs, nvec); /* (b) */ if (rc < 0) goto err; pci_unlock_msi(dev->pdev); dev->num_vectors = nvec; /* (b) */ return 0; err: pci_unlock_msi(dev->pdev); kerr(dev->name, "pci_enable_msix() failed, err=%d", rc); dev->num_vectors = 0; return rc; } -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Thu, Oct 10, 2013 at 09:28:27AM -0700, H. Peter Anvin wrote: > On 10/10/2013 03:17 AM, Alexander Gordeev wrote: > > On Wed, Oct 09, 2013 at 03:24:08PM +1100, Benjamin Herrenschmidt wrote: > > > > Ok, this suggestion sounded in one or another form by several people. > > What about name it pcim_enable_msix_range() and wrap in couple more > > helpers to complete an API: > > > > int pcim_enable_msix_range(pdev, msix_entries, nvec, minvec); > > <0 - error code > > >0 - number of MSIs allocated, where minvec >= result <= nvec > > > > int pcim_enable_msix(pdev, msix_entries, nvec); > > <0 - error code > > >0 - number of MSIs allocated, where 1 >= result <= nvec > > > > int pcim_enable_msix_exact(pdev, msix_entries, nvec); > > <0 - error code > > >0 - number of MSIs allocated, where result == nvec > > > > The latter's return value seems odd, but I can not help to make > > it consistent with the first two. > > > > Is there a reason for the wrappers, as opposed to just specifying either > 1 or nvec as the minimum? The wrappers are more handy IMO. I.e. can imagine people start struggling to figure out what minvec to provide: 1 or 0? Why 1? Oh.. okay.. Or should we tolerate 0 (as opposite to -ERANGE)? Well, do not know.. pcim_enable_msix(pdev, msix_entries, nvec, nvec) is less readable for me than just pcim_enable_msix_exact(pdev, msix_entries, nvec). > -hpa -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Wed, Oct 09, 2013 at 03:24:08PM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2013-10-08 at 20:55 -0700, H. Peter Anvin wrote: > > Why not add a minimum number to pci_enable_msix(), i.e.: > > > > pci_enable_msix(pdev, msix_entries, nvec, minvec) > > > > ... which means "nvec" is the number of interrupts *requested*, and > > "minvec" is the minimum acceptable number (otherwise fail). > > Which is exactly what Ben (the other Ben :-) suggested and that I > supports... Ok, this suggestion sounded in one or another form by several people. What about name it pcim_enable_msix_range() and wrap in couple more helpers to complete an API: int pcim_enable_msix_range(pdev, msix_entries, nvec, minvec); <0 - error code >0 - number of MSIs allocated, where minvec >= result <= nvec int pcim_enable_msix(pdev, msix_entries, nvec); <0 - error code >0 - number of MSIs allocated, where 1 >= result <= nvec int pcim_enable_msix_exact(pdev, msix_entries, nvec); <0 - error code >0 - number of MSIs allocated, where result == nvec The latter's return value seems odd, but I can not help to make it consistent with the first two. (Sorry if you see this message twice - my MUA seems struggle with one of CC). > Cheers, > Ben. > > -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Mon, Oct 07, 2013 at 02:01:11PM -0400, Tejun Heo wrote: > Hmmm... yean, the race condition could be an issue as multiple msi > allocation might fail even if the driver can and explicitly handle > multiple allocation if the quota gets reduced inbetween. BTW, should we care about the quota getting increased inbetween? That would entail.. kind of pci_get_msi_limit() :), but IMHO it is not worth it. > tejun -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Mon, Oct 07, 2013 at 02:01:11PM -0400, Tejun Heo wrote: > > What about introducing pci_lock_msi() and pci_unlock_msi() and let device > > drivers care about their ranges and specifics in race-safe manner? > > I do not call to introduce it right now (since it appears pSeries has not > > been hitting the race for years) just as a possible alternative to Ben's > > proposal. > > I don't think the same race condition would happen with the loop. We need to distinguish the contexts we're discussing. If we talk about pSeries quota, then the current pSeries pci_enable_msix() implementation is racy internally and could fail if the quota went down *while* pci_enable_msix() is executing. In this case the loop will have to exit rather than retry with a lower number (what number?). In this regard the new scheme does not bring anything new and relies on the fact this race does not hit and therefore does not worry. If we talk about quota as it has to be, then yes - the loop scheme seems more preferable. Overall, looks like we just need to fix the pSeries implementation, if the guys want it, he-he :) > The problem case is where multiple msi(x) allocation fails completely > because the global limit went down before inquiry and allocation. In > the loop based interface, it'd retry with the lower number. I am probably missing something here. If the global limit went down before inquiry then the inquiry will get what is available and try to allocate with than number. -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Mon, Oct 07, 2013 at 02:21:17PM -0400, Tejun Heo wrote: > Whee that's a lot more than I expected. I was just scanning > multiple msi users. Maybe we can stage the work in more manageable > steps so that you don't have to go through massive conversion only to > do it all over again afterwards and likewise people don't get > bombarded on each iteration? Maybe we can first update pci / msi code > proper, msi and then msix? Multipe MSIs is just a handful of drivers, really. MSI-X impact still will be huge. But if we opt a different name for the new pci_enable_msix() then we could first update pci/msi, then drivers (in few stages possibly) and finally remove the old implementation. > tejun -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 05/77] PCI/MSI: Convert pci_msix_table_size() to a public interface
On Mon, Oct 07, 2013 at 02:10:43PM -0400, Tejun Heo wrote: > On Wed, Oct 02, 2013 at 12:48:21PM +0200, Alexander Gordeev wrote: > > Make pci_msix_table_size() to return a error code if the device > > does not support MSI-X. This update is needed to facilitate a > > forthcoming re-design MSI/MSI-X interrupts enabling pattern. > > > > Device drivers will use this interface to obtain maximum number > > of MSI-X interrupts the device supports and use that value in > > the following call to pci_enable_msix() interface. > > Hmmm... I probably missed something but why is this necessary? To > discern between -EINVAL and -ENOSPC? If so, does that really matter? pci_msix_table_size() is kind of helper and returns 0 if a device does not have MSI-X table. If we want drivers to use it we need return -EINVAL for MSI-X incapable/disabled devices. Nothing about -ENOSPC. > tejun -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 07/77] PCI/MSI: Re-design MSI/MSI-X interrupts enablement pattern
On Mon, Oct 07, 2013 at 02:17:49PM -0400, Tejun Heo wrote: > Hello, > > On Wed, Oct 02, 2013 at 12:48:23PM +0200, Alexander Gordeev wrote: > > +static int foo_driver_enable_msi(struct foo_adapter *adapter, int nvec) > > +{ > > + rc = pci_get_msi_cap(adapter->pdev); > > + if (rc < 0) > > + return rc; > > + > > + nvec = min(nvec, rc); > > + if (nvec < FOO_DRIVER_MINIMUM_NVEC) { > > + return -ENOSPC; > > + > > + rc = pci_enable_msi_block(adapter->pdev, nvec); > > + return rc; > > +} > > If there are many which duplicate the above pattern, it'd probably be > worthwhile to provide a helper? It's usually a good idea to reduce > the amount of boilerplate code in drivers. I wanted to limit discussion in v1 to as little changes as possible. I 'planned' those helper(s) for a separate effort if/when the most important change is accepted and soaked a bit. > > @@ -975,7 +951,7 @@ int pci_enable_msix(struct pci_dev *dev, struct > > msix_entry *entries, int nvec) > > if (nr_entries < 0) > > return nr_entries; > > if (nvec > nr_entries) > > - return nr_entries; > > + return -EINVAL; > > > > /* Check for any invalid entries */ > > for (i = 0; i < nvec; i++) { > > If we do things this way, it breaks all drivers using this interface > until they're converted, right? Right. And the rest of the series does it. > Also, it probably isn't the best idea > to flip the behavior like this as this can go completely unnoticed (no > compiler warning or anything, the same function just behaves > differently). Maybe it'd be a better idea to introduce a simpler > interface that most can be converted to? Well, an *other* interface is a good idea. What do you mean with the simpler here? > tejun -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Tue, Oct 08, 2013 at 03:33:30PM +1100, Michael Ellerman wrote: > On Wed, Oct 02, 2013 at 12:29:04PM +0200, Alexander Gordeev wrote: > > This technique proved to be confusing and error-prone. Vast share > > of device drivers simply fail to follow the described guidelines. > > To clarify "Vast share of device drivers": > > - 58 drivers call pci_enable_msix() > - 24 try a single allocation and then fallback to MSI/LSI > - 19 use the loop style allocation as above > - 14 try an allocation, and if it fails retry once > - 1 incorrectly continues when pci_enable_msix() returns > 0 > > So 33 drivers (> 50%) successfully make use of the "confusing and > error-prone" return value. Ok, you caught me - 'vast share' is incorrect and is a subject to rewording. But out of 19/58 how many drivers tested fallbacks on the real hardware? IOW, which drivers are affected by the pSeries quota? > And yes, one is buggy, so obviously the interface is too complex. Thanks > drivers/ntb/ntb_hw.c vmxnet3 would be the best example. > cheers -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 54/77] ntb: Ensure number of MSIs on SNB is enough for the link interrupt
On Mon, Oct 07, 2013 at 09:50:57AM -0700, Jon Mason wrote: > On Sat, Oct 05, 2013 at 11:43:04PM +0200, Alexander Gordeev wrote: > > On Wed, Oct 02, 2013 at 05:48:05PM -0700, Jon Mason wrote: > > > On Wed, Oct 02, 2013 at 12:49:10PM +0200, Alexander Gordeev wrote: > > > > Signed-off-by: Alexander Gordeev > > > > --- > > > > drivers/ntb/ntb_hw.c |2 +- > > > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > > > > > diff --git a/drivers/ntb/ntb_hw.c b/drivers/ntb/ntb_hw.c > > > > index de2062c..eccd5e5 100644 > > > > --- a/drivers/ntb/ntb_hw.c > > > > +++ b/drivers/ntb/ntb_hw.c > > > > @@ -1066,7 +1066,7 @@ static int ntb_setup_msix(struct ntb_device *ndev) > > > > /* On SNB, the link interrupt is always tied to 4th > > > > vector. If > > > > * we can't get all 4, then we can't use MSI-X. > > > > */ > > > > - if (ndev->hw_type != BWD_HW) { > > > > + if ((rc < SNB_MSIX_CNT) && (ndev->hw_type != BWD_HW)) { > > > > > > Nack, this check is unnecessary. > > > > If SNB can do more than SNB_MSIX_CNT MSI-Xs then this check is needed > > to enable less than maximum MSI-Xs in case the maximum was not allocated. > > Otherwise SNB will fallback to single MSI instead of multiple MSI-Xs. > > Per the comment in the code snippet above, "If we can't get all 4, > then we can't use MSI-X". There is already a check to see if more > than 4 were acquired. So it's not possible to hit this. Even if it > was, don't use SNB_MSIX_CNT here (limits.msix_cnt is the preferred > variable). Also, the "()" are unnecessary. The changelog is definitely bogus. I meant here an improvement to the existing scheme, not a conversion to the new one: msix_entries = msix_table_size(val); Getting i.e. 16 vectors here. if (msix_entries > ndev->limits.msix_cnt) { rc = -EINVAL; goto err; } Upper limit check i.e. succeeds. [...] rc = pci_enable_msix(pdev, ndev->msix_entries, msix_entries); pci_enable_msix() does not success and returns i.e. 8 here, should retry. if (rc < 0) goto err1; if (rc > 0) { /* On SNB, the link interrupt is always tied to 4th vector. If * we can't get all 4, then we can't use MSI-X. */ if (ndev->hw_type != BWD_HW) { On SNB bail out here, although could have continue with 8 vectors. Can only use SNB_MSIX_CNT here, since limits.msix_cnt is the upper limit. rc = -EIO; goto err1; } [...] } -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Sun, Oct 06, 2013 at 05:19:46PM +1100, Benjamin Herrenschmidt wrote: > On Sun, 2013-10-06 at 08:02 +0200, Alexander Gordeev wrote: > > In fact, in the current design to address the quota race decently the > > drivers would have to protect the *loop* to prevent the quota change > > between a pci_enable_msix() returned a positive number and the the next > > call to pci_enable_msix() with that number. Is it doable? > > I am not advocating for the current design, simply saying that your > proposal doesn't address this issue while Ben's does. There is one major flaw in min-max approach - the generic MSI layer will have to take decisions on exact number of MSIs to request, not device drivers. This will never work for all devices, because there might be specific requirements which are not covered by the min-max. That is what Ben described "...say, any even number within a certain range". Ben suggests to leave the existing loop scheme to cover such devices, which I think is not right. What about introducing pci_lock_msi() and pci_unlock_msi() and let device drivers care about their ranges and specifics in race-safe manner? I do not call to introduce it right now (since it appears pSeries has not been hitting the race for years) just as a possible alternative to Ben's proposal. -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Sun, Oct 06, 2013 at 08:46:26AM +1100, Benjamin Herrenschmidt wrote: > On Sat, 2013-10-05 at 16:20 +0200, Alexander Gordeev wrote: > > So my point is - drivers should first obtain a number of MSIs they *can* > > get, then *derive* a number of MSIs the device is fine with and only then > > request that number. Not terribly different from memory or any other type > > of resource allocation ;) > > What if the limit is for a group of devices ? Your interface is racy in > that case, another driver could have eaten into the limit in between the > calls. Well, the another driver has had a better karma ;) But seriously, the current scheme with a loop is not race-safe wrt to any other type of resource which might exhaust. What makes the quota so special so we should care about it and should not care i.e. about lack of msi_desc's? Yeah, I know the quota might hit more likely. But why it is not addressed right now then? Not a single function in chains... rtas_msi_check_device() -> msi_quota_for_device() -> traverse_pci_devices() rtas_setup_msi_irqs() -> msi_quota_for_device() -> traverse_pci_devices() ...is race-safe. So if it has not been bothering anyone until now then no reason to start worrying now :) In fact, in the current design to address the quota race decently the drivers would have to protect the *loop* to prevent the quota change between a pci_enable_msix() returned a positive number and the the next call to pci_enable_msix() with that number. Is it doable? > Ben. > > -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 54/77] ntb: Ensure number of MSIs on SNB is enough for the link interrupt
On Wed, Oct 02, 2013 at 05:48:05PM -0700, Jon Mason wrote: > On Wed, Oct 02, 2013 at 12:49:10PM +0200, Alexander Gordeev wrote: > > Signed-off-by: Alexander Gordeev > > --- > > drivers/ntb/ntb_hw.c |2 +- > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > diff --git a/drivers/ntb/ntb_hw.c b/drivers/ntb/ntb_hw.c > > index de2062c..eccd5e5 100644 > > --- a/drivers/ntb/ntb_hw.c > > +++ b/drivers/ntb/ntb_hw.c > > @@ -1066,7 +1066,7 @@ static int ntb_setup_msix(struct ntb_device *ndev) > > /* On SNB, the link interrupt is always tied to 4th vector. If > > * we can't get all 4, then we can't use MSI-X. > > */ > > - if (ndev->hw_type != BWD_HW) { > > + if ((rc < SNB_MSIX_CNT) && (ndev->hw_type != BWD_HW)) { > > Nack, this check is unnecessary. If SNB can do more than SNB_MSIX_CNT MSI-Xs then this check is needed to enable less than maximum MSI-Xs in case the maximum was not allocated. Otherwise SNB will fallback to single MSI instead of multiple MSI-Xs. -- Regards, Alexander Gordeev agord...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev