Re: [PATCH 4/4] x86, fpu: irq_fpu_usable: kill all checks except !in_kernel_fpu
On Fri, Aug 29, 2014 at 11:17 AM, Oleg Nesterov wrote: > ONCE AGAIN, THIS IS MORE THE QUESTION THAN THE PATCH. this patch I think needs more thought for sure. please see below. > > interrupted_kernel_fpu_idle() does: > > if (use_eager_fpu()) > return true; > > return !__thread_has_fpu(current) && > (read_cr0() & X86_CR0_TS); > > and it is absolutely not clear why these 2 cases differ so much. > > To remind, the use_eager_fpu() case is buggy; __save_init_fpu() in > __kernel_fpu_begin() can race with math_state_restore() which does > __thread_fpu_begin() + restore_fpu_checking(). So we should fix this > race anyway and we can't require __thread_has_fpu() == F likes the > !use_eager_fpu() case does, in this case kernel_fpu_begin() will not > work if it interrupts the idle thread (this will reintroduce the > performance regression fixed by 5187b28f). > > Probably math_state_restore() can use kernel_fpu_disable/end() which > sets/clears in_kernel_fpu, or it can disable irqs. Doesn't matter, we > should fix this bug anyway. > > And if we fix this bug, why else !use_eager_fpu() case needs the much > more strict check? Why we can't handle the __thread_has_fpu(current) > case the same way? > > The comment deleted by this change says: > > and TS must be set so that the clts/stts pair does nothing > > and can explain the difference, but I can not understand this (again, > assuming that we fix the race(s) mentoined above). > > Say, user_fpu_begin(). Yes, kernel_fpu_begin/end() can restore X86_CR0_TS. > But this should be fine? No. The reason is that has_fpu state and cr0.TS can get out of sync. Let's say you get an interrupt after clts() in __thread_fpu_begin() called as part of user_fpu_begin(). And because of this proposed change, irq_fpu_usable() returns true and an interrupt can end-up using fpu and after the return from interrupt we can have a state where cr0.TS is set but we end up resuming the execution from __thread_set_has_fpu(). Now after this point has_fpu is set but cr0.TS is set. And now any schedule() with this state (let's say immd after preemption_enable() at the end of user_fpu_begin()) is dangerous. We can get a dna fault in the middle of __switch_to() which can lead to subtle bugs. > A context switch before restore_user_xstate() > can equally set it back? > And device_not_available() should be fine even > in kernel context? not in some critical places like switch_to(). other than this patch, rest of the changes look ok to me. Can you please resend this patchset with the math_state_restore() race addressed aswell? thanks, suresh > > I'll appreciate any comment. > --- > arch/x86/kernel/i387.c | 44 +--- > 1 files changed, 1 insertions(+), 43 deletions(-) > > diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c > index 9fb2899..ef60f33 100644 > --- a/arch/x86/kernel/i387.c > +++ b/arch/x86/kernel/i387.c > @@ -22,54 +22,12 @@ > static DEFINE_PER_CPU(bool, in_kernel_fpu); > > /* > - * Were we in an interrupt that interrupted kernel mode? > - * > - * On others, we can do a kernel_fpu_begin/end() pair *ONLY* if that > - * pair does nothing at all: the thread must not have fpu (so > - * that we don't try to save the FPU state), and TS must > - * be set (so that the clts/stts pair does nothing that is > - * visible in the interrupted kernel thread). > - * > - * Except for the eagerfpu case when we return 1. > - */ > -static inline bool interrupted_kernel_fpu_idle(void) > -{ > - if (this_cpu_read(in_kernel_fpu)) > - return false; > - > - if (use_eager_fpu()) > - return true; > - > - return !__thread_has_fpu(current) && > - (read_cr0() & X86_CR0_TS); > -} > - > -/* > - * Were we in user mode (or vm86 mode) when we were > - * interrupted? > - * > - * Doing kernel_fpu_begin/end() is ok if we are running > - * in an interrupt context from user mode - we'll just > - * save the FPU state as required. > - */ > -static inline bool interrupted_user_mode(void) > -{ > - struct pt_regs *regs = get_irq_regs(); > - return regs && user_mode_vm(regs); > -} > - > -/* > * Can we use the FPU in kernel mode with the > * whole "kernel_fpu_begin/end()" sequence? > - * > - * It's always ok in process context (ie "not interrupt") > - * but it is sometimes ok even from an irq. > */ > bool irq_fpu_usable(void) > { > - return !in_interrupt() || > - interrupted_user_mode() || > - interrupted_kernel_fpu_idle(); > + return !this_cpu_read(in_kernel_fpu); > } > EXPORT_SYMBOL(irq_fpu_usable); > > -- > 1.5.5.1 > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] x86, fpu: introduce per-cpu "bool in_kernel_fpu"
On Fri, Aug 29, 2014 at 11:16 AM, Oleg Nesterov wrote: > interrupted_kernel_fpu_idle() tries to detect if kernel_fpu_begin() > is safe or not. In particulat it should obviously deny the nested > kernel_fpu_begin() and this logic doesn't look clean. > > If use_eager_fpu() == T we rely on a) __thread_has_fpu() check in > interrupted_kernel_fpu_idle(), and b) on the fact that _begin() does > __thread_clear_has_fpu(). > > Otherwise we demand that the interrupted task has no FPU if it is in > kernel mode, this works becase __kernel_fpu_begin() does clts(). > > Add the per-cpu "bool in_kernel_fpu" variable, and change this code > to check/set/clear it. This allows to do some cleanups (see the next > changes) and fixes. > > Note that the current code looks racy. Say, kernel_fpu_begin() right > after math_state_restore()->__thread_fpu_begin() will overwrite the > regs we are going to restore. This patch doesn't even try to fix this, yes indeed, explicit calls to math_state_restore() in eager_fpu case has this race. I guess this is present from the commit 5187b28f. thanks, suresh > it just adds the comment, but "in_kernel_fpu" can also be used to > implement kernel_fpu_disable() / kernel_fpu_enable(). > > Signed-off-by: Oleg Nesterov > --- > arch/x86/include/asm/i387.h |2 +- > arch/x86/kernel/i387.c | 10 ++ > 2 files changed, 11 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h > index ed8089d..5e275d3 100644 > --- a/arch/x86/include/asm/i387.h > +++ b/arch/x86/include/asm/i387.h > @@ -40,8 +40,8 @@ extern void __kernel_fpu_end(void); > > static inline void kernel_fpu_begin(void) > { > - WARN_ON_ONCE(!irq_fpu_usable()); > preempt_disable(); > + WARN_ON_ONCE(!irq_fpu_usable()); > __kernel_fpu_begin(); > } > > diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c > index d5dd808..8fb8868 100644 > --- a/arch/x86/kernel/i387.c > +++ b/arch/x86/kernel/i387.c > @@ -19,6 +19,8 @@ > #include > #include > > +static DEFINE_PER_CPU(bool, in_kernel_fpu); > + > /* > * Were we in an interrupt that interrupted kernel mode? > * > @@ -33,6 +35,9 @@ > */ > static inline bool interrupted_kernel_fpu_idle(void) > { > + if (this_cpu_read(in_kernel_fpu)) > + return false; > + > if (use_eager_fpu()) > return __thread_has_fpu(current); > > @@ -73,6 +78,9 @@ void __kernel_fpu_begin(void) > { > struct task_struct *me = current; > > + this_cpu_write(in_kernel_fpu, true); > + > + /* FIXME: race with math_state_restore()-like code */ > if (__thread_has_fpu(me)) { > __thread_clear_has_fpu(me); > __save_init_fpu(me); > @@ -99,6 +107,8 @@ void __kernel_fpu_end(void) > } else { > stts(); > } > + > + this_cpu_write(in_kernel_fpu, false); > } > EXPORT_SYMBOL(__kernel_fpu_end); > > -- > 1.5.5.1 > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] x86, fpu: introduce per-cpu bool in_kernel_fpu
On Fri, Aug 29, 2014 at 11:16 AM, Oleg Nesterov o...@redhat.com wrote: interrupted_kernel_fpu_idle() tries to detect if kernel_fpu_begin() is safe or not. In particulat it should obviously deny the nested kernel_fpu_begin() and this logic doesn't look clean. If use_eager_fpu() == T we rely on a) __thread_has_fpu() check in interrupted_kernel_fpu_idle(), and b) on the fact that _begin() does __thread_clear_has_fpu(). Otherwise we demand that the interrupted task has no FPU if it is in kernel mode, this works becase __kernel_fpu_begin() does clts(). Add the per-cpu bool in_kernel_fpu variable, and change this code to check/set/clear it. This allows to do some cleanups (see the next changes) and fixes. Note that the current code looks racy. Say, kernel_fpu_begin() right after math_state_restore()-__thread_fpu_begin() will overwrite the regs we are going to restore. This patch doesn't even try to fix this, yes indeed, explicit calls to math_state_restore() in eager_fpu case has this race. I guess this is present from the commit 5187b28f. thanks, suresh it just adds the comment, but in_kernel_fpu can also be used to implement kernel_fpu_disable() / kernel_fpu_enable(). Signed-off-by: Oleg Nesterov o...@redhat.com --- arch/x86/include/asm/i387.h |2 +- arch/x86/kernel/i387.c | 10 ++ 2 files changed, 11 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index ed8089d..5e275d3 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -40,8 +40,8 @@ extern void __kernel_fpu_end(void); static inline void kernel_fpu_begin(void) { - WARN_ON_ONCE(!irq_fpu_usable()); preempt_disable(); + WARN_ON_ONCE(!irq_fpu_usable()); __kernel_fpu_begin(); } diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index d5dd808..8fb8868 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -19,6 +19,8 @@ #include asm/fpu-internal.h #include asm/user.h +static DEFINE_PER_CPU(bool, in_kernel_fpu); + /* * Were we in an interrupt that interrupted kernel mode? * @@ -33,6 +35,9 @@ */ static inline bool interrupted_kernel_fpu_idle(void) { + if (this_cpu_read(in_kernel_fpu)) + return false; + if (use_eager_fpu()) return __thread_has_fpu(current); @@ -73,6 +78,9 @@ void __kernel_fpu_begin(void) { struct task_struct *me = current; + this_cpu_write(in_kernel_fpu, true); + + /* FIXME: race with math_state_restore()-like code */ if (__thread_has_fpu(me)) { __thread_clear_has_fpu(me); __save_init_fpu(me); @@ -99,6 +107,8 @@ void __kernel_fpu_end(void) } else { stts(); } + + this_cpu_write(in_kernel_fpu, false); } EXPORT_SYMBOL(__kernel_fpu_end); -- 1.5.5.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] x86, fpu: irq_fpu_usable: kill all checks except !in_kernel_fpu
On Fri, Aug 29, 2014 at 11:17 AM, Oleg Nesterov o...@redhat.com wrote: ONCE AGAIN, THIS IS MORE THE QUESTION THAN THE PATCH. this patch I think needs more thought for sure. please see below. interrupted_kernel_fpu_idle() does: if (use_eager_fpu()) return true; return !__thread_has_fpu(current) (read_cr0() X86_CR0_TS); and it is absolutely not clear why these 2 cases differ so much. To remind, the use_eager_fpu() case is buggy; __save_init_fpu() in __kernel_fpu_begin() can race with math_state_restore() which does __thread_fpu_begin() + restore_fpu_checking(). So we should fix this race anyway and we can't require __thread_has_fpu() == F likes the !use_eager_fpu() case does, in this case kernel_fpu_begin() will not work if it interrupts the idle thread (this will reintroduce the performance regression fixed by 5187b28f). Probably math_state_restore() can use kernel_fpu_disable/end() which sets/clears in_kernel_fpu, or it can disable irqs. Doesn't matter, we should fix this bug anyway. And if we fix this bug, why else !use_eager_fpu() case needs the much more strict check? Why we can't handle the __thread_has_fpu(current) case the same way? The comment deleted by this change says: and TS must be set so that the clts/stts pair does nothing and can explain the difference, but I can not understand this (again, assuming that we fix the race(s) mentoined above). Say, user_fpu_begin(). Yes, kernel_fpu_begin/end() can restore X86_CR0_TS. But this should be fine? No. The reason is that has_fpu state and cr0.TS can get out of sync. Let's say you get an interrupt after clts() in __thread_fpu_begin() called as part of user_fpu_begin(). And because of this proposed change, irq_fpu_usable() returns true and an interrupt can end-up using fpu and after the return from interrupt we can have a state where cr0.TS is set but we end up resuming the execution from __thread_set_has_fpu(). Now after this point has_fpu is set but cr0.TS is set. And now any schedule() with this state (let's say immd after preemption_enable() at the end of user_fpu_begin()) is dangerous. We can get a dna fault in the middle of __switch_to() which can lead to subtle bugs. A context switch before restore_user_xstate() can equally set it back? And device_not_available() should be fine even in kernel context? not in some critical places like switch_to(). other than this patch, rest of the changes look ok to me. Can you please resend this patchset with the math_state_restore() race addressed aswell? thanks, suresh I'll appreciate any comment. --- arch/x86/kernel/i387.c | 44 +--- 1 files changed, 1 insertions(+), 43 deletions(-) diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index 9fb2899..ef60f33 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -22,54 +22,12 @@ static DEFINE_PER_CPU(bool, in_kernel_fpu); /* - * Were we in an interrupt that interrupted kernel mode? - * - * On others, we can do a kernel_fpu_begin/end() pair *ONLY* if that - * pair does nothing at all: the thread must not have fpu (so - * that we don't try to save the FPU state), and TS must - * be set (so that the clts/stts pair does nothing that is - * visible in the interrupted kernel thread). - * - * Except for the eagerfpu case when we return 1. - */ -static inline bool interrupted_kernel_fpu_idle(void) -{ - if (this_cpu_read(in_kernel_fpu)) - return false; - - if (use_eager_fpu()) - return true; - - return !__thread_has_fpu(current) - (read_cr0() X86_CR0_TS); -} - -/* - * Were we in user mode (or vm86 mode) when we were - * interrupted? - * - * Doing kernel_fpu_begin/end() is ok if we are running - * in an interrupt context from user mode - we'll just - * save the FPU state as required. - */ -static inline bool interrupted_user_mode(void) -{ - struct pt_regs *regs = get_irq_regs(); - return regs user_mode_vm(regs); -} - -/* * Can we use the FPU in kernel mode with the * whole kernel_fpu_begin/end() sequence? - * - * It's always ok in process context (ie not interrupt) - * but it is sometimes ok even from an irq. */ bool irq_fpu_usable(void) { - return !in_interrupt() || - interrupted_user_mode() || - interrupted_kernel_fpu_idle(); + return !this_cpu_read(in_kernel_fpu); } EXPORT_SYMBOL(irq_fpu_usable); -- 1.5.5.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] x86, fpu: copy_process's FPU paths cleanups
On Wed, Aug 27, 2014 at 11:51 AM, Oleg Nesterov wrote: > Hello, > > Who can review this? And where should I send FPU changes? > > And it seems that nobody cares about 2 fixes I sent before. > Linus, I understand that you won't take them into v3.17, but > perhaps you can ack/nack them explicitly? It seems that nobody > can do this. > > Oleg. > > arch/x86/include/asm/fpu-internal.h |2 +- > arch/x86/kernel/process.c | 16 +--- > arch/x86/kernel/process_32.c|2 -- > arch/x86/kernel/process_64.c|1 - > 4 files changed, 10 insertions(+), 11 deletions(-) These 4 patches also look good to me. Reviewed-by: Suresh Siddha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86, fpu: __restore_xstate_sig()->math_state_restore() needs preempt_disable()
On Mon, Aug 25, 2014 at 11:08 AM, Oleg Nesterov wrote: > > Add preempt_disable() + preempt_enable() around math_state_restore() in > __restore_xstate_sig(). Otherwise __switch_to() after __thread_fpu_begin() > can overwrite fpu->state we are going to restore. > > Signed-off-by: Oleg Nesterov > Cc: sta...@vger.kernel.org > --- > arch/x86/kernel/xsave.c |5 - > 1 files changed, 4 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c > index 453343c..c52eb9c 100644 > --- a/arch/x86/kernel/xsave.c > +++ b/arch/x86/kernel/xsave.c > @@ -397,8 +397,11 @@ int __restore_xstate_sig(void __user *buf, void __user > *buf_fx, int size) > set_used_math(); > } > > - if (use_eager_fpu()) > + if (use_eager_fpu()) { > + preempt_disable(); > math_state_restore(); > + preempt_enable(); > + } > > return err; > } else { > oops. looks good to me. Reviewed-by: Suresh Siddha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86, fpu: __restore_xstate_sig()-math_state_restore() needs preempt_disable()
On Mon, Aug 25, 2014 at 11:08 AM, Oleg Nesterov o...@redhat.com wrote: Add preempt_disable() + preempt_enable() around math_state_restore() in __restore_xstate_sig(). Otherwise __switch_to() after __thread_fpu_begin() can overwrite fpu-state we are going to restore. Signed-off-by: Oleg Nesterov o...@redhat.com Cc: sta...@vger.kernel.org --- arch/x86/kernel/xsave.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c index 453343c..c52eb9c 100644 --- a/arch/x86/kernel/xsave.c +++ b/arch/x86/kernel/xsave.c @@ -397,8 +397,11 @@ int __restore_xstate_sig(void __user *buf, void __user *buf_fx, int size) set_used_math(); } - if (use_eager_fpu()) + if (use_eager_fpu()) { + preempt_disable(); math_state_restore(); + preempt_enable(); + } return err; } else { oops. looks good to me. Reviewed-by: Suresh Siddha sbsid...@gmail.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] x86, fpu: copy_process's FPU paths cleanups
On Wed, Aug 27, 2014 at 11:51 AM, Oleg Nesterov o...@redhat.com wrote: Hello, Who can review this? And where should I send FPU changes? And it seems that nobody cares about 2 fixes I sent before. Linus, I understand that you won't take them into v3.17, but perhaps you can ack/nack them explicitly? It seems that nobody can do this. Oleg. arch/x86/include/asm/fpu-internal.h |2 +- arch/x86/kernel/process.c | 16 +--- arch/x86/kernel/process_32.c|2 -- arch/x86/kernel/process_64.c|1 - 4 files changed, 10 insertions(+), 11 deletions(-) These 4 patches also look good to me. Reviewed-by: Suresh Siddha sbsid...@gmail.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/urgent] x86, fpu: Check tsk_used_math() in kernel_fpu_end() for eager FPU
Commit-ID: 731bd6a93a6e9172094a2322bd0ee964bb1f4d63 Gitweb: http://git.kernel.org/tip/731bd6a93a6e9172094a2322bd0ee964bb1f4d63 Author: Suresh Siddha AuthorDate: Sun, 2 Feb 2014 22:56:23 -0800 Committer: H. Peter Anvin CommitDate: Tue, 11 Mar 2014 12:32:52 -0700 x86, fpu: Check tsk_used_math() in kernel_fpu_end() for eager FPU For non-eager fpu mode, thread's fpu state is allocated during the first fpu usage (in the context of device not available exception). This (math_state_restore()) can be a blocking call and hence we enable interrupts (which were originally disabled when the exception happened), allocate memory and disable interrupts etc. But the eager-fpu mode, call's the same math_state_restore() from kernel_fpu_end(). The assumption being that tsk_used_math() is always set for the eager-fpu mode and thus avoid the code path of enabling interrupts, allocating fpu state using blocking call and disable interrupts etc. But the below issue was noticed by Maarten Baert, Nate Eldredge and few others: If a user process dumps core on an ecrypt fs while aesni-intel is loaded, we get a BUG() in __find_get_block() complaining that it was called with interrupts disabled; then all further accesses to our ecrypt fs hang and we have to reboot. The aesni-intel code (encrypting the core file that we are writing) needs the FPU and quite properly wraps its code in kernel_fpu_{begin,end}(), the latter of which calls math_state_restore(). So after kernel_fpu_end(), interrupts may be disabled, which nobody seems to expect, and they stay that way until we eventually get to __find_get_block() which barfs. For eager fpu, most the time, tsk_used_math() is true. At few instances during thread exit, signal return handling etc, tsk_used_math() might be false. In kernel_fpu_end(), for eager-fpu, call math_state_restore() only if tsk_used_math() is set. Otherwise, don't bother. Kernel code path which cleared tsk_used_math() knows what needs to be done with the fpu state. Reported-by: Maarten Baert Reported-by: Nate Eldredge Suggested-by: Linus Torvalds Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1391410583.3801.6.camel@europa Cc: George Spelvin Signed-off-by: H. Peter Anvin --- arch/x86/kernel/i387.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index e8368c6..d5dd808 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -86,10 +86,19 @@ EXPORT_SYMBOL(__kernel_fpu_begin); void __kernel_fpu_end(void) { - if (use_eager_fpu()) - math_state_restore(); - else + if (use_eager_fpu()) { + /* +* For eager fpu, most the time, tsk_used_math() is true. +* Restore the user math as we are done with the kernel usage. +* At few instances during thread exit, signal handling etc, +* tsk_used_math() is false. Those few places will take proper +* actions, so we don't need to restore the math here. +*/ + if (likely(tsk_used_math(current))) + math_state_restore(); + } else { stts(); + } } EXPORT_SYMBOL(__kernel_fpu_end); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/urgent] x86, fpu: Check tsk_used_math() in kernel_fpu_end() for eager FPU
Commit-ID: 731bd6a93a6e9172094a2322bd0ee964bb1f4d63 Gitweb: http://git.kernel.org/tip/731bd6a93a6e9172094a2322bd0ee964bb1f4d63 Author: Suresh Siddha sbsid...@gmail.com AuthorDate: Sun, 2 Feb 2014 22:56:23 -0800 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 11 Mar 2014 12:32:52 -0700 x86, fpu: Check tsk_used_math() in kernel_fpu_end() for eager FPU For non-eager fpu mode, thread's fpu state is allocated during the first fpu usage (in the context of device not available exception). This (math_state_restore()) can be a blocking call and hence we enable interrupts (which were originally disabled when the exception happened), allocate memory and disable interrupts etc. But the eager-fpu mode, call's the same math_state_restore() from kernel_fpu_end(). The assumption being that tsk_used_math() is always set for the eager-fpu mode and thus avoid the code path of enabling interrupts, allocating fpu state using blocking call and disable interrupts etc. But the below issue was noticed by Maarten Baert, Nate Eldredge and few others: If a user process dumps core on an ecrypt fs while aesni-intel is loaded, we get a BUG() in __find_get_block() complaining that it was called with interrupts disabled; then all further accesses to our ecrypt fs hang and we have to reboot. The aesni-intel code (encrypting the core file that we are writing) needs the FPU and quite properly wraps its code in kernel_fpu_{begin,end}(), the latter of which calls math_state_restore(). So after kernel_fpu_end(), interrupts may be disabled, which nobody seems to expect, and they stay that way until we eventually get to __find_get_block() which barfs. For eager fpu, most the time, tsk_used_math() is true. At few instances during thread exit, signal return handling etc, tsk_used_math() might be false. In kernel_fpu_end(), for eager-fpu, call math_state_restore() only if tsk_used_math() is set. Otherwise, don't bother. Kernel code path which cleared tsk_used_math() knows what needs to be done with the fpu state. Reported-by: Maarten Baert maarten-ba...@hotmail.com Reported-by: Nate Eldredge n...@thatsmathematics.com Suggested-by: Linus Torvalds torva...@linux-foundation.org Signed-off-by: Suresh Siddha sbsid...@gmail.com Link: http://lkml.kernel.org/r/1391410583.3801.6.camel@europa Cc: George Spelvin li...@horizon.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/kernel/i387.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index e8368c6..d5dd808 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -86,10 +86,19 @@ EXPORT_SYMBOL(__kernel_fpu_begin); void __kernel_fpu_end(void) { - if (use_eager_fpu()) - math_state_restore(); - else + if (use_eager_fpu()) { + /* +* For eager fpu, most the time, tsk_used_math() is true. +* Restore the user math as we are done with the kernel usage. +* At few instances during thread exit, signal handling etc, +* tsk_used_math() is false. Those few places will take proper +* actions, so we don't need to restore the math here. +*/ + if (likely(tsk_used_math(current))) + math_state_restore(); + } else { stts(); + } } EXPORT_SYMBOL(__kernel_fpu_end); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Fri, Mar 7, 2014 at 3:18 PM, H. Peter Anvin wrote: > > Hi Suresh, > > Any thoughts on this? hi Peter, Can you please pickup the second short patch (https://lkml.org/lkml/2014/2/3/21) which actually fixes the reported problem at hand. And tested and acked by all the problem reporters. I will respond shortly about the first patch (which is more of a cleanup). thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Fri, Mar 7, 2014 at 3:18 PM, H. Peter Anvin h...@zytor.com wrote: Hi Suresh, Any thoughts on this? hi Peter, Can you please pickup the second short patch (https://lkml.org/lkml/2014/2/3/21) which actually fixes the reported problem at hand. And tested and acked by all the problem reporters. I will respond shortly about the first patch (which is more of a cleanup). thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Mon, 2014-02-03 at 10:20 -0800, Linus Torvalds wrote: > Thinking about it some more, this patch is *almost* not needed at all. > > I'm wondering if you should just change the first patch to just always > initialize the fpu when it is allocated, and at execve() time (ie in > flush_thread()). > We already do this for eager-fpu case, in eager_fpu_init() during boot and in drop_init_fpu() during flush_thread(). > If we do that, then this: > > + if (!tsk_used_math(tsk)) > + init_fpu(tsk); > > can be dropped entirely from math_state_restore(). yeah, probably for eager-fpu, but: > And quite frankly, > at that point, I think all the changes to __kernel_fpu_end() can go > away, because at that point math_state_restore() really does the right > thing - all the allocations are gone, and all the async task state > games are gone, only the "restore state" remains. > > Hmm? So the only thing needed would be to add that "init_fpu()" to the > initial bootmem allocation path and to change flush_thread() (it > currently does "drop_init_fpu()", let's just make it initialize the > FPU state using fpu_finit()), and then we could remove the whole > "used_math" bit entirely, and just say that the FPU is always > initialized. > > What do you guys think? No. as I mentioned in the changelog, there is one more path which does drop_fpu() and we still depend on this used_math bit for eager-fpu. in signal restore path for 32-bit app, where we copy the sig-context state from the user stack to the kernel manually (because of legacy reasons where fsave state is followed by fxsave state etc in the 32-bit signal handler context and we have to go through convert_to_fxsr() etc). from __restore_xstate_sig() : /* * Drop the current fpu which clears used_math(). This ensures * that any context-switch during the copy of the new state, * avoids the intermediate state from getting restored/saved. * Thus avoiding the new restored state from getting corrupted. * We will be ready to restore/save the state only after * set_used_math() is again set. */ drop_fpu(tsk); thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Mon, 2014-02-03 at 10:20 -0800, Linus Torvalds wrote: Thinking about it some more, this patch is *almost* not needed at all. I'm wondering if you should just change the first patch to just always initialize the fpu when it is allocated, and at execve() time (ie in flush_thread()). We already do this for eager-fpu case, in eager_fpu_init() during boot and in drop_init_fpu() during flush_thread(). If we do that, then this: + if (!tsk_used_math(tsk)) + init_fpu(tsk); can be dropped entirely from math_state_restore(). yeah, probably for eager-fpu, but: And quite frankly, at that point, I think all the changes to __kernel_fpu_end() can go away, because at that point math_state_restore() really does the right thing - all the allocations are gone, and all the async task state games are gone, only the restore state remains. Hmm? So the only thing needed would be to add that init_fpu() to the initial bootmem allocation path and to change flush_thread() (it currently does drop_init_fpu(), let's just make it initialize the FPU state using fpu_finit()), and then we could remove the whole used_math bit entirely, and just say that the FPU is always initialized. What do you guys think? No. as I mentioned in the changelog, there is one more path which does drop_fpu() and we still depend on this used_math bit for eager-fpu. in signal restore path for 32-bit app, where we copy the sig-context state from the user stack to the kernel manually (because of legacy reasons where fsave state is followed by fxsave state etc in the 32-bit signal handler context and we have to go through convert_to_fxsr() etc). from __restore_xstate_sig() : /* * Drop the current fpu which clears used_math(). This ensures * that any context-switch during the copy of the new state, * avoids the intermediate state from getting restored/saved. * Thus avoiding the new restored state from getting corrupted. * We will be ready to restore/save the state only after * set_used_math() is again set. */ drop_fpu(tsk); thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Sun, 2014-02-02 at 11:15 -0800, Linus Torvalds wrote: > On Sat, Feb 1, 2014 at 11:19 PM, Suresh Siddha wrote: > > > > The real fix for Nate's problem will be coming from Linus, with a > > slightly modified option-b that Linus proposed. Linus, please let me > > know if you want me to spin it. I can do it sunday night. > > Please do it, since clearly I wasn't aware enough about the whole > non-TS-checking FPU state details. > > Also, since this issue doesn't seem to be a recent regression, I'm not > going to take this patch directly (even though I'm planning on doing > -rc1 in a few hours), and expect that I'll get it through the normal > channels (presumably together with the __kernel_fpu_end cleanups). Ok > with everybody? Here is the second patch, which should fix the issue reported in this thread. Maarten, Nate, George, please give this patch a try as is and see if it helps address the issue you ran into. And please ack/review with your test results. Other patch which cleans up the irq_enable/disable logic in math_state_restore() has been sent yesterday. You can run your experiments with both these patches if you want. But your issue should get fixed with just the appended patch here. Peter, Please push both these patches through normal channels depending on the results. thanks, suresh --- From: Suresh Siddha Subject: x86, fpu: check tsk_used_math() in kernel_fpu_end() for eager fpu For non-eager fpu mode, thread's fpu state is allocated during the first fpu usage (in the context of device not available exception). This (math_state_restore()) can be a blocking call and hence we enable interrupts (which were originally disabled when the exception happened), allocate memory and disable interrupts etc. But the eager-fpu mode, call's the same math_state_restore() from kernel_fpu_end(). The assumption being that tsk_used_math() is always set for the eager-fpu mode and thus avoid the code path of enabling interrupts, allocating fpu state using blocking call and disable interrupts etc. But the below issue was noticed by Maarten Baert, Nate Eldredge and few others: If a user process dumps core on an ecrypt fs while aesni-intel is loaded, we get a BUG() in __find_get_block() complaining that it was called with interrupts disabled; then all further accesses to our ecrypt fs hang and we have to reboot. The aesni-intel code (encrypting the core file that we are writing) needs the FPU and quite properly wraps its code in kernel_fpu_{begin,end}(), the latter of which calls math_state_restore(). So after kernel_fpu_end(), interrupts may be disabled, which nobody seems to expect, and they stay that way until we eventually get to __find_get_block() which barfs. For eager fpu, most the time, tsk_used_math() is true. At few instances during thread exit, signal return handling etc, tsk_used_math() might be false. In kernel_fpu_end(), for eager-fpu, call math_state_restore() only if tsk_used_math() is set. Otherwise, don't bother. Kernel code path which cleared tsk_used_math() knows what needs to be done with the fpu state. Reported-by: Maarten Baert Reported-by: Nate Eldredge Suggested-by: Linus Torvalds Signed-off-by: Suresh Siddha Cc: George Spelvin --- arch/x86/kernel/i387.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index 4e5f770..670bba1 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -87,10 +87,19 @@ EXPORT_SYMBOL(__kernel_fpu_begin); void __kernel_fpu_end(void) { - if (use_eager_fpu()) - math_state_restore(); - else + if (use_eager_fpu()) { + /* +* For eager fpu, most the time, tsk_used_math() is true. +* Restore the user math as we are done with the kernel usage. +* At few instances during thread exit, signal handling etc, +* tsk_used_math() is false. Those few places will take proper +* actions, so we don't need to restore the math here. +*/ + if (likely(tsk_used_math(current))) + math_state_restore(); + } else { stts(); + } } EXPORT_SYMBOL(__kernel_fpu_end); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Sun, 2014-02-02 at 11:15 -0800, Linus Torvalds wrote: On Sat, Feb 1, 2014 at 11:19 PM, Suresh Siddha sbsid...@gmail.com wrote: The real fix for Nate's problem will be coming from Linus, with a slightly modified option-b that Linus proposed. Linus, please let me know if you want me to spin it. I can do it sunday night. Please do it, since clearly I wasn't aware enough about the whole non-TS-checking FPU state details. Also, since this issue doesn't seem to be a recent regression, I'm not going to take this patch directly (even though I'm planning on doing -rc1 in a few hours), and expect that I'll get it through the normal channels (presumably together with the __kernel_fpu_end cleanups). Ok with everybody? Here is the second patch, which should fix the issue reported in this thread. Maarten, Nate, George, please give this patch a try as is and see if it helps address the issue you ran into. And please ack/review with your test results. Other patch which cleans up the irq_enable/disable logic in math_state_restore() has been sent yesterday. You can run your experiments with both these patches if you want. But your issue should get fixed with just the appended patch here. Peter, Please push both these patches through normal channels depending on the results. thanks, suresh --- From: Suresh Siddha sbsid...@gmail.com Subject: x86, fpu: check tsk_used_math() in kernel_fpu_end() for eager fpu For non-eager fpu mode, thread's fpu state is allocated during the first fpu usage (in the context of device not available exception). This (math_state_restore()) can be a blocking call and hence we enable interrupts (which were originally disabled when the exception happened), allocate memory and disable interrupts etc. But the eager-fpu mode, call's the same math_state_restore() from kernel_fpu_end(). The assumption being that tsk_used_math() is always set for the eager-fpu mode and thus avoid the code path of enabling interrupts, allocating fpu state using blocking call and disable interrupts etc. But the below issue was noticed by Maarten Baert, Nate Eldredge and few others: If a user process dumps core on an ecrypt fs while aesni-intel is loaded, we get a BUG() in __find_get_block() complaining that it was called with interrupts disabled; then all further accesses to our ecrypt fs hang and we have to reboot. The aesni-intel code (encrypting the core file that we are writing) needs the FPU and quite properly wraps its code in kernel_fpu_{begin,end}(), the latter of which calls math_state_restore(). So after kernel_fpu_end(), interrupts may be disabled, which nobody seems to expect, and they stay that way until we eventually get to __find_get_block() which barfs. For eager fpu, most the time, tsk_used_math() is true. At few instances during thread exit, signal return handling etc, tsk_used_math() might be false. In kernel_fpu_end(), for eager-fpu, call math_state_restore() only if tsk_used_math() is set. Otherwise, don't bother. Kernel code path which cleared tsk_used_math() knows what needs to be done with the fpu state. Reported-by: Maarten Baert maarten-ba...@hotmail.com Reported-by: Nate Eldredge n...@thatsmathematics.com Suggested-by: Linus Torvalds torva...@linux-foundation.org Signed-off-by: Suresh Siddha sbsid...@gmail.com Cc: George Spelvin li...@horizon.com --- arch/x86/kernel/i387.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index 4e5f770..670bba1 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -87,10 +87,19 @@ EXPORT_SYMBOL(__kernel_fpu_begin); void __kernel_fpu_end(void) { - if (use_eager_fpu()) - math_state_restore(); - else + if (use_eager_fpu()) { + /* +* For eager fpu, most the time, tsk_used_math() is true. +* Restore the user math as we are done with the kernel usage. +* At few instances during thread exit, signal handling etc, +* tsk_used_math() is false. Those few places will take proper +* actions, so we don't need to restore the math here. +*/ + if (likely(tsk_used_math(current))) + math_state_restore(); + } else { stts(); + } } EXPORT_SYMBOL(__kernel_fpu_end); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Sat, 2014-02-01 at 17:06 -0800, Suresh Siddha wrote: > Meanwhile I have the patch removing the delayed dynamic allocation for > non-eager fpu. will post it after some testing. Appended the patch for this. Tested for last 4-5 hours on my laptop. The real fix for Nate's problem will be coming from Linus, with a slightly modified option-b that Linus proposed. Linus, please let me know if you want me to spin it. I can do it sunday night. thanks, suresh --- From: Suresh Siddha Subject: x86, fpu: remove the logic of non-eager fpu mem allocation at the first usage For non-eager fpu mode, thread's fpu state is allocated during the first fpu usage (in the context of device not available exception). This can be a blocking call and hence we enable interrupts (which were originally disabled when the exception happened), allocate memory and disable interrupts etc. While this saves 512 bytes or so per-thread, there are some issues in general. a. Most of the future cases will be anyway using eager FPU (because of processor features like xsaveopt, LWP, MPX etc) and they do the allocation at the thread creation itself. Nice to have one common mechanism as all the state save/restore code is shared. Avoids the confusion and minimizes the subtle bugs in the core piece involved with context-switch. b. If a parent thread uses FPU, during fork() we allocate the FPU state in the child and copy the state etc. Shortly after this, during exec() we free it up, so that we can later allocate during the first usage of FPU. So this free/allocate might be slower for some workloads. c. math_state_restore() is called from multiple places and it is error pone if the caller expects interrupts to be disabled throughout the execution of math_state_restore(). Can lead to subtle bugs like Ubuntu bug #1265841. Memory savings will be small anyways and the code complexity introducing subtle bugs is not worth it. So remove the logic of non-eager fpu mem allocation at the first usage. Signed-off-by: Suresh Siddha --- arch/x86/kernel/i387.c| 14 +- arch/x86/kernel/process.c | 6 -- arch/x86/kernel/traps.c | 16 ++-- arch/x86/kernel/xsave.c | 2 -- 4 files changed, 7 insertions(+), 31 deletions(-) diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index e8368c6..4e5f770 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -5,6 +5,7 @@ * General FPU state handling cleanups * Gareth Hughes , May 2000 */ +#include #include #include #include @@ -186,6 +187,10 @@ void fpu_init(void) if (xstate_size == 0) init_thread_xstate(); + if (!current->thread.fpu.state) + current->thread.fpu.state = + alloc_bootmem_align(xstate_size, __alignof__(struct xsave_struct)); + mxcsr_feature_mask_init(); xsave_init(); eager_fpu_init(); @@ -219,8 +224,6 @@ EXPORT_SYMBOL_GPL(fpu_finit); */ int init_fpu(struct task_struct *tsk) { - int ret; - if (tsk_used_math(tsk)) { if (cpu_has_fpu && tsk == current) unlazy_fpu(tsk); @@ -228,13 +231,6 @@ int init_fpu(struct task_struct *tsk) return 0; } - /* -* Memory allocation at the first usage of the FPU and other state. -*/ - ret = fpu_alloc(>thread.fpu); - if (ret) - return ret; - fpu_finit(>thread.fpu); set_stopped_child_used_math(tsk); diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 3fb8d95..cd9c190 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -128,12 +128,6 @@ void flush_thread(void) flush_ptrace_hw_breakpoint(tsk); memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array)); drop_init_fpu(tsk); - /* -* Free the FPU state for non xsave platforms. They get reallocated -* lazily at the first use. -*/ - if (!use_eager_fpu()) - free_thread_xstate(tsk); } static void hard_disable_TSC(void) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 57409f6..3265429 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -623,20 +623,8 @@ void math_state_restore(void) { struct task_struct *tsk = current; - if (!tsk_used_math(tsk)) { - local_irq_enable(); - /* -* does a slab alloc which can sleep -*/ - if (init_fpu(tsk)) { - /* -* ran out of memory! -*/ - do_group_exit(SIGKILL); - return; - } - local_irq_disable(); - } + if (!tsk_used_math(tsk)) + init_fpu(tsk); __thread_fpu_begin(tsk); diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Sat, Feb 1, 2014 at 5:51 PM, Linus Torvalds wrote: > On Sat, Feb 1, 2014 at 5:47 PM, Suresh Siddha wrote: >> >> So if the restore failed, we should do something like drop_init_fpu(), >> which will restore init-state to the registers. >> >> for eager-fpu() paths we don't use clts() stts() etc. > > Uhhuh. Ok. > > Why do we do that, btw? I think it would make much more sense to just > do what I *thought* we did, and just make it a context-switch-time > optimization ("let's always switch FP state"), not make it a huge > semantic difference. clts/stts is more costly and not all the state under xsave adhers to cr0.TS/DNA rules. did I answer your question? thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Sat, Feb 1, 2014 at 5:38 PM, Linus Torvalds wrote: > It definitely does not want an else, I think. > > If tsk_used_math() is false, or if the FPU restore failed, we > *definitely* need that stts(). Otherwise we'd return to user mode with > random contents in the FP state, and let user mode muck around with > it. > > No? So if the restore failed, we should do something like drop_init_fpu(), which will restore init-state to the registers. for eager-fpu() paths we don't use clts() stts() etc. thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Sat, Feb 1, 2014 at 5:26 PM, H. Peter Anvin wrote: > Even "b" does that, no? oh right. It needs an else. only for non-eager fpu case we should do stts() void __kernel_fpu_end(void) { if (use_eager_fpu()) { struct task_struct *me = current; if (tsk_used_math(me) && likely(!restore_fpu_checking( me))) return; } else stts(); } thanks, suresh > "a" should be fine as long as we don't ever use > those features in the kernel, even under kernel_fpu_begin/end(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Sat, Feb 1, 2014 at 11:27 AM, Linus Torvalds wrote: > That said, regardless of the allocation issue, I do think that it's > stupid for kernel_fpu_{begin,end} to save the math state if > "used_math" was not set. So I do think__kernel_fpu_end() as-s is > buggy and stupid. For eager_fpu case, assumption was every task should always have 'used_math' set. But i think there is a race, where we drop the fpu explicitly by doing drop_fpu() and meanwhile if we get an interrupt etc that ends up using fpu? so I will Ack for option "b", as option "a" breaks the features which don't take into account cr0.TS. Meanwhile I have the patch removing the delayed dynamic allocation for non-eager fpu. will post it after some testing. thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Sat, Feb 1, 2014 at 11:27 AM, Linus Torvalds torva...@linux-foundation.org wrote: That said, regardless of the allocation issue, I do think that it's stupid for kernel_fpu_{begin,end} to save the math state if used_math was not set. So I do think__kernel_fpu_end() as-s is buggy and stupid. For eager_fpu case, assumption was every task should always have 'used_math' set. But i think there is a race, where we drop the fpu explicitly by doing drop_fpu() and meanwhile if we get an interrupt etc that ends up using fpu? so I will Ack for option b, as option a breaks the features which don't take into account cr0.TS. Meanwhile I have the patch removing the delayed dynamic allocation for non-eager fpu. will post it after some testing. thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Sat, Feb 1, 2014 at 5:26 PM, H. Peter Anvin h...@zytor.com wrote: Even b does that, no? oh right. It needs an else. only for non-eager fpu case we should do stts() void __kernel_fpu_end(void) { if (use_eager_fpu()) { struct task_struct *me = current; if (tsk_used_math(me) likely(!restore_fpu_checking( me))) return; } else stts(); } thanks, suresh a should be fine as long as we don't ever use those features in the kernel, even under kernel_fpu_begin/end(). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Sat, Feb 1, 2014 at 5:38 PM, Linus Torvalds torva...@linux-foundation.org wrote: It definitely does not want an else, I think. If tsk_used_math() is false, or if the FPU restore failed, we *definitely* need that stts(). Otherwise we'd return to user mode with random contents in the FP state, and let user mode muck around with it. No? So if the restore failed, we should do something like drop_init_fpu(), which will restore init-state to the registers. for eager-fpu() paths we don't use clts() stts() etc. thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Sat, Feb 1, 2014 at 5:51 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Sat, Feb 1, 2014 at 5:47 PM, Suresh Siddha sbsid...@gmail.com wrote: So if the restore failed, we should do something like drop_init_fpu(), which will restore init-state to the registers. for eager-fpu() paths we don't use clts() stts() etc. Uhhuh. Ok. Why do we do that, btw? I think it would make much more sense to just do what I *thought* we did, and just make it a context-switch-time optimization (let's always switch FP state), not make it a huge semantic difference. clts/stts is more costly and not all the state under xsave adhers to cr0.TS/DNA rules. did I answer your question? thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
On Sat, 2014-02-01 at 17:06 -0800, Suresh Siddha wrote: Meanwhile I have the patch removing the delayed dynamic allocation for non-eager fpu. will post it after some testing. Appended the patch for this. Tested for last 4-5 hours on my laptop. The real fix for Nate's problem will be coming from Linus, with a slightly modified option-b that Linus proposed. Linus, please let me know if you want me to spin it. I can do it sunday night. thanks, suresh --- From: Suresh Siddha sbsid...@gmail.com Subject: x86, fpu: remove the logic of non-eager fpu mem allocation at the first usage For non-eager fpu mode, thread's fpu state is allocated during the first fpu usage (in the context of device not available exception). This can be a blocking call and hence we enable interrupts (which were originally disabled when the exception happened), allocate memory and disable interrupts etc. While this saves 512 bytes or so per-thread, there are some issues in general. a. Most of the future cases will be anyway using eager FPU (because of processor features like xsaveopt, LWP, MPX etc) and they do the allocation at the thread creation itself. Nice to have one common mechanism as all the state save/restore code is shared. Avoids the confusion and minimizes the subtle bugs in the core piece involved with context-switch. b. If a parent thread uses FPU, during fork() we allocate the FPU state in the child and copy the state etc. Shortly after this, during exec() we free it up, so that we can later allocate during the first usage of FPU. So this free/allocate might be slower for some workloads. c. math_state_restore() is called from multiple places and it is error pone if the caller expects interrupts to be disabled throughout the execution of math_state_restore(). Can lead to subtle bugs like Ubuntu bug #1265841. Memory savings will be small anyways and the code complexity introducing subtle bugs is not worth it. So remove the logic of non-eager fpu mem allocation at the first usage. Signed-off-by: Suresh Siddha sbsid...@gmail.com --- arch/x86/kernel/i387.c| 14 +- arch/x86/kernel/process.c | 6 -- arch/x86/kernel/traps.c | 16 ++-- arch/x86/kernel/xsave.c | 2 -- 4 files changed, 7 insertions(+), 31 deletions(-) diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index e8368c6..4e5f770 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -5,6 +5,7 @@ * General FPU state handling cleanups * Gareth Hughes gar...@valinux.com, May 2000 */ +#include linux/bootmem.h #include linux/module.h #include linux/regset.h #include linux/sched.h @@ -186,6 +187,10 @@ void fpu_init(void) if (xstate_size == 0) init_thread_xstate(); + if (!current-thread.fpu.state) + current-thread.fpu.state = + alloc_bootmem_align(xstate_size, __alignof__(struct xsave_struct)); + mxcsr_feature_mask_init(); xsave_init(); eager_fpu_init(); @@ -219,8 +224,6 @@ EXPORT_SYMBOL_GPL(fpu_finit); */ int init_fpu(struct task_struct *tsk) { - int ret; - if (tsk_used_math(tsk)) { if (cpu_has_fpu tsk == current) unlazy_fpu(tsk); @@ -228,13 +231,6 @@ int init_fpu(struct task_struct *tsk) return 0; } - /* -* Memory allocation at the first usage of the FPU and other state. -*/ - ret = fpu_alloc(tsk-thread.fpu); - if (ret) - return ret; - fpu_finit(tsk-thread.fpu); set_stopped_child_used_math(tsk); diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 3fb8d95..cd9c190 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -128,12 +128,6 @@ void flush_thread(void) flush_ptrace_hw_breakpoint(tsk); memset(tsk-thread.tls_array, 0, sizeof(tsk-thread.tls_array)); drop_init_fpu(tsk); - /* -* Free the FPU state for non xsave platforms. They get reallocated -* lazily at the first use. -*/ - if (!use_eager_fpu()) - free_thread_xstate(tsk); } static void hard_disable_TSC(void) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 57409f6..3265429 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -623,20 +623,8 @@ void math_state_restore(void) { struct task_struct *tsk = current; - if (!tsk_used_math(tsk)) { - local_irq_enable(); - /* -* does a slab alloc which can sleep -*/ - if (init_fpu(tsk)) { - /* -* ran out of memory! -*/ - do_group_exit(SIGKILL); - return; - } - local_irq_disable(); - } + if (!tsk_used_math(tsk)) + init_fpu(tsk); __thread_fpu_begin
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
hi, On Thu, Jan 30, 2014 at 2:24 PM, Linus Torvalds wrote: > I'm adding in some people here, because I think in the end this bug > was introduced by commit 304bceda6a18 ("x86, fpu: use non-lazy fpu > restore for processors supporting xsave") that introduced that > math_state_restore() in kernel_fpu_end(), but we have other commits > (like 5187b28ff08: "x86: Allow FPU to be used at interrupt time even > with eagerfpu") that seem tangential too and might be part of why it > actually *triggers* now. > > Comments? I haven't been following the recent changes closely, so before I get a chance to review the current bug and the relevant commits, wanted to added that: a. delayed dynamic allocation of FPU state area was not a good idea (from me). Given most of the future cases will be anyway using eager FPU (because of processor features like xsaveopt etc, applications implicitly using FPU because of optimizations in commonly used libraries etc), we should probably go back to allocation of FPU state area during thread creation for everyone (including non-eager cases). Memory savings will be small anyways and the code complexity introducing subtle bugs like this in not worth it. b. with the above change, kernel_fpu_begin() will just save any user live math state and be ready for kernel math operations. And kernel_fpu_end() will drop the kernel math state and for eager-fpu case restore the user math state. We will avoid worrying about any memory allocations in the math_state_restore() with interrupts disabled etc. If there are no objections, I will see if I can come up with a quick patch. or will ask HPA to help fill me in. thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
hi, On Thu, Jan 30, 2014 at 2:24 PM, Linus Torvalds torva...@linux-foundation.org wrote: I'm adding in some people here, because I think in the end this bug was introduced by commit 304bceda6a18 (x86, fpu: use non-lazy fpu restore for processors supporting xsave) that introduced that math_state_restore() in kernel_fpu_end(), but we have other commits (like 5187b28ff08: x86: Allow FPU to be used at interrupt time even with eagerfpu) that seem tangential too and might be part of why it actually *triggers* now. Comments? I haven't been following the recent changes closely, so before I get a chance to review the current bug and the relevant commits, wanted to added that: a. delayed dynamic allocation of FPU state area was not a good idea (from me). Given most of the future cases will be anyway using eager FPU (because of processor features like xsaveopt etc, applications implicitly using FPU because of optimizations in commonly used libraries etc), we should probably go back to allocation of FPU state area during thread creation for everyone (including non-eager cases). Memory savings will be small anyways and the code complexity introducing subtle bugs like this in not worth it. b. with the above change, kernel_fpu_begin() will just save any user live math state and be ready for kernel math operations. And kernel_fpu_end() will drop the kernel math state and for eager-fpu case restore the user math state. We will avoid worrying about any memory allocations in the math_state_restore() with interrupts disabled etc. If there are no objections, I will see if I can come up with a quick patch. or will ask HPA to help fill me in. thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/cleanups] x86, apic: Cleanup cfg-> domain setup for legacy interrupts
Commit-ID: 29c574c0aba8dc0736e19eb9b24aad28cc5c9098 Gitweb: http://git.kernel.org/tip/29c574c0aba8dc0736e19eb9b24aad28cc5c9098 Author: Suresh Siddha AuthorDate: Mon, 26 Nov 2012 14:49:36 -0800 Committer: H. Peter Anvin CommitDate: Mon, 26 Nov 2012 15:43:25 -0800 x86, apic: Cleanup cfg->domain setup for legacy interrupts Issues that need to be handled: * Handle PIC interrupts on any CPU irrespective of the apic mode * In the apic lowest priority logical flat delivery mode, be prepared to handle the interrupt on any CPU irrespective of what the IO-APIC RTE says. * Because of above, when the IO-APIC starts handling the legacy PIC interrupt, use the same vector that is being used by the PIC while programming the corresponding IO-APIC RTE. Start with all the cpu's in the legacy PIC interrupts cfg->domain. By the time IO-APIC starts taking over the PIC interrupts, apic driver model is finalized. So depend on the assign_irq_vector() to update the cfg->domain and retain the same vector that was used by PIC before. For the logical apic flat mode, cfg->domain is updated (during the first call to assign_irq_vector()) to contain all the possible online cpu's (0xff). Vector used for the legacy PIC interrupt doesn't change when the IO-APIC starts handling the interrupt. Any interrupt migration after that doesn't change the cfg->domain or the vector used. For other apic modes like physical mode, cfg->domain is updated (during the first call to assign_irq_vector()) to the boot cpu (cpu-0), with the same vector that is being used by the PIC. When that interrupt is migrated to a different cpu, cfg->domin and the vector assigned will change accordingly. Tested-by: Borislav Petkov Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1353970176.21070.51.ca...@sbsiddha-desk.sc.intel.com Signed-off-by: H. Peter Anvin --- arch/x86/kernel/apic/io_apic.c | 26 ++ 1 file changed, 6 insertions(+), 20 deletions(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index c265593..0c1f366 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -234,11 +234,11 @@ int __init arch_early_irq_init(void) zalloc_cpumask_var_node([i].old_domain, GFP_KERNEL, node); /* * For legacy IRQ's, start with assigning irq0 to irq15 to -* IRQ0_VECTOR to IRQ15_VECTOR on cpu 0. +* IRQ0_VECTOR to IRQ15_VECTOR for all cpu's. */ if (i < legacy_pic->nr_legacy_irqs) { cfg[i].vector = IRQ0_VECTOR + i; - cpumask_set_cpu(0, cfg[i].domain); + cpumask_setall(cfg[i].domain); } } @@ -1141,7 +1141,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask) * allocation for the members that are not used anymore. */ cpumask_andnot(cfg->old_domain, cfg->domain, tmp_mask); - cfg->move_in_progress = 1; + cfg->move_in_progress = + cpumask_intersects(cfg->old_domain, cpu_online_mask); cpumask_and(cfg->domain, cfg->domain, tmp_mask); break; } @@ -1172,8 +1173,9 @@ next: current_vector = vector; current_offset = offset; if (cfg->vector) { - cfg->move_in_progress = 1; cpumask_copy(cfg->old_domain, cfg->domain); + cfg->move_in_progress = + cpumask_intersects(cfg->old_domain, cpu_online_mask); } for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask) per_cpu(vector_irq, new_cpu)[vector] = irq; @@ -1241,12 +1243,6 @@ void __setup_vector_irq(int cpu) cfg = irq_get_chip_data(irq); if (!cfg) continue; - /* -* If it is a legacy IRQ handled by the legacy PIC, this cpu -* will be part of the irq_cfg's domain. -*/ - if (irq < legacy_pic->nr_legacy_irqs && !IO_APIC_IRQ(irq)) - cpumask_set_cpu(cpu, cfg->domain); if (!cpumask_test_cpu(cpu, cfg->domain)) continue; @@ -1356,16 +1352,6 @@ static void setup_ioapic_irq(unsigned int irq, struct irq_cfg *cfg, if (!IO_APIC_IRQ(irq)) return; - /* -* For legacy irqs, cfg->domain starts with cpu 0. Now that IO-APIC -* can handle this irq and the apic driver is finialized at this point, -* update the cfg->domain. -*/ - if (irq < legacy_pic->nr_legacy_irqs &&
[patch] x86, apic: cleanup cfg->domain setup for legacy interrupts
Had this cleanup patch (tested before by me and Boris aswell) for a while. Forgot to post this earlier. Thanks. ---8<--- From: Suresh Siddha Subject: x86, apic: cleanup cfg->domain setup for legacy interrupts Issues that need to be handled: * Handle PIC interrupts on any CPU irrespective of the apic mode * In the apic lowest priority logical flat delivery mode, be prepared to handle the interrupt on any CPU irrespective of what the IO-APIC RTE says. * Because of above, when the IO-APIC starts handling the legacy PIC interrupt, use the same vector that is being used by the PIC while programming the corresponding IO-APIC RTE. Start with all the cpu's in the legacy PIC interrupts cfg->domain. By the time IO-APIC starts taking over the PIC interrupts, apic driver model is finalized. So depend on the assign_irq_vector() to update the cfg->domain and retain the same vector that was used by PIC before. For the logical apic flat mode, cfg->domain is updated (during the first call to assign_irq_vector()) to contain all the possible online cpu's (0xff). Vector used for the legacy PIC interrupt doesn't change when the IO-APIC starts handling the interrupt. Any interrupt migration after that doesn't change the cfg->domain or the vector used. For other apic modes like physical mode, cfg->domain is updated (during the first call to assign_irq_vector()) to the boot cpu (cpu-0), with the same vector that is being used by the PIC. When that interrupt is migrated to a different cpu, cfg->domin and the vector assigned will change accordingly. Tested-by: Borislav Petkov Signed-off-by: Suresh Siddha --- arch/x86/kernel/apic/io_apic.c | 26 ++ 1 files changed, 6 insertions(+), 20 deletions(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index c265593..0c1f366 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -234,11 +234,11 @@ int __init arch_early_irq_init(void) zalloc_cpumask_var_node([i].old_domain, GFP_KERNEL, node); /* * For legacy IRQ's, start with assigning irq0 to irq15 to -* IRQ0_VECTOR to IRQ15_VECTOR on cpu 0. +* IRQ0_VECTOR to IRQ15_VECTOR for all cpu's. */ if (i < legacy_pic->nr_legacy_irqs) { cfg[i].vector = IRQ0_VECTOR + i; - cpumask_set_cpu(0, cfg[i].domain); + cpumask_setall(cfg[i].domain); } } @@ -1141,7 +1141,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask) * allocation for the members that are not used anymore. */ cpumask_andnot(cfg->old_domain, cfg->domain, tmp_mask); - cfg->move_in_progress = 1; + cfg->move_in_progress = + cpumask_intersects(cfg->old_domain, cpu_online_mask); cpumask_and(cfg->domain, cfg->domain, tmp_mask); break; } @@ -1172,8 +1173,9 @@ next: current_vector = vector; current_offset = offset; if (cfg->vector) { - cfg->move_in_progress = 1; cpumask_copy(cfg->old_domain, cfg->domain); + cfg->move_in_progress = + cpumask_intersects(cfg->old_domain, cpu_online_mask); } for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask) per_cpu(vector_irq, new_cpu)[vector] = irq; @@ -1241,12 +1243,6 @@ void __setup_vector_irq(int cpu) cfg = irq_get_chip_data(irq); if (!cfg) continue; - /* -* If it is a legacy IRQ handled by the legacy PIC, this cpu -* will be part of the irq_cfg's domain. -*/ - if (irq < legacy_pic->nr_legacy_irqs && !IO_APIC_IRQ(irq)) - cpumask_set_cpu(cpu, cfg->domain); if (!cpumask_test_cpu(cpu, cfg->domain)) continue; @@ -1356,16 +1352,6 @@ static void setup_ioapic_irq(unsigned int irq, struct irq_cfg *cfg, if (!IO_APIC_IRQ(irq)) return; - /* -* For legacy irqs, cfg->domain starts with cpu 0. Now that IO-APIC -* can handle this irq and the apic driver is finialized at this point, -* update the cfg->domain. -*/ - if (irq < legacy_pic->nr_legacy_irqs && - cpumask_equal(cfg->domain, cpumask_of(0))) - apic->vector_allocation_domain(0, cfg->domain, - apic->target_cpus()); - if (assi
[patch] x86, apic: cleanup cfg-domain setup for legacy interrupts
Had this cleanup patch (tested before by me and Boris aswell) for a while. Forgot to post this earlier. Thanks. ---8--- From: Suresh Siddha suresh.b.sid...@intel.com Subject: x86, apic: cleanup cfg-domain setup for legacy interrupts Issues that need to be handled: * Handle PIC interrupts on any CPU irrespective of the apic mode * In the apic lowest priority logical flat delivery mode, be prepared to handle the interrupt on any CPU irrespective of what the IO-APIC RTE says. * Because of above, when the IO-APIC starts handling the legacy PIC interrupt, use the same vector that is being used by the PIC while programming the corresponding IO-APIC RTE. Start with all the cpu's in the legacy PIC interrupts cfg-domain. By the time IO-APIC starts taking over the PIC interrupts, apic driver model is finalized. So depend on the assign_irq_vector() to update the cfg-domain and retain the same vector that was used by PIC before. For the logical apic flat mode, cfg-domain is updated (during the first call to assign_irq_vector()) to contain all the possible online cpu's (0xff). Vector used for the legacy PIC interrupt doesn't change when the IO-APIC starts handling the interrupt. Any interrupt migration after that doesn't change the cfg-domain or the vector used. For other apic modes like physical mode, cfg-domain is updated (during the first call to assign_irq_vector()) to the boot cpu (cpu-0), with the same vector that is being used by the PIC. When that interrupt is migrated to a different cpu, cfg-domin and the vector assigned will change accordingly. Tested-by: Borislav Petkov b...@alien8.de Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com --- arch/x86/kernel/apic/io_apic.c | 26 ++ 1 files changed, 6 insertions(+), 20 deletions(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index c265593..0c1f366 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -234,11 +234,11 @@ int __init arch_early_irq_init(void) zalloc_cpumask_var_node(cfg[i].old_domain, GFP_KERNEL, node); /* * For legacy IRQ's, start with assigning irq0 to irq15 to -* IRQ0_VECTOR to IRQ15_VECTOR on cpu 0. +* IRQ0_VECTOR to IRQ15_VECTOR for all cpu's. */ if (i legacy_pic-nr_legacy_irqs) { cfg[i].vector = IRQ0_VECTOR + i; - cpumask_set_cpu(0, cfg[i].domain); + cpumask_setall(cfg[i].domain); } } @@ -1141,7 +1141,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask) * allocation for the members that are not used anymore. */ cpumask_andnot(cfg-old_domain, cfg-domain, tmp_mask); - cfg-move_in_progress = 1; + cfg-move_in_progress = + cpumask_intersects(cfg-old_domain, cpu_online_mask); cpumask_and(cfg-domain, cfg-domain, tmp_mask); break; } @@ -1172,8 +1173,9 @@ next: current_vector = vector; current_offset = offset; if (cfg-vector) { - cfg-move_in_progress = 1; cpumask_copy(cfg-old_domain, cfg-domain); + cfg-move_in_progress = + cpumask_intersects(cfg-old_domain, cpu_online_mask); } for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask) per_cpu(vector_irq, new_cpu)[vector] = irq; @@ -1241,12 +1243,6 @@ void __setup_vector_irq(int cpu) cfg = irq_get_chip_data(irq); if (!cfg) continue; - /* -* If it is a legacy IRQ handled by the legacy PIC, this cpu -* will be part of the irq_cfg's domain. -*/ - if (irq legacy_pic-nr_legacy_irqs !IO_APIC_IRQ(irq)) - cpumask_set_cpu(cpu, cfg-domain); if (!cpumask_test_cpu(cpu, cfg-domain)) continue; @@ -1356,16 +1352,6 @@ static void setup_ioapic_irq(unsigned int irq, struct irq_cfg *cfg, if (!IO_APIC_IRQ(irq)) return; - /* -* For legacy irqs, cfg-domain starts with cpu 0. Now that IO-APIC -* can handle this irq and the apic driver is finialized at this point, -* update the cfg-domain. -*/ - if (irq legacy_pic-nr_legacy_irqs - cpumask_equal(cfg-domain, cpumask_of(0))) - apic-vector_allocation_domain(0, cfg-domain, - apic-target_cpus()); - if (assign_irq_vector(irq, cfg, apic-target_cpus())) return; -- To unsubscribe from this list
[tip:x86/cleanups] x86, apic: Cleanup cfg- domain setup for legacy interrupts
Commit-ID: 29c574c0aba8dc0736e19eb9b24aad28cc5c9098 Gitweb: http://git.kernel.org/tip/29c574c0aba8dc0736e19eb9b24aad28cc5c9098 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Mon, 26 Nov 2012 14:49:36 -0800 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Mon, 26 Nov 2012 15:43:25 -0800 x86, apic: Cleanup cfg-domain setup for legacy interrupts Issues that need to be handled: * Handle PIC interrupts on any CPU irrespective of the apic mode * In the apic lowest priority logical flat delivery mode, be prepared to handle the interrupt on any CPU irrespective of what the IO-APIC RTE says. * Because of above, when the IO-APIC starts handling the legacy PIC interrupt, use the same vector that is being used by the PIC while programming the corresponding IO-APIC RTE. Start with all the cpu's in the legacy PIC interrupts cfg-domain. By the time IO-APIC starts taking over the PIC interrupts, apic driver model is finalized. So depend on the assign_irq_vector() to update the cfg-domain and retain the same vector that was used by PIC before. For the logical apic flat mode, cfg-domain is updated (during the first call to assign_irq_vector()) to contain all the possible online cpu's (0xff). Vector used for the legacy PIC interrupt doesn't change when the IO-APIC starts handling the interrupt. Any interrupt migration after that doesn't change the cfg-domain or the vector used. For other apic modes like physical mode, cfg-domain is updated (during the first call to assign_irq_vector()) to the boot cpu (cpu-0), with the same vector that is being used by the PIC. When that interrupt is migrated to a different cpu, cfg-domin and the vector assigned will change accordingly. Tested-by: Borislav Petkov b...@alien8.de Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1353970176.21070.51.ca...@sbsiddha-desk.sc.intel.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/kernel/apic/io_apic.c | 26 ++ 1 file changed, 6 insertions(+), 20 deletions(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index c265593..0c1f366 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -234,11 +234,11 @@ int __init arch_early_irq_init(void) zalloc_cpumask_var_node(cfg[i].old_domain, GFP_KERNEL, node); /* * For legacy IRQ's, start with assigning irq0 to irq15 to -* IRQ0_VECTOR to IRQ15_VECTOR on cpu 0. +* IRQ0_VECTOR to IRQ15_VECTOR for all cpu's. */ if (i legacy_pic-nr_legacy_irqs) { cfg[i].vector = IRQ0_VECTOR + i; - cpumask_set_cpu(0, cfg[i].domain); + cpumask_setall(cfg[i].domain); } } @@ -1141,7 +1141,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask) * allocation for the members that are not used anymore. */ cpumask_andnot(cfg-old_domain, cfg-domain, tmp_mask); - cfg-move_in_progress = 1; + cfg-move_in_progress = + cpumask_intersects(cfg-old_domain, cpu_online_mask); cpumask_and(cfg-domain, cfg-domain, tmp_mask); break; } @@ -1172,8 +1173,9 @@ next: current_vector = vector; current_offset = offset; if (cfg-vector) { - cfg-move_in_progress = 1; cpumask_copy(cfg-old_domain, cfg-domain); + cfg-move_in_progress = + cpumask_intersects(cfg-old_domain, cpu_online_mask); } for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask) per_cpu(vector_irq, new_cpu)[vector] = irq; @@ -1241,12 +1243,6 @@ void __setup_vector_irq(int cpu) cfg = irq_get_chip_data(irq); if (!cfg) continue; - /* -* If it is a legacy IRQ handled by the legacy PIC, this cpu -* will be part of the irq_cfg's domain. -*/ - if (irq legacy_pic-nr_legacy_irqs !IO_APIC_IRQ(irq)) - cpumask_set_cpu(cpu, cfg-domain); if (!cpumask_test_cpu(cpu, cfg-domain)) continue; @@ -1356,16 +1352,6 @@ static void setup_ioapic_irq(unsigned int irq, struct irq_cfg *cfg, if (!IO_APIC_IRQ(irq)) return; - /* -* For legacy irqs, cfg-domain starts with cpu 0. Now that IO-APIC -* can handle this irq and the apic driver is finialized at this point, -* update the cfg-domain. -*/ - if (irq legacy_pic-nr_legacy_irqs - cpumask_equal
[tip:x86/timers] x86: apic: Use tsc deadline for oneshot when available
Commit-ID: 279f1461432ccdec0b98c0bcbe0a8e2c0f6fdda5 Gitweb: http://git.kernel.org/tip/279f1461432ccdec0b98c0bcbe0a8e2c0f6fdda5 Author: Suresh Siddha AuthorDate: Mon, 22 Oct 2012 14:37:58 -0700 Committer: Thomas Gleixner CommitDate: Fri, 2 Nov 2012 11:23:37 +0100 x86: apic: Use tsc deadline for oneshot when available If the TSC deadline mode is supported, LAPIC timer one-shot mode can be implemented using IA32_TSC_DEADLINE MSR. An interrupt will be generated when the TSC value equals or exceeds the value in the IA32_TSC_DEADLINE MSR. This enables us to skip the APIC calibration during boot. Also, in xapic mode, this enables us to skip the uncached apic access to re-arm the APIC timer. As this timer ticks at the high frequency TSC rate, we use the TSC_DIVISOR (32) to work with the 32-bit restrictions in the clockevent API's to avoid 64-bit divides etc (frequency is u32 and "unsigned long" in the set_next_event(), max_delta limits the next event to 32-bit for 32-bit kernel). Signed-off-by: Suresh Siddha Cc: ve...@google.com Cc: len.br...@intel.com Link: http://lkml.kernel.org/r/1350941878.6017.31.ca...@sbsiddha-desk.sc.intel.com Signed-off-by: Thomas Gleixner --- Documentation/kernel-parameters.txt |4 ++ arch/x86/include/asm/msr-index.h|2 + arch/x86/kernel/apic/apic.c | 73 +- 3 files changed, 59 insertions(+), 20 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 9776f06..4aa9ca0 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1304,6 +1304,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. lapic [X86-32,APIC] Enable the local APIC even if BIOS disabled it. + lapic= [x86,APIC] "notscdeadline" Do not use TSC deadline + value for LAPIC timer one-shot implementation. Default + back to the programmable timer unit in the LAPIC. + lapic_timer_c2_ok [X86,APIC] trust the local apic timer in C2 power state. diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 7f0edce..e400cdb 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -337,6 +337,8 @@ #define MSR_IA32_MISC_ENABLE_TURBO_DISABLE (1ULL << 38) #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE (1ULL << 39) +#define MSR_IA32_TSC_DEADLINE 0x06E0 + /* P4/Xeon+ specific */ #define MSR_IA32_MCG_EAX 0x0180 #define MSR_IA32_MCG_EBX 0x0181 diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index b17416e..b994cc8 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -90,21 +90,6 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid); */ DEFINE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid, BAD_APICID); -/* - * Knob to control our willingness to enable the local APIC. - * - * +1=force-enable - */ -static int force_enable_local_apic __initdata; -/* - * APIC command line parameters - */ -static int __init parse_lapic(char *arg) -{ - force_enable_local_apic = 1; - return 0; -} -early_param("lapic", parse_lapic); /* Local APIC was disabled by the BIOS and enabled by the kernel */ static int enabled_via_apicbase; @@ -133,6 +118,25 @@ static inline void imcr_apic_to_pic(void) } #endif +/* + * Knob to control our willingness to enable the local APIC. + * + * +1=force-enable + */ +static int force_enable_local_apic __initdata; +/* + * APIC command line parameters + */ +static int __init parse_lapic(char *arg) +{ + if (config_enabled(CONFIG_X86_32) && !arg) + force_enable_local_apic = 1; + else if (!strncmp(arg, "notscdeadline", 13)) + setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); + return 0; +} +early_param("lapic", parse_lapic); + #ifdef CONFIG_X86_64 static int apic_calibrate_pmtmr __initdata; static __init int setup_apicpmtimer(char *s) @@ -315,6 +319,7 @@ int lapic_get_maxlvt(void) /* Clock divisor */ #define APIC_DIVISOR 16 +#define TSC_DIVISOR 32 /* * This function sets up the local APIC timer, with a timeout of @@ -333,6 +338,9 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) lvtt_value = LOCAL_TIMER_VECTOR; if (!oneshot) lvtt_value |= APIC_LVT_TIMER_PERIODIC; + else if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) + lvtt_value |= APIC_LVT_TIMER_TSCDEADLINE; + if (!lapic_is_integrated()) lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV); @@ -341,6 +349,11 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) apic_write(APIC_LV
[tip:x86/timers] x86: apic: Use tsc deadline for oneshot when available
Commit-ID: 279f1461432ccdec0b98c0bcbe0a8e2c0f6fdda5 Gitweb: http://git.kernel.org/tip/279f1461432ccdec0b98c0bcbe0a8e2c0f6fdda5 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Mon, 22 Oct 2012 14:37:58 -0700 Committer: Thomas Gleixner t...@linutronix.de CommitDate: Fri, 2 Nov 2012 11:23:37 +0100 x86: apic: Use tsc deadline for oneshot when available If the TSC deadline mode is supported, LAPIC timer one-shot mode can be implemented using IA32_TSC_DEADLINE MSR. An interrupt will be generated when the TSC value equals or exceeds the value in the IA32_TSC_DEADLINE MSR. This enables us to skip the APIC calibration during boot. Also, in xapic mode, this enables us to skip the uncached apic access to re-arm the APIC timer. As this timer ticks at the high frequency TSC rate, we use the TSC_DIVISOR (32) to work with the 32-bit restrictions in the clockevent API's to avoid 64-bit divides etc (frequency is u32 and unsigned long in the set_next_event(), max_delta limits the next event to 32-bit for 32-bit kernel). Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Cc: ve...@google.com Cc: len.br...@intel.com Link: http://lkml.kernel.org/r/1350941878.6017.31.ca...@sbsiddha-desk.sc.intel.com Signed-off-by: Thomas Gleixner t...@linutronix.de --- Documentation/kernel-parameters.txt |4 ++ arch/x86/include/asm/msr-index.h|2 + arch/x86/kernel/apic/apic.c | 73 +- 3 files changed, 59 insertions(+), 20 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 9776f06..4aa9ca0 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1304,6 +1304,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. lapic [X86-32,APIC] Enable the local APIC even if BIOS disabled it. + lapic= [x86,APIC] notscdeadline Do not use TSC deadline + value for LAPIC timer one-shot implementation. Default + back to the programmable timer unit in the LAPIC. + lapic_timer_c2_ok [X86,APIC] trust the local apic timer in C2 power state. diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 7f0edce..e400cdb 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -337,6 +337,8 @@ #define MSR_IA32_MISC_ENABLE_TURBO_DISABLE (1ULL 38) #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE (1ULL 39) +#define MSR_IA32_TSC_DEADLINE 0x06E0 + /* P4/Xeon+ specific */ #define MSR_IA32_MCG_EAX 0x0180 #define MSR_IA32_MCG_EBX 0x0181 diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index b17416e..b994cc8 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -90,21 +90,6 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid); */ DEFINE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid, BAD_APICID); -/* - * Knob to control our willingness to enable the local APIC. - * - * +1=force-enable - */ -static int force_enable_local_apic __initdata; -/* - * APIC command line parameters - */ -static int __init parse_lapic(char *arg) -{ - force_enable_local_apic = 1; - return 0; -} -early_param(lapic, parse_lapic); /* Local APIC was disabled by the BIOS and enabled by the kernel */ static int enabled_via_apicbase; @@ -133,6 +118,25 @@ static inline void imcr_apic_to_pic(void) } #endif +/* + * Knob to control our willingness to enable the local APIC. + * + * +1=force-enable + */ +static int force_enable_local_apic __initdata; +/* + * APIC command line parameters + */ +static int __init parse_lapic(char *arg) +{ + if (config_enabled(CONFIG_X86_32) !arg) + force_enable_local_apic = 1; + else if (!strncmp(arg, notscdeadline, 13)) + setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); + return 0; +} +early_param(lapic, parse_lapic); + #ifdef CONFIG_X86_64 static int apic_calibrate_pmtmr __initdata; static __init int setup_apicpmtimer(char *s) @@ -315,6 +319,7 @@ int lapic_get_maxlvt(void) /* Clock divisor */ #define APIC_DIVISOR 16 +#define TSC_DIVISOR 32 /* * This function sets up the local APIC timer, with a timeout of @@ -333,6 +338,9 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) lvtt_value = LOCAL_TIMER_VECTOR; if (!oneshot) lvtt_value |= APIC_LVT_TIMER_PERIODIC; + else if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) + lvtt_value |= APIC_LVT_TIMER_TSCDEADLINE; + if (!lapic_is_integrated()) lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV); @@ -341,6 +349,11 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) apic_write(APIC_LVTT, lvtt_value
Re: [PATCH] x86/ioapic: Fix the vector_irq[] is corrupted randomly
On Tue, 2012-10-30 at 00:15 +0800, Chuansheng Liu wrote: > Not all irq chips are IO-APIC chip. > > In our system, there are many demux GPIO interrupts except for the > io-apic chip interrupts, and these GPIO interrupts are belonged > to other irq chips, the chip data is not type of struct irq_cfg > either. > > But in function __setup_vector_irq(), it listed all allocated irqs, > and presume all irq chip is ioapic_chip and the chip data is type > of struct irq_cfg, it possibly causes the vector_irq is corrupted > randomly. > > For example, one irq 258 is not io-apic chip irq, in __setup_vector_irq(), > the chip data is forced to be used as struct irq_cfg, then the value > cfg->domain and cfg->vector are wrong to be used to write vector_irq: > vector = cfg->vector; > per_cpu(vector_irq, cpu)[vector] = irq; > > This patch use the .flags to identify if the irq chip is io-apic. I have a feeling that your gpio driver is abusing the 'chip_data' in the struct irq_data. Shouldn't the driver be using 'handler_data' instead? >From include/linux/irq.h: * @handler_data: per-IRQ data for the irq_chip methods * @chip_data: platform-specific per-chip private data for the chip * methods, to allow shared chip implementations Also, how are these routed to the processors and the mechanism of the vector assignment for these irq's? I presume irq_cfg is needed for the setup and the interrupt migration from one cpu to another. What am I missing? thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86/ioapic: Fix the vector_irq[] is corrupted randomly
On Tue, 2012-10-30 at 00:15 +0800, Chuansheng Liu wrote: Not all irq chips are IO-APIC chip. In our system, there are many demux GPIO interrupts except for the io-apic chip interrupts, and these GPIO interrupts are belonged to other irq chips, the chip data is not type of struct irq_cfg either. But in function __setup_vector_irq(), it listed all allocated irqs, and presume all irq chip is ioapic_chip and the chip data is type of struct irq_cfg, it possibly causes the vector_irq is corrupted randomly. For example, one irq 258 is not io-apic chip irq, in __setup_vector_irq(), the chip data is forced to be used as struct irq_cfg, then the value cfg-domain and cfg-vector are wrong to be used to write vector_irq: vector = cfg-vector; per_cpu(vector_irq, cpu)[vector] = irq; This patch use the .flags to identify if the irq chip is io-apic. I have a feeling that your gpio driver is abusing the 'chip_data' in the struct irq_data. Shouldn't the driver be using 'handler_data' instead? From include/linux/irq.h: * @handler_data: per-IRQ data for the irq_chip methods * @chip_data: platform-specific per-chip private data for the chip * methods, to allow shared chip implementations Also, how are these routed to the processors and the mechanism of the vector assignment for these irq's? I presume irq_cfg is needed for the setup and the interrupt migration from one cpu to another. What am I missing? thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH 2.6.32.y 0/3] Re: [stable 2.6.32..2.6.34] x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic()
On Wed, 2012-10-24 at 12:41 -0700, Jonathan Nieder wrote: > Suresh Siddha wrote: > > On Wed, 2012-10-24 at 11:25 -0700, Jonathan Nieder wrote: > > >> Why not cherry-pick 7716a5c4ff5 in full? > > > > As that depends on the other commits like: > > commit 4b6b19a1c7302477653d799a53d48063dd53d555 > > More importantly, if I understand correctly it might depend on > > commit cf7500c0ea13 > Author: Eric W. Biederman > Date: Tue Mar 30 01:07:11 2010 -0700 > > x86, ioapic: In mpparse use mp_register_ioapic > > Here's a series, completely untested, that is closer to what I > expected. But the approach you took seems reasonable, too, as long > as the commit message is tweaked to explain it. > > Thanks again, > Jonathan > > Eric W. Biederman (3): > x86, ioapic: Teach mp_register_ioapic to compute a global gsi_end > x86, ioapic: In mpparse use mp_register_ioapic > x86, ioapic: Move nr_ioapic_registers calculation to > mp_register_ioapic. > > arch/x86/include/asm/io_apic.h | 1 + > arch/x86/kernel/apic/io_apic.c | 28 ++-- > arch/x86/kernel/mpparse.c | 25 + > arch/x86/kernel/sfi.c | 4 +--- > 4 files changed, 17 insertions(+), 41 deletions(-) hmm, NO. I am not sure if it is worth spending time validating all these changes for the stable series and I can't do it on my own, as I don't have all the relevant HW. For example, another commit a4384df3e24579d6292a1b3b41d500349948f30b (which you haven't picked up in your series) fixes some of these issues introduced by the commits you have picked. commit a4384df3e24579d6292a1b3b41d500349948f30b Author: Eric W. Biederman Date: Tue Jun 8 11:44:32 2010 -0700 x86, irq: Rename gsi_end gsi_top, and fix off by one errors So I did think about all these things and wanted to really pursue the smallest and simplest change. Here is the updated patch with just some more text added to the changelog. Greg, does this look ok to you? Thanks. -- 8< -- From: Suresh Siddha Subject: x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic() Lin Bao reported that one of the HP platforms failed to boot 2.6.32 kernel, when the BIOS enabled interrupt-remapping and x2apic before handing over the control to the Linux kernel. During boot, Linux kernel masks all the interrupt sources (8259, IO-APIC RTE's), setup the interrupt-remapping hardware with the OS controlled table and unmasks the 8259 interrupts but not the IO-APIC RTE's (as the newly setup interrupt-remapping table and the IO-APIC RTE's are not yet programmed by the kernel). Shortly after this, IO-APIC RTE's and the interrupt-remapping table entries are programmed based on the ACPI tables etc. So the expectation is that any interrupt during this window will be dropped and not see the intermediate configuration. In the reported problematic case, BIOS has configured the IO-APIC in virtual wire-B mode. Between the window of the kernel setting up new interrupt-remapping table and the IO-APIC RTE's are properly configured, an interrupt gets routed by the IO-APIC RTE (setup by the virtual wire-B configuration) and sees the empty interrupt-remapping table entry, resulting in vt-d fault causing the platform to generate NMI. And the OS panics on this unexpected NMI. This problem doesn't happen with more recent kernels and closer look at the 2.6.32 kernel shows that the code which masks the IO-APIC RTE's is not working as expected as the nr_ioapic_registers for each IO-APIC is not yet initialized at this point. In the later kernels we initialize nr_ioapic_registers much before and everything works as expected. For 2.6.[32..34] kernels, fix this issue by initializing nr_ioapic_registers early in mp_register_ioapic() [ Relevant upstream commit info: commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af Author: Eric W. Biederman Date: Tue Mar 30 01:07:12 2010 -0700 x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic. As the upstream commit depends on quite a few prior commits and some followup fixes in the mainline, we just picked the smallest relevant hunk for fixing the issue at hand. Problematic platform uses ACPI for IO-APIC, VT-d enumeration etc and this hunk only touches the ACPI based platforms. nr_ioapic_reigsters initialization in enable_IO_APIC() is still retained, so that other configurations like legacy MPS table based enumeration etc works with no change. ] Reported-and-tested-by: Zhang, Lin-Bao Signed-off-by: Suresh Siddha Cc: sta...@vger.kernel.org [v2.6.32..v2.6.34] --- arch/x86/kernel/apic/io_apic.c |9 +++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 8928d97..d256bc3 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_ap
Re: [stable 2.6.32..2.6.34] x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic()
On Wed, 2012-10-24 at 11:25 -0700, Jonathan Nieder wrote: > Hi Suresh, > > Suresh Siddha wrote: > > [...] > > This problem doesn't happen with more recent kernels and closer > > look at the 2.6.32 kernel shows that the code which masks > > the IO-APIC RTE's is not working as expected as the nr_ioapic_registers > > for each IO-APIC is not yet initialized at this point. In the later > > kernels we initialize nr_ioapic_registers much before and > > everything works as expected. > > > > For 2.6.[32..34] kernels, fix this issue by initializing > > nr_ioapic_registers early in mp_register_ioapic() > > > > Relevant upstream commit info: > > > > commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af > > Why not cherry-pick 7716a5c4ff5 in full? As that depends on the other commits like: commit 4b6b19a1c7302477653d799a53d48063dd53d555 Author: Eric W. Biederman Date: Tue Mar 30 01:07:08 2010 -0700 Wanted to keep the changes as minimal as possible. thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[stable 2.6.32..2.6.34] x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic()
Lin Bao reported that one of the HP platforms failed to boot 2.6.32 kernel, when the BIOS enabled interrupt-remapping and x2apic before handing over the control to the Linux kernel. During boot, Linux kernel masks all the interrupt sources (8259, IO-APIC RTE's), setup the interrupt-remapping hardware with the OS controlled table and unmasks the 8259 interrupts but not the IO-APIC RTE's (as the newly setup interrupt-remapping table and the IO-APIC RTE's are not yet programmed by the kernel). Shortly after this, IO-APIC RTE's and the interrupt-remapping table entries are programmed based on the ACPI tables etc. So the expectation is that any interrupt during this window will be dropped and not see the intermediate configuration. In the reported problematic case, BIOS has configured the IO-APIC in virtual wire-B mode. Between the window of the kernel setting up new interrupt-remapping table and the IO-APIC RTE's are properly configured, an interrupt gets routed by the IO-APIC RTE (setup by the virtual wire-B configuration) and sees the empty interrupt-remapping table entry, resulting in vt-d fault causing the platform to generate NMI. And the OS panics on this unexpected NMI. This problem doesn't happen with more recent kernels and closer look at the 2.6.32 kernel shows that the code which masks the IO-APIC RTE's is not working as expected as the nr_ioapic_registers for each IO-APIC is not yet initialized at this point. In the later kernels we initialize nr_ioapic_registers much before and everything works as expected. For 2.6.[32..34] kernels, fix this issue by initializing nr_ioapic_registers early in mp_register_ioapic() Relevant upstream commit info: commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af Author: Eric W. Biederman Date: Tue Mar 30 01:07:12 2010 -0700 x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic. Reported-and-tested-by: Zhang, Lin-Bao Signed-off-by: Suresh Siddha Cc: sta...@vger.kernel.org [v2.6.32..v2.6.34] --- arch/x86/kernel/apic/io_apic.c |9 +++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 8928d97..d256bc3 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -4262,6 +4262,7 @@ static int bad_ioapic(unsigned long address) void __init mp_register_ioapic(int id, u32 address, u32 gsi_base) { int idx = 0; + int entries; if (bad_ioapic(address)) return; @@ -4280,10 +4281,14 @@ void __init mp_register_ioapic(int id, u32 address, u32 gsi_base) * Build basic GSI lookup table to facilitate gsi->io_apic lookups * and to prevent reprogramming of IOAPIC pins (PCI GSIs). */ + entries = io_apic_get_redir_entries(idx); mp_gsi_routing[idx].gsi_base = gsi_base; - mp_gsi_routing[idx].gsi_end = gsi_base + - io_apic_get_redir_entries(idx); + mp_gsi_routing[idx].gsi_end = gsi_base + entries; + /* +* The number of IO-APIC IRQ registers (== #pins): +*/ + nr_ioapic_registers[idx] = entries + 1; printk(KERN_INFO "IOAPIC[%d]: apic_id %d, version %d, address 0x%x, " "GSI %d-%d\n", idx, mp_ioapics[idx].apicid, mp_ioapics[idx].apicver, mp_ioapics[idx].apicaddr, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[stable 2.6.32..2.6.34] x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic()
Lin Bao reported that one of the HP platforms failed to boot 2.6.32 kernel, when the BIOS enabled interrupt-remapping and x2apic before handing over the control to the Linux kernel. During boot, Linux kernel masks all the interrupt sources (8259, IO-APIC RTE's), setup the interrupt-remapping hardware with the OS controlled table and unmasks the 8259 interrupts but not the IO-APIC RTE's (as the newly setup interrupt-remapping table and the IO-APIC RTE's are not yet programmed by the kernel). Shortly after this, IO-APIC RTE's and the interrupt-remapping table entries are programmed based on the ACPI tables etc. So the expectation is that any interrupt during this window will be dropped and not see the intermediate configuration. In the reported problematic case, BIOS has configured the IO-APIC in virtual wire-B mode. Between the window of the kernel setting up new interrupt-remapping table and the IO-APIC RTE's are properly configured, an interrupt gets routed by the IO-APIC RTE (setup by the virtual wire-B configuration) and sees the empty interrupt-remapping table entry, resulting in vt-d fault causing the platform to generate NMI. And the OS panics on this unexpected NMI. This problem doesn't happen with more recent kernels and closer look at the 2.6.32 kernel shows that the code which masks the IO-APIC RTE's is not working as expected as the nr_ioapic_registers for each IO-APIC is not yet initialized at this point. In the later kernels we initialize nr_ioapic_registers much before and everything works as expected. For 2.6.[32..34] kernels, fix this issue by initializing nr_ioapic_registers early in mp_register_ioapic() Relevant upstream commit info: commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af Author: Eric W. Biederman ebied...@xmission.com Date: Tue Mar 30 01:07:12 2010 -0700 x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic. Reported-and-tested-by: Zhang, Lin-Bao linbao.zh...@hp.com Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Cc: sta...@vger.kernel.org [v2.6.32..v2.6.34] --- arch/x86/kernel/apic/io_apic.c |9 +++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 8928d97..d256bc3 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -4262,6 +4262,7 @@ static int bad_ioapic(unsigned long address) void __init mp_register_ioapic(int id, u32 address, u32 gsi_base) { int idx = 0; + int entries; if (bad_ioapic(address)) return; @@ -4280,10 +4281,14 @@ void __init mp_register_ioapic(int id, u32 address, u32 gsi_base) * Build basic GSI lookup table to facilitate gsi-io_apic lookups * and to prevent reprogramming of IOAPIC pins (PCI GSIs). */ + entries = io_apic_get_redir_entries(idx); mp_gsi_routing[idx].gsi_base = gsi_base; - mp_gsi_routing[idx].gsi_end = gsi_base + - io_apic_get_redir_entries(idx); + mp_gsi_routing[idx].gsi_end = gsi_base + entries; + /* +* The number of IO-APIC IRQ registers (== #pins): +*/ + nr_ioapic_registers[idx] = entries + 1; printk(KERN_INFO IOAPIC[%d]: apic_id %d, version %d, address 0x%x, GSI %d-%d\n, idx, mp_ioapics[idx].apicid, mp_ioapics[idx].apicver, mp_ioapics[idx].apicaddr, -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable 2.6.32..2.6.34] x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic()
On Wed, 2012-10-24 at 11:25 -0700, Jonathan Nieder wrote: Hi Suresh, Suresh Siddha wrote: [...] This problem doesn't happen with more recent kernels and closer look at the 2.6.32 kernel shows that the code which masks the IO-APIC RTE's is not working as expected as the nr_ioapic_registers for each IO-APIC is not yet initialized at this point. In the later kernels we initialize nr_ioapic_registers much before and everything works as expected. For 2.6.[32..34] kernels, fix this issue by initializing nr_ioapic_registers early in mp_register_ioapic() Relevant upstream commit info: commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af Why not cherry-pick 7716a5c4ff5 in full? As that depends on the other commits like: commit 4b6b19a1c7302477653d799a53d48063dd53d555 Author: Eric W. Biederman ebied...@xmission.com Date: Tue Mar 30 01:07:08 2010 -0700 Wanted to keep the changes as minimal as possible. thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH 2.6.32.y 0/3] Re: [stable 2.6.32..2.6.34] x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic()
On Wed, 2012-10-24 at 12:41 -0700, Jonathan Nieder wrote: Suresh Siddha wrote: On Wed, 2012-10-24 at 11:25 -0700, Jonathan Nieder wrote: Why not cherry-pick 7716a5c4ff5 in full? As that depends on the other commits like: commit 4b6b19a1c7302477653d799a53d48063dd53d555 More importantly, if I understand correctly it might depend on commit cf7500c0ea13 Author: Eric W. Biederman ebied...@xmission.com Date: Tue Mar 30 01:07:11 2010 -0700 x86, ioapic: In mpparse use mp_register_ioapic Here's a series, completely untested, that is closer to what I expected. But the approach you took seems reasonable, too, as long as the commit message is tweaked to explain it. Thanks again, Jonathan Eric W. Biederman (3): x86, ioapic: Teach mp_register_ioapic to compute a global gsi_end x86, ioapic: In mpparse use mp_register_ioapic x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic. arch/x86/include/asm/io_apic.h | 1 + arch/x86/kernel/apic/io_apic.c | 28 ++-- arch/x86/kernel/mpparse.c | 25 + arch/x86/kernel/sfi.c | 4 +--- 4 files changed, 17 insertions(+), 41 deletions(-) hmm, NO. I am not sure if it is worth spending time validating all these changes for the stable series and I can't do it on my own, as I don't have all the relevant HW. For example, another commit a4384df3e24579d6292a1b3b41d500349948f30b (which you haven't picked up in your series) fixes some of these issues introduced by the commits you have picked. commit a4384df3e24579d6292a1b3b41d500349948f30b Author: Eric W. Biederman ebied...@xmission.com Date: Tue Jun 8 11:44:32 2010 -0700 x86, irq: Rename gsi_end gsi_top, and fix off by one errors So I did think about all these things and wanted to really pursue the smallest and simplest change. Here is the updated patch with just some more text added to the changelog. Greg, does this look ok to you? Thanks. -- 8 -- From: Suresh Siddha suresh.b.sid...@intel.com Subject: x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic() Lin Bao reported that one of the HP platforms failed to boot 2.6.32 kernel, when the BIOS enabled interrupt-remapping and x2apic before handing over the control to the Linux kernel. During boot, Linux kernel masks all the interrupt sources (8259, IO-APIC RTE's), setup the interrupt-remapping hardware with the OS controlled table and unmasks the 8259 interrupts but not the IO-APIC RTE's (as the newly setup interrupt-remapping table and the IO-APIC RTE's are not yet programmed by the kernel). Shortly after this, IO-APIC RTE's and the interrupt-remapping table entries are programmed based on the ACPI tables etc. So the expectation is that any interrupt during this window will be dropped and not see the intermediate configuration. In the reported problematic case, BIOS has configured the IO-APIC in virtual wire-B mode. Between the window of the kernel setting up new interrupt-remapping table and the IO-APIC RTE's are properly configured, an interrupt gets routed by the IO-APIC RTE (setup by the virtual wire-B configuration) and sees the empty interrupt-remapping table entry, resulting in vt-d fault causing the platform to generate NMI. And the OS panics on this unexpected NMI. This problem doesn't happen with more recent kernels and closer look at the 2.6.32 kernel shows that the code which masks the IO-APIC RTE's is not working as expected as the nr_ioapic_registers for each IO-APIC is not yet initialized at this point. In the later kernels we initialize nr_ioapic_registers much before and everything works as expected. For 2.6.[32..34] kernels, fix this issue by initializing nr_ioapic_registers early in mp_register_ioapic() [ Relevant upstream commit info: commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af Author: Eric W. Biederman ebied...@xmission.com Date: Tue Mar 30 01:07:12 2010 -0700 x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic. As the upstream commit depends on quite a few prior commits and some followup fixes in the mainline, we just picked the smallest relevant hunk for fixing the issue at hand. Problematic platform uses ACPI for IO-APIC, VT-d enumeration etc and this hunk only touches the ACPI based platforms. nr_ioapic_reigsters initialization in enable_IO_APIC() is still retained, so that other configurations like legacy MPS table based enumeration etc works with no change. ] Reported-and-tested-by: Zhang, Lin-Bao linbao.zh...@hp.com Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Cc: sta...@vger.kernel.org [v2.6.32..v2.6.34] --- arch/x86/kernel/apic/io_apic.c |9 +++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 8928d97..d256bc3 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -4262,6 +4262,7
[patch] x86, apic: use tsc deadline for oneshot when available
Thomas, You wanted to run some tests with this, right? Please give it a try and see if this is ok to be pushed to the -tip. thanks, suresh --8<-- From: Suresh Siddha Subject: x86, apic: use tsc deadline for oneshot when available If the TSC deadline mode is supported, LAPIC timer one-shot mode can be implemented using IA32_TSC_DEADLINE MSR. An interrupt will be generated when the TSC value equals or exceeds the value in the IA32_TSC_DEADLINE MSR. This enables us to skip the APIC calibration during boot. Also, in xapic mode, this enables us to skip the uncached apic access to re-arm the APIC timer. As this timer ticks at the high frequency TSC rate, we use the TSC_DIVISOR (32) to work with the 32-bit restrictions in the clockevent API's to avoid 64-bit divides etc (frequency is u32 and "unsigned long" in the set_next_event(), max_delta limits the next event to 32-bit for 32-bit kernel). Signed-off-by: Suresh Siddha --- Documentation/kernel-parameters.txt |4 ++ arch/x86/include/asm/msr-index.h|2 + arch/x86/kernel/apic/apic.c | 66 ++- 3 files changed, 55 insertions(+), 17 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 9776f06..4aa9ca0 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1304,6 +1304,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. lapic [X86-32,APIC] Enable the local APIC even if BIOS disabled it. + lapic= [x86,APIC] "notscdeadline" Do not use TSC deadline + value for LAPIC timer one-shot implementation. Default + back to the programmable timer unit in the LAPIC. + lapic_timer_c2_ok [X86,APIC] trust the local apic timer in C2 power state. diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 7f0edce..e400cdb 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -337,6 +337,8 @@ #define MSR_IA32_MISC_ENABLE_TURBO_DISABLE (1ULL << 38) #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE (1ULL << 39) +#define MSR_IA32_TSC_DEADLINE 0x06E0 + /* P4/Xeon+ specific */ #define MSR_IA32_MCG_EAX 0x0180 #define MSR_IA32_MCG_EBX 0x0181 diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index b17416e..b0c49b1 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -90,21 +90,6 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid); */ DEFINE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid, BAD_APICID); -/* - * Knob to control our willingness to enable the local APIC. - * - * +1=force-enable - */ -static int force_enable_local_apic __initdata; -/* - * APIC command line parameters - */ -static int __init parse_lapic(char *arg) -{ - force_enable_local_apic = 1; - return 0; -} -early_param("lapic", parse_lapic); /* Local APIC was disabled by the BIOS and enabled by the kernel */ static int enabled_via_apicbase; @@ -133,6 +118,25 @@ static inline void imcr_apic_to_pic(void) } #endif +/* + * Knob to control our willingness to enable the local APIC. + * + * +1=force-enable + */ +static int force_enable_local_apic __initdata; +/* + * APIC command line parameters + */ +static int __init parse_lapic(char *arg) +{ + if (config_enabled(CONFIG_X86_32) && !arg) + force_enable_local_apic = 1; + else if (!strncmp(arg, "notscdeadline", 13)) + setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); + return 0; +} +early_param("lapic", parse_lapic); + #ifdef CONFIG_X86_64 static int apic_calibrate_pmtmr __initdata; static __init int setup_apicpmtimer(char *s) @@ -315,6 +319,7 @@ int lapic_get_maxlvt(void) /* Clock divisor */ #define APIC_DIVISOR 16 +#define TSC_DIVISOR 32 /* * This function sets up the local APIC timer, with a timeout of @@ -333,6 +338,9 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) lvtt_value = LOCAL_TIMER_VECTOR; if (!oneshot) lvtt_value |= APIC_LVT_TIMER_PERIODIC; + else if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) + lvtt_value |= APIC_LVT_TIMER_TSCDEADLINE; + if (!lapic_is_integrated()) lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV); @@ -341,6 +349,11 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) apic_write(APIC_LVTT, lvtt_value); + if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) { + printk_once(KERN_DEBUG "TSC deadline timer enabled\n"); + return; + } + /* * Divide PICLK by 16 */ @@ -453,6 +466,15
[patch] x86, apic: use tsc deadline for oneshot when available
Thomas, You wanted to run some tests with this, right? Please give it a try and see if this is ok to be pushed to the -tip. thanks, suresh --8-- From: Suresh Siddha suresh.b.sid...@intel.com Subject: x86, apic: use tsc deadline for oneshot when available If the TSC deadline mode is supported, LAPIC timer one-shot mode can be implemented using IA32_TSC_DEADLINE MSR. An interrupt will be generated when the TSC value equals or exceeds the value in the IA32_TSC_DEADLINE MSR. This enables us to skip the APIC calibration during boot. Also, in xapic mode, this enables us to skip the uncached apic access to re-arm the APIC timer. As this timer ticks at the high frequency TSC rate, we use the TSC_DIVISOR (32) to work with the 32-bit restrictions in the clockevent API's to avoid 64-bit divides etc (frequency is u32 and unsigned long in the set_next_event(), max_delta limits the next event to 32-bit for 32-bit kernel). Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com --- Documentation/kernel-parameters.txt |4 ++ arch/x86/include/asm/msr-index.h|2 + arch/x86/kernel/apic/apic.c | 66 ++- 3 files changed, 55 insertions(+), 17 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 9776f06..4aa9ca0 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1304,6 +1304,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. lapic [X86-32,APIC] Enable the local APIC even if BIOS disabled it. + lapic= [x86,APIC] notscdeadline Do not use TSC deadline + value for LAPIC timer one-shot implementation. Default + back to the programmable timer unit in the LAPIC. + lapic_timer_c2_ok [X86,APIC] trust the local apic timer in C2 power state. diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 7f0edce..e400cdb 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -337,6 +337,8 @@ #define MSR_IA32_MISC_ENABLE_TURBO_DISABLE (1ULL 38) #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE (1ULL 39) +#define MSR_IA32_TSC_DEADLINE 0x06E0 + /* P4/Xeon+ specific */ #define MSR_IA32_MCG_EAX 0x0180 #define MSR_IA32_MCG_EBX 0x0181 diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index b17416e..b0c49b1 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -90,21 +90,6 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid); */ DEFINE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid, BAD_APICID); -/* - * Knob to control our willingness to enable the local APIC. - * - * +1=force-enable - */ -static int force_enable_local_apic __initdata; -/* - * APIC command line parameters - */ -static int __init parse_lapic(char *arg) -{ - force_enable_local_apic = 1; - return 0; -} -early_param(lapic, parse_lapic); /* Local APIC was disabled by the BIOS and enabled by the kernel */ static int enabled_via_apicbase; @@ -133,6 +118,25 @@ static inline void imcr_apic_to_pic(void) } #endif +/* + * Knob to control our willingness to enable the local APIC. + * + * +1=force-enable + */ +static int force_enable_local_apic __initdata; +/* + * APIC command line parameters + */ +static int __init parse_lapic(char *arg) +{ + if (config_enabled(CONFIG_X86_32) !arg) + force_enable_local_apic = 1; + else if (!strncmp(arg, notscdeadline, 13)) + setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER); + return 0; +} +early_param(lapic, parse_lapic); + #ifdef CONFIG_X86_64 static int apic_calibrate_pmtmr __initdata; static __init int setup_apicpmtimer(char *s) @@ -315,6 +319,7 @@ int lapic_get_maxlvt(void) /* Clock divisor */ #define APIC_DIVISOR 16 +#define TSC_DIVISOR 32 /* * This function sets up the local APIC timer, with a timeout of @@ -333,6 +338,9 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) lvtt_value = LOCAL_TIMER_VECTOR; if (!oneshot) lvtt_value |= APIC_LVT_TIMER_PERIODIC; + else if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) + lvtt_value |= APIC_LVT_TIMER_TSCDEADLINE; + if (!lapic_is_integrated()) lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV); @@ -341,6 +349,11 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) apic_write(APIC_LVTT, lvtt_value); + if (lvtt_value APIC_LVT_TIMER_TSCDEADLINE) { + printk_once(KERN_DEBUG TSC deadline timer enabled\n); + return; + } + /* * Divide PICLK by 16 */ @@ -453,6 +466,15 @@ static int lapic_next_event(unsigned long delta, return 0
Re: x2apic boot failure on recent sandy bridge system
On Fri, 2012-10-19 at 16:36 -0700, H. Peter Anvin wrote: > On 10/19/2012 04:32 PM, Yinghai Lu wrote: > > On Fri, Oct 19, 2012 at 4:03 PM, Suresh Siddha > > wrote: > >> On Fri, 2012-10-19 at 13:42 -0700, rrl...@gmail.com wrote: > >>> Any update? The messages just seem to have stopped months ago. A > >>> fallback would be nice, I have been booting the kernel with noa2xpic > >>> for since kernel 3.2, and currently I am working with 3.6.2. > >>> > >>> If needed I can try to attempt modifying the patch to include > >>> fallback, but I am probably not the best person to do it. > >>> > >> > >> Are you referring to this commit that made into the mainline tree > >> already? > >> > >> commit fb209bd891645bb87b9618b724f0b4928e0df3de > >> Author: Yinghai Lu > >> Date: Wed Dec 21 17:45:17 2011 -0800 > >> > >> x86, x2apic: Fallback to xapic when BIOS doesn't setup > >> interrupt-remapping > > > > I think his system has DMAR table and cpu support x2apic, So kernel > > will switch to x2apic, > > > > but somehow BIOS SMI handler has problem with x2apic. should be thinkpad > > W520? > > > > Right, StinkPad W520 needs a quirk. yes. Yinghai, if you remember you had a T420 that didn't show the problem. And someone in the bugzilla with T420 had the problem. And their dmidecode is https://launchpadlibrarian.net/109393850/dmidecode.txt What is the difference with your system? Bios I think is the same. Can you see what we should check for in the dmi tables to black list these systems? Can you post your T420's dmidecode to see the difference. Bugs I have on this are: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/776999 https://bugzilla.kernel.org/show_bug.cgi?id=43054 thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x2apic boot failure on recent sandy bridge system
On Fri, 2012-10-19 at 13:42 -0700, rrl...@gmail.com wrote: > Any update? The messages just seem to have stopped months ago. A > fallback would be nice, I have been booting the kernel with noa2xpic > for since kernel 3.2, and currently I am working with 3.6.2. > > If needed I can try to attempt modifying the patch to include > fallback, but I am probably not the best person to do it. > Are you referring to this commit that made into the mainline tree already? commit fb209bd891645bb87b9618b724f0b4928e0df3de Author: Yinghai Lu Date: Wed Dec 21 17:45:17 2011 -0800 x86, x2apic: Fallback to xapic when BIOS doesn't setup interrupt-remapping On some of the recent Intel SNB platforms, by default bios is pre-enabling x2apic mode in the cpu with out setting up interrupt-remapping. This case was resulting in the kernel to panic as the cpu is already in x2apic mode but the OS was not able to enable interrupt-remapping (which is a pre-req for using x2apic capability). On these platforms all the apic-ids are < 255 and the kernel can fallback to xapic mode if the bios has not enabled interrupt-remapping (which is mostly the case if the bios has not exported interrupt-remapping tables to the OS). Reported-by: Berck E. Nash Signed-off-by: Yinghai Lu Link: http://lkml.kernel.org/r/20111222014632.600418...@sbsiddha-desk.sc.intel.com Signed-off-by: Suresh Siddha Signed-off-by: H. Peter Anvin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x2apic boot failure on recent sandy bridge system
On Fri, 2012-10-19 at 13:42 -0700, rrl...@gmail.com wrote: Any update? The messages just seem to have stopped months ago. A fallback would be nice, I have been booting the kernel with noa2xpic for since kernel 3.2, and currently I am working with 3.6.2. If needed I can try to attempt modifying the patch to include fallback, but I am probably not the best person to do it. Are you referring to this commit that made into the mainline tree already? commit fb209bd891645bb87b9618b724f0b4928e0df3de Author: Yinghai Lu ying...@kernel.org Date: Wed Dec 21 17:45:17 2011 -0800 x86, x2apic: Fallback to xapic when BIOS doesn't setup interrupt-remapping On some of the recent Intel SNB platforms, by default bios is pre-enabling x2apic mode in the cpu with out setting up interrupt-remapping. This case was resulting in the kernel to panic as the cpu is already in x2apic mode but the OS was not able to enable interrupt-remapping (which is a pre-req for using x2apic capability). On these platforms all the apic-ids are 255 and the kernel can fallback to xapic mode if the bios has not enabled interrupt-remapping (which is mostly the case if the bios has not exported interrupt-remapping tables to the OS). Reported-by: Berck E. Nash fly...@gmail.com Signed-off-by: Yinghai Lu ying...@kernel.org Link: http://lkml.kernel.org/r/20111222014632.600418...@sbsiddha-desk.sc.intel.com Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin h...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x2apic boot failure on recent sandy bridge system
On Fri, 2012-10-19 at 16:36 -0700, H. Peter Anvin wrote: On 10/19/2012 04:32 PM, Yinghai Lu wrote: On Fri, Oct 19, 2012 at 4:03 PM, Suresh Siddha suresh.b.sid...@intel.com wrote: On Fri, 2012-10-19 at 13:42 -0700, rrl...@gmail.com wrote: Any update? The messages just seem to have stopped months ago. A fallback would be nice, I have been booting the kernel with noa2xpic for since kernel 3.2, and currently I am working with 3.6.2. If needed I can try to attempt modifying the patch to include fallback, but I am probably not the best person to do it. Are you referring to this commit that made into the mainline tree already? commit fb209bd891645bb87b9618b724f0b4928e0df3de Author: Yinghai Lu ying...@kernel.org Date: Wed Dec 21 17:45:17 2011 -0800 x86, x2apic: Fallback to xapic when BIOS doesn't setup interrupt-remapping I think his system has DMAR table and cpu support x2apic, So kernel will switch to x2apic, but somehow BIOS SMI handler has problem with x2apic. should be thinkpad W520? Right, StinkPad W520 needs a quirk. yes. Yinghai, if you remember you had a T420 that didn't show the problem. And someone in the bugzilla with T420 had the problem. And their dmidecode is https://launchpadlibrarian.net/109393850/dmidecode.txt What is the difference with your system? Bios I think is the same. Can you see what we should check for in the dmi tables to black list these systems? Can you post your T420's dmidecode to see the difference. Bugs I have on this are: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/776999 https://bugzilla.kernel.org/show_bug.cgi?id=43054 thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries(smpboot.c and apic.c )
On Wed, 2012-10-10 at 00:26 +, Zhang, Lin-Bao (Linux Kernel R) wrote: > So , we can think ,as your patch , during the window , IO-apic is > useless or we can think IO-APIC doesn't exist ? > Could you mind please sharing your design details ? thanks very much! Between the window of interrupt-remapping enabled and the masked IO-APIC RTE's are configured properly, linux kernel doesn't wait/depend on any external interrupts. > > Can you please apply the appended patch to 2.6.32 kernel and see if the > > issue > > you mentioned gets fixed? If so, we can ask the -stable and OSV's teams to > > pick up this fix. > Yes , it can resolve current issue. Thanks for testing it out. I will add the appropriate changelog and send the patch out (to 2.6.32 stable and OSV kernels) with your "Tested-by:" if you are ok. thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries(smpboot.c and apic.c )
On Wed, 2012-10-10 at 16:02 -0700, Zhang, Lin-Bao (Linux Kernel R) wrote: > > As I mentioned earlier, the current design already ensures that all the > > IO-APIC > > RTE's are masked between the time we enable interrupt-remapping to the time > > when the IO-APIC RTE's are configured correctly. > > > > So I looked at why you are seeing the problem with v2.6.32 but not with the > > recent kernels. And I think I found out the reason. > > I want to know what masking IO-APIC means? As the platform is configured to use virtual-wire B and the corresponding IO-APIC RTE is masked, that interrupt will be dropped. thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries(smpboot.c and apic.c )
On Wed, 2012-10-10 at 16:02 -0700, Zhang, Lin-Bao (Linux Kernel RD) wrote: As I mentioned earlier, the current design already ensures that all the IO-APIC RTE's are masked between the time we enable interrupt-remapping to the time when the IO-APIC RTE's are configured correctly. So I looked at why you are seeing the problem with v2.6.32 but not with the recent kernels. And I think I found out the reason. I want to know what masking IO-APIC means? As the platform is configured to use virtual-wire B and the corresponding IO-APIC RTE is masked, that interrupt will be dropped. thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries(smpboot.c and apic.c )
On Wed, 2012-10-10 at 00:26 +, Zhang, Lin-Bao (Linux Kernel RD) wrote: So , we can think ,as your patch , during the window , IO-apic is useless or we can think IO-APIC doesn't exist ? Could you mind please sharing your design details ? thanks very much! Between the window of interrupt-remapping enabled and the masked IO-APIC RTE's are configured properly, linux kernel doesn't wait/depend on any external interrupts. Can you please apply the appended patch to 2.6.32 kernel and see if the issue you mentioned gets fixed? If so, we can ask the -stable and OSV's teams to pick up this fix. Yes , it can resolve current issue. Thanks for testing it out. I will add the appropriate changelog and send the patch out (to 2.6.32 stable and OSV kernels) with your Tested-by: if you are ok. thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries(smpboot.c and apic.c )
On Sun, 2012-10-07 at 21:53 -0700, Zhang, Lin-Bao (Linux Kernel R) wrote: > Hi Suresh, > Could you please update current status about these 2 files and patch? > I am not sure if I have answered your questions , if not ,feel free to let me > know. > This is my first time to submit patch to LKML, so what should I do next step ? As I mentioned earlier, the current design already ensures that all the IO-APIC RTE's are masked between the time we enable interrupt-remapping to the time when the IO-APIC RTE's are configured correctly. So I looked at why you are seeing the problem with v2.6.32 but not with the recent kernels. And I think I found out the reason. 2.6.32 kernel is missing this fix, http://marc.info/?l=linux-acpi=126993666715081=2 commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af Author: Eric W. Biederman Date: Tue Mar 30 01:07:12 2010 -0700 x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic. Now that all ioapic registration happens in mp_register_ioapic we can move the calculation of nr_ioapic_registers there from enable_IO_APIC. The number of ioapic registers is already calucated in mp_register_ioapic so all that really needs to be done is to save the caluclated value in nr_ioapic_registers. Signed-off-by: Eric W. Biederman LKML-Reference: <1269936436-7039-11-git-send-email-ebied...@xmission.com> Signed-off-by: H. Peter Anvin Because of this, in v2.6.32, mask_IO_APIC_setup() is not working as expected as nr_ioapic_registers[] are not yet initialized and thus the io-apic RTE's are not masked as expected. We just need the last hunk of that patch, I think. Can you please apply the appended patch to 2.6.32 kernel and see if the issue you mentioned gets fixed? If so, we can ask the -stable and OSV's teams to pick up this fix. diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index f807255..dae9240 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -4293,6 +4281,7 @@ static int bad_ioapic(unsigned long address) void __init mp_register_ioapic(int id, u32 address, u32 gsi_base) { int idx = 0; + int entries; if (bad_ioapic(address)) return; @@ -4311,9 +4300,14 @@ void __init mp_register_ioapic(int id, u32 address, u32 gsi_base) * Build basic GSI lookup table to facilitate gsi->io_apic lookups * and to prevent reprogramming of IOAPIC pins (PCI GSIs). */ + entries = io_apic_get_redir_entries(idx); mp_gsi_routing[idx].gsi_base = gsi_base; - mp_gsi_routing[idx].gsi_end = gsi_base + - io_apic_get_redir_entries(idx) - 1; + mp_gsi_routing[idx].gsi_end = gsi_base + entries - 1; + + /* +* The number of IO-APIC IRQ registers (== #pins): +*/ + nr_ioapic_registers[idx] = entries; if (mp_gsi_routing[idx].gsi_end > gsi_end) gsi_end = mp_gsi_routing[idx].gsi_end; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries(smpboot.c and apic.c )
On Sun, 2012-10-07 at 21:53 -0700, Zhang, Lin-Bao (Linux Kernel RD) wrote: Hi Suresh, Could you please update current status about these 2 files and patch? I am not sure if I have answered your questions , if not ,feel free to let me know. This is my first time to submit patch to LKML, so what should I do next step ? As I mentioned earlier, the current design already ensures that all the IO-APIC RTE's are masked between the time we enable interrupt-remapping to the time when the IO-APIC RTE's are configured correctly. So I looked at why you are seeing the problem with v2.6.32 but not with the recent kernels. And I think I found out the reason. 2.6.32 kernel is missing this fix, http://marc.info/?l=linux-acpim=126993666715081w=2 commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af Author: Eric W. Biederman ebied...@xmission.com Date: Tue Mar 30 01:07:12 2010 -0700 x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic. Now that all ioapic registration happens in mp_register_ioapic we can move the calculation of nr_ioapic_registers there from enable_IO_APIC. The number of ioapic registers is already calucated in mp_register_ioapic so all that really needs to be done is to save the caluclated value in nr_ioapic_registers. Signed-off-by: Eric W. Biederman ebied...@xmission.com LKML-Reference: 1269936436-7039-11-git-send-email-ebied...@xmission.com Signed-off-by: H. Peter Anvin h...@zytor.com Because of this, in v2.6.32, mask_IO_APIC_setup() is not working as expected as nr_ioapic_registers[] are not yet initialized and thus the io-apic RTE's are not masked as expected. We just need the last hunk of that patch, I think. Can you please apply the appended patch to 2.6.32 kernel and see if the issue you mentioned gets fixed? If so, we can ask the -stable and OSV's teams to pick up this fix. diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index f807255..dae9240 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -4293,6 +4281,7 @@ static int bad_ioapic(unsigned long address) void __init mp_register_ioapic(int id, u32 address, u32 gsi_base) { int idx = 0; + int entries; if (bad_ioapic(address)) return; @@ -4311,9 +4300,14 @@ void __init mp_register_ioapic(int id, u32 address, u32 gsi_base) * Build basic GSI lookup table to facilitate gsi-io_apic lookups * and to prevent reprogramming of IOAPIC pins (PCI GSIs). */ + entries = io_apic_get_redir_entries(idx); mp_gsi_routing[idx].gsi_base = gsi_base; - mp_gsi_routing[idx].gsi_end = gsi_base + - io_apic_get_redir_entries(idx) - 1; + mp_gsi_routing[idx].gsi_end = gsi_base + entries - 1; + + /* +* The number of IO-APIC IRQ registers (== #pins): +*/ + nr_ioapic_registers[idx] = entries; if (mp_gsi_routing[idx].gsi_end gsi_end) gsi_end = mp_gsi_routing[idx].gsi_end; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] x86/fixup_irq: Clean the offlining CPU from the irq affinity mask
On Fri, 2012-09-28 at 00:12 +0530, Srivatsa S. Bhat wrote: > On 09/27/2012 04:16 AM, Suresh Siddha wrote: > > > > No. irq_set_affinity() > > > > Um? That takes the updated/changed affinity and sets data->affinity to > that value no? You mentioned that probably the intention of the original > code was to preserve the user-set affinity mask, but still change the > underlying interrupt routing. Sorry, but I still didn't quite understand > what is that part of the code that achieves that. For the HW routing to be changed we AND it with cpu_online_map and use that for programming the interrupt entries etc. The user-specified affinity still has the cpu that is offlined. And when the cpu comes online and if it is part of the user-specified affinity, then the HW routing can be again modified to include the new cpu. hope this clears it! thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] x86/fixup_irq: Clean the offlining CPU from the irq affinity mask
On Fri, 2012-09-28 at 00:12 +0530, Srivatsa S. Bhat wrote: On 09/27/2012 04:16 AM, Suresh Siddha wrote: No. irq_set_affinity() Um? That takes the updated/changed affinity and sets data-affinity to that value no? You mentioned that probably the intention of the original code was to preserve the user-set affinity mask, but still change the underlying interrupt routing. Sorry, but I still didn't quite understand what is that part of the code that achieves that. For the HW routing to be changed we AND it with cpu_online_map and use that for programming the interrupt entries etc. The user-specified affinity still has the cpu that is offlined. And when the cpu comes online and if it is part of the user-specified affinity, then the HW routing can be again modified to include the new cpu. hope this clears it! thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] x86/fixup_irq: Clean the offlining CPU from the irq affinity mask
On Wed, 2012-09-26 at 23:00 +0530, Srivatsa S. Bhat wrote: > On 09/26/2012 10:36 PM, Suresh Siddha wrote: > > On Wed, 2012-09-26 at 21:33 +0530, Srivatsa S. Bhat wrote: > >> I have some fundamental questions here: > >> 1. Why was the CPU never removed from the affinity masks in the original > >> code? I find it hard to believe that it was just an oversight, because the > >> whole point of fixup_irqs() is to affine the interrupts to other CPUs, > >> IIUC. > >> So, is that really a bug or is the existing code correct for some reason > >> which I don't know of? > > > > I am not aware of the history but my guess is that the affinity mask > > which is coming from the user-space wants to be preserved. And > > fixup_irqs() is fixing the underlying interrupt routing when the cpu > > goes down > > and the code that corresponds to that is: > irq_force_complete_move(irq); is it? No. irq_set_affinity() > > with a hope that things will be corrected when the cpu comes > > back online. But as Liu noted, we are not correcting the underlying > > routing when the cpu comes back online. I think we should fix that > > rather than modifying the user-specified affinity. > > > > Hmm, I didn't entirely get your suggestion. Are you saying that we should > change > data->affinity (by calling ->irq_set_affinity()) during offline but maintain a > copy of the original affinity mask somewhere, so that we can try to match it > when possible (ie., when CPU comes back online)? Don't change the data->affinity in the fixup_irqs() and shortly after a cpu is online, call irq_chip's irq_set_affinity() for those irq's who affinity included this cpu (now that the cpu is back online, irq_set_affinity() will setup the HW routing tables correctly). This presumes that across the suspend/resume, cpu offline/online operations, we don't want to break the irq affinity setup by the user-level entity like irqbalance etc... > > That happens only if the irq chip doesn't have the irq_set_affinity() setup. > > That is my other point of concern : setting irq affinity can fail even if > we have ->irq_set_affinity(). (If __ioapic_set_affinity() fails, for example). > Why don't we complain in that case? I think we should... and if its serious > enough, abort the hotplug operation or atleast indicate that offline failed.. yes if there is a failure then we are in trouble, as the cpu is already disappeared from the online-masks etc. For platforms with interrupt-remapping, interrupts can be migrated from the process context and as such this all can be done much before. And for legacy platforms we have done quite a few changes in the recent past like using eoi_ioapic_irq() for level triggered interrupts etc, that makes it as safe as it can be. Perhaps we can move most of the fixup_irqs() code much ahead and the lost section of the current fixup_irqs() (which check IRR bits and use the retrigger function to trigger the interrupt on another cpu) can still be done late just like now. thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] x86/fixup_irq: Clean the offlining CPU from the irq affinity mask
On Wed, 2012-09-26 at 21:33 +0530, Srivatsa S. Bhat wrote: > I have some fundamental questions here: > 1. Why was the CPU never removed from the affinity masks in the original > code? I find it hard to believe that it was just an oversight, because the > whole point of fixup_irqs() is to affine the interrupts to other CPUs, IIUC. > So, is that really a bug or is the existing code correct for some reason > which I don't know of? I am not aware of the history but my guess is that the affinity mask which is coming from the user-space wants to be preserved. And fixup_irqs() is fixing the underlying interrupt routing when the cpu goes down with a hope that things will be corrected when the cpu comes back online. But as Liu noted, we are not correcting the underlying routing when the cpu comes back online. I think we should fix that rather than modifying the user-specified affinity. > 2. In case this is indeed a bug, why are the warnings ratelimited when the > interrupts can't be affined to other CPUs? Are they not serious enough to > report? Put more strongly, why do we even silently return with a warning > instead of reporting that the CPU offline operation failed?? Is that because > we have come way too far in the hotplug sequence and we can't easily roll > back? Or are we still actually OK in that situation? Are you referring to the "cannot set affinity for irq" messages? That happens only if the irq chip doesn't have the irq_set_affinity() setup. But that is not common. > > Suresh, I'd be grateful if you could kindly throw some light on these > issues... I'm actually debugging an issue where an offline CPU gets apic timer > interrupts (and in one case, I even saw a device interrupt), which I have > reported in another thread at: https://lkml.org/lkml/2012/9/26/119 > But this issue in fixup_irqs() that Liu brought to light looks even more > surprising to me.. These issues look different to me, will look into that. thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] x86/fixup_irq: Clean the offlining CPU from the irq affinity mask
On Wed, 2012-09-26 at 21:33 +0530, Srivatsa S. Bhat wrote: I have some fundamental questions here: 1. Why was the CPU never removed from the affinity masks in the original code? I find it hard to believe that it was just an oversight, because the whole point of fixup_irqs() is to affine the interrupts to other CPUs, IIUC. So, is that really a bug or is the existing code correct for some reason which I don't know of? I am not aware of the history but my guess is that the affinity mask which is coming from the user-space wants to be preserved. And fixup_irqs() is fixing the underlying interrupt routing when the cpu goes down with a hope that things will be corrected when the cpu comes back online. But as Liu noted, we are not correcting the underlying routing when the cpu comes back online. I think we should fix that rather than modifying the user-specified affinity. 2. In case this is indeed a bug, why are the warnings ratelimited when the interrupts can't be affined to other CPUs? Are they not serious enough to report? Put more strongly, why do we even silently return with a warning instead of reporting that the CPU offline operation failed?? Is that because we have come way too far in the hotplug sequence and we can't easily roll back? Or are we still actually OK in that situation? Are you referring to the cannot set affinity for irq messages? That happens only if the irq chip doesn't have the irq_set_affinity() setup. But that is not common. Suresh, I'd be grateful if you could kindly throw some light on these issues... I'm actually debugging an issue where an offline CPU gets apic timer interrupts (and in one case, I even saw a device interrupt), which I have reported in another thread at: https://lkml.org/lkml/2012/9/26/119 But this issue in fixup_irqs() that Liu brought to light looks even more surprising to me.. These issues look different to me, will look into that. thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] x86/fixup_irq: Clean the offlining CPU from the irq affinity mask
On Wed, 2012-09-26 at 23:00 +0530, Srivatsa S. Bhat wrote: On 09/26/2012 10:36 PM, Suresh Siddha wrote: On Wed, 2012-09-26 at 21:33 +0530, Srivatsa S. Bhat wrote: I have some fundamental questions here: 1. Why was the CPU never removed from the affinity masks in the original code? I find it hard to believe that it was just an oversight, because the whole point of fixup_irqs() is to affine the interrupts to other CPUs, IIUC. So, is that really a bug or is the existing code correct for some reason which I don't know of? I am not aware of the history but my guess is that the affinity mask which is coming from the user-space wants to be preserved. And fixup_irqs() is fixing the underlying interrupt routing when the cpu goes down and the code that corresponds to that is: irq_force_complete_move(irq); is it? No. irq_set_affinity() with a hope that things will be corrected when the cpu comes back online. But as Liu noted, we are not correcting the underlying routing when the cpu comes back online. I think we should fix that rather than modifying the user-specified affinity. Hmm, I didn't entirely get your suggestion. Are you saying that we should change data-affinity (by calling -irq_set_affinity()) during offline but maintain a copy of the original affinity mask somewhere, so that we can try to match it when possible (ie., when CPU comes back online)? Don't change the data-affinity in the fixup_irqs() and shortly after a cpu is online, call irq_chip's irq_set_affinity() for those irq's who affinity included this cpu (now that the cpu is back online, irq_set_affinity() will setup the HW routing tables correctly). This presumes that across the suspend/resume, cpu offline/online operations, we don't want to break the irq affinity setup by the user-level entity like irqbalance etc... That happens only if the irq chip doesn't have the irq_set_affinity() setup. That is my other point of concern : setting irq affinity can fail even if we have -irq_set_affinity(). (If __ioapic_set_affinity() fails, for example). Why don't we complain in that case? I think we should... and if its serious enough, abort the hotplug operation or atleast indicate that offline failed.. yes if there is a failure then we are in trouble, as the cpu is already disappeared from the online-masks etc. For platforms with interrupt-remapping, interrupts can be migrated from the process context and as such this all can be done much before. And for legacy platforms we have done quite a few changes in the recent past like using eoi_ioapic_irq() for level triggered interrupts etc, that makes it as safe as it can be. Perhaps we can move most of the fixup_irqs() code much ahead and the lost section of the current fixup_irqs() (which check IRR bits and use the retrigger function to trigger the interrupt on another cpu) can still be done late just like now. thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Mon, 2012-09-24 at 12:12 -0700, Linus Torvalds wrote: > On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith wrote: > > > > Aside from the cache pollution I recall having been mentioned, on my > > E5620, cross core is a tbench win over affine, cross thread is not. > > Oh, I agree with trying to avoid HT threads, the resource contention > easily gets too bad. > > It's more a question of "if we have real cores with separate L1's but > shared L2's, go with those first, before we start distributing it out > to separate L2's". There is one issue though. If the tasks continue to run in this state and the periodic balance notices an idle L2, it will force migrate (using active migration) one of the tasks to the idle L2. As the periodic balance tries to spread the load as far as possible to take maximum advantage of the available resources (and the perf advantage of this really depends on the workload, cache usage/memory bw, the upside of turbo etc). But I am not sure if this was the reason why we chose to spread it out to separate L2's during wakeup. Anyways, this is one of the places where the Paul Turner's task load average tracking patches will be useful. Depending on how long a task typically runs, we can probably even chose a SMT siblings or a separate L2 to run. thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
On Mon, 2012-09-24 at 12:12 -0700, Linus Torvalds wrote: On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith efa...@gmx.de wrote: Aside from the cache pollution I recall having been mentioned, on my E5620, cross core is a tbench win over affine, cross thread is not. Oh, I agree with trying to avoid HT threads, the resource contention easily gets too bad. It's more a question of if we have real cores with separate L1's but shared L2's, go with those first, before we start distributing it out to separate L2's. There is one issue though. If the tasks continue to run in this state and the periodic balance notices an idle L2, it will force migrate (using active migration) one of the tasks to the idle L2. As the periodic balance tries to spread the load as far as possible to take maximum advantage of the available resources (and the perf advantage of this really depends on the workload, cache usage/memory bw, the upside of turbo etc). But I am not sure if this was the reason why we chose to spread it out to separate L2's during wakeup. Anyways, this is one of the places where the Paul Turner's task load average tracking patches will be useful. Depending on how long a task typically runs, we can probably even chose a SMT siblings or a separate L2 to run. thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/fpu] x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
Commit-ID: b1a74bf8212367be2b1d6685c11a84e056eaaaf1 Gitweb: http://git.kernel.org/tip/b1a74bf8212367be2b1d6685c11a84e056eaaaf1 Author: Suresh Siddha AuthorDate: Thu, 20 Sep 2012 11:01:49 -0700 Committer: H. Peter Anvin CommitDate: Fri, 21 Sep 2012 16:59:04 -0700 x86, kvm: fix kvm's usage of kernel_fpu_begin/end() Preemption is disabled between kernel_fpu_begin/end() and as such it is not a good idea to use these routines in kvm_load/put_guest_fpu() which can be very far apart. kvm_load/put_guest_fpu() routines are already called with preemption disabled and KVM already uses the preempt notifier to save the guest fpu state using kvm_put_guest_fpu(). So introduce __kernel_fpu_begin/end() routines which don't touch preemption and use them instead of kernel_fpu_begin/end() for KVM's use model of saving/restoring guest FPU state. Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit state in the case of VMX. For eagerFPU case, host cr0.TS is always clear. So no need to worry about it. For the traditional lazyFPU restore case, change the cr0.TS bit for the host state during vm-exit to be always clear and cr0.TS bit is set in the __vmx_load_host_state() when the FPU (guest FPU or the host task's FPU) state is not active. This ensures that the host/guest FPU state is properly saved, restored during context-switch and with interrupts (using irq_fpu_usable()) not stomping on the active FPU state. Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1348164109.26695.338.ca...@sbsiddha-desk.sc.intel.com Cc: Avi Kivity Signed-off-by: H. Peter Anvin --- arch/x86/include/asm/i387.h | 28 ++-- arch/x86/kernel/i387.c | 13 + arch/x86/kvm/vmx.c | 10 +++--- arch/x86/kvm/x86.c |4 ++-- 4 files changed, 40 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index 6c3bd37..ed8089d 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -24,8 +24,32 @@ extern int dump_fpu(struct pt_regs *, struct user_i387_struct *); extern void math_state_restore(void); extern bool irq_fpu_usable(void); -extern void kernel_fpu_begin(void); -extern void kernel_fpu_end(void); + +/* + * Careful: __kernel_fpu_begin/end() must be called with preempt disabled + * and they don't touch the preempt state on their own. + * If you enable preemption after __kernel_fpu_begin(), preempt notifier + * should call the __kernel_fpu_end() to prevent the kernel/user FPU + * state from getting corrupted. KVM for example uses this model. + * + * All other cases use kernel_fpu_begin/end() which disable preemption + * during kernel FPU usage. + */ +extern void __kernel_fpu_begin(void); +extern void __kernel_fpu_end(void); + +static inline void kernel_fpu_begin(void) +{ + WARN_ON_ONCE(!irq_fpu_usable()); + preempt_disable(); + __kernel_fpu_begin(); +} + +static inline void kernel_fpu_end(void) +{ + __kernel_fpu_end(); + preempt_enable(); +} /* * Some instructions like VIA's padlock instructions generate a spurious diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index 6782e39..675a050 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -73,32 +73,29 @@ bool irq_fpu_usable(void) } EXPORT_SYMBOL(irq_fpu_usable); -void kernel_fpu_begin(void) +void __kernel_fpu_begin(void) { struct task_struct *me = current; - WARN_ON_ONCE(!irq_fpu_usable()); - preempt_disable(); if (__thread_has_fpu(me)) { __save_init_fpu(me); __thread_clear_has_fpu(me); - /* We do 'stts()' in kernel_fpu_end() */ + /* We do 'stts()' in __kernel_fpu_end() */ } else if (!use_eager_fpu()) { this_cpu_write(fpu_owner_task, NULL); clts(); } } -EXPORT_SYMBOL(kernel_fpu_begin); +EXPORT_SYMBOL(__kernel_fpu_begin); -void kernel_fpu_end(void) +void __kernel_fpu_end(void) { if (use_eager_fpu()) math_state_restore(); else stts(); - preempt_enable(); } -EXPORT_SYMBOL(kernel_fpu_end); +EXPORT_SYMBOL(__kernel_fpu_end); void unlazy_fpu(struct task_struct *tsk) { diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c00f03d..70dfcec 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1493,8 +1493,12 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx) #ifdef CONFIG_X86_64 wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base); #endif - if (user_has_fpu()) - clts(); + /* +* If the FPU is not active (through the host task or +* the guest vcpu), then restore the cr0.TS bit. +*/ + if (!user_has_fpu() && !vmx->vcpu.guest_fpu_loaded) + stts(); load_gdt(&__get_cpu_var(host_gdt)); } @@ -3730,7 +3734,7 @@ static void vmx_set_consta
[tip:x86/fpu] x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
Commit-ID: b1a74bf8212367be2b1d6685c11a84e056eaaaf1 Gitweb: http://git.kernel.org/tip/b1a74bf8212367be2b1d6685c11a84e056eaaaf1 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Thu, 20 Sep 2012 11:01:49 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Fri, 21 Sep 2012 16:59:04 -0700 x86, kvm: fix kvm's usage of kernel_fpu_begin/end() Preemption is disabled between kernel_fpu_begin/end() and as such it is not a good idea to use these routines in kvm_load/put_guest_fpu() which can be very far apart. kvm_load/put_guest_fpu() routines are already called with preemption disabled and KVM already uses the preempt notifier to save the guest fpu state using kvm_put_guest_fpu(). So introduce __kernel_fpu_begin/end() routines which don't touch preemption and use them instead of kernel_fpu_begin/end() for KVM's use model of saving/restoring guest FPU state. Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit state in the case of VMX. For eagerFPU case, host cr0.TS is always clear. So no need to worry about it. For the traditional lazyFPU restore case, change the cr0.TS bit for the host state during vm-exit to be always clear and cr0.TS bit is set in the __vmx_load_host_state() when the FPU (guest FPU or the host task's FPU) state is not active. This ensures that the host/guest FPU state is properly saved, restored during context-switch and with interrupts (using irq_fpu_usable()) not stomping on the active FPU state. Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1348164109.26695.338.ca...@sbsiddha-desk.sc.intel.com Cc: Avi Kivity a...@redhat.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/include/asm/i387.h | 28 ++-- arch/x86/kernel/i387.c | 13 + arch/x86/kvm/vmx.c | 10 +++--- arch/x86/kvm/x86.c |4 ++-- 4 files changed, 40 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index 6c3bd37..ed8089d 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -24,8 +24,32 @@ extern int dump_fpu(struct pt_regs *, struct user_i387_struct *); extern void math_state_restore(void); extern bool irq_fpu_usable(void); -extern void kernel_fpu_begin(void); -extern void kernel_fpu_end(void); + +/* + * Careful: __kernel_fpu_begin/end() must be called with preempt disabled + * and they don't touch the preempt state on their own. + * If you enable preemption after __kernel_fpu_begin(), preempt notifier + * should call the __kernel_fpu_end() to prevent the kernel/user FPU + * state from getting corrupted. KVM for example uses this model. + * + * All other cases use kernel_fpu_begin/end() which disable preemption + * during kernel FPU usage. + */ +extern void __kernel_fpu_begin(void); +extern void __kernel_fpu_end(void); + +static inline void kernel_fpu_begin(void) +{ + WARN_ON_ONCE(!irq_fpu_usable()); + preempt_disable(); + __kernel_fpu_begin(); +} + +static inline void kernel_fpu_end(void) +{ + __kernel_fpu_end(); + preempt_enable(); +} /* * Some instructions like VIA's padlock instructions generate a spurious diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index 6782e39..675a050 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -73,32 +73,29 @@ bool irq_fpu_usable(void) } EXPORT_SYMBOL(irq_fpu_usable); -void kernel_fpu_begin(void) +void __kernel_fpu_begin(void) { struct task_struct *me = current; - WARN_ON_ONCE(!irq_fpu_usable()); - preempt_disable(); if (__thread_has_fpu(me)) { __save_init_fpu(me); __thread_clear_has_fpu(me); - /* We do 'stts()' in kernel_fpu_end() */ + /* We do 'stts()' in __kernel_fpu_end() */ } else if (!use_eager_fpu()) { this_cpu_write(fpu_owner_task, NULL); clts(); } } -EXPORT_SYMBOL(kernel_fpu_begin); +EXPORT_SYMBOL(__kernel_fpu_begin); -void kernel_fpu_end(void) +void __kernel_fpu_end(void) { if (use_eager_fpu()) math_state_restore(); else stts(); - preempt_enable(); } -EXPORT_SYMBOL(kernel_fpu_end); +EXPORT_SYMBOL(__kernel_fpu_end); void unlazy_fpu(struct task_struct *tsk) { diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c00f03d..70dfcec 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1493,8 +1493,12 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx) #ifdef CONFIG_X86_64 wrmsrl(MSR_KERNEL_GS_BASE, vmx-msr_host_kernel_gs_base); #endif - if (user_has_fpu()) - clts(); + /* +* If the FPU is not active (through the host task or +* the guest vcpu), then restore the cr0.TS bit. +*/ + if (!user_has_fpu() !vmx-vcpu.guest_fpu_loaded) + stts(); load_gdt
Re: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries
On Wed, 2012-09-12 at 07:02 +, Zhang, Lin-Bao (ESSN-MCXS-Linux Kernel R) wrote: > Hi all, > This defect can be observed when the x2apic setting in BIOS is set to > "auto" and the BIOS has virtual wire mode enabled on a power up. This > defect was found on a 2.6.32 based kernel. I assume you are able to reproduce the issue with the latest kernel aswell? What virtual wire mode is it? Virtual wire mode-A (where the PIC output is connected to LINT0 of the Local APIC) doesn't go through interrupt-remapping and virtual wire mode-B (where the PIC output is routed through the IO-APIC RTE) will be completely disabled as all the BIOS setup IO-APIC RTE's are masked by the Linux kernel from the time we enable interrupt-remapping to the time IO-APIC RTE's are properly re-configured by the Linux kernel again. So I am at a loss to understand what is causing this. > > The kernel code (smpboot.c, apic.c) does not mask 8259A interrupts > before changing and initializing the new VT-d table when x2apic > virtual wire mode is enable on power up. The Linux Kernel expects > virtual wire mode to be disabled when booting and enables it when > interrupts are masked. > > The BIOS code builds a simple VT-d table on power up. While the Linux > Kernel boots, it first builds an empty VT-d table and use it. After > some time, the Linux Kernel then initializes the IO-APIC redirect > table, and then initializes the VT-d entries. The window between > initializing the redirect table and the VT-d entries, the 8259A > interrupts are not masked. If an interrupt occurs in this window, the > Linux Kernel will not find a valid entry for this interrupt. The > kernel treats it to be a fatal error and panics. If the error never > gets cleared, the Linux kernel continuously print this error: > "NMI: IOCK error (debug interrupt?) for reason" Not sure why we get a NMI instead of a vt-d fault? Perhaps the vt-d fault is also getting reported via NMI in this platform? Does your tested kernel has this fix? commit 254e42006c893f45bca48f313536fcba12206418 Author: Suresh Siddha Date: Mon Dec 6 12:26:30 2010 -0800 x86, vt-d: Quirk for masking vtd spec errors to platform error handling logic Will you be able to provide the failing kernel log so that I can better understand the issue? thanks, suresh > The fix to this defect, the code change is to mask 8259A interrupts > before changing VT-d table and initializing VT-d entries. Then unmask > interrupts after completing the redirect table entries. > > > Signed-off-by: Zhang, Lin-Bao > Tested-by: Nigel Croxon > > diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index > 24deb30..299172c 100644 > --- a/arch/x86/kernel/apic/apic.c > +++ b/arch/x86/kernel/apic/apic.c > @@ -1556,7 +1556,6 @@ void __init enable_IR_x2apic(void) > } > > local_irq_save(flags); > - legacy_pic->mask_all(); > mask_ioapic_entries(); > > if (x2apic_preenabled && nox2apic) @@ -1603,7 +1602,6 @@ void __init > enable_IR_x2apic(void) > skip_x2apic: > if (ret < 0) /* IR enabling failed */ > restore_ioapic_entries(); > - legacy_pic->restore_mask(); > local_irq_restore(flags); > } > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index > 7c5a8c3..95fee01 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1000,7 +1000,7 @@ void __init native_smp_prepare_cpus(unsigned int > max_cpus) > zalloc_cpumask_var(_cpu(cpu_llc_shared_map, i), > GFP_KERNEL); > } > set_cpu_sibling_map(0); > - > + mask_8259A(); > > if (smp_sanity_check(max_cpus) < 0) { > pr_info("SMP disabled\n"); @@ -1037,6 +1037,8 @@ void __init > native_smp_prepare_cpus(unsigned int max_cpus) > apic->setup_portio_remap(); > > smpboot_setup_io_apic(); > + unmask_8259A(); > + > /* > * Set up local APIC timer on boot CPU. > */ > > > > -- Bob(Zhang LinBao) > 子曰:”不患人知不己知,患不知人也” > "If not us, who ? if not now, when ?" > ESSN-MCBS linux kernel enginner > > > NР骒rybX肚v^?藓{.n?伐{赙zXФ≤}财z?j:+v?赙zZ+?zf"h~iz?wア?ㄨ??撷f^j谦ym@Aa囤0鹅h?i -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
On Thu, 2012-09-20 at 12:50 +0300, Avi Kivity wrote: > On 09/20/2012 03:10 AM, Suresh Siddha wrote: > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > > index b06737d..8ff328b 100644 > > --- a/arch/x86/kvm/vmx.c > > +++ b/arch/x86/kvm/vmx.c > > @@ -1493,7 +1493,8 @@ static void __vmx_load_host_state(struct vcpu_vmx > > *vmx) > > #ifdef CONFIG_X86_64 > > wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base); > > #endif > > - if (user_has_fpu()) > > + /* Did the host task or the guest vcpu has FPU restored lazily? */ > > + if (!use_eager_fpu() && (user_has_fpu() || vmx->vcpu.guest_fpu_loaded)) > > clts(); > > Why do the clts() if guest_fpu_loaded()? > > An interrupt might arrive after this, look at TS > (interrupted_kernel_fpu_idle()), and stomp on the the guest's fpu. Actually clts() is harmless, as this condition, (read_cr0() & X86_CR0_TS) in interrupted_kernel_fpu_idle() will return false. But you raise a good point, any interrupt between the vmexit and the __vmx_load_host_state() can stomp on the guest FPU as the vmexit was unconditionally setting host's cr0.TS bit and with the kvm using kernel_fpu_begin/end(), !__thread_has_fpu(current) in the interrupted_kernel_fpu_idle() will be always true. So the right thing to do here is to always have the cr0.TS bit clear during vmexit and set that bit back in __vmx_load_host_state() if the FPU state is not active. Appended the modified patch. thanks, suresh --8<-- From: Suresh Siddha Subject: x86, kvm: fix kvm's usage of kernel_fpu_begin/end() Preemption is disabled between kernel_fpu_begin/end() and as such it is not a good idea to use these routines in kvm_load/put_guest_fpu() which can be very far apart. kvm_load/put_guest_fpu() routines are already called with preemption disabled and KVM already uses the preempt notifier to save the guest fpu state using kvm_put_guest_fpu(). So introduce __kernel_fpu_begin/end() routines which don't touch preemption and use them instead of kernel_fpu_begin/end() for KVM's use model of saving/restoring guest FPU state. Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit state in the case of VMX. For eagerFPU case, host cr0.TS is always clear. So no need to worry about it. For the traditional lazyFPU restore case, change the cr0.TS bit for the host state during vm-exit to be always clear and cr0.TS bit is set in the __vmx_load_host_state() when the FPU (guest FPU or the host task's FPU) state is not active. This ensures that the host/guest FPU state is properly saved, restored during context-switch and with interrupts (using irq_fpu_usable()) not stomping on the active FPU state. Signed-off-by: Suresh Siddha --- arch/x86/include/asm/i387.h | 28 ++-- arch/x86/kernel/i387.c | 13 + arch/x86/kvm/vmx.c | 10 +++--- arch/x86/kvm/x86.c |4 ++-- 4 files changed, 40 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index 6c3bd37..ed8089d 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -24,8 +24,32 @@ extern int dump_fpu(struct pt_regs *, struct user_i387_struct *); extern void math_state_restore(void); extern bool irq_fpu_usable(void); -extern void kernel_fpu_begin(void); -extern void kernel_fpu_end(void); + +/* + * Careful: __kernel_fpu_begin/end() must be called with preempt disabled + * and they don't touch the preempt state on their own. + * If you enable preemption after __kernel_fpu_begin(), preempt notifier + * should call the __kernel_fpu_end() to prevent the kernel/user FPU + * state from getting corrupted. KVM for example uses this model. + * + * All other cases use kernel_fpu_begin/end() which disable preemption + * during kernel FPU usage. + */ +extern void __kernel_fpu_begin(void); +extern void __kernel_fpu_end(void); + +static inline void kernel_fpu_begin(void) +{ + WARN_ON_ONCE(!irq_fpu_usable()); + preempt_disable(); + __kernel_fpu_begin(); +} + +static inline void kernel_fpu_end(void) +{ + __kernel_fpu_end(); + preempt_enable(); +} /* * Some instructions like VIA's padlock instructions generate a spurious diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index 6782e39..675a050 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -73,32 +73,29 @@ bool irq_fpu_usable(void) } EXPORT_SYMBOL(irq_fpu_usable); -void kernel_fpu_begin(void) +void __kernel_fpu_begin(void) { struct task_struct *me = current; - WARN_ON_ONCE(!irq_fpu_usable()); - preempt_disable(); if (__thread_has_fpu(me)) { __save_init_fpu(me); __thread_clear_has_fpu(me); - /* We do 'stts()' in kernel_fpu_end() */ + /* We do 'stts()' in __kernel_fpu_end()
Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
On Thu, 2012-09-20 at 12:50 +0300, Avi Kivity wrote: On 09/20/2012 03:10 AM, Suresh Siddha wrote: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index b06737d..8ff328b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1493,7 +1493,8 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx) #ifdef CONFIG_X86_64 wrmsrl(MSR_KERNEL_GS_BASE, vmx-msr_host_kernel_gs_base); #endif - if (user_has_fpu()) + /* Did the host task or the guest vcpu has FPU restored lazily? */ + if (!use_eager_fpu() (user_has_fpu() || vmx-vcpu.guest_fpu_loaded)) clts(); Why do the clts() if guest_fpu_loaded()? An interrupt might arrive after this, look at TS (interrupted_kernel_fpu_idle()), and stomp on the the guest's fpu. Actually clts() is harmless, as this condition, (read_cr0() X86_CR0_TS) in interrupted_kernel_fpu_idle() will return false. But you raise a good point, any interrupt between the vmexit and the __vmx_load_host_state() can stomp on the guest FPU as the vmexit was unconditionally setting host's cr0.TS bit and with the kvm using kernel_fpu_begin/end(), !__thread_has_fpu(current) in the interrupted_kernel_fpu_idle() will be always true. So the right thing to do here is to always have the cr0.TS bit clear during vmexit and set that bit back in __vmx_load_host_state() if the FPU state is not active. Appended the modified patch. thanks, suresh --8-- From: Suresh Siddha suresh.b.sid...@intel.com Subject: x86, kvm: fix kvm's usage of kernel_fpu_begin/end() Preemption is disabled between kernel_fpu_begin/end() and as such it is not a good idea to use these routines in kvm_load/put_guest_fpu() which can be very far apart. kvm_load/put_guest_fpu() routines are already called with preemption disabled and KVM already uses the preempt notifier to save the guest fpu state using kvm_put_guest_fpu(). So introduce __kernel_fpu_begin/end() routines which don't touch preemption and use them instead of kernel_fpu_begin/end() for KVM's use model of saving/restoring guest FPU state. Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit state in the case of VMX. For eagerFPU case, host cr0.TS is always clear. So no need to worry about it. For the traditional lazyFPU restore case, change the cr0.TS bit for the host state during vm-exit to be always clear and cr0.TS bit is set in the __vmx_load_host_state() when the FPU (guest FPU or the host task's FPU) state is not active. This ensures that the host/guest FPU state is properly saved, restored during context-switch and with interrupts (using irq_fpu_usable()) not stomping on the active FPU state. Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com --- arch/x86/include/asm/i387.h | 28 ++-- arch/x86/kernel/i387.c | 13 + arch/x86/kvm/vmx.c | 10 +++--- arch/x86/kvm/x86.c |4 ++-- 4 files changed, 40 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index 6c3bd37..ed8089d 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -24,8 +24,32 @@ extern int dump_fpu(struct pt_regs *, struct user_i387_struct *); extern void math_state_restore(void); extern bool irq_fpu_usable(void); -extern void kernel_fpu_begin(void); -extern void kernel_fpu_end(void); + +/* + * Careful: __kernel_fpu_begin/end() must be called with preempt disabled + * and they don't touch the preempt state on their own. + * If you enable preemption after __kernel_fpu_begin(), preempt notifier + * should call the __kernel_fpu_end() to prevent the kernel/user FPU + * state from getting corrupted. KVM for example uses this model. + * + * All other cases use kernel_fpu_begin/end() which disable preemption + * during kernel FPU usage. + */ +extern void __kernel_fpu_begin(void); +extern void __kernel_fpu_end(void); + +static inline void kernel_fpu_begin(void) +{ + WARN_ON_ONCE(!irq_fpu_usable()); + preempt_disable(); + __kernel_fpu_begin(); +} + +static inline void kernel_fpu_end(void) +{ + __kernel_fpu_end(); + preempt_enable(); +} /* * Some instructions like VIA's padlock instructions generate a spurious diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index 6782e39..675a050 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -73,32 +73,29 @@ bool irq_fpu_usable(void) } EXPORT_SYMBOL(irq_fpu_usable); -void kernel_fpu_begin(void) +void __kernel_fpu_begin(void) { struct task_struct *me = current; - WARN_ON_ONCE(!irq_fpu_usable()); - preempt_disable(); if (__thread_has_fpu(me)) { __save_init_fpu(me); __thread_clear_has_fpu(me); - /* We do 'stts()' in kernel_fpu_end() */ + /* We do 'stts()' in __kernel_fpu_end() */ } else if (!use_eager_fpu()) { this_cpu_write(fpu_owner_task, NULL
Re: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries
On Wed, 2012-09-12 at 07:02 +, Zhang, Lin-Bao (ESSN-MCXS-Linux Kernel RD) wrote: Hi all, This defect can be observed when the x2apic setting in BIOS is set to auto and the BIOS has virtual wire mode enabled on a power up. This defect was found on a 2.6.32 based kernel. I assume you are able to reproduce the issue with the latest kernel aswell? What virtual wire mode is it? Virtual wire mode-A (where the PIC output is connected to LINT0 of the Local APIC) doesn't go through interrupt-remapping and virtual wire mode-B (where the PIC output is routed through the IO-APIC RTE) will be completely disabled as all the BIOS setup IO-APIC RTE's are masked by the Linux kernel from the time we enable interrupt-remapping to the time IO-APIC RTE's are properly re-configured by the Linux kernel again. So I am at a loss to understand what is causing this. The kernel code (smpboot.c, apic.c) does not mask 8259A interrupts before changing and initializing the new VT-d table when x2apic virtual wire mode is enable on power up. The Linux Kernel expects virtual wire mode to be disabled when booting and enables it when interrupts are masked. The BIOS code builds a simple VT-d table on power up. While the Linux Kernel boots, it first builds an empty VT-d table and use it. After some time, the Linux Kernel then initializes the IO-APIC redirect table, and then initializes the VT-d entries. The window between initializing the redirect table and the VT-d entries, the 8259A interrupts are not masked. If an interrupt occurs in this window, the Linux Kernel will not find a valid entry for this interrupt. The kernel treats it to be a fatal error and panics. If the error never gets cleared, the Linux kernel continuously print this error: NMI: IOCK error (debug interrupt?) for reason Not sure why we get a NMI instead of a vt-d fault? Perhaps the vt-d fault is also getting reported via NMI in this platform? Does your tested kernel has this fix? commit 254e42006c893f45bca48f313536fcba12206418 Author: Suresh Siddha suresh.b.sid...@intel.com Date: Mon Dec 6 12:26:30 2010 -0800 x86, vt-d: Quirk for masking vtd spec errors to platform error handling logic Will you be able to provide the failing kernel log so that I can better understand the issue? thanks, suresh The fix to this defect, the code change is to mask 8259A interrupts before changing VT-d table and initializing VT-d entries. Then unmask interrupts after completing the redirect table entries. Signed-off-by: Zhang, Lin-Bao linbao.zh...@hp.com Tested-by: Nigel Croxon nigel.cro...@hp.com diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 24deb30..299172c 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -1556,7 +1556,6 @@ void __init enable_IR_x2apic(void) } local_irq_save(flags); - legacy_pic-mask_all(); mask_ioapic_entries(); if (x2apic_preenabled nox2apic) @@ -1603,7 +1602,6 @@ void __init enable_IR_x2apic(void) skip_x2apic: if (ret 0) /* IR enabling failed */ restore_ioapic_entries(); - legacy_pic-restore_mask(); local_irq_restore(flags); } diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 7c5a8c3..95fee01 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1000,7 +1000,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) zalloc_cpumask_var(per_cpu(cpu_llc_shared_map, i), GFP_KERNEL); } set_cpu_sibling_map(0); - + mask_8259A(); if (smp_sanity_check(max_cpus) 0) { pr_info(SMP disabled\n); @@ -1037,6 +1037,8 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) apic-setup_portio_remap(); smpboot_setup_io_apic(); + unmask_8259A(); + /* * Set up local APIC timer on boot CPU. */ -- Bob(Zhang LinBao) 子曰:”不患人知不己知,患不知人也” If not us, who ? if not now, when ? ESSN-MCBS linux kernel enginner NР骒rybX肚v^?藓{.n?伐{赙zXФ≤}财z?j:+v?赙zZ+?zf"h~iz?wア?ㄨ??撷f^j谦ym@Aa囤0鹅h?i -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
On Wed, 2012-09-19 at 10:18 -0700, Suresh Siddha wrote: > These routines (kvm_load/put_guest_fpu()) are already called with > preemption disabled but as you mentioned, we don't want the preemption > to be disabled completely between the kvm_load_guest_fpu() and > kvm_put_guest_fpu(). > > Also KVM already has the preempt notifier which is doing the > kvm_put_guest_fpu(), so something like the appended should address this. > I will test this shortly. Appended the tested fix (one more VMX based change needed as it fiddles with cr0.TS host bit). Thanks. --8<-- From: Suresh Siddha Subject: x86, kvm: fix kvm's usage of kernel_fpu_begin/end() Preemption is disabled between kernel_fpu_begin/end() and as such it is not a good idea to use these routines in kvm_load/put_guest_fpu() which can be very far apart. kvm_load/put_guest_fpu() routines are already called with preemption disabled and KVM already uses the preempt notifier to save the guest fpu state using kvm_put_guest_fpu(). So introduce __kernel_fpu_begin/end() routines which don't touch preemption and use them instead of kernel_fpu_begin/end() for KVM's use model of saving/restoring guest FPU state. Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit state in the case of VMX. For eagerFPU case, host cr0.TS is always clear. So no need to worry about it. For the traditional lazyFPU restore case, cr0.TS bit is always set during vm-exit and depending on the guest FPU state and the host task's FPU state, cr0.TS bit is cleared when needed. Signed-off-by: Suresh Siddha --- arch/x86/include/asm/fpu-internal.h |5 - arch/x86/include/asm/i387.h | 28 ++-- arch/x86/include/asm/processor.h|5 + arch/x86/kernel/i387.c | 13 + arch/x86/kvm/vmx.c | 11 +-- arch/x86/kvm/x86.c |4 ++-- 6 files changed, 47 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 92f3c6e..a6b60c7 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -85,11 +85,6 @@ static inline int is_x32_frame(void) #define X87_FSW_ES (1 << 7)/* Exception Summary */ -static __always_inline __pure bool use_eager_fpu(void) -{ - return static_cpu_has(X86_FEATURE_EAGER_FPU); -} - static __always_inline __pure bool use_xsaveopt(void) { return static_cpu_has(X86_FEATURE_XSAVEOPT); diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index 6c3bd37..ed8089d 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -24,8 +24,32 @@ extern int dump_fpu(struct pt_regs *, struct user_i387_struct *); extern void math_state_restore(void); extern bool irq_fpu_usable(void); -extern void kernel_fpu_begin(void); -extern void kernel_fpu_end(void); + +/* + * Careful: __kernel_fpu_begin/end() must be called with preempt disabled + * and they don't touch the preempt state on their own. + * If you enable preemption after __kernel_fpu_begin(), preempt notifier + * should call the __kernel_fpu_end() to prevent the kernel/user FPU + * state from getting corrupted. KVM for example uses this model. + * + * All other cases use kernel_fpu_begin/end() which disable preemption + * during kernel FPU usage. + */ +extern void __kernel_fpu_begin(void); +extern void __kernel_fpu_end(void); + +static inline void kernel_fpu_begin(void) +{ + WARN_ON_ONCE(!irq_fpu_usable()); + preempt_disable(); + __kernel_fpu_begin(); +} + +static inline void kernel_fpu_end(void) +{ + __kernel_fpu_end(); + preempt_enable(); +} /* * Some instructions like VIA's padlock instructions generate a spurious diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index b98c0d9..d0e9adb 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -402,6 +402,11 @@ struct fpu { union thread_xstate *state; }; +static __always_inline __pure bool use_eager_fpu(void) +{ + return static_cpu_has(X86_FEATURE_EAGER_FPU); +} + #ifdef CONFIG_X86_64 DECLARE_PER_CPU(struct orig_ist, orig_ist); diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index 6782e39..675a050 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -73,32 +73,29 @@ bool irq_fpu_usable(void) } EXPORT_SYMBOL(irq_fpu_usable); -void kernel_fpu_begin(void) +void __kernel_fpu_begin(void) { struct task_struct *me = current; - WARN_ON_ONCE(!irq_fpu_usable()); - preempt_disable(); if (__thread_has_fpu(me)) { __save_init_fpu(me); __thread_clear_has_fpu(me); - /* We do 'stts()' in kernel_fpu_end() */ + /* We do 'stts()' in __kernel_fpu_end() */ } else if (!use_eager_fpu()) { this_cpu_write(fpu_owner_
Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
On Wed, 2012-09-19 at 20:22 +0300, Avi Kivity wrote: > On 09/19/2012 08:18 PM, Suresh Siddha wrote: > > > These routines (kvm_load/put_guest_fpu()) are already called with > > preemption disabled but as you mentioned, we don't want the preemption > > to be disabled completely between the kvm_load_guest_fpu() and > > kvm_put_guest_fpu(). > > > > Also KVM already has the preempt notifier which is doing the > > kvm_put_guest_fpu(), so something like the appended should address this. > > I will test this shortly. > > > > Note, we could also go in a different direction and make > kernel_fpu_begin() use preempt notifiers and thus make its users > preemptible. But that's for a separate patchset. yep, but we need the fpu buffer to save/restore the kernel fpu state. KVM already has those buffers allocated in the guest cpu state and hence it all works out ok. But yes, we can revisit this in future. thanks, suresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
On Wed, 2012-09-19 at 13:13 +0300, Avi Kivity wrote: > On 08/25/2012 12:12 AM, Suresh Siddha wrote: > > kvm's guest fpu save/restore should be wrapped around > > kernel_fpu_begin/end(). This will avoid for example taking a DNA > > in kvm_load_guest_fpu() when it tries to load the fpu immediately > > after doing unlazy_fpu() on the host side. > > > > More importantly this will prevent the host process fpu from being > > corrupted. > > > > Signed-off-by: Suresh Siddha > > Cc: Avi Kivity > > --- > > arch/x86/kvm/x86.c |3 ++- > > 1 files changed, 2 insertions(+), 1 deletions(-) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 42bce48..67e773c 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -5969,7 +5969,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) > > */ > > kvm_put_guest_xcr0(vcpu); > > vcpu->guest_fpu_loaded = 1; > > - unlazy_fpu(current); > > + kernel_fpu_begin(); > > fpu_restore_checking(>arch.guest_fpu); > > trace_kvm_fpu(1); > > This breaks kvm, since it disables preemption. What we want here is to > save the user fpu state if it was loaded, and do nothing if wasn't. > Don't know what's the new API for that. These routines (kvm_load/put_guest_fpu()) are already called with preemption disabled but as you mentioned, we don't want the preemption to be disabled completely between the kvm_load_guest_fpu() and kvm_put_guest_fpu(). Also KVM already has the preempt notifier which is doing the kvm_put_guest_fpu(), so something like the appended should address this. I will test this shortly. Signed-off-by: Suresh Siddha --- arch/x86/include/asm/i387.h | 17 +++-- arch/x86/kernel/i387.c | 13 + arch/x86/kvm/x86.c |4 ++-- 3 files changed, 22 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index 6c3bd37..29429b1 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -24,8 +24,21 @@ extern int dump_fpu(struct pt_regs *, struct user_i387_struct *); extern void math_state_restore(void); extern bool irq_fpu_usable(void); -extern void kernel_fpu_begin(void); -extern void kernel_fpu_end(void); +extern void __kernel_fpu_begin(void); +extern void __kernel_fpu_end(void); + +static inline void kernel_fpu_begin(void) +{ + WARN_ON_ONCE(!irq_fpu_usable()); + preempt_disable(); + __kernel_fpu_begin(); +} + +static inline void kernel_fpu_end(void) +{ + __kernel_fpu_end(); + preempt_enable(); +} /* * Some instructions like VIA's padlock instructions generate a spurious diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index 6782e39..675a050 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -73,32 +73,29 @@ bool irq_fpu_usable(void) } EXPORT_SYMBOL(irq_fpu_usable); -void kernel_fpu_begin(void) +void __kernel_fpu_begin(void) { struct task_struct *me = current; - WARN_ON_ONCE(!irq_fpu_usable()); - preempt_disable(); if (__thread_has_fpu(me)) { __save_init_fpu(me); __thread_clear_has_fpu(me); - /* We do 'stts()' in kernel_fpu_end() */ + /* We do 'stts()' in __kernel_fpu_end() */ } else if (!use_eager_fpu()) { this_cpu_write(fpu_owner_task, NULL); clts(); } } -EXPORT_SYMBOL(kernel_fpu_begin); +EXPORT_SYMBOL(__kernel_fpu_begin); -void kernel_fpu_end(void) +void __kernel_fpu_end(void) { if (use_eager_fpu()) math_state_restore(); else stts(); - preempt_enable(); } -EXPORT_SYMBOL(kernel_fpu_end); +EXPORT_SYMBOL(__kernel_fpu_end); void unlazy_fpu(struct task_struct *tsk) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3ddefb4..1f09552 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5979,7 +5979,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) */ kvm_put_guest_xcr0(vcpu); vcpu->guest_fpu_loaded = 1; - kernel_fpu_begin(); + __kernel_fpu_begin(); fpu_restore_checking(>arch.guest_fpu); trace_kvm_fpu(1); } @@ -5993,7 +5993,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) vcpu->guest_fpu_loaded = 0; fpu_save_init(>arch.guest_fpu); - kernel_fpu_end(); + __kernel_fpu_end(); ++vcpu->stat.fpu_reload; kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); trace_kvm_fpu(0); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
On Wed, 2012-09-19 at 13:13 +0300, Avi Kivity wrote: On 08/25/2012 12:12 AM, Suresh Siddha wrote: kvm's guest fpu save/restore should be wrapped around kernel_fpu_begin/end(). This will avoid for example taking a DNA in kvm_load_guest_fpu() when it tries to load the fpu immediately after doing unlazy_fpu() on the host side. More importantly this will prevent the host process fpu from being corrupted. Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Cc: Avi Kivity a...@redhat.com --- arch/x86/kvm/x86.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 42bce48..67e773c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5969,7 +5969,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) */ kvm_put_guest_xcr0(vcpu); vcpu-guest_fpu_loaded = 1; - unlazy_fpu(current); + kernel_fpu_begin(); fpu_restore_checking(vcpu-arch.guest_fpu); trace_kvm_fpu(1); This breaks kvm, since it disables preemption. What we want here is to save the user fpu state if it was loaded, and do nothing if wasn't. Don't know what's the new API for that. These routines (kvm_load/put_guest_fpu()) are already called with preemption disabled but as you mentioned, we don't want the preemption to be disabled completely between the kvm_load_guest_fpu() and kvm_put_guest_fpu(). Also KVM already has the preempt notifier which is doing the kvm_put_guest_fpu(), so something like the appended should address this. I will test this shortly. Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com --- arch/x86/include/asm/i387.h | 17 +++-- arch/x86/kernel/i387.c | 13 + arch/x86/kvm/x86.c |4 ++-- 3 files changed, 22 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index 6c3bd37..29429b1 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -24,8 +24,21 @@ extern int dump_fpu(struct pt_regs *, struct user_i387_struct *); extern void math_state_restore(void); extern bool irq_fpu_usable(void); -extern void kernel_fpu_begin(void); -extern void kernel_fpu_end(void); +extern void __kernel_fpu_begin(void); +extern void __kernel_fpu_end(void); + +static inline void kernel_fpu_begin(void) +{ + WARN_ON_ONCE(!irq_fpu_usable()); + preempt_disable(); + __kernel_fpu_begin(); +} + +static inline void kernel_fpu_end(void) +{ + __kernel_fpu_end(); + preempt_enable(); +} /* * Some instructions like VIA's padlock instructions generate a spurious diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index 6782e39..675a050 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -73,32 +73,29 @@ bool irq_fpu_usable(void) } EXPORT_SYMBOL(irq_fpu_usable); -void kernel_fpu_begin(void) +void __kernel_fpu_begin(void) { struct task_struct *me = current; - WARN_ON_ONCE(!irq_fpu_usable()); - preempt_disable(); if (__thread_has_fpu(me)) { __save_init_fpu(me); __thread_clear_has_fpu(me); - /* We do 'stts()' in kernel_fpu_end() */ + /* We do 'stts()' in __kernel_fpu_end() */ } else if (!use_eager_fpu()) { this_cpu_write(fpu_owner_task, NULL); clts(); } } -EXPORT_SYMBOL(kernel_fpu_begin); +EXPORT_SYMBOL(__kernel_fpu_begin); -void kernel_fpu_end(void) +void __kernel_fpu_end(void) { if (use_eager_fpu()) math_state_restore(); else stts(); - preempt_enable(); } -EXPORT_SYMBOL(kernel_fpu_end); +EXPORT_SYMBOL(__kernel_fpu_end); void unlazy_fpu(struct task_struct *tsk) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3ddefb4..1f09552 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5979,7 +5979,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) */ kvm_put_guest_xcr0(vcpu); vcpu-guest_fpu_loaded = 1; - kernel_fpu_begin(); + __kernel_fpu_begin(); fpu_restore_checking(vcpu-arch.guest_fpu); trace_kvm_fpu(1); } @@ -5993,7 +5993,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) vcpu-guest_fpu_loaded = 0; fpu_save_init(vcpu-arch.guest_fpu); - kernel_fpu_end(); + __kernel_fpu_end(); ++vcpu-stat.fpu_reload; kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); trace_kvm_fpu(0); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
On Wed, 2012-09-19 at 20:22 +0300, Avi Kivity wrote: On 09/19/2012 08:18 PM, Suresh Siddha wrote: These routines (kvm_load/put_guest_fpu()) are already called with preemption disabled but as you mentioned, we don't want the preemption to be disabled completely between the kvm_load_guest_fpu() and kvm_put_guest_fpu(). Also KVM already has the preempt notifier which is doing the kvm_put_guest_fpu(), so something like the appended should address this. I will test this shortly. Note, we could also go in a different direction and make kernel_fpu_begin() use preempt notifiers and thus make its users preemptible. But that's for a separate patchset. yep, but we need the fpu buffer to save/restore the kernel fpu state. KVM already has those buffers allocated in the guest cpu state and hence it all works out ok. But yes, we can revisit this in future. thanks, suresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
On Wed, 2012-09-19 at 10:18 -0700, Suresh Siddha wrote: These routines (kvm_load/put_guest_fpu()) are already called with preemption disabled but as you mentioned, we don't want the preemption to be disabled completely between the kvm_load_guest_fpu() and kvm_put_guest_fpu(). Also KVM already has the preempt notifier which is doing the kvm_put_guest_fpu(), so something like the appended should address this. I will test this shortly. Appended the tested fix (one more VMX based change needed as it fiddles with cr0.TS host bit). Thanks. --8-- From: Suresh Siddha suresh.b.sid...@intel.com Subject: x86, kvm: fix kvm's usage of kernel_fpu_begin/end() Preemption is disabled between kernel_fpu_begin/end() and as such it is not a good idea to use these routines in kvm_load/put_guest_fpu() which can be very far apart. kvm_load/put_guest_fpu() routines are already called with preemption disabled and KVM already uses the preempt notifier to save the guest fpu state using kvm_put_guest_fpu(). So introduce __kernel_fpu_begin/end() routines which don't touch preemption and use them instead of kernel_fpu_begin/end() for KVM's use model of saving/restoring guest FPU state. Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit state in the case of VMX. For eagerFPU case, host cr0.TS is always clear. So no need to worry about it. For the traditional lazyFPU restore case, cr0.TS bit is always set during vm-exit and depending on the guest FPU state and the host task's FPU state, cr0.TS bit is cleared when needed. Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com --- arch/x86/include/asm/fpu-internal.h |5 - arch/x86/include/asm/i387.h | 28 ++-- arch/x86/include/asm/processor.h|5 + arch/x86/kernel/i387.c | 13 + arch/x86/kvm/vmx.c | 11 +-- arch/x86/kvm/x86.c |4 ++-- 6 files changed, 47 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 92f3c6e..a6b60c7 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -85,11 +85,6 @@ static inline int is_x32_frame(void) #define X87_FSW_ES (1 7)/* Exception Summary */ -static __always_inline __pure bool use_eager_fpu(void) -{ - return static_cpu_has(X86_FEATURE_EAGER_FPU); -} - static __always_inline __pure bool use_xsaveopt(void) { return static_cpu_has(X86_FEATURE_XSAVEOPT); diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h index 6c3bd37..ed8089d 100644 --- a/arch/x86/include/asm/i387.h +++ b/arch/x86/include/asm/i387.h @@ -24,8 +24,32 @@ extern int dump_fpu(struct pt_regs *, struct user_i387_struct *); extern void math_state_restore(void); extern bool irq_fpu_usable(void); -extern void kernel_fpu_begin(void); -extern void kernel_fpu_end(void); + +/* + * Careful: __kernel_fpu_begin/end() must be called with preempt disabled + * and they don't touch the preempt state on their own. + * If you enable preemption after __kernel_fpu_begin(), preempt notifier + * should call the __kernel_fpu_end() to prevent the kernel/user FPU + * state from getting corrupted. KVM for example uses this model. + * + * All other cases use kernel_fpu_begin/end() which disable preemption + * during kernel FPU usage. + */ +extern void __kernel_fpu_begin(void); +extern void __kernel_fpu_end(void); + +static inline void kernel_fpu_begin(void) +{ + WARN_ON_ONCE(!irq_fpu_usable()); + preempt_disable(); + __kernel_fpu_begin(); +} + +static inline void kernel_fpu_end(void) +{ + __kernel_fpu_end(); + preempt_enable(); +} /* * Some instructions like VIA's padlock instructions generate a spurious diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index b98c0d9..d0e9adb 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -402,6 +402,11 @@ struct fpu { union thread_xstate *state; }; +static __always_inline __pure bool use_eager_fpu(void) +{ + return static_cpu_has(X86_FEATURE_EAGER_FPU); +} + #ifdef CONFIG_X86_64 DECLARE_PER_CPU(struct orig_ist, orig_ist); diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c index 6782e39..675a050 100644 --- a/arch/x86/kernel/i387.c +++ b/arch/x86/kernel/i387.c @@ -73,32 +73,29 @@ bool irq_fpu_usable(void) } EXPORT_SYMBOL(irq_fpu_usable); -void kernel_fpu_begin(void) +void __kernel_fpu_begin(void) { struct task_struct *me = current; - WARN_ON_ONCE(!irq_fpu_usable()); - preempt_disable(); if (__thread_has_fpu(me)) { __save_init_fpu(me); __thread_clear_has_fpu(me); - /* We do 'stts()' in kernel_fpu_end() */ + /* We do 'stts()' in __kernel_fpu_end() */ } else if (!use_eager_fpu()) { this_cpu_write(fpu_owner_task
[tip:x86/fpu] x86, fpu: remove cpu_has_xmm check in the fx_finit()
Commit-ID: a8615af4bc3621cb01096541dafa6f68352ec2d9 Gitweb: http://git.kernel.org/tip/a8615af4bc3621cb01096541dafa6f68352ec2d9 Author: Suresh Siddha AuthorDate: Mon, 10 Sep 2012 10:40:08 -0700 Committer: H. Peter Anvin CommitDate: Tue, 18 Sep 2012 15:52:24 -0700 x86, fpu: remove cpu_has_xmm check in the fx_finit() CPUs with FXSAVE but no XMM/MXCSR (Pentium II from Intel, Crusoe/TM-3xxx/5xxx from Transmeta, and presumably some of the K6 generation from AMD) ever looked at the mxcsr field during fxrstor/fxsave. So remove the cpu_has_xmm check in the fx_finit() Reported-by: Al Viro Acked-by: H. Peter Anvin Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1347300665-6209-6-git-send-email-suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin --- arch/x86/include/asm/fpu-internal.h |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 0ca72f0..92f3c6e 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -109,8 +109,7 @@ static inline void fx_finit(struct i387_fxsave_struct *fx) { memset(fx, 0, xstate_size); fx->cwd = 0x37f; - if (cpu_has_xmm) - fx->mxcsr = MXCSR_DEFAULT; + fx->mxcsr = MXCSR_DEFAULT; } extern void __sanitize_i387_state(struct task_struct *); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/fpu] x86, fpu: make eagerfpu= boot param tri-state
Commit-ID: e00229819f306b1f86134095347e9187dc346bd1 Gitweb: http://git.kernel.org/tip/e00229819f306b1f86134095347e9187dc346bd1 Author: Suresh Siddha AuthorDate: Mon, 10 Sep 2012 10:32:32 -0700 Committer: H. Peter Anvin CommitDate: Tue, 18 Sep 2012 15:52:24 -0700 x86, fpu: make eagerfpu= boot param tri-state Add the "eagerfpu=auto" (that selects the default scheme in enabling eagerfpu) which can override compiled-in boot parameters like "eagerfpu=on/off" (that force enable/disable eagerfpu). Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1347300665-6209-5-git-send-email-suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin --- Documentation/kernel-parameters.txt |4 +++- arch/x86/kernel/xsave.c | 17 - 2 files changed, 15 insertions(+), 6 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e8f7faa..46a6a82 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1834,8 +1834,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. enabling legacy floating-point and sse state. eagerfpu= [X86] - on enable eager fpu restore (default for xsaveopt) + on enable eager fpu restore off disable eager fpu restore + autoselects the default scheme, which automatically + enables eagerfpu restore for xsaveopt. nohlt [BUGS=ARM,SH] Tells the kernel that the sleep(SH) or wfi(ARM) instruction doesn't work correctly and not to diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c index e99f754..4e89b3d 100644 --- a/arch/x86/kernel/xsave.c +++ b/arch/x86/kernel/xsave.c @@ -508,13 +508,15 @@ static void __init setup_init_fpu_buf(void) xsave_state(init_xstate_buf, -1); } -static int disable_eagerfpu; +static enum { AUTO, ENABLE, DISABLE } eagerfpu = AUTO; static int __init eager_fpu_setup(char *s) { if (!strcmp(s, "on")) - setup_force_cpu_cap(X86_FEATURE_EAGER_FPU); + eagerfpu = ENABLE; else if (!strcmp(s, "off")) - disable_eagerfpu = 1; + eagerfpu = DISABLE; + else if (!strcmp(s, "auto")) + eagerfpu = AUTO; return 1; } __setup("eagerfpu=", eager_fpu_setup); @@ -557,8 +559,9 @@ static void __init xstate_enable_boot_cpu(void) prepare_fx_sw_frame(); setup_init_fpu_buf(); - if (cpu_has_xsaveopt && !disable_eagerfpu) - setup_force_cpu_cap(X86_FEATURE_EAGER_FPU); + /* Auto enable eagerfpu for xsaveopt */ + if (cpu_has_xsaveopt && eagerfpu != DISABLE) + eagerfpu = ENABLE; pr_info("enabled xstate_bv 0x%llx, cntxt size 0x%x\n", pcntxt_mask, xstate_size); @@ -598,6 +601,10 @@ void __cpuinit eager_fpu_init(void) clear_used_math(); current_thread_info()->status = 0; + + if (eagerfpu == ENABLE) + setup_force_cpu_cap(X86_FEATURE_EAGER_FPU); + if (!cpu_has_eager_fpu) { stts(); return; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/fpu] x86, fpu: decouple non-lazy/ eager fpu restore from xsave
Commit-ID: 5d2bd7009f306c82afddd1ca4d9763ad8473c216 Gitweb: http://git.kernel.org/tip/5d2bd7009f306c82afddd1ca4d9763ad8473c216 Author: Suresh Siddha AuthorDate: Thu, 6 Sep 2012 14:58:52 -0700 Committer: H. Peter Anvin CommitDate: Tue, 18 Sep 2012 15:52:22 -0700 x86, fpu: decouple non-lazy/eager fpu restore from xsave Decouple non-lazy/eager fpu restore policy from the existence of the xsave feature. Introduce a synthetic CPUID flag to represent the eagerfpu policy. "eagerfpu=on" boot paramter will enable the policy. Requested-by: H. Peter Anvin Requested-by: Linus Torvalds Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1347300665-6209-2-git-send-email-suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin --- Documentation/kernel-parameters.txt |4 ++ arch/x86/include/asm/cpufeature.h |2 + arch/x86/include/asm/fpu-internal.h | 54 -- arch/x86/kernel/cpu/common.c|2 - arch/x86/kernel/i387.c | 25 +++--- arch/x86/kernel/process.c |2 +- arch/x86/kernel/traps.c |2 +- arch/x86/kernel/xsave.c | 87 +++ 8 files changed, 112 insertions(+), 66 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index ad7e2e5..741d064 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1833,6 +1833,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. and restore using xsave. The kernel will fallback to enabling legacy floating-point and sse state. + eagerfpu= [X86] + on enable eager fpu restore + off disable eager fpu restore + nohlt [BUGS=ARM,SH] Tells the kernel that the sleep(SH) or wfi(ARM) instruction doesn't work correctly and not to use it. This is also useful when using JTAG debugger. diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 6b7ee5f..5dd2b47 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -97,6 +97,7 @@ #define X86_FEATURE_EXTD_APICID(3*32+26) /* has extended APICID (8 bits) */ #define X86_FEATURE_AMD_DCM (3*32+27) /* multi-node processor */ #define X86_FEATURE_APERFMPERF (3*32+28) /* APERFMPERF */ +#define X86_FEATURE_EAGER_FPU (3*32+29) /* "eagerfpu" Non lazy FPU restore */ /* Intel-defined CPU features, CPUID level 0x0001 (ecx), word 4 */ #define X86_FEATURE_XMM3 (4*32+ 0) /* "pni" SSE-3 */ @@ -305,6 +306,7 @@ extern const char * const x86_power_flags[32]; #define cpu_has_perfctr_core boot_cpu_has(X86_FEATURE_PERFCTR_CORE) #define cpu_has_cx8boot_cpu_has(X86_FEATURE_CX8) #define cpu_has_cx16 boot_cpu_has(X86_FEATURE_CX16) +#define cpu_has_eager_fpu boot_cpu_has(X86_FEATURE_EAGER_FPU) #if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64) # define cpu_has_invlpg1 diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 8ca0f9f..0ca72f0 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -38,6 +38,7 @@ int ia32_setup_frame(int sig, struct k_sigaction *ka, extern unsigned int mxcsr_feature_mask; extern void fpu_init(void); +extern void eager_fpu_init(void); DECLARE_PER_CPU(struct task_struct *, fpu_owner_task); @@ -84,6 +85,11 @@ static inline int is_x32_frame(void) #define X87_FSW_ES (1 << 7)/* Exception Summary */ +static __always_inline __pure bool use_eager_fpu(void) +{ + return static_cpu_has(X86_FEATURE_EAGER_FPU); +} + static __always_inline __pure bool use_xsaveopt(void) { return static_cpu_has(X86_FEATURE_XSAVEOPT); @@ -99,6 +105,14 @@ static __always_inline __pure bool use_fxsr(void) return static_cpu_has(X86_FEATURE_FXSR); } +static inline void fx_finit(struct i387_fxsave_struct *fx) +{ + memset(fx, 0, xstate_size); + fx->cwd = 0x37f; + if (cpu_has_xmm) + fx->mxcsr = MXCSR_DEFAULT; +} + extern void __sanitize_i387_state(struct task_struct *); static inline void sanitize_i387_state(struct task_struct *tsk) @@ -291,13 +305,13 @@ static inline void __thread_set_has_fpu(struct task_struct *tsk) static inline void __thread_fpu_end(struct task_struct *tsk) { __thread_clear_has_fpu(tsk); - if (!use_xsave()) + if (!use_eager_fpu()) stts(); } static inline void __thread_fpu_begin(struct task_struct *tsk) { - if (!use_xsave()) + if (!use_eager_fpu()) clts(); __thread_set_has_fpu(tsk); } @@ -327,10 +341,14 @@ static inline void drop_fpu(struct task_struct *tsk) static inline void drop_init_fpu(struct t
[tip:x86/fpu] x86, fpu: use non-lazy fpu restore for processors supporting xsave
Commit-ID: 304bceda6a18ae0b0240b8aac9a6bdf8ce2d2469 Gitweb: http://git.kernel.org/tip/304bceda6a18ae0b0240b8aac9a6bdf8ce2d2469 Author: Suresh Siddha AuthorDate: Fri, 24 Aug 2012 14:13:02 -0700 Committer: H. Peter Anvin CommitDate: Tue, 18 Sep 2012 15:52:11 -0700 x86, fpu: use non-lazy fpu restore for processors supporting xsave Fundamental model of the current Linux kernel is to lazily init and restore FPU instead of restoring the task state during context switch. This changes that fundamental lazy model to the non-lazy model for the processors supporting xsave feature. Reasons driving this model change are: i. Newer processors support optimized state save/restore using xsaveopt and xrstor by tracking the INIT state and MODIFIED state during context-switch. This is faster than modifying the cr0.TS bit which has serializing semantics. ii. Newer glibc versions use SSE for some of the optimized copy/clear routines. With certain workloads (like boot, kernel-compilation etc), application completes its work with in the first 5 task switches, thus taking upto 5 #DNA traps with the kernel not getting a chance to apply the above mentioned pre-load heuristic. iii. Some xstate features (like AMD's LWP feature) don't honor the cr0.TS bit and thus will not work correctly in the presence of lazy restore. Non-lazy state restore is needed for enabling such features. Some data on a two socket SNB system: * Saved 20K DNA exceptions during boot on a two socket SNB system. * Saved 50K DNA exceptions during kernel-compilation workload. * Improved throughput of the AVX based checksumming function inside the kernel by ~15% as xsave/xrstor is faster than the serializing clts/stts pair. Also now kernel_fpu_begin/end() relies on the patched alternative instructions. So move check_fpu() which uses the kernel_fpu_begin/end() after alternative_instructions(). Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1345842782-24175-7-git-send-email-suresh.b.sid...@intel.com Merge 32-bit boot fix from, Link: http://lkml.kernel.org/r/1347300665-6209-4-git-send-email-suresh.b.sid...@intel.com Cc: Jim Kukunas Cc: NeilBrown Cc: Avi Kivity Signed-off-by: H. Peter Anvin --- arch/x86/include/asm/fpu-internal.h | 96 +++ arch/x86/include/asm/i387.h |1 + arch/x86/include/asm/xsave.h|1 + arch/x86/kernel/cpu/bugs.c |7 ++- arch/x86/kernel/i387.c | 20 ++- arch/x86/kernel/process.c | 12 +++-- arch/x86/kernel/process_32.c|4 -- arch/x86/kernel/process_64.c|4 -- arch/x86/kernel/traps.c |5 ++- arch/x86/kernel/xsave.c | 57 + 10 files changed, 146 insertions(+), 61 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 52202a6..8ca0f9f 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -291,15 +291,48 @@ static inline void __thread_set_has_fpu(struct task_struct *tsk) static inline void __thread_fpu_end(struct task_struct *tsk) { __thread_clear_has_fpu(tsk); - stts(); + if (!use_xsave()) + stts(); } static inline void __thread_fpu_begin(struct task_struct *tsk) { - clts(); + if (!use_xsave()) + clts(); __thread_set_has_fpu(tsk); } +static inline void __drop_fpu(struct task_struct *tsk) +{ + if (__thread_has_fpu(tsk)) { + /* Ignore delayed exceptions from user space */ + asm volatile("1: fwait\n" +"2:\n" +_ASM_EXTABLE(1b, 2b)); + __thread_fpu_end(tsk); + } +} + +static inline void drop_fpu(struct task_struct *tsk) +{ + /* +* Forget coprocessor state.. +*/ + preempt_disable(); + tsk->fpu_counter = 0; + __drop_fpu(tsk); + clear_used_math(); + preempt_enable(); +} + +static inline void drop_init_fpu(struct task_struct *tsk) +{ + if (!use_xsave()) + drop_fpu(tsk); + else + xrstor_state(init_xstate_buf, -1); +} + /* * FPU state switching for scheduling. * @@ -333,7 +366,12 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta { fpu_switch_t fpu; - fpu.preload = tsk_used_math(new) && new->fpu_counter > 5; + /* +* If the task has used the math, pre-load the FPU on xsave processors +* or if the past 5 consecutive context-switches used math. +*/ + fpu.preload = tsk_used_math(new) && (use_xsave() || +new->fpu_counter > 5); if (__thread_has_fpu(old)) { if (!__save_init_fpu(old)) cpu = ~0; @@ -345,14 +383,14 @@ static inline fpu_switch_t swi
[tip:x86/fpu] lguest, x86: handle guest TS bit for lazy/ non-lazy fpu host models
Commit-ID: 9c6ff8bbb69a4e7b47ac40bfa44509296e89c5c0 Gitweb: http://git.kernel.org/tip/9c6ff8bbb69a4e7b47ac40bfa44509296e89c5c0 Author: Suresh Siddha AuthorDate: Fri, 24 Aug 2012 14:13:01 -0700 Committer: H. Peter Anvin CommitDate: Tue, 18 Sep 2012 15:52:09 -0700 lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models Instead of using unlazy_fpu() check if user_has_fpu() and set/clear the host TS bits so that the lguest works fine with both the lazy/non-lazy FPU host models with minimal changes. Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1345842782-24175-6-git-send-email-suresh.b.sid...@intel.com Cc: Rusty Russell Signed-off-by: H. Peter Anvin --- drivers/lguest/x86/core.c | 10 +++--- 1 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/lguest/x86/core.c b/drivers/lguest/x86/core.c index 39809035..4af12e1 100644 --- a/drivers/lguest/x86/core.c +++ b/drivers/lguest/x86/core.c @@ -203,8 +203,8 @@ void lguest_arch_run_guest(struct lg_cpu *cpu) * we set it now, so we can trap and pass that trap to the Guest if it * uses the FPU. */ - if (cpu->ts) - unlazy_fpu(current); + if (cpu->ts && user_has_fpu()) + stts(); /* * SYSENTER is an optimized way of doing system calls. We can't allow @@ -234,6 +234,10 @@ void lguest_arch_run_guest(struct lg_cpu *cpu) if (boot_cpu_has(X86_FEATURE_SEP)) wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0); + /* Clear the host TS bit if it was set above. */ + if (cpu->ts && user_has_fpu()) + clts(); + /* * If the Guest page faulted, then the cr2 register will tell us the * bad virtual address. We have to grab this now, because once we @@ -249,7 +253,7 @@ void lguest_arch_run_guest(struct lg_cpu *cpu) * a different CPU. So all the critical stuff should be done * before this. */ - else if (cpu->regs->trapnum == 7) + else if (cpu->regs->trapnum == 7 && !user_has_fpu()) math_state_restore(); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/fpu] x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage
Commit-ID: 841e3604d35aa70d399146abdc526d8c89a2c2f5 Gitweb: http://git.kernel.org/tip/841e3604d35aa70d399146abdc526d8c89a2c2f5 Author: Suresh Siddha AuthorDate: Fri, 24 Aug 2012 14:13:00 -0700 Committer: H. Peter Anvin CommitDate: Tue, 18 Sep 2012 15:52:08 -0700 x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage use kernel_fpu_begin/end() instead of unconditionally accessing cr0 and saving/restoring just the few used xmm/ymm registers. This has some advantages like: * If the task's FPU state is already active, then kernel_fpu_begin() will just save the user-state and avoiding the read/write of cr0. In general, cr0 accesses are much slower. * Manual save/restore of xmm/ymm registers will affect the 'modified' and the 'init' optimizations brought in the by xsaveopt/xrstor infrastructure. * Foward compatibility with future vector register extensions will be a problem if the xmm/ymm registers are manually saved and restored (corrupting the extended state of those vector registers). With this patch, there was no significant difference in the xor throughput using AVX, measured during boot. Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1345842782-24175-5-git-send-email-suresh.b.sid...@intel.com Cc: Jim Kukunas Cc: NeilBrown Signed-off-by: H. Peter Anvin --- arch/x86/include/asm/xor_32.h | 56 +--- arch/x86/include/asm/xor_64.h | 61 ++-- arch/x86/include/asm/xor_avx.h | 54 --- 3 files changed, 29 insertions(+), 142 deletions(-) diff --git a/arch/x86/include/asm/xor_32.h b/arch/x86/include/asm/xor_32.h index 4545708..aabd585 100644 --- a/arch/x86/include/asm/xor_32.h +++ b/arch/x86/include/asm/xor_32.h @@ -534,38 +534,6 @@ static struct xor_block_template xor_block_p5_mmx = { * Copyright (C) 1999 Zach Brown (with obvious credit due Ingo) */ -#define XMMS_SAVE \ -do { \ - preempt_disable(); \ - cr0 = read_cr0(); \ - clts(); \ - asm volatile( \ - "movups %%xmm0,(%0) ;\n\t" \ - "movups %%xmm1,0x10(%0) ;\n\t" \ - "movups %%xmm2,0x20(%0) ;\n\t" \ - "movups %%xmm3,0x30(%0) ;\n\t" \ - : \ - : "r" (xmm_save)\ - : "memory");\ -} while (0) - -#define XMMS_RESTORE \ -do { \ - asm volatile( \ - "sfence ;\n\t" \ - "movups (%0),%%xmm0 ;\n\t" \ - "movups 0x10(%0),%%xmm1 ;\n\t" \ - "movups 0x20(%0),%%xmm2 ;\n\t" \ - "movups 0x30(%0),%%xmm3 ;\n\t" \ - : \ - : "r" (xmm_save)\ - : "memory");\ - write_cr0(cr0); \ - preempt_enable(); \ -} while (0) - -#define ALIGN16 __attribute__((aligned(16))) - #define OFFS(x)"16*("#x")" #define PF_OFFS(x) "256+16*("#x")" #definePF0(x) " prefetchnta "PF_OFFS(x)"(%1) ;\n" @@ -587,10 +555,8 @@ static void xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2) { unsigned long lines = bytes >> 8; - char xmm_save[16*4] ALIGN16; - int cr0; - XMMS_SAVE; + kernel_fpu_begin(); asm volatile( #undef BLOCK @@ -633,7 +599,7 @@ xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2) : : "memory"); - XMMS_RESTORE; + kernel_fpu_end(); } static void @@ -641,10 +607,8 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned long *p2, unsigned long *p3) { unsigned long lines = bytes >> 8; - char xmm_save[16*4] ALIGN16; - int cr0; - XMMS_SAVE; + kernel_fpu_begin(); asm volatile( #undef BLOCK @@ -694,7 +658,7 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned long *p2, : : "memory" ); - XMMS_RESTORE; + kernel_fpu_end(); } static void @@ -702,10 +666,8 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned long *p2, unsigned long *p3, unsigned long *p4) { unsigned long lines = bytes >> 8; - char xmm_save[16*4] ALIGN16; - int cr0; - XMMS_SAVE; + kernel_fpu_begin();
[tip:x86/fpu] x86, kvm: use kernel_fpu_begin/end() in kvm_load/ put_guest_fpu()
Commit-ID: 9c1c3fac53378c9782c18f80107965578d7b7167 Gitweb: http://git.kernel.org/tip/9c1c3fac53378c9782c18f80107965578d7b7167 Author: Suresh Siddha AuthorDate: Fri, 24 Aug 2012 14:12:59 -0700 Committer: H. Peter Anvin CommitDate: Tue, 18 Sep 2012 15:52:07 -0700 x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu() kvm's guest fpu save/restore should be wrapped around kernel_fpu_begin/end(). This will avoid for example taking a DNA in kvm_load_guest_fpu() when it tries to load the fpu immediately after doing unlazy_fpu() on the host side. More importantly this will prevent the host process fpu from being corrupted. Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1345842782-24175-4-git-send-email-suresh.b.sid...@intel.com Cc: Avi Kivity Signed-off-by: H. Peter Anvin --- arch/x86/kvm/x86.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 148ed66..cf637f5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5972,7 +5972,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) */ kvm_put_guest_xcr0(vcpu); vcpu->guest_fpu_loaded = 1; - unlazy_fpu(current); + kernel_fpu_begin(); fpu_restore_checking(>arch.guest_fpu); trace_kvm_fpu(1); } @@ -5986,6 +5986,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) vcpu->guest_fpu_loaded = 0; fpu_save_init(>arch.guest_fpu); + kernel_fpu_end(); ++vcpu->stat.fpu_reload; kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); trace_kvm_fpu(0); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/fpu] x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()
Commit-ID: 377ffbcc536a5adc077395163ab149c02610 Gitweb: http://git.kernel.org/tip/377ffbcc536a5adc077395163ab149c02610 Author: Suresh Siddha AuthorDate: Fri, 24 Aug 2012 14:12:58 -0700 Committer: H. Peter Anvin CommitDate: Tue, 18 Sep 2012 15:52:06 -0700 x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig() Few lines below we do drop_fpu() which is more safer. Remove the unnecessary user_fpu_end() in save_xstate_sig(), which allows the drop_fpu() to ignore any pending exceptions from the user-space and drop the current fpu. Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1345842782-24175-3-git-send-email-suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin --- arch/x86/include/asm/fpu-internal.h | 17 +++-- arch/x86/kernel/xsave.c |1 - 2 files changed, 3 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 78169d1..52202a6 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -412,22 +412,11 @@ static inline void __drop_fpu(struct task_struct *tsk) } /* - * The actual user_fpu_begin/end() functions - * need to be preemption-safe. + * Need to be preemption-safe. * - * NOTE! user_fpu_end() must be used only after you - * have saved the FP state, and user_fpu_begin() must - * be used only immediately before restoring it. - * These functions do not do any save/restore on - * their own. + * NOTE! user_fpu_begin() must be used only immediately before restoring + * it. This function does not do any save/restore on their own. */ -static inline void user_fpu_end(void) -{ - preempt_disable(); - __thread_fpu_end(current); - preempt_enable(); -} - static inline void user_fpu_begin(void) { preempt_disable(); diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c index 07ddc87..4ac5f2e 100644 --- a/arch/x86/kernel/xsave.c +++ b/arch/x86/kernel/xsave.c @@ -255,7 +255,6 @@ int save_xstate_sig(void __user *buf, void __user *buf_fx, int size) /* Update the thread's fxstate to save the fsave header. */ if (ia32_fxstate) fpu_fxsave(>thread.fpu); - user_fpu_end(); } else { sanitize_i387_state(tsk); if (__copy_to_user(buf_fx, xsave, xstate_size)) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/fpu] x86, fpu: drop_fpu() before restoring new state from sigframe
Commit-ID: e962591749dfd4df9fea2c530ed7a3cfed50e5aa Gitweb: http://git.kernel.org/tip/e962591749dfd4df9fea2c530ed7a3cfed50e5aa Author: Suresh Siddha AuthorDate: Fri, 24 Aug 2012 14:12:57 -0700 Committer: H. Peter Anvin CommitDate: Tue, 18 Sep 2012 15:52:05 -0700 x86, fpu: drop_fpu() before restoring new state from sigframe No need to save the state with unlazy_fpu(), that is about to get overwritten by the state from the signal frame. Instead use drop_fpu() and continue to restore the new state. Also fold the stop_fpu_preload() into drop_fpu(). Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1345842782-24175-2-git-send-email-suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin --- arch/x86/include/asm/fpu-internal.h |7 +-- arch/x86/kernel/xsave.c |8 +++- 2 files changed, 4 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 4fbb419..78169d1 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -448,17 +448,12 @@ static inline void save_init_fpu(struct task_struct *tsk) preempt_enable(); } -static inline void stop_fpu_preload(struct task_struct *tsk) -{ - tsk->fpu_counter = 0; -} - static inline void drop_fpu(struct task_struct *tsk) { /* * Forget coprocessor state.. */ - stop_fpu_preload(tsk); + tsk->fpu_counter = 0; preempt_disable(); __drop_fpu(tsk); preempt_enable(); diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c index 0923d27..07ddc87 100644 --- a/arch/x86/kernel/xsave.c +++ b/arch/x86/kernel/xsave.c @@ -382,16 +382,14 @@ int __restore_xstate_sig(void __user *buf, void __user *buf_fx, int size) struct xsave_struct *xsave = >thread.fpu.state->xsave; struct user_i387_ia32_struct env; - stop_fpu_preload(tsk); - unlazy_fpu(tsk); + drop_fpu(tsk); if (__copy_from_user(xsave, buf_fx, state_size) || - __copy_from_user(, buf, sizeof(env))) { - drop_fpu(tsk); + __copy_from_user(, buf, sizeof(env))) return -1; - } sanitize_restored_xstate(tsk, , xstate_bv, fx_only); + set_used_math(); } else { /* * For 64-bit frames and 32-bit fsave frames, restore the user -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/fpu] x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels
Commit-ID: 72a671ced66db6d1c2bfff1c930a101ac8d08204 Gitweb: http://git.kernel.org/tip/72a671ced66db6d1c2bfff1c930a101ac8d08204 Author: Suresh Siddha AuthorDate: Tue, 24 Jul 2012 16:05:29 -0700 Committer: H. Peter Anvin CommitDate: Tue, 18 Sep 2012 15:51:48 -0700 x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels Currently for x86 and x86_32 binaries, fpstate in the user sigframe is copied to/from the fpstate in the task struct. And in the case of signal delivery for x86_64 binaries, if the fpstate is live in the CPU registers, then the live state is copied directly to the user sigframe. Otherwise fpstate in the task struct is copied to the user sigframe. During restore, fpstate in the user sigframe is restored directly to the live CPU registers. Historically, different code paths led to different bugs. For example, x86_64 code path was not preemption safe till recently. Also there is lot of code duplication for support of new features like xsave etc. Unify signal handling code paths for x86 and x86_64 kernels. New strategy is as follows: Signal delivery: Both for 32/64-bit frames, align the core math frame area to 64bytes as needed by xsave (this where the main fpu/extended state gets copied to and excludes the legacy compatibility fsave header for the 32-bit [f]xsave frames). If the state is live, copy the register state directly to the user frame. If not live, copy the state in the thread struct to the user frame. And for 32-bit [f]xsave frames, construct the fsave header separately before the actual [f]xsave area. Signal return: As the 32-bit frames with [f]xstate has an additional 'fsave' header, copy everything back from the user sigframe to the fpstate in the task structure and reconstruct the fxstate from the 'fsave' header (Also user passed pointers may not be correctly aligned for any attempt to directly restore any partial state). At the next fpstate usage, everything will be restored to the live CPU registers. For all the 64-bit frames and the 32-bit fsave frame, restore the state from the user sigframe directly to the live CPU registers. 64-bit signals always restored the math frame directly, so we can expect the math frame pointer to be correctly aligned. For 32-bit fsave frames, there are no alignment requirements, so we can restore the state directly. "lat_sig catch" microbenchmark numbers (for x86, x86_64, x86_32 binaries) are with in the noise range with this change. Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1343171129-2747-4-git-send-email-suresh.b.sid...@intel.com [ Merged in compilation fix ] Link: http://lkml.kernel.org/r/1344544736.8326.17.ca...@sbsiddha-desk.sc.intel.com Signed-off-by: H. Peter Anvin --- arch/x86/ia32/ia32_signal.c |9 +- arch/x86/include/asm/fpu-internal.h | 111 ++ arch/x86/include/asm/xsave.h|6 +- arch/x86/kernel/i387.c | 246 + arch/x86/kernel/process.c | 10 - arch/x86/kernel/ptrace.c|3 - arch/x86/kernel/signal.c| 15 +- arch/x86/kernel/xsave.c | 432 +-- 8 files changed, 348 insertions(+), 484 deletions(-) diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index 452d4dd..8c77c64 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -251,7 +251,7 @@ static int ia32_restore_sigcontext(struct pt_regs *regs, get_user_ex(tmp, >fpstate); buf = compat_ptr(tmp); - err |= restore_i387_xstate_ia32(buf); + err |= restore_xstate_sig(buf, 1); get_user_ex(*pax, >ax); } get_user_catch(err); @@ -382,9 +382,12 @@ static void __user *get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, sp = (unsigned long) ka->sa.sa_restorer; if (used_math()) { - sp = sp - sig_xstate_ia32_size; + unsigned long fx_aligned, math_size; + + sp = alloc_mathframe(sp, 1, _aligned, _size); *fpstate = (struct _fpstate_ia32 __user *) sp; - if (save_i387_xstate_ia32(*fpstate) < 0) + if (save_xstate_sig(*fpstate, (void __user *)fx_aligned, + math_size) < 0) return (void __user *) -1L; } diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 016acb3..4fbb419 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -22,11 +22,30 @@ #include #include -extern unsigned int sig_xstate_size; +#ifdef CONFIG_X86_64 +# include +# include +int ia32_setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info, + compat_sigset_t *set, struct pt_regs *regs); +int ia32_setup_frame(int sig, struct k_sigaction *ka, +compat_sigset_t *
[tip:x86/fpu] x86, fpu: Consolidate inline asm routines for saving /restoring fpu state
Commit-ID: 0ca5bd0d886578ad0afeceaa83458c0f35cb3c6b Gitweb: http://git.kernel.org/tip/0ca5bd0d886578ad0afeceaa83458c0f35cb3c6b Author: Suresh Siddha AuthorDate: Tue, 24 Jul 2012 16:05:28 -0700 Committer: H. Peter Anvin CommitDate: Tue, 18 Sep 2012 15:51:26 -0700 x86, fpu: Consolidate inline asm routines for saving/restoring fpu state Consolidate x86, x86_64 inline asm routines saving/restoring fpu state using config_enabled(). Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1343171129-2747-3-git-send-email-suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin --- arch/x86/include/asm/fpu-internal.h | 182 +++ arch/x86/include/asm/xsave.h|6 +- arch/x86/kernel/xsave.c |4 +- 3 files changed, 80 insertions(+), 112 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 6f59543..016acb3 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -97,34 +97,24 @@ static inline void sanitize_i387_state(struct task_struct *tsk) __sanitize_i387_state(tsk); } -#ifdef CONFIG_X86_64 -static inline int fxrstor_checking(struct i387_fxsave_struct *fx) -{ - int err; - - /* See comment in fxsave() below. */ -#ifdef CONFIG_AS_FXSAVEQ - asm volatile("1: fxrstorq %[fx]\n\t" -"2:\n" -".section .fixup,\"ax\"\n" -"3: movl $-1,%[err]\n" -"jmp 2b\n" -".previous\n" -_ASM_EXTABLE(1b, 3b) -: [err] "=r" (err) -: [fx] "m" (*fx), "0" (0)); -#else - asm volatile("1: rex64/fxrstor (%[fx])\n\t" -"2:\n" -".section .fixup,\"ax\"\n" -"3: movl $-1,%[err]\n" -"jmp 2b\n" -".previous\n" -_ASM_EXTABLE(1b, 3b) -: [err] "=r" (err) -: [fx] "R" (fx), "m" (*fx), "0" (0)); -#endif - return err; +#define check_insn(insn, output, input...) \ +({ \ + int err;\ + asm volatile("1:" #insn "\n\t" \ +"2:\n" \ +".section .fixup,\"ax\"\n" \ +"3: movl $-1,%[err]\n"\ +"jmp 2b\n"\ +".previous\n" \ +_ASM_EXTABLE(1b, 3b) \ +: [err] "=r" (err), output \ +: "0"(0), input); \ + err;\ +}) + +static inline int fsave_user(struct i387_fsave_struct __user *fx) +{ + return check_insn(fnsave %[fx]; fwait, [fx] "=m" (*fx), "m" (*fx)); } static inline int fxsave_user(struct i387_fxsave_struct __user *fx) @@ -140,90 +130,73 @@ static inline int fxsave_user(struct i387_fxsave_struct __user *fx) if (unlikely(err)) return -EFAULT; - /* See comment in fxsave() below. */ -#ifdef CONFIG_AS_FXSAVEQ - asm volatile("1: fxsaveq %[fx]\n\t" -"2:\n" -".section .fixup,\"ax\"\n" -"3: movl $-1,%[err]\n" -"jmp 2b\n" -".previous\n" -_ASM_EXTABLE(1b, 3b) -: [err] "=r" (err), [fx] "=m" (*fx) -: "0" (0)); -#else - asm volatile("1: rex64/fxsave (%[fx])\n\t" -"2:\n" -".section .fixup,\"ax\"\n" -"3: movl $-1,%[err]\n" -"jmp 2b\n" -".previous\n" -_ASM_EXTABLE(1b, 3b) -: [err] "=r" (err), "=m" (*fx) -: [fx] "R" (fx), "0" (0)); -#endif - if (unlikely(err) && - __clear_user(fx, sizeof(struct i387_fxsave_struct))) - err = -EFAULT; - /
[tip:x86/fpu] x86, signal: Cleanup ifdefs and is_ia32, is_x32
Commit-ID: 050902c011712ad4703038fa4489ec4edd87d396 Gitweb: http://git.kernel.org/tip/050902c011712ad4703038fa4489ec4edd87d396 Author: Suresh Siddha AuthorDate: Tue, 24 Jul 2012 16:05:27 -0700 Committer: H. Peter Anvin CommitDate: Tue, 18 Sep 2012 15:51:26 -0700 x86, signal: Cleanup ifdefs and is_ia32, is_x32 Use config_enabled() to cleanup the definitions of is_ia32/is_x32. Move the function prototypes to the header file to cleanup ifdefs, and move the x32_setup_rt_frame() code around. Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1343171129-2747-2-git-send-email-suresh.b.sid...@intel.com Merged in compilation fix from, Link: http://lkml.kernel.org/r/1344544736.8326.17.ca...@sbsiddha-desk.sc.intel.com Signed-off-by: H. Peter Anvin --- arch/x86/include/asm/fpu-internal.h | 26 +- arch/x86/include/asm/signal.h |4 + arch/x86/kernel/signal.c| 196 ++ 3 files changed, 110 insertions(+), 116 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 75f4c6d..6f59543 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -12,6 +12,7 @@ #include #include +#include #include #include #include @@ -32,7 +33,6 @@ extern user_regset_get_fn fpregs_get, xfpregs_get, fpregs_soft_get, extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set, xstateregs_set; - /* * xstateregs_active == fpregs_active. Please refer to the comment * at the definition of fpregs_active. @@ -55,6 +55,22 @@ extern void finit_soft_fpu(struct i387_soft_struct *soft); static inline void finit_soft_fpu(struct i387_soft_struct *soft) {} #endif +static inline int is_ia32_compat_frame(void) +{ + return config_enabled(CONFIG_IA32_EMULATION) && + test_thread_flag(TIF_IA32); +} + +static inline int is_ia32_frame(void) +{ + return config_enabled(CONFIG_X86_32) || is_ia32_compat_frame(); +} + +static inline int is_x32_frame(void) +{ + return config_enabled(CONFIG_X86_X32_ABI) && test_thread_flag(TIF_X32); +} + #define X87_FSW_ES (1 << 7)/* Exception Summary */ static __always_inline __pure bool use_xsaveopt(void) @@ -180,6 +196,11 @@ static inline void fpu_fxsave(struct fpu *fpu) #endif } +int ia32_setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info, + compat_sigset_t *set, struct pt_regs *regs); +int ia32_setup_frame(int sig, struct k_sigaction *ka, +compat_sigset_t *set, struct pt_regs *regs); + #else /* CONFIG_X86_32 */ /* perform fxrstor iff the processor has extended states, otherwise frstor */ @@ -204,6 +225,9 @@ static inline void fpu_fxsave(struct fpu *fpu) : [fx] "=m" (fpu->state->fxsave)); } +#define ia32_setup_frame __setup_frame +#define ia32_setup_rt_frame__setup_rt_frame + #endif /* CONFIG_X86_64 */ /* diff --git a/arch/x86/include/asm/signal.h b/arch/x86/include/asm/signal.h index 598457c..323973f 100644 --- a/arch/x86/include/asm/signal.h +++ b/arch/x86/include/asm/signal.h @@ -31,6 +31,10 @@ typedef struct { unsigned long sig[_NSIG_WORDS]; } sigset_t; +#ifndef CONFIG_COMPAT +typedef sigset_t compat_sigset_t; +#endif + #else /* Here we must cater to libcs that poke about in kernel headers. */ diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index b280908..bed431a 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -209,24 +209,21 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size, unsigned long sp = regs->sp; int onsigstack = on_sig_stack(sp); -#ifdef CONFIG_X86_64 /* redzone */ - sp -= 128; -#endif /* CONFIG_X86_64 */ + if (config_enabled(CONFIG_X86_64)) + sp -= 128; if (!onsigstack) { /* This is the X/Open sanctioned signal stack switching. */ if (ka->sa.sa_flags & SA_ONSTACK) { if (current->sas_ss_size) sp = current->sas_ss_sp + current->sas_ss_size; - } else { -#ifdef CONFIG_X86_32 - /* This is the legacy signal stack switching. */ - if ((regs->ss & 0x) != __USER_DS && - !(ka->sa.sa_flags & SA_RESTORER) && - ka->sa.sa_restorer) + } else if (config_enabled(CONFIG_X86_32) && + (regs->ss & 0x) != __USER_DS && + !(ka->sa.sa_flags & SA_RESTORER) && + ka->sa.sa_restorer) { + /* This is the legacy signal stack switching. */
[tip:x86/fpu] x86, signal: Cleanup ifdefs and is_ia32, is_x32
Commit-ID: 050902c011712ad4703038fa4489ec4edd87d396 Gitweb: http://git.kernel.org/tip/050902c011712ad4703038fa4489ec4edd87d396 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Tue, 24 Jul 2012 16:05:27 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 18 Sep 2012 15:51:26 -0700 x86, signal: Cleanup ifdefs and is_ia32, is_x32 Use config_enabled() to cleanup the definitions of is_ia32/is_x32. Move the function prototypes to the header file to cleanup ifdefs, and move the x32_setup_rt_frame() code around. Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1343171129-2747-2-git-send-email-suresh.b.sid...@intel.com Merged in compilation fix from, Link: http://lkml.kernel.org/r/1344544736.8326.17.ca...@sbsiddha-desk.sc.intel.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/include/asm/fpu-internal.h | 26 +- arch/x86/include/asm/signal.h |4 + arch/x86/kernel/signal.c| 196 ++ 3 files changed, 110 insertions(+), 116 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 75f4c6d..6f59543 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -12,6 +12,7 @@ #include linux/kernel_stat.h #include linux/regset.h +#include linux/compat.h #include linux/slab.h #include asm/asm.h #include asm/cpufeature.h @@ -32,7 +33,6 @@ extern user_regset_get_fn fpregs_get, xfpregs_get, fpregs_soft_get, extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set, xstateregs_set; - /* * xstateregs_active == fpregs_active. Please refer to the comment * at the definition of fpregs_active. @@ -55,6 +55,22 @@ extern void finit_soft_fpu(struct i387_soft_struct *soft); static inline void finit_soft_fpu(struct i387_soft_struct *soft) {} #endif +static inline int is_ia32_compat_frame(void) +{ + return config_enabled(CONFIG_IA32_EMULATION) + test_thread_flag(TIF_IA32); +} + +static inline int is_ia32_frame(void) +{ + return config_enabled(CONFIG_X86_32) || is_ia32_compat_frame(); +} + +static inline int is_x32_frame(void) +{ + return config_enabled(CONFIG_X86_X32_ABI) test_thread_flag(TIF_X32); +} + #define X87_FSW_ES (1 7)/* Exception Summary */ static __always_inline __pure bool use_xsaveopt(void) @@ -180,6 +196,11 @@ static inline void fpu_fxsave(struct fpu *fpu) #endif } +int ia32_setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info, + compat_sigset_t *set, struct pt_regs *regs); +int ia32_setup_frame(int sig, struct k_sigaction *ka, +compat_sigset_t *set, struct pt_regs *regs); + #else /* CONFIG_X86_32 */ /* perform fxrstor iff the processor has extended states, otherwise frstor */ @@ -204,6 +225,9 @@ static inline void fpu_fxsave(struct fpu *fpu) : [fx] =m (fpu-state-fxsave)); } +#define ia32_setup_frame __setup_frame +#define ia32_setup_rt_frame__setup_rt_frame + #endif /* CONFIG_X86_64 */ /* diff --git a/arch/x86/include/asm/signal.h b/arch/x86/include/asm/signal.h index 598457c..323973f 100644 --- a/arch/x86/include/asm/signal.h +++ b/arch/x86/include/asm/signal.h @@ -31,6 +31,10 @@ typedef struct { unsigned long sig[_NSIG_WORDS]; } sigset_t; +#ifndef CONFIG_COMPAT +typedef sigset_t compat_sigset_t; +#endif + #else /* Here we must cater to libcs that poke about in kernel headers. */ diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index b280908..bed431a 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -209,24 +209,21 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size, unsigned long sp = regs-sp; int onsigstack = on_sig_stack(sp); -#ifdef CONFIG_X86_64 /* redzone */ - sp -= 128; -#endif /* CONFIG_X86_64 */ + if (config_enabled(CONFIG_X86_64)) + sp -= 128; if (!onsigstack) { /* This is the X/Open sanctioned signal stack switching. */ if (ka-sa.sa_flags SA_ONSTACK) { if (current-sas_ss_size) sp = current-sas_ss_sp + current-sas_ss_size; - } else { -#ifdef CONFIG_X86_32 - /* This is the legacy signal stack switching. */ - if ((regs-ss 0x) != __USER_DS - !(ka-sa.sa_flags SA_RESTORER) - ka-sa.sa_restorer) + } else if (config_enabled(CONFIG_X86_32) + (regs-ss 0x) != __USER_DS + !(ka-sa.sa_flags SA_RESTORER) + ka-sa.sa_restorer) { + /* This is the legacy signal stack switching
[tip:x86/fpu] x86, fpu: Consolidate inline asm routines for saving /restoring fpu state
Commit-ID: 0ca5bd0d886578ad0afeceaa83458c0f35cb3c6b Gitweb: http://git.kernel.org/tip/0ca5bd0d886578ad0afeceaa83458c0f35cb3c6b Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Tue, 24 Jul 2012 16:05:28 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 18 Sep 2012 15:51:26 -0700 x86, fpu: Consolidate inline asm routines for saving/restoring fpu state Consolidate x86, x86_64 inline asm routines saving/restoring fpu state using config_enabled(). Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1343171129-2747-3-git-send-email-suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/include/asm/fpu-internal.h | 182 +++ arch/x86/include/asm/xsave.h|6 +- arch/x86/kernel/xsave.c |4 +- 3 files changed, 80 insertions(+), 112 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 6f59543..016acb3 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -97,34 +97,24 @@ static inline void sanitize_i387_state(struct task_struct *tsk) __sanitize_i387_state(tsk); } -#ifdef CONFIG_X86_64 -static inline int fxrstor_checking(struct i387_fxsave_struct *fx) -{ - int err; - - /* See comment in fxsave() below. */ -#ifdef CONFIG_AS_FXSAVEQ - asm volatile(1: fxrstorq %[fx]\n\t -2:\n -.section .fixup,\ax\\n -3: movl $-1,%[err]\n -jmp 2b\n -.previous\n -_ASM_EXTABLE(1b, 3b) -: [err] =r (err) -: [fx] m (*fx), 0 (0)); -#else - asm volatile(1: rex64/fxrstor (%[fx])\n\t -2:\n -.section .fixup,\ax\\n -3: movl $-1,%[err]\n -jmp 2b\n -.previous\n -_ASM_EXTABLE(1b, 3b) -: [err] =r (err) -: [fx] R (fx), m (*fx), 0 (0)); -#endif - return err; +#define check_insn(insn, output, input...) \ +({ \ + int err;\ + asm volatile(1: #insn \n\t \ +2:\n \ +.section .fixup,\ax\\n \ +3: movl $-1,%[err]\n\ +jmp 2b\n\ +.previous\n \ +_ASM_EXTABLE(1b, 3b) \ +: [err] =r (err), output \ +: 0(0), input); \ + err;\ +}) + +static inline int fsave_user(struct i387_fsave_struct __user *fx) +{ + return check_insn(fnsave %[fx]; fwait, [fx] =m (*fx), m (*fx)); } static inline int fxsave_user(struct i387_fxsave_struct __user *fx) @@ -140,90 +130,73 @@ static inline int fxsave_user(struct i387_fxsave_struct __user *fx) if (unlikely(err)) return -EFAULT; - /* See comment in fxsave() below. */ -#ifdef CONFIG_AS_FXSAVEQ - asm volatile(1: fxsaveq %[fx]\n\t -2:\n -.section .fixup,\ax\\n -3: movl $-1,%[err]\n -jmp 2b\n -.previous\n -_ASM_EXTABLE(1b, 3b) -: [err] =r (err), [fx] =m (*fx) -: 0 (0)); -#else - asm volatile(1: rex64/fxsave (%[fx])\n\t -2:\n -.section .fixup,\ax\\n -3: movl $-1,%[err]\n -jmp 2b\n -.previous\n -_ASM_EXTABLE(1b, 3b) -: [err] =r (err), =m (*fx) -: [fx] R (fx), 0 (0)); -#endif - if (unlikely(err) - __clear_user(fx, sizeof(struct i387_fxsave_struct))) - err = -EFAULT; - /* No need to clear here because the caller clears USED_MATH */ - return err; + if (config_enabled(CONFIG_X86_32)) + return check_insn(fxsave %[fx], [fx] =m (*fx), m (*fx)); + else if (config_enabled(CONFIG_AS_FXSAVEQ)) + return check_insn(fxsaveq %[fx], [fx] =m (*fx), m (*fx)); + + /* See comment in fpu_fxsave() below. */ + return check_insn(rex64/fxsave (%[fx]), =m (*fx), [fx] R (fx)); } -static inline void fpu_fxsave(struct fpu *fpu) +static inline int fxrstor_checking(struct i387_fxsave_struct *fx
[tip:x86/fpu] x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels
Commit-ID: 72a671ced66db6d1c2bfff1c930a101ac8d08204 Gitweb: http://git.kernel.org/tip/72a671ced66db6d1c2bfff1c930a101ac8d08204 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Tue, 24 Jul 2012 16:05:29 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 18 Sep 2012 15:51:48 -0700 x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels Currently for x86 and x86_32 binaries, fpstate in the user sigframe is copied to/from the fpstate in the task struct. And in the case of signal delivery for x86_64 binaries, if the fpstate is live in the CPU registers, then the live state is copied directly to the user sigframe. Otherwise fpstate in the task struct is copied to the user sigframe. During restore, fpstate in the user sigframe is restored directly to the live CPU registers. Historically, different code paths led to different bugs. For example, x86_64 code path was not preemption safe till recently. Also there is lot of code duplication for support of new features like xsave etc. Unify signal handling code paths for x86 and x86_64 kernels. New strategy is as follows: Signal delivery: Both for 32/64-bit frames, align the core math frame area to 64bytes as needed by xsave (this where the main fpu/extended state gets copied to and excludes the legacy compatibility fsave header for the 32-bit [f]xsave frames). If the state is live, copy the register state directly to the user frame. If not live, copy the state in the thread struct to the user frame. And for 32-bit [f]xsave frames, construct the fsave header separately before the actual [f]xsave area. Signal return: As the 32-bit frames with [f]xstate has an additional 'fsave' header, copy everything back from the user sigframe to the fpstate in the task structure and reconstruct the fxstate from the 'fsave' header (Also user passed pointers may not be correctly aligned for any attempt to directly restore any partial state). At the next fpstate usage, everything will be restored to the live CPU registers. For all the 64-bit frames and the 32-bit fsave frame, restore the state from the user sigframe directly to the live CPU registers. 64-bit signals always restored the math frame directly, so we can expect the math frame pointer to be correctly aligned. For 32-bit fsave frames, there are no alignment requirements, so we can restore the state directly. lat_sig catch microbenchmark numbers (for x86, x86_64, x86_32 binaries) are with in the noise range with this change. Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1343171129-2747-4-git-send-email-suresh.b.sid...@intel.com [ Merged in compilation fix ] Link: http://lkml.kernel.org/r/1344544736.8326.17.ca...@sbsiddha-desk.sc.intel.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/ia32/ia32_signal.c |9 +- arch/x86/include/asm/fpu-internal.h | 111 ++ arch/x86/include/asm/xsave.h|6 +- arch/x86/kernel/i387.c | 246 + arch/x86/kernel/process.c | 10 - arch/x86/kernel/ptrace.c|3 - arch/x86/kernel/signal.c| 15 +- arch/x86/kernel/xsave.c | 432 +-- 8 files changed, 348 insertions(+), 484 deletions(-) diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index 452d4dd..8c77c64 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -251,7 +251,7 @@ static int ia32_restore_sigcontext(struct pt_regs *regs, get_user_ex(tmp, sc-fpstate); buf = compat_ptr(tmp); - err |= restore_i387_xstate_ia32(buf); + err |= restore_xstate_sig(buf, 1); get_user_ex(*pax, sc-ax); } get_user_catch(err); @@ -382,9 +382,12 @@ static void __user *get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, sp = (unsigned long) ka-sa.sa_restorer; if (used_math()) { - sp = sp - sig_xstate_ia32_size; + unsigned long fx_aligned, math_size; + + sp = alloc_mathframe(sp, 1, fx_aligned, math_size); *fpstate = (struct _fpstate_ia32 __user *) sp; - if (save_i387_xstate_ia32(*fpstate) 0) + if (save_xstate_sig(*fpstate, (void __user *)fx_aligned, + math_size) 0) return (void __user *) -1L; } diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 016acb3..4fbb419 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -22,11 +22,30 @@ #include asm/uaccess.h #include asm/xsave.h -extern unsigned int sig_xstate_size; +#ifdef CONFIG_X86_64 +# include asm/sigcontext32.h +# include asm/user32.h +int ia32_setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info, + compat_sigset_t *set
[tip:x86/fpu] x86, fpu: drop_fpu() before restoring new state from sigframe
Commit-ID: e962591749dfd4df9fea2c530ed7a3cfed50e5aa Gitweb: http://git.kernel.org/tip/e962591749dfd4df9fea2c530ed7a3cfed50e5aa Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Fri, 24 Aug 2012 14:12:57 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 18 Sep 2012 15:52:05 -0700 x86, fpu: drop_fpu() before restoring new state from sigframe No need to save the state with unlazy_fpu(), that is about to get overwritten by the state from the signal frame. Instead use drop_fpu() and continue to restore the new state. Also fold the stop_fpu_preload() into drop_fpu(). Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1345842782-24175-2-git-send-email-suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/include/asm/fpu-internal.h |7 +-- arch/x86/kernel/xsave.c |8 +++- 2 files changed, 4 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 4fbb419..78169d1 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -448,17 +448,12 @@ static inline void save_init_fpu(struct task_struct *tsk) preempt_enable(); } -static inline void stop_fpu_preload(struct task_struct *tsk) -{ - tsk-fpu_counter = 0; -} - static inline void drop_fpu(struct task_struct *tsk) { /* * Forget coprocessor state.. */ - stop_fpu_preload(tsk); + tsk-fpu_counter = 0; preempt_disable(); __drop_fpu(tsk); preempt_enable(); diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c index 0923d27..07ddc87 100644 --- a/arch/x86/kernel/xsave.c +++ b/arch/x86/kernel/xsave.c @@ -382,16 +382,14 @@ int __restore_xstate_sig(void __user *buf, void __user *buf_fx, int size) struct xsave_struct *xsave = tsk-thread.fpu.state-xsave; struct user_i387_ia32_struct env; - stop_fpu_preload(tsk); - unlazy_fpu(tsk); + drop_fpu(tsk); if (__copy_from_user(xsave, buf_fx, state_size) || - __copy_from_user(env, buf, sizeof(env))) { - drop_fpu(tsk); + __copy_from_user(env, buf, sizeof(env))) return -1; - } sanitize_restored_xstate(tsk, env, xstate_bv, fx_only); + set_used_math(); } else { /* * For 64-bit frames and 32-bit fsave frames, restore the user -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/fpu] x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()
Commit-ID: 377ffbcc536a5adc077395163ab149c02610 Gitweb: http://git.kernel.org/tip/377ffbcc536a5adc077395163ab149c02610 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Fri, 24 Aug 2012 14:12:58 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 18 Sep 2012 15:52:06 -0700 x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig() Few lines below we do drop_fpu() which is more safer. Remove the unnecessary user_fpu_end() in save_xstate_sig(), which allows the drop_fpu() to ignore any pending exceptions from the user-space and drop the current fpu. Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1345842782-24175-3-git-send-email-suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/include/asm/fpu-internal.h | 17 +++-- arch/x86/kernel/xsave.c |1 - 2 files changed, 3 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 78169d1..52202a6 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -412,22 +412,11 @@ static inline void __drop_fpu(struct task_struct *tsk) } /* - * The actual user_fpu_begin/end() functions - * need to be preemption-safe. + * Need to be preemption-safe. * - * NOTE! user_fpu_end() must be used only after you - * have saved the FP state, and user_fpu_begin() must - * be used only immediately before restoring it. - * These functions do not do any save/restore on - * their own. + * NOTE! user_fpu_begin() must be used only immediately before restoring + * it. This function does not do any save/restore on their own. */ -static inline void user_fpu_end(void) -{ - preempt_disable(); - __thread_fpu_end(current); - preempt_enable(); -} - static inline void user_fpu_begin(void) { preempt_disable(); diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c index 07ddc87..4ac5f2e 100644 --- a/arch/x86/kernel/xsave.c +++ b/arch/x86/kernel/xsave.c @@ -255,7 +255,6 @@ int save_xstate_sig(void __user *buf, void __user *buf_fx, int size) /* Update the thread's fxstate to save the fsave header. */ if (ia32_fxstate) fpu_fxsave(tsk-thread.fpu); - user_fpu_end(); } else { sanitize_i387_state(tsk); if (__copy_to_user(buf_fx, xsave, xstate_size)) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/fpu] x86, kvm: use kernel_fpu_begin/end() in kvm_load/ put_guest_fpu()
Commit-ID: 9c1c3fac53378c9782c18f80107965578d7b7167 Gitweb: http://git.kernel.org/tip/9c1c3fac53378c9782c18f80107965578d7b7167 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Fri, 24 Aug 2012 14:12:59 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 18 Sep 2012 15:52:07 -0700 x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu() kvm's guest fpu save/restore should be wrapped around kernel_fpu_begin/end(). This will avoid for example taking a DNA in kvm_load_guest_fpu() when it tries to load the fpu immediately after doing unlazy_fpu() on the host side. More importantly this will prevent the host process fpu from being corrupted. Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1345842782-24175-4-git-send-email-suresh.b.sid...@intel.com Cc: Avi Kivity a...@redhat.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/kvm/x86.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 148ed66..cf637f5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5972,7 +5972,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) */ kvm_put_guest_xcr0(vcpu); vcpu-guest_fpu_loaded = 1; - unlazy_fpu(current); + kernel_fpu_begin(); fpu_restore_checking(vcpu-arch.guest_fpu); trace_kvm_fpu(1); } @@ -5986,6 +5986,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) vcpu-guest_fpu_loaded = 0; fpu_save_init(vcpu-arch.guest_fpu); + kernel_fpu_end(); ++vcpu-stat.fpu_reload; kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); trace_kvm_fpu(0); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/fpu] x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage
Commit-ID: 841e3604d35aa70d399146abdc526d8c89a2c2f5 Gitweb: http://git.kernel.org/tip/841e3604d35aa70d399146abdc526d8c89a2c2f5 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Fri, 24 Aug 2012 14:13:00 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 18 Sep 2012 15:52:08 -0700 x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage use kernel_fpu_begin/end() instead of unconditionally accessing cr0 and saving/restoring just the few used xmm/ymm registers. This has some advantages like: * If the task's FPU state is already active, then kernel_fpu_begin() will just save the user-state and avoiding the read/write of cr0. In general, cr0 accesses are much slower. * Manual save/restore of xmm/ymm registers will affect the 'modified' and the 'init' optimizations brought in the by xsaveopt/xrstor infrastructure. * Foward compatibility with future vector register extensions will be a problem if the xmm/ymm registers are manually saved and restored (corrupting the extended state of those vector registers). With this patch, there was no significant difference in the xor throughput using AVX, measured during boot. Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1345842782-24175-5-git-send-email-suresh.b.sid...@intel.com Cc: Jim Kukunas james.t.kuku...@linux.intel.com Cc: NeilBrown ne...@suse.de Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/include/asm/xor_32.h | 56 +--- arch/x86/include/asm/xor_64.h | 61 ++-- arch/x86/include/asm/xor_avx.h | 54 --- 3 files changed, 29 insertions(+), 142 deletions(-) diff --git a/arch/x86/include/asm/xor_32.h b/arch/x86/include/asm/xor_32.h index 4545708..aabd585 100644 --- a/arch/x86/include/asm/xor_32.h +++ b/arch/x86/include/asm/xor_32.h @@ -534,38 +534,6 @@ static struct xor_block_template xor_block_p5_mmx = { * Copyright (C) 1999 Zach Brown (with obvious credit due Ingo) */ -#define XMMS_SAVE \ -do { \ - preempt_disable(); \ - cr0 = read_cr0(); \ - clts(); \ - asm volatile( \ - movups %%xmm0,(%0) ;\n\t \ - movups %%xmm1,0x10(%0) ;\n\t \ - movups %%xmm2,0x20(%0) ;\n\t \ - movups %%xmm3,0x30(%0) ;\n\t \ - : \ - : r (xmm_save)\ - : memory);\ -} while (0) - -#define XMMS_RESTORE \ -do { \ - asm volatile( \ - sfence ;\n\t \ - movups (%0),%%xmm0 ;\n\t \ - movups 0x10(%0),%%xmm1 ;\n\t \ - movups 0x20(%0),%%xmm2 ;\n\t \ - movups 0x30(%0),%%xmm3 ;\n\t \ - : \ - : r (xmm_save)\ - : memory);\ - write_cr0(cr0); \ - preempt_enable(); \ -} while (0) - -#define ALIGN16 __attribute__((aligned(16))) - #define OFFS(x)16*(#x) #define PF_OFFS(x) 256+16*(#x) #definePF0(x) prefetchnta PF_OFFS(x)(%1) ;\n @@ -587,10 +555,8 @@ static void xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2) { unsigned long lines = bytes 8; - char xmm_save[16*4] ALIGN16; - int cr0; - XMMS_SAVE; + kernel_fpu_begin(); asm volatile( #undef BLOCK @@ -633,7 +599,7 @@ xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2) : : memory); - XMMS_RESTORE; + kernel_fpu_end(); } static void @@ -641,10 +607,8 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned long *p2, unsigned long *p3) { unsigned long lines = bytes 8; - char xmm_save[16*4] ALIGN16; - int cr0; - XMMS_SAVE; + kernel_fpu_begin(); asm volatile( #undef BLOCK @@ -694,7 +658,7 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned long *p2, : : memory ); - XMMS_RESTORE; + kernel_fpu_end(); } static void @@ -702,10 +666,8 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned long *p2, unsigned long *p3, unsigned long *p4) { unsigned long lines = bytes 8; - char xmm_save[16*4] ALIGN16; - int cr0; - XMMS_SAVE; + kernel_fpu_begin(); asm volatile( #undef BLOCK @@ -762,7 +724,7 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned long *p2, : : memory
[tip:x86/fpu] lguest, x86: handle guest TS bit for lazy/ non-lazy fpu host models
Commit-ID: 9c6ff8bbb69a4e7b47ac40bfa44509296e89c5c0 Gitweb: http://git.kernel.org/tip/9c6ff8bbb69a4e7b47ac40bfa44509296e89c5c0 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Fri, 24 Aug 2012 14:13:01 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 18 Sep 2012 15:52:09 -0700 lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models Instead of using unlazy_fpu() check if user_has_fpu() and set/clear the host TS bits so that the lguest works fine with both the lazy/non-lazy FPU host models with minimal changes. Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1345842782-24175-6-git-send-email-suresh.b.sid...@intel.com Cc: Rusty Russell ru...@rustcorp.com.au Signed-off-by: H. Peter Anvin h...@linux.intel.com --- drivers/lguest/x86/core.c | 10 +++--- 1 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/lguest/x86/core.c b/drivers/lguest/x86/core.c index 39809035..4af12e1 100644 --- a/drivers/lguest/x86/core.c +++ b/drivers/lguest/x86/core.c @@ -203,8 +203,8 @@ void lguest_arch_run_guest(struct lg_cpu *cpu) * we set it now, so we can trap and pass that trap to the Guest if it * uses the FPU. */ - if (cpu-ts) - unlazy_fpu(current); + if (cpu-ts user_has_fpu()) + stts(); /* * SYSENTER is an optimized way of doing system calls. We can't allow @@ -234,6 +234,10 @@ void lguest_arch_run_guest(struct lg_cpu *cpu) if (boot_cpu_has(X86_FEATURE_SEP)) wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0); + /* Clear the host TS bit if it was set above. */ + if (cpu-ts user_has_fpu()) + clts(); + /* * If the Guest page faulted, then the cr2 register will tell us the * bad virtual address. We have to grab this now, because once we @@ -249,7 +253,7 @@ void lguest_arch_run_guest(struct lg_cpu *cpu) * a different CPU. So all the critical stuff should be done * before this. */ - else if (cpu-regs-trapnum == 7) + else if (cpu-regs-trapnum == 7 !user_has_fpu()) math_state_restore(); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/fpu] x86, fpu: use non-lazy fpu restore for processors supporting xsave
Commit-ID: 304bceda6a18ae0b0240b8aac9a6bdf8ce2d2469 Gitweb: http://git.kernel.org/tip/304bceda6a18ae0b0240b8aac9a6bdf8ce2d2469 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Fri, 24 Aug 2012 14:13:02 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 18 Sep 2012 15:52:11 -0700 x86, fpu: use non-lazy fpu restore for processors supporting xsave Fundamental model of the current Linux kernel is to lazily init and restore FPU instead of restoring the task state during context switch. This changes that fundamental lazy model to the non-lazy model for the processors supporting xsave feature. Reasons driving this model change are: i. Newer processors support optimized state save/restore using xsaveopt and xrstor by tracking the INIT state and MODIFIED state during context-switch. This is faster than modifying the cr0.TS bit which has serializing semantics. ii. Newer glibc versions use SSE for some of the optimized copy/clear routines. With certain workloads (like boot, kernel-compilation etc), application completes its work with in the first 5 task switches, thus taking upto 5 #DNA traps with the kernel not getting a chance to apply the above mentioned pre-load heuristic. iii. Some xstate features (like AMD's LWP feature) don't honor the cr0.TS bit and thus will not work correctly in the presence of lazy restore. Non-lazy state restore is needed for enabling such features. Some data on a two socket SNB system: * Saved 20K DNA exceptions during boot on a two socket SNB system. * Saved 50K DNA exceptions during kernel-compilation workload. * Improved throughput of the AVX based checksumming function inside the kernel by ~15% as xsave/xrstor is faster than the serializing clts/stts pair. Also now kernel_fpu_begin/end() relies on the patched alternative instructions. So move check_fpu() which uses the kernel_fpu_begin/end() after alternative_instructions(). Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1345842782-24175-7-git-send-email-suresh.b.sid...@intel.com Merge 32-bit boot fix from, Link: http://lkml.kernel.org/r/1347300665-6209-4-git-send-email-suresh.b.sid...@intel.com Cc: Jim Kukunas james.t.kuku...@linux.intel.com Cc: NeilBrown ne...@suse.de Cc: Avi Kivity a...@redhat.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/include/asm/fpu-internal.h | 96 +++ arch/x86/include/asm/i387.h |1 + arch/x86/include/asm/xsave.h|1 + arch/x86/kernel/cpu/bugs.c |7 ++- arch/x86/kernel/i387.c | 20 ++- arch/x86/kernel/process.c | 12 +++-- arch/x86/kernel/process_32.c|4 -- arch/x86/kernel/process_64.c|4 -- arch/x86/kernel/traps.c |5 ++- arch/x86/kernel/xsave.c | 57 + 10 files changed, 146 insertions(+), 61 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 52202a6..8ca0f9f 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -291,15 +291,48 @@ static inline void __thread_set_has_fpu(struct task_struct *tsk) static inline void __thread_fpu_end(struct task_struct *tsk) { __thread_clear_has_fpu(tsk); - stts(); + if (!use_xsave()) + stts(); } static inline void __thread_fpu_begin(struct task_struct *tsk) { - clts(); + if (!use_xsave()) + clts(); __thread_set_has_fpu(tsk); } +static inline void __drop_fpu(struct task_struct *tsk) +{ + if (__thread_has_fpu(tsk)) { + /* Ignore delayed exceptions from user space */ + asm volatile(1: fwait\n +2:\n +_ASM_EXTABLE(1b, 2b)); + __thread_fpu_end(tsk); + } +} + +static inline void drop_fpu(struct task_struct *tsk) +{ + /* +* Forget coprocessor state.. +*/ + preempt_disable(); + tsk-fpu_counter = 0; + __drop_fpu(tsk); + clear_used_math(); + preempt_enable(); +} + +static inline void drop_init_fpu(struct task_struct *tsk) +{ + if (!use_xsave()) + drop_fpu(tsk); + else + xrstor_state(init_xstate_buf, -1); +} + /* * FPU state switching for scheduling. * @@ -333,7 +366,12 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta { fpu_switch_t fpu; - fpu.preload = tsk_used_math(new) new-fpu_counter 5; + /* +* If the task has used the math, pre-load the FPU on xsave processors +* or if the past 5 consecutive context-switches used math. +*/ + fpu.preload = tsk_used_math(new) (use_xsave() || +new-fpu_counter 5); if (__thread_has_fpu(old)) { if (!__save_init_fpu(old
[tip:x86/fpu] x86, fpu: decouple non-lazy/ eager fpu restore from xsave
Commit-ID: 5d2bd7009f306c82afddd1ca4d9763ad8473c216 Gitweb: http://git.kernel.org/tip/5d2bd7009f306c82afddd1ca4d9763ad8473c216 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Thu, 6 Sep 2012 14:58:52 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 18 Sep 2012 15:52:22 -0700 x86, fpu: decouple non-lazy/eager fpu restore from xsave Decouple non-lazy/eager fpu restore policy from the existence of the xsave feature. Introduce a synthetic CPUID flag to represent the eagerfpu policy. eagerfpu=on boot paramter will enable the policy. Requested-by: H. Peter Anvin h...@zytor.com Requested-by: Linus Torvalds torva...@linux-foundation.org Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1347300665-6209-2-git-send-email-suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- Documentation/kernel-parameters.txt |4 ++ arch/x86/include/asm/cpufeature.h |2 + arch/x86/include/asm/fpu-internal.h | 54 -- arch/x86/kernel/cpu/common.c|2 - arch/x86/kernel/i387.c | 25 +++--- arch/x86/kernel/process.c |2 +- arch/x86/kernel/traps.c |2 +- arch/x86/kernel/xsave.c | 87 +++ 8 files changed, 112 insertions(+), 66 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index ad7e2e5..741d064 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1833,6 +1833,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. and restore using xsave. The kernel will fallback to enabling legacy floating-point and sse state. + eagerfpu= [X86] + on enable eager fpu restore + off disable eager fpu restore + nohlt [BUGS=ARM,SH] Tells the kernel that the sleep(SH) or wfi(ARM) instruction doesn't work correctly and not to use it. This is also useful when using JTAG debugger. diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 6b7ee5f..5dd2b47 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -97,6 +97,7 @@ #define X86_FEATURE_EXTD_APICID(3*32+26) /* has extended APICID (8 bits) */ #define X86_FEATURE_AMD_DCM (3*32+27) /* multi-node processor */ #define X86_FEATURE_APERFMPERF (3*32+28) /* APERFMPERF */ +#define X86_FEATURE_EAGER_FPU (3*32+29) /* eagerfpu Non lazy FPU restore */ /* Intel-defined CPU features, CPUID level 0x0001 (ecx), word 4 */ #define X86_FEATURE_XMM3 (4*32+ 0) /* pni SSE-3 */ @@ -305,6 +306,7 @@ extern const char * const x86_power_flags[32]; #define cpu_has_perfctr_core boot_cpu_has(X86_FEATURE_PERFCTR_CORE) #define cpu_has_cx8boot_cpu_has(X86_FEATURE_CX8) #define cpu_has_cx16 boot_cpu_has(X86_FEATURE_CX16) +#define cpu_has_eager_fpu boot_cpu_has(X86_FEATURE_EAGER_FPU) #if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64) # define cpu_has_invlpg1 diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 8ca0f9f..0ca72f0 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -38,6 +38,7 @@ int ia32_setup_frame(int sig, struct k_sigaction *ka, extern unsigned int mxcsr_feature_mask; extern void fpu_init(void); +extern void eager_fpu_init(void); DECLARE_PER_CPU(struct task_struct *, fpu_owner_task); @@ -84,6 +85,11 @@ static inline int is_x32_frame(void) #define X87_FSW_ES (1 7)/* Exception Summary */ +static __always_inline __pure bool use_eager_fpu(void) +{ + return static_cpu_has(X86_FEATURE_EAGER_FPU); +} + static __always_inline __pure bool use_xsaveopt(void) { return static_cpu_has(X86_FEATURE_XSAVEOPT); @@ -99,6 +105,14 @@ static __always_inline __pure bool use_fxsr(void) return static_cpu_has(X86_FEATURE_FXSR); } +static inline void fx_finit(struct i387_fxsave_struct *fx) +{ + memset(fx, 0, xstate_size); + fx-cwd = 0x37f; + if (cpu_has_xmm) + fx-mxcsr = MXCSR_DEFAULT; +} + extern void __sanitize_i387_state(struct task_struct *); static inline void sanitize_i387_state(struct task_struct *tsk) @@ -291,13 +305,13 @@ static inline void __thread_set_has_fpu(struct task_struct *tsk) static inline void __thread_fpu_end(struct task_struct *tsk) { __thread_clear_has_fpu(tsk); - if (!use_xsave()) + if (!use_eager_fpu()) stts(); } static inline void __thread_fpu_begin(struct task_struct *tsk) { - if (!use_xsave()) + if (!use_eager_fpu()) clts(); __thread_set_has_fpu(tsk); } @@ -327,10 +341,14 @@ static inline
[tip:x86/fpu] x86, fpu: make eagerfpu= boot param tri-state
Commit-ID: e00229819f306b1f86134095347e9187dc346bd1 Gitweb: http://git.kernel.org/tip/e00229819f306b1f86134095347e9187dc346bd1 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Mon, 10 Sep 2012 10:32:32 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 18 Sep 2012 15:52:24 -0700 x86, fpu: make eagerfpu= boot param tri-state Add the eagerfpu=auto (that selects the default scheme in enabling eagerfpu) which can override compiled-in boot parameters like eagerfpu=on/off (that force enable/disable eagerfpu). Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1347300665-6209-5-git-send-email-suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- Documentation/kernel-parameters.txt |4 +++- arch/x86/kernel/xsave.c | 17 - 2 files changed, 15 insertions(+), 6 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e8f7faa..46a6a82 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1834,8 +1834,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. enabling legacy floating-point and sse state. eagerfpu= [X86] - on enable eager fpu restore (default for xsaveopt) + on enable eager fpu restore off disable eager fpu restore + autoselects the default scheme, which automatically + enables eagerfpu restore for xsaveopt. nohlt [BUGS=ARM,SH] Tells the kernel that the sleep(SH) or wfi(ARM) instruction doesn't work correctly and not to diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c index e99f754..4e89b3d 100644 --- a/arch/x86/kernel/xsave.c +++ b/arch/x86/kernel/xsave.c @@ -508,13 +508,15 @@ static void __init setup_init_fpu_buf(void) xsave_state(init_xstate_buf, -1); } -static int disable_eagerfpu; +static enum { AUTO, ENABLE, DISABLE } eagerfpu = AUTO; static int __init eager_fpu_setup(char *s) { if (!strcmp(s, on)) - setup_force_cpu_cap(X86_FEATURE_EAGER_FPU); + eagerfpu = ENABLE; else if (!strcmp(s, off)) - disable_eagerfpu = 1; + eagerfpu = DISABLE; + else if (!strcmp(s, auto)) + eagerfpu = AUTO; return 1; } __setup(eagerfpu=, eager_fpu_setup); @@ -557,8 +559,9 @@ static void __init xstate_enable_boot_cpu(void) prepare_fx_sw_frame(); setup_init_fpu_buf(); - if (cpu_has_xsaveopt !disable_eagerfpu) - setup_force_cpu_cap(X86_FEATURE_EAGER_FPU); + /* Auto enable eagerfpu for xsaveopt */ + if (cpu_has_xsaveopt eagerfpu != DISABLE) + eagerfpu = ENABLE; pr_info(enabled xstate_bv 0x%llx, cntxt size 0x%x\n, pcntxt_mask, xstate_size); @@ -598,6 +601,10 @@ void __cpuinit eager_fpu_init(void) clear_used_math(); current_thread_info()-status = 0; + + if (eagerfpu == ENABLE) + setup_force_cpu_cap(X86_FEATURE_EAGER_FPU); + if (!cpu_has_eager_fpu) { stts(); return; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/fpu] x86, fpu: remove cpu_has_xmm check in the fx_finit()
Commit-ID: a8615af4bc3621cb01096541dafa6f68352ec2d9 Gitweb: http://git.kernel.org/tip/a8615af4bc3621cb01096541dafa6f68352ec2d9 Author: Suresh Siddha suresh.b.sid...@intel.com AuthorDate: Mon, 10 Sep 2012 10:40:08 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Tue, 18 Sep 2012 15:52:24 -0700 x86, fpu: remove cpu_has_xmm check in the fx_finit() CPUs with FXSAVE but no XMM/MXCSR (Pentium II from Intel, Crusoe/TM-3xxx/5xxx from Transmeta, and presumably some of the K6 generation from AMD) ever looked at the mxcsr field during fxrstor/fxsave. So remove the cpu_has_xmm check in the fx_finit() Reported-by: Al Viro v...@zeniv.linux.org.uk Acked-by: H. Peter Anvin h...@zytor.com Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Link: http://lkml.kernel.org/r/1347300665-6209-6-git-send-email-suresh.b.sid...@intel.com Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/include/asm/fpu-internal.h |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h index 0ca72f0..92f3c6e 100644 --- a/arch/x86/include/asm/fpu-internal.h +++ b/arch/x86/include/asm/fpu-internal.h @@ -109,8 +109,7 @@ static inline void fx_finit(struct i387_fxsave_struct *fx) { memset(fx, 0, xstate_size); fx-cwd = 0x37f; - if (cpu_has_xmm) - fx-mxcsr = MXCSR_DEFAULT; + fx-mxcsr = MXCSR_DEFAULT; } extern void __sanitize_i387_state(struct task_struct *); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] crypto, tcrypt: remove local_bh_disable/enable() around local_irq_disable/enable()
Ran into this while looking at some new crypto code using FPU hitting a WARN_ON_ONCE(!irq_fpu_usable()) in the kernel_fpu_begin() on a x86 kernel that uses the new eagerfpu model. In short, current eagerfpu changes return 0 for interrupted_kernel_fpu_idle() and the in_interrupt() thinks it is in the interrupt context because of the local_bh_disable(). Thus resulting in the WARN_ON(). Remove the local_bh_disable/enable() calls around the existing local_irq_disable/enable() calls. local_irq_disable/enable() already disables the BH. [ If there are any other legitimate users calling kernel_fpu_begin() from the process context but with BH disabled, then we can look into fixing the irq_fpu_usable() in future. ] Signed-off-by: Suresh Siddha Cc: Tim Chen --- crypto/tcrypt.c |6 -- 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index 5cf2ccb..de8c5d3 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -97,7 +97,6 @@ static int test_cipher_cycles(struct blkcipher_desc *desc, int enc, int ret = 0; int i; - local_bh_disable(); local_irq_disable(); /* Warm-up run. */ @@ -130,7 +129,6 @@ static int test_cipher_cycles(struct blkcipher_desc *desc, int enc, out: local_irq_enable(); - local_bh_enable(); if (ret == 0) printk("1 operation in %lu cycles (%d bytes)\n", @@ -300,7 +298,6 @@ static int test_hash_cycles_digest(struct hash_desc *desc, int i; int ret; - local_bh_disable(); local_irq_disable(); /* Warm-up run. */ @@ -327,7 +324,6 @@ static int test_hash_cycles_digest(struct hash_desc *desc, out: local_irq_enable(); - local_bh_enable(); if (ret) return ret; @@ -348,7 +344,6 @@ static int test_hash_cycles(struct hash_desc *desc, struct scatterlist *sg, if (plen == blen) return test_hash_cycles_digest(desc, sg, blen, out); - local_bh_disable(); local_irq_disable(); /* Warm-up run. */ @@ -391,7 +386,6 @@ static int test_hash_cycles(struct hash_desc *desc, struct scatterlist *sg, out: local_irq_enable(); - local_bh_enable(); if (ret) return ret; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] crypto, tcrypt: remove local_bh_disable/enable() around local_irq_disable/enable()
Ran into this while looking at some new crypto code using FPU hitting a WARN_ON_ONCE(!irq_fpu_usable()) in the kernel_fpu_begin() on a x86 kernel that uses the new eagerfpu model. In short, current eagerfpu changes return 0 for interrupted_kernel_fpu_idle() and the in_interrupt() thinks it is in the interrupt context because of the local_bh_disable(). Thus resulting in the WARN_ON(). Remove the local_bh_disable/enable() calls around the existing local_irq_disable/enable() calls. local_irq_disable/enable() already disables the BH. [ If there are any other legitimate users calling kernel_fpu_begin() from the process context but with BH disabled, then we can look into fixing the irq_fpu_usable() in future. ] Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com Cc: Tim Chen tim.c.c...@linux.intel.com --- crypto/tcrypt.c |6 -- 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index 5cf2ccb..de8c5d3 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -97,7 +97,6 @@ static int test_cipher_cycles(struct blkcipher_desc *desc, int enc, int ret = 0; int i; - local_bh_disable(); local_irq_disable(); /* Warm-up run. */ @@ -130,7 +129,6 @@ static int test_cipher_cycles(struct blkcipher_desc *desc, int enc, out: local_irq_enable(); - local_bh_enable(); if (ret == 0) printk(1 operation in %lu cycles (%d bytes)\n, @@ -300,7 +298,6 @@ static int test_hash_cycles_digest(struct hash_desc *desc, int i; int ret; - local_bh_disable(); local_irq_disable(); /* Warm-up run. */ @@ -327,7 +324,6 @@ static int test_hash_cycles_digest(struct hash_desc *desc, out: local_irq_enable(); - local_bh_enable(); if (ret) return ret; @@ -348,7 +344,6 @@ static int test_hash_cycles(struct hash_desc *desc, struct scatterlist *sg, if (plen == blen) return test_hash_cycles_digest(desc, sg, blen, out); - local_bh_disable(); local_irq_disable(); /* Warm-up run. */ @@ -391,7 +386,6 @@ static int test_hash_cycles(struct hash_desc *desc, struct scatterlist *sg, out: local_irq_enable(); - local_bh_enable(); if (ret) return ret; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/