from:"Suresh Siddha"

Re: [PATCH 4/4] x86, fpu: irq_fpu_usable: kill all checks except !in_kernel_fpu

2014-09-02 Thread Suresh Siddha

On Fri, Aug 29, 2014 at 11:17 AM, Oleg Nesterov  wrote:
> ONCE AGAIN, THIS IS MORE THE QUESTION THAN THE PATCH.

this patch I think needs more thought for sure. please see below.

>
> interrupted_kernel_fpu_idle() does:
>
> if (use_eager_fpu())
> return true;
>
> return !__thread_has_fpu(current) &&
> (read_cr0() & X86_CR0_TS);
>
> and it is absolutely not clear why these 2 cases differ so much.
>
> To remind, the use_eager_fpu() case is buggy; __save_init_fpu() in
> __kernel_fpu_begin() can race with math_state_restore() which does
> __thread_fpu_begin() + restore_fpu_checking(). So we should fix this
> race anyway and we can't require __thread_has_fpu() == F likes the
> !use_eager_fpu() case does, in this case kernel_fpu_begin() will not
> work if it interrupts the idle thread (this will reintroduce the
> performance regression fixed by 5187b28f).
>
> Probably math_state_restore() can use kernel_fpu_disable/end() which
> sets/clears in_kernel_fpu, or it can disable irqs. Doesn't matter, we
> should fix this bug anyway.
>
> And if we fix this bug, why else !use_eager_fpu() case needs the much
> more strict check? Why we can't handle the __thread_has_fpu(current)
> case the same way?
>
> The comment deleted by this change says:
>
> and TS must be set so that the clts/stts pair does nothing
>
> and can explain the difference, but I can not understand this (again,
> assuming that we fix the race(s) mentoined above).
>
> Say, user_fpu_begin(). Yes, kernel_fpu_begin/end() can restore X86_CR0_TS.
> But this should be fine?

No. The reason is that has_fpu state and cr0.TS can get out of sync.

Let's say you get an interrupt after clts() in __thread_fpu_begin()
called as part of user_fpu_begin().

And because of this proposed change, irq_fpu_usable() returns true and
an interrupt can end-up using fpu and after the return from interrupt
we can have a state where cr0.TS is set but we end up resuming the
execution from __thread_set_has_fpu(). Now after this point has_fpu is
set but cr0.TS is set. And now any schedule() with this state (let's
say immd after preemption_enable() at the end of user_fpu_begin()) is
dangerous. We can get a dna fault in the middle of __switch_to() which
can lead to subtle bugs.


> A context switch before restore_user_xstate()
> can equally set it back?
> And device_not_available() should be fine even
> in kernel context?

not in some critical places like switch_to().

other than this patch, rest of the changes look ok to me. Can you
please resend this patchset with the math_state_restore() race
addressed aswell?

thanks,
suresh

>
> I'll appreciate any comment.
> ---
>  arch/x86/kernel/i387.c |   44 +---
>  1 files changed, 1 insertions(+), 43 deletions(-)
>
> diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
> index 9fb2899..ef60f33 100644
> --- a/arch/x86/kernel/i387.c
> +++ b/arch/x86/kernel/i387.c
> @@ -22,54 +22,12 @@
>  static DEFINE_PER_CPU(bool, in_kernel_fpu);
>
>  /*
> - * Were we in an interrupt that interrupted kernel mode?
> - *
> - * On others, we can do a kernel_fpu_begin/end() pair *ONLY* if that
> - * pair does nothing at all: the thread must not have fpu (so
> - * that we don't try to save the FPU state), and TS must
> - * be set (so that the clts/stts pair does nothing that is
> - * visible in the interrupted kernel thread).
> - *
> - * Except for the eagerfpu case when we return 1.
> - */
> -static inline bool interrupted_kernel_fpu_idle(void)
> -{
> -   if (this_cpu_read(in_kernel_fpu))
> -   return false;
> -
> -   if (use_eager_fpu())
> -   return true;
> -
> -   return !__thread_has_fpu(current) &&
> -   (read_cr0() & X86_CR0_TS);
> -}
> -
> -/*
> - * Were we in user mode (or vm86 mode) when we were
> - * interrupted?
> - *
> - * Doing kernel_fpu_begin/end() is ok if we are running
> - * in an interrupt context from user mode - we'll just
> - * save the FPU state as required.
> - */
> -static inline bool interrupted_user_mode(void)
> -{
> -   struct pt_regs *regs = get_irq_regs();
> -   return regs && user_mode_vm(regs);
> -}
> -
> -/*
>   * Can we use the FPU in kernel mode with the
>   * whole "kernel_fpu_begin/end()" sequence?
> - *
> - * It's always ok in process context (ie "not interrupt")
> - * but it is sometimes ok even from an irq.
>   */
>  bool irq_fpu_usable(void)
>  {
> -   return !in_interrupt() ||
> -   interrupted_user_mode() ||
> -   interrupted_kernel_fpu_idle();
> +   return !this_cpu_read(in_kernel_fpu);
>  }
>  EXPORT_SYMBOL(irq_fpu_usable);
>
> --
> 1.5.5.1
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] x86, fpu: introduce per-cpu "bool in_kernel_fpu"

2014-09-02 Thread Suresh Siddha

On Fri, Aug 29, 2014 at 11:16 AM, Oleg Nesterov  wrote:
> interrupted_kernel_fpu_idle() tries to detect if kernel_fpu_begin()
> is safe or not. In particulat it should obviously deny the nested
> kernel_fpu_begin() and this logic doesn't look clean.
>
> If use_eager_fpu() == T we rely on a) __thread_has_fpu() check in
> interrupted_kernel_fpu_idle(), and b) on the fact that _begin() does
> __thread_clear_has_fpu().
>
> Otherwise we demand that the interrupted task has no FPU if it is in
> kernel mode, this works becase __kernel_fpu_begin() does clts().
>
> Add the per-cpu "bool in_kernel_fpu" variable, and change this code
> to check/set/clear it. This allows to do some cleanups (see the next
> changes) and fixes.
>
> Note that the current code looks racy. Say, kernel_fpu_begin() right
> after math_state_restore()->__thread_fpu_begin() will overwrite the
> regs we are going to restore. This patch doesn't even try to fix this,

yes indeed, explicit calls to math_state_restore() in eager_fpu case
has this race. I guess this is present from the commit 5187b28f.

thanks,
suresh


> it just adds the comment, but "in_kernel_fpu" can also be used to
> implement kernel_fpu_disable() / kernel_fpu_enable().
>
> Signed-off-by: Oleg Nesterov 
> ---
>  arch/x86/include/asm/i387.h |2 +-
>  arch/x86/kernel/i387.c  |   10 ++
>  2 files changed, 11 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
> index ed8089d..5e275d3 100644
> --- a/arch/x86/include/asm/i387.h
> +++ b/arch/x86/include/asm/i387.h
> @@ -40,8 +40,8 @@ extern void __kernel_fpu_end(void);
>
>  static inline void kernel_fpu_begin(void)
>  {
> -   WARN_ON_ONCE(!irq_fpu_usable());
> preempt_disable();
> +   WARN_ON_ONCE(!irq_fpu_usable());
> __kernel_fpu_begin();
>  }
>
> diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
> index d5dd808..8fb8868 100644
> --- a/arch/x86/kernel/i387.c
> +++ b/arch/x86/kernel/i387.c
> @@ -19,6 +19,8 @@
>  #include 
>  #include 
>
> +static DEFINE_PER_CPU(bool, in_kernel_fpu);
> +
>  /*
>   * Were we in an interrupt that interrupted kernel mode?
>   *
> @@ -33,6 +35,9 @@
>   */
>  static inline bool interrupted_kernel_fpu_idle(void)
>  {
> +   if (this_cpu_read(in_kernel_fpu))
> +   return false;
> +
> if (use_eager_fpu())
> return __thread_has_fpu(current);
>
> @@ -73,6 +78,9 @@ void __kernel_fpu_begin(void)
>  {
> struct task_struct *me = current;
>
> +   this_cpu_write(in_kernel_fpu, true);
> +
> +   /* FIXME: race with math_state_restore()-like code */
> if (__thread_has_fpu(me)) {
> __thread_clear_has_fpu(me);
> __save_init_fpu(me);
> @@ -99,6 +107,8 @@ void __kernel_fpu_end(void)
> } else {
> stts();
> }
> +
> +   this_cpu_write(in_kernel_fpu, false);
>  }
>  EXPORT_SYMBOL(__kernel_fpu_end);
>
> --
> 1.5.5.1
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] x86, fpu: introduce per-cpu bool in_kernel_fpu

2014-09-02 Thread Suresh Siddha

On Fri, Aug 29, 2014 at 11:16 AM, Oleg Nesterov o...@redhat.com wrote:
 interrupted_kernel_fpu_idle() tries to detect if kernel_fpu_begin()
 is safe or not. In particulat it should obviously deny the nested
 kernel_fpu_begin() and this logic doesn't look clean.

 If use_eager_fpu() == T we rely on a) __thread_has_fpu() check in
 interrupted_kernel_fpu_idle(), and b) on the fact that _begin() does
 __thread_clear_has_fpu().

 Otherwise we demand that the interrupted task has no FPU if it is in
 kernel mode, this works becase __kernel_fpu_begin() does clts().

 Add the per-cpu bool in_kernel_fpu variable, and change this code
 to check/set/clear it. This allows to do some cleanups (see the next
 changes) and fixes.

 Note that the current code looks racy. Say, kernel_fpu_begin() right
 after math_state_restore()-__thread_fpu_begin() will overwrite the
 regs we are going to restore. This patch doesn't even try to fix this,

yes indeed, explicit calls to math_state_restore() in eager_fpu case
has this race. I guess this is present from the commit 5187b28f.

thanks,
suresh


 it just adds the comment, but in_kernel_fpu can also be used to
 implement kernel_fpu_disable() / kernel_fpu_enable().

 Signed-off-by: Oleg Nesterov o...@redhat.com
 ---
  arch/x86/include/asm/i387.h |2 +-
  arch/x86/kernel/i387.c  |   10 ++
  2 files changed, 11 insertions(+), 1 deletions(-)

 diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
 index ed8089d..5e275d3 100644
 --- a/arch/x86/include/asm/i387.h
 +++ b/arch/x86/include/asm/i387.h
 @@ -40,8 +40,8 @@ extern void __kernel_fpu_end(void);

  static inline void kernel_fpu_begin(void)
  {
 -   WARN_ON_ONCE(!irq_fpu_usable());
 preempt_disable();
 +   WARN_ON_ONCE(!irq_fpu_usable());
 __kernel_fpu_begin();
  }

 diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
 index d5dd808..8fb8868 100644
 --- a/arch/x86/kernel/i387.c
 +++ b/arch/x86/kernel/i387.c
 @@ -19,6 +19,8 @@
  #include asm/fpu-internal.h
  #include asm/user.h

 +static DEFINE_PER_CPU(bool, in_kernel_fpu);
 +
  /*
   * Were we in an interrupt that interrupted kernel mode?
   *
 @@ -33,6 +35,9 @@
   */
  static inline bool interrupted_kernel_fpu_idle(void)
  {
 +   if (this_cpu_read(in_kernel_fpu))
 +   return false;
 +
 if (use_eager_fpu())
 return __thread_has_fpu(current);

 @@ -73,6 +78,9 @@ void __kernel_fpu_begin(void)
  {
 struct task_struct *me = current;

 +   this_cpu_write(in_kernel_fpu, true);
 +
 +   /* FIXME: race with math_state_restore()-like code */
 if (__thread_has_fpu(me)) {
 __thread_clear_has_fpu(me);
 __save_init_fpu(me);
 @@ -99,6 +107,8 @@ void __kernel_fpu_end(void)
 } else {
 stts();
 }
 +
 +   this_cpu_write(in_kernel_fpu, false);
  }
  EXPORT_SYMBOL(__kernel_fpu_end);

 --
 1.5.5.1


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] x86, fpu: irq_fpu_usable: kill all checks except !in_kernel_fpu

2014-09-02 Thread Suresh Siddha

On Fri, Aug 29, 2014 at 11:17 AM, Oleg Nesterov o...@redhat.com wrote:
 ONCE AGAIN, THIS IS MORE THE QUESTION THAN THE PATCH.

this patch I think needs more thought for sure. please see below.


 interrupted_kernel_fpu_idle() does:

 if (use_eager_fpu())
 return true;

 return !__thread_has_fpu(current) 
 (read_cr0()  X86_CR0_TS);

 and it is absolutely not clear why these 2 cases differ so much.

 To remind, the use_eager_fpu() case is buggy; __save_init_fpu() in
 __kernel_fpu_begin() can race with math_state_restore() which does
 __thread_fpu_begin() + restore_fpu_checking(). So we should fix this
 race anyway and we can't require __thread_has_fpu() == F likes the
 !use_eager_fpu() case does, in this case kernel_fpu_begin() will not
 work if it interrupts the idle thread (this will reintroduce the
 performance regression fixed by 5187b28f).

 Probably math_state_restore() can use kernel_fpu_disable/end() which
 sets/clears in_kernel_fpu, or it can disable irqs. Doesn't matter, we
 should fix this bug anyway.

 And if we fix this bug, why else !use_eager_fpu() case needs the much
 more strict check? Why we can't handle the __thread_has_fpu(current)
 case the same way?

 The comment deleted by this change says:

 and TS must be set so that the clts/stts pair does nothing

 and can explain the difference, but I can not understand this (again,
 assuming that we fix the race(s) mentoined above).

 Say, user_fpu_begin(). Yes, kernel_fpu_begin/end() can restore X86_CR0_TS.
 But this should be fine?

No. The reason is that has_fpu state and cr0.TS can get out of sync.

Let's say you get an interrupt after clts() in __thread_fpu_begin()
called as part of user_fpu_begin().

And because of this proposed change, irq_fpu_usable() returns true and
an interrupt can end-up using fpu and after the return from interrupt
we can have a state where cr0.TS is set but we end up resuming the
execution from __thread_set_has_fpu(). Now after this point has_fpu is
set but cr0.TS is set. And now any schedule() with this state (let's
say immd after preemption_enable() at the end of user_fpu_begin()) is
dangerous. We can get a dna fault in the middle of __switch_to() which
can lead to subtle bugs.


 A context switch before restore_user_xstate()
 can equally set it back?
 And device_not_available() should be fine even
 in kernel context?

not in some critical places like switch_to().

other than this patch, rest of the changes look ok to me. Can you
please resend this patchset with the math_state_restore() race
addressed aswell?

thanks,
suresh


 I'll appreciate any comment.
 ---
  arch/x86/kernel/i387.c |   44 +---
  1 files changed, 1 insertions(+), 43 deletions(-)

 diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
 index 9fb2899..ef60f33 100644
 --- a/arch/x86/kernel/i387.c
 +++ b/arch/x86/kernel/i387.c
 @@ -22,54 +22,12 @@
  static DEFINE_PER_CPU(bool, in_kernel_fpu);

  /*
 - * Were we in an interrupt that interrupted kernel mode?
 - *
 - * On others, we can do a kernel_fpu_begin/end() pair *ONLY* if that
 - * pair does nothing at all: the thread must not have fpu (so
 - * that we don't try to save the FPU state), and TS must
 - * be set (so that the clts/stts pair does nothing that is
 - * visible in the interrupted kernel thread).
 - *
 - * Except for the eagerfpu case when we return 1.
 - */
 -static inline bool interrupted_kernel_fpu_idle(void)
 -{
 -   if (this_cpu_read(in_kernel_fpu))
 -   return false;
 -
 -   if (use_eager_fpu())
 -   return true;
 -
 -   return !__thread_has_fpu(current) 
 -   (read_cr0()  X86_CR0_TS);
 -}
 -
 -/*
 - * Were we in user mode (or vm86 mode) when we were
 - * interrupted?
 - *
 - * Doing kernel_fpu_begin/end() is ok if we are running
 - * in an interrupt context from user mode - we'll just
 - * save the FPU state as required.
 - */
 -static inline bool interrupted_user_mode(void)
 -{
 -   struct pt_regs *regs = get_irq_regs();
 -   return regs  user_mode_vm(regs);
 -}
 -
 -/*
   * Can we use the FPU in kernel mode with the
   * whole kernel_fpu_begin/end() sequence?
 - *
 - * It's always ok in process context (ie not interrupt)
 - * but it is sometimes ok even from an irq.
   */
  bool irq_fpu_usable(void)
  {
 -   return !in_interrupt() ||
 -   interrupted_user_mode() ||
 -   interrupted_kernel_fpu_idle();
 +   return !this_cpu_read(in_kernel_fpu);
  }
  EXPORT_SYMBOL(irq_fpu_usable);

 --
 1.5.5.1


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] x86, fpu: copy_process's FPU paths cleanups

2014-09-01 Thread Suresh Siddha

On Wed, Aug 27, 2014 at 11:51 AM, Oleg Nesterov  wrote:
> Hello,
>
> Who can review this? And where should I send FPU changes?
>
> And it seems that nobody cares about 2 fixes I sent before.
> Linus, I understand that you won't take them into v3.17, but
> perhaps you can ack/nack them explicitly? It seems that nobody
> can do this.
>
> Oleg.
>
>  arch/x86/include/asm/fpu-internal.h |2 +-
>  arch/x86/kernel/process.c   |   16 +---
>  arch/x86/kernel/process_32.c|2 --
>  arch/x86/kernel/process_64.c|1 -
>  4 files changed, 10 insertions(+), 11 deletions(-)

These 4 patches also look good to me.

Reviewed-by: Suresh Siddha 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, fpu: __restore_xstate_sig()->math_state_restore() needs preempt_disable()

2014-09-01 Thread Suresh Siddha

On Mon, Aug 25, 2014 at 11:08 AM, Oleg Nesterov  wrote:
>
> Add preempt_disable() + preempt_enable() around math_state_restore() in
> __restore_xstate_sig(). Otherwise __switch_to() after __thread_fpu_begin()
> can overwrite fpu->state we are going to restore.
>
> Signed-off-by: Oleg Nesterov 
> Cc: sta...@vger.kernel.org
> ---
>  arch/x86/kernel/xsave.c |5 -
>  1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
> index 453343c..c52eb9c 100644
> --- a/arch/x86/kernel/xsave.c
> +++ b/arch/x86/kernel/xsave.c
> @@ -397,8 +397,11 @@ int __restore_xstate_sig(void __user *buf, void __user 
> *buf_fx, int size)
> set_used_math();
> }
>
> -   if (use_eager_fpu())
> +   if (use_eager_fpu()) {
> +   preempt_disable();
> math_state_restore();
> +   preempt_enable();
> +   }
>
>         return err;
> } else {
>

oops. looks good to me.

Reviewed-by: Suresh Siddha 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, fpu: __restore_xstate_sig()-math_state_restore() needs preempt_disable()

2014-09-01 Thread Suresh Siddha

On Mon, Aug 25, 2014 at 11:08 AM, Oleg Nesterov o...@redhat.com wrote:

 Add preempt_disable() + preempt_enable() around math_state_restore() in
 __restore_xstate_sig(). Otherwise __switch_to() after __thread_fpu_begin()
 can overwrite fpu-state we are going to restore.

 Signed-off-by: Oleg Nesterov o...@redhat.com
 Cc: sta...@vger.kernel.org
 ---
  arch/x86/kernel/xsave.c |5 -
  1 files changed, 4 insertions(+), 1 deletions(-)

 diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
 index 453343c..c52eb9c 100644
 --- a/arch/x86/kernel/xsave.c
 +++ b/arch/x86/kernel/xsave.c
 @@ -397,8 +397,11 @@ int __restore_xstate_sig(void __user *buf, void __user 
 *buf_fx, int size)
 set_used_math();
 }

 -   if (use_eager_fpu())
 +   if (use_eager_fpu()) {
 +   preempt_disable();
 math_state_restore();
 +   preempt_enable();
 +   }

 return err;
 } else {


oops. looks good to me.

Reviewed-by: Suresh Siddha sbsid...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] x86, fpu: copy_process's FPU paths cleanups

2014-09-01 Thread Suresh Siddha

On Wed, Aug 27, 2014 at 11:51 AM, Oleg Nesterov o...@redhat.com wrote:
 Hello,

 Who can review this? And where should I send FPU changes?

 And it seems that nobody cares about 2 fixes I sent before.
 Linus, I understand that you won't take them into v3.17, but
 perhaps you can ack/nack them explicitly? It seems that nobody
 can do this.

 Oleg.

  arch/x86/include/asm/fpu-internal.h |2 +-
  arch/x86/kernel/process.c   |   16 +---
  arch/x86/kernel/process_32.c|2 --
  arch/x86/kernel/process_64.c|1 -
  4 files changed, 10 insertions(+), 11 deletions(-)

These 4 patches also look good to me.

Reviewed-by: Suresh Siddha sbsid...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/urgent] x86, fpu: Check tsk_used_math() in kernel_fpu_end() for eager FPU

2014-03-11 Thread tip-bot for Suresh Siddha

Commit-ID:  731bd6a93a6e9172094a2322bd0ee964bb1f4d63
Gitweb: http://git.kernel.org/tip/731bd6a93a6e9172094a2322bd0ee964bb1f4d63
Author: Suresh Siddha 
AuthorDate: Sun, 2 Feb 2014 22:56:23 -0800
Committer:  H. Peter Anvin 
CommitDate: Tue, 11 Mar 2014 12:32:52 -0700

x86, fpu: Check tsk_used_math() in kernel_fpu_end() for eager FPU

For non-eager fpu mode, thread's fpu state is allocated during the first
fpu usage (in the context of device not available exception). This
(math_state_restore()) can be a blocking call and hence we enable
interrupts (which were originally disabled when the exception happened),
allocate memory and disable interrupts etc.

But the eager-fpu mode, call's the same math_state_restore() from
kernel_fpu_end(). The assumption being that tsk_used_math() is always
set for the eager-fpu mode and thus avoid the code path of enabling
interrupts, allocating fpu state using blocking call and disable
interrupts etc.

But the below issue was noticed by Maarten Baert, Nate Eldredge and
few others:

If a user process dumps core on an ecrypt fs while aesni-intel is loaded,
we get a BUG() in __find_get_block() complaining that it was called with
interrupts disabled; then all further accesses to our ecrypt fs hang
and we have to reboot.

The aesni-intel code (encrypting the core file that we are writing) needs
the FPU and quite properly wraps its code in kernel_fpu_{begin,end}(),
the latter of which calls math_state_restore(). So after kernel_fpu_end(),
interrupts may be disabled, which nobody seems to expect, and they stay
that way until we eventually get to __find_get_block() which barfs.

For eager fpu, most the time, tsk_used_math() is true. At few instances
during thread exit, signal return handling etc, tsk_used_math() might
be false.

In kernel_fpu_end(), for eager-fpu, call math_state_restore()
only if tsk_used_math() is set. Otherwise, don't bother. Kernel code
path which cleared tsk_used_math() knows what needs to be done
with the fpu state.

Reported-by: Maarten Baert 
Reported-by: Nate Eldredge 
Suggested-by: Linus Torvalds 
Signed-off-by: Suresh Siddha 
Link: http://lkml.kernel.org/r/1391410583.3801.6.camel@europa
Cc: George Spelvin 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/kernel/i387.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index e8368c6..d5dd808 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -86,10 +86,19 @@ EXPORT_SYMBOL(__kernel_fpu_begin);
 
 void __kernel_fpu_end(void)
 {
-   if (use_eager_fpu())
-   math_state_restore();
-   else
+   if (use_eager_fpu()) {
+   /*
+* For eager fpu, most the time, tsk_used_math() is true.
+* Restore the user math as we are done with the kernel usage.
+* At few instances during thread exit, signal handling etc,
+* tsk_used_math() is false. Those few places will take proper
+* actions, so we don't need to restore the math here.
+*/
+   if (likely(tsk_used_math(current)))
+   math_state_restore();
+   } else {
stts();
+   }
 }
 EXPORT_SYMBOL(__kernel_fpu_end);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/urgent] x86, fpu: Check tsk_used_math() in kernel_fpu_end() for eager FPU

2014-03-11 Thread tip-bot for Suresh Siddha

Commit-ID:  731bd6a93a6e9172094a2322bd0ee964bb1f4d63
Gitweb: http://git.kernel.org/tip/731bd6a93a6e9172094a2322bd0ee964bb1f4d63
Author: Suresh Siddha sbsid...@gmail.com
AuthorDate: Sun, 2 Feb 2014 22:56:23 -0800
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 11 Mar 2014 12:32:52 -0700

x86, fpu: Check tsk_used_math() in kernel_fpu_end() for eager FPU

For non-eager fpu mode, thread's fpu state is allocated during the first
fpu usage (in the context of device not available exception). This
(math_state_restore()) can be a blocking call and hence we enable
interrupts (which were originally disabled when the exception happened),
allocate memory and disable interrupts etc.

But the eager-fpu mode, call's the same math_state_restore() from
kernel_fpu_end(). The assumption being that tsk_used_math() is always
set for the eager-fpu mode and thus avoid the code path of enabling
interrupts, allocating fpu state using blocking call and disable
interrupts etc.

But the below issue was noticed by Maarten Baert, Nate Eldredge and
few others:

If a user process dumps core on an ecrypt fs while aesni-intel is loaded,
we get a BUG() in __find_get_block() complaining that it was called with
interrupts disabled; then all further accesses to our ecrypt fs hang
and we have to reboot.

The aesni-intel code (encrypting the core file that we are writing) needs
the FPU and quite properly wraps its code in kernel_fpu_{begin,end}(),
the latter of which calls math_state_restore(). So after kernel_fpu_end(),
interrupts may be disabled, which nobody seems to expect, and they stay
that way until we eventually get to __find_get_block() which barfs.

For eager fpu, most the time, tsk_used_math() is true. At few instances
during thread exit, signal return handling etc, tsk_used_math() might
be false.

In kernel_fpu_end(), for eager-fpu, call math_state_restore()
only if tsk_used_math() is set. Otherwise, don't bother. Kernel code
path which cleared tsk_used_math() knows what needs to be done
with the fpu state.

Reported-by: Maarten Baert maarten-ba...@hotmail.com
Reported-by: Nate Eldredge n...@thatsmathematics.com
Suggested-by: Linus Torvalds torva...@linux-foundation.org
Signed-off-by: Suresh Siddha sbsid...@gmail.com
Link: http://lkml.kernel.org/r/1391410583.3801.6.camel@europa
Cc: George Spelvin li...@horizon.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/kernel/i387.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index e8368c6..d5dd808 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -86,10 +86,19 @@ EXPORT_SYMBOL(__kernel_fpu_begin);
 
 void __kernel_fpu_end(void)
 {
-   if (use_eager_fpu())
-   math_state_restore();
-   else
+   if (use_eager_fpu()) {
+   /*
+* For eager fpu, most the time, tsk_used_math() is true.
+* Restore the user math as we are done with the kernel usage.
+* At few instances during thread exit, signal handling etc,
+* tsk_used_math() is false. Those few places will take proper
+* actions, so we don't need to restore the math here.
+*/
+   if (likely(tsk_used_math(current)))
+   math_state_restore();
+   } else {
stts();
+   }
 }
 EXPORT_SYMBOL(__kernel_fpu_end);
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-03-07 Thread Suresh Siddha

On Fri, Mar 7, 2014 at 3:18 PM, H. Peter Anvin  wrote:
>
> Hi Suresh,
>
> Any thoughts on this?

hi Peter,

Can you please pickup the second short patch
(https://lkml.org/lkml/2014/2/3/21) which actually fixes the reported
problem at hand. And tested and acked by all the problem reporters.

I will respond shortly about the first patch (which is more of a cleanup).

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-03-07 Thread Suresh Siddha

On Fri, Mar 7, 2014 at 3:18 PM, H. Peter Anvin h...@zytor.com wrote:

 Hi Suresh,

 Any thoughts on this?

hi Peter,

Can you please pickup the second short patch
(https://lkml.org/lkml/2014/2/3/21) which actually fixes the reported
problem at hand. And tested and acked by all the problem reporters.

I will respond shortly about the first patch (which is more of a cleanup).

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-03 Thread Suresh Siddha

On Mon, 2014-02-03 at 10:20 -0800, Linus Torvalds wrote:
> Thinking about it some more, this patch is *almost* not needed at all.
> 
> I'm wondering if you should just change the first patch to just always
> initialize the fpu when it is allocated, and at execve() time (ie in
> flush_thread()).
> 

We already do this for eager-fpu case, in eager_fpu_init() during boot
and in drop_init_fpu() during flush_thread().

> If we do that, then this:
> 
> +   if (!tsk_used_math(tsk))
> +   init_fpu(tsk);
> 
> can be dropped entirely from math_state_restore(). 

yeah, probably for eager-fpu, but:

> And quite frankly,
> at that point, I think all the changes to __kernel_fpu_end() can go
> away, because at that point math_state_restore() really does the right
> thing - all the allocations are gone, and all the async task state
> games are gone, only the "restore state" remains.
> 
> Hmm? So the only thing needed would be to add that "init_fpu()" to the
> initial bootmem allocation path and to change flush_thread() (it
> currently does "drop_init_fpu()", let's just make it initialize the
> FPU state using fpu_finit()), and then we could remove the whole
> "used_math" bit entirely, and just say that the FPU is always
> initialized.
> 
> What do you guys think?

No. as I mentioned in the changelog, there is one more path which does
drop_fpu() and we still depend on this used_math bit for eager-fpu.

in signal restore path for 32-bit app, where we copy the sig-context
state from the user stack to the kernel manually (because of legacy
reasons where fsave state is followed by fxsave state etc in the 32-bit
signal handler context and we have to go through convert_to_fxsr() etc).

from __restore_xstate_sig() :

/*
 * Drop the current fpu which clears used_math(). This ensures
 * that any context-switch during the copy of the new state,
 * avoids the intermediate state from getting restored/saved.
 * Thus avoiding the new restored state from getting corrupted.
 * We will be ready to restore/save the state only after
 * set_used_math() is again set.
 */
drop_fpu(tsk);

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-03 Thread Suresh Siddha

On Mon, 2014-02-03 at 10:20 -0800, Linus Torvalds wrote:
 Thinking about it some more, this patch is *almost* not needed at all.
 
 I'm wondering if you should just change the first patch to just always
 initialize the fpu when it is allocated, and at execve() time (ie in
 flush_thread()).
 

We already do this for eager-fpu case, in eager_fpu_init() during boot
and in drop_init_fpu() during flush_thread().

 If we do that, then this:
 
 +   if (!tsk_used_math(tsk))
 +   init_fpu(tsk);
 
 can be dropped entirely from math_state_restore(). 

yeah, probably for eager-fpu, but:

 And quite frankly,
 at that point, I think all the changes to __kernel_fpu_end() can go
 away, because at that point math_state_restore() really does the right
 thing - all the allocations are gone, and all the async task state
 games are gone, only the restore state remains.
 
 Hmm? So the only thing needed would be to add that init_fpu() to the
 initial bootmem allocation path and to change flush_thread() (it
 currently does drop_init_fpu(), let's just make it initialize the
 FPU state using fpu_finit()), and then we could remove the whole
 used_math bit entirely, and just say that the FPU is always
 initialized.
 
 What do you guys think?

No. as I mentioned in the changelog, there is one more path which does
drop_fpu() and we still depend on this used_math bit for eager-fpu.

in signal restore path for 32-bit app, where we copy the sig-context
state from the user stack to the kernel manually (because of legacy
reasons where fsave state is followed by fxsave state etc in the 32-bit
signal handler context and we have to go through convert_to_fxsr() etc).

from __restore_xstate_sig() :

/*
 * Drop the current fpu which clears used_math(). This ensures
 * that any context-switch during the copy of the new state,
 * avoids the intermediate state from getting restored/saved.
 * Thus avoiding the new restored state from getting corrupted.
 * We will be ready to restore/save the state only after
 * set_used_math() is again set.
 */
drop_fpu(tsk);


thanks,
suresh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-02 Thread Suresh Siddha

On Sun, 2014-02-02 at 11:15 -0800, Linus Torvalds wrote:
> On Sat, Feb 1, 2014 at 11:19 PM, Suresh Siddha  wrote:
> >
> > The real fix for Nate's problem will be coming from Linus, with a
> > slightly modified option-b that Linus proposed. Linus, please let me
> > know if you want me to spin it. I can do it sunday night.
> 
> Please do it, since clearly I wasn't aware enough about the whole
> non-TS-checking FPU state details.
> 
> Also, since this issue doesn't seem to be a recent regression, I'm not
> going to take this patch directly (even though I'm planning on doing
> -rc1 in a few hours), and expect that I'll get it through the normal
> channels (presumably together with the __kernel_fpu_end cleanups). Ok
> with everybody?

Here is the second patch, which should fix the issue reported in this
thread. Maarten, Nate, George, please give this patch a try as is and
see if it helps address the issue you ran into. And please ack/review
with your test results.

Other patch which cleans up the irq_enable/disable logic in
math_state_restore() has been sent yesterday. You can run your
experiments with both these patches if you want. But your issue should
get fixed with just the appended patch here.

Peter, Please push both these patches through normal channels depending
on the results.

thanks,
suresh
---
From: Suresh Siddha 
Subject: x86, fpu: check tsk_used_math() in kernel_fpu_end() for eager fpu

For non-eager fpu mode, thread's fpu state is allocated during the first
fpu usage (in the context of device not available exception). This
(math_state_restore()) can be a blocking call and hence we enable
interrupts (which were originally disabled when the exception happened),
allocate memory and disable interrupts etc.

But the eager-fpu mode, call's the same math_state_restore() from
kernel_fpu_end(). The assumption being that tsk_used_math() is always
set for the eager-fpu mode and thus avoid the code path of enabling
interrupts, allocating fpu state using blocking call and disable
interrupts etc. 

But the below issue was noticed by Maarten Baert, Nate Eldredge and
few others:

If a user process dumps core on an ecrypt fs while aesni-intel is loaded,
we get a BUG() in __find_get_block() complaining that it was called with
interrupts disabled; then all further accesses to our ecrypt fs hang
and we have to reboot.

The aesni-intel code (encrypting the core file that we are writing) needs
the FPU and quite properly wraps its code in kernel_fpu_{begin,end}(),
the latter of which calls math_state_restore(). So after kernel_fpu_end(),
interrupts may be disabled, which nobody seems to expect, and they stay
that way until we eventually get to __find_get_block() which barfs.

For eager fpu, most the time, tsk_used_math() is true. At few instances
during thread exit, signal return handling etc, tsk_used_math() might
be false.

In kernel_fpu_end(), for eager-fpu, call math_state_restore()
only if tsk_used_math() is set. Otherwise, don't bother. Kernel code
path which cleared tsk_used_math() knows what needs to be done
with the fpu state.

Reported-by: Maarten Baert 
Reported-by: Nate Eldredge 
Suggested-by: Linus Torvalds 
Signed-off-by: Suresh Siddha 
Cc: George Spelvin 
---
 arch/x86/kernel/i387.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 4e5f770..670bba1 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -87,10 +87,19 @@ EXPORT_SYMBOL(__kernel_fpu_begin);

 void __kernel_fpu_end(void)
 {
-   if (use_eager_fpu())
-   math_state_restore();
-   else
+   if (use_eager_fpu()) {
+   /*
+* For eager fpu, most the time, tsk_used_math() is true.
+* Restore the user math as we are done with the kernel usage.
+* At few instances during thread exit, signal handling etc,
+* tsk_used_math() is false. Those few places will take proper
+* actions, so we don't need to restore the math here.
+*/
+   if (likely(tsk_used_math(current)))
+   math_state_restore();
+   } else {
stts();
+   }
 }
 EXPORT_SYMBOL(__kernel_fpu_end);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-02 Thread Suresh Siddha

On Sun, 2014-02-02 at 11:15 -0800, Linus Torvalds wrote:
 On Sat, Feb 1, 2014 at 11:19 PM, Suresh Siddha sbsid...@gmail.com wrote:
 
  The real fix for Nate's problem will be coming from Linus, with a
  slightly modified option-b that Linus proposed. Linus, please let me
  know if you want me to spin it. I can do it sunday night.
 
 Please do it, since clearly I wasn't aware enough about the whole
 non-TS-checking FPU state details.
 
 Also, since this issue doesn't seem to be a recent regression, I'm not
 going to take this patch directly (even though I'm planning on doing
 -rc1 in a few hours), and expect that I'll get it through the normal
 channels (presumably together with the __kernel_fpu_end cleanups). Ok
 with everybody?

Here is the second patch, which should fix the issue reported in this
thread. Maarten, Nate, George, please give this patch a try as is and
see if it helps address the issue you ran into. And please ack/review
with your test results.

Other patch which cleans up the irq_enable/disable logic in
math_state_restore() has been sent yesterday. You can run your
experiments with both these patches if you want. But your issue should
get fixed with just the appended patch here.

Peter, Please push both these patches through normal channels depending
on the results.

thanks,
suresh
---
From: Suresh Siddha sbsid...@gmail.com
Subject: x86, fpu: check tsk_used_math() in kernel_fpu_end() for eager fpu

For non-eager fpu mode, thread's fpu state is allocated during the first
fpu usage (in the context of device not available exception). This
(math_state_restore()) can be a blocking call and hence we enable
interrupts (which were originally disabled when the exception happened),
allocate memory and disable interrupts etc.

But the eager-fpu mode, call's the same math_state_restore() from
kernel_fpu_end(). The assumption being that tsk_used_math() is always
set for the eager-fpu mode and thus avoid the code path of enabling
interrupts, allocating fpu state using blocking call and disable
interrupts etc. 

But the below issue was noticed by Maarten Baert, Nate Eldredge and
few others:

If a user process dumps core on an ecrypt fs while aesni-intel is loaded,
we get a BUG() in __find_get_block() complaining that it was called with
interrupts disabled; then all further accesses to our ecrypt fs hang
and we have to reboot.

The aesni-intel code (encrypting the core file that we are writing) needs
the FPU and quite properly wraps its code in kernel_fpu_{begin,end}(),
the latter of which calls math_state_restore(). So after kernel_fpu_end(),
interrupts may be disabled, which nobody seems to expect, and they stay
that way until we eventually get to __find_get_block() which barfs.

For eager fpu, most the time, tsk_used_math() is true. At few instances
during thread exit, signal return handling etc, tsk_used_math() might
be false.

In kernel_fpu_end(), for eager-fpu, call math_state_restore()
only if tsk_used_math() is set. Otherwise, don't bother. Kernel code
path which cleared tsk_used_math() knows what needs to be done
with the fpu state.

Reported-by: Maarten Baert maarten-ba...@hotmail.com
Reported-by: Nate Eldredge n...@thatsmathematics.com
Suggested-by: Linus Torvalds torva...@linux-foundation.org
Signed-off-by: Suresh Siddha sbsid...@gmail.com
Cc: George Spelvin li...@horizon.com
---
 arch/x86/kernel/i387.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 4e5f770..670bba1 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -87,10 +87,19 @@ EXPORT_SYMBOL(__kernel_fpu_begin);
 
 void __kernel_fpu_end(void)
 {
-   if (use_eager_fpu())
-   math_state_restore();
-   else
+   if (use_eager_fpu()) {
+   /*
+* For eager fpu, most the time, tsk_used_math() is true.
+* Restore the user math as we are done with the kernel usage.
+* At few instances during thread exit, signal handling etc,
+* tsk_used_math() is false. Those few places will take proper
+* actions, so we don't need to restore the math here.
+*/
+   if (likely(tsk_used_math(current)))
+   math_state_restore();
+   } else {
stts();
+   }
 }
 EXPORT_SYMBOL(__kernel_fpu_end);
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-01 Thread Suresh Siddha

On Sat, 2014-02-01 at 17:06 -0800, Suresh Siddha wrote:
> Meanwhile I have the patch removing the delayed dynamic allocation for
> non-eager fpu. will post it after some testing.

Appended the patch for this. Tested for last 4-5 hours on my laptop.

The real fix for Nate's problem will be coming from Linus, with a
slightly modified option-b that Linus proposed. Linus, please let me
know if you want me to spin it. I can do it sunday night.

thanks,
suresh
---
From: Suresh Siddha 
Subject: x86, fpu: remove the logic of non-eager fpu mem allocation at the 
first usage

For non-eager fpu mode, thread's fpu state is allocated during the first
fpu usage (in the context of device not available exception). This can be
a blocking call and hence we enable interrupts (which were originally
disabled when the exception happened), allocate memory and disable
interrupts etc. While this saves 512 bytes or so per-thread, there
are some issues in general.

a.  Most of the future cases will be anyway using eager
FPU (because of processor features like xsaveopt, LWP, MPX etc) and
they do the allocation at the thread creation itself. Nice to have
one common mechanism as all the state save/restore code is
shared. Avoids the confusion and minimizes the subtle bugs
in the core piece involved with context-switch.

b. If a parent thread uses FPU, during fork() we allocate
the FPU state in the child and copy the state etc. Shortly after this,
during exec() we free it up, so that we can later allocate during
the first usage of FPU. So this free/allocate might be slower
for some workloads.

c. math_state_restore() is called from multiple places
and it is error pone if the caller expects interrupts to be disabled
throughout the execution of math_state_restore(). Can lead to subtle
bugs like Ubuntu bug #1265841.

Memory savings will be small anyways and the code complexity
introducing subtle bugs is not worth it. So remove
the logic of non-eager fpu mem allocation at the first usage.

Signed-off-by: Suresh Siddha 
---
 arch/x86/kernel/i387.c| 14 +-
 arch/x86/kernel/process.c |  6 --
 arch/x86/kernel/traps.c   | 16 ++--
 arch/x86/kernel/xsave.c   |  2 --
 4 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index e8368c6..4e5f770 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -5,6 +5,7 @@
  *  General FPU state handling cleanups
  * Gareth Hughes , May 2000
  */
+#include 
 #include 
 #include 
 #include 
@@ -186,6 +187,10 @@ void fpu_init(void)
if (xstate_size == 0)
init_thread_xstate();

+   if (!current->thread.fpu.state)
+   current->thread.fpu.state =
+   alloc_bootmem_align(xstate_size, __alignof__(struct 
xsave_struct));
+
mxcsr_feature_mask_init();
xsave_init();
eager_fpu_init();
@@ -219,8 +224,6 @@ EXPORT_SYMBOL_GPL(fpu_finit);
  */
 int init_fpu(struct task_struct *tsk)
 {
-   int ret;
-
if (tsk_used_math(tsk)) {
if (cpu_has_fpu && tsk == current)
unlazy_fpu(tsk);
@@ -228,13 +231,6 @@ int init_fpu(struct task_struct *tsk)
return 0;
}

-   /*
-* Memory allocation at the first usage of the FPU and other state.
-*/
-   ret = fpu_alloc(>thread.fpu);
-   if (ret)
-   return ret;
-
fpu_finit(>thread.fpu);

set_stopped_child_used_math(tsk);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3fb8d95..cd9c190 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -128,12 +128,6 @@ void flush_thread(void)
flush_ptrace_hw_breakpoint(tsk);
memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
drop_init_fpu(tsk);
-   /*
-* Free the FPU state for non xsave platforms. They get reallocated
-* lazily at the first use.
-*/
-   if (!use_eager_fpu())
-   free_thread_xstate(tsk);
 }

 static void hard_disable_TSC(void)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 57409f6..3265429 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -623,20 +623,8 @@ void math_state_restore(void)
 {
struct task_struct *tsk = current;

-   if (!tsk_used_math(tsk)) {
-   local_irq_enable();
-   /*
-* does a slab alloc which can sleep
-*/
-   if (init_fpu(tsk)) {
-   /*
-* ran out of memory!
-*/
-   do_group_exit(SIGKILL);
-   return;
-   }
-   local_irq_disable();
-   }
+   if (!tsk_used_math(tsk))
+   init_fpu(tsk);

__thread_fpu_begin(tsk);

diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-01 Thread Suresh Siddha

On Sat, Feb 1, 2014 at 5:51 PM, Linus Torvalds
 wrote:
> On Sat, Feb 1, 2014 at 5:47 PM, Suresh Siddha  wrote:
>>
>> So if the restore failed, we should do something like drop_init_fpu(),
>> which will restore init-state to the registers.
>>
>> for eager-fpu()  paths we don't use clts() stts() etc.
>
> Uhhuh. Ok.
>
> Why do we do that, btw? I think it would make much more sense to just
> do what I *thought* we did, and just make it a context-switch-time
> optimization ("let's always switch FP state"), not make it a huge
> semantic difference.

clts/stts is more costly and not all the state under xsave adhers to
cr0.TS/DNA rules.

did I answer your question?

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-01 Thread Suresh Siddha

On Sat, Feb 1, 2014 at 5:38 PM, Linus Torvalds
 wrote:
> It definitely does not want an else, I think.
>
> If tsk_used_math() is false, or if the FPU restore failed, we
> *definitely* need that stts(). Otherwise we'd return to user mode with
> random contents in the FP state, and let user mode muck around with
> it.
>
> No?

So if the restore failed, we should do something like drop_init_fpu(),
which will restore init-state to the registers.

for eager-fpu()  paths we don't use clts() stts() etc.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-01 Thread Suresh Siddha

On Sat, Feb 1, 2014 at 5:26 PM, H. Peter Anvin  wrote:
> Even "b" does that, no?

oh right. It needs an else. only for non-eager fpu case we should do stts()

  void __kernel_fpu_end(void)
  {
  if (use_eager_fpu()) {
  struct task_struct *me = current;

  if (tsk_used_math(me) && likely(!restore_fpu_checking(
me)))
  return;
  } else
  stts();
  }

thanks,
suresh

> "a" should be fine as long as we don't ever use
> those features in the kernel, even under kernel_fpu_begin/end().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-01 Thread Suresh Siddha

On Sat, Feb 1, 2014 at 11:27 AM, Linus Torvalds
 wrote:
> That said, regardless of the allocation issue, I do think that it's
> stupid for kernel_fpu_{begin,end} to save the math state if
> "used_math" was not set.  So I do think__kernel_fpu_end() as-s is
> buggy and stupid.

For eager_fpu case, assumption was every task should always have
'used_math' set. But i think there is a race,  where we drop the fpu
explicitly by doing drop_fpu() and meanwhile if we get an interrupt
etc that ends up using fpu?

so I will Ack for option "b", as option "a" breaks the features which
don't take into account cr0.TS.

Meanwhile I have the patch removing the delayed dynamic allocation for
non-eager fpu. will post it after some testing.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-01 Thread Suresh Siddha

On Sat, Feb 1, 2014 at 11:27 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 That said, regardless of the allocation issue, I do think that it's
 stupid for kernel_fpu_{begin,end} to save the math state if
 used_math was not set.  So I do think__kernel_fpu_end() as-s is
 buggy and stupid.

For eager_fpu case, assumption was every task should always have
'used_math' set. But i think there is a race,  where we drop the fpu
explicitly by doing drop_fpu() and meanwhile if we get an interrupt
etc that ends up using fpu?

so I will Ack for option b, as option a breaks the features which
don't take into account cr0.TS.

Meanwhile I have the patch removing the delayed dynamic allocation for
non-eager fpu. will post it after some testing.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-01 Thread Suresh Siddha

On Sat, Feb 1, 2014 at 5:26 PM, H. Peter Anvin h...@zytor.com wrote:
 Even b does that, no?

oh right. It needs an else. only for non-eager fpu case we should do stts()

  void __kernel_fpu_end(void)
  {
  if (use_eager_fpu()) {
  struct task_struct *me = current;

  if (tsk_used_math(me)  likely(!restore_fpu_checking(
me)))
  return;
  } else
  stts();
  }

thanks,
suresh

 a should be fine as long as we don't ever use
 those features in the kernel, even under kernel_fpu_begin/end().
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-01 Thread Suresh Siddha

On Sat, Feb 1, 2014 at 5:38 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 It definitely does not want an else, I think.

 If tsk_used_math() is false, or if the FPU restore failed, we
 *definitely* need that stts(). Otherwise we'd return to user mode with
 random contents in the FP state, and let user mode muck around with
 it.

 No?

So if the restore failed, we should do something like drop_init_fpu(),
which will restore init-state to the registers.

for eager-fpu()  paths we don't use clts() stts() etc.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-01 Thread Suresh Siddha

On Sat, Feb 1, 2014 at 5:51 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Sat, Feb 1, 2014 at 5:47 PM, Suresh Siddha sbsid...@gmail.com wrote:

 So if the restore failed, we should do something like drop_init_fpu(),
 which will restore init-state to the registers.

 for eager-fpu()  paths we don't use clts() stts() etc.

 Uhhuh. Ok.

 Why do we do that, btw? I think it would make much more sense to just
 do what I *thought* we did, and just make it a context-switch-time
 optimization (let's always switch FP state), not make it a huge
 semantic difference.

clts/stts is more costly and not all the state under xsave adhers to
cr0.TS/DNA rules.

did I answer your question?

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-02-01 Thread Suresh Siddha

On Sat, 2014-02-01 at 17:06 -0800, Suresh Siddha wrote:
 Meanwhile I have the patch removing the delayed dynamic allocation for
 non-eager fpu. will post it after some testing.

Appended the patch for this. Tested for last 4-5 hours on my laptop.

The real fix for Nate's problem will be coming from Linus, with a
slightly modified option-b that Linus proposed. Linus, please let me
know if you want me to spin it. I can do it sunday night.

thanks,
suresh
---
From: Suresh Siddha sbsid...@gmail.com
Subject: x86, fpu: remove the logic of non-eager fpu mem allocation at the 
first usage

For non-eager fpu mode, thread's fpu state is allocated during the first
fpu usage (in the context of device not available exception). This can be
a blocking call and hence we enable interrupts (which were originally
disabled when the exception happened), allocate memory and disable
interrupts etc. While this saves 512 bytes or so per-thread, there
are some issues in general.

a.  Most of the future cases will be anyway using eager
FPU (because of processor features like xsaveopt, LWP, MPX etc) and
they do the allocation at the thread creation itself. Nice to have
one common mechanism as all the state save/restore code is
shared. Avoids the confusion and minimizes the subtle bugs
in the core piece involved with context-switch.

b. If a parent thread uses FPU, during fork() we allocate
the FPU state in the child and copy the state etc. Shortly after this,
during exec() we free it up, so that we can later allocate during
the first usage of FPU. So this free/allocate might be slower
for some workloads.

c. math_state_restore() is called from multiple places
and it is error pone if the caller expects interrupts to be disabled
throughout the execution of math_state_restore(). Can lead to subtle
bugs like Ubuntu bug #1265841.

Memory savings will be small anyways and the code complexity
introducing subtle bugs is not worth it. So remove
the logic of non-eager fpu mem allocation at the first usage.

Signed-off-by: Suresh Siddha sbsid...@gmail.com
---
 arch/x86/kernel/i387.c| 14 +-
 arch/x86/kernel/process.c |  6 --
 arch/x86/kernel/traps.c   | 16 ++--
 arch/x86/kernel/xsave.c   |  2 --
 4 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index e8368c6..4e5f770 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -5,6 +5,7 @@
  *  General FPU state handling cleanups
  * Gareth Hughes gar...@valinux.com, May 2000
  */
+#include linux/bootmem.h
 #include linux/module.h
 #include linux/regset.h
 #include linux/sched.h
@@ -186,6 +187,10 @@ void fpu_init(void)
if (xstate_size == 0)
init_thread_xstate();
 
+   if (!current-thread.fpu.state)
+   current-thread.fpu.state =
+   alloc_bootmem_align(xstate_size, __alignof__(struct 
xsave_struct));
+
mxcsr_feature_mask_init();
xsave_init();
eager_fpu_init();
@@ -219,8 +224,6 @@ EXPORT_SYMBOL_GPL(fpu_finit);
  */
 int init_fpu(struct task_struct *tsk)
 {
-   int ret;
-
if (tsk_used_math(tsk)) {
if (cpu_has_fpu  tsk == current)
unlazy_fpu(tsk);
@@ -228,13 +231,6 @@ int init_fpu(struct task_struct *tsk)
return 0;
}
 
-   /*
-* Memory allocation at the first usage of the FPU and other state.
-*/
-   ret = fpu_alloc(tsk-thread.fpu);
-   if (ret)
-   return ret;
-
fpu_finit(tsk-thread.fpu);
 
set_stopped_child_used_math(tsk);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3fb8d95..cd9c190 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -128,12 +128,6 @@ void flush_thread(void)
flush_ptrace_hw_breakpoint(tsk);
memset(tsk-thread.tls_array, 0, sizeof(tsk-thread.tls_array));
drop_init_fpu(tsk);
-   /*
-* Free the FPU state for non xsave platforms. They get reallocated
-* lazily at the first use.
-*/
-   if (!use_eager_fpu())
-   free_thread_xstate(tsk);
 }
 
 static void hard_disable_TSC(void)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 57409f6..3265429 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -623,20 +623,8 @@ void math_state_restore(void)
 {
struct task_struct *tsk = current;
 
-   if (!tsk_used_math(tsk)) {
-   local_irq_enable();
-   /*
-* does a slab alloc which can sleep
-*/
-   if (init_fpu(tsk)) {
-   /*
-* ran out of memory!
-*/
-   do_group_exit(SIGKILL);
-   return;
-   }
-   local_irq_disable();
-   }
+   if (!tsk_used_math(tsk))
+   init_fpu(tsk);
 
__thread_fpu_begin

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-01-30 Thread Suresh Siddha

hi,

On Thu, Jan 30, 2014 at 2:24 PM, Linus Torvalds
 wrote:
> I'm adding in some people here, because I think in the end this bug
> was introduced by commit 304bceda6a18 ("x86, fpu: use non-lazy fpu
> restore for processors supporting xsave") that introduced that
> math_state_restore() in kernel_fpu_end(), but we have other commits
> (like 5187b28ff08: "x86: Allow FPU to be used at interrupt time even
> with eagerfpu") that seem tangential too and might be part of why it
> actually *triggers* now.
>
> Comments?

I haven't been following the recent changes closely, so before I get a
chance to review the current bug and the relevant commits, wanted to
added that:

a. delayed dynamic allocation of FPU state area was not a good idea
(from me). Given most of the future cases will be anyway using eager
FPU (because of processor features like xsaveopt etc, applications
implicitly using FPU because of optimizations in commonly used
libraries etc), we should probably go back to allocation of FPU state
area during thread creation for everyone (including non-eager cases).
Memory savings will be small anyways and the code complexity
introducing subtle bugs like this in not worth it.

b. with the above change, kernel_fpu_begin() will just save any user
live math state and be ready for kernel math operations. And
kernel_fpu_end() will drop the kernel math state and for  eager-fpu
case restore the user math state.

We will avoid worrying about any memory allocations in the
math_state_restore() with interrupts disabled etc.

If there are no objections, I will see if I can come up with a quick
patch. or will ask HPA to help fill me in.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make math_state_restore() save and restore the interrupt flag

2014-01-30 Thread Suresh Siddha

hi,

On Thu, Jan 30, 2014 at 2:24 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 I'm adding in some people here, because I think in the end this bug
 was introduced by commit 304bceda6a18 (x86, fpu: use non-lazy fpu
 restore for processors supporting xsave) that introduced that
 math_state_restore() in kernel_fpu_end(), but we have other commits
 (like 5187b28ff08: x86: Allow FPU to be used at interrupt time even
 with eagerfpu) that seem tangential too and might be part of why it
 actually *triggers* now.

 Comments?

I haven't been following the recent changes closely, so before I get a
chance to review the current bug and the relevant commits, wanted to
added that:

a. delayed dynamic allocation of FPU state area was not a good idea
(from me). Given most of the future cases will be anyway using eager
FPU (because of processor features like xsaveopt etc, applications
implicitly using FPU because of optimizations in commonly used
libraries etc), we should probably go back to allocation of FPU state
area during thread creation for everyone (including non-eager cases).
Memory savings will be small anyways and the code complexity
introducing subtle bugs like this in not worth it.

b. with the above change, kernel_fpu_begin() will just save any user
live math state and be ready for kernel math operations. And
kernel_fpu_end() will drop the kernel math state and for  eager-fpu
case restore the user math state.

We will avoid worrying about any memory allocations in the
math_state_restore() with interrupts disabled etc.

If there are no objections, I will see if I can come up with a quick
patch. or will ask HPA to help fill me in.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/cleanups] x86, apic: Cleanup cfg-> domain setup for legacy interrupts

2012-11-26 Thread tip-bot for Suresh Siddha

Commit-ID:  29c574c0aba8dc0736e19eb9b24aad28cc5c9098
Gitweb: http://git.kernel.org/tip/29c574c0aba8dc0736e19eb9b24aad28cc5c9098
Author: Suresh Siddha 
AuthorDate: Mon, 26 Nov 2012 14:49:36 -0800
Committer:  H. Peter Anvin 
CommitDate: Mon, 26 Nov 2012 15:43:25 -0800

x86, apic: Cleanup cfg->domain setup for legacy interrupts

Issues that need to be handled:
* Handle PIC interrupts on any CPU irrespective of the apic mode
* In the apic lowest priority logical flat delivery mode, be prepared to
  handle the interrupt on any CPU irrespective of what the IO-APIC RTE says.
* Because of above, when the IO-APIC starts handling the legacy PIC interrupt,
  use the same vector that is being used by the PIC while programming the
  corresponding IO-APIC RTE.

Start with all the cpu's in the legacy PIC interrupts cfg->domain.

By the time IO-APIC starts taking over the PIC interrupts, apic driver
model is finalized. So depend on the assign_irq_vector() to update the
cfg->domain and retain the same vector that was used by PIC before.

For the logical apic flat mode, cfg->domain is updated (during the first
call to assign_irq_vector()) to contain all the possible online cpu's (0xff).
Vector used for the legacy PIC interrupt doesn't change when the IO-APIC
starts handling the interrupt. Any interrupt migration after that
doesn't change the cfg->domain or the vector used.

For other apic modes like physical mode, cfg->domain is updated
(during the first call to assign_irq_vector()) to the boot cpu (cpu-0),
with the same vector that is being used by the PIC. When that interrupt is
migrated to a different cpu, cfg->domin and the vector assigned will change
accordingly.

Tested-by: Borislav Petkov 
Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1353970176.21070.51.ca...@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/kernel/apic/io_apic.c | 26 ++
 1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index c265593..0c1f366 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -234,11 +234,11 @@ int __init arch_early_irq_init(void)
zalloc_cpumask_var_node([i].old_domain, GFP_KERNEL, node);
/*
 * For legacy IRQ's, start with assigning irq0 to irq15 to
-* IRQ0_VECTOR to IRQ15_VECTOR on cpu 0.
+* IRQ0_VECTOR to IRQ15_VECTOR for all cpu's.
 */
if (i < legacy_pic->nr_legacy_irqs) {
cfg[i].vector = IRQ0_VECTOR + i;
-   cpumask_set_cpu(0, cfg[i].domain);
+   cpumask_setall(cfg[i].domain);
}
}
 
@@ -1141,7 +1141,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const 
struct cpumask *mask)
 * allocation for the members that are not used anymore.
 */
cpumask_andnot(cfg->old_domain, cfg->domain, tmp_mask);
-   cfg->move_in_progress = 1;
+   cfg->move_in_progress =
+  cpumask_intersects(cfg->old_domain, cpu_online_mask);
cpumask_and(cfg->domain, cfg->domain, tmp_mask);
break;
}
@@ -1172,8 +1173,9 @@ next:
current_vector = vector;
current_offset = offset;
if (cfg->vector) {
-   cfg->move_in_progress = 1;
cpumask_copy(cfg->old_domain, cfg->domain);
+   cfg->move_in_progress =
+  cpumask_intersects(cfg->old_domain, cpu_online_mask);
}
for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
per_cpu(vector_irq, new_cpu)[vector] = irq;
@@ -1241,12 +1243,6 @@ void __setup_vector_irq(int cpu)
cfg = irq_get_chip_data(irq);
if (!cfg)
continue;
-   /*
-* If it is a legacy IRQ handled by the legacy PIC, this cpu
-* will be part of the irq_cfg's domain.
-*/
-   if (irq < legacy_pic->nr_legacy_irqs && !IO_APIC_IRQ(irq))
-   cpumask_set_cpu(cpu, cfg->domain);
 
if (!cpumask_test_cpu(cpu, cfg->domain))
continue;
@@ -1356,16 +1352,6 @@ static void setup_ioapic_irq(unsigned int irq, struct 
irq_cfg *cfg,
if (!IO_APIC_IRQ(irq))
return;
 
-   /*
-* For legacy irqs, cfg->domain starts with cpu 0. Now that IO-APIC
-* can handle this irq and the apic driver is finialized at this point,
-* update the cfg->domain.
-*/
-   if (irq < legacy_pic->nr_legacy_irqs &&

[patch] x86, apic: cleanup cfg->domain setup for legacy interrupts

2012-11-26 Thread Suresh Siddha

Had this cleanup patch (tested before by me and Boris aswell) for a
while. Forgot to post this earlier. Thanks.
---8<---

From: Suresh Siddha 
Subject: x86, apic: cleanup cfg->domain setup for legacy interrupts

Issues that need to be handled:
* Handle PIC interrupts on any CPU irrespective of the apic mode
* In the apic lowest priority logical flat delivery mode, be prepared to
  handle the interrupt on any CPU irrespective of what the IO-APIC RTE says.
* Because of above, when the IO-APIC starts handling the legacy PIC interrupt,
  use the same vector that is being used by the PIC while programming the
  corresponding IO-APIC RTE.

Start with all the cpu's in the legacy PIC interrupts cfg->domain.

By the time IO-APIC starts taking over the PIC interrupts, apic driver
model is finalized. So depend on the assign_irq_vector() to update the
cfg->domain and retain the same vector that was used by PIC before.

For the logical apic flat mode, cfg->domain is updated (during the first
call to assign_irq_vector()) to contain all the possible online cpu's (0xff).
Vector used for the legacy PIC interrupt doesn't change when the IO-APIC
starts handling the interrupt. Any interrupt migration after that
doesn't change the cfg->domain or the vector used.

For other apic modes like physical mode, cfg->domain is updated
(during the first call to assign_irq_vector()) to the boot cpu (cpu-0),
with the same vector that is being used by the PIC. When that interrupt is
migrated to a different cpu, cfg->domin and the vector assigned will change
accordingly.

Tested-by: Borislav Petkov 
Signed-off-by: Suresh Siddha 
---
 arch/x86/kernel/apic/io_apic.c |   26 ++
 1 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index c265593..0c1f366 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -234,11 +234,11 @@ int __init arch_early_irq_init(void)
zalloc_cpumask_var_node([i].old_domain, GFP_KERNEL, node);
/*
 * For legacy IRQ's, start with assigning irq0 to irq15 to
-* IRQ0_VECTOR to IRQ15_VECTOR on cpu 0.
+* IRQ0_VECTOR to IRQ15_VECTOR for all cpu's.
 */
if (i < legacy_pic->nr_legacy_irqs) {
cfg[i].vector = IRQ0_VECTOR + i;
-   cpumask_set_cpu(0, cfg[i].domain);
+   cpumask_setall(cfg[i].domain);
}
}

@@ -1141,7 +1141,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const 
struct cpumask *mask)
 * allocation for the members that are not used anymore.
 */
cpumask_andnot(cfg->old_domain, cfg->domain, tmp_mask);
-   cfg->move_in_progress = 1;
+   cfg->move_in_progress =
+  cpumask_intersects(cfg->old_domain, cpu_online_mask);
cpumask_and(cfg->domain, cfg->domain, tmp_mask);
break;
}
@@ -1172,8 +1173,9 @@ next:
current_vector = vector;
current_offset = offset;
if (cfg->vector) {
-   cfg->move_in_progress = 1;
cpumask_copy(cfg->old_domain, cfg->domain);
+   cfg->move_in_progress =
+  cpumask_intersects(cfg->old_domain, cpu_online_mask);
}
for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
per_cpu(vector_irq, new_cpu)[vector] = irq;
@@ -1241,12 +1243,6 @@ void __setup_vector_irq(int cpu)
cfg = irq_get_chip_data(irq);
if (!cfg)
continue;
-   /*
-* If it is a legacy IRQ handled by the legacy PIC, this cpu
-* will be part of the irq_cfg's domain.
-*/
-   if (irq < legacy_pic->nr_legacy_irqs && !IO_APIC_IRQ(irq))
-   cpumask_set_cpu(cpu, cfg->domain);

if (!cpumask_test_cpu(cpu, cfg->domain))
continue;
@@ -1356,16 +1352,6 @@ static void setup_ioapic_irq(unsigned int irq, struct 
irq_cfg *cfg,
if (!IO_APIC_IRQ(irq))
return;

-   /*
-* For legacy irqs, cfg->domain starts with cpu 0. Now that IO-APIC
-* can handle this irq and the apic driver is finialized at this point,
-* update the cfg->domain.
-*/
-   if (irq < legacy_pic->nr_legacy_irqs &&
-   cpumask_equal(cfg->domain, cpumask_of(0)))
-   apic->vector_allocation_domain(0, cfg->domain,
-  apic->target_cpus());
-
if (assi

[patch] x86, apic: cleanup cfg-domain setup for legacy interrupts

2012-11-26 Thread Suresh Siddha

Had this cleanup patch (tested before by me and Boris aswell) for a
while. Forgot to post this earlier. Thanks.
---8---

From: Suresh Siddha suresh.b.sid...@intel.com
Subject: x86, apic: cleanup cfg-domain setup for legacy interrupts

Issues that need to be handled:
* Handle PIC interrupts on any CPU irrespective of the apic mode
* In the apic lowest priority logical flat delivery mode, be prepared to
  handle the interrupt on any CPU irrespective of what the IO-APIC RTE says.
* Because of above, when the IO-APIC starts handling the legacy PIC interrupt,
  use the same vector that is being used by the PIC while programming the
  corresponding IO-APIC RTE.

Start with all the cpu's in the legacy PIC interrupts cfg-domain.

By the time IO-APIC starts taking over the PIC interrupts, apic driver
model is finalized. So depend on the assign_irq_vector() to update the
cfg-domain and retain the same vector that was used by PIC before.

For the logical apic flat mode, cfg-domain is updated (during the first
call to assign_irq_vector()) to contain all the possible online cpu's (0xff).
Vector used for the legacy PIC interrupt doesn't change when the IO-APIC
starts handling the interrupt. Any interrupt migration after that
doesn't change the cfg-domain or the vector used.

For other apic modes like physical mode, cfg-domain is updated
(during the first call to assign_irq_vector()) to the boot cpu (cpu-0),
with the same vector that is being used by the PIC. When that interrupt is
migrated to a different cpu, cfg-domin and the vector assigned will change
accordingly.

Tested-by: Borislav Petkov b...@alien8.de
Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
---
 arch/x86/kernel/apic/io_apic.c |   26 ++
 1 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index c265593..0c1f366 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -234,11 +234,11 @@ int __init arch_early_irq_init(void)
zalloc_cpumask_var_node(cfg[i].old_domain, GFP_KERNEL, node);
/*
 * For legacy IRQ's, start with assigning irq0 to irq15 to
-* IRQ0_VECTOR to IRQ15_VECTOR on cpu 0.
+* IRQ0_VECTOR to IRQ15_VECTOR for all cpu's.
 */
if (i  legacy_pic-nr_legacy_irqs) {
cfg[i].vector = IRQ0_VECTOR + i;
-   cpumask_set_cpu(0, cfg[i].domain);
+   cpumask_setall(cfg[i].domain);
}
}

@@ -1141,7 +1141,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const 
struct cpumask *mask)
 * allocation for the members that are not used anymore.
 */
cpumask_andnot(cfg-old_domain, cfg-domain, tmp_mask);
-   cfg-move_in_progress = 1;
+   cfg-move_in_progress =
+  cpumask_intersects(cfg-old_domain, cpu_online_mask);
cpumask_and(cfg-domain, cfg-domain, tmp_mask);
break;
}
@@ -1172,8 +1173,9 @@ next:
current_vector = vector;
current_offset = offset;
if (cfg-vector) {
-   cfg-move_in_progress = 1;
cpumask_copy(cfg-old_domain, cfg-domain);
+   cfg-move_in_progress =
+  cpumask_intersects(cfg-old_domain, cpu_online_mask);
}
for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
per_cpu(vector_irq, new_cpu)[vector] = irq;
@@ -1241,12 +1243,6 @@ void __setup_vector_irq(int cpu)
cfg = irq_get_chip_data(irq);
if (!cfg)
continue;
-   /*
-* If it is a legacy IRQ handled by the legacy PIC, this cpu
-* will be part of the irq_cfg's domain.
-*/
-   if (irq  legacy_pic-nr_legacy_irqs  !IO_APIC_IRQ(irq))
-   cpumask_set_cpu(cpu, cfg-domain);

if (!cpumask_test_cpu(cpu, cfg-domain))
continue;
@@ -1356,16 +1352,6 @@ static void setup_ioapic_irq(unsigned int irq, struct 
irq_cfg *cfg,
if (!IO_APIC_IRQ(irq))
return;

-   /*
-* For legacy irqs, cfg-domain starts with cpu 0. Now that IO-APIC
-* can handle this irq and the apic driver is finialized at this point,
-* update the cfg-domain.
-*/
-   if (irq  legacy_pic-nr_legacy_irqs 
-   cpumask_equal(cfg-domain, cpumask_of(0)))
-   apic-vector_allocation_domain(0, cfg-domain,
-  apic-target_cpus());
-
if (assign_irq_vector(irq, cfg, apic-target_cpus()))
return;

--
To unsubscribe from this list

[tip:x86/cleanups] x86, apic: Cleanup cfg- domain setup for legacy interrupts

2012-11-26 Thread tip-bot for Suresh Siddha

Commit-ID:  29c574c0aba8dc0736e19eb9b24aad28cc5c9098
Gitweb: http://git.kernel.org/tip/29c574c0aba8dc0736e19eb9b24aad28cc5c9098
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Mon, 26 Nov 2012 14:49:36 -0800
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Mon, 26 Nov 2012 15:43:25 -0800

x86, apic: Cleanup cfg-domain setup for legacy interrupts

Issues that need to be handled:
* Handle PIC interrupts on any CPU irrespective of the apic mode
* In the apic lowest priority logical flat delivery mode, be prepared to
  handle the interrupt on any CPU irrespective of what the IO-APIC RTE says.
* Because of above, when the IO-APIC starts handling the legacy PIC interrupt,
  use the same vector that is being used by the PIC while programming the
  corresponding IO-APIC RTE.

Start with all the cpu's in the legacy PIC interrupts cfg-domain.

By the time IO-APIC starts taking over the PIC interrupts, apic driver
model is finalized. So depend on the assign_irq_vector() to update the
cfg-domain and retain the same vector that was used by PIC before.

For the logical apic flat mode, cfg-domain is updated (during the first
call to assign_irq_vector()) to contain all the possible online cpu's (0xff).
Vector used for the legacy PIC interrupt doesn't change when the IO-APIC
starts handling the interrupt. Any interrupt migration after that
doesn't change the cfg-domain or the vector used.

For other apic modes like physical mode, cfg-domain is updated
(during the first call to assign_irq_vector()) to the boot cpu (cpu-0),
with the same vector that is being used by the PIC. When that interrupt is
migrated to a different cpu, cfg-domin and the vector assigned will change
accordingly.

Tested-by: Borislav Petkov b...@alien8.de
Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1353970176.21070.51.ca...@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/kernel/apic/io_apic.c | 26 ++
 1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index c265593..0c1f366 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -234,11 +234,11 @@ int __init arch_early_irq_init(void)
zalloc_cpumask_var_node(cfg[i].old_domain, GFP_KERNEL, node);
/*
 * For legacy IRQ's, start with assigning irq0 to irq15 to
-* IRQ0_VECTOR to IRQ15_VECTOR on cpu 0.
+* IRQ0_VECTOR to IRQ15_VECTOR for all cpu's.
 */
if (i  legacy_pic-nr_legacy_irqs) {
cfg[i].vector = IRQ0_VECTOR + i;
-   cpumask_set_cpu(0, cfg[i].domain);
+   cpumask_setall(cfg[i].domain);
}
}
 
@@ -1141,7 +1141,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const 
struct cpumask *mask)
 * allocation for the members that are not used anymore.
 */
cpumask_andnot(cfg-old_domain, cfg-domain, tmp_mask);
-   cfg-move_in_progress = 1;
+   cfg-move_in_progress =
+  cpumask_intersects(cfg-old_domain, cpu_online_mask);
cpumask_and(cfg-domain, cfg-domain, tmp_mask);
break;
}
@@ -1172,8 +1173,9 @@ next:
current_vector = vector;
current_offset = offset;
if (cfg-vector) {
-   cfg-move_in_progress = 1;
cpumask_copy(cfg-old_domain, cfg-domain);
+   cfg-move_in_progress =
+  cpumask_intersects(cfg-old_domain, cpu_online_mask);
}
for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
per_cpu(vector_irq, new_cpu)[vector] = irq;
@@ -1241,12 +1243,6 @@ void __setup_vector_irq(int cpu)
cfg = irq_get_chip_data(irq);
if (!cfg)
continue;
-   /*
-* If it is a legacy IRQ handled by the legacy PIC, this cpu
-* will be part of the irq_cfg's domain.
-*/
-   if (irq  legacy_pic-nr_legacy_irqs  !IO_APIC_IRQ(irq))
-   cpumask_set_cpu(cpu, cfg-domain);
 
if (!cpumask_test_cpu(cpu, cfg-domain))
continue;
@@ -1356,16 +1352,6 @@ static void setup_ioapic_irq(unsigned int irq, struct 
irq_cfg *cfg,
if (!IO_APIC_IRQ(irq))
return;
 
-   /*
-* For legacy irqs, cfg-domain starts with cpu 0. Now that IO-APIC
-* can handle this irq and the apic driver is finialized at this point,
-* update the cfg-domain.
-*/
-   if (irq  legacy_pic-nr_legacy_irqs 
-   cpumask_equal

[tip:x86/timers] x86: apic: Use tsc deadline for oneshot when available

2012-11-02 Thread tip-bot for Suresh Siddha

Commit-ID:  279f1461432ccdec0b98c0bcbe0a8e2c0f6fdda5
Gitweb: http://git.kernel.org/tip/279f1461432ccdec0b98c0bcbe0a8e2c0f6fdda5
Author: Suresh Siddha 
AuthorDate: Mon, 22 Oct 2012 14:37:58 -0700
Committer:  Thomas Gleixner 
CommitDate: Fri, 2 Nov 2012 11:23:37 +0100

x86: apic: Use tsc deadline for oneshot when available

If the TSC deadline mode is supported, LAPIC timer one-shot mode can be
implemented using IA32_TSC_DEADLINE MSR. An interrupt will be generated
when the TSC value equals or exceeds the value in the IA32_TSC_DEADLINE
MSR.

This enables us to skip the APIC calibration during boot. Also, in
xapic mode, this enables us to skip the uncached apic access to re-arm
the APIC timer.

As this timer ticks at the high frequency TSC rate, we use the
TSC_DIVISOR (32) to work with the 32-bit restrictions in the
clockevent API's to avoid 64-bit divides etc (frequency is u32 and
"unsigned long" in the set_next_event(), max_delta limits the next
event to 32-bit for 32-bit kernel).

Signed-off-by: Suresh Siddha 
Cc: ve...@google.com
Cc: len.br...@intel.com
Link: 
http://lkml.kernel.org/r/1350941878.6017.31.ca...@sbsiddha-desk.sc.intel.com
Signed-off-by: Thomas Gleixner 
---
 Documentation/kernel-parameters.txt |4 ++
 arch/x86/include/asm/msr-index.h|2 +
 arch/x86/kernel/apic/apic.c |   73 +-
 3 files changed, 59 insertions(+), 20 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 9776f06..4aa9ca0 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1304,6 +1304,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
lapic   [X86-32,APIC] Enable the local APIC even if BIOS
disabled it.
 
+   lapic=  [x86,APIC] "notscdeadline" Do not use TSC deadline
+   value for LAPIC timer one-shot implementation. Default
+   back to the programmable timer unit in the LAPIC.
+
lapic_timer_c2_ok   [X86,APIC] trust the local apic timer
in C2 power state.
 
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 7f0edce..e400cdb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -337,6 +337,8 @@
 #define MSR_IA32_MISC_ENABLE_TURBO_DISABLE (1ULL << 38)
 #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE   (1ULL << 39)
 
+#define MSR_IA32_TSC_DEADLINE  0x06E0
+
 /* P4/Xeon+ specific */
 #define MSR_IA32_MCG_EAX   0x0180
 #define MSR_IA32_MCG_EBX   0x0181
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b17416e..b994cc8 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -90,21 +90,6 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
  */
 DEFINE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid, BAD_APICID);
 
-/*
- * Knob to control our willingness to enable the local APIC.
- *
- * +1=force-enable
- */
-static int force_enable_local_apic __initdata;
-/*
- * APIC command line parameters
- */
-static int __init parse_lapic(char *arg)
-{
-   force_enable_local_apic = 1;
-   return 0;
-}
-early_param("lapic", parse_lapic);
 /* Local APIC was disabled by the BIOS and enabled by the kernel */
 static int enabled_via_apicbase;
 
@@ -133,6 +118,25 @@ static inline void imcr_apic_to_pic(void)
 }
 #endif
 
+/*
+ * Knob to control our willingness to enable the local APIC.
+ *
+ * +1=force-enable
+ */
+static int force_enable_local_apic __initdata;
+/*
+ * APIC command line parameters
+ */
+static int __init parse_lapic(char *arg)
+{
+   if (config_enabled(CONFIG_X86_32) && !arg)
+   force_enable_local_apic = 1;
+   else if (!strncmp(arg, "notscdeadline", 13))
+   setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+   return 0;
+}
+early_param("lapic", parse_lapic);
+
 #ifdef CONFIG_X86_64
 static int apic_calibrate_pmtmr __initdata;
 static __init int setup_apicpmtimer(char *s)
@@ -315,6 +319,7 @@ int lapic_get_maxlvt(void)
 
 /* Clock divisor */
 #define APIC_DIVISOR 16
+#define TSC_DIVISOR  32
 
 /*
  * This function sets up the local APIC timer, with a timeout of
@@ -333,6 +338,9 @@ static void __setup_APIC_LVTT(unsigned int clocks, int 
oneshot, int irqen)
lvtt_value = LOCAL_TIMER_VECTOR;
if (!oneshot)
lvtt_value |= APIC_LVT_TIMER_PERIODIC;
+   else if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER))
+   lvtt_value |= APIC_LVT_TIMER_TSCDEADLINE;
+
if (!lapic_is_integrated())
lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV);
 
@@ -341,6 +349,11 @@ static void __setup_APIC_LVTT(unsigned int clocks, int 
oneshot, int irqen)
 
apic_write(APIC_LV

[tip:x86/timers] x86: apic: Use tsc deadline for oneshot when available

2012-11-02 Thread tip-bot for Suresh Siddha

Commit-ID:  279f1461432ccdec0b98c0bcbe0a8e2c0f6fdda5
Gitweb: http://git.kernel.org/tip/279f1461432ccdec0b98c0bcbe0a8e2c0f6fdda5
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Mon, 22 Oct 2012 14:37:58 -0700
Committer:  Thomas Gleixner t...@linutronix.de
CommitDate: Fri, 2 Nov 2012 11:23:37 +0100

x86: apic: Use tsc deadline for oneshot when available

If the TSC deadline mode is supported, LAPIC timer one-shot mode can be
implemented using IA32_TSC_DEADLINE MSR. An interrupt will be generated
when the TSC value equals or exceeds the value in the IA32_TSC_DEADLINE
MSR.

This enables us to skip the APIC calibration during boot. Also, in
xapic mode, this enables us to skip the uncached apic access to re-arm
the APIC timer.

As this timer ticks at the high frequency TSC rate, we use the
TSC_DIVISOR (32) to work with the 32-bit restrictions in the
clockevent API's to avoid 64-bit divides etc (frequency is u32 and
unsigned long in the set_next_event(), max_delta limits the next
event to 32-bit for 32-bit kernel).

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Cc: ve...@google.com
Cc: len.br...@intel.com
Link: 
http://lkml.kernel.org/r/1350941878.6017.31.ca...@sbsiddha-desk.sc.intel.com
Signed-off-by: Thomas Gleixner t...@linutronix.de
---
 Documentation/kernel-parameters.txt |4 ++
 arch/x86/include/asm/msr-index.h|2 +
 arch/x86/kernel/apic/apic.c |   73 +-
 3 files changed, 59 insertions(+), 20 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 9776f06..4aa9ca0 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1304,6 +1304,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
lapic   [X86-32,APIC] Enable the local APIC even if BIOS
disabled it.
 
+   lapic=  [x86,APIC] notscdeadline Do not use TSC deadline
+   value for LAPIC timer one-shot implementation. Default
+   back to the programmable timer unit in the LAPIC.
+
lapic_timer_c2_ok   [X86,APIC] trust the local apic timer
in C2 power state.
 
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 7f0edce..e400cdb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -337,6 +337,8 @@
 #define MSR_IA32_MISC_ENABLE_TURBO_DISABLE (1ULL  38)
 #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE   (1ULL  39)
 
+#define MSR_IA32_TSC_DEADLINE  0x06E0
+
 /* P4/Xeon+ specific */
 #define MSR_IA32_MCG_EAX   0x0180
 #define MSR_IA32_MCG_EBX   0x0181
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b17416e..b994cc8 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -90,21 +90,6 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
  */
 DEFINE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid, BAD_APICID);
 
-/*
- * Knob to control our willingness to enable the local APIC.
- *
- * +1=force-enable
- */
-static int force_enable_local_apic __initdata;
-/*
- * APIC command line parameters
- */
-static int __init parse_lapic(char *arg)
-{
-   force_enable_local_apic = 1;
-   return 0;
-}
-early_param(lapic, parse_lapic);
 /* Local APIC was disabled by the BIOS and enabled by the kernel */
 static int enabled_via_apicbase;
 
@@ -133,6 +118,25 @@ static inline void imcr_apic_to_pic(void)
 }
 #endif
 
+/*
+ * Knob to control our willingness to enable the local APIC.
+ *
+ * +1=force-enable
+ */
+static int force_enable_local_apic __initdata;
+/*
+ * APIC command line parameters
+ */
+static int __init parse_lapic(char *arg)
+{
+   if (config_enabled(CONFIG_X86_32)  !arg)
+   force_enable_local_apic = 1;
+   else if (!strncmp(arg, notscdeadline, 13))
+   setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+   return 0;
+}
+early_param(lapic, parse_lapic);
+
 #ifdef CONFIG_X86_64
 static int apic_calibrate_pmtmr __initdata;
 static __init int setup_apicpmtimer(char *s)
@@ -315,6 +319,7 @@ int lapic_get_maxlvt(void)
 
 /* Clock divisor */
 #define APIC_DIVISOR 16
+#define TSC_DIVISOR  32
 
 /*
  * This function sets up the local APIC timer, with a timeout of
@@ -333,6 +338,9 @@ static void __setup_APIC_LVTT(unsigned int clocks, int 
oneshot, int irqen)
lvtt_value = LOCAL_TIMER_VECTOR;
if (!oneshot)
lvtt_value |= APIC_LVT_TIMER_PERIODIC;
+   else if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER))
+   lvtt_value |= APIC_LVT_TIMER_TSCDEADLINE;
+
if (!lapic_is_integrated())
lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV);
 
@@ -341,6 +349,11 @@ static void __setup_APIC_LVTT(unsigned int clocks, int 
oneshot, int irqen)
 
apic_write(APIC_LVTT, lvtt_value

Re: [PATCH] x86/ioapic: Fix the vector_irq[] is corrupted randomly

2012-10-29 Thread Suresh Siddha

On Tue, 2012-10-30 at 00:15 +0800, Chuansheng Liu wrote:
> Not all irq chips are IO-APIC chip.
> 
> In our system, there are many demux GPIO interrupts except for the
> io-apic chip interrupts, and these GPIO interrupts are belonged
> to other irq chips, the chip data is not type of struct irq_cfg
> either.
> 
> But in function __setup_vector_irq(), it listed all allocated irqs,
> and presume all irq chip is ioapic_chip and the chip data is type
> of struct irq_cfg, it possibly causes the vector_irq is corrupted
> randomly.
> 
> For example, one irq 258 is not io-apic chip irq, in __setup_vector_irq(),
> the chip data is forced to be used as struct irq_cfg, then the value
> cfg->domain and cfg->vector are wrong to be used to write vector_irq:
>   vector = cfg->vector;
>   per_cpu(vector_irq, cpu)[vector] = irq;
> 
> This patch use the .flags to identify if the irq chip is io-apic.

I have a feeling that your gpio driver is abusing the 'chip_data' in the
struct irq_data. Shouldn't the driver be using 'handler_data' instead?

>From include/linux/irq.h:
 * @handler_data:   per-IRQ data for the irq_chip methods
 * @chip_data:  platform-specific per-chip private data for the chip
 *  methods, to allow shared chip implementations

Also, how are these routed to the processors and the mechanism of the
vector assignment for these irq's? I presume irq_cfg is needed for the
setup and the interrupt migration from one cpu to another.

What am I missing?

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86/ioapic: Fix the vector_irq[] is corrupted randomly

2012-10-29 Thread Suresh Siddha

On Tue, 2012-10-30 at 00:15 +0800, Chuansheng Liu wrote:
 Not all irq chips are IO-APIC chip.
 
 In our system, there are many demux GPIO interrupts except for the
 io-apic chip interrupts, and these GPIO interrupts are belonged
 to other irq chips, the chip data is not type of struct irq_cfg
 either.
 
 But in function __setup_vector_irq(), it listed all allocated irqs,
 and presume all irq chip is ioapic_chip and the chip data is type
 of struct irq_cfg, it possibly causes the vector_irq is corrupted
 randomly.
 
 For example, one irq 258 is not io-apic chip irq, in __setup_vector_irq(),
 the chip data is forced to be used as struct irq_cfg, then the value
 cfg-domain and cfg-vector are wrong to be used to write vector_irq:
   vector = cfg-vector;
   per_cpu(vector_irq, cpu)[vector] = irq;
 
 This patch use the .flags to identify if the irq chip is io-apic.

I have a feeling that your gpio driver is abusing the 'chip_data' in the
struct irq_data. Shouldn't the driver be using 'handler_data' instead?

From include/linux/irq.h:
 * @handler_data:   per-IRQ data for the irq_chip methods
 * @chip_data:  platform-specific per-chip private data for the chip
 *  methods, to allow shared chip implementations

Also, how are these routed to the processors and the mechanism of the
vector assignment for these irq's? I presume irq_cfg is needed for the
setup and the interrupt migration from one cpu to another.

What am I missing?

thanks,
suresh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH 2.6.32.y 0/3] Re: [stable 2.6.32..2.6.34] x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic()

2012-10-24 Thread Suresh Siddha

On Wed, 2012-10-24 at 12:41 -0700, Jonathan Nieder wrote:
> Suresh Siddha wrote:
> > On Wed, 2012-10-24 at 11:25 -0700, Jonathan Nieder wrote:
> 
> >> Why not cherry-pick 7716a5c4ff5 in full?
> >
> > As that depends on the other commits like:
> > commit 4b6b19a1c7302477653d799a53d48063dd53d555
> 
> More importantly, if I understand correctly it might depend on
> 
>  commit cf7500c0ea13
>  Author: Eric W. Biederman 
>  Date:   Tue Mar 30 01:07:11 2010 -0700
> 
>  x86, ioapic: In mpparse use mp_register_ioapic
> 
> Here's a series, completely untested, that is closer to what I
> expected.  But the approach you took seems reasonable, too, as long
> as the commit message is tweaked to explain it.
> 
> Thanks again,
> Jonathan
> 
> Eric W. Biederman (3):
>   x86, ioapic: Teach mp_register_ioapic to compute a global gsi_end
>   x86, ioapic: In mpparse use mp_register_ioapic
>   x86, ioapic: Move nr_ioapic_registers calculation to
> mp_register_ioapic.
> 
>  arch/x86/include/asm/io_apic.h |  1 +
>  arch/x86/kernel/apic/io_apic.c | 28 ++--
>  arch/x86/kernel/mpparse.c  | 25 +
>  arch/x86/kernel/sfi.c  |  4 +---
>  4 files changed, 17 insertions(+), 41 deletions(-)

hmm, NO.

I am not sure if it is worth spending time validating all these changes
for the stable series and I can't do it on my own, as I don't have all
the relevant HW.

For example, another commit a4384df3e24579d6292a1b3b41d500349948f30b
(which you haven't picked up in your series) fixes some of these issues
introduced by the commits you have picked.

commit a4384df3e24579d6292a1b3b41d500349948f30b
Author: Eric W. Biederman 
Date:   Tue Jun 8 11:44:32 2010 -0700

x86, irq: Rename gsi_end gsi_top, and fix off by one errors

So I did think about all these things and wanted to really pursue the
smallest and simplest change. Here is the updated patch with just some
more text added to the changelog. Greg, does this look ok to you?

Thanks.
-- 8< --

From: Suresh Siddha 
Subject: x86, ioapic: initialize nr_ioapic_registers early in 
mp_register_ioapic()

Lin Bao reported that one of the HP platforms failed to boot
2.6.32 kernel, when the BIOS enabled interrupt-remapping and
x2apic before handing over the control to the Linux kernel.

During boot, Linux kernel masks all the interrupt sources
(8259, IO-APIC RTE's), setup the interrupt-remapping hardware
with the OS controlled table and unmasks the 8259 interrupts
but not the IO-APIC RTE's (as the newly setup interrupt-remapping
table and the IO-APIC RTE's are not yet programmed by the kernel).

Shortly after this, IO-APIC RTE's and the interrupt-remapping table
entries are programmed based on the ACPI tables etc. So the
expectation is that any interrupt during this window will be dropped
and not see the intermediate configuration.

In the reported problematic case, BIOS has configured the IO-APIC
in virtual wire-B mode. Between the window of the kernel setting up
new interrupt-remapping table  and the IO-APIC RTE's are properly
configured, an interrupt gets routed by the IO-APIC RTE (setup
by the virtual wire-B configuration) and sees the empty
interrupt-remapping table entry, resulting in vt-d fault causing
the platform to generate NMI. And the OS panics on this unexpected NMI.

This problem doesn't happen with more recent kernels and closer
look at the 2.6.32 kernel shows that the code which masks
the IO-APIC RTE's is not working as expected as the nr_ioapic_registers
for each IO-APIC is not yet initialized at this point. In the later
kernels we initialize nr_ioapic_registers much before and
everything works as expected.

For 2.6.[32..34] kernels, fix this issue by initializing
nr_ioapic_registers early in mp_register_ioapic()

[ Relevant upstream commit info:
  commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af
  Author: Eric W. Biederman 
  Date:   Tue Mar 30 01:07:12 2010 -0700

x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic.

  As the upstream commit depends on quite a few prior commits
  and some followup fixes in the mainline, we just picked
  the smallest relevant hunk for fixing the issue at hand.
  Problematic platform uses ACPI for IO-APIC, VT-d enumeration etc
  and this hunk only touches the ACPI based platforms.

  nr_ioapic_reigsters initialization in enable_IO_APIC() is still
  retained, so that other configurations like legacy MPS table based
  enumeration etc works with no change.
]

Reported-and-tested-by: Zhang, Lin-Bao 
Signed-off-by: Suresh Siddha 
Cc: sta...@vger.kernel.org [v2.6.32..v2.6.34]
---
 arch/x86/kernel/apic/io_apic.c |9 +++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 8928d97..d256bc3 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_ap

Re: [stable 2.6.32..2.6.34] x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic()

2012-10-24 Thread Suresh Siddha

On Wed, 2012-10-24 at 11:25 -0700, Jonathan Nieder wrote:
> Hi Suresh,
> 
> Suresh Siddha wrote:
> 
> [...]
> > This problem doesn't happen with more recent kernels and closer
> > look at the 2.6.32 kernel shows that the code which masks
> > the IO-APIC RTE's is not working as expected as the nr_ioapic_registers
> > for each IO-APIC is not yet initialized at this point. In the later
> > kernels we initialize nr_ioapic_registers much before and
> > everything works as expected.
> >
> > For 2.6.[32..34] kernels, fix this issue by initializing
> > nr_ioapic_registers early in mp_register_ioapic()
> >
> > Relevant upstream commit info:
> >
> > commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af
> 
> Why not cherry-pick 7716a5c4ff5 in full?

As that depends on the other commits like:
commit 4b6b19a1c7302477653d799a53d48063dd53d555
Author: Eric W. Biederman 
Date:   Tue Mar 30 01:07:08 2010 -0700

Wanted to keep the changes as minimal as possible.

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[stable 2.6.32..2.6.34] x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic()

2012-10-24 Thread Suresh Siddha

Lin Bao reported that one of the HP platforms failed to boot
2.6.32 kernel, when the BIOS enabled interrupt-remapping and
x2apic before handing over the control to the Linux kernel.

During boot, Linux kernel masks all the interrupt sources
(8259, IO-APIC RTE's), setup the interrupt-remapping hardware
with the OS controlled table and unmasks the 8259 interrupts
but not the IO-APIC RTE's (as the newly setup interrupt-remapping
table and the IO-APIC RTE's are not yet programmed by the kernel).

Shortly after this, IO-APIC RTE's and the interrupt-remapping table
entries are programmed based on the ACPI tables etc. So the
expectation is that any interrupt during this window will be dropped
and not see the intermediate configuration.

In the reported problematic case, BIOS has configured the IO-APIC
in virtual wire-B mode. Between the window of the kernel setting up
new interrupt-remapping table  and the IO-APIC RTE's are properly
configured, an interrupt gets routed by the IO-APIC RTE (setup
by the virtual wire-B configuration) and sees the empty
interrupt-remapping table entry, resulting in vt-d fault causing
the platform to generate NMI. And the OS panics on this unexpected NMI.

This problem doesn't happen with more recent kernels and closer
look at the 2.6.32 kernel shows that the code which masks
the IO-APIC RTE's is not working as expected as the nr_ioapic_registers
for each IO-APIC is not yet initialized at this point. In the later
kernels we initialize nr_ioapic_registers much before and
everything works as expected.

For 2.6.[32..34] kernels, fix this issue by initializing
nr_ioapic_registers early in mp_register_ioapic()

Relevant upstream commit info:

commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af
Author: Eric W. Biederman 
Date:   Tue Mar 30 01:07:12 2010 -0700

x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic.

Reported-and-tested-by: Zhang, Lin-Bao 
Signed-off-by: Suresh Siddha 
Cc: sta...@vger.kernel.org [v2.6.32..v2.6.34]
---
 arch/x86/kernel/apic/io_apic.c |9 +++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 8928d97..d256bc3 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -4262,6 +4262,7 @@ static int bad_ioapic(unsigned long address)
 void __init mp_register_ioapic(int id, u32 address, u32 gsi_base)
 {
int idx = 0;
+   int entries;
 
if (bad_ioapic(address))
return;
@@ -4280,10 +4281,14 @@ void __init mp_register_ioapic(int id, u32 address, u32 
gsi_base)
 * Build basic GSI lookup table to facilitate gsi->io_apic lookups
 * and to prevent reprogramming of IOAPIC pins (PCI GSIs).
 */
+   entries = io_apic_get_redir_entries(idx);
mp_gsi_routing[idx].gsi_base = gsi_base;
-   mp_gsi_routing[idx].gsi_end = gsi_base +
-   io_apic_get_redir_entries(idx);
+   mp_gsi_routing[idx].gsi_end = gsi_base + entries;
 
+   /*
+* The number of IO-APIC IRQ registers (== #pins):
+*/
+   nr_ioapic_registers[idx] = entries + 1;
printk(KERN_INFO "IOAPIC[%d]: apic_id %d, version %d, address 0x%x, "
   "GSI %d-%d\n", idx, mp_ioapics[idx].apicid,
   mp_ioapics[idx].apicver, mp_ioapics[idx].apicaddr,


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[stable 2.6.32..2.6.34] x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic()

2012-10-24 Thread Suresh Siddha

Lin Bao reported that one of the HP platforms failed to boot
2.6.32 kernel, when the BIOS enabled interrupt-remapping and
x2apic before handing over the control to the Linux kernel.

During boot, Linux kernel masks all the interrupt sources
(8259, IO-APIC RTE's), setup the interrupt-remapping hardware
with the OS controlled table and unmasks the 8259 interrupts
but not the IO-APIC RTE's (as the newly setup interrupt-remapping
table and the IO-APIC RTE's are not yet programmed by the kernel).

Shortly after this, IO-APIC RTE's and the interrupt-remapping table
entries are programmed based on the ACPI tables etc. So the
expectation is that any interrupt during this window will be dropped
and not see the intermediate configuration.

In the reported problematic case, BIOS has configured the IO-APIC
in virtual wire-B mode. Between the window of the kernel setting up
new interrupt-remapping table  and the IO-APIC RTE's are properly
configured, an interrupt gets routed by the IO-APIC RTE (setup
by the virtual wire-B configuration) and sees the empty
interrupt-remapping table entry, resulting in vt-d fault causing
the platform to generate NMI. And the OS panics on this unexpected NMI.

This problem doesn't happen with more recent kernels and closer
look at the 2.6.32 kernel shows that the code which masks
the IO-APIC RTE's is not working as expected as the nr_ioapic_registers
for each IO-APIC is not yet initialized at this point. In the later
kernels we initialize nr_ioapic_registers much before and
everything works as expected.

For 2.6.[32..34] kernels, fix this issue by initializing
nr_ioapic_registers early in mp_register_ioapic()

Relevant upstream commit info:

commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af
Author: Eric W. Biederman ebied...@xmission.com
Date:   Tue Mar 30 01:07:12 2010 -0700

x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic.

Reported-and-tested-by: Zhang, Lin-Bao linbao.zh...@hp.com
Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Cc: sta...@vger.kernel.org [v2.6.32..v2.6.34]
---
 arch/x86/kernel/apic/io_apic.c |9 +++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 8928d97..d256bc3 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -4262,6 +4262,7 @@ static int bad_ioapic(unsigned long address)
 void __init mp_register_ioapic(int id, u32 address, u32 gsi_base)
 {
int idx = 0;
+   int entries;
 
if (bad_ioapic(address))
return;
@@ -4280,10 +4281,14 @@ void __init mp_register_ioapic(int id, u32 address, u32 
gsi_base)
 * Build basic GSI lookup table to facilitate gsi-io_apic lookups
 * and to prevent reprogramming of IOAPIC pins (PCI GSIs).
 */
+   entries = io_apic_get_redir_entries(idx);
mp_gsi_routing[idx].gsi_base = gsi_base;
-   mp_gsi_routing[idx].gsi_end = gsi_base +
-   io_apic_get_redir_entries(idx);
+   mp_gsi_routing[idx].gsi_end = gsi_base + entries;
 
+   /*
+* The number of IO-APIC IRQ registers (== #pins):
+*/
+   nr_ioapic_registers[idx] = entries + 1;
printk(KERN_INFO IOAPIC[%d]: apic_id %d, version %d, address 0x%x, 
   GSI %d-%d\n, idx, mp_ioapics[idx].apicid,
   mp_ioapics[idx].apicver, mp_ioapics[idx].apicaddr,


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable 2.6.32..2.6.34] x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic()

2012-10-24 Thread Suresh Siddha

On Wed, 2012-10-24 at 11:25 -0700, Jonathan Nieder wrote:
 Hi Suresh,
 
 Suresh Siddha wrote:
 
 [...]
  This problem doesn't happen with more recent kernels and closer
  look at the 2.6.32 kernel shows that the code which masks
  the IO-APIC RTE's is not working as expected as the nr_ioapic_registers
  for each IO-APIC is not yet initialized at this point. In the later
  kernels we initialize nr_ioapic_registers much before and
  everything works as expected.
 
  For 2.6.[32..34] kernels, fix this issue by initializing
  nr_ioapic_registers early in mp_register_ioapic()
 
  Relevant upstream commit info:
 
  commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af
 
 Why not cherry-pick 7716a5c4ff5 in full?

As that depends on the other commits like:
commit 4b6b19a1c7302477653d799a53d48063dd53d555
Author: Eric W. Biederman ebied...@xmission.com
Date:   Tue Mar 30 01:07:08 2010 -0700

Wanted to keep the changes as minimal as possible.

thanks,
suresh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH 2.6.32.y 0/3] Re: [stable 2.6.32..2.6.34] x86, ioapic: initialize nr_ioapic_registers early in mp_register_ioapic()

2012-10-24 Thread Suresh Siddha

On Wed, 2012-10-24 at 12:41 -0700, Jonathan Nieder wrote:
 Suresh Siddha wrote:
  On Wed, 2012-10-24 at 11:25 -0700, Jonathan Nieder wrote:
 
  Why not cherry-pick 7716a5c4ff5 in full?
 
  As that depends on the other commits like:
  commit 4b6b19a1c7302477653d799a53d48063dd53d555
 
 More importantly, if I understand correctly it might depend on
 
  commit cf7500c0ea13
  Author: Eric W. Biederman ebied...@xmission.com
  Date:   Tue Mar 30 01:07:11 2010 -0700
 
  x86, ioapic: In mpparse use mp_register_ioapic
 
 Here's a series, completely untested, that is closer to what I
 expected.  But the approach you took seems reasonable, too, as long
 as the commit message is tweaked to explain it.
 
 Thanks again,
 Jonathan
 
 Eric W. Biederman (3):
   x86, ioapic: Teach mp_register_ioapic to compute a global gsi_end
   x86, ioapic: In mpparse use mp_register_ioapic
   x86, ioapic: Move nr_ioapic_registers calculation to
 mp_register_ioapic.
 
  arch/x86/include/asm/io_apic.h |  1 +
  arch/x86/kernel/apic/io_apic.c | 28 ++--
  arch/x86/kernel/mpparse.c  | 25 +
  arch/x86/kernel/sfi.c  |  4 +---
  4 files changed, 17 insertions(+), 41 deletions(-)

hmm, NO.

I am not sure if it is worth spending time validating all these changes
for the stable series and I can't do it on my own, as I don't have all
the relevant HW.

For example, another commit a4384df3e24579d6292a1b3b41d500349948f30b
(which you haven't picked up in your series) fixes some of these issues
introduced by the commits you have picked.

commit a4384df3e24579d6292a1b3b41d500349948f30b
Author: Eric W. Biederman ebied...@xmission.com
Date:   Tue Jun 8 11:44:32 2010 -0700

x86, irq: Rename gsi_end gsi_top, and fix off by one errors

So I did think about all these things and wanted to really pursue the
smallest and simplest change. Here is the updated patch with just some
more text added to the changelog. Greg, does this look ok to you?

Thanks.
-- 8 --

From: Suresh Siddha suresh.b.sid...@intel.com
Subject: x86, ioapic: initialize nr_ioapic_registers early in 
mp_register_ioapic()

Lin Bao reported that one of the HP platforms failed to boot
2.6.32 kernel, when the BIOS enabled interrupt-remapping and
x2apic before handing over the control to the Linux kernel.

During boot, Linux kernel masks all the interrupt sources
(8259, IO-APIC RTE's), setup the interrupt-remapping hardware
with the OS controlled table and unmasks the 8259 interrupts
but not the IO-APIC RTE's (as the newly setup interrupt-remapping
table and the IO-APIC RTE's are not yet programmed by the kernel).

Shortly after this, IO-APIC RTE's and the interrupt-remapping table
entries are programmed based on the ACPI tables etc. So the
expectation is that any interrupt during this window will be dropped
and not see the intermediate configuration.

In the reported problematic case, BIOS has configured the IO-APIC
in virtual wire-B mode. Between the window of the kernel setting up
new interrupt-remapping table  and the IO-APIC RTE's are properly
configured, an interrupt gets routed by the IO-APIC RTE (setup
by the virtual wire-B configuration) and sees the empty
interrupt-remapping table entry, resulting in vt-d fault causing
the platform to generate NMI. And the OS panics on this unexpected NMI.

This problem doesn't happen with more recent kernels and closer
look at the 2.6.32 kernel shows that the code which masks
the IO-APIC RTE's is not working as expected as the nr_ioapic_registers
for each IO-APIC is not yet initialized at this point. In the later
kernels we initialize nr_ioapic_registers much before and
everything works as expected.

For 2.6.[32..34] kernels, fix this issue by initializing
nr_ioapic_registers early in mp_register_ioapic()

[ Relevant upstream commit info:
  commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af
  Author: Eric W. Biederman ebied...@xmission.com
  Date:   Tue Mar 30 01:07:12 2010 -0700

x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic.

  As the upstream commit depends on quite a few prior commits
  and some followup fixes in the mainline, we just picked
  the smallest relevant hunk for fixing the issue at hand.
  Problematic platform uses ACPI for IO-APIC, VT-d enumeration etc
  and this hunk only touches the ACPI based platforms.

  nr_ioapic_reigsters initialization in enable_IO_APIC() is still
  retained, so that other configurations like legacy MPS table based
  enumeration etc works with no change.
]

Reported-and-tested-by: Zhang, Lin-Bao linbao.zh...@hp.com
Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Cc: sta...@vger.kernel.org [v2.6.32..v2.6.34]
---
 arch/x86/kernel/apic/io_apic.c |9 +++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 8928d97..d256bc3 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -4262,6 +4262,7

[patch] x86, apic: use tsc deadline for oneshot when available

2012-10-22 Thread Suresh Siddha

Thomas, You wanted to run some tests with this, right? Please give it a
try and see if this is ok to be pushed to the -tip.

thanks,
suresh
--8<--
From: Suresh Siddha 
Subject: x86, apic: use tsc deadline for oneshot when available

If the TSC deadline mode is supported, LAPIC timer one-shot mode can be
implemented using IA32_TSC_DEADLINE MSR. An interrupt will be generated
when the TSC value equals or exceeds the value in the IA32_TSC_DEADLINE
MSR.

This enables us to skip the APIC calibration during boot. Also,
in xapic mode, this enables us to skip the uncached apic access
to re-arm the APIC timer.

As this timer ticks at the high frequency TSC rate, we use the
TSC_DIVISOR (32) to work with the 32-bit restrictions in the clockevent
API's to avoid 64-bit divides etc (frequency is u32 and "unsigned long"
in the set_next_event(), max_delta limits the next event to 32-bit for
32-bit kernel).

Signed-off-by: Suresh Siddha 
---
 Documentation/kernel-parameters.txt |4 ++
 arch/x86/include/asm/msr-index.h|2 +
 arch/x86/kernel/apic/apic.c |   66 ++-
 3 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 9776f06..4aa9ca0 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1304,6 +1304,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
lapic   [X86-32,APIC] Enable the local APIC even if BIOS
disabled it.
 
+   lapic=  [x86,APIC] "notscdeadline" Do not use TSC deadline
+   value for LAPIC timer one-shot implementation. Default
+   back to the programmable timer unit in the LAPIC.
+
lapic_timer_c2_ok   [X86,APIC] trust the local apic timer
in C2 power state.
 
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 7f0edce..e400cdb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -337,6 +337,8 @@
 #define MSR_IA32_MISC_ENABLE_TURBO_DISABLE (1ULL << 38)
 #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE   (1ULL << 39)
 
+#define MSR_IA32_TSC_DEADLINE  0x06E0
+
 /* P4/Xeon+ specific */
 #define MSR_IA32_MCG_EAX   0x0180
 #define MSR_IA32_MCG_EBX   0x0181
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b17416e..b0c49b1 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -90,21 +90,6 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
  */
 DEFINE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid, BAD_APICID);
 
-/*
- * Knob to control our willingness to enable the local APIC.
- *
- * +1=force-enable
- */
-static int force_enable_local_apic __initdata;
-/*
- * APIC command line parameters
- */
-static int __init parse_lapic(char *arg)
-{
-   force_enable_local_apic = 1;
-   return 0;
-}
-early_param("lapic", parse_lapic);
 /* Local APIC was disabled by the BIOS and enabled by the kernel */
 static int enabled_via_apicbase;
 
@@ -133,6 +118,25 @@ static inline void imcr_apic_to_pic(void)
 }
 #endif
 
+/*
+ * Knob to control our willingness to enable the local APIC.
+ *
+ * +1=force-enable
+ */
+static int force_enable_local_apic __initdata;
+/*
+ * APIC command line parameters
+ */
+static int __init parse_lapic(char *arg)
+{
+   if (config_enabled(CONFIG_X86_32) && !arg)
+   force_enable_local_apic = 1;
+   else if (!strncmp(arg, "notscdeadline", 13))
+   setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+   return 0;
+}
+early_param("lapic", parse_lapic);
+
 #ifdef CONFIG_X86_64
 static int apic_calibrate_pmtmr __initdata;
 static __init int setup_apicpmtimer(char *s)
@@ -315,6 +319,7 @@ int lapic_get_maxlvt(void)
 
 /* Clock divisor */
 #define APIC_DIVISOR 16
+#define TSC_DIVISOR  32
 
 /*
  * This function sets up the local APIC timer, with a timeout of
@@ -333,6 +338,9 @@ static void __setup_APIC_LVTT(unsigned int clocks, int 
oneshot, int irqen)
lvtt_value = LOCAL_TIMER_VECTOR;
if (!oneshot)
lvtt_value |= APIC_LVT_TIMER_PERIODIC;
+   else if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER))
+   lvtt_value |= APIC_LVT_TIMER_TSCDEADLINE;
+
if (!lapic_is_integrated())
lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV);
 
@@ -341,6 +349,11 @@ static void __setup_APIC_LVTT(unsigned int clocks, int 
oneshot, int irqen)
 
apic_write(APIC_LVTT, lvtt_value);
 
+   if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
+   printk_once(KERN_DEBUG "TSC deadline timer enabled\n");
+   return;
+   }
+
/*
 * Divide PICLK by 16
 */
@@ -453,6 +466,15

[patch] x86, apic: use tsc deadline for oneshot when available

2012-10-22 Thread Suresh Siddha

Thomas, You wanted to run some tests with this, right? Please give it a
try and see if this is ok to be pushed to the -tip.

thanks,
suresh
--8--
From: Suresh Siddha suresh.b.sid...@intel.com
Subject: x86, apic: use tsc deadline for oneshot when available

If the TSC deadline mode is supported, LAPIC timer one-shot mode can be
implemented using IA32_TSC_DEADLINE MSR. An interrupt will be generated
when the TSC value equals or exceeds the value in the IA32_TSC_DEADLINE
MSR.

This enables us to skip the APIC calibration during boot. Also,
in xapic mode, this enables us to skip the uncached apic access
to re-arm the APIC timer.

As this timer ticks at the high frequency TSC rate, we use the
TSC_DIVISOR (32) to work with the 32-bit restrictions in the clockevent
API's to avoid 64-bit divides etc (frequency is u32 and unsigned long
in the set_next_event(), max_delta limits the next event to 32-bit for
32-bit kernel).

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
---
 Documentation/kernel-parameters.txt |4 ++
 arch/x86/include/asm/msr-index.h|2 +
 arch/x86/kernel/apic/apic.c |   66 ++-
 3 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 9776f06..4aa9ca0 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1304,6 +1304,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
lapic   [X86-32,APIC] Enable the local APIC even if BIOS
disabled it.
 
+   lapic=  [x86,APIC] notscdeadline Do not use TSC deadline
+   value for LAPIC timer one-shot implementation. Default
+   back to the programmable timer unit in the LAPIC.
+
lapic_timer_c2_ok   [X86,APIC] trust the local apic timer
in C2 power state.
 
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 7f0edce..e400cdb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -337,6 +337,8 @@
 #define MSR_IA32_MISC_ENABLE_TURBO_DISABLE (1ULL  38)
 #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE   (1ULL  39)
 
+#define MSR_IA32_TSC_DEADLINE  0x06E0
+
 /* P4/Xeon+ specific */
 #define MSR_IA32_MCG_EAX   0x0180
 #define MSR_IA32_MCG_EBX   0x0181
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b17416e..b0c49b1 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -90,21 +90,6 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
  */
 DEFINE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid, BAD_APICID);
 
-/*
- * Knob to control our willingness to enable the local APIC.
- *
- * +1=force-enable
- */
-static int force_enable_local_apic __initdata;
-/*
- * APIC command line parameters
- */
-static int __init parse_lapic(char *arg)
-{
-   force_enable_local_apic = 1;
-   return 0;
-}
-early_param(lapic, parse_lapic);
 /* Local APIC was disabled by the BIOS and enabled by the kernel */
 static int enabled_via_apicbase;
 
@@ -133,6 +118,25 @@ static inline void imcr_apic_to_pic(void)
 }
 #endif
 
+/*
+ * Knob to control our willingness to enable the local APIC.
+ *
+ * +1=force-enable
+ */
+static int force_enable_local_apic __initdata;
+/*
+ * APIC command line parameters
+ */
+static int __init parse_lapic(char *arg)
+{
+   if (config_enabled(CONFIG_X86_32)  !arg)
+   force_enable_local_apic = 1;
+   else if (!strncmp(arg, notscdeadline, 13))
+   setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+   return 0;
+}
+early_param(lapic, parse_lapic);
+
 #ifdef CONFIG_X86_64
 static int apic_calibrate_pmtmr __initdata;
 static __init int setup_apicpmtimer(char *s)
@@ -315,6 +319,7 @@ int lapic_get_maxlvt(void)
 
 /* Clock divisor */
 #define APIC_DIVISOR 16
+#define TSC_DIVISOR  32
 
 /*
  * This function sets up the local APIC timer, with a timeout of
@@ -333,6 +338,9 @@ static void __setup_APIC_LVTT(unsigned int clocks, int 
oneshot, int irqen)
lvtt_value = LOCAL_TIMER_VECTOR;
if (!oneshot)
lvtt_value |= APIC_LVT_TIMER_PERIODIC;
+   else if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER))
+   lvtt_value |= APIC_LVT_TIMER_TSCDEADLINE;
+
if (!lapic_is_integrated())
lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV);
 
@@ -341,6 +349,11 @@ static void __setup_APIC_LVTT(unsigned int clocks, int 
oneshot, int irqen)
 
apic_write(APIC_LVTT, lvtt_value);
 
+   if (lvtt_value  APIC_LVT_TIMER_TSCDEADLINE) {
+   printk_once(KERN_DEBUG TSC deadline timer enabled\n);
+   return;
+   }
+
/*
 * Divide PICLK by 16
 */
@@ -453,6 +466,15 @@ static int lapic_next_event(unsigned long delta,
return 0

Re: x2apic boot failure on recent sandy bridge system

2012-10-19 Thread Suresh Siddha

On Fri, 2012-10-19 at 16:36 -0700, H. Peter Anvin wrote:
> On 10/19/2012 04:32 PM, Yinghai Lu wrote:
> > On Fri, Oct 19, 2012 at 4:03 PM, Suresh Siddha
> >  wrote:
> >> On Fri, 2012-10-19 at 13:42 -0700, rrl...@gmail.com wrote:
> >>> Any update? The messages just seem to have stopped months ago. A
> >>> fallback would be nice, I have been booting the kernel with noa2xpic
> >>> for since kernel 3.2, and currently I am working with 3.6.2.
> >>>
> >>> If needed I can try to attempt modifying the patch to include
> >>> fallback, but I am probably not the best person to do it.
> >>>
> >>
> >> Are you referring to this commit that made into the mainline tree
> >> already?
> >>
> >> commit fb209bd891645bb87b9618b724f0b4928e0df3de
> >> Author: Yinghai Lu 
> >> Date:   Wed Dec 21 17:45:17 2011 -0800
> >>
> >>  x86, x2apic: Fallback to xapic when BIOS doesn't setup 
> >> interrupt-remapping
> >
> > I think his system has DMAR table and cpu support x2apic, So kernel
> > will switch to x2apic,
> >
> > but somehow BIOS SMI handler has problem with x2apic. should be thinkpad 
> > W520?
> >
> 
> Right, StinkPad W520 needs a quirk.

yes. Yinghai, if you remember you had a T420 that didn't show the
problem. And someone in the bugzilla with T420 had the problem. And
their dmidecode is
https://launchpadlibrarian.net/109393850/dmidecode.txt

What is the difference with your system? Bios I think is the same.

Can you see what we should check for in the dmi tables to black list
these systems? Can you post your T420's dmidecode to see the difference.

Bugs I have on this are:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/776999
https://bugzilla.kernel.org/show_bug.cgi?id=43054

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x2apic boot failure on recent sandy bridge system

2012-10-19 Thread Suresh Siddha

On Fri, 2012-10-19 at 13:42 -0700, rrl...@gmail.com wrote:
> Any update? The messages just seem to have stopped months ago. A
> fallback would be nice, I have been booting the kernel with noa2xpic
> for since kernel 3.2, and currently I am working with 3.6.2.
> 
> If needed I can try to attempt modifying the patch to include
> fallback, but I am probably not the best person to do it.
> 

Are you referring to this commit that made into the mainline tree
already?

commit fb209bd891645bb87b9618b724f0b4928e0df3de
Author: Yinghai Lu 
Date:   Wed Dec 21 17:45:17 2011 -0800

x86, x2apic: Fallback to xapic when BIOS doesn't setup interrupt-remapping

On some of the recent Intel SNB platforms, by default bios is pre-enabling
x2apic mode in the cpu with out setting up interrupt-remapping.
This case was resulting in the kernel to panic as the cpu is already in
x2apic mode but the OS was not able to enable interrupt-remapping (which
is a pre-req for using x2apic capability).

On these platforms all the apic-ids are < 255 and the kernel can fallback to
xapic mode if the bios has not enabled interrupt-remapping (which is
mostly the case if the bios has not exported interrupt-remapping tables to 
the
OS).

Reported-by: Berck E. Nash 
Signed-off-by: Yinghai Lu 
Link: 
http://lkml.kernel.org/r/20111222014632.600418...@sbsiddha-desk.sc.intel.com
    Signed-off-by: Suresh Siddha 
Signed-off-by: H. Peter Anvin 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x2apic boot failure on recent sandy bridge system

2012-10-19 Thread Suresh Siddha

On Fri, 2012-10-19 at 13:42 -0700, rrl...@gmail.com wrote:
 Any update? The messages just seem to have stopped months ago. A
 fallback would be nice, I have been booting the kernel with noa2xpic
 for since kernel 3.2, and currently I am working with 3.6.2.
 
 If needed I can try to attempt modifying the patch to include
 fallback, but I am probably not the best person to do it.
 

Are you referring to this commit that made into the mainline tree
already?

commit fb209bd891645bb87b9618b724f0b4928e0df3de
Author: Yinghai Lu ying...@kernel.org
Date:   Wed Dec 21 17:45:17 2011 -0800

x86, x2apic: Fallback to xapic when BIOS doesn't setup interrupt-remapping

On some of the recent Intel SNB platforms, by default bios is pre-enabling
x2apic mode in the cpu with out setting up interrupt-remapping.
This case was resulting in the kernel to panic as the cpu is already in
x2apic mode but the OS was not able to enable interrupt-remapping (which
is a pre-req for using x2apic capability).

On these platforms all the apic-ids are  255 and the kernel can fallback to
xapic mode if the bios has not enabled interrupt-remapping (which is
mostly the case if the bios has not exported interrupt-remapping tables to 
the
OS).

Reported-by: Berck E. Nash fly...@gmail.com
Signed-off-by: Yinghai Lu ying...@kernel.org
Link: 
http://lkml.kernel.org/r/20111222014632.600418...@sbsiddha-desk.sc.intel.com
Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x2apic boot failure on recent sandy bridge system

2012-10-19 Thread Suresh Siddha

On Fri, 2012-10-19 at 16:36 -0700, H. Peter Anvin wrote:
 On 10/19/2012 04:32 PM, Yinghai Lu wrote:
  On Fri, Oct 19, 2012 at 4:03 PM, Suresh Siddha
  suresh.b.sid...@intel.com wrote:
  On Fri, 2012-10-19 at 13:42 -0700, rrl...@gmail.com wrote:
  Any update? The messages just seem to have stopped months ago. A
  fallback would be nice, I have been booting the kernel with noa2xpic
  for since kernel 3.2, and currently I am working with 3.6.2.
 
  If needed I can try to attempt modifying the patch to include
  fallback, but I am probably not the best person to do it.
 
 
  Are you referring to this commit that made into the mainline tree
  already?
 
  commit fb209bd891645bb87b9618b724f0b4928e0df3de
  Author: Yinghai Lu ying...@kernel.org
  Date:   Wed Dec 21 17:45:17 2011 -0800
 
   x86, x2apic: Fallback to xapic when BIOS doesn't setup 
  interrupt-remapping
 
  I think his system has DMAR table and cpu support x2apic, So kernel
  will switch to x2apic,
 
  but somehow BIOS SMI handler has problem with x2apic. should be thinkpad 
  W520?
 
 
 Right, StinkPad W520 needs a quirk.

yes. Yinghai, if you remember you had a T420 that didn't show the
problem. And someone in the bugzilla with T420 had the problem. And
their dmidecode is
https://launchpadlibrarian.net/109393850/dmidecode.txt

What is the difference with your system? Bios I think is the same.

Can you see what we should check for in the dmi tables to black list
these systems? Can you post your T420's dmidecode to see the difference.

Bugs I have on this are:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/776999
https://bugzilla.kernel.org/show_bug.cgi?id=43054

thanks,
suresh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries(smpboot.c and apic.c )

2012-10-10 Thread Suresh Siddha

On Wed, 2012-10-10 at 00:26 +, Zhang, Lin-Bao (Linux Kernel R)
wrote:

> So , we can think ,as your patch , during the window , IO-apic is
> useless or we can think IO-APIC doesn't exist ? 
> Could you mind please sharing your design details ? thanks very much! 

Between the window of interrupt-remapping enabled and the masked IO-APIC
RTE's are configured properly, linux kernel doesn't wait/depend on any
external interrupts.

> > Can you please apply the appended patch to 2.6.32 kernel and see if the 
> > issue
> > you mentioned gets fixed? If so, we can ask the -stable and OSV's teams to
> > pick up this fix.
> Yes , it can resolve current issue. 

Thanks for testing it out.

I will add the appropriate changelog and send the patch out (to 2.6.32
stable and OSV kernels) with your "Tested-by:" if you are ok.

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries(smpboot.c and apic.c )

2012-10-10 Thread Suresh Siddha

On Wed, 2012-10-10 at 16:02 -0700, Zhang, Lin-Bao (Linux Kernel R)
wrote:
> > As I mentioned earlier, the current design already ensures that all the 
> > IO-APIC
> > RTE's are masked between the time we enable interrupt-remapping to the time
> > when the IO-APIC RTE's are configured correctly.
> >
> > So I looked at why you are seeing the problem with v2.6.32 but not with the
> > recent kernels. And I think I found out the reason.
> 
> I want to know what masking IO-APIC means?

As the platform is configured to use virtual-wire B and the
corresponding IO-APIC RTE is masked, that interrupt will be dropped.

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries(smpboot.c and apic.c )

2012-10-10 Thread Suresh Siddha

On Wed, 2012-10-10 at 16:02 -0700, Zhang, Lin-Bao (Linux Kernel RD)
wrote:
  As I mentioned earlier, the current design already ensures that all the 
  IO-APIC
  RTE's are masked between the time we enable interrupt-remapping to the time
  when the IO-APIC RTE's are configured correctly.
 
  So I looked at why you are seeing the problem with v2.6.32 but not with the
  recent kernels. And I think I found out the reason.
 
 I want to know what masking IO-APIC means?

As the platform is configured to use virtual-wire B and the
corresponding IO-APIC RTE is masked, that interrupt will be dropped.

thanks,
suresh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries(smpboot.c and apic.c )

2012-10-10 Thread Suresh Siddha

On Wed, 2012-10-10 at 00:26 +, Zhang, Lin-Bao (Linux Kernel RD)
wrote:

 So , we can think ,as your patch , during the window , IO-apic is
 useless or we can think IO-APIC doesn't exist ? 
 Could you mind please sharing your design details ? thanks very much! 

Between the window of interrupt-remapping enabled and the masked IO-APIC
RTE's are configured properly, linux kernel doesn't wait/depend on any
external interrupts.

  Can you please apply the appended patch to 2.6.32 kernel and see if the 
  issue
  you mentioned gets fixed? If so, we can ask the -stable and OSV's teams to
  pick up this fix.
 Yes , it can resolve current issue. 

Thanks for testing it out.

I will add the appropriate changelog and send the patch out (to 2.6.32
stable and OSV kernels) with your Tested-by: if you are ok.

thanks,
suresh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries(smpboot.c and apic.c )

2012-10-08 Thread Suresh Siddha

On Sun, 2012-10-07 at 21:53 -0700, Zhang, Lin-Bao (Linux Kernel R)
wrote:
> Hi Suresh,
> Could you please update current status about these 2 files and patch?
> I am not sure if I have answered your questions , if not ,feel free to let me 
> know.
> This is my first time to submit patch to LKML, so what should I do next step ?

As I mentioned earlier, the current design already ensures that all the
IO-APIC RTE's are masked between the time we enable interrupt-remapping
to the time when the IO-APIC RTE's are configured correctly.

So I looked at why you are seeing the problem with v2.6.32 but not with
the recent kernels. And I think I found out the reason.

2.6.32 kernel is missing this fix,
http://marc.info/?l=linux-acpi=126993666715081=2

commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af
Author: Eric W. Biederman 
Date:   Tue Mar 30 01:07:12 2010 -0700

x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic.

Now that all ioapic registration happens in mp_register_ioapic we can
move the calculation of nr_ioapic_registers there from enable_IO_APIC.
The number of ioapic registers is already calucated in mp_register_ioapic
so all that really needs to be done is to save the caluclated value
in nr_ioapic_registers.

Signed-off-by: Eric W. Biederman 
LKML-Reference: <1269936436-7039-11-git-send-email-ebied...@xmission.com>
Signed-off-by: H. Peter Anvin 

Because of this, in v2.6.32, mask_IO_APIC_setup() is not working as
expected as nr_ioapic_registers[] are not yet initialized and thus the
io-apic RTE's are not masked as expected.

We just need the last hunk of that patch, I think.

Can you please apply the appended patch to 2.6.32 kernel and see if the
issue you mentioned gets fixed? If so, we can ask the -stable and OSV's
teams to pick up this fix.

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index f807255..dae9240 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -4293,6 +4281,7 @@ static int bad_ioapic(unsigned long address)
 void __init mp_register_ioapic(int id, u32 address, u32 gsi_base)
 {
int idx = 0;
+   int entries;

if (bad_ioapic(address))
return;
@@ -4311,9 +4300,14 @@ void __init mp_register_ioapic(int id, u32 address, u32 
gsi_base)
 * Build basic GSI lookup table to facilitate gsi->io_apic lookups
 * and to prevent reprogramming of IOAPIC pins (PCI GSIs).
 */
+   entries = io_apic_get_redir_entries(idx);
mp_gsi_routing[idx].gsi_base = gsi_base;
-   mp_gsi_routing[idx].gsi_end = gsi_base +
-   io_apic_get_redir_entries(idx) - 1;
+   mp_gsi_routing[idx].gsi_end = gsi_base + entries - 1;
+
+   /*
+* The number of IO-APIC IRQ registers (== #pins):
+*/
+   nr_ioapic_registers[idx] = entries;

if (mp_gsi_routing[idx].gsi_end > gsi_end)
gsi_end = mp_gsi_routing[idx].gsi_end;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries(smpboot.c and apic.c )

2012-10-08 Thread Suresh Siddha

On Sun, 2012-10-07 at 21:53 -0700, Zhang, Lin-Bao (Linux Kernel RD)
wrote:
 Hi Suresh,
 Could you please update current status about these 2 files and patch?
 I am not sure if I have answered your questions , if not ,feel free to let me 
 know.
 This is my first time to submit patch to LKML, so what should I do next step ?

As I mentioned earlier, the current design already ensures that all the
IO-APIC RTE's are masked between the time we enable interrupt-remapping
to the time when the IO-APIC RTE's are configured correctly.

So I looked at why you are seeing the problem with v2.6.32 but not with
the recent kernels. And I think I found out the reason.

2.6.32 kernel is missing this fix,
http://marc.info/?l=linux-acpim=126993666715081w=2

commit 7716a5c4ff5f1f3dc5e9edcab125cbf7fceef0af
Author: Eric W. Biederman ebied...@xmission.com
Date:   Tue Mar 30 01:07:12 2010 -0700

x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic.

Now that all ioapic registration happens in mp_register_ioapic we can
move the calculation of nr_ioapic_registers there from enable_IO_APIC.
The number of ioapic registers is already calucated in mp_register_ioapic
so all that really needs to be done is to save the caluclated value
in nr_ioapic_registers.

Signed-off-by: Eric W. Biederman ebied...@xmission.com
LKML-Reference: 1269936436-7039-11-git-send-email-ebied...@xmission.com
Signed-off-by: H. Peter Anvin h...@zytor.com


Because of this, in v2.6.32, mask_IO_APIC_setup() is not working as
expected as nr_ioapic_registers[] are not yet initialized and thus the
io-apic RTE's are not masked as expected.

We just need the last hunk of that patch, I think.

Can you please apply the appended patch to 2.6.32 kernel and see if the
issue you mentioned gets fixed? If so, we can ask the -stable and OSV's
teams to pick up this fix.

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index f807255..dae9240 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -4293,6 +4281,7 @@ static int bad_ioapic(unsigned long address)
 void __init mp_register_ioapic(int id, u32 address, u32 gsi_base)
 {
int idx = 0;
+   int entries;
 
if (bad_ioapic(address))
return;
@@ -4311,9 +4300,14 @@ void __init mp_register_ioapic(int id, u32 address, u32 
gsi_base)
 * Build basic GSI lookup table to facilitate gsi-io_apic lookups
 * and to prevent reprogramming of IOAPIC pins (PCI GSIs).
 */
+   entries = io_apic_get_redir_entries(idx);
mp_gsi_routing[idx].gsi_base = gsi_base;
-   mp_gsi_routing[idx].gsi_end = gsi_base +
-   io_apic_get_redir_entries(idx) - 1;
+   mp_gsi_routing[idx].gsi_end = gsi_base + entries - 1;
+
+   /*
+* The number of IO-APIC IRQ registers (== #pins):
+*/
+   nr_ioapic_registers[idx] = entries;
 
if (mp_gsi_routing[idx].gsi_end  gsi_end)
gsi_end = mp_gsi_routing[idx].gsi_end;


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] x86/fixup_irq: Clean the offlining CPU from the irq affinity mask

2012-09-27 Thread Suresh Siddha

On Fri, 2012-09-28 at 00:12 +0530, Srivatsa S. Bhat wrote:
> On 09/27/2012 04:16 AM, Suresh Siddha wrote:
> > 
> > No. irq_set_affinity()
> > 
> 
> Um? That takes the updated/changed affinity and sets data->affinity to
> that value no? You mentioned that probably the intention of the original
> code was to preserve the user-set affinity mask, but still change the
> underlying interrupt routing. Sorry, but I still didn't quite understand
> what is that part of the code that achieves that.

For the HW routing to be changed we AND it with cpu_online_map and use
that for programming the interrupt entries etc. The user-specified
affinity still has the cpu that is offlined.

And when the cpu comes online and if it is part of the user-specified
affinity, then the HW routing can be again modified to include the new
cpu.

hope this clears it!

thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] x86/fixup_irq: Clean the offlining CPU from the irq affinity mask

2012-09-27 Thread Suresh Siddha

On Fri, 2012-09-28 at 00:12 +0530, Srivatsa S. Bhat wrote:
 On 09/27/2012 04:16 AM, Suresh Siddha wrote:
  
  No. irq_set_affinity()
  
 
 Um? That takes the updated/changed affinity and sets data-affinity to
 that value no? You mentioned that probably the intention of the original
 code was to preserve the user-set affinity mask, but still change the
 underlying interrupt routing. Sorry, but I still didn't quite understand
 what is that part of the code that achieves that.

For the HW routing to be changed we AND it with cpu_online_map and use
that for programming the interrupt entries etc. The user-specified
affinity still has the cpu that is offlined.

And when the cpu comes online and if it is part of the user-specified
affinity, then the HW routing can be again modified to include the new
cpu.

hope this clears it!

thanks.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] x86/fixup_irq: Clean the offlining CPU from the irq affinity mask

2012-09-26 Thread Suresh Siddha

On Wed, 2012-09-26 at 23:00 +0530, Srivatsa S. Bhat wrote:
> On 09/26/2012 10:36 PM, Suresh Siddha wrote:
> > On Wed, 2012-09-26 at 21:33 +0530, Srivatsa S. Bhat wrote:
> >> I have some fundamental questions here:
> >> 1. Why was the CPU never removed from the affinity masks in the original
> >> code? I find it hard to believe that it was just an oversight, because the
> >> whole point of fixup_irqs() is to affine the interrupts to other CPUs, 
> >> IIUC.
> >> So, is that really a bug or is the existing code correct for some reason
> >> which I don't know of?
> > 
> > I am not aware of the history but my guess is that the affinity mask
> > which is coming from the user-space wants to be preserved. And
> > fixup_irqs() is fixing the underlying interrupt routing when the cpu
> > goes down
> 
> and the code that corresponds to that is:
> irq_force_complete_move(irq); is it?

No. irq_set_affinity()

> > with a hope that things will be corrected when the cpu comes
> > back online. But  as Liu noted, we are not correcting the underlying
> > routing when the cpu comes back online. I think we should fix that
> > rather than modifying the user-specified affinity.
> > 
> 
> Hmm, I didn't entirely get your suggestion. Are you saying that we should 
> change
> data->affinity (by calling ->irq_set_affinity()) during offline but maintain a
> copy of the original affinity mask somewhere, so that we can try to match it
> when possible (ie., when CPU comes back online)?

Don't change the data->affinity in the fixup_irqs() and shortly after a
cpu is online, call irq_chip's irq_set_affinity() for those irq's who
affinity included this cpu (now that the cpu is back online,
irq_set_affinity() will setup the HW routing tables correctly).

This presumes that across the suspend/resume, cpu offline/online
operations, we don't want to break the irq affinity setup by the
user-level entity like irqbalance etc...

> > That happens only if the irq chip doesn't have the irq_set_affinity() setup.
> 
> That is my other point of concern : setting irq affinity can fail even if
> we have ->irq_set_affinity(). (If __ioapic_set_affinity() fails, for example).
> Why don't we complain in that case? I think we should... and if its serious
> enough, abort the hotplug operation or atleast indicate that offline failed..

yes if there is a failure then we are in trouble, as the cpu is already
disappeared from the online-masks etc. For platforms with
interrupt-remapping, interrupts can be migrated from the process context
and as such this all can be done much before.

And for legacy platforms we have done quite a few changes in the recent
past like using eoi_ioapic_irq() for level triggered interrupts etc,
that makes it as safe as it can be. Perhaps we can move most of the
fixup_irqs() code much ahead and the lost section of the current
fixup_irqs() (which check IRR bits and use the retrigger function to
trigger the interrupt on another cpu) can still be done late just like
now.

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] x86/fixup_irq: Clean the offlining CPU from the irq affinity mask

2012-09-26 Thread Suresh Siddha

On Wed, 2012-09-26 at 21:33 +0530, Srivatsa S. Bhat wrote:
> I have some fundamental questions here:
> 1. Why was the CPU never removed from the affinity masks in the original
> code? I find it hard to believe that it was just an oversight, because the
> whole point of fixup_irqs() is to affine the interrupts to other CPUs, IIUC.
> So, is that really a bug or is the existing code correct for some reason
> which I don't know of?

I am not aware of the history but my guess is that the affinity mask
which is coming from the user-space wants to be preserved. And
fixup_irqs() is fixing the underlying interrupt routing when the cpu
goes down with a hope that things will be corrected when the cpu comes
back online. But  as Liu noted, we are not correcting the underlying
routing when the cpu comes back online. I think we should fix that
rather than modifying the user-specified affinity.

> 2. In case this is indeed a bug, why are the warnings ratelimited when the
> interrupts can't be affined to other CPUs? Are they not serious enough to
> report? Put more strongly, why do we even silently return with a warning
> instead of reporting that the CPU offline operation failed?? Is that because
> we have come way too far in the hotplug sequence and we can't easily roll
> back? Or are we still actually OK in that situation?

Are you referring to the "cannot set affinity for irq" messages? That
happens only if the irq chip doesn't have the irq_set_affinity() setup.
But that is not common.

> 
> Suresh, I'd be grateful if you could kindly throw some light on these
> issues... I'm actually debugging an issue where an offline CPU gets apic timer
> interrupts (and in one case, I even saw a device interrupt), which I have
> reported in another thread at: https://lkml.org/lkml/2012/9/26/119
> But this issue in fixup_irqs() that Liu brought to light looks even more
> surprising to me..

These issues look different to me, will look into that.

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] x86/fixup_irq: Clean the offlining CPU from the irq affinity mask

2012-09-26 Thread Suresh Siddha

On Wed, 2012-09-26 at 21:33 +0530, Srivatsa S. Bhat wrote:
 I have some fundamental questions here:
 1. Why was the CPU never removed from the affinity masks in the original
 code? I find it hard to believe that it was just an oversight, because the
 whole point of fixup_irqs() is to affine the interrupts to other CPUs, IIUC.
 So, is that really a bug or is the existing code correct for some reason
 which I don't know of?

I am not aware of the history but my guess is that the affinity mask
which is coming from the user-space wants to be preserved. And
fixup_irqs() is fixing the underlying interrupt routing when the cpu
goes down with a hope that things will be corrected when the cpu comes
back online. But  as Liu noted, we are not correcting the underlying
routing when the cpu comes back online. I think we should fix that
rather than modifying the user-specified affinity.

 2. In case this is indeed a bug, why are the warnings ratelimited when the
 interrupts can't be affined to other CPUs? Are they not serious enough to
 report? Put more strongly, why do we even silently return with a warning
 instead of reporting that the CPU offline operation failed?? Is that because
 we have come way too far in the hotplug sequence and we can't easily roll
 back? Or are we still actually OK in that situation?

Are you referring to the cannot set affinity for irq messages? That
happens only if the irq chip doesn't have the irq_set_affinity() setup.
But that is not common.

 
 Suresh, I'd be grateful if you could kindly throw some light on these
 issues... I'm actually debugging an issue where an offline CPU gets apic timer
 interrupts (and in one case, I even saw a device interrupt), which I have
 reported in another thread at: https://lkml.org/lkml/2012/9/26/119
 But this issue in fixup_irqs() that Liu brought to light looks even more
 surprising to me..

These issues look different to me, will look into that.

thanks,
suresh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] x86/fixup_irq: Clean the offlining CPU from the irq affinity mask

2012-09-26 Thread Suresh Siddha

On Wed, 2012-09-26 at 23:00 +0530, Srivatsa S. Bhat wrote:
 On 09/26/2012 10:36 PM, Suresh Siddha wrote:
  On Wed, 2012-09-26 at 21:33 +0530, Srivatsa S. Bhat wrote:
  I have some fundamental questions here:
  1. Why was the CPU never removed from the affinity masks in the original
  code? I find it hard to believe that it was just an oversight, because the
  whole point of fixup_irqs() is to affine the interrupts to other CPUs, 
  IIUC.
  So, is that really a bug or is the existing code correct for some reason
  which I don't know of?
  
  I am not aware of the history but my guess is that the affinity mask
  which is coming from the user-space wants to be preserved. And
  fixup_irqs() is fixing the underlying interrupt routing when the cpu
  goes down
 
 and the code that corresponds to that is:
 irq_force_complete_move(irq); is it?

No. irq_set_affinity()

  with a hope that things will be corrected when the cpu comes
  back online. But  as Liu noted, we are not correcting the underlying
  routing when the cpu comes back online. I think we should fix that
  rather than modifying the user-specified affinity.
  
 
 Hmm, I didn't entirely get your suggestion. Are you saying that we should 
 change
 data-affinity (by calling -irq_set_affinity()) during offline but maintain a
 copy of the original affinity mask somewhere, so that we can try to match it
 when possible (ie., when CPU comes back online)?

Don't change the data-affinity in the fixup_irqs() and shortly after a
cpu is online, call irq_chip's irq_set_affinity() for those irq's who
affinity included this cpu (now that the cpu is back online,
irq_set_affinity() will setup the HW routing tables correctly).

This presumes that across the suspend/resume, cpu offline/online
operations, we don't want to break the irq affinity setup by the
user-level entity like irqbalance etc...

  That happens only if the irq chip doesn't have the irq_set_affinity() setup.
 
 That is my other point of concern : setting irq affinity can fail even if
 we have -irq_set_affinity(). (If __ioapic_set_affinity() fails, for example).
 Why don't we complain in that case? I think we should... and if its serious
 enough, abort the hotplug operation or atleast indicate that offline failed..

yes if there is a failure then we are in trouble, as the cpu is already
disappeared from the online-masks etc. For platforms with
interrupt-remapping, interrupts can be migrated from the process context
and as such this all can be done much before.

And for legacy platforms we have done quite a few changes in the recent
past like using eoi_ioapic_irq() for level triggered interrupts etc,
that makes it as safe as it can be. Perhaps we can move most of the
fixup_irqs() code much ahead and the lost section of the current
fixup_irqs() (which check IRR bits and use the retrigger function to
trigger the interrupt on another cpu) can still be done late just like
now.

thanks,
suresh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Suresh Siddha

On Mon, 2012-09-24 at 12:12 -0700, Linus Torvalds wrote:
> On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith  wrote:
> >
> > Aside from the cache pollution I recall having been mentioned, on my
> > E5620, cross core is a tbench win over affine, cross thread is not.
> 
> Oh, I agree with trying to avoid HT threads, the resource contention
> easily gets too bad.
> 
> It's more a question of "if we have real cores with separate L1's but
> shared L2's, go with those first, before we start distributing it out
> to separate L2's".

There is one issue though. If the tasks continue to run in this state
and the periodic balance notices an idle L2, it will force migrate
(using active migration) one of the tasks to the idle L2. As the
periodic balance tries to spread the load as far as possible to take
maximum advantage of the available resources (and the perf advantage of
this really depends on the workload, cache usage/memory bw, the upside
of turbo etc).

But I am not sure if this was the reason why we chose to spread it out
to separate L2's during wakeup.

Anyways, this is one of the places where the Paul Turner's task load
average tracking patches will be useful. Depending on how long a task
typically runs, we can probably even chose a SMT siblings or a separate
L2 to run.

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected

2012-09-25 Thread Suresh Siddha

On Mon, 2012-09-24 at 12:12 -0700, Linus Torvalds wrote:
 On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith efa...@gmx.de wrote:
 
  Aside from the cache pollution I recall having been mentioned, on my
  E5620, cross core is a tbench win over affine, cross thread is not.
 
 Oh, I agree with trying to avoid HT threads, the resource contention
 easily gets too bad.
 
 It's more a question of if we have real cores with separate L1's but
 shared L2's, go with those first, before we start distributing it out
 to separate L2's.

There is one issue though. If the tasks continue to run in this state
and the periodic balance notices an idle L2, it will force migrate
(using active migration) one of the tasks to the idle L2. As the
periodic balance tries to spread the load as far as possible to take
maximum advantage of the available resources (and the perf advantage of
this really depends on the workload, cache usage/memory bw, the upside
of turbo etc).

But I am not sure if this was the reason why we chose to spread it out
to separate L2's during wakeup.

Anyways, this is one of the places where the Paul Turner's task load
average tracking patches will be useful. Depending on how long a task
typically runs, we can probably even chose a SMT siblings or a separate
L2 to run.

thanks,
suresh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/fpu] x86, kvm: fix kvm's usage of kernel_fpu_begin/end()

2012-09-21 Thread tip-bot for Suresh Siddha

Commit-ID:  b1a74bf8212367be2b1d6685c11a84e056eaaaf1
Gitweb: http://git.kernel.org/tip/b1a74bf8212367be2b1d6685c11a84e056eaaaf1
Author: Suresh Siddha 
AuthorDate: Thu, 20 Sep 2012 11:01:49 -0700
Committer:  H. Peter Anvin 
CommitDate: Fri, 21 Sep 2012 16:59:04 -0700

x86, kvm: fix kvm's usage of kernel_fpu_begin/end()

Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.

kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().

So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.

Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
change the cr0.TS bit for the host state during vm-exit to be always clear
and cr0.TS bit is set in the __vmx_load_host_state() when the FPU
(guest FPU or the host task's FPU) state is not active. This ensures
that the host/guest FPU state is properly saved, restored
during context-switch and with interrupts (using irq_fpu_usable()) not
stomping on the active FPU state.

Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1348164109.26695.338.ca...@sbsiddha-desk.sc.intel.com
Cc: Avi Kivity 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/i387.h |   28 ++--
 arch/x86/kernel/i387.c  |   13 +
 arch/x86/kvm/vmx.c  |   10 +++---
 arch/x86/kvm/x86.c  |4 ++--
 4 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 6c3bd37..ed8089d 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -24,8 +24,32 @@ extern int dump_fpu(struct pt_regs *, struct 
user_i387_struct *);
 extern void math_state_restore(void);
 
 extern bool irq_fpu_usable(void);
-extern void kernel_fpu_begin(void);
-extern void kernel_fpu_end(void);
+
+/*
+ * Careful: __kernel_fpu_begin/end() must be called with preempt disabled
+ * and they don't touch the preempt state on their own.
+ * If you enable preemption after __kernel_fpu_begin(), preempt notifier
+ * should call the __kernel_fpu_end() to prevent the kernel/user FPU
+ * state from getting corrupted. KVM for example uses this model.
+ *
+ * All other cases use kernel_fpu_begin/end() which disable preemption
+ * during kernel FPU usage.
+ */
+extern void __kernel_fpu_begin(void);
+extern void __kernel_fpu_end(void);
+
+static inline void kernel_fpu_begin(void)
+{
+   WARN_ON_ONCE(!irq_fpu_usable());
+   preempt_disable();
+   __kernel_fpu_begin();
+}
+
+static inline void kernel_fpu_end(void)
+{
+   __kernel_fpu_end();
+   preempt_enable();
+}
 
 /*
  * Some instructions like VIA's padlock instructions generate a spurious
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 6782e39..675a050 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -73,32 +73,29 @@ bool irq_fpu_usable(void)
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 
-void kernel_fpu_begin(void)
+void __kernel_fpu_begin(void)
 {
struct task_struct *me = current;
 
-   WARN_ON_ONCE(!irq_fpu_usable());
-   preempt_disable();
if (__thread_has_fpu(me)) {
__save_init_fpu(me);
__thread_clear_has_fpu(me);
-   /* We do 'stts()' in kernel_fpu_end() */
+   /* We do 'stts()' in __kernel_fpu_end() */
} else if (!use_eager_fpu()) {
this_cpu_write(fpu_owner_task, NULL);
clts();
}
 }
-EXPORT_SYMBOL(kernel_fpu_begin);
+EXPORT_SYMBOL(__kernel_fpu_begin);
 
-void kernel_fpu_end(void)
+void __kernel_fpu_end(void)
 {
if (use_eager_fpu())
math_state_restore();
else
stts();
-   preempt_enable();
 }
-EXPORT_SYMBOL(kernel_fpu_end);
+EXPORT_SYMBOL(__kernel_fpu_end);
 
 void unlazy_fpu(struct task_struct *tsk)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c00f03d..70dfcec 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1493,8 +1493,12 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx)
 #ifdef CONFIG_X86_64
wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
 #endif
-   if (user_has_fpu())
-   clts();
+   /*
+* If the FPU is not active (through the host task or
+* the guest vcpu), then restore the cr0.TS bit.
+*/
+   if (!user_has_fpu() && !vmx->vcpu.guest_fpu_loaded)
+   stts();
load_gdt(&__get_cpu_var(host_gdt));
 }
 
@@ -3730,7 +3734,7 @@ static void vmx_set_consta

[tip:x86/fpu] x86, kvm: fix kvm's usage of kernel_fpu_begin/end()

2012-09-21 Thread tip-bot for Suresh Siddha

Commit-ID:  b1a74bf8212367be2b1d6685c11a84e056eaaaf1
Gitweb: http://git.kernel.org/tip/b1a74bf8212367be2b1d6685c11a84e056eaaaf1
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Thu, 20 Sep 2012 11:01:49 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Fri, 21 Sep 2012 16:59:04 -0700

x86, kvm: fix kvm's usage of kernel_fpu_begin/end()

Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.

kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().

So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.

Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
change the cr0.TS bit for the host state during vm-exit to be always clear
and cr0.TS bit is set in the __vmx_load_host_state() when the FPU
(guest FPU or the host task's FPU) state is not active. This ensures
that the host/guest FPU state is properly saved, restored
during context-switch and with interrupts (using irq_fpu_usable()) not
stomping on the active FPU state.

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1348164109.26695.338.ca...@sbsiddha-desk.sc.intel.com
Cc: Avi Kivity a...@redhat.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/include/asm/i387.h |   28 ++--
 arch/x86/kernel/i387.c  |   13 +
 arch/x86/kvm/vmx.c  |   10 +++---
 arch/x86/kvm/x86.c  |4 ++--
 4 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 6c3bd37..ed8089d 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -24,8 +24,32 @@ extern int dump_fpu(struct pt_regs *, struct 
user_i387_struct *);
 extern void math_state_restore(void);
 
 extern bool irq_fpu_usable(void);
-extern void kernel_fpu_begin(void);
-extern void kernel_fpu_end(void);
+
+/*
+ * Careful: __kernel_fpu_begin/end() must be called with preempt disabled
+ * and they don't touch the preempt state on their own.
+ * If you enable preemption after __kernel_fpu_begin(), preempt notifier
+ * should call the __kernel_fpu_end() to prevent the kernel/user FPU
+ * state from getting corrupted. KVM for example uses this model.
+ *
+ * All other cases use kernel_fpu_begin/end() which disable preemption
+ * during kernel FPU usage.
+ */
+extern void __kernel_fpu_begin(void);
+extern void __kernel_fpu_end(void);
+
+static inline void kernel_fpu_begin(void)
+{
+   WARN_ON_ONCE(!irq_fpu_usable());
+   preempt_disable();
+   __kernel_fpu_begin();
+}
+
+static inline void kernel_fpu_end(void)
+{
+   __kernel_fpu_end();
+   preempt_enable();
+}
 
 /*
  * Some instructions like VIA's padlock instructions generate a spurious
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 6782e39..675a050 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -73,32 +73,29 @@ bool irq_fpu_usable(void)
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 
-void kernel_fpu_begin(void)
+void __kernel_fpu_begin(void)
 {
struct task_struct *me = current;
 
-   WARN_ON_ONCE(!irq_fpu_usable());
-   preempt_disable();
if (__thread_has_fpu(me)) {
__save_init_fpu(me);
__thread_clear_has_fpu(me);
-   /* We do 'stts()' in kernel_fpu_end() */
+   /* We do 'stts()' in __kernel_fpu_end() */
} else if (!use_eager_fpu()) {
this_cpu_write(fpu_owner_task, NULL);
clts();
}
 }
-EXPORT_SYMBOL(kernel_fpu_begin);
+EXPORT_SYMBOL(__kernel_fpu_begin);
 
-void kernel_fpu_end(void)
+void __kernel_fpu_end(void)
 {
if (use_eager_fpu())
math_state_restore();
else
stts();
-   preempt_enable();
 }
-EXPORT_SYMBOL(kernel_fpu_end);
+EXPORT_SYMBOL(__kernel_fpu_end);
 
 void unlazy_fpu(struct task_struct *tsk)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c00f03d..70dfcec 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1493,8 +1493,12 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx)
 #ifdef CONFIG_X86_64
wrmsrl(MSR_KERNEL_GS_BASE, vmx-msr_host_kernel_gs_base);
 #endif
-   if (user_has_fpu())
-   clts();
+   /*
+* If the FPU is not active (through the host task or
+* the guest vcpu), then restore the cr0.TS bit.
+*/
+   if (!user_has_fpu()  !vmx-vcpu.guest_fpu_loaded)
+   stts();
load_gdt

Re: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries

2012-09-20 Thread Suresh Siddha

On Wed, 2012-09-12 at 07:02 +, Zhang, Lin-Bao (ESSN-MCXS-Linux
Kernel R) wrote:
> Hi all, 
> This defect can be observed when the x2apic setting in BIOS is set to
> "auto" and the BIOS has virtual wire mode enabled on a power up. This
> defect was found on a 2.6.32 based kernel.

I assume you are able to reproduce the issue with the latest kernel
aswell?

What virtual wire mode is it?

Virtual wire mode-A (where the PIC output is connected to LINT0 of the
Local APIC) doesn't go through interrupt-remapping and virtual wire
mode-B (where the PIC output is routed through the IO-APIC RTE) will be
completely disabled as all the BIOS setup IO-APIC RTE's are masked by
the Linux kernel from the time we enable interrupt-remapping to the time
IO-APIC RTE's are properly re-configured by the Linux kernel again.

So I am at a loss to understand what is causing this.

> 
> The kernel code (smpboot.c, apic.c) does not mask 8259A interrupts
> before changing and initializing the new VT-d table when x2apic
> virtual wire mode is enable on power up. The Linux Kernel expects
> virtual wire mode to be disabled when booting and enables it when
> interrupts are masked.
> 
> The BIOS code builds a simple VT-d table on power up. While the Linux
> Kernel boots, it first builds an empty VT-d table and use it. After
> some time, the Linux Kernel then initializes the IO-APIC redirect
> table, and then initializes the VT-d entries. The window between
> initializing the redirect table and the VT-d entries, the 8259A
> interrupts are not masked. If an interrupt occurs in this window, the
> Linux Kernel will not find a valid entry for this interrupt. The
> kernel treats it to be a fatal error and panics. If the error never
> gets cleared, the Linux kernel continuously print this error:
> "NMI: IOCK error (debug interrupt?) for reason"

Not sure why we get a NMI instead of a vt-d fault? Perhaps the vt-d
fault is also getting reported via NMI in this platform?

Does your tested kernel has this fix?
commit 254e42006c893f45bca48f313536fcba12206418
Author: Suresh Siddha 
Date:   Mon Dec 6 12:26:30 2010 -0800

x86, vt-d: Quirk for masking vtd spec errors to platform error handling 
logic

Will you be able to provide the failing kernel log so that I can better
understand the issue?

thanks,
suresh

> The fix to this defect, the code change is to mask 8259A interrupts
> before changing VT-d table and initializing VT-d entries. Then unmask
> interrupts after completing the redirect table entries.
> 
> 
> Signed-off-by: Zhang, Lin-Bao 
> Tested-by: Nigel Croxon 
> 
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 
> 24deb30..299172c 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1556,7 +1556,6 @@ void __init enable_IR_x2apic(void)
> }
> 
> local_irq_save(flags);
> -   legacy_pic->mask_all();
> mask_ioapic_entries();
> 
> if (x2apic_preenabled && nox2apic) @@ -1603,7 +1602,6 @@ void __init 
> enable_IR_x2apic(void)
>  skip_x2apic:
> if (ret < 0) /* IR enabling failed */
> restore_ioapic_entries();
> -   legacy_pic->restore_mask();
> local_irq_restore(flags);
>  }
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 
> 7c5a8c3..95fee01 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1000,7 +1000,7 @@ void __init native_smp_prepare_cpus(unsigned int 
> max_cpus)
> zalloc_cpumask_var(_cpu(cpu_llc_shared_map, i), 
> GFP_KERNEL);
> }
> set_cpu_sibling_map(0);
> -
> +   mask_8259A();
> 
> if (smp_sanity_check(max_cpus) < 0) {
> pr_info("SMP disabled\n"); @@ -1037,6 +1037,8 @@ void __init 
> native_smp_prepare_cpus(unsigned int max_cpus)
> apic->setup_portio_remap();
> 
> smpboot_setup_io_apic();
> +   unmask_8259A();
> +
> /*
>  * Set up local APIC timer on boot CPU.
>  */
> 
> 
> 
> -- Bob(Zhang LinBao)
> 子曰：”不患人知不己知,患不知人也”
> "If not us, who ? if not now, when ?"
> ESSN-MCBS linux kernel enginner
> 
> 
> NР骒rybX肚v^?藓{.n?伐{赙zXФ≤}财z?j:+v?赙zZ+?zf＂h~iz?wア?ㄨ??撷f^j谦ym@Aa囤0鹅h?i


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()

2012-09-20 Thread Suresh Siddha

On Thu, 2012-09-20 at 12:50 +0300, Avi Kivity wrote:
> On 09/20/2012 03:10 AM, Suresh Siddha wrote:
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index b06737d..8ff328b 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -1493,7 +1493,8 @@ static void __vmx_load_host_state(struct vcpu_vmx 
> > *vmx)
> >  #ifdef CONFIG_X86_64
> > wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
> >  #endif
> > -   if (user_has_fpu())
> > +   /* Did the host task or the guest vcpu has FPU restored lazily? */
> > +   if (!use_eager_fpu() && (user_has_fpu() || vmx->vcpu.guest_fpu_loaded))
> > clts();
> 
> Why do the clts() if guest_fpu_loaded()?
> 
> An interrupt might arrive after this, look at TS
> (interrupted_kernel_fpu_idle()), and stomp on the the guest's fpu.

Actually clts() is harmless, as this condition,
(read_cr0() & X86_CR0_TS)
in interrupted_kernel_fpu_idle() will return false.

But you raise a good point, any interrupt between the vmexit and the
__vmx_load_host_state() can stomp on the guest FPU as the vmexit was
unconditionally setting host's cr0.TS bit and with the kvm using
kernel_fpu_begin/end(), !__thread_has_fpu(current) in the
interrupted_kernel_fpu_idle() will be always true.

So the right thing to do here is to always have the cr0.TS bit clear
during vmexit and set that bit back in __vmx_load_host_state() if the
FPU state is not active.

Appended the modified patch.

thanks,
suresh
--8<--
From: Suresh Siddha 
Subject: x86, kvm: fix kvm's usage of kernel_fpu_begin/end()

Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.

kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().

So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.

Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
change the cr0.TS bit for the host state during vm-exit to be always clear
and cr0.TS bit is set in the __vmx_load_host_state() when the FPU
(guest FPU or the host task's FPU) state is not active. This ensures
that the host/guest FPU state is properly saved, restored
during context-switch and with interrupts (using irq_fpu_usable()) not
stomping on the active FPU state.

Signed-off-by: Suresh Siddha 
---
 arch/x86/include/asm/i387.h |   28 ++--
 arch/x86/kernel/i387.c  |   13 +
 arch/x86/kvm/vmx.c  |   10 +++---
 arch/x86/kvm/x86.c  |4 ++--
 4 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 6c3bd37..ed8089d 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -24,8 +24,32 @@ extern int dump_fpu(struct pt_regs *, struct 
user_i387_struct *);
 extern void math_state_restore(void);

 extern bool irq_fpu_usable(void);
-extern void kernel_fpu_begin(void);
-extern void kernel_fpu_end(void);
+
+/*
+ * Careful: __kernel_fpu_begin/end() must be called with preempt disabled
+ * and they don't touch the preempt state on their own.
+ * If you enable preemption after __kernel_fpu_begin(), preempt notifier
+ * should call the __kernel_fpu_end() to prevent the kernel/user FPU
+ * state from getting corrupted. KVM for example uses this model.
+ *
+ * All other cases use kernel_fpu_begin/end() which disable preemption
+ * during kernel FPU usage.
+ */
+extern void __kernel_fpu_begin(void);
+extern void __kernel_fpu_end(void);
+
+static inline void kernel_fpu_begin(void)
+{
+   WARN_ON_ONCE(!irq_fpu_usable());
+   preempt_disable();
+   __kernel_fpu_begin();
+}
+
+static inline void kernel_fpu_end(void)
+{
+   __kernel_fpu_end();
+   preempt_enable();
+}

 /*
  * Some instructions like VIA's padlock instructions generate a spurious
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 6782e39..675a050 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -73,32 +73,29 @@ bool irq_fpu_usable(void)
 }
 EXPORT_SYMBOL(irq_fpu_usable);

-void kernel_fpu_begin(void)
+void __kernel_fpu_begin(void)
 {
struct task_struct *me = current;

-   WARN_ON_ONCE(!irq_fpu_usable());
-   preempt_disable();
if (__thread_has_fpu(me)) {
__save_init_fpu(me);
__thread_clear_has_fpu(me);
-   /* We do 'stts()' in kernel_fpu_end() */
+   /* We do 'stts()' in __kernel_fpu_end()

Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()

2012-09-20 Thread Suresh Siddha

On Thu, 2012-09-20 at 12:50 +0300, Avi Kivity wrote:
 On 09/20/2012 03:10 AM, Suresh Siddha wrote:
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index b06737d..8ff328b 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -1493,7 +1493,8 @@ static void __vmx_load_host_state(struct vcpu_vmx 
  *vmx)
   #ifdef CONFIG_X86_64
  wrmsrl(MSR_KERNEL_GS_BASE, vmx-msr_host_kernel_gs_base);
   #endif
  -   if (user_has_fpu())
  +   /* Did the host task or the guest vcpu has FPU restored lazily? */
  +   if (!use_eager_fpu()  (user_has_fpu() || vmx-vcpu.guest_fpu_loaded))
  clts();
 
 Why do the clts() if guest_fpu_loaded()?
 
 An interrupt might arrive after this, look at TS
 (interrupted_kernel_fpu_idle()), and stomp on the the guest's fpu.

Actually clts() is harmless, as this condition,
(read_cr0()  X86_CR0_TS)
in interrupted_kernel_fpu_idle() will return false.

But you raise a good point, any interrupt between the vmexit and the
__vmx_load_host_state() can stomp on the guest FPU as the vmexit was
unconditionally setting host's cr0.TS bit and with the kvm using
kernel_fpu_begin/end(), !__thread_has_fpu(current) in the
interrupted_kernel_fpu_idle() will be always true.

So the right thing to do here is to always have the cr0.TS bit clear
during vmexit and set that bit back in __vmx_load_host_state() if the
FPU state is not active.

Appended the modified patch.

thanks,
suresh
--8--
From: Suresh Siddha suresh.b.sid...@intel.com
Subject: x86, kvm: fix kvm's usage of kernel_fpu_begin/end()

Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.

kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().

So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.

Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
change the cr0.TS bit for the host state during vm-exit to be always clear
and cr0.TS bit is set in the __vmx_load_host_state() when the FPU
(guest FPU or the host task's FPU) state is not active. This ensures
that the host/guest FPU state is properly saved, restored
during context-switch and with interrupts (using irq_fpu_usable()) not
stomping on the active FPU state.

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
---
 arch/x86/include/asm/i387.h |   28 ++--
 arch/x86/kernel/i387.c  |   13 +
 arch/x86/kvm/vmx.c  |   10 +++---
 arch/x86/kvm/x86.c  |4 ++--
 4 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 6c3bd37..ed8089d 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -24,8 +24,32 @@ extern int dump_fpu(struct pt_regs *, struct 
user_i387_struct *);
 extern void math_state_restore(void);
 
 extern bool irq_fpu_usable(void);
-extern void kernel_fpu_begin(void);
-extern void kernel_fpu_end(void);
+
+/*
+ * Careful: __kernel_fpu_begin/end() must be called with preempt disabled
+ * and they don't touch the preempt state on their own.
+ * If you enable preemption after __kernel_fpu_begin(), preempt notifier
+ * should call the __kernel_fpu_end() to prevent the kernel/user FPU
+ * state from getting corrupted. KVM for example uses this model.
+ *
+ * All other cases use kernel_fpu_begin/end() which disable preemption
+ * during kernel FPU usage.
+ */
+extern void __kernel_fpu_begin(void);
+extern void __kernel_fpu_end(void);
+
+static inline void kernel_fpu_begin(void)
+{
+   WARN_ON_ONCE(!irq_fpu_usable());
+   preempt_disable();
+   __kernel_fpu_begin();
+}
+
+static inline void kernel_fpu_end(void)
+{
+   __kernel_fpu_end();
+   preempt_enable();
+}
 
 /*
  * Some instructions like VIA's padlock instructions generate a spurious
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 6782e39..675a050 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -73,32 +73,29 @@ bool irq_fpu_usable(void)
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 
-void kernel_fpu_begin(void)
+void __kernel_fpu_begin(void)
 {
struct task_struct *me = current;
 
-   WARN_ON_ONCE(!irq_fpu_usable());
-   preempt_disable();
if (__thread_has_fpu(me)) {
__save_init_fpu(me);
__thread_clear_has_fpu(me);
-   /* We do 'stts()' in kernel_fpu_end() */
+   /* We do 'stts()' in __kernel_fpu_end() */
} else if (!use_eager_fpu()) {
this_cpu_write(fpu_owner_task, NULL

Re: [PATCH] fix x2apic defect that Linux kernel doesn't mask 8259A interrupt during the time window between changing VT-d table base address and initializing these VT-d entries

2012-09-20 Thread Suresh Siddha

On Wed, 2012-09-12 at 07:02 +, Zhang, Lin-Bao (ESSN-MCXS-Linux
Kernel RD) wrote:
 Hi all, 
 This defect can be observed when the x2apic setting in BIOS is set to
 auto and the BIOS has virtual wire mode enabled on a power up. This
 defect was found on a 2.6.32 based kernel.

I assume you are able to reproduce the issue with the latest kernel
aswell?

What virtual wire mode is it?

Virtual wire mode-A (where the PIC output is connected to LINT0 of the
Local APIC) doesn't go through interrupt-remapping and virtual wire
mode-B (where the PIC output is routed through the IO-APIC RTE) will be
completely disabled as all the BIOS setup IO-APIC RTE's are masked by
the Linux kernel from the time we enable interrupt-remapping to the time
IO-APIC RTE's are properly re-configured by the Linux kernel again.

So I am at a loss to understand what is causing this.

 
 The kernel code (smpboot.c, apic.c) does not mask 8259A interrupts
 before changing and initializing the new VT-d table when x2apic
 virtual wire mode is enable on power up. The Linux Kernel expects
 virtual wire mode to be disabled when booting and enables it when
 interrupts are masked.
 
 The BIOS code builds a simple VT-d table on power up. While the Linux
 Kernel boots, it first builds an empty VT-d table and use it. After
 some time, the Linux Kernel then initializes the IO-APIC redirect
 table, and then initializes the VT-d entries. The window between
 initializing the redirect table and the VT-d entries, the 8259A
 interrupts are not masked. If an interrupt occurs in this window, the
 Linux Kernel will not find a valid entry for this interrupt. The
 kernel treats it to be a fatal error and panics. If the error never
 gets cleared, the Linux kernel continuously print this error:
 NMI: IOCK error (debug interrupt?) for reason

Not sure why we get a NMI instead of a vt-d fault? Perhaps the vt-d
fault is also getting reported via NMI in this platform?

Does your tested kernel has this fix?
commit 254e42006c893f45bca48f313536fcba12206418
Author: Suresh Siddha suresh.b.sid...@intel.com
Date:   Mon Dec 6 12:26:30 2010 -0800

x86, vt-d: Quirk for masking vtd spec errors to platform error handling 
logic

Will you be able to provide the failing kernel log so that I can better
understand the issue?

thanks,
suresh

 The fix to this defect, the code change is to mask 8259A interrupts
 before changing VT-d table and initializing VT-d entries. Then unmask
 interrupts after completing the redirect table entries.
 
 
 Signed-off-by: Zhang, Lin-Bao linbao.zh...@hp.com
 Tested-by: Nigel Croxon nigel.cro...@hp.com
 
 diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 
 24deb30..299172c 100644
 --- a/arch/x86/kernel/apic/apic.c
 +++ b/arch/x86/kernel/apic/apic.c
 @@ -1556,7 +1556,6 @@ void __init enable_IR_x2apic(void)
 }
 
 local_irq_save(flags);
 -   legacy_pic-mask_all();
 mask_ioapic_entries();
 
 if (x2apic_preenabled  nox2apic) @@ -1603,7 +1602,6 @@ void __init 
 enable_IR_x2apic(void)
  skip_x2apic:
 if (ret  0) /* IR enabling failed */
 restore_ioapic_entries();
 -   legacy_pic-restore_mask();
 local_irq_restore(flags);
  }
 
 diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 
 7c5a8c3..95fee01 100644
 --- a/arch/x86/kernel/smpboot.c
 +++ b/arch/x86/kernel/smpboot.c
 @@ -1000,7 +1000,7 @@ void __init native_smp_prepare_cpus(unsigned int 
 max_cpus)
 zalloc_cpumask_var(per_cpu(cpu_llc_shared_map, i), 
 GFP_KERNEL);
 }
 set_cpu_sibling_map(0);
 -
 +   mask_8259A();
 
 if (smp_sanity_check(max_cpus)  0) {
 pr_info(SMP disabled\n); @@ -1037,6 +1037,8 @@ void __init 
 native_smp_prepare_cpus(unsigned int max_cpus)
 apic-setup_portio_remap();
 
 smpboot_setup_io_apic();
 +   unmask_8259A();
 +
 /*
  * Set up local APIC timer on boot CPU.
  */
 
 
 
 -- Bob(Zhang LinBao)
 子曰：”不患人知不己知,患不知人也”
 If not us, who ? if not now, when ?
 ESSN-MCBS linux kernel enginner
 
 
 NР骒rybX肚v^?藓{.n?伐{赙zXФ≤}财z?j:+v?赙zZ+?zf＂h~iz?wア?ㄨ??撷f^j谦ym@Aa囤0鹅h?i


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()

2012-09-19 Thread Suresh Siddha

On Wed, 2012-09-19 at 10:18 -0700, Suresh Siddha wrote:
> These routines (kvm_load/put_guest_fpu()) are already called with
> preemption disabled but as you mentioned, we don't want the preemption
> to be disabled completely between the kvm_load_guest_fpu() and
> kvm_put_guest_fpu().
> 
> Also KVM already has the preempt notifier which is doing the
> kvm_put_guest_fpu(), so something like the appended should address this.
> I will test this shortly.

Appended the tested fix (one more VMX based change needed as it fiddles
with cr0.TS host bit).

Thanks.
--8<--

From: Suresh Siddha 
Subject: x86, kvm: fix kvm's usage of kernel_fpu_begin/end()

Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.

kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().

So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.

Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
cr0.TS bit is always set during vm-exit and depending on the guest FPU state
and the host task's FPU state, cr0.TS bit is cleared when needed.

Signed-off-by: Suresh Siddha 
---
 arch/x86/include/asm/fpu-internal.h |5 -
 arch/x86/include/asm/i387.h |   28 ++--
 arch/x86/include/asm/processor.h|5 +
 arch/x86/kernel/i387.c  |   13 +
 arch/x86/kvm/vmx.c  |   11 +--
 arch/x86/kvm/x86.c  |4 ++--
 6 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 92f3c6e..a6b60c7 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -85,11 +85,6 @@ static inline int is_x32_frame(void)
 
 #define X87_FSW_ES (1 << 7)/* Exception Summary */
 
-static __always_inline __pure bool use_eager_fpu(void)
-{
-   return static_cpu_has(X86_FEATURE_EAGER_FPU);
-}
-
 static __always_inline __pure bool use_xsaveopt(void)
 {
return static_cpu_has(X86_FEATURE_XSAVEOPT);
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 6c3bd37..ed8089d 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -24,8 +24,32 @@ extern int dump_fpu(struct pt_regs *, struct 
user_i387_struct *);
 extern void math_state_restore(void);
 
 extern bool irq_fpu_usable(void);
-extern void kernel_fpu_begin(void);
-extern void kernel_fpu_end(void);
+
+/*
+ * Careful: __kernel_fpu_begin/end() must be called with preempt disabled
+ * and they don't touch the preempt state on their own.
+ * If you enable preemption after __kernel_fpu_begin(), preempt notifier
+ * should call the __kernel_fpu_end() to prevent the kernel/user FPU
+ * state from getting corrupted. KVM for example uses this model.
+ *
+ * All other cases use kernel_fpu_begin/end() which disable preemption
+ * during kernel FPU usage.
+ */
+extern void __kernel_fpu_begin(void);
+extern void __kernel_fpu_end(void);
+
+static inline void kernel_fpu_begin(void)
+{
+   WARN_ON_ONCE(!irq_fpu_usable());
+   preempt_disable();
+   __kernel_fpu_begin();
+}
+
+static inline void kernel_fpu_end(void)
+{
+   __kernel_fpu_end();
+   preempt_enable();
+}
 
 /*
  * Some instructions like VIA's padlock instructions generate a spurious
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index b98c0d9..d0e9adb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -402,6 +402,11 @@ struct fpu {
union thread_xstate *state;
 };
 
+static __always_inline __pure bool use_eager_fpu(void)
+{
+   return static_cpu_has(X86_FEATURE_EAGER_FPU);
+}
+
 #ifdef CONFIG_X86_64
 DECLARE_PER_CPU(struct orig_ist, orig_ist);
 
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 6782e39..675a050 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -73,32 +73,29 @@ bool irq_fpu_usable(void)
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 
-void kernel_fpu_begin(void)
+void __kernel_fpu_begin(void)
 {
struct task_struct *me = current;
 
-   WARN_ON_ONCE(!irq_fpu_usable());
-   preempt_disable();
if (__thread_has_fpu(me)) {
__save_init_fpu(me);
__thread_clear_has_fpu(me);
-   /* We do 'stts()' in kernel_fpu_end() */
+   /* We do 'stts()' in __kernel_fpu_end() */
} else if (!use_eager_fpu()) {
this_cpu_write(fpu_owner_

Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()

2012-09-19 Thread Suresh Siddha

On Wed, 2012-09-19 at 20:22 +0300, Avi Kivity wrote:
> On 09/19/2012 08:18 PM, Suresh Siddha wrote:
> 
> > These routines (kvm_load/put_guest_fpu()) are already called with
> > preemption disabled but as you mentioned, we don't want the preemption
> > to be disabled completely between the kvm_load_guest_fpu() and
> > kvm_put_guest_fpu().
> > 
> > Also KVM already has the preempt notifier which is doing the
> > kvm_put_guest_fpu(), so something like the appended should address this.
> > I will test this shortly.
> > 
> 
> Note, we could also go in a different direction and make
> kernel_fpu_begin() use preempt notifiers and thus make its users
> preemptible.  But that's for a separate patchset.

yep, but we need the fpu buffer to save/restore the kernel fpu state.

KVM already has those buffers allocated in the guest cpu state and hence
it all works out ok. But yes, we can revisit this in future.

thanks,
suresh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()

2012-09-19 Thread Suresh Siddha

On Wed, 2012-09-19 at 13:13 +0300, Avi Kivity wrote:
> On 08/25/2012 12:12 AM, Suresh Siddha wrote:
> > kvm's guest fpu save/restore should be wrapped around
> > kernel_fpu_begin/end(). This will avoid for example taking a DNA
> > in kvm_load_guest_fpu() when it tries to load the fpu immediately
> > after doing unlazy_fpu() on the host side.
> > 
> > More importantly this will prevent the host process fpu from being
> > corrupted.
> > 
> > Signed-off-by: Suresh Siddha 
> > Cc: Avi Kivity 
> > ---
> >  arch/x86/kvm/x86.c |3 ++-
> >  1 files changed, 2 insertions(+), 1 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 42bce48..67e773c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -5969,7 +5969,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
> >  */
> > kvm_put_guest_xcr0(vcpu);
> > vcpu->guest_fpu_loaded = 1;
> > -   unlazy_fpu(current);
> > +   kernel_fpu_begin();
> > fpu_restore_checking(>arch.guest_fpu);
> > trace_kvm_fpu(1);
> 
> This breaks kvm, since it disables preemption.  What we want here is to
> save the user fpu state if it was loaded, and do nothing if wasn't.
> Don't know what's the new API for that.

These routines (kvm_load/put_guest_fpu()) are already called with
preemption disabled but as you mentioned, we don't want the preemption
to be disabled completely between the kvm_load_guest_fpu() and
kvm_put_guest_fpu().

Also KVM already has the preempt notifier which is doing the
kvm_put_guest_fpu(), so something like the appended should address this.
I will test this shortly.

Signed-off-by: Suresh Siddha 
---
 arch/x86/include/asm/i387.h |   17 +++--
 arch/x86/kernel/i387.c  |   13 +
 arch/x86/kvm/x86.c  |4 ++--
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 6c3bd37..29429b1 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -24,8 +24,21 @@ extern int dump_fpu(struct pt_regs *, struct 
user_i387_struct *);
 extern void math_state_restore(void);
 
 extern bool irq_fpu_usable(void);
-extern void kernel_fpu_begin(void);
-extern void kernel_fpu_end(void);
+extern void __kernel_fpu_begin(void);
+extern void __kernel_fpu_end(void);
+
+static inline void kernel_fpu_begin(void)
+{
+   WARN_ON_ONCE(!irq_fpu_usable());
+   preempt_disable();
+   __kernel_fpu_begin();
+}
+
+static inline void kernel_fpu_end(void)
+{
+   __kernel_fpu_end();
+   preempt_enable();
+}
 
 /*
  * Some instructions like VIA's padlock instructions generate a spurious
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 6782e39..675a050 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -73,32 +73,29 @@ bool irq_fpu_usable(void)
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 
-void kernel_fpu_begin(void)
+void __kernel_fpu_begin(void)
 {
struct task_struct *me = current;
 
-   WARN_ON_ONCE(!irq_fpu_usable());
-   preempt_disable();
if (__thread_has_fpu(me)) {
__save_init_fpu(me);
__thread_clear_has_fpu(me);
-   /* We do 'stts()' in kernel_fpu_end() */
+   /* We do 'stts()' in __kernel_fpu_end() */
} else if (!use_eager_fpu()) {
this_cpu_write(fpu_owner_task, NULL);
clts();
}
 }
-EXPORT_SYMBOL(kernel_fpu_begin);
+EXPORT_SYMBOL(__kernel_fpu_begin);
 
-void kernel_fpu_end(void)
+void __kernel_fpu_end(void)
 {
if (use_eager_fpu())
math_state_restore();
else
stts();
-   preempt_enable();
 }
-EXPORT_SYMBOL(kernel_fpu_end);
+EXPORT_SYMBOL(__kernel_fpu_end);
 
 void unlazy_fpu(struct task_struct *tsk)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ddefb4..1f09552 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5979,7 +5979,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 */
kvm_put_guest_xcr0(vcpu);
vcpu->guest_fpu_loaded = 1;
-   kernel_fpu_begin();
+   __kernel_fpu_begin();
fpu_restore_checking(>arch.guest_fpu);
trace_kvm_fpu(1);
 }
@@ -5993,7 +5993,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 
vcpu->guest_fpu_loaded = 0;
fpu_save_init(>arch.guest_fpu);
-   kernel_fpu_end();
+   __kernel_fpu_end();
++vcpu->stat.fpu_reload;
kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
trace_kvm_fpu(0);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()

2012-09-19 Thread Suresh Siddha

On Wed, 2012-09-19 at 13:13 +0300, Avi Kivity wrote:
 On 08/25/2012 12:12 AM, Suresh Siddha wrote:
  kvm's guest fpu save/restore should be wrapped around
  kernel_fpu_begin/end(). This will avoid for example taking a DNA
  in kvm_load_guest_fpu() when it tries to load the fpu immediately
  after doing unlazy_fpu() on the host side.
  
  More importantly this will prevent the host process fpu from being
  corrupted.
  
  Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
  Cc: Avi Kivity a...@redhat.com
  ---
   arch/x86/kvm/x86.c |3 ++-
   1 files changed, 2 insertions(+), 1 deletions(-)
  
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
  index 42bce48..67e773c 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -5969,7 +5969,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
   */
  kvm_put_guest_xcr0(vcpu);
  vcpu-guest_fpu_loaded = 1;
  -   unlazy_fpu(current);
  +   kernel_fpu_begin();
  fpu_restore_checking(vcpu-arch.guest_fpu);
  trace_kvm_fpu(1);
 
 This breaks kvm, since it disables preemption.  What we want here is to
 save the user fpu state if it was loaded, and do nothing if wasn't.
 Don't know what's the new API for that.

These routines (kvm_load/put_guest_fpu()) are already called with
preemption disabled but as you mentioned, we don't want the preemption
to be disabled completely between the kvm_load_guest_fpu() and
kvm_put_guest_fpu().

Also KVM already has the preempt notifier which is doing the
kvm_put_guest_fpu(), so something like the appended should address this.
I will test this shortly.

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
---
 arch/x86/include/asm/i387.h |   17 +++--
 arch/x86/kernel/i387.c  |   13 +
 arch/x86/kvm/x86.c  |4 ++--
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 6c3bd37..29429b1 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -24,8 +24,21 @@ extern int dump_fpu(struct pt_regs *, struct 
user_i387_struct *);
 extern void math_state_restore(void);
 
 extern bool irq_fpu_usable(void);
-extern void kernel_fpu_begin(void);
-extern void kernel_fpu_end(void);
+extern void __kernel_fpu_begin(void);
+extern void __kernel_fpu_end(void);
+
+static inline void kernel_fpu_begin(void)
+{
+   WARN_ON_ONCE(!irq_fpu_usable());
+   preempt_disable();
+   __kernel_fpu_begin();
+}
+
+static inline void kernel_fpu_end(void)
+{
+   __kernel_fpu_end();
+   preempt_enable();
+}
 
 /*
  * Some instructions like VIA's padlock instructions generate a spurious
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 6782e39..675a050 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -73,32 +73,29 @@ bool irq_fpu_usable(void)
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 
-void kernel_fpu_begin(void)
+void __kernel_fpu_begin(void)
 {
struct task_struct *me = current;
 
-   WARN_ON_ONCE(!irq_fpu_usable());
-   preempt_disable();
if (__thread_has_fpu(me)) {
__save_init_fpu(me);
__thread_clear_has_fpu(me);
-   /* We do 'stts()' in kernel_fpu_end() */
+   /* We do 'stts()' in __kernel_fpu_end() */
} else if (!use_eager_fpu()) {
this_cpu_write(fpu_owner_task, NULL);
clts();
}
 }
-EXPORT_SYMBOL(kernel_fpu_begin);
+EXPORT_SYMBOL(__kernel_fpu_begin);
 
-void kernel_fpu_end(void)
+void __kernel_fpu_end(void)
 {
if (use_eager_fpu())
math_state_restore();
else
stts();
-   preempt_enable();
 }
-EXPORT_SYMBOL(kernel_fpu_end);
+EXPORT_SYMBOL(__kernel_fpu_end);
 
 void unlazy_fpu(struct task_struct *tsk)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ddefb4..1f09552 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5979,7 +5979,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 */
kvm_put_guest_xcr0(vcpu);
vcpu-guest_fpu_loaded = 1;
-   kernel_fpu_begin();
+   __kernel_fpu_begin();
fpu_restore_checking(vcpu-arch.guest_fpu);
trace_kvm_fpu(1);
 }
@@ -5993,7 +5993,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 
vcpu-guest_fpu_loaded = 0;
fpu_save_init(vcpu-arch.guest_fpu);
-   kernel_fpu_end();
+   __kernel_fpu_end();
++vcpu-stat.fpu_reload;
kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
trace_kvm_fpu(0);


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()

2012-09-19 Thread Suresh Siddha

On Wed, 2012-09-19 at 20:22 +0300, Avi Kivity wrote:
 On 09/19/2012 08:18 PM, Suresh Siddha wrote:
 
  These routines (kvm_load/put_guest_fpu()) are already called with
  preemption disabled but as you mentioned, we don't want the preemption
  to be disabled completely between the kvm_load_guest_fpu() and
  kvm_put_guest_fpu().
  
  Also KVM already has the preempt notifier which is doing the
  kvm_put_guest_fpu(), so something like the appended should address this.
  I will test this shortly.
  
 
 Note, we could also go in a different direction and make
 kernel_fpu_begin() use preempt notifiers and thus make its users
 preemptible.  But that's for a separate patchset.

yep, but we need the fpu buffer to save/restore the kernel fpu state.

KVM already has those buffers allocated in the guest cpu state and hence
it all works out ok. But yes, we can revisit this in future.

thanks,
suresh


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()

2012-09-19 Thread Suresh Siddha

On Wed, 2012-09-19 at 10:18 -0700, Suresh Siddha wrote:
 These routines (kvm_load/put_guest_fpu()) are already called with
 preemption disabled but as you mentioned, we don't want the preemption
 to be disabled completely between the kvm_load_guest_fpu() and
 kvm_put_guest_fpu().
 
 Also KVM already has the preempt notifier which is doing the
 kvm_put_guest_fpu(), so something like the appended should address this.
 I will test this shortly.

Appended the tested fix (one more VMX based change needed as it fiddles
with cr0.TS host bit).

Thanks.
--8--

From: Suresh Siddha suresh.b.sid...@intel.com
Subject: x86, kvm: fix kvm's usage of kernel_fpu_begin/end()

Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.

kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().

So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.

Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
cr0.TS bit is always set during vm-exit and depending on the guest FPU state
and the host task's FPU state, cr0.TS bit is cleared when needed.

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
---
 arch/x86/include/asm/fpu-internal.h |5 -
 arch/x86/include/asm/i387.h |   28 ++--
 arch/x86/include/asm/processor.h|5 +
 arch/x86/kernel/i387.c  |   13 +
 arch/x86/kvm/vmx.c  |   11 +--
 arch/x86/kvm/x86.c  |4 ++--
 6 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 92f3c6e..a6b60c7 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -85,11 +85,6 @@ static inline int is_x32_frame(void)
 
 #define X87_FSW_ES (1  7)/* Exception Summary */
 
-static __always_inline __pure bool use_eager_fpu(void)
-{
-   return static_cpu_has(X86_FEATURE_EAGER_FPU);
-}
-
 static __always_inline __pure bool use_xsaveopt(void)
 {
return static_cpu_has(X86_FEATURE_XSAVEOPT);
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 6c3bd37..ed8089d 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -24,8 +24,32 @@ extern int dump_fpu(struct pt_regs *, struct 
user_i387_struct *);
 extern void math_state_restore(void);
 
 extern bool irq_fpu_usable(void);
-extern void kernel_fpu_begin(void);
-extern void kernel_fpu_end(void);
+
+/*
+ * Careful: __kernel_fpu_begin/end() must be called with preempt disabled
+ * and they don't touch the preempt state on their own.
+ * If you enable preemption after __kernel_fpu_begin(), preempt notifier
+ * should call the __kernel_fpu_end() to prevent the kernel/user FPU
+ * state from getting corrupted. KVM for example uses this model.
+ *
+ * All other cases use kernel_fpu_begin/end() which disable preemption
+ * during kernel FPU usage.
+ */
+extern void __kernel_fpu_begin(void);
+extern void __kernel_fpu_end(void);
+
+static inline void kernel_fpu_begin(void)
+{
+   WARN_ON_ONCE(!irq_fpu_usable());
+   preempt_disable();
+   __kernel_fpu_begin();
+}
+
+static inline void kernel_fpu_end(void)
+{
+   __kernel_fpu_end();
+   preempt_enable();
+}
 
 /*
  * Some instructions like VIA's padlock instructions generate a spurious
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index b98c0d9..d0e9adb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -402,6 +402,11 @@ struct fpu {
union thread_xstate *state;
 };
 
+static __always_inline __pure bool use_eager_fpu(void)
+{
+   return static_cpu_has(X86_FEATURE_EAGER_FPU);
+}
+
 #ifdef CONFIG_X86_64
 DECLARE_PER_CPU(struct orig_ist, orig_ist);
 
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 6782e39..675a050 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -73,32 +73,29 @@ bool irq_fpu_usable(void)
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 
-void kernel_fpu_begin(void)
+void __kernel_fpu_begin(void)
 {
struct task_struct *me = current;
 
-   WARN_ON_ONCE(!irq_fpu_usable());
-   preempt_disable();
if (__thread_has_fpu(me)) {
__save_init_fpu(me);
__thread_clear_has_fpu(me);
-   /* We do 'stts()' in kernel_fpu_end() */
+   /* We do 'stts()' in __kernel_fpu_end() */
} else if (!use_eager_fpu()) {
this_cpu_write(fpu_owner_task

[tip:x86/fpu] x86, fpu: remove cpu_has_xmm check in the fx_finit()

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  a8615af4bc3621cb01096541dafa6f68352ec2d9
Gitweb: http://git.kernel.org/tip/a8615af4bc3621cb01096541dafa6f68352ec2d9
Author: Suresh Siddha 
AuthorDate: Mon, 10 Sep 2012 10:40:08 -0700
Committer:  H. Peter Anvin 
CommitDate: Tue, 18 Sep 2012 15:52:24 -0700

x86, fpu: remove cpu_has_xmm check in the fx_finit()

CPUs with FXSAVE but no XMM/MXCSR (Pentium II from Intel,
Crusoe/TM-3xxx/5xxx from Transmeta, and presumably some of the K6
generation from AMD) ever looked at the mxcsr field during
fxrstor/fxsave. So remove the cpu_has_xmm check in the fx_finit()

Reported-by: Al Viro 
Acked-by: H. Peter Anvin 
Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1347300665-6209-6-git-send-email-suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/fpu-internal.h |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 0ca72f0..92f3c6e 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -109,8 +109,7 @@ static inline void fx_finit(struct i387_fxsave_struct *fx)
 {
memset(fx, 0, xstate_size);
fx->cwd = 0x37f;
-   if (cpu_has_xmm)
-   fx->mxcsr = MXCSR_DEFAULT;
+   fx->mxcsr = MXCSR_DEFAULT;
 }
 
 extern void __sanitize_i387_state(struct task_struct *);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/fpu] x86, fpu: make eagerfpu= boot param tri-state

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  e00229819f306b1f86134095347e9187dc346bd1
Gitweb: http://git.kernel.org/tip/e00229819f306b1f86134095347e9187dc346bd1
Author: Suresh Siddha 
AuthorDate: Mon, 10 Sep 2012 10:32:32 -0700
Committer:  H. Peter Anvin 
CommitDate: Tue, 18 Sep 2012 15:52:24 -0700

x86, fpu: make eagerfpu= boot param tri-state

Add the "eagerfpu=auto" (that selects the default scheme in
enabling eagerfpu) which can override compiled-in boot parameters
like "eagerfpu=on/off" (that force enable/disable eagerfpu).

Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1347300665-6209-5-git-send-email-suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin 
---
 Documentation/kernel-parameters.txt |4 +++-
 arch/x86/kernel/xsave.c |   17 -
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index e8f7faa..46a6a82 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1834,8 +1834,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
enabling legacy floating-point and sse state.
 
eagerfpu=   [X86]
-   on  enable eager fpu restore (default for xsaveopt)
+   on  enable eager fpu restore
off disable eager fpu restore
+   autoselects the default scheme, which automatically
+   enables eagerfpu restore for xsaveopt.
 
nohlt   [BUGS=ARM,SH] Tells the kernel that the sleep(SH) or
wfi(ARM) instruction doesn't work correctly and not to
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index e99f754..4e89b3d 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -508,13 +508,15 @@ static void __init setup_init_fpu_buf(void)
xsave_state(init_xstate_buf, -1);
 }
 
-static int disable_eagerfpu;
+static enum { AUTO, ENABLE, DISABLE } eagerfpu = AUTO;
 static int __init eager_fpu_setup(char *s)
 {
if (!strcmp(s, "on"))
-   setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);
+   eagerfpu = ENABLE;
else if (!strcmp(s, "off"))
-   disable_eagerfpu = 1;
+   eagerfpu = DISABLE;
+   else if (!strcmp(s, "auto"))
+   eagerfpu = AUTO;
return 1;
 }
 __setup("eagerfpu=", eager_fpu_setup);
@@ -557,8 +559,9 @@ static void __init xstate_enable_boot_cpu(void)
prepare_fx_sw_frame();
setup_init_fpu_buf();
 
-   if (cpu_has_xsaveopt && !disable_eagerfpu)
-   setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);
+   /* Auto enable eagerfpu for xsaveopt */
+   if (cpu_has_xsaveopt && eagerfpu != DISABLE)
+   eagerfpu = ENABLE;
 
pr_info("enabled xstate_bv 0x%llx, cntxt size 0x%x\n",
pcntxt_mask, xstate_size);
@@ -598,6 +601,10 @@ void __cpuinit eager_fpu_init(void)
 
clear_used_math();
current_thread_info()->status = 0;
+
+   if (eagerfpu == ENABLE)
+   setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);
+
if (!cpu_has_eager_fpu) {
stts();
return;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/fpu] x86, fpu: decouple non-lazy/ eager fpu restore from xsave

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  5d2bd7009f306c82afddd1ca4d9763ad8473c216
Gitweb: http://git.kernel.org/tip/5d2bd7009f306c82afddd1ca4d9763ad8473c216
Author: Suresh Siddha 
AuthorDate: Thu, 6 Sep 2012 14:58:52 -0700
Committer:  H. Peter Anvin 
CommitDate: Tue, 18 Sep 2012 15:52:22 -0700

x86, fpu: decouple non-lazy/eager fpu restore from xsave

Decouple non-lazy/eager fpu restore policy from the existence of the xsave
feature. Introduce a synthetic CPUID flag to represent the eagerfpu
policy. "eagerfpu=on" boot paramter will enable the policy.

Requested-by: H. Peter Anvin 
Requested-by: Linus Torvalds 
Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1347300665-6209-2-git-send-email-suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin 
---
 Documentation/kernel-parameters.txt |4 ++
 arch/x86/include/asm/cpufeature.h   |2 +
 arch/x86/include/asm/fpu-internal.h |   54 --
 arch/x86/kernel/cpu/common.c|2 -
 arch/x86/kernel/i387.c  |   25 +++---
 arch/x86/kernel/process.c   |2 +-
 arch/x86/kernel/traps.c |2 +-
 arch/x86/kernel/xsave.c |   87 +++
 8 files changed, 112 insertions(+), 66 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index ad7e2e5..741d064 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1833,6 +1833,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
and restore using xsave. The kernel will fallback to
enabling legacy floating-point and sse state.
 
+   eagerfpu=   [X86]
+   on  enable eager fpu restore
+   off disable eager fpu restore
+
nohlt   [BUGS=ARM,SH] Tells the kernel that the sleep(SH) or
wfi(ARM) instruction doesn't work correctly and not to
use it. This is also useful when using JTAG debugger.
diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 6b7ee5f..5dd2b47 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -97,6 +97,7 @@
 #define X86_FEATURE_EXTD_APICID(3*32+26) /* has extended APICID (8 
bits) */
 #define X86_FEATURE_AMD_DCM (3*32+27) /* multi-node processor */
 #define X86_FEATURE_APERFMPERF (3*32+28) /* APERFMPERF */
+#define X86_FEATURE_EAGER_FPU  (3*32+29) /* "eagerfpu" Non lazy FPU restore */
 
 /* Intel-defined CPU features, CPUID level 0x0001 (ecx), word 4 */
 #define X86_FEATURE_XMM3   (4*32+ 0) /* "pni" SSE-3 */
@@ -305,6 +306,7 @@ extern const char * const x86_power_flags[32];
 #define cpu_has_perfctr_core   boot_cpu_has(X86_FEATURE_PERFCTR_CORE)
 #define cpu_has_cx8boot_cpu_has(X86_FEATURE_CX8)
 #define cpu_has_cx16   boot_cpu_has(X86_FEATURE_CX16)
+#define cpu_has_eager_fpu  boot_cpu_has(X86_FEATURE_EAGER_FPU)
 
 #if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64)
 # define cpu_has_invlpg1
diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 8ca0f9f..0ca72f0 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -38,6 +38,7 @@ int ia32_setup_frame(int sig, struct k_sigaction *ka,
 
 extern unsigned int mxcsr_feature_mask;
 extern void fpu_init(void);
+extern void eager_fpu_init(void);
 
 DECLARE_PER_CPU(struct task_struct *, fpu_owner_task);
 
@@ -84,6 +85,11 @@ static inline int is_x32_frame(void)
 
 #define X87_FSW_ES (1 << 7)/* Exception Summary */
 
+static __always_inline __pure bool use_eager_fpu(void)
+{
+   return static_cpu_has(X86_FEATURE_EAGER_FPU);
+}
+
 static __always_inline __pure bool use_xsaveopt(void)
 {
return static_cpu_has(X86_FEATURE_XSAVEOPT);
@@ -99,6 +105,14 @@ static __always_inline __pure bool use_fxsr(void)
 return static_cpu_has(X86_FEATURE_FXSR);
 }
 
+static inline void fx_finit(struct i387_fxsave_struct *fx)
+{
+   memset(fx, 0, xstate_size);
+   fx->cwd = 0x37f;
+   if (cpu_has_xmm)
+   fx->mxcsr = MXCSR_DEFAULT;
+}
+
 extern void __sanitize_i387_state(struct task_struct *);
 
 static inline void sanitize_i387_state(struct task_struct *tsk)
@@ -291,13 +305,13 @@ static inline void __thread_set_has_fpu(struct 
task_struct *tsk)
 static inline void __thread_fpu_end(struct task_struct *tsk)
 {
__thread_clear_has_fpu(tsk);
-   if (!use_xsave())
+   if (!use_eager_fpu())
stts();
 }
 
 static inline void __thread_fpu_begin(struct task_struct *tsk)
 {
-   if (!use_xsave())
+   if (!use_eager_fpu())
clts();
__thread_set_has_fpu(tsk);
 }
@@ -327,10 +341,14 @@ static inline void drop_fpu(struct task_struct *tsk)
 
 static inline void drop_init_fpu(struct t

[tip:x86/fpu] x86, fpu: use non-lazy fpu restore for processors supporting xsave

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  304bceda6a18ae0b0240b8aac9a6bdf8ce2d2469
Gitweb: http://git.kernel.org/tip/304bceda6a18ae0b0240b8aac9a6bdf8ce2d2469
Author: Suresh Siddha 
AuthorDate: Fri, 24 Aug 2012 14:13:02 -0700
Committer:  H. Peter Anvin 
CommitDate: Tue, 18 Sep 2012 15:52:11 -0700

x86, fpu: use non-lazy fpu restore for processors supporting xsave

Fundamental model of the current Linux kernel is to lazily init and
restore FPU instead of restoring the task state during context switch.
This changes that fundamental lazy model to the non-lazy model for
the processors supporting xsave feature.

Reasons driving this model change are:

i. Newer processors support optimized state save/restore using xsaveopt and
xrstor by tracking the INIT state and MODIFIED state during context-switch.
This is faster than modifying the cr0.TS bit which has serializing semantics.

ii. Newer glibc versions use SSE for some of the optimized copy/clear routines.
With certain workloads (like boot, kernel-compilation etc), application
completes its work with in the first 5 task switches, thus taking upto 5 #DNA
traps with the kernel not getting a chance to apply the above mentioned
pre-load heuristic.

iii. Some xstate features (like AMD's LWP feature) don't honor the cr0.TS bit
and thus will not work correctly in the presence of lazy restore. Non-lazy
state restore is needed for enabling such features.

Some data on a two socket SNB system:
 * Saved 20K DNA exceptions during boot on a two socket SNB system.
 * Saved 50K DNA exceptions during kernel-compilation workload.
 * Improved throughput of the AVX based checksumming function inside the
   kernel by ~15% as xsave/xrstor is faster than the serializing clts/stts
   pair.

Also now kernel_fpu_begin/end() relies on the patched
alternative instructions. So move check_fpu() which uses the
kernel_fpu_begin/end() after alternative_instructions().

Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1345842782-24175-7-git-send-email-suresh.b.sid...@intel.com
Merge 32-bit boot fix from,
Link: 
http://lkml.kernel.org/r/1347300665-6209-4-git-send-email-suresh.b.sid...@intel.com
Cc: Jim Kukunas 
Cc: NeilBrown 
Cc: Avi Kivity 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/fpu-internal.h |   96 +++
 arch/x86/include/asm/i387.h |1 +
 arch/x86/include/asm/xsave.h|1 +
 arch/x86/kernel/cpu/bugs.c  |7 ++-
 arch/x86/kernel/i387.c  |   20 ++-
 arch/x86/kernel/process.c   |   12 +++--
 arch/x86/kernel/process_32.c|4 --
 arch/x86/kernel/process_64.c|4 --
 arch/x86/kernel/traps.c |5 ++-
 arch/x86/kernel/xsave.c |   57 +
 10 files changed, 146 insertions(+), 61 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 52202a6..8ca0f9f 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -291,15 +291,48 @@ static inline void __thread_set_has_fpu(struct 
task_struct *tsk)
 static inline void __thread_fpu_end(struct task_struct *tsk)
 {
__thread_clear_has_fpu(tsk);
-   stts();
+   if (!use_xsave())
+   stts();
 }
 
 static inline void __thread_fpu_begin(struct task_struct *tsk)
 {
-   clts();
+   if (!use_xsave())
+   clts();
__thread_set_has_fpu(tsk);
 }
 
+static inline void __drop_fpu(struct task_struct *tsk)
+{
+   if (__thread_has_fpu(tsk)) {
+   /* Ignore delayed exceptions from user space */
+   asm volatile("1: fwait\n"
+"2:\n"
+_ASM_EXTABLE(1b, 2b));
+   __thread_fpu_end(tsk);
+   }
+}
+
+static inline void drop_fpu(struct task_struct *tsk)
+{
+   /*
+* Forget coprocessor state..
+*/
+   preempt_disable();
+   tsk->fpu_counter = 0;
+   __drop_fpu(tsk);
+   clear_used_math();
+   preempt_enable();
+}
+
+static inline void drop_init_fpu(struct task_struct *tsk)
+{
+   if (!use_xsave())
+   drop_fpu(tsk);
+   else
+   xrstor_state(init_xstate_buf, -1);
+}
+
 /*
  * FPU state switching for scheduling.
  *
@@ -333,7 +366,12 @@ static inline fpu_switch_t switch_fpu_prepare(struct 
task_struct *old, struct ta
 {
fpu_switch_t fpu;
 
-   fpu.preload = tsk_used_math(new) && new->fpu_counter > 5;
+   /*
+* If the task has used the math, pre-load the FPU on xsave processors
+* or if the past 5 consecutive context-switches used math.
+*/
+   fpu.preload = tsk_used_math(new) && (use_xsave() ||
+new->fpu_counter > 5);
if (__thread_has_fpu(old)) {
if (!__save_init_fpu(old))
cpu = ~0;
@@ -345,14 +383,14 @@ static inline fpu_switch_t swi

[tip:x86/fpu] lguest, x86: handle guest TS bit for lazy/ non-lazy fpu host models

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  9c6ff8bbb69a4e7b47ac40bfa44509296e89c5c0
Gitweb: http://git.kernel.org/tip/9c6ff8bbb69a4e7b47ac40bfa44509296e89c5c0
Author: Suresh Siddha 
AuthorDate: Fri, 24 Aug 2012 14:13:01 -0700
Committer:  H. Peter Anvin 
CommitDate: Tue, 18 Sep 2012 15:52:09 -0700

lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models

Instead of using unlazy_fpu() check if user_has_fpu() and set/clear
the host TS bits so that the lguest works fine with both the
lazy/non-lazy FPU host models with minimal changes.

Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1345842782-24175-6-git-send-email-suresh.b.sid...@intel.com
Cc: Rusty Russell 
Signed-off-by: H. Peter Anvin 
---
 drivers/lguest/x86/core.c |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/lguest/x86/core.c b/drivers/lguest/x86/core.c
index 39809035..4af12e1 100644
--- a/drivers/lguest/x86/core.c
+++ b/drivers/lguest/x86/core.c
@@ -203,8 +203,8 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
 * we set it now, so we can trap and pass that trap to the Guest if it
 * uses the FPU.
 */
-   if (cpu->ts)
-   unlazy_fpu(current);
+   if (cpu->ts && user_has_fpu())
+   stts();
 
/*
 * SYSENTER is an optimized way of doing system calls.  We can't allow
@@ -234,6 +234,10 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
 if (boot_cpu_has(X86_FEATURE_SEP))
wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
 
+   /* Clear the host TS bit if it was set above. */
+   if (cpu->ts && user_has_fpu())
+   clts();
+
/*
 * If the Guest page faulted, then the cr2 register will tell us the
 * bad virtual address.  We have to grab this now, because once we
@@ -249,7 +253,7 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
 * a different CPU. So all the critical stuff should be done
 * before this.
 */
-   else if (cpu->regs->trapnum == 7)
+   else if (cpu->regs->trapnum == 7 && !user_has_fpu())
math_state_restore();
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/fpu] x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  841e3604d35aa70d399146abdc526d8c89a2c2f5
Gitweb: http://git.kernel.org/tip/841e3604d35aa70d399146abdc526d8c89a2c2f5
Author: Suresh Siddha 
AuthorDate: Fri, 24 Aug 2012 14:13:00 -0700
Committer:  H. Peter Anvin 
CommitDate: Tue, 18 Sep 2012 15:52:08 -0700

x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage

use kernel_fpu_begin/end() instead of unconditionally accessing cr0 and
saving/restoring just the few used xmm/ymm registers.

This has some advantages like:
* If the task's FPU state is already active, then kernel_fpu_begin()
  will just save the user-state and avoiding the read/write of cr0.
  In general, cr0 accesses are much slower.

* Manual save/restore of xmm/ymm registers will affect the 'modified' and
  the 'init' optimizations brought in the by xsaveopt/xrstor
  infrastructure.

* Foward compatibility with future vector register extensions will be a
  problem if the xmm/ymm registers are manually saved and restored
  (corrupting the extended state of those vector registers).

With this patch, there was no significant difference in the xor throughput
using AVX, measured during boot.

Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1345842782-24175-5-git-send-email-suresh.b.sid...@intel.com
Cc: Jim Kukunas 
Cc: NeilBrown 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/xor_32.h  |   56 +---
 arch/x86/include/asm/xor_64.h  |   61 ++--
 arch/x86/include/asm/xor_avx.h |   54 ---
 3 files changed, 29 insertions(+), 142 deletions(-)

diff --git a/arch/x86/include/asm/xor_32.h b/arch/x86/include/asm/xor_32.h
index 4545708..aabd585 100644
--- a/arch/x86/include/asm/xor_32.h
+++ b/arch/x86/include/asm/xor_32.h
@@ -534,38 +534,6 @@ static struct xor_block_template xor_block_p5_mmx = {
  * Copyright (C) 1999 Zach Brown (with obvious credit due Ingo)
  */
 
-#define XMMS_SAVE  \
-do {   \
-   preempt_disable();  \
-   cr0 = read_cr0();   \
-   clts(); \
-   asm volatile(   \
-   "movups %%xmm0,(%0) ;\n\t"  \
-   "movups %%xmm1,0x10(%0) ;\n\t"  \
-   "movups %%xmm2,0x20(%0) ;\n\t"  \
-   "movups %%xmm3,0x30(%0) ;\n\t"  \
-   :   \
-   : "r" (xmm_save)\
-   : "memory");\
-} while (0)
-
-#define XMMS_RESTORE   \
-do {   \
-   asm volatile(   \
-   "sfence ;\n\t"  \
-   "movups (%0),%%xmm0 ;\n\t"  \
-   "movups 0x10(%0),%%xmm1 ;\n\t"  \
-   "movups 0x20(%0),%%xmm2 ;\n\t"  \
-   "movups 0x30(%0),%%xmm3 ;\n\t"  \
-   :   \
-   : "r" (xmm_save)\
-   : "memory");\
-   write_cr0(cr0); \
-   preempt_enable();   \
-} while (0)
-
-#define ALIGN16 __attribute__((aligned(16)))
-
 #define OFFS(x)"16*("#x")"
 #define PF_OFFS(x) "256+16*("#x")"
 #definePF0(x)  "   prefetchnta "PF_OFFS(x)"(%1)
;\n"
@@ -587,10 +555,8 @@ static void
 xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
 {
unsigned long lines = bytes >> 8;
-   char xmm_save[16*4] ALIGN16;
-   int cr0;
 
-   XMMS_SAVE;
+   kernel_fpu_begin();
 
asm volatile(
 #undef BLOCK
@@ -633,7 +599,7 @@ xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned 
long *p2)
:
: "memory");
 
-   XMMS_RESTORE;
+   kernel_fpu_end();
 }
 
 static void
@@ -641,10 +607,8 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned 
long *p2,
  unsigned long *p3)
 {
unsigned long lines = bytes >> 8;
-   char xmm_save[16*4] ALIGN16;
-   int cr0;
 
-   XMMS_SAVE;
+   kernel_fpu_begin();
 
asm volatile(
 #undef BLOCK
@@ -694,7 +658,7 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned 
long *p2,
:
: "memory" );
 
-   XMMS_RESTORE;
+   kernel_fpu_end();
 }
 
 static void
@@ -702,10 +666,8 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned 
long *p2,
  unsigned long *p3, unsigned long *p4)
 {
unsigned long lines = bytes >> 8;
-   char xmm_save[16*4] ALIGN16;
-   int cr0;
 
-   XMMS_SAVE;
+   kernel_fpu_begin();

[tip:x86/fpu] x86, kvm: use kernel_fpu_begin/end() in kvm_load/ put_guest_fpu()

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  9c1c3fac53378c9782c18f80107965578d7b7167
Gitweb: http://git.kernel.org/tip/9c1c3fac53378c9782c18f80107965578d7b7167
Author: Suresh Siddha 
AuthorDate: Fri, 24 Aug 2012 14:12:59 -0700
Committer:  H. Peter Anvin 
CommitDate: Tue, 18 Sep 2012 15:52:07 -0700

x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()

kvm's guest fpu save/restore should be wrapped around
kernel_fpu_begin/end(). This will avoid for example taking a DNA
in kvm_load_guest_fpu() when it tries to load the fpu immediately
after doing unlazy_fpu() on the host side.

More importantly this will prevent the host process fpu from being
corrupted.

Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1345842782-24175-4-git-send-email-suresh.b.sid...@intel.com
Cc: Avi Kivity 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/kvm/x86.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 148ed66..cf637f5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5972,7 +5972,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 */
kvm_put_guest_xcr0(vcpu);
vcpu->guest_fpu_loaded = 1;
-   unlazy_fpu(current);
+   kernel_fpu_begin();
fpu_restore_checking(>arch.guest_fpu);
trace_kvm_fpu(1);
 }
@@ -5986,6 +5986,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 
vcpu->guest_fpu_loaded = 0;
fpu_save_init(>arch.guest_fpu);
+   kernel_fpu_end();
++vcpu->stat.fpu_reload;
kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
trace_kvm_fpu(0);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/fpu] x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  377ffbcc536a5adc077395163ab149c02610
Gitweb: http://git.kernel.org/tip/377ffbcc536a5adc077395163ab149c02610
Author: Suresh Siddha 
AuthorDate: Fri, 24 Aug 2012 14:12:58 -0700
Committer:  H. Peter Anvin 
CommitDate: Tue, 18 Sep 2012 15:52:06 -0700

x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()

Few lines below we do drop_fpu() which is more safer. Remove the
unnecessary user_fpu_end() in save_xstate_sig(), which allows
the drop_fpu() to ignore any pending exceptions from the user-space
and drop the current fpu.

Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1345842782-24175-3-git-send-email-suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/fpu-internal.h |   17 +++--
 arch/x86/kernel/xsave.c |1 -
 2 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 78169d1..52202a6 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -412,22 +412,11 @@ static inline void __drop_fpu(struct task_struct *tsk)
 }
 
 /*
- * The actual user_fpu_begin/end() functions
- * need to be preemption-safe.
+ * Need to be preemption-safe.
  *
- * NOTE! user_fpu_end() must be used only after you
- * have saved the FP state, and user_fpu_begin() must
- * be used only immediately before restoring it.
- * These functions do not do any save/restore on
- * their own.
+ * NOTE! user_fpu_begin() must be used only immediately before restoring
+ * it. This function does not do any save/restore on their own.
  */
-static inline void user_fpu_end(void)
-{
-   preempt_disable();
-   __thread_fpu_end(current);
-   preempt_enable();
-}
-
 static inline void user_fpu_begin(void)
 {
preempt_disable();
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 07ddc87..4ac5f2e 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -255,7 +255,6 @@ int save_xstate_sig(void __user *buf, void __user *buf_fx, 
int size)
/* Update the thread's fxstate to save the fsave header. */
if (ia32_fxstate)
fpu_fxsave(>thread.fpu);
-   user_fpu_end();
} else {
sanitize_i387_state(tsk);
if (__copy_to_user(buf_fx, xsave, xstate_size))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/fpu] x86, fpu: drop_fpu() before restoring new state from sigframe

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  e962591749dfd4df9fea2c530ed7a3cfed50e5aa
Gitweb: http://git.kernel.org/tip/e962591749dfd4df9fea2c530ed7a3cfed50e5aa
Author: Suresh Siddha 
AuthorDate: Fri, 24 Aug 2012 14:12:57 -0700
Committer:  H. Peter Anvin 
CommitDate: Tue, 18 Sep 2012 15:52:05 -0700

x86, fpu: drop_fpu() before restoring new state from sigframe

No need to save the state with unlazy_fpu(), that is about to get overwritten
by the state from the signal frame. Instead use drop_fpu() and continue
to restore the new state.

Also fold the stop_fpu_preload() into drop_fpu().

Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1345842782-24175-2-git-send-email-suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/fpu-internal.h |7 +--
 arch/x86/kernel/xsave.c |8 +++-
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 4fbb419..78169d1 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -448,17 +448,12 @@ static inline void save_init_fpu(struct task_struct *tsk)
preempt_enable();
 }
 
-static inline void stop_fpu_preload(struct task_struct *tsk)
-{
-   tsk->fpu_counter = 0;
-}
-
 static inline void drop_fpu(struct task_struct *tsk)
 {
/*
 * Forget coprocessor state..
 */
-   stop_fpu_preload(tsk);
+   tsk->fpu_counter = 0;
preempt_disable();
__drop_fpu(tsk);
preempt_enable();
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 0923d27..07ddc87 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -382,16 +382,14 @@ int __restore_xstate_sig(void __user *buf, void __user 
*buf_fx, int size)
struct xsave_struct *xsave = >thread.fpu.state->xsave;
struct user_i387_ia32_struct env;
 
-   stop_fpu_preload(tsk);
-   unlazy_fpu(tsk);
+   drop_fpu(tsk);
 
if (__copy_from_user(xsave, buf_fx, state_size) ||
-   __copy_from_user(, buf, sizeof(env))) {
-   drop_fpu(tsk);
+   __copy_from_user(, buf, sizeof(env)))
return -1;
-   }
 
sanitize_restored_xstate(tsk, , xstate_bv, fx_only);
+   set_used_math();
} else {
/*
 * For 64-bit frames and 32-bit fsave frames, restore the user
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/fpu] x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  72a671ced66db6d1c2bfff1c930a101ac8d08204
Gitweb: http://git.kernel.org/tip/72a671ced66db6d1c2bfff1c930a101ac8d08204
Author: Suresh Siddha 
AuthorDate: Tue, 24 Jul 2012 16:05:29 -0700
Committer:  H. Peter Anvin 
CommitDate: Tue, 18 Sep 2012 15:51:48 -0700

x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels

Currently for x86 and x86_32 binaries, fpstate in the user sigframe is copied
to/from the fpstate in the task struct.

And in the case of signal delivery for x86_64 binaries, if the fpstate is live
in the CPU registers, then the live state is copied directly to the user
sigframe. Otherwise  fpstate in the task struct is copied to the user sigframe.
During restore, fpstate in the user sigframe is restored directly to the live
CPU registers.

Historically, different code paths led to different bugs. For example,
x86_64 code path was not preemption safe till recently. Also there is lot
of code duplication for support of new features like xsave etc.

Unify signal handling code paths for x86 and x86_64 kernels.

New strategy is as follows:

Signal delivery: Both for 32/64-bit frames, align the core math frame area to
64bytes as needed by xsave (this where the main fpu/extended state gets copied
to and excludes the legacy compatibility fsave header for the 32-bit [f]xsave
frames). If the state is live, copy the register state directly to the user
frame. If not live, copy the state in the thread struct to the user frame. And
for 32-bit [f]xsave frames, construct the fsave header separately before
the actual [f]xsave area.

Signal return: As the 32-bit frames with [f]xstate has an additional
'fsave' header, copy everything back from the user sigframe to the
fpstate in the task structure and reconstruct the fxstate from the 'fsave'
header (Also user passed pointers may not be correctly aligned for
any attempt to directly restore any partial state). At the next fpstate usage,
everything will be restored to the live CPU registers.
For all the 64-bit frames and the 32-bit fsave frame, restore the state from
the user sigframe directly to the live CPU registers. 64-bit signals always
restored the math frame directly, so we can expect the math frame pointer
to be correctly aligned. For 32-bit fsave frames, there are no alignment
requirements, so we can restore the state directly.

"lat_sig catch" microbenchmark numbers (for x86, x86_64, x86_32 binaries) are
with in the noise range with this change.

Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1343171129-2747-4-git-send-email-suresh.b.sid...@intel.com
[ Merged in compilation fix ]
Link: 
http://lkml.kernel.org/r/1344544736.8326.17.ca...@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/ia32/ia32_signal.c |9 +-
 arch/x86/include/asm/fpu-internal.h |  111 ++
 arch/x86/include/asm/xsave.h|6 +-
 arch/x86/kernel/i387.c  |  246 +
 arch/x86/kernel/process.c   |   10 -
 arch/x86/kernel/ptrace.c|3 -
 arch/x86/kernel/signal.c|   15 +-
 arch/x86/kernel/xsave.c |  432 +--
 8 files changed, 348 insertions(+), 484 deletions(-)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 452d4dd..8c77c64 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -251,7 +251,7 @@ static int ia32_restore_sigcontext(struct pt_regs *regs,
 
get_user_ex(tmp, >fpstate);
buf = compat_ptr(tmp);
-   err |= restore_i387_xstate_ia32(buf);
+   err |= restore_xstate_sig(buf, 1);
 
get_user_ex(*pax, >ax);
} get_user_catch(err);
@@ -382,9 +382,12 @@ static void __user *get_sigframe(struct k_sigaction *ka, 
struct pt_regs *regs,
sp = (unsigned long) ka->sa.sa_restorer;
 
if (used_math()) {
-   sp = sp - sig_xstate_ia32_size;
+   unsigned long fx_aligned, math_size;
+
+   sp = alloc_mathframe(sp, 1, _aligned, _size);
*fpstate = (struct _fpstate_ia32 __user *) sp;
-   if (save_i387_xstate_ia32(*fpstate) < 0)
+   if (save_xstate_sig(*fpstate, (void __user *)fx_aligned,
+   math_size) < 0)
return (void __user *) -1L;
}
 
diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 016acb3..4fbb419 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -22,11 +22,30 @@
 #include 
 #include 
 
-extern unsigned int sig_xstate_size;
+#ifdef CONFIG_X86_64
+# include 
+# include 
+int ia32_setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info,
+   compat_sigset_t *set, struct pt_regs *regs);
+int ia32_setup_frame(int sig, struct k_sigaction *ka,
+compat_sigset_t *

[tip:x86/fpu] x86, fpu: Consolidate inline asm routines for saving /restoring fpu state

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  0ca5bd0d886578ad0afeceaa83458c0f35cb3c6b
Gitweb: http://git.kernel.org/tip/0ca5bd0d886578ad0afeceaa83458c0f35cb3c6b
Author: Suresh Siddha 
AuthorDate: Tue, 24 Jul 2012 16:05:28 -0700
Committer:  H. Peter Anvin 
CommitDate: Tue, 18 Sep 2012 15:51:26 -0700

x86, fpu: Consolidate inline asm routines for saving/restoring fpu state

Consolidate x86, x86_64 inline asm routines saving/restoring fpu state
using config_enabled().

Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1343171129-2747-3-git-send-email-suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/fpu-internal.h |  182 +++
 arch/x86/include/asm/xsave.h|6 +-
 arch/x86/kernel/xsave.c |4 +-
 3 files changed, 80 insertions(+), 112 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 6f59543..016acb3 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -97,34 +97,24 @@ static inline void sanitize_i387_state(struct task_struct 
*tsk)
__sanitize_i387_state(tsk);
 }
 
-#ifdef CONFIG_X86_64
-static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
-{
-   int err;
-
-   /* See comment in fxsave() below. */
-#ifdef CONFIG_AS_FXSAVEQ
-   asm volatile("1:  fxrstorq %[fx]\n\t"
-"2:\n"
-".section .fixup,\"ax\"\n"
-"3:  movl $-1,%[err]\n"
-"jmp  2b\n"
-".previous\n"
-_ASM_EXTABLE(1b, 3b)
-: [err] "=r" (err)
-: [fx] "m" (*fx), "0" (0));
-#else
-   asm volatile("1:  rex64/fxrstor (%[fx])\n\t"
-"2:\n"
-".section .fixup,\"ax\"\n"
-"3:  movl $-1,%[err]\n"
-"jmp  2b\n"
-".previous\n"
-_ASM_EXTABLE(1b, 3b)
-: [err] "=r" (err)
-: [fx] "R" (fx), "m" (*fx), "0" (0));
-#endif
-   return err;
+#define check_insn(insn, output, input...) \
+({ \
+   int err;\
+   asm volatile("1:" #insn "\n\t"  \
+"2:\n" \
+".section .fixup,\"ax\"\n" \
+"3:  movl $-1,%[err]\n"\
+"jmp  2b\n"\
+".previous\n"  \
+_ASM_EXTABLE(1b, 3b)   \
+: [err] "=r" (err), output \
+: "0"(0), input);  \
+   err;\
+})
+
+static inline int fsave_user(struct i387_fsave_struct __user *fx)
+{
+   return check_insn(fnsave %[fx]; fwait,  [fx] "=m" (*fx), "m" (*fx));
 }
 
 static inline int fxsave_user(struct i387_fxsave_struct __user *fx)
@@ -140,90 +130,73 @@ static inline int fxsave_user(struct i387_fxsave_struct 
__user *fx)
if (unlikely(err))
return -EFAULT;
 
-   /* See comment in fxsave() below. */
-#ifdef CONFIG_AS_FXSAVEQ
-   asm volatile("1:  fxsaveq %[fx]\n\t"
-"2:\n"
-".section .fixup,\"ax\"\n"
-"3:  movl $-1,%[err]\n"
-"jmp  2b\n"
-".previous\n"
-_ASM_EXTABLE(1b, 3b)
-: [err] "=r" (err), [fx] "=m" (*fx)
-: "0" (0));
-#else
-   asm volatile("1:  rex64/fxsave (%[fx])\n\t"
-"2:\n"
-".section .fixup,\"ax\"\n"
-"3:  movl $-1,%[err]\n"
-"jmp  2b\n"
-".previous\n"
-_ASM_EXTABLE(1b, 3b)
-: [err] "=r" (err), "=m" (*fx)
-: [fx] "R" (fx), "0" (0));
-#endif
-   if (unlikely(err) &&
-   __clear_user(fx, sizeof(struct i387_fxsave_struct)))
-   err = -EFAULT;
-   /

[tip:x86/fpu] x86, signal: Cleanup ifdefs and is_ia32, is_x32

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  050902c011712ad4703038fa4489ec4edd87d396
Gitweb: http://git.kernel.org/tip/050902c011712ad4703038fa4489ec4edd87d396
Author: Suresh Siddha 
AuthorDate: Tue, 24 Jul 2012 16:05:27 -0700
Committer:  H. Peter Anvin 
CommitDate: Tue, 18 Sep 2012 15:51:26 -0700

x86, signal: Cleanup ifdefs and is_ia32, is_x32

Use config_enabled() to cleanup the definitions of is_ia32/is_x32. Move
the function prototypes to the header file to cleanup ifdefs,
and move the x32_setup_rt_frame() code around.

Signed-off-by: Suresh Siddha 
Link: 
http://lkml.kernel.org/r/1343171129-2747-2-git-send-email-suresh.b.sid...@intel.com
Merged in compilation fix from,
Link: 
http://lkml.kernel.org/r/1344544736.8326.17.ca...@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/fpu-internal.h |   26 +-
 arch/x86/include/asm/signal.h   |4 +
 arch/x86/kernel/signal.c|  196 ++
 3 files changed, 110 insertions(+), 116 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 75f4c6d..6f59543 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -12,6 +12,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -32,7 +33,6 @@ extern user_regset_get_fn fpregs_get, xfpregs_get, 
fpregs_soft_get,
 extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
 xstateregs_set;
 
-
 /*
  * xstateregs_active == fpregs_active. Please refer to the comment
  * at the definition of fpregs_active.
@@ -55,6 +55,22 @@ extern void finit_soft_fpu(struct i387_soft_struct *soft);
 static inline void finit_soft_fpu(struct i387_soft_struct *soft) {}
 #endif
 
+static inline int is_ia32_compat_frame(void)
+{
+   return config_enabled(CONFIG_IA32_EMULATION) &&
+  test_thread_flag(TIF_IA32);
+}
+
+static inline int is_ia32_frame(void)
+{
+   return config_enabled(CONFIG_X86_32) || is_ia32_compat_frame();
+}
+
+static inline int is_x32_frame(void)
+{
+   return config_enabled(CONFIG_X86_X32_ABI) && test_thread_flag(TIF_X32);
+}
+
 #define X87_FSW_ES (1 << 7)/* Exception Summary */
 
 static __always_inline __pure bool use_xsaveopt(void)
@@ -180,6 +196,11 @@ static inline void fpu_fxsave(struct fpu *fpu)
 #endif
 }
 
+int ia32_setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info,
+   compat_sigset_t *set, struct pt_regs *regs);
+int ia32_setup_frame(int sig, struct k_sigaction *ka,
+compat_sigset_t *set, struct pt_regs *regs);
+
 #else  /* CONFIG_X86_32 */
 
 /* perform fxrstor iff the processor has extended states, otherwise frstor */
@@ -204,6 +225,9 @@ static inline void fpu_fxsave(struct fpu *fpu)
 : [fx] "=m" (fpu->state->fxsave));
 }
 
+#define ia32_setup_frame   __setup_frame
+#define ia32_setup_rt_frame__setup_rt_frame
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff --git a/arch/x86/include/asm/signal.h b/arch/x86/include/asm/signal.h
index 598457c..323973f 100644
--- a/arch/x86/include/asm/signal.h
+++ b/arch/x86/include/asm/signal.h
@@ -31,6 +31,10 @@ typedef struct {
unsigned long sig[_NSIG_WORDS];
 } sigset_t;
 
+#ifndef CONFIG_COMPAT
+typedef sigset_t compat_sigset_t;
+#endif
+
 #else
 /* Here we must cater to libcs that poke about in kernel headers.  */
 
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index b280908..bed431a 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -209,24 +209,21 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs 
*regs, size_t frame_size,
unsigned long sp = regs->sp;
int onsigstack = on_sig_stack(sp);
 
-#ifdef CONFIG_X86_64
/* redzone */
-   sp -= 128;
-#endif /* CONFIG_X86_64 */
+   if (config_enabled(CONFIG_X86_64))
+   sp -= 128;
 
if (!onsigstack) {
/* This is the X/Open sanctioned signal stack switching.  */
if (ka->sa.sa_flags & SA_ONSTACK) {
if (current->sas_ss_size)
sp = current->sas_ss_sp + current->sas_ss_size;
-   } else {
-#ifdef CONFIG_X86_32
-   /* This is the legacy signal stack switching. */
-   if ((regs->ss & 0x) != __USER_DS &&
-   !(ka->sa.sa_flags & SA_RESTORER) &&
-   ka->sa.sa_restorer)
+   } else if (config_enabled(CONFIG_X86_32) &&
+  (regs->ss & 0x) != __USER_DS &&
+  !(ka->sa.sa_flags & SA_RESTORER) &&
+  ka->sa.sa_restorer) {
+   /* This is the legacy signal stack switching. */

[tip:x86/fpu] x86, signal: Cleanup ifdefs and is_ia32, is_x32

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  050902c011712ad4703038fa4489ec4edd87d396
Gitweb: http://git.kernel.org/tip/050902c011712ad4703038fa4489ec4edd87d396
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Tue, 24 Jul 2012 16:05:27 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 18 Sep 2012 15:51:26 -0700

x86, signal: Cleanup ifdefs and is_ia32, is_x32

Use config_enabled() to cleanup the definitions of is_ia32/is_x32. Move
the function prototypes to the header file to cleanup ifdefs,
and move the x32_setup_rt_frame() code around.

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1343171129-2747-2-git-send-email-suresh.b.sid...@intel.com
Merged in compilation fix from,
Link: 
http://lkml.kernel.org/r/1344544736.8326.17.ca...@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/include/asm/fpu-internal.h |   26 +-
 arch/x86/include/asm/signal.h   |4 +
 arch/x86/kernel/signal.c|  196 ++
 3 files changed, 110 insertions(+), 116 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 75f4c6d..6f59543 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -12,6 +12,7 @@
 
 #include linux/kernel_stat.h
 #include linux/regset.h
+#include linux/compat.h
 #include linux/slab.h
 #include asm/asm.h
 #include asm/cpufeature.h
@@ -32,7 +33,6 @@ extern user_regset_get_fn fpregs_get, xfpregs_get, 
fpregs_soft_get,
 extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
 xstateregs_set;
 
-
 /*
  * xstateregs_active == fpregs_active. Please refer to the comment
  * at the definition of fpregs_active.
@@ -55,6 +55,22 @@ extern void finit_soft_fpu(struct i387_soft_struct *soft);
 static inline void finit_soft_fpu(struct i387_soft_struct *soft) {}
 #endif
 
+static inline int is_ia32_compat_frame(void)
+{
+   return config_enabled(CONFIG_IA32_EMULATION) 
+  test_thread_flag(TIF_IA32);
+}
+
+static inline int is_ia32_frame(void)
+{
+   return config_enabled(CONFIG_X86_32) || is_ia32_compat_frame();
+}
+
+static inline int is_x32_frame(void)
+{
+   return config_enabled(CONFIG_X86_X32_ABI)  test_thread_flag(TIF_X32);
+}
+
 #define X87_FSW_ES (1  7)/* Exception Summary */
 
 static __always_inline __pure bool use_xsaveopt(void)
@@ -180,6 +196,11 @@ static inline void fpu_fxsave(struct fpu *fpu)
 #endif
 }
 
+int ia32_setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info,
+   compat_sigset_t *set, struct pt_regs *regs);
+int ia32_setup_frame(int sig, struct k_sigaction *ka,
+compat_sigset_t *set, struct pt_regs *regs);
+
 #else  /* CONFIG_X86_32 */
 
 /* perform fxrstor iff the processor has extended states, otherwise frstor */
@@ -204,6 +225,9 @@ static inline void fpu_fxsave(struct fpu *fpu)
 : [fx] =m (fpu-state-fxsave));
 }
 
+#define ia32_setup_frame   __setup_frame
+#define ia32_setup_rt_frame__setup_rt_frame
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff --git a/arch/x86/include/asm/signal.h b/arch/x86/include/asm/signal.h
index 598457c..323973f 100644
--- a/arch/x86/include/asm/signal.h
+++ b/arch/x86/include/asm/signal.h
@@ -31,6 +31,10 @@ typedef struct {
unsigned long sig[_NSIG_WORDS];
 } sigset_t;
 
+#ifndef CONFIG_COMPAT
+typedef sigset_t compat_sigset_t;
+#endif
+
 #else
 /* Here we must cater to libcs that poke about in kernel headers.  */
 
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index b280908..bed431a 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -209,24 +209,21 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs 
*regs, size_t frame_size,
unsigned long sp = regs-sp;
int onsigstack = on_sig_stack(sp);
 
-#ifdef CONFIG_X86_64
/* redzone */
-   sp -= 128;
-#endif /* CONFIG_X86_64 */
+   if (config_enabled(CONFIG_X86_64))
+   sp -= 128;
 
if (!onsigstack) {
/* This is the X/Open sanctioned signal stack switching.  */
if (ka-sa.sa_flags  SA_ONSTACK) {
if (current-sas_ss_size)
sp = current-sas_ss_sp + current-sas_ss_size;
-   } else {
-#ifdef CONFIG_X86_32
-   /* This is the legacy signal stack switching. */
-   if ((regs-ss  0x) != __USER_DS 
-   !(ka-sa.sa_flags  SA_RESTORER) 
-   ka-sa.sa_restorer)
+   } else if (config_enabled(CONFIG_X86_32) 
+  (regs-ss  0x) != __USER_DS 
+  !(ka-sa.sa_flags  SA_RESTORER) 
+  ka-sa.sa_restorer) {
+   /* This is the legacy signal stack switching

[tip:x86/fpu] x86, fpu: Consolidate inline asm routines for saving /restoring fpu state

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  0ca5bd0d886578ad0afeceaa83458c0f35cb3c6b
Gitweb: http://git.kernel.org/tip/0ca5bd0d886578ad0afeceaa83458c0f35cb3c6b
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Tue, 24 Jul 2012 16:05:28 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 18 Sep 2012 15:51:26 -0700

x86, fpu: Consolidate inline asm routines for saving/restoring fpu state

Consolidate x86, x86_64 inline asm routines saving/restoring fpu state
using config_enabled().

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1343171129-2747-3-git-send-email-suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/include/asm/fpu-internal.h |  182 +++
 arch/x86/include/asm/xsave.h|6 +-
 arch/x86/kernel/xsave.c |4 +-
 3 files changed, 80 insertions(+), 112 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 6f59543..016acb3 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -97,34 +97,24 @@ static inline void sanitize_i387_state(struct task_struct 
*tsk)
__sanitize_i387_state(tsk);
 }
 
-#ifdef CONFIG_X86_64
-static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
-{
-   int err;
-
-   /* See comment in fxsave() below. */
-#ifdef CONFIG_AS_FXSAVEQ
-   asm volatile(1:  fxrstorq %[fx]\n\t
-2:\n
-.section .fixup,\ax\\n
-3:  movl $-1,%[err]\n
-jmp  2b\n
-.previous\n
-_ASM_EXTABLE(1b, 3b)
-: [err] =r (err)
-: [fx] m (*fx), 0 (0));
-#else
-   asm volatile(1:  rex64/fxrstor (%[fx])\n\t
-2:\n
-.section .fixup,\ax\\n
-3:  movl $-1,%[err]\n
-jmp  2b\n
-.previous\n
-_ASM_EXTABLE(1b, 3b)
-: [err] =r (err)
-: [fx] R (fx), m (*fx), 0 (0));
-#endif
-   return err;
+#define check_insn(insn, output, input...) \
+({ \
+   int err;\
+   asm volatile(1: #insn \n\t  \
+2:\n \
+.section .fixup,\ax\\n \
+3:  movl $-1,%[err]\n\
+jmp  2b\n\
+.previous\n  \
+_ASM_EXTABLE(1b, 3b)   \
+: [err] =r (err), output \
+: 0(0), input);  \
+   err;\
+})
+
+static inline int fsave_user(struct i387_fsave_struct __user *fx)
+{
+   return check_insn(fnsave %[fx]; fwait,  [fx] =m (*fx), m (*fx));
 }
 
 static inline int fxsave_user(struct i387_fxsave_struct __user *fx)
@@ -140,90 +130,73 @@ static inline int fxsave_user(struct i387_fxsave_struct 
__user *fx)
if (unlikely(err))
return -EFAULT;
 
-   /* See comment in fxsave() below. */
-#ifdef CONFIG_AS_FXSAVEQ
-   asm volatile(1:  fxsaveq %[fx]\n\t
-2:\n
-.section .fixup,\ax\\n
-3:  movl $-1,%[err]\n
-jmp  2b\n
-.previous\n
-_ASM_EXTABLE(1b, 3b)
-: [err] =r (err), [fx] =m (*fx)
-: 0 (0));
-#else
-   asm volatile(1:  rex64/fxsave (%[fx])\n\t
-2:\n
-.section .fixup,\ax\\n
-3:  movl $-1,%[err]\n
-jmp  2b\n
-.previous\n
-_ASM_EXTABLE(1b, 3b)
-: [err] =r (err), =m (*fx)
-: [fx] R (fx), 0 (0));
-#endif
-   if (unlikely(err) 
-   __clear_user(fx, sizeof(struct i387_fxsave_struct)))
-   err = -EFAULT;
-   /* No need to clear here because the caller clears USED_MATH */
-   return err;
+   if (config_enabled(CONFIG_X86_32))
+   return check_insn(fxsave %[fx], [fx] =m (*fx), m (*fx));
+   else if (config_enabled(CONFIG_AS_FXSAVEQ))
+   return check_insn(fxsaveq %[fx], [fx] =m (*fx), m (*fx));
+
+   /* See comment in fpu_fxsave() below. */
+   return check_insn(rex64/fxsave (%[fx]), =m (*fx), [fx] R (fx));
 }
 
-static inline void fpu_fxsave(struct fpu *fpu)
+static inline int fxrstor_checking(struct i387_fxsave_struct *fx

[tip:x86/fpu] x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  72a671ced66db6d1c2bfff1c930a101ac8d08204
Gitweb: http://git.kernel.org/tip/72a671ced66db6d1c2bfff1c930a101ac8d08204
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Tue, 24 Jul 2012 16:05:29 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 18 Sep 2012 15:51:48 -0700

x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels

Currently for x86 and x86_32 binaries, fpstate in the user sigframe is copied
to/from the fpstate in the task struct.

And in the case of signal delivery for x86_64 binaries, if the fpstate is live
in the CPU registers, then the live state is copied directly to the user
sigframe. Otherwise  fpstate in the task struct is copied to the user sigframe.
During restore, fpstate in the user sigframe is restored directly to the live
CPU registers.

Historically, different code paths led to different bugs. For example,
x86_64 code path was not preemption safe till recently. Also there is lot
of code duplication for support of new features like xsave etc.

Unify signal handling code paths for x86 and x86_64 kernels.

New strategy is as follows:

Signal delivery: Both for 32/64-bit frames, align the core math frame area to
64bytes as needed by xsave (this where the main fpu/extended state gets copied
to and excludes the legacy compatibility fsave header for the 32-bit [f]xsave
frames). If the state is live, copy the register state directly to the user
frame. If not live, copy the state in the thread struct to the user frame. And
for 32-bit [f]xsave frames, construct the fsave header separately before
the actual [f]xsave area.

Signal return: As the 32-bit frames with [f]xstate has an additional
'fsave' header, copy everything back from the user sigframe to the
fpstate in the task structure and reconstruct the fxstate from the 'fsave'
header (Also user passed pointers may not be correctly aligned for
any attempt to directly restore any partial state). At the next fpstate usage,
everything will be restored to the live CPU registers.
For all the 64-bit frames and the 32-bit fsave frame, restore the state from
the user sigframe directly to the live CPU registers. 64-bit signals always
restored the math frame directly, so we can expect the math frame pointer
to be correctly aligned. For 32-bit fsave frames, there are no alignment
requirements, so we can restore the state directly.

lat_sig catch microbenchmark numbers (for x86, x86_64, x86_32 binaries) are
with in the noise range with this change.

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1343171129-2747-4-git-send-email-suresh.b.sid...@intel.com
[ Merged in compilation fix ]
Link: 
http://lkml.kernel.org/r/1344544736.8326.17.ca...@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/ia32/ia32_signal.c |9 +-
 arch/x86/include/asm/fpu-internal.h |  111 ++
 arch/x86/include/asm/xsave.h|6 +-
 arch/x86/kernel/i387.c  |  246 +
 arch/x86/kernel/process.c   |   10 -
 arch/x86/kernel/ptrace.c|3 -
 arch/x86/kernel/signal.c|   15 +-
 arch/x86/kernel/xsave.c |  432 +--
 8 files changed, 348 insertions(+), 484 deletions(-)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 452d4dd..8c77c64 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -251,7 +251,7 @@ static int ia32_restore_sigcontext(struct pt_regs *regs,
 
get_user_ex(tmp, sc-fpstate);
buf = compat_ptr(tmp);
-   err |= restore_i387_xstate_ia32(buf);
+   err |= restore_xstate_sig(buf, 1);
 
get_user_ex(*pax, sc-ax);
} get_user_catch(err);
@@ -382,9 +382,12 @@ static void __user *get_sigframe(struct k_sigaction *ka, 
struct pt_regs *regs,
sp = (unsigned long) ka-sa.sa_restorer;
 
if (used_math()) {
-   sp = sp - sig_xstate_ia32_size;
+   unsigned long fx_aligned, math_size;
+
+   sp = alloc_mathframe(sp, 1, fx_aligned, math_size);
*fpstate = (struct _fpstate_ia32 __user *) sp;
-   if (save_i387_xstate_ia32(*fpstate)  0)
+   if (save_xstate_sig(*fpstate, (void __user *)fx_aligned,
+   math_size)  0)
return (void __user *) -1L;
}
 
diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 016acb3..4fbb419 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -22,11 +22,30 @@
 #include asm/uaccess.h
 #include asm/xsave.h
 
-extern unsigned int sig_xstate_size;
+#ifdef CONFIG_X86_64
+# include asm/sigcontext32.h
+# include asm/user32.h
+int ia32_setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info,
+   compat_sigset_t *set

[tip:x86/fpu] x86, fpu: drop_fpu() before restoring new state from sigframe

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  e962591749dfd4df9fea2c530ed7a3cfed50e5aa
Gitweb: http://git.kernel.org/tip/e962591749dfd4df9fea2c530ed7a3cfed50e5aa
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Fri, 24 Aug 2012 14:12:57 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 18 Sep 2012 15:52:05 -0700

x86, fpu: drop_fpu() before restoring new state from sigframe

No need to save the state with unlazy_fpu(), that is about to get overwritten
by the state from the signal frame. Instead use drop_fpu() and continue
to restore the new state.

Also fold the stop_fpu_preload() into drop_fpu().

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1345842782-24175-2-git-send-email-suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/include/asm/fpu-internal.h |7 +--
 arch/x86/kernel/xsave.c |8 +++-
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 4fbb419..78169d1 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -448,17 +448,12 @@ static inline void save_init_fpu(struct task_struct *tsk)
preempt_enable();
 }
 
-static inline void stop_fpu_preload(struct task_struct *tsk)
-{
-   tsk-fpu_counter = 0;
-}
-
 static inline void drop_fpu(struct task_struct *tsk)
 {
/*
 * Forget coprocessor state..
 */
-   stop_fpu_preload(tsk);
+   tsk-fpu_counter = 0;
preempt_disable();
__drop_fpu(tsk);
preempt_enable();
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 0923d27..07ddc87 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -382,16 +382,14 @@ int __restore_xstate_sig(void __user *buf, void __user 
*buf_fx, int size)
struct xsave_struct *xsave = tsk-thread.fpu.state-xsave;
struct user_i387_ia32_struct env;
 
-   stop_fpu_preload(tsk);
-   unlazy_fpu(tsk);
+   drop_fpu(tsk);
 
if (__copy_from_user(xsave, buf_fx, state_size) ||
-   __copy_from_user(env, buf, sizeof(env))) {
-   drop_fpu(tsk);
+   __copy_from_user(env, buf, sizeof(env)))
return -1;
-   }
 
sanitize_restored_xstate(tsk, env, xstate_bv, fx_only);
+   set_used_math();
} else {
/*
 * For 64-bit frames and 32-bit fsave frames, restore the user
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/fpu] x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  377ffbcc536a5adc077395163ab149c02610
Gitweb: http://git.kernel.org/tip/377ffbcc536a5adc077395163ab149c02610
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Fri, 24 Aug 2012 14:12:58 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 18 Sep 2012 15:52:06 -0700

x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()

Few lines below we do drop_fpu() which is more safer. Remove the
unnecessary user_fpu_end() in save_xstate_sig(), which allows
the drop_fpu() to ignore any pending exceptions from the user-space
and drop the current fpu.

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1345842782-24175-3-git-send-email-suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/include/asm/fpu-internal.h |   17 +++--
 arch/x86/kernel/xsave.c |1 -
 2 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 78169d1..52202a6 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -412,22 +412,11 @@ static inline void __drop_fpu(struct task_struct *tsk)
 }
 
 /*
- * The actual user_fpu_begin/end() functions
- * need to be preemption-safe.
+ * Need to be preemption-safe.
  *
- * NOTE! user_fpu_end() must be used only after you
- * have saved the FP state, and user_fpu_begin() must
- * be used only immediately before restoring it.
- * These functions do not do any save/restore on
- * their own.
+ * NOTE! user_fpu_begin() must be used only immediately before restoring
+ * it. This function does not do any save/restore on their own.
  */
-static inline void user_fpu_end(void)
-{
-   preempt_disable();
-   __thread_fpu_end(current);
-   preempt_enable();
-}
-
 static inline void user_fpu_begin(void)
 {
preempt_disable();
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 07ddc87..4ac5f2e 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -255,7 +255,6 @@ int save_xstate_sig(void __user *buf, void __user *buf_fx, 
int size)
/* Update the thread's fxstate to save the fsave header. */
if (ia32_fxstate)
fpu_fxsave(tsk-thread.fpu);
-   user_fpu_end();
} else {
sanitize_i387_state(tsk);
if (__copy_to_user(buf_fx, xsave, xstate_size))
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/fpu] x86, kvm: use kernel_fpu_begin/end() in kvm_load/ put_guest_fpu()

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  9c1c3fac53378c9782c18f80107965578d7b7167
Gitweb: http://git.kernel.org/tip/9c1c3fac53378c9782c18f80107965578d7b7167
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Fri, 24 Aug 2012 14:12:59 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 18 Sep 2012 15:52:07 -0700

x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()

kvm's guest fpu save/restore should be wrapped around
kernel_fpu_begin/end(). This will avoid for example taking a DNA
in kvm_load_guest_fpu() when it tries to load the fpu immediately
after doing unlazy_fpu() on the host side.

More importantly this will prevent the host process fpu from being
corrupted.

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1345842782-24175-4-git-send-email-suresh.b.sid...@intel.com
Cc: Avi Kivity a...@redhat.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/kvm/x86.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 148ed66..cf637f5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5972,7 +5972,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 */
kvm_put_guest_xcr0(vcpu);
vcpu-guest_fpu_loaded = 1;
-   unlazy_fpu(current);
+   kernel_fpu_begin();
fpu_restore_checking(vcpu-arch.guest_fpu);
trace_kvm_fpu(1);
 }
@@ -5986,6 +5986,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 
vcpu-guest_fpu_loaded = 0;
fpu_save_init(vcpu-arch.guest_fpu);
+   kernel_fpu_end();
++vcpu-stat.fpu_reload;
kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
trace_kvm_fpu(0);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/fpu] x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  841e3604d35aa70d399146abdc526d8c89a2c2f5
Gitweb: http://git.kernel.org/tip/841e3604d35aa70d399146abdc526d8c89a2c2f5
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Fri, 24 Aug 2012 14:13:00 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 18 Sep 2012 15:52:08 -0700

x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage

use kernel_fpu_begin/end() instead of unconditionally accessing cr0 and
saving/restoring just the few used xmm/ymm registers.

This has some advantages like:
* If the task's FPU state is already active, then kernel_fpu_begin()
  will just save the user-state and avoiding the read/write of cr0.
  In general, cr0 accesses are much slower.

* Manual save/restore of xmm/ymm registers will affect the 'modified' and
  the 'init' optimizations brought in the by xsaveopt/xrstor
  infrastructure.

* Foward compatibility with future vector register extensions will be a
  problem if the xmm/ymm registers are manually saved and restored
  (corrupting the extended state of those vector registers).

With this patch, there was no significant difference in the xor throughput
using AVX, measured during boot.

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1345842782-24175-5-git-send-email-suresh.b.sid...@intel.com
Cc: Jim Kukunas james.t.kuku...@linux.intel.com
Cc: NeilBrown ne...@suse.de
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/include/asm/xor_32.h  |   56 +---
 arch/x86/include/asm/xor_64.h  |   61 ++--
 arch/x86/include/asm/xor_avx.h |   54 ---
 3 files changed, 29 insertions(+), 142 deletions(-)

diff --git a/arch/x86/include/asm/xor_32.h b/arch/x86/include/asm/xor_32.h
index 4545708..aabd585 100644
--- a/arch/x86/include/asm/xor_32.h
+++ b/arch/x86/include/asm/xor_32.h
@@ -534,38 +534,6 @@ static struct xor_block_template xor_block_p5_mmx = {
  * Copyright (C) 1999 Zach Brown (with obvious credit due Ingo)
  */
 
-#define XMMS_SAVE  \
-do {   \
-   preempt_disable();  \
-   cr0 = read_cr0();   \
-   clts(); \
-   asm volatile(   \
-   movups %%xmm0,(%0) ;\n\t  \
-   movups %%xmm1,0x10(%0) ;\n\t  \
-   movups %%xmm2,0x20(%0) ;\n\t  \
-   movups %%xmm3,0x30(%0) ;\n\t  \
-   :   \
-   : r (xmm_save)\
-   : memory);\
-} while (0)
-
-#define XMMS_RESTORE   \
-do {   \
-   asm volatile(   \
-   sfence ;\n\t  \
-   movups (%0),%%xmm0 ;\n\t  \
-   movups 0x10(%0),%%xmm1 ;\n\t  \
-   movups 0x20(%0),%%xmm2 ;\n\t  \
-   movups 0x30(%0),%%xmm3 ;\n\t  \
-   :   \
-   : r (xmm_save)\
-   : memory);\
-   write_cr0(cr0); \
-   preempt_enable();   \
-} while (0)
-
-#define ALIGN16 __attribute__((aligned(16)))
-
 #define OFFS(x)16*(#x)
 #define PF_OFFS(x) 256+16*(#x)
 #definePF0(x) prefetchnta PF_OFFS(x)(%1)
;\n
@@ -587,10 +555,8 @@ static void
 xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
 {
unsigned long lines = bytes  8;
-   char xmm_save[16*4] ALIGN16;
-   int cr0;
 
-   XMMS_SAVE;
+   kernel_fpu_begin();
 
asm volatile(
 #undef BLOCK
@@ -633,7 +599,7 @@ xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned 
long *p2)
:
: memory);
 
-   XMMS_RESTORE;
+   kernel_fpu_end();
 }
 
 static void
@@ -641,10 +607,8 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned 
long *p2,
  unsigned long *p3)
 {
unsigned long lines = bytes  8;
-   char xmm_save[16*4] ALIGN16;
-   int cr0;
 
-   XMMS_SAVE;
+   kernel_fpu_begin();
 
asm volatile(
 #undef BLOCK
@@ -694,7 +658,7 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned 
long *p2,
:
: memory );
 
-   XMMS_RESTORE;
+   kernel_fpu_end();
 }
 
 static void
@@ -702,10 +666,8 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned 
long *p2,
  unsigned long *p3, unsigned long *p4)
 {
unsigned long lines = bytes  8;
-   char xmm_save[16*4] ALIGN16;
-   int cr0;
 
-   XMMS_SAVE;
+   kernel_fpu_begin();
 
asm volatile(
 #undef BLOCK
@@ -762,7 +724,7 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned 
long *p2,
:
: memory

[tip:x86/fpu] lguest, x86: handle guest TS bit for lazy/ non-lazy fpu host models

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  9c6ff8bbb69a4e7b47ac40bfa44509296e89c5c0
Gitweb: http://git.kernel.org/tip/9c6ff8bbb69a4e7b47ac40bfa44509296e89c5c0
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Fri, 24 Aug 2012 14:13:01 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 18 Sep 2012 15:52:09 -0700

lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models

Instead of using unlazy_fpu() check if user_has_fpu() and set/clear
the host TS bits so that the lguest works fine with both the
lazy/non-lazy FPU host models with minimal changes.

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1345842782-24175-6-git-send-email-suresh.b.sid...@intel.com
Cc: Rusty Russell ru...@rustcorp.com.au
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 drivers/lguest/x86/core.c |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/lguest/x86/core.c b/drivers/lguest/x86/core.c
index 39809035..4af12e1 100644
--- a/drivers/lguest/x86/core.c
+++ b/drivers/lguest/x86/core.c
@@ -203,8 +203,8 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
 * we set it now, so we can trap and pass that trap to the Guest if it
 * uses the FPU.
 */
-   if (cpu-ts)
-   unlazy_fpu(current);
+   if (cpu-ts  user_has_fpu())
+   stts();
 
/*
 * SYSENTER is an optimized way of doing system calls.  We can't allow
@@ -234,6 +234,10 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
 if (boot_cpu_has(X86_FEATURE_SEP))
wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
 
+   /* Clear the host TS bit if it was set above. */
+   if (cpu-ts  user_has_fpu())
+   clts();
+
/*
 * If the Guest page faulted, then the cr2 register will tell us the
 * bad virtual address.  We have to grab this now, because once we
@@ -249,7 +253,7 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
 * a different CPU. So all the critical stuff should be done
 * before this.
 */
-   else if (cpu-regs-trapnum == 7)
+   else if (cpu-regs-trapnum == 7  !user_has_fpu())
math_state_restore();
 }
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/fpu] x86, fpu: use non-lazy fpu restore for processors supporting xsave

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  304bceda6a18ae0b0240b8aac9a6bdf8ce2d2469
Gitweb: http://git.kernel.org/tip/304bceda6a18ae0b0240b8aac9a6bdf8ce2d2469
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Fri, 24 Aug 2012 14:13:02 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 18 Sep 2012 15:52:11 -0700

x86, fpu: use non-lazy fpu restore for processors supporting xsave

Fundamental model of the current Linux kernel is to lazily init and
restore FPU instead of restoring the task state during context switch.
This changes that fundamental lazy model to the non-lazy model for
the processors supporting xsave feature.

Reasons driving this model change are:

i. Newer processors support optimized state save/restore using xsaveopt and
xrstor by tracking the INIT state and MODIFIED state during context-switch.
This is faster than modifying the cr0.TS bit which has serializing semantics.

ii. Newer glibc versions use SSE for some of the optimized copy/clear routines.
With certain workloads (like boot, kernel-compilation etc), application
completes its work with in the first 5 task switches, thus taking upto 5 #DNA
traps with the kernel not getting a chance to apply the above mentioned
pre-load heuristic.

iii. Some xstate features (like AMD's LWP feature) don't honor the cr0.TS bit
and thus will not work correctly in the presence of lazy restore. Non-lazy
state restore is needed for enabling such features.

Some data on a two socket SNB system:
 * Saved 20K DNA exceptions during boot on a two socket SNB system.
 * Saved 50K DNA exceptions during kernel-compilation workload.
 * Improved throughput of the AVX based checksumming function inside the
   kernel by ~15% as xsave/xrstor is faster than the serializing clts/stts
   pair.

Also now kernel_fpu_begin/end() relies on the patched
alternative instructions. So move check_fpu() which uses the
kernel_fpu_begin/end() after alternative_instructions().

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1345842782-24175-7-git-send-email-suresh.b.sid...@intel.com
Merge 32-bit boot fix from,
Link: 
http://lkml.kernel.org/r/1347300665-6209-4-git-send-email-suresh.b.sid...@intel.com
Cc: Jim Kukunas james.t.kuku...@linux.intel.com
Cc: NeilBrown ne...@suse.de
Cc: Avi Kivity a...@redhat.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/include/asm/fpu-internal.h |   96 +++
 arch/x86/include/asm/i387.h |1 +
 arch/x86/include/asm/xsave.h|1 +
 arch/x86/kernel/cpu/bugs.c  |7 ++-
 arch/x86/kernel/i387.c  |   20 ++-
 arch/x86/kernel/process.c   |   12 +++--
 arch/x86/kernel/process_32.c|4 --
 arch/x86/kernel/process_64.c|4 --
 arch/x86/kernel/traps.c |5 ++-
 arch/x86/kernel/xsave.c |   57 +
 10 files changed, 146 insertions(+), 61 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 52202a6..8ca0f9f 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -291,15 +291,48 @@ static inline void __thread_set_has_fpu(struct 
task_struct *tsk)
 static inline void __thread_fpu_end(struct task_struct *tsk)
 {
__thread_clear_has_fpu(tsk);
-   stts();
+   if (!use_xsave())
+   stts();
 }
 
 static inline void __thread_fpu_begin(struct task_struct *tsk)
 {
-   clts();
+   if (!use_xsave())
+   clts();
__thread_set_has_fpu(tsk);
 }
 
+static inline void __drop_fpu(struct task_struct *tsk)
+{
+   if (__thread_has_fpu(tsk)) {
+   /* Ignore delayed exceptions from user space */
+   asm volatile(1: fwait\n
+2:\n
+_ASM_EXTABLE(1b, 2b));
+   __thread_fpu_end(tsk);
+   }
+}
+
+static inline void drop_fpu(struct task_struct *tsk)
+{
+   /*
+* Forget coprocessor state..
+*/
+   preempt_disable();
+   tsk-fpu_counter = 0;
+   __drop_fpu(tsk);
+   clear_used_math();
+   preempt_enable();
+}
+
+static inline void drop_init_fpu(struct task_struct *tsk)
+{
+   if (!use_xsave())
+   drop_fpu(tsk);
+   else
+   xrstor_state(init_xstate_buf, -1);
+}
+
 /*
  * FPU state switching for scheduling.
  *
@@ -333,7 +366,12 @@ static inline fpu_switch_t switch_fpu_prepare(struct 
task_struct *old, struct ta
 {
fpu_switch_t fpu;
 
-   fpu.preload = tsk_used_math(new)  new-fpu_counter  5;
+   /*
+* If the task has used the math, pre-load the FPU on xsave processors
+* or if the past 5 consecutive context-switches used math.
+*/
+   fpu.preload = tsk_used_math(new)  (use_xsave() ||
+new-fpu_counter  5);
if (__thread_has_fpu(old)) {
if (!__save_init_fpu(old

[tip:x86/fpu] x86, fpu: decouple non-lazy/ eager fpu restore from xsave

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  5d2bd7009f306c82afddd1ca4d9763ad8473c216
Gitweb: http://git.kernel.org/tip/5d2bd7009f306c82afddd1ca4d9763ad8473c216
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Thu, 6 Sep 2012 14:58:52 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 18 Sep 2012 15:52:22 -0700

x86, fpu: decouple non-lazy/eager fpu restore from xsave

Decouple non-lazy/eager fpu restore policy from the existence of the xsave
feature. Introduce a synthetic CPUID flag to represent the eagerfpu
policy. eagerfpu=on boot paramter will enable the policy.

Requested-by: H. Peter Anvin h...@zytor.com
Requested-by: Linus Torvalds torva...@linux-foundation.org
Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1347300665-6209-2-git-send-email-suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 Documentation/kernel-parameters.txt |4 ++
 arch/x86/include/asm/cpufeature.h   |2 +
 arch/x86/include/asm/fpu-internal.h |   54 --
 arch/x86/kernel/cpu/common.c|2 -
 arch/x86/kernel/i387.c  |   25 +++---
 arch/x86/kernel/process.c   |2 +-
 arch/x86/kernel/traps.c |2 +-
 arch/x86/kernel/xsave.c |   87 +++
 8 files changed, 112 insertions(+), 66 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index ad7e2e5..741d064 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1833,6 +1833,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
and restore using xsave. The kernel will fallback to
enabling legacy floating-point and sse state.
 
+   eagerfpu=   [X86]
+   on  enable eager fpu restore
+   off disable eager fpu restore
+
nohlt   [BUGS=ARM,SH] Tells the kernel that the sleep(SH) or
wfi(ARM) instruction doesn't work correctly and not to
use it. This is also useful when using JTAG debugger.
diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 6b7ee5f..5dd2b47 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -97,6 +97,7 @@
 #define X86_FEATURE_EXTD_APICID(3*32+26) /* has extended APICID (8 
bits) */
 #define X86_FEATURE_AMD_DCM (3*32+27) /* multi-node processor */
 #define X86_FEATURE_APERFMPERF (3*32+28) /* APERFMPERF */
+#define X86_FEATURE_EAGER_FPU  (3*32+29) /* eagerfpu Non lazy FPU restore */
 
 /* Intel-defined CPU features, CPUID level 0x0001 (ecx), word 4 */
 #define X86_FEATURE_XMM3   (4*32+ 0) /* pni SSE-3 */
@@ -305,6 +306,7 @@ extern const char * const x86_power_flags[32];
 #define cpu_has_perfctr_core   boot_cpu_has(X86_FEATURE_PERFCTR_CORE)
 #define cpu_has_cx8boot_cpu_has(X86_FEATURE_CX8)
 #define cpu_has_cx16   boot_cpu_has(X86_FEATURE_CX16)
+#define cpu_has_eager_fpu  boot_cpu_has(X86_FEATURE_EAGER_FPU)
 
 #if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64)
 # define cpu_has_invlpg1
diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 8ca0f9f..0ca72f0 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -38,6 +38,7 @@ int ia32_setup_frame(int sig, struct k_sigaction *ka,
 
 extern unsigned int mxcsr_feature_mask;
 extern void fpu_init(void);
+extern void eager_fpu_init(void);
 
 DECLARE_PER_CPU(struct task_struct *, fpu_owner_task);
 
@@ -84,6 +85,11 @@ static inline int is_x32_frame(void)
 
 #define X87_FSW_ES (1  7)/* Exception Summary */
 
+static __always_inline __pure bool use_eager_fpu(void)
+{
+   return static_cpu_has(X86_FEATURE_EAGER_FPU);
+}
+
 static __always_inline __pure bool use_xsaveopt(void)
 {
return static_cpu_has(X86_FEATURE_XSAVEOPT);
@@ -99,6 +105,14 @@ static __always_inline __pure bool use_fxsr(void)
 return static_cpu_has(X86_FEATURE_FXSR);
 }
 
+static inline void fx_finit(struct i387_fxsave_struct *fx)
+{
+   memset(fx, 0, xstate_size);
+   fx-cwd = 0x37f;
+   if (cpu_has_xmm)
+   fx-mxcsr = MXCSR_DEFAULT;
+}
+
 extern void __sanitize_i387_state(struct task_struct *);
 
 static inline void sanitize_i387_state(struct task_struct *tsk)
@@ -291,13 +305,13 @@ static inline void __thread_set_has_fpu(struct 
task_struct *tsk)
 static inline void __thread_fpu_end(struct task_struct *tsk)
 {
__thread_clear_has_fpu(tsk);
-   if (!use_xsave())
+   if (!use_eager_fpu())
stts();
 }
 
 static inline void __thread_fpu_begin(struct task_struct *tsk)
 {
-   if (!use_xsave())
+   if (!use_eager_fpu())
clts();
__thread_set_has_fpu(tsk);
 }
@@ -327,10 +341,14 @@ static inline

[tip:x86/fpu] x86, fpu: make eagerfpu= boot param tri-state

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  e00229819f306b1f86134095347e9187dc346bd1
Gitweb: http://git.kernel.org/tip/e00229819f306b1f86134095347e9187dc346bd1
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Mon, 10 Sep 2012 10:32:32 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 18 Sep 2012 15:52:24 -0700

x86, fpu: make eagerfpu= boot param tri-state

Add the eagerfpu=auto (that selects the default scheme in
enabling eagerfpu) which can override compiled-in boot parameters
like eagerfpu=on/off (that force enable/disable eagerfpu).

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1347300665-6209-5-git-send-email-suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 Documentation/kernel-parameters.txt |4 +++-
 arch/x86/kernel/xsave.c |   17 -
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index e8f7faa..46a6a82 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1834,8 +1834,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
enabling legacy floating-point and sse state.
 
eagerfpu=   [X86]
-   on  enable eager fpu restore (default for xsaveopt)
+   on  enable eager fpu restore
off disable eager fpu restore
+   autoselects the default scheme, which automatically
+   enables eagerfpu restore for xsaveopt.
 
nohlt   [BUGS=ARM,SH] Tells the kernel that the sleep(SH) or
wfi(ARM) instruction doesn't work correctly and not to
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index e99f754..4e89b3d 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -508,13 +508,15 @@ static void __init setup_init_fpu_buf(void)
xsave_state(init_xstate_buf, -1);
 }
 
-static int disable_eagerfpu;
+static enum { AUTO, ENABLE, DISABLE } eagerfpu = AUTO;
 static int __init eager_fpu_setup(char *s)
 {
if (!strcmp(s, on))
-   setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);
+   eagerfpu = ENABLE;
else if (!strcmp(s, off))
-   disable_eagerfpu = 1;
+   eagerfpu = DISABLE;
+   else if (!strcmp(s, auto))
+   eagerfpu = AUTO;
return 1;
 }
 __setup(eagerfpu=, eager_fpu_setup);
@@ -557,8 +559,9 @@ static void __init xstate_enable_boot_cpu(void)
prepare_fx_sw_frame();
setup_init_fpu_buf();
 
-   if (cpu_has_xsaveopt  !disable_eagerfpu)
-   setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);
+   /* Auto enable eagerfpu for xsaveopt */
+   if (cpu_has_xsaveopt  eagerfpu != DISABLE)
+   eagerfpu = ENABLE;
 
pr_info(enabled xstate_bv 0x%llx, cntxt size 0x%x\n,
pcntxt_mask, xstate_size);
@@ -598,6 +601,10 @@ void __cpuinit eager_fpu_init(void)
 
clear_used_math();
current_thread_info()-status = 0;
+
+   if (eagerfpu == ENABLE)
+   setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);
+
if (!cpu_has_eager_fpu) {
stts();
return;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/fpu] x86, fpu: remove cpu_has_xmm check in the fx_finit()

2012-09-18 Thread tip-bot for Suresh Siddha

Commit-ID:  a8615af4bc3621cb01096541dafa6f68352ec2d9
Gitweb: http://git.kernel.org/tip/a8615af4bc3621cb01096541dafa6f68352ec2d9
Author: Suresh Siddha suresh.b.sid...@intel.com
AuthorDate: Mon, 10 Sep 2012 10:40:08 -0700
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Tue, 18 Sep 2012 15:52:24 -0700

x86, fpu: remove cpu_has_xmm check in the fx_finit()

CPUs with FXSAVE but no XMM/MXCSR (Pentium II from Intel,
Crusoe/TM-3xxx/5xxx from Transmeta, and presumably some of the K6
generation from AMD) ever looked at the mxcsr field during
fxrstor/fxsave. So remove the cpu_has_xmm check in the fx_finit()

Reported-by: Al Viro v...@zeniv.linux.org.uk
Acked-by: H. Peter Anvin h...@zytor.com
Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Link: 
http://lkml.kernel.org/r/1347300665-6209-6-git-send-email-suresh.b.sid...@intel.com
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/include/asm/fpu-internal.h |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 0ca72f0..92f3c6e 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -109,8 +109,7 @@ static inline void fx_finit(struct i387_fxsave_struct *fx)
 {
memset(fx, 0, xstate_size);
fx-cwd = 0x37f;
-   if (cpu_has_xmm)
-   fx-mxcsr = MXCSR_DEFAULT;
+   fx-mxcsr = MXCSR_DEFAULT;
 }
 
 extern void __sanitize_i387_state(struct task_struct *);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] crypto, tcrypt: remove local_bh_disable/enable() around local_irq_disable/enable()

2012-09-17 Thread Suresh Siddha

Ran into this while looking at some new crypto code using FPU
hitting a WARN_ON_ONCE(!irq_fpu_usable()) in the kernel_fpu_begin()
on a x86 kernel that uses the new eagerfpu model. In short, current eagerfpu
changes return 0 for interrupted_kernel_fpu_idle() and the in_interrupt()
thinks it is in the interrupt context because of the local_bh_disable().
Thus resulting in the WARN_ON().

Remove the local_bh_disable/enable() calls around the existing
local_irq_disable/enable() calls. local_irq_disable/enable() already
disables the BH.

 [ If there are any other legitimate users calling kernel_fpu_begin() from
   the process context but with BH disabled, then we can look into fixing the
   irq_fpu_usable() in future. ]

Signed-off-by: Suresh Siddha 
Cc: Tim Chen 
---
 crypto/tcrypt.c |6 --
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 5cf2ccb..de8c5d3 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -97,7 +97,6 @@ static int test_cipher_cycles(struct blkcipher_desc *desc, 
int enc,
int ret = 0;
int i;
 
-   local_bh_disable();
local_irq_disable();
 
/* Warm-up run. */
@@ -130,7 +129,6 @@ static int test_cipher_cycles(struct blkcipher_desc *desc, 
int enc,
 
 out:
local_irq_enable();
-   local_bh_enable();
 
if (ret == 0)
printk("1 operation in %lu cycles (%d bytes)\n",
@@ -300,7 +298,6 @@ static int test_hash_cycles_digest(struct hash_desc *desc,
int i;
int ret;
 
-   local_bh_disable();
local_irq_disable();
 
/* Warm-up run. */
@@ -327,7 +324,6 @@ static int test_hash_cycles_digest(struct hash_desc *desc,
 
 out:
local_irq_enable();
-   local_bh_enable();
 
if (ret)
return ret;
@@ -348,7 +344,6 @@ static int test_hash_cycles(struct hash_desc *desc, struct 
scatterlist *sg,
if (plen == blen)
return test_hash_cycles_digest(desc, sg, blen, out);
 
-   local_bh_disable();
local_irq_disable();
 
/* Warm-up run. */
@@ -391,7 +386,6 @@ static int test_hash_cycles(struct hash_desc *desc, struct 
scatterlist *sg,
 
 out:
local_irq_enable();
-   local_bh_enable();
 
if (ret)
return ret;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] crypto, tcrypt: remove local_bh_disable/enable() around local_irq_disable/enable()

2012-09-17 Thread Suresh Siddha

Ran into this while looking at some new crypto code using FPU
hitting a WARN_ON_ONCE(!irq_fpu_usable()) in the kernel_fpu_begin()
on a x86 kernel that uses the new eagerfpu model. In short, current eagerfpu
changes return 0 for interrupted_kernel_fpu_idle() and the in_interrupt()
thinks it is in the interrupt context because of the local_bh_disable().
Thus resulting in the WARN_ON().

Remove the local_bh_disable/enable() calls around the existing
local_irq_disable/enable() calls. local_irq_disable/enable() already
disables the BH.

 [ If there are any other legitimate users calling kernel_fpu_begin() from
   the process context but with BH disabled, then we can look into fixing the
   irq_fpu_usable() in future. ]

Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
Cc: Tim Chen tim.c.c...@linux.intel.com
---
 crypto/tcrypt.c |6 --
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 5cf2ccb..de8c5d3 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -97,7 +97,6 @@ static int test_cipher_cycles(struct blkcipher_desc *desc, 
int enc,
int ret = 0;
int i;
 
-   local_bh_disable();
local_irq_disable();
 
/* Warm-up run. */
@@ -130,7 +129,6 @@ static int test_cipher_cycles(struct blkcipher_desc *desc, 
int enc,
 
 out:
local_irq_enable();
-   local_bh_enable();
 
if (ret == 0)
printk(1 operation in %lu cycles (%d bytes)\n,
@@ -300,7 +298,6 @@ static int test_hash_cycles_digest(struct hash_desc *desc,
int i;
int ret;
 
-   local_bh_disable();
local_irq_disable();
 
/* Warm-up run. */
@@ -327,7 +324,6 @@ static int test_hash_cycles_digest(struct hash_desc *desc,
 
 out:
local_irq_enable();
-   local_bh_enable();
 
if (ret)
return ret;
@@ -348,7 +344,6 @@ static int test_hash_cycles(struct hash_desc *desc, struct 
scatterlist *sg,
if (plen == blen)
return test_hash_cycles_digest(desc, sg, blen, out);
 
-   local_bh_disable();
local_irq_disable();
 
/* Warm-up run. */
@@ -391,7 +386,6 @@ static int test_hash_cycles(struct hash_desc *desc, struct 
scatterlist *sg,
 
 out:
local_irq_enable();
-   local_bh_enable();
 
if (ret)
return ret;


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 210 matches

Mail list logo