Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread H. Peter Anvin
On 05/01/2014 03:18 PM, Andi Kleen wrote:
>> I haven't looked through the flows (I'm at LCE so I have limited screen
>> bandwidth) to see how that would be handled in this case, but in the
>> general paranoid case it comes down to the fact that in this particular
>> subcase we don't necessarily know exactly how many SWAPGS are between us
>> and userspace after we IRET.
> 
> There is none as far as I know. Certainly wasn't any when the code
> was originally written.
> 

This applies for an asynchronous entry from kernel space.  Obviously in
the case where we actually come directly from user space (the stack
frame CS.RPL == 3) then that doesn't apply.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread Andi Kleen
> I haven't looked through the flows (I'm at LCE so I have limited screen
> bandwidth) to see how that would be handled in this case, but in the
> general paranoid case it comes down to the fact that in this particular
> subcase we don't necessarily know exactly how many SWAPGS are between us
> and userspace after we IRET.

There is none as far as I know. Certainly wasn't any when the code
was originally written.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread Andy Lutomirski
On Thu, May 1, 2014 at 2:58 PM, H. Peter Anvin  wrote:
> On 05/01/2014 02:15 PM, Andi Kleen wrote:
>>> If usergs == kernelgs, then ebx will always be 1 and we'll never end
>>> up in paranoid_userspace.
>>
>> You may miss a reschedule in this obscure case. It shouldn't really
>> happen because loading a kernel pointer is not useful for user space.
>>
>> Doesn't seem like a real issue to me.
>>
>> We only happen need to handle it to avoid crashing.
>>
>
> No, it would be a rootable security hole, not just a crash.
>
>>> Alternatively, what if the paranoid entry checked whether we're coming
>>> from userspace at the very beginning and, if so, just jumped to the
>>> non-paranoid entry?
>>
>> That would work, but I doubt it would be worth it.
>
> If that would solve the problem it is simple enough, but the tricky part
> is when we end up in a "crack" where we are in kernel mode with the user GS.
>
> I haven't looked through the flows (I'm at LCE so I have limited screen
> bandwidth) to see how that would be handled in this case, but in the
> general paranoid case it comes down to the fact that in this particular
> subcase we don't necessarily know exactly how many SWAPGS are between us
> and userspace after we IRET.

The current code looks like it will never try to reschedule on
paranoid exit unless it came from user *CS*, in which case there
shouldn't be any weird gs issues.  Given that the current code won't
reschedule even on a paranoid entry that hits during interruptable
kernel code, I find it unlikely that this code is important.  You
probably know more about its history and significance than I do.

What happens when ftrace or perf tries to wake a task from a debug
interrupt or NMI?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread H. Peter Anvin
On 05/01/2014 02:15 PM, Andi Kleen wrote:
>> If usergs == kernelgs, then ebx will always be 1 and we'll never end
>> up in paranoid_userspace.
> 
> You may miss a reschedule in this obscure case. It shouldn't really
> happen because loading a kernel pointer is not useful for user space.
> 
> Doesn't seem like a real issue to me.
> 
> We only happen need to handle it to avoid crashing.
> 

No, it would be a rootable security hole, not just a crash.

>> Alternatively, what if the paranoid entry checked whether we're coming
>> from userspace at the very beginning and, if so, just jumped to the
>> non-paranoid entry?
> 
> That would work, but I doubt it would be worth it.

If that would solve the problem it is simple enough, but the tricky part
is when we end up in a "crack" where we are in kernel mode with the user GS.

I haven't looked through the flows (I'm at LCE so I have limited screen
bandwidth) to see how that would be handled in this case, but in the
general paranoid case it comes down to the fact that in this particular
subcase we don't necessarily know exactly how many SWAPGS are between us
and userspace after we IRET.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread Andy Lutomirski
On Thu, May 1, 2014 at 2:51 PM, Andi Kleen  wrote:
>> Allowing userspace to prevent itself from being rescheduled by loading
>> something strange into gsbase seems unfortunate.
>
> The timer tick will eventually catch it, so any delay is tightly
> bounded.
>

What about NO_HZ_FULL?


> Also still gets rescheduled most of the time, just not when a paranoid
> exception handler is running.

If rescheduling on exit from a paranoid exception handler isn't
important, then let's just remove it.  Otherwise let's keep it
working.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread Andi Kleen
> Allowing userspace to prevent itself from being rescheduled by loading
> something strange into gsbase seems unfortunate.

The timer tick will eventually catch it, so any delay is tightly
bounded.

Also still gets rescheduled most of the time, just not when a paranoid 
exception handler is running.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread Andy Lutomirski
On Thu, May 1, 2014 at 2:15 PM, Andi Kleen  wrote:
>> If usergs == kernelgs, then ebx will always be 1 and we'll never end
>> up in paranoid_userspace.
>
> You may miss a reschedule in this obscure case. It shouldn't really
> happen because loading a kernel pointer is not useful for user space.
>
> Doesn't seem like a real issue to me.
>
> We only happen need to handle it to avoid crashing.

Allowing userspace to prevent itself from being rescheduled by loading
something strange into gsbase seems unfortunate.

--Andy

>
>> Alternatively, what if the paranoid entry checked whether we're coming
>> from userspace at the very beginning and, if so, just jumped to the
>> non-paranoid entry?
>
> That would work, but I doubt it would be worth it.
>


> -Andi



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread Andi Kleen
> If usergs == kernelgs, then ebx will always be 1 and we'll never end
> up in paranoid_userspace.

You may miss a reschedule in this obscure case. It shouldn't really
happen because loading a kernel pointer is not useful for user space.

Doesn't seem like a real issue to me.

We only happen need to handle it to avoid crashing.

> Alternatively, what if the paranoid entry checked whether we're coming
> from userspace at the very beginning and, if so, just jumped to the
> non-paranoid entry?

That would work, but I doubt it would be worth it.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread Andi Kleen
 If usergs == kernelgs, then ebx will always be 1 and we'll never end
 up in paranoid_userspace.

You may miss a reschedule in this obscure case. It shouldn't really
happen because loading a kernel pointer is not useful for user space.

Doesn't seem like a real issue to me.

We only happen need to handle it to avoid crashing.

 Alternatively, what if the paranoid entry checked whether we're coming
 from userspace at the very beginning and, if so, just jumped to the
 non-paranoid entry?

That would work, but I doubt it would be worth it.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread Andy Lutomirski
On Thu, May 1, 2014 at 2:15 PM, Andi Kleen a...@firstfloor.org wrote:
 If usergs == kernelgs, then ebx will always be 1 and we'll never end
 up in paranoid_userspace.

 You may miss a reschedule in this obscure case. It shouldn't really
 happen because loading a kernel pointer is not useful for user space.

 Doesn't seem like a real issue to me.

 We only happen need to handle it to avoid crashing.

Allowing userspace to prevent itself from being rescheduled by loading
something strange into gsbase seems unfortunate.

--Andy


 Alternatively, what if the paranoid entry checked whether we're coming
 from userspace at the very beginning and, if so, just jumped to the
 non-paranoid entry?

 That would work, but I doubt it would be worth it.



 -Andi



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread Andi Kleen
 Allowing userspace to prevent itself from being rescheduled by loading
 something strange into gsbase seems unfortunate.

The timer tick will eventually catch it, so any delay is tightly
bounded.

Also still gets rescheduled most of the time, just not when a paranoid 
exception handler is running.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread Andy Lutomirski
On Thu, May 1, 2014 at 2:51 PM, Andi Kleen a...@firstfloor.org wrote:
 Allowing userspace to prevent itself from being rescheduled by loading
 something strange into gsbase seems unfortunate.

 The timer tick will eventually catch it, so any delay is tightly
 bounded.


What about NO_HZ_FULL?


 Also still gets rescheduled most of the time, just not when a paranoid
 exception handler is running.

If rescheduling on exit from a paranoid exception handler isn't
important, then let's just remove it.  Otherwise let's keep it
working.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread H. Peter Anvin
On 05/01/2014 02:15 PM, Andi Kleen wrote:
 If usergs == kernelgs, then ebx will always be 1 and we'll never end
 up in paranoid_userspace.
 
 You may miss a reschedule in this obscure case. It shouldn't really
 happen because loading a kernel pointer is not useful for user space.
 
 Doesn't seem like a real issue to me.
 
 We only happen need to handle it to avoid crashing.
 

No, it would be a rootable security hole, not just a crash.

 Alternatively, what if the paranoid entry checked whether we're coming
 from userspace at the very beginning and, if so, just jumped to the
 non-paranoid entry?
 
 That would work, but I doubt it would be worth it.

If that would solve the problem it is simple enough, but the tricky part
is when we end up in a crack where we are in kernel mode with the user GS.

I haven't looked through the flows (I'm at LCE so I have limited screen
bandwidth) to see how that would be handled in this case, but in the
general paranoid case it comes down to the fact that in this particular
subcase we don't necessarily know exactly how many SWAPGS are between us
and userspace after we IRET.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread Andy Lutomirski
On Thu, May 1, 2014 at 2:58 PM, H. Peter Anvin h...@zytor.com wrote:
 On 05/01/2014 02:15 PM, Andi Kleen wrote:
 If usergs == kernelgs, then ebx will always be 1 and we'll never end
 up in paranoid_userspace.

 You may miss a reschedule in this obscure case. It shouldn't really
 happen because loading a kernel pointer is not useful for user space.

 Doesn't seem like a real issue to me.

 We only happen need to handle it to avoid crashing.


 No, it would be a rootable security hole, not just a crash.

 Alternatively, what if the paranoid entry checked whether we're coming
 from userspace at the very beginning and, if so, just jumped to the
 non-paranoid entry?

 That would work, but I doubt it would be worth it.

 If that would solve the problem it is simple enough, but the tricky part
 is when we end up in a crack where we are in kernel mode with the user GS.

 I haven't looked through the flows (I'm at LCE so I have limited screen
 bandwidth) to see how that would be handled in this case, but in the
 general paranoid case it comes down to the fact that in this particular
 subcase we don't necessarily know exactly how many SWAPGS are between us
 and userspace after we IRET.

The current code looks like it will never try to reschedule on
paranoid exit unless it came from user *CS*, in which case there
shouldn't be any weird gs issues.  Given that the current code won't
reschedule even on a paranoid entry that hits during interruptable
kernel code, I find it unlikely that this code is important.  You
probably know more about its history and significance than I do.

What happens when ftrace or perf tries to wake a task from a debug
interrupt or NMI?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread Andi Kleen
 I haven't looked through the flows (I'm at LCE so I have limited screen
 bandwidth) to see how that would be handled in this case, but in the
 general paranoid case it comes down to the fact that in this particular
 subcase we don't necessarily know exactly how many SWAPGS are between us
 and userspace after we IRET.

There is none as far as I know. Certainly wasn't any when the code
was originally written.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-05-01 Thread H. Peter Anvin
On 05/01/2014 03:18 PM, Andi Kleen wrote:
 I haven't looked through the flows (I'm at LCE so I have limited screen
 bandwidth) to see how that would be handled in this case, but in the
 general paranoid case it comes down to the fact that in this particular
 subcase we don't necessarily know exactly how many SWAPGS are between us
 and userspace after we IRET.
 
 There is none as far as I know. Certainly wasn't any when the code
 was originally written.
 

This applies for an asynchronous entry from kernel space.  Obviously in
the case where we actually come directly from user space (the stack
frame CS.RPL == 3) then that doesn't apply.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-04-30 Thread Andy Lutomirski
On Wed, Apr 30, 2014 at 4:44 PM, Andy Lutomirski  wrote:
> On Tue, Apr 29, 2014 at 9:52 PM, H. Peter Anvin  wrote:
>> On 04/29/2014 04:39 PM, Andi Kleen wrote:
 Case 3 is annoying.  If nothing tries to change the user gs base, then
 everything is okay because the user gs base and the kernel gs bases are
 equal.  But if something does try to change the user gs base, then it
 will accidentally change the kernel gs base instead.
>>>
>>> It doesn't really matter, as they are the same.
>>> They would just switch identities.
>>>
>>> Besides I don't think anyone does that.
>>>
>>
>> It matters -- greatly -- if (and only if) we can enter the kernel with
>> usergs == kernelgs and then want to change usergs inside a paranoid
>> routine.  At that point we risk being upside down, which basically means
>> we're rooted.
>>
>> However, I believe this patchset also means only IST entries can be
>> paranoid, which in turn means we can't sleep inside them.  To the very
>> best of my knowledge the only times we change usergs is on context
>> switch or inside a system call.  We need to make sure that is actually
>> the case, though.
>>
>> I'm at ELC for a few days, so I'll have limited decent-sized-monitor
>> time, but it shouldn't be too hard to convince ourselves of... mostly a
>> matter of making sure something like ptrace can't to stupid crap.
>
> The only things that look relevant are the context switch paths and
> the kvm stuff.  I don't know what happens if an IST exception happens
> while running a guest, though.  TBH I have no idea what the VMX and
> SVM interfaces look like.
>
> paranoid_schedule looks scary.  If I'm understanding it correctly, it
> expects to be executed with gs == usergs.  I think it's okay, since it
> will only be invoked if we trapped from userspace, in which case the
> state is well-defined.  But this bit could be wrong:
>
> testl %ebx,%ebx/* swapgs needed? */
> jnz paranoid_restore
> testl $3,CS(%rsp)
> jnz   paranoid_userspace
>
> If usergs == kernelgs, then ebx will always be 1 and we'll never end
> up in paranoid_userspace.
>
> This could be fixed in two ways.  We could just switch the order of
> the tests, since the only way to have ebx == 1 and CS with CPL == 3
> should be if we're coming from userspace with usergs==kernelgs.  Or we
> could get rid of the paranoid schedule code entirely.  It is actually
> needed for anything?  Timer and rescheduling interrupts shouldn't be
> paranoid, and if there's any paranoid code that will trigger a
> reschedule, couldn't it do it much more sanely by sending an IPI to
> self and thus deferring the reschedule until interrupts are enabled?

Having just asked this, isn't the current code already broken if
something like an NMI or MCE tried to reschedule the current cpu?  It
could hit just before running hlt in the non-polling idle loop or it
could happen during execution of a kernel thread.  In either case, I
don't see why anything is guaranteed to notice the resched flag being
set.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-04-30 Thread Andy Lutomirski
On Tue, Apr 29, 2014 at 9:52 PM, H. Peter Anvin  wrote:
> On 04/29/2014 04:39 PM, Andi Kleen wrote:
>>> Case 3 is annoying.  If nothing tries to change the user gs base, then
>>> everything is okay because the user gs base and the kernel gs bases are
>>> equal.  But if something does try to change the user gs base, then it
>>> will accidentally change the kernel gs base instead.
>>
>> It doesn't really matter, as they are the same.
>> They would just switch identities.
>>
>> Besides I don't think anyone does that.
>>
>
> It matters -- greatly -- if (and only if) we can enter the kernel with
> usergs == kernelgs and then want to change usergs inside a paranoid
> routine.  At that point we risk being upside down, which basically means
> we're rooted.
>
> However, I believe this patchset also means only IST entries can be
> paranoid, which in turn means we can't sleep inside them.  To the very
> best of my knowledge the only times we change usergs is on context
> switch or inside a system call.  We need to make sure that is actually
> the case, though.
>
> I'm at ELC for a few days, so I'll have limited decent-sized-monitor
> time, but it shouldn't be too hard to convince ourselves of... mostly a
> matter of making sure something like ptrace can't to stupid crap.

The only things that look relevant are the context switch paths and
the kvm stuff.  I don't know what happens if an IST exception happens
while running a guest, though.  TBH I have no idea what the VMX and
SVM interfaces look like.

paranoid_schedule looks scary.  If I'm understanding it correctly, it
expects to be executed with gs == usergs.  I think it's okay, since it
will only be invoked if we trapped from userspace, in which case the
state is well-defined.  But this bit could be wrong:

testl %ebx,%ebx/* swapgs needed? */
jnz paranoid_restore
testl $3,CS(%rsp)
jnz   paranoid_userspace

If usergs == kernelgs, then ebx will always be 1 and we'll never end
up in paranoid_userspace.

This could be fixed in two ways.  We could just switch the order of
the tests, since the only way to have ebx == 1 and CS with CPL == 3
should be if we're coming from userspace with usergs==kernelgs.  Or we
could get rid of the paranoid schedule code entirely.  It is actually
needed for anything?  Timer and rescheduling interrupts shouldn't be
paranoid, and if there's any paranoid code that will trigger a
reschedule, couldn't it do it much more sanely by sending an IPI to
self and thus deferring the reschedule until interrupts are enabled?

Alternatively, what if the paranoid entry checked whether we're coming
from userspace at the very beginning and, if so, just jumped to the
non-paranoid entry?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-04-30 Thread Andy Lutomirski
On Tue, Apr 29, 2014 at 9:52 PM, H. Peter Anvin h...@zytor.com wrote:
 On 04/29/2014 04:39 PM, Andi Kleen wrote:
 Case 3 is annoying.  If nothing tries to change the user gs base, then
 everything is okay because the user gs base and the kernel gs bases are
 equal.  But if something does try to change the user gs base, then it
 will accidentally change the kernel gs base instead.

 It doesn't really matter, as they are the same.
 They would just switch identities.

 Besides I don't think anyone does that.


 It matters -- greatly -- if (and only if) we can enter the kernel with
 usergs == kernelgs and then want to change usergs inside a paranoid
 routine.  At that point we risk being upside down, which basically means
 we're rooted.

 However, I believe this patchset also means only IST entries can be
 paranoid, which in turn means we can't sleep inside them.  To the very
 best of my knowledge the only times we change usergs is on context
 switch or inside a system call.  We need to make sure that is actually
 the case, though.

 I'm at ELC for a few days, so I'll have limited decent-sized-monitor
 time, but it shouldn't be too hard to convince ourselves of... mostly a
 matter of making sure something like ptrace can't to stupid crap.

The only things that look relevant are the context switch paths and
the kvm stuff.  I don't know what happens if an IST exception happens
while running a guest, though.  TBH I have no idea what the VMX and
SVM interfaces look like.

paranoid_schedule looks scary.  If I'm understanding it correctly, it
expects to be executed with gs == usergs.  I think it's okay, since it
will only be invoked if we trapped from userspace, in which case the
state is well-defined.  But this bit could be wrong:

testl %ebx,%ebx/* swapgs needed? */
jnz paranoid_restore
testl $3,CS(%rsp)
jnz   paranoid_userspace

If usergs == kernelgs, then ebx will always be 1 and we'll never end
up in paranoid_userspace.

This could be fixed in two ways.  We could just switch the order of
the tests, since the only way to have ebx == 1 and CS with CPL == 3
should be if we're coming from userspace with usergs==kernelgs.  Or we
could get rid of the paranoid schedule code entirely.  It is actually
needed for anything?  Timer and rescheduling interrupts shouldn't be
paranoid, and if there's any paranoid code that will trigger a
reschedule, couldn't it do it much more sanely by sending an IPI to
self and thus deferring the reschedule until interrupts are enabled?

Alternatively, what if the paranoid entry checked whether we're coming
from userspace at the very beginning and, if so, just jumped to the
non-paranoid entry?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-04-30 Thread Andy Lutomirski
On Wed, Apr 30, 2014 at 4:44 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Apr 29, 2014 at 9:52 PM, H. Peter Anvin h...@zytor.com wrote:
 On 04/29/2014 04:39 PM, Andi Kleen wrote:
 Case 3 is annoying.  If nothing tries to change the user gs base, then
 everything is okay because the user gs base and the kernel gs bases are
 equal.  But if something does try to change the user gs base, then it
 will accidentally change the kernel gs base instead.

 It doesn't really matter, as they are the same.
 They would just switch identities.

 Besides I don't think anyone does that.


 It matters -- greatly -- if (and only if) we can enter the kernel with
 usergs == kernelgs and then want to change usergs inside a paranoid
 routine.  At that point we risk being upside down, which basically means
 we're rooted.

 However, I believe this patchset also means only IST entries can be
 paranoid, which in turn means we can't sleep inside them.  To the very
 best of my knowledge the only times we change usergs is on context
 switch or inside a system call.  We need to make sure that is actually
 the case, though.

 I'm at ELC for a few days, so I'll have limited decent-sized-monitor
 time, but it shouldn't be too hard to convince ourselves of... mostly a
 matter of making sure something like ptrace can't to stupid crap.

 The only things that look relevant are the context switch paths and
 the kvm stuff.  I don't know what happens if an IST exception happens
 while running a guest, though.  TBH I have no idea what the VMX and
 SVM interfaces look like.

 paranoid_schedule looks scary.  If I'm understanding it correctly, it
 expects to be executed with gs == usergs.  I think it's okay, since it
 will only be invoked if we trapped from userspace, in which case the
 state is well-defined.  But this bit could be wrong:

 testl %ebx,%ebx/* swapgs needed? */
 jnz paranoid_restore
 testl $3,CS(%rsp)
 jnz   paranoid_userspace

 If usergs == kernelgs, then ebx will always be 1 and we'll never end
 up in paranoid_userspace.

 This could be fixed in two ways.  We could just switch the order of
 the tests, since the only way to have ebx == 1 and CS with CPL == 3
 should be if we're coming from userspace with usergs==kernelgs.  Or we
 could get rid of the paranoid schedule code entirely.  It is actually
 needed for anything?  Timer and rescheduling interrupts shouldn't be
 paranoid, and if there's any paranoid code that will trigger a
 reschedule, couldn't it do it much more sanely by sending an IPI to
 self and thus deferring the reschedule until interrupts are enabled?

Having just asked this, isn't the current code already broken if
something like an NMI or MCE tried to reschedule the current cpu?  It
could hit just before running hlt in the non-polling idle loop or it
could happen during execution of a kernel thread.  In either case, I
don't see why anything is guaranteed to notice the resched flag being
set.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-04-29 Thread H. Peter Anvin
On 04/29/2014 09:52 PM, H. Peter Anvin wrote:
> 
> It matters -- greatly -- if (and only if) we can enter the kernel with
> usergs == kernelgs and then want to change usergs inside a paranoid
> routine.  At that point we risk being upside down, which basically means
> we're rooted.
> 
> However, I believe this patchset also means only IST entries can be
> paranoid, which in turn means we can't sleep inside them.  To the very
> best of my knowledge the only times we change usergs is on context
> switch or inside a system call.  We need to make sure that is actually
> the case, though.
> 
> I'm at ELC for a few days, so I'll have limited decent-sized-monitor
> time, but it shouldn't be too hard to convince ourselves of... mostly a
> matter of making sure something like ptrace can't to stupid crap.
> 

Just in case anyone is getting the wrong impression: this is a
discussion about details.  I'm glad to see this work getting done.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-04-29 Thread H. Peter Anvin
On 04/29/2014 04:39 PM, Andi Kleen wrote:
>> Case 3 is annoying.  If nothing tries to change the user gs base, then
>> everything is okay because the user gs base and the kernel gs bases are
>> equal.  But if something does try to change the user gs base, then it
>> will accidentally change the kernel gs base instead.
> 
> It doesn't really matter, as they are the same.
> They would just switch identities.
> 
> Besides I don't think anyone does that.
> 

It matters -- greatly -- if (and only if) we can enter the kernel with
usergs == kernelgs and then want to change usergs inside a paranoid
routine.  At that point we risk being upside down, which basically means
we're rooted.

However, I believe this patchset also means only IST entries can be
paranoid, which in turn means we can't sleep inside them.  To the very
best of my knowledge the only times we change usergs is on context
switch or inside a system call.  We need to make sure that is actually
the case, though.

I'm at ELC for a few days, so I'll have limited decent-sized-monitor
time, but it shouldn't be too hard to convince ourselves of... mostly a
matter of making sure something like ptrace can't to stupid crap.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-04-29 Thread Andi Kleen
> Case 3 is annoying.  If nothing tries to change the user gs base, then
> everything is okay because the user gs base and the kernel gs bases are
> equal.  But if something does try to change the user gs base, then it
> will accidentally change the kernel gs base instead.

It doesn't really matter, as they are the same.
They would just switch identities.

Besides I don't think anyone does that.

> 
> For the IST entries, this should be fine -- cpu migration, scheduling,
> and such are impossible anyway.  For the non-IST entries, I'm less
> convinced.  The entry_64.S code suggests that the problematic entries are:
> 
> double_fault
> stack_segment
> machine_check

I don't think any of them can schedule.

> 
> Of course, all of those entries really do use IST, so I wonder why they
> are paranoid*entry instead of paranoid*entry_ist.  Is it because they're
> supposedly non-recursive?

Yes, only the DEBUG stack is big enough to recurse.

> 
> In any case, wouldn't this all be much simpler and less magical if the
> paranoid entries just saved the old gsbase to the rbx and loaded the new
> ones?  The exits could do the inverse.  This should be really fast:

I had it originally in a similar scheme, but it was significantly
more complicated, with changed exit path So I switched to this "only a 
single hook needed" variant, which mirrors the existing code
closely.

> I don't know the actual latencies, but I suspect that this would be
> faster, too -- it removes some branches, and wrgsbase and rdgsbase
> deserve to be faster than swapgs.  It's probably no good for
> non-rd/wrgsbase-capable cpus, though, since I suspect that three MSR
> accesses are much worse than one MSR access and two swapgs calls.

Probably doesn't matter much, it's MUCH faster than the old
code in any case.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-04-29 Thread Andy Lutomirski
On 04/28/2014 03:12 PM, Andi Kleen wrote:
> From: Andi Kleen 
> 
> IvyBridge added new instructions to directly write the fs and gs
> 64bit base registers. Previously this had to be done with a system
> call to write to MSRs. The main use case is fast user space threading
> and switching the fs/gs registers quickly there.
> 
> The instructions are opt-in and have to be explicitely enabled
> by the OS.
> 
> Previously Linux couldn't support this because the paranoid
> entry code relied on the gs base never being negative outside
> the kernel to decide when to use swaps. It would check the gs MSR
> value and assume it was already running in kernel if the value
> was already negative.
> 
> This patch changes the paranoid entry code to use rdgsbase
> if available.  Then we check the GS value against the expected GS value
> stored at the bottom of the IST stack. If the value is the expected
> value we skip swapgs.
> 
> This is also significantly faster than a MSR read, so will speed
> NMis (critical for profiling)
> 
> An alternative would have been to save/restore the GS value
> unconditionally, but this approach needs less changes.
> 
> Then after these changes we need to also use the new instructions
> to save/restore fs and gs, so that the new values set by the
> users won't disappear.  This is also significantly
> faster for the case when the 64bit base has to be switched
> (that is when GS is larger than 4GB), as we can replace
> the slow MSR write with a faster wr[fg]sbase execution.
> 
> The instructions do not context switch
> the segment index, so the old invariant that fs or gs index
> have to be 0 for a different 64bit value to stick is still
> true. Previously it was enforced by arch_prctl, now the user
> program has to make sure it keeps the segment indexes zero.
> If it doesn't the changes may not stick.
> 
> This is in term enables fast switching when there are
> enough threads that their TLS segment does not fit below 4GB,
> or alternatively programs that use fs as an additional base
> register will not get a sigificant context switch penalty.
> 
> It is all done in a single patch to avoid bisect crash
> holes.
> 


> +paranoid_save_gs:
> + .byte 0xf3,0x48,0x0f,0xae,0xc9  # rdgsbaseq %rcx
> + movq $-EXCEPTION_STKSZ,%rax # non debug stack size
> + cmpq $DEBUG_STACK,ORIG_RAX+8(%rsp)
> + movq $-1,ORIG_RAX+8(%rsp)   # no syscall to restart
> + jne  1f
> + movq $-DEBUG_STKSZ,%rax # debug stack size
> +1:
> + andq %rsp,%rax  # bottom of stack
> + movq (%rax),%rdi# get expected GS
> + cmpq %rdi,%rcx  # is it the kernel gs?

I don't like this part.  There are now three cases:

1. User gs, gsbase != kernel gs base.  This works the same as before

2. Kernel gs.  This also works the same as before.

3. User gs, but gsbase == kernel gs base.  This will cause C code to
execute on the *user* gs base.

Case 3 is annoying.  If nothing tries to change the user gs base, then
everything is okay because the user gs base and the kernel gs bases are
equal.  But if something does try to change the user gs base, then it
will accidentally change the kernel gs base instead.

For the IST entries, this should be fine -- cpu migration, scheduling,
and such are impossible anyway.  For the non-IST entries, I'm less
convinced.  The entry_64.S code suggests that the problematic entries are:

double_fault
stack_segment
machine_check

Of course, all of those entries really do use IST, so I wonder why they
are paranoid*entry instead of paranoid*entry_ist.  Is it because they're
supposedly non-recursive?

In any case, wouldn't this all be much simpler and less magical if the
paranoid entries just saved the old gsbase to the rbx and loaded the new
ones?  The exits could do the inverse.  This should be really fast:

rdgsbaseq %rbx
wrgsbaseq {the correct value}

...

wrgsbaseq %rbx

This still doesn't support changing the usergs value inside a paranoid
entry, but at least it will fail consistently instead of only failing if
the user gs has a particular special value.

I don't know the actual latencies, but I suspect that this would be
faster, too -- it removes some branches, and wrgsbase and rdgsbase
deserve to be faster than swapgs.  It's probably no good for
non-rd/wrgsbase-capable cpus, though, since I suspect that three MSR
accesses are much worse than one MSR access and two swapgs calls.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-04-29 Thread Andy Lutomirski
On 04/28/2014 03:12 PM, Andi Kleen wrote:
 From: Andi Kleen a...@linux.intel.com
 
 IvyBridge added new instructions to directly write the fs and gs
 64bit base registers. Previously this had to be done with a system
 call to write to MSRs. The main use case is fast user space threading
 and switching the fs/gs registers quickly there.
 
 The instructions are opt-in and have to be explicitely enabled
 by the OS.
 
 Previously Linux couldn't support this because the paranoid
 entry code relied on the gs base never being negative outside
 the kernel to decide when to use swaps. It would check the gs MSR
 value and assume it was already running in kernel if the value
 was already negative.
 
 This patch changes the paranoid entry code to use rdgsbase
 if available.  Then we check the GS value against the expected GS value
 stored at the bottom of the IST stack. If the value is the expected
 value we skip swapgs.
 
 This is also significantly faster than a MSR read, so will speed
 NMis (critical for profiling)
 
 An alternative would have been to save/restore the GS value
 unconditionally, but this approach needs less changes.
 
 Then after these changes we need to also use the new instructions
 to save/restore fs and gs, so that the new values set by the
 users won't disappear.  This is also significantly
 faster for the case when the 64bit base has to be switched
 (that is when GS is larger than 4GB), as we can replace
 the slow MSR write with a faster wr[fg]sbase execution.
 
 The instructions do not context switch
 the segment index, so the old invariant that fs or gs index
 have to be 0 for a different 64bit value to stick is still
 true. Previously it was enforced by arch_prctl, now the user
 program has to make sure it keeps the segment indexes zero.
 If it doesn't the changes may not stick.
 
 This is in term enables fast switching when there are
 enough threads that their TLS segment does not fit below 4GB,
 or alternatively programs that use fs as an additional base
 register will not get a sigificant context switch penalty.
 
 It is all done in a single patch to avoid bisect crash
 holes.
 


 +paranoid_save_gs:
 + .byte 0xf3,0x48,0x0f,0xae,0xc9  # rdgsbaseq %rcx
 + movq $-EXCEPTION_STKSZ,%rax # non debug stack size
 + cmpq $DEBUG_STACK,ORIG_RAX+8(%rsp)
 + movq $-1,ORIG_RAX+8(%rsp)   # no syscall to restart
 + jne  1f
 + movq $-DEBUG_STKSZ,%rax # debug stack size
 +1:
 + andq %rsp,%rax  # bottom of stack
 + movq (%rax),%rdi# get expected GS
 + cmpq %rdi,%rcx  # is it the kernel gs?

I don't like this part.  There are now three cases:

1. User gs, gsbase != kernel gs base.  This works the same as before

2. Kernel gs.  This also works the same as before.

3. User gs, but gsbase == kernel gs base.  This will cause C code to
execute on the *user* gs base.

Case 3 is annoying.  If nothing tries to change the user gs base, then
everything is okay because the user gs base and the kernel gs bases are
equal.  But if something does try to change the user gs base, then it
will accidentally change the kernel gs base instead.

For the IST entries, this should be fine -- cpu migration, scheduling,
and such are impossible anyway.  For the non-IST entries, I'm less
convinced.  The entry_64.S code suggests that the problematic entries are:

double_fault
stack_segment
machine_check

Of course, all of those entries really do use IST, so I wonder why they
are paranoid*entry instead of paranoid*entry_ist.  Is it because they're
supposedly non-recursive?

In any case, wouldn't this all be much simpler and less magical if the
paranoid entries just saved the old gsbase to the rbx and loaded the new
ones?  The exits could do the inverse.  This should be really fast:

rdgsbaseq %rbx
wrgsbaseq {the correct value}

...

wrgsbaseq %rbx

This still doesn't support changing the usergs value inside a paranoid
entry, but at least it will fail consistently instead of only failing if
the user gs has a particular special value.

I don't know the actual latencies, but I suspect that this would be
faster, too -- it removes some branches, and wrgsbase and rdgsbase
deserve to be faster than swapgs.  It's probably no good for
non-rd/wrgsbase-capable cpus, though, since I suspect that three MSR
accesses are much worse than one MSR access and two swapgs calls.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-04-29 Thread Andi Kleen
 Case 3 is annoying.  If nothing tries to change the user gs base, then
 everything is okay because the user gs base and the kernel gs bases are
 equal.  But if something does try to change the user gs base, then it
 will accidentally change the kernel gs base instead.

It doesn't really matter, as they are the same.
They would just switch identities.

Besides I don't think anyone does that.

 
 For the IST entries, this should be fine -- cpu migration, scheduling,
 and such are impossible anyway.  For the non-IST entries, I'm less
 convinced.  The entry_64.S code suggests that the problematic entries are:
 
 double_fault
 stack_segment
 machine_check

I don't think any of them can schedule.

 
 Of course, all of those entries really do use IST, so I wonder why they
 are paranoid*entry instead of paranoid*entry_ist.  Is it because they're
 supposedly non-recursive?

Yes, only the DEBUG stack is big enough to recurse.

 
 In any case, wouldn't this all be much simpler and less magical if the
 paranoid entries just saved the old gsbase to the rbx and loaded the new
 ones?  The exits could do the inverse.  This should be really fast:

I had it originally in a similar scheme, but it was significantly
more complicated, with changed exit path So I switched to this only a 
single hook needed variant, which mirrors the existing code
closely.

 I don't know the actual latencies, but I suspect that this would be
 faster, too -- it removes some branches, and wrgsbase and rdgsbase
 deserve to be faster than swapgs.  It's probably no good for
 non-rd/wrgsbase-capable cpus, though, since I suspect that three MSR
 accesses are much worse than one MSR access and two swapgs calls.

Probably doesn't matter much, it's MUCH faster than the old
code in any case.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-04-29 Thread H. Peter Anvin
On 04/29/2014 04:39 PM, Andi Kleen wrote:
 Case 3 is annoying.  If nothing tries to change the user gs base, then
 everything is okay because the user gs base and the kernel gs bases are
 equal.  But if something does try to change the user gs base, then it
 will accidentally change the kernel gs base instead.
 
 It doesn't really matter, as they are the same.
 They would just switch identities.
 
 Besides I don't think anyone does that.
 

It matters -- greatly -- if (and only if) we can enter the kernel with
usergs == kernelgs and then want to change usergs inside a paranoid
routine.  At that point we risk being upside down, which basically means
we're rooted.

However, I believe this patchset also means only IST entries can be
paranoid, which in turn means we can't sleep inside them.  To the very
best of my knowledge the only times we change usergs is on context
switch or inside a system call.  We need to make sure that is actually
the case, though.

I'm at ELC for a few days, so I'll have limited decent-sized-monitor
time, but it shouldn't be too hard to convince ourselves of... mostly a
matter of making sure something like ptrace can't to stupid crap.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base

2014-04-29 Thread H. Peter Anvin
On 04/29/2014 09:52 PM, H. Peter Anvin wrote:
 
 It matters -- greatly -- if (and only if) we can enter the kernel with
 usergs == kernelgs and then want to change usergs inside a paranoid
 routine.  At that point we risk being upside down, which basically means
 we're rooted.
 
 However, I believe this patchset also means only IST entries can be
 paranoid, which in turn means we can't sleep inside them.  To the very
 best of my knowledge the only times we change usergs is on context
 switch or inside a system call.  We need to make sure that is actually
 the case, though.
 
 I'm at ELC for a few days, so I'll have limited decent-sized-monitor
 time, but it shouldn't be too hard to convince ourselves of... mostly a
 matter of making sure something like ptrace can't to stupid crap.
 

Just in case anyone is getting the wrong impression: this is a
discussion about details.  I'm glad to see this work getting done.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/