Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On 05/01/2014 03:18 PM, Andi Kleen wrote: >> I haven't looked through the flows (I'm at LCE so I have limited screen >> bandwidth) to see how that would be handled in this case, but in the >> general paranoid case it comes down to the fact that in this particular >> subcase we don't necessarily know exactly how many SWAPGS are between us >> and userspace after we IRET. > > There is none as far as I know. Certainly wasn't any when the code > was originally written. > This applies for an asynchronous entry from kernel space. Obviously in the case where we actually come directly from user space (the stack frame CS.RPL == 3) then that doesn't apply. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
> I haven't looked through the flows (I'm at LCE so I have limited screen > bandwidth) to see how that would be handled in this case, but in the > general paranoid case it comes down to the fact that in this particular > subcase we don't necessarily know exactly how many SWAPGS are between us > and userspace after we IRET. There is none as far as I know. Certainly wasn't any when the code was originally written. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On Thu, May 1, 2014 at 2:58 PM, H. Peter Anvin wrote: > On 05/01/2014 02:15 PM, Andi Kleen wrote: >>> If usergs == kernelgs, then ebx will always be 1 and we'll never end >>> up in paranoid_userspace. >> >> You may miss a reschedule in this obscure case. It shouldn't really >> happen because loading a kernel pointer is not useful for user space. >> >> Doesn't seem like a real issue to me. >> >> We only happen need to handle it to avoid crashing. >> > > No, it would be a rootable security hole, not just a crash. > >>> Alternatively, what if the paranoid entry checked whether we're coming >>> from userspace at the very beginning and, if so, just jumped to the >>> non-paranoid entry? >> >> That would work, but I doubt it would be worth it. > > If that would solve the problem it is simple enough, but the tricky part > is when we end up in a "crack" where we are in kernel mode with the user GS. > > I haven't looked through the flows (I'm at LCE so I have limited screen > bandwidth) to see how that would be handled in this case, but in the > general paranoid case it comes down to the fact that in this particular > subcase we don't necessarily know exactly how many SWAPGS are between us > and userspace after we IRET. The current code looks like it will never try to reschedule on paranoid exit unless it came from user *CS*, in which case there shouldn't be any weird gs issues. Given that the current code won't reschedule even on a paranoid entry that hits during interruptable kernel code, I find it unlikely that this code is important. You probably know more about its history and significance than I do. What happens when ftrace or perf tries to wake a task from a debug interrupt or NMI? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On 05/01/2014 02:15 PM, Andi Kleen wrote: >> If usergs == kernelgs, then ebx will always be 1 and we'll never end >> up in paranoid_userspace. > > You may miss a reschedule in this obscure case. It shouldn't really > happen because loading a kernel pointer is not useful for user space. > > Doesn't seem like a real issue to me. > > We only happen need to handle it to avoid crashing. > No, it would be a rootable security hole, not just a crash. >> Alternatively, what if the paranoid entry checked whether we're coming >> from userspace at the very beginning and, if so, just jumped to the >> non-paranoid entry? > > That would work, but I doubt it would be worth it. If that would solve the problem it is simple enough, but the tricky part is when we end up in a "crack" where we are in kernel mode with the user GS. I haven't looked through the flows (I'm at LCE so I have limited screen bandwidth) to see how that would be handled in this case, but in the general paranoid case it comes down to the fact that in this particular subcase we don't necessarily know exactly how many SWAPGS are between us and userspace after we IRET. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On Thu, May 1, 2014 at 2:51 PM, Andi Kleen wrote: >> Allowing userspace to prevent itself from being rescheduled by loading >> something strange into gsbase seems unfortunate. > > The timer tick will eventually catch it, so any delay is tightly > bounded. > What about NO_HZ_FULL? > Also still gets rescheduled most of the time, just not when a paranoid > exception handler is running. If rescheduling on exit from a paranoid exception handler isn't important, then let's just remove it. Otherwise let's keep it working. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
> Allowing userspace to prevent itself from being rescheduled by loading > something strange into gsbase seems unfortunate. The timer tick will eventually catch it, so any delay is tightly bounded. Also still gets rescheduled most of the time, just not when a paranoid exception handler is running. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On Thu, May 1, 2014 at 2:15 PM, Andi Kleen wrote: >> If usergs == kernelgs, then ebx will always be 1 and we'll never end >> up in paranoid_userspace. > > You may miss a reschedule in this obscure case. It shouldn't really > happen because loading a kernel pointer is not useful for user space. > > Doesn't seem like a real issue to me. > > We only happen need to handle it to avoid crashing. Allowing userspace to prevent itself from being rescheduled by loading something strange into gsbase seems unfortunate. --Andy > >> Alternatively, what if the paranoid entry checked whether we're coming >> from userspace at the very beginning and, if so, just jumped to the >> non-paranoid entry? > > That would work, but I doubt it would be worth it. > > -Andi -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
> If usergs == kernelgs, then ebx will always be 1 and we'll never end > up in paranoid_userspace. You may miss a reschedule in this obscure case. It shouldn't really happen because loading a kernel pointer is not useful for user space. Doesn't seem like a real issue to me. We only happen need to handle it to avoid crashing. > Alternatively, what if the paranoid entry checked whether we're coming > from userspace at the very beginning and, if so, just jumped to the > non-paranoid entry? That would work, but I doubt it would be worth it. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
If usergs == kernelgs, then ebx will always be 1 and we'll never end up in paranoid_userspace. You may miss a reschedule in this obscure case. It shouldn't really happen because loading a kernel pointer is not useful for user space. Doesn't seem like a real issue to me. We only happen need to handle it to avoid crashing. Alternatively, what if the paranoid entry checked whether we're coming from userspace at the very beginning and, if so, just jumped to the non-paranoid entry? That would work, but I doubt it would be worth it. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On Thu, May 1, 2014 at 2:15 PM, Andi Kleen a...@firstfloor.org wrote: If usergs == kernelgs, then ebx will always be 1 and we'll never end up in paranoid_userspace. You may miss a reschedule in this obscure case. It shouldn't really happen because loading a kernel pointer is not useful for user space. Doesn't seem like a real issue to me. We only happen need to handle it to avoid crashing. Allowing userspace to prevent itself from being rescheduled by loading something strange into gsbase seems unfortunate. --Andy Alternatively, what if the paranoid entry checked whether we're coming from userspace at the very beginning and, if so, just jumped to the non-paranoid entry? That would work, but I doubt it would be worth it. -Andi -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
Allowing userspace to prevent itself from being rescheduled by loading something strange into gsbase seems unfortunate. The timer tick will eventually catch it, so any delay is tightly bounded. Also still gets rescheduled most of the time, just not when a paranoid exception handler is running. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On Thu, May 1, 2014 at 2:51 PM, Andi Kleen a...@firstfloor.org wrote: Allowing userspace to prevent itself from being rescheduled by loading something strange into gsbase seems unfortunate. The timer tick will eventually catch it, so any delay is tightly bounded. What about NO_HZ_FULL? Also still gets rescheduled most of the time, just not when a paranoid exception handler is running. If rescheduling on exit from a paranoid exception handler isn't important, then let's just remove it. Otherwise let's keep it working. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On 05/01/2014 02:15 PM, Andi Kleen wrote: If usergs == kernelgs, then ebx will always be 1 and we'll never end up in paranoid_userspace. You may miss a reschedule in this obscure case. It shouldn't really happen because loading a kernel pointer is not useful for user space. Doesn't seem like a real issue to me. We only happen need to handle it to avoid crashing. No, it would be a rootable security hole, not just a crash. Alternatively, what if the paranoid entry checked whether we're coming from userspace at the very beginning and, if so, just jumped to the non-paranoid entry? That would work, but I doubt it would be worth it. If that would solve the problem it is simple enough, but the tricky part is when we end up in a crack where we are in kernel mode with the user GS. I haven't looked through the flows (I'm at LCE so I have limited screen bandwidth) to see how that would be handled in this case, but in the general paranoid case it comes down to the fact that in this particular subcase we don't necessarily know exactly how many SWAPGS are between us and userspace after we IRET. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On Thu, May 1, 2014 at 2:58 PM, H. Peter Anvin h...@zytor.com wrote: On 05/01/2014 02:15 PM, Andi Kleen wrote: If usergs == kernelgs, then ebx will always be 1 and we'll never end up in paranoid_userspace. You may miss a reschedule in this obscure case. It shouldn't really happen because loading a kernel pointer is not useful for user space. Doesn't seem like a real issue to me. We only happen need to handle it to avoid crashing. No, it would be a rootable security hole, not just a crash. Alternatively, what if the paranoid entry checked whether we're coming from userspace at the very beginning and, if so, just jumped to the non-paranoid entry? That would work, but I doubt it would be worth it. If that would solve the problem it is simple enough, but the tricky part is when we end up in a crack where we are in kernel mode with the user GS. I haven't looked through the flows (I'm at LCE so I have limited screen bandwidth) to see how that would be handled in this case, but in the general paranoid case it comes down to the fact that in this particular subcase we don't necessarily know exactly how many SWAPGS are between us and userspace after we IRET. The current code looks like it will never try to reschedule on paranoid exit unless it came from user *CS*, in which case there shouldn't be any weird gs issues. Given that the current code won't reschedule even on a paranoid entry that hits during interruptable kernel code, I find it unlikely that this code is important. You probably know more about its history and significance than I do. What happens when ftrace or perf tries to wake a task from a debug interrupt or NMI? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
I haven't looked through the flows (I'm at LCE so I have limited screen bandwidth) to see how that would be handled in this case, but in the general paranoid case it comes down to the fact that in this particular subcase we don't necessarily know exactly how many SWAPGS are between us and userspace after we IRET. There is none as far as I know. Certainly wasn't any when the code was originally written. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On 05/01/2014 03:18 PM, Andi Kleen wrote: I haven't looked through the flows (I'm at LCE so I have limited screen bandwidth) to see how that would be handled in this case, but in the general paranoid case it comes down to the fact that in this particular subcase we don't necessarily know exactly how many SWAPGS are between us and userspace after we IRET. There is none as far as I know. Certainly wasn't any when the code was originally written. This applies for an asynchronous entry from kernel space. Obviously in the case where we actually come directly from user space (the stack frame CS.RPL == 3) then that doesn't apply. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On Wed, Apr 30, 2014 at 4:44 PM, Andy Lutomirski wrote: > On Tue, Apr 29, 2014 at 9:52 PM, H. Peter Anvin wrote: >> On 04/29/2014 04:39 PM, Andi Kleen wrote: Case 3 is annoying. If nothing tries to change the user gs base, then everything is okay because the user gs base and the kernel gs bases are equal. But if something does try to change the user gs base, then it will accidentally change the kernel gs base instead. >>> >>> It doesn't really matter, as they are the same. >>> They would just switch identities. >>> >>> Besides I don't think anyone does that. >>> >> >> It matters -- greatly -- if (and only if) we can enter the kernel with >> usergs == kernelgs and then want to change usergs inside a paranoid >> routine. At that point we risk being upside down, which basically means >> we're rooted. >> >> However, I believe this patchset also means only IST entries can be >> paranoid, which in turn means we can't sleep inside them. To the very >> best of my knowledge the only times we change usergs is on context >> switch or inside a system call. We need to make sure that is actually >> the case, though. >> >> I'm at ELC for a few days, so I'll have limited decent-sized-monitor >> time, but it shouldn't be too hard to convince ourselves of... mostly a >> matter of making sure something like ptrace can't to stupid crap. > > The only things that look relevant are the context switch paths and > the kvm stuff. I don't know what happens if an IST exception happens > while running a guest, though. TBH I have no idea what the VMX and > SVM interfaces look like. > > paranoid_schedule looks scary. If I'm understanding it correctly, it > expects to be executed with gs == usergs. I think it's okay, since it > will only be invoked if we trapped from userspace, in which case the > state is well-defined. But this bit could be wrong: > > testl %ebx,%ebx/* swapgs needed? */ > jnz paranoid_restore > testl $3,CS(%rsp) > jnz paranoid_userspace > > If usergs == kernelgs, then ebx will always be 1 and we'll never end > up in paranoid_userspace. > > This could be fixed in two ways. We could just switch the order of > the tests, since the only way to have ebx == 1 and CS with CPL == 3 > should be if we're coming from userspace with usergs==kernelgs. Or we > could get rid of the paranoid schedule code entirely. It is actually > needed for anything? Timer and rescheduling interrupts shouldn't be > paranoid, and if there's any paranoid code that will trigger a > reschedule, couldn't it do it much more sanely by sending an IPI to > self and thus deferring the reschedule until interrupts are enabled? Having just asked this, isn't the current code already broken if something like an NMI or MCE tried to reschedule the current cpu? It could hit just before running hlt in the non-polling idle loop or it could happen during execution of a kernel thread. In either case, I don't see why anything is guaranteed to notice the resched flag being set. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On Tue, Apr 29, 2014 at 9:52 PM, H. Peter Anvin wrote: > On 04/29/2014 04:39 PM, Andi Kleen wrote: >>> Case 3 is annoying. If nothing tries to change the user gs base, then >>> everything is okay because the user gs base and the kernel gs bases are >>> equal. But if something does try to change the user gs base, then it >>> will accidentally change the kernel gs base instead. >> >> It doesn't really matter, as they are the same. >> They would just switch identities. >> >> Besides I don't think anyone does that. >> > > It matters -- greatly -- if (and only if) we can enter the kernel with > usergs == kernelgs and then want to change usergs inside a paranoid > routine. At that point we risk being upside down, which basically means > we're rooted. > > However, I believe this patchset also means only IST entries can be > paranoid, which in turn means we can't sleep inside them. To the very > best of my knowledge the only times we change usergs is on context > switch or inside a system call. We need to make sure that is actually > the case, though. > > I'm at ELC for a few days, so I'll have limited decent-sized-monitor > time, but it shouldn't be too hard to convince ourselves of... mostly a > matter of making sure something like ptrace can't to stupid crap. The only things that look relevant are the context switch paths and the kvm stuff. I don't know what happens if an IST exception happens while running a guest, though. TBH I have no idea what the VMX and SVM interfaces look like. paranoid_schedule looks scary. If I'm understanding it correctly, it expects to be executed with gs == usergs. I think it's okay, since it will only be invoked if we trapped from userspace, in which case the state is well-defined. But this bit could be wrong: testl %ebx,%ebx/* swapgs needed? */ jnz paranoid_restore testl $3,CS(%rsp) jnz paranoid_userspace If usergs == kernelgs, then ebx will always be 1 and we'll never end up in paranoid_userspace. This could be fixed in two ways. We could just switch the order of the tests, since the only way to have ebx == 1 and CS with CPL == 3 should be if we're coming from userspace with usergs==kernelgs. Or we could get rid of the paranoid schedule code entirely. It is actually needed for anything? Timer and rescheduling interrupts shouldn't be paranoid, and if there's any paranoid code that will trigger a reschedule, couldn't it do it much more sanely by sending an IPI to self and thus deferring the reschedule until interrupts are enabled? Alternatively, what if the paranoid entry checked whether we're coming from userspace at the very beginning and, if so, just jumped to the non-paranoid entry? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On Tue, Apr 29, 2014 at 9:52 PM, H. Peter Anvin h...@zytor.com wrote: On 04/29/2014 04:39 PM, Andi Kleen wrote: Case 3 is annoying. If nothing tries to change the user gs base, then everything is okay because the user gs base and the kernel gs bases are equal. But if something does try to change the user gs base, then it will accidentally change the kernel gs base instead. It doesn't really matter, as they are the same. They would just switch identities. Besides I don't think anyone does that. It matters -- greatly -- if (and only if) we can enter the kernel with usergs == kernelgs and then want to change usergs inside a paranoid routine. At that point we risk being upside down, which basically means we're rooted. However, I believe this patchset also means only IST entries can be paranoid, which in turn means we can't sleep inside them. To the very best of my knowledge the only times we change usergs is on context switch or inside a system call. We need to make sure that is actually the case, though. I'm at ELC for a few days, so I'll have limited decent-sized-monitor time, but it shouldn't be too hard to convince ourselves of... mostly a matter of making sure something like ptrace can't to stupid crap. The only things that look relevant are the context switch paths and the kvm stuff. I don't know what happens if an IST exception happens while running a guest, though. TBH I have no idea what the VMX and SVM interfaces look like. paranoid_schedule looks scary. If I'm understanding it correctly, it expects to be executed with gs == usergs. I think it's okay, since it will only be invoked if we trapped from userspace, in which case the state is well-defined. But this bit could be wrong: testl %ebx,%ebx/* swapgs needed? */ jnz paranoid_restore testl $3,CS(%rsp) jnz paranoid_userspace If usergs == kernelgs, then ebx will always be 1 and we'll never end up in paranoid_userspace. This could be fixed in two ways. We could just switch the order of the tests, since the only way to have ebx == 1 and CS with CPL == 3 should be if we're coming from userspace with usergs==kernelgs. Or we could get rid of the paranoid schedule code entirely. It is actually needed for anything? Timer and rescheduling interrupts shouldn't be paranoid, and if there's any paranoid code that will trigger a reschedule, couldn't it do it much more sanely by sending an IPI to self and thus deferring the reschedule until interrupts are enabled? Alternatively, what if the paranoid entry checked whether we're coming from userspace at the very beginning and, if so, just jumped to the non-paranoid entry? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On Wed, Apr 30, 2014 at 4:44 PM, Andy Lutomirski l...@amacapital.net wrote: On Tue, Apr 29, 2014 at 9:52 PM, H. Peter Anvin h...@zytor.com wrote: On 04/29/2014 04:39 PM, Andi Kleen wrote: Case 3 is annoying. If nothing tries to change the user gs base, then everything is okay because the user gs base and the kernel gs bases are equal. But if something does try to change the user gs base, then it will accidentally change the kernel gs base instead. It doesn't really matter, as they are the same. They would just switch identities. Besides I don't think anyone does that. It matters -- greatly -- if (and only if) we can enter the kernel with usergs == kernelgs and then want to change usergs inside a paranoid routine. At that point we risk being upside down, which basically means we're rooted. However, I believe this patchset also means only IST entries can be paranoid, which in turn means we can't sleep inside them. To the very best of my knowledge the only times we change usergs is on context switch or inside a system call. We need to make sure that is actually the case, though. I'm at ELC for a few days, so I'll have limited decent-sized-monitor time, but it shouldn't be too hard to convince ourselves of... mostly a matter of making sure something like ptrace can't to stupid crap. The only things that look relevant are the context switch paths and the kvm stuff. I don't know what happens if an IST exception happens while running a guest, though. TBH I have no idea what the VMX and SVM interfaces look like. paranoid_schedule looks scary. If I'm understanding it correctly, it expects to be executed with gs == usergs. I think it's okay, since it will only be invoked if we trapped from userspace, in which case the state is well-defined. But this bit could be wrong: testl %ebx,%ebx/* swapgs needed? */ jnz paranoid_restore testl $3,CS(%rsp) jnz paranoid_userspace If usergs == kernelgs, then ebx will always be 1 and we'll never end up in paranoid_userspace. This could be fixed in two ways. We could just switch the order of the tests, since the only way to have ebx == 1 and CS with CPL == 3 should be if we're coming from userspace with usergs==kernelgs. Or we could get rid of the paranoid schedule code entirely. It is actually needed for anything? Timer and rescheduling interrupts shouldn't be paranoid, and if there's any paranoid code that will trigger a reschedule, couldn't it do it much more sanely by sending an IPI to self and thus deferring the reschedule until interrupts are enabled? Having just asked this, isn't the current code already broken if something like an NMI or MCE tried to reschedule the current cpu? It could hit just before running hlt in the non-polling idle loop or it could happen during execution of a kernel thread. In either case, I don't see why anything is guaranteed to notice the resched flag being set. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On 04/29/2014 09:52 PM, H. Peter Anvin wrote: > > It matters -- greatly -- if (and only if) we can enter the kernel with > usergs == kernelgs and then want to change usergs inside a paranoid > routine. At that point we risk being upside down, which basically means > we're rooted. > > However, I believe this patchset also means only IST entries can be > paranoid, which in turn means we can't sleep inside them. To the very > best of my knowledge the only times we change usergs is on context > switch or inside a system call. We need to make sure that is actually > the case, though. > > I'm at ELC for a few days, so I'll have limited decent-sized-monitor > time, but it shouldn't be too hard to convince ourselves of... mostly a > matter of making sure something like ptrace can't to stupid crap. > Just in case anyone is getting the wrong impression: this is a discussion about details. I'm glad to see this work getting done. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On 04/29/2014 04:39 PM, Andi Kleen wrote: >> Case 3 is annoying. If nothing tries to change the user gs base, then >> everything is okay because the user gs base and the kernel gs bases are >> equal. But if something does try to change the user gs base, then it >> will accidentally change the kernel gs base instead. > > It doesn't really matter, as they are the same. > They would just switch identities. > > Besides I don't think anyone does that. > It matters -- greatly -- if (and only if) we can enter the kernel with usergs == kernelgs and then want to change usergs inside a paranoid routine. At that point we risk being upside down, which basically means we're rooted. However, I believe this patchset also means only IST entries can be paranoid, which in turn means we can't sleep inside them. To the very best of my knowledge the only times we change usergs is on context switch or inside a system call. We need to make sure that is actually the case, though. I'm at ELC for a few days, so I'll have limited decent-sized-monitor time, but it shouldn't be too hard to convince ourselves of... mostly a matter of making sure something like ptrace can't to stupid crap. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
> Case 3 is annoying. If nothing tries to change the user gs base, then > everything is okay because the user gs base and the kernel gs bases are > equal. But if something does try to change the user gs base, then it > will accidentally change the kernel gs base instead. It doesn't really matter, as they are the same. They would just switch identities. Besides I don't think anyone does that. > > For the IST entries, this should be fine -- cpu migration, scheduling, > and such are impossible anyway. For the non-IST entries, I'm less > convinced. The entry_64.S code suggests that the problematic entries are: > > double_fault > stack_segment > machine_check I don't think any of them can schedule. > > Of course, all of those entries really do use IST, so I wonder why they > are paranoid*entry instead of paranoid*entry_ist. Is it because they're > supposedly non-recursive? Yes, only the DEBUG stack is big enough to recurse. > > In any case, wouldn't this all be much simpler and less magical if the > paranoid entries just saved the old gsbase to the rbx and loaded the new > ones? The exits could do the inverse. This should be really fast: I had it originally in a similar scheme, but it was significantly more complicated, with changed exit path So I switched to this "only a single hook needed" variant, which mirrors the existing code closely. > I don't know the actual latencies, but I suspect that this would be > faster, too -- it removes some branches, and wrgsbase and rdgsbase > deserve to be faster than swapgs. It's probably no good for > non-rd/wrgsbase-capable cpus, though, since I suspect that three MSR > accesses are much worse than one MSR access and two swapgs calls. Probably doesn't matter much, it's MUCH faster than the old code in any case. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On 04/28/2014 03:12 PM, Andi Kleen wrote: > From: Andi Kleen > > IvyBridge added new instructions to directly write the fs and gs > 64bit base registers. Previously this had to be done with a system > call to write to MSRs. The main use case is fast user space threading > and switching the fs/gs registers quickly there. > > The instructions are opt-in and have to be explicitely enabled > by the OS. > > Previously Linux couldn't support this because the paranoid > entry code relied on the gs base never being negative outside > the kernel to decide when to use swaps. It would check the gs MSR > value and assume it was already running in kernel if the value > was already negative. > > This patch changes the paranoid entry code to use rdgsbase > if available. Then we check the GS value against the expected GS value > stored at the bottom of the IST stack. If the value is the expected > value we skip swapgs. > > This is also significantly faster than a MSR read, so will speed > NMis (critical for profiling) > > An alternative would have been to save/restore the GS value > unconditionally, but this approach needs less changes. > > Then after these changes we need to also use the new instructions > to save/restore fs and gs, so that the new values set by the > users won't disappear. This is also significantly > faster for the case when the 64bit base has to be switched > (that is when GS is larger than 4GB), as we can replace > the slow MSR write with a faster wr[fg]sbase execution. > > The instructions do not context switch > the segment index, so the old invariant that fs or gs index > have to be 0 for a different 64bit value to stick is still > true. Previously it was enforced by arch_prctl, now the user > program has to make sure it keeps the segment indexes zero. > If it doesn't the changes may not stick. > > This is in term enables fast switching when there are > enough threads that their TLS segment does not fit below 4GB, > or alternatively programs that use fs as an additional base > register will not get a sigificant context switch penalty. > > It is all done in a single patch to avoid bisect crash > holes. > > +paranoid_save_gs: > + .byte 0xf3,0x48,0x0f,0xae,0xc9 # rdgsbaseq %rcx > + movq $-EXCEPTION_STKSZ,%rax # non debug stack size > + cmpq $DEBUG_STACK,ORIG_RAX+8(%rsp) > + movq $-1,ORIG_RAX+8(%rsp) # no syscall to restart > + jne 1f > + movq $-DEBUG_STKSZ,%rax # debug stack size > +1: > + andq %rsp,%rax # bottom of stack > + movq (%rax),%rdi# get expected GS > + cmpq %rdi,%rcx # is it the kernel gs? I don't like this part. There are now three cases: 1. User gs, gsbase != kernel gs base. This works the same as before 2. Kernel gs. This also works the same as before. 3. User gs, but gsbase == kernel gs base. This will cause C code to execute on the *user* gs base. Case 3 is annoying. If nothing tries to change the user gs base, then everything is okay because the user gs base and the kernel gs bases are equal. But if something does try to change the user gs base, then it will accidentally change the kernel gs base instead. For the IST entries, this should be fine -- cpu migration, scheduling, and such are impossible anyway. For the non-IST entries, I'm less convinced. The entry_64.S code suggests that the problematic entries are: double_fault stack_segment machine_check Of course, all of those entries really do use IST, so I wonder why they are paranoid*entry instead of paranoid*entry_ist. Is it because they're supposedly non-recursive? In any case, wouldn't this all be much simpler and less magical if the paranoid entries just saved the old gsbase to the rbx and loaded the new ones? The exits could do the inverse. This should be really fast: rdgsbaseq %rbx wrgsbaseq {the correct value} ... wrgsbaseq %rbx This still doesn't support changing the usergs value inside a paranoid entry, but at least it will fail consistently instead of only failing if the user gs has a particular special value. I don't know the actual latencies, but I suspect that this would be faster, too -- it removes some branches, and wrgsbase and rdgsbase deserve to be faster than swapgs. It's probably no good for non-rd/wrgsbase-capable cpus, though, since I suspect that three MSR accesses are much worse than one MSR access and two swapgs calls. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On 04/28/2014 03:12 PM, Andi Kleen wrote: From: Andi Kleen a...@linux.intel.com IvyBridge added new instructions to directly write the fs and gs 64bit base registers. Previously this had to be done with a system call to write to MSRs. The main use case is fast user space threading and switching the fs/gs registers quickly there. The instructions are opt-in and have to be explicitely enabled by the OS. Previously Linux couldn't support this because the paranoid entry code relied on the gs base never being negative outside the kernel to decide when to use swaps. It would check the gs MSR value and assume it was already running in kernel if the value was already negative. This patch changes the paranoid entry code to use rdgsbase if available. Then we check the GS value against the expected GS value stored at the bottom of the IST stack. If the value is the expected value we skip swapgs. This is also significantly faster than a MSR read, so will speed NMis (critical for profiling) An alternative would have been to save/restore the GS value unconditionally, but this approach needs less changes. Then after these changes we need to also use the new instructions to save/restore fs and gs, so that the new values set by the users won't disappear. This is also significantly faster for the case when the 64bit base has to be switched (that is when GS is larger than 4GB), as we can replace the slow MSR write with a faster wr[fg]sbase execution. The instructions do not context switch the segment index, so the old invariant that fs or gs index have to be 0 for a different 64bit value to stick is still true. Previously it was enforced by arch_prctl, now the user program has to make sure it keeps the segment indexes zero. If it doesn't the changes may not stick. This is in term enables fast switching when there are enough threads that their TLS segment does not fit below 4GB, or alternatively programs that use fs as an additional base register will not get a sigificant context switch penalty. It is all done in a single patch to avoid bisect crash holes. +paranoid_save_gs: + .byte 0xf3,0x48,0x0f,0xae,0xc9 # rdgsbaseq %rcx + movq $-EXCEPTION_STKSZ,%rax # non debug stack size + cmpq $DEBUG_STACK,ORIG_RAX+8(%rsp) + movq $-1,ORIG_RAX+8(%rsp) # no syscall to restart + jne 1f + movq $-DEBUG_STKSZ,%rax # debug stack size +1: + andq %rsp,%rax # bottom of stack + movq (%rax),%rdi# get expected GS + cmpq %rdi,%rcx # is it the kernel gs? I don't like this part. There are now three cases: 1. User gs, gsbase != kernel gs base. This works the same as before 2. Kernel gs. This also works the same as before. 3. User gs, but gsbase == kernel gs base. This will cause C code to execute on the *user* gs base. Case 3 is annoying. If nothing tries to change the user gs base, then everything is okay because the user gs base and the kernel gs bases are equal. But if something does try to change the user gs base, then it will accidentally change the kernel gs base instead. For the IST entries, this should be fine -- cpu migration, scheduling, and such are impossible anyway. For the non-IST entries, I'm less convinced. The entry_64.S code suggests that the problematic entries are: double_fault stack_segment machine_check Of course, all of those entries really do use IST, so I wonder why they are paranoid*entry instead of paranoid*entry_ist. Is it because they're supposedly non-recursive? In any case, wouldn't this all be much simpler and less magical if the paranoid entries just saved the old gsbase to the rbx and loaded the new ones? The exits could do the inverse. This should be really fast: rdgsbaseq %rbx wrgsbaseq {the correct value} ... wrgsbaseq %rbx This still doesn't support changing the usergs value inside a paranoid entry, but at least it will fail consistently instead of only failing if the user gs has a particular special value. I don't know the actual latencies, but I suspect that this would be faster, too -- it removes some branches, and wrgsbase and rdgsbase deserve to be faster than swapgs. It's probably no good for non-rd/wrgsbase-capable cpus, though, since I suspect that three MSR accesses are much worse than one MSR access and two swapgs calls. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
Case 3 is annoying. If nothing tries to change the user gs base, then everything is okay because the user gs base and the kernel gs bases are equal. But if something does try to change the user gs base, then it will accidentally change the kernel gs base instead. It doesn't really matter, as they are the same. They would just switch identities. Besides I don't think anyone does that. For the IST entries, this should be fine -- cpu migration, scheduling, and such are impossible anyway. For the non-IST entries, I'm less convinced. The entry_64.S code suggests that the problematic entries are: double_fault stack_segment machine_check I don't think any of them can schedule. Of course, all of those entries really do use IST, so I wonder why they are paranoid*entry instead of paranoid*entry_ist. Is it because they're supposedly non-recursive? Yes, only the DEBUG stack is big enough to recurse. In any case, wouldn't this all be much simpler and less magical if the paranoid entries just saved the old gsbase to the rbx and loaded the new ones? The exits could do the inverse. This should be really fast: I had it originally in a similar scheme, but it was significantly more complicated, with changed exit path So I switched to this only a single hook needed variant, which mirrors the existing code closely. I don't know the actual latencies, but I suspect that this would be faster, too -- it removes some branches, and wrgsbase and rdgsbase deserve to be faster than swapgs. It's probably no good for non-rd/wrgsbase-capable cpus, though, since I suspect that three MSR accesses are much worse than one MSR access and two swapgs calls. Probably doesn't matter much, it's MUCH faster than the old code in any case. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On 04/29/2014 04:39 PM, Andi Kleen wrote: Case 3 is annoying. If nothing tries to change the user gs base, then everything is okay because the user gs base and the kernel gs bases are equal. But if something does try to change the user gs base, then it will accidentally change the kernel gs base instead. It doesn't really matter, as they are the same. They would just switch identities. Besides I don't think anyone does that. It matters -- greatly -- if (and only if) we can enter the kernel with usergs == kernelgs and then want to change usergs inside a paranoid routine. At that point we risk being upside down, which basically means we're rooted. However, I believe this patchset also means only IST entries can be paranoid, which in turn means we can't sleep inside them. To the very best of my knowledge the only times we change usergs is on context switch or inside a system call. We need to make sure that is actually the case, though. I'm at ELC for a few days, so I'll have limited decent-sized-monitor time, but it shouldn't be too hard to convince ourselves of... mostly a matter of making sure something like ptrace can't to stupid crap. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
On 04/29/2014 09:52 PM, H. Peter Anvin wrote: It matters -- greatly -- if (and only if) we can enter the kernel with usergs == kernelgs and then want to change usergs inside a paranoid routine. At that point we risk being upside down, which basically means we're rooted. However, I believe this patchset also means only IST entries can be paranoid, which in turn means we can't sleep inside them. To the very best of my knowledge the only times we change usergs is on context switch or inside a system call. We need to make sure that is actually the case, though. I'm at ELC for a few days, so I'll have limited decent-sized-monitor time, but it shouldn't be too hard to convince ourselves of... mostly a matter of making sure something like ptrace can't to stupid crap. Just in case anyone is getting the wrong impression: this is a discussion about details. I'm glad to see this work getting done. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/