On Sun, Mar 8, 2015 at 3:13 PM, Denys Vlasenko <vda.li...@googlemail.com> wrote: > On Sat, Mar 7, 2015 at 1:33 AM, Andy Lutomirski <l...@amacapital.net> wrote: >> On Fri, Mar 6, 2015 at 3:30 PM, Fengguang Wu <fengguang...@intel.com> wrote: >>> Greetings, >>> >>> 0day kernel testing robot got the below dmesg and the first bad commit is >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/asm >>> >>> commit 75182b1632a89f12540baa1806a7c5c180db620c >>> Author: Andy Lutomirski <l...@amacapital.net> >>> AuthorDate: Thu Mar 5 19:19:03 2015 -0800 >>> Commit: Ingo Molnar <mi...@kernel.org> >>> CommitDate: Fri Mar 6 08:32:57 2015 +0100 >>> >>> x86/asm/entry: Switch all C consumers of kernel_stack to this_cpu_sp0() >> >> This problem only happens on 32-bit kernels, and the culprit is, sort of: >> >> /* >> * The below -8 is to reserve 8 bytes on top of the ring0 stack. >> * This is necessary to guarantee that the entire "struct pt_regs" >> * is accessible even if the CPU haven't stored the SS/ESP registers >> * on the stack (interrupt gate does not save these registers >> * when switching to the same priv ring). >> * Therefore beware: accessing the ss/esp fields of the >> * "struct pt_regs" is possible, but they may contain the >> * completely wrong values. >> */ >> #define task_pt_regs(task) \ >> ({ \ >> struct pt_regs *__regs__; \ >> __regs__ = (struct pt_regs *)(KSTK_TOP(task_stack_page(task))-8); \ >> __regs__ - 1; \ >> }) >> >> I'm confused about multiple things: >> >> 1. I don't understand this comment. > > Comment says that in 32-bit x86, interrupts and exceptions > in ring 0 do not push SS,ESP - they only save EFLAGS,CS,EIP > in iret frame. (This happens because CPL doesn't > change, not beacuse ot is zero). > > IRET insn likewise does not restore SS,ESP if it detects > that RPL(stack_CS) = RPL(CS). > > This was changed for 64-bit more for two reasons: > (1) it's simpler, and > (2) it makes it possible for CPU to automatically align stack > to 16-byte boundary (you need to undo that on iretq, > therefore you need to always save old RSP). > > From AMD docs: > > "8.9.3 Interrupt Stack Frame > In long mode, the return-program stack pointer (SS:RSP) is always > pushed onto the interrupt-handler > stack, regardless of whether or not a privilege change occurs. > Although the SS register is not used in > 64-bit mode, SS is pushed to allow returns into compatibility mode. > Pushing SS:RSP unconditionally > presents operating systems with a consistent interrupt-stack-frame > size for all interrupts, except for > error codes. Interrupt service-routine entry points that handle > interrupts generated by non-error-code > interrupts can push an error code on the stack for consistency. > In long mode, when a control transfer to an interrupt handler occurs, > the processor performs the > following: > 1. Aligns the new interrupt-stack frame by masking RSP with > FFFF_FFFF_FFFF_FFF0h. > ..."
I believe kernel threads used to run with the stack pointer starting right at the top of the stack. So if a kernel thread was interrupted then regs->ss and regs->esp might be above the stack in the next page (which could possibly fault). However, any code that looks at those two values on 32-bit without checking regs->cs first is buggy. Even if they can be read they are not valid if there was no CPL change. It appears that kernel threads now have an empty pt_regs struct on the top of the stack, which is used for execve. That would mean there is sufficient buffer for accessing regs->esp and regs->ss without crossing the top of the stack. The 8 byte offset is not needed anymore. >> 2. copy_thread thinks that sp0 should be task_pt_regs + 1, which is 8 >> bytes below the top of the stack. cpu_tss, formerly INIT_TSS, thinks >> that sp0 should be the top of the stack. This is inconsistent. What >> gives? > > I ran a small test: made "umask" syscall print where pt_regs is, > and where it's end is: > > SYSCALL_DEFINE1(umask, int, mask) > { > struct pt_regs *regs = task_pt_regs(current); > unsigned long ptregs = (unsigned long)regs; > > pr_err("[%d] current:%lx pt_regs:%lx pt_regs_end:%lx\n", > current->pid, (long)current, ptregs, ptregs + sizeof(*regs)); > .... > > Result: > > / # umask > [ 8.393450] [910] current:c777ce00 pt_regs:c7567fb4 pt_regs_end:c7567ff8 > [ 8.396601] [910] current:c777ce00 pt_regs:c7567fb4 pt_regs_end:c7567ff8 > > It seems to be true that 32-bit x86 always has two unused words > beyond "struct pt_regs" on stack during syscalls. > > IOW: tss.sp0 is *not* set at the very end of the stack on 32-bit. > >> 3. vm86 does something terrible to sp0, which might mean that we can't >> use sp0 to find the top of the stack in general on 32-bit kernels, >> which is sad. We might need a partial revert or to introduce a >> per-cpu variable "real_sp0" on 32-bit. Ugh. Could someone who >> understands vm86 mode better than me give me a hint? vm86 mode alters sp0 in order to save the 32-bit pt_regs on the top of the stack. The iret frame for vm86 mode includes ds/es/fs/gs on the stack before the normal iret frame. This is so that the segments can be set for real mode semantics. So the normal pt_regs would not be large enough to include those extra segments in the ire frame. I had started work on a patch to clean this up, by saving the 32-bit regs and other data off the stack, but never finished it. I may look at it again soon. >> 4. Not directly related, but the 32-bit tss_struct contains this gem: >> >> /* >> * .. and then another 0x100 bytes for the emergency kernel stack: >> */ >> unsigned long stack[64]; >> >> Last I checked, 0x100 != 64. Also, wow, this is kind of disgusting. :) > > > Seems to be unused: I commented it out on "defconfig" build > and got no build errors. This is only for the sysenter code on a 32-bit kernel. It should be commented out for a 64-bit kernel. -- Brian Gerst -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/