On Sat, Mar 7, 2015 at 1:33 AM, Andy Lutomirski <l...@amacapital.net> wrote: > On Fri, Mar 6, 2015 at 3:30 PM, Fengguang Wu <fengguang...@intel.com> wrote: >> Greetings, >> >> 0day kernel testing robot got the below dmesg and the first bad commit is >> >> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/asm >> >> commit 75182b1632a89f12540baa1806a7c5c180db620c >> Author: Andy Lutomirski <l...@amacapital.net> >> AuthorDate: Thu Mar 5 19:19:03 2015 -0800 >> Commit: Ingo Molnar <mi...@kernel.org> >> CommitDate: Fri Mar 6 08:32:57 2015 +0100 >> >> x86/asm/entry: Switch all C consumers of kernel_stack to this_cpu_sp0() > > This problem only happens on 32-bit kernels, and the culprit is, sort of: > > /* > * The below -8 is to reserve 8 bytes on top of the ring0 stack. > * This is necessary to guarantee that the entire "struct pt_regs" > * is accessible even if the CPU haven't stored the SS/ESP registers > * on the stack (interrupt gate does not save these registers > * when switching to the same priv ring). > * Therefore beware: accessing the ss/esp fields of the > * "struct pt_regs" is possible, but they may contain the > * completely wrong values. > */ > #define task_pt_regs(task) \ > ({ \ > struct pt_regs *__regs__; \ > __regs__ = (struct pt_regs *)(KSTK_TOP(task_stack_page(task))-8); \ > __regs__ - 1; \ > }) > > I'm confused about multiple things: > > 1. I don't understand this comment.
Comment says that in 32-bit x86, interrupts and exceptions in ring 0 do not push SS,ESP - they only save EFLAGS,CS,EIP in iret frame. (This happens because CPL doesn't change, not beacuse ot is zero). IRET insn likewise does not restore SS,ESP if it detects that RPL(stack_CS) = RPL(CS). This was changed for 64-bit more for two reasons: (1) it's simpler, and (2) it makes it possible for CPU to automatically align stack to 16-byte boundary (you need to undo that on iretq, therefore you need to always save old RSP). >From AMD docs: "8.9.3 Interrupt Stack Frame In long mode, the return-program stack pointer (SS:RSP) is always pushed onto the interrupt-handler stack, regardless of whether or not a privilege change occurs. Although the SS register is not used in 64-bit mode, SS is pushed to allow returns into compatibility mode. Pushing SS:RSP unconditionally presents operating systems with a consistent interrupt-stack-frame size for all interrupts, except for error codes. Interrupt service-routine entry points that handle interrupts generated by non-error-code interrupts can push an error code on the stack for consistency. In long mode, when a control transfer to an interrupt handler occurs, the processor performs the following: 1. Aligns the new interrupt-stack frame by masking RSP with FFFF_FFFF_FFFF_FFF0h. ..." > 2. copy_thread thinks that sp0 should be task_pt_regs + 1, which is 8 > bytes below the top of the stack. cpu_tss, formerly INIT_TSS, thinks > that sp0 should be the top of the stack. This is inconsistent. What > gives? I ran a small test: made "umask" syscall print where pt_regs is, and where it's end is: SYSCALL_DEFINE1(umask, int, mask) { struct pt_regs *regs = task_pt_regs(current); unsigned long ptregs = (unsigned long)regs; pr_err("[%d] current:%lx pt_regs:%lx pt_regs_end:%lx\n", current->pid, (long)current, ptregs, ptregs + sizeof(*regs)); .... Result: / # umask [ 8.393450] [910] current:c777ce00 pt_regs:c7567fb4 pt_regs_end:c7567ff8 [ 8.396601] [910] current:c777ce00 pt_regs:c7567fb4 pt_regs_end:c7567ff8 It seems to be true that 32-bit x86 always has two unused words beyond "struct pt_regs" on stack during syscalls. IOW: tss.sp0 is *not* set at the very end of the stack on 32-bit. > 3. vm86 does something terrible to sp0, which might mean that we can't > use sp0 to find the top of the stack in general on 32-bit kernels, > which is sad. We might need a partial revert or to introduce a > per-cpu variable "real_sp0" on 32-bit. Ugh. Could someone who > understands vm86 mode better than me give me a hint? > > 4. Not directly related, but the 32-bit tss_struct contains this gem: > > /* > * .. and then another 0x100 bytes for the emergency kernel stack: > */ > unsigned long stack[64]; > > Last I checked, 0x100 != 64. Also, wow, this is kind of disgusting. :) Seems to be unused: I commented it out on "defconfig" build and got no build errors. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/