David's change 10a8d6ae4b3182d6588a5809a8366343bc295c20, "i386: add ptep_test_and_clear_{dirty,young}" has introduced an SMP race which affects the Xen pv-ops backend.
In Xen, pagetables are normally kept RO so that the hypervisor can mediate all updates to them. If Xen sees a write to an active (currently pointed to by cr3) or pinned (a currently inactive but registered pagetable), it will trap the write fault and emulate the instruction making the update; this means that most pagetable-modifying code doesn't need to know or care that pagetables are RO. When a pagetable is first created (either in execve or fork), the the Xen paravirt backend pins the pagetable, and conversely, on exit it is unpinned; this is done via the arch_dup_mmap() and activate_mm() hooks. Pinning is done in two phases: first the pagetable pages are marked RO, and then the pagetable is registered with Xen; unpinning is the opposite. This works assuming that the pagetable is not in use, and not yet visible to other cpus. The race on pagetable creation is this: in kernel/fork.c:dup_mmap(), it copies the old pagetable into the new one, and registers each vma with the rmap prio tree. Once everything is copied, it calls arch_dup_mmap(), which ends up doing the Xen pagetable pin. However, because the pagetable is visible to other cpus via the prio tree, pagetable modifications (specifically, clearing the access bit) can race with pinning. If it hits between making the pagetable pages RO but before they're registered with Xen, modifications to the flags will fault, and Xen won't know to do the fixup. The converse is also true in exit_mmap(): arch_exit_mmap is called before removing the vmas from the prio tree, so it can race with unpinning. The specific oops I'm seeing is this: BUG: unable to handle kernel paging request at virtual address c5b023e8 printing eip: c016d3f2 *pdpt = 000000004bc1a001 Oops: 0003 [#1] PREEMPT SMP Modules linked in: CPU: 1 EIP: 0061:[<c016d3f2>] Not tainted VLI EFLAGS: 00010202 (2.6.23-rc9-paravirt #1656) EIP is at page_referenced_one+0xb8/0x12a eax: c0401b17 ebx: c5b023e8 ecx: c2398000 edx: c044ceca esi: 0087d000 edi: c5660688 ebp: c2399af4 esp: c2399acc ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0069 Process cc1 (pid: 31474, ti=c2398000 task=c2dc9000 task.ti=c2398000) Stack: c04014a7 c040f47a 0000011e c03697fe c2399b1c c5eb4500 c113e87c c116b1c8 c5660688 c13aa890 c2399b2c c016d4d8 00000000 c7917340 00000008 00000000 00000000 c13aa8c4 00000000 00000000 00000005 c116b1c8 00000001 c0473940 Call Trace: [<c010927e>] show_trace_log_lvl+0x1a/0x2f [<c0109330>] show_stack_log_lvl+0x9d/0xa5 [<c010952f>] show_registers+0x1f7/0x336 [<c0109789>] die+0x11b/0x23b [<c035e811>] do_page_fault+0x758/0x838 [<c035ccca>] error_code+0x72/0x78 [<c016d4d8>] page_referenced_file+0x74/0xa0 [<c016d732>] page_referenced+0xbd/0xd0 [<c01611b4>] shrink_active_list+0x170/0x3a3 [<c0161e56>] shrink_zone+0xb9/0xf8 [<c0162808>] try_to_free_pages+0x13c/0x208 [<c015e269>] __alloc_pages+0x197/0x290 [<c015fa95>] __do_page_cache_readahead+0xd4/0x1d7 [<c015fe86>] do_page_cache_readahead+0x4b/0x56 [<c015be8d>] filemap_fault+0x1b7/0x3de [<c0164649>] __do_fault+0x79/0x407 [<c0167760>] handle_mm_fault+0x27e/0xca0 [<c035e44a>] do_page_fault+0x391/0x838 [<c035ccca>] error_code+0x72/0x78 ======================= Code: 0c fe 97 36 c0 c7 44 24 08 1e 01 00 00 c7 44 24 04 7a f4 40 c0 c7 04 24 a7 14 40 c0 e8 d4 e5 fb ff e8 29 c9 f9 ff f6 03 20 74 27 <f0> 0f ba 33 05 19 c0 85 c0 74 1c 8b 07 89 f2 89 d9 8d b6 00 00 EIP: [<c016d3f2>] page_referenced_one+0xb8/0x12a SS:ESP 0069:c2399acc It all worked OK before David's change, because asm-generic/pgtable.h uses set_pte_at(), which ends up making a hypercall to update the pagetable, which always works regardless of the state of the pagetable pages. It seems to me that there are a few ways to fix this: 1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled. This will clearly work, but is pretty blunt. 2. Make test_and_clear_pte_flags a new paravirt-op, which can be implemented in Xen as a hypercall, and as a raw test_and_clear_bit for everyone else. The downside is adding yet another pv-op. 3. Restructure the pagetable setup code so that the mm is not added to the prio tree until after arch_dup_mmap has been called (and the converse for exit_mmap). This is arguably cleaner, but I haven't looked to see how much trouble this would be. Thoughts anyone? Does making the pagetables visible "early" cause problems for anyone else? Thanks, J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/