Re: Need help in bug in isolate_migratepages_range
On Mon, 3 Feb 2014, David Rientjes wrote: On Mon, 3 Feb 2014, Vlastimil Babka wrote: It seems to come from balloon_page_movable() and its test page_count(page) == 1. Hmm, I think it might be because compound_head() == NULL here. Holger, this looks like a race condition when allocating a compound page, did you only see it once or is it actually reproducible? No, this only happened once. It is not reproducable, the system was running for four days without problems. And before this kernel, five years without any problems. Thanks, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Need help in bug in isolate_migratepages_range
On Mon, 3 Feb 2014, Vlastimil Babka wrote: > It seems to come from balloon_page_movable() and its test page_count(page) == > 1. > Hmm, I think it might be because compound_head() == NULL here. Holger, this looks like a race condition when allocating a compound page, did you only see it once or is it actually reproducible? I think this happens when a new compound page is allocated and PageBuddy is cleared before prep_compound_page() and then we see PageTail(p) set but p->first_page is not yet initialized. Is there any way to avoid memory barriers in compound_page()? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Need help in bug in isolate_migratepages_range
On Mon, 3 Feb 2014, Michal Hocko wrote: On Mon 03-02-14 14:29:22, Holger Kiehl wrote: I have attached it. Please, tell me if you do not get the attachment. I hoped it would help me to get a closer compiled code to yours but I am probably using too different gcc. I have an old gcc, it is 4.4.1-2. Anyway I've tried to check whether I can hook on something and it seems that this is a race with thp merge/split or something like that. [...] Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 001c Jan 31 13:07:43 asterix kernel: IP: [] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0 Jan 31 13:07:43 asterix kernel: Oops: [#1] SMP Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode] Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1 Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008 Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 task.ti: 8807d30b2000 Jan 31 13:07:43 asterix kernel: RIP: 0010:[] [] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928 EFLAGS: 00010286 Jan 31 13:07:43 asterix kernel: RAX: RBX: 0020ec09 RCX: 0002 Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 RDI: 006c Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 R09: 0001 Jan 31 13:07:43 asterix kernel: R10: R11: ea000733a000 R12: 8807d30b3a58 Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: R15: 88083ffe1d80 Jan 31 13:07:43 asterix kernel: FS: 7f9d9e72f910() GS:88083fd4() knlGS: Jan 31 13:07:43 asterix kernel: CS: 0010 DS: ES: CR0: 8005003b Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 CR4: 000407e0 Jan 31 13:07:43 asterix kernel: Stack: Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 ea2e6af0 8807d30b3998 Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 8807d30b08c0 0020f000 Jan 31 13:07:43 asterix kernel: 083b 000a 8807d30b3a68 Jan 31 13:07:43 asterix kernel: Call Trace: Jan 31 13:07:43 asterix kernel: [] ? lru_add_drain_cpu+0x25/0x97 Jan 31 13:07:43 asterix kernel: [] compact_zone+0x2b5/0x319 Jan 31 13:07:43 asterix kernel: [] ? put_super+0x20/0x2c Jan 31 13:07:43 asterix kernel: [] compact_zone_order+0xad/0xc4 Jan 31 13:07:43 asterix kernel: [] try_to_compact_pages+0x91/0xe8 Jan 31 13:07:43 asterix kernel: [] ? page_alloc_cpu_notify+0x3e/0x3e Jan 31 13:07:43 asterix kernel: [] __alloc_pages_direct_compact+0xae/0x195 Jan 31 13:07:43 asterix kernel: [] __alloc_pages_nodemask+0x772/0x7b5 Jan 31 13:07:43 asterix kernel: [] alloc_pages_vma+0xd6/0x101 Jan 31 13:07:43 asterix kernel: [] do_huge_pmd_anonymous_page+0x199/0x2ee Jan 31 13:07:43 asterix kernel: [] handle_mm_fault+0x1b7/0xceb Jan 31 13:07:43 asterix kernel: [] ? __dequeue_entity+0x2e/0x33 Jan 31 13:07:43 asterix kernel: [] __do_page_fault+0x3bd/0x3e4 Jan 31 13:07:43 asterix kernel: [] ? mprotect_fixup+0x1c9/0x1fb Jan 31 13:07:43 asterix kernel: [] ? vm_mmap_pgoff+0x6d/0x8f Jan 31 13:07:43 asterix kernel: [] ? SyS_futex+0x103/0x13d Jan 31 13:07:43 asterix kernel: [] do_page_fault+0x9/0xb Jan 31 13:07:43 asterix kernel: [] page_fault+0x22/0x30 Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2 Jan 31 13:07:43 asterix kernel: RIP [] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP Jan 31 13:07:43 asterix kernel: CR2: 001c Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]--- This seems to match: 17027: 49 8b 17mov(%r15),%rdx # page->flags 1702a: 4c 89 f8mov%r15,%rax 1702d: 80 e6 80and$0x80,%dh # PageTail test 17030: 74 04 je 17036 17032: 49 8b 47 30 mov0x30(%r15),%rax # page = page->first_page 17036: 8b 40 1cmov0x1c(%rax),%eax <<< page->_count 17039: ff c8 dec%eax Which seems to be inlined compound_head. DH is 0x80 so this
Re: Need help in bug in isolate_migratepages_range
On 02/03/2014 05:20 PM, Michal Hocko wrote: On Mon 03-02-14 14:29:22, Holger Kiehl wrote: I have attached it. Please, tell me if you do not get the attachment. I hoped it would help me to get a closer compiled code to yours but I am probably using too different gcc. Anyway I've tried to check whether I can hook on something and it seems that this is a race with thp merge/split or something like that. [...] Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 001c Jan 31 13:07:43 asterix kernel: IP: [] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0 Jan 31 13:07:43 asterix kernel: Oops: [#1] SMP Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode] Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1 Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008 Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 task.ti: 8807d30b2000 Jan 31 13:07:43 asterix kernel: RIP: 0010:[] [] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928 EFLAGS: 00010286 Jan 31 13:07:43 asterix kernel: RAX: RBX: 0020ec09 RCX: 0002 Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 RDI: 006c Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 R09: 0001 Jan 31 13:07:43 asterix kernel: R10: R11: ea000733a000 R12: 8807d30b3a58 Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: R15: 88083ffe1d80 Jan 31 13:07:43 asterix kernel: FS: 7f9d9e72f910() GS:88083fd4() knlGS: Jan 31 13:07:43 asterix kernel: CS: 0010 DS: ES: CR0: 8005003b Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 CR4: 000407e0 Jan 31 13:07:43 asterix kernel: Stack: Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 ea2e6af0 8807d30b3998 Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 8807d30b08c0 0020f000 Jan 31 13:07:43 asterix kernel: 083b 000a 8807d30b3a68 Jan 31 13:07:43 asterix kernel: Call Trace: Jan 31 13:07:43 asterix kernel: [] ? lru_add_drain_cpu+0x25/0x97 Jan 31 13:07:43 asterix kernel: [] compact_zone+0x2b5/0x319 Jan 31 13:07:43 asterix kernel: [] ? put_super+0x20/0x2c Jan 31 13:07:43 asterix kernel: [] compact_zone_order+0xad/0xc4 Jan 31 13:07:43 asterix kernel: [] try_to_compact_pages+0x91/0xe8 Jan 31 13:07:43 asterix kernel: [] ? page_alloc_cpu_notify+0x3e/0x3e Jan 31 13:07:43 asterix kernel: [] __alloc_pages_direct_compact+0xae/0x195 Jan 31 13:07:43 asterix kernel: [] __alloc_pages_nodemask+0x772/0x7b5 Jan 31 13:07:43 asterix kernel: [] alloc_pages_vma+0xd6/0x101 Jan 31 13:07:43 asterix kernel: [] do_huge_pmd_anonymous_page+0x199/0x2ee Jan 31 13:07:43 asterix kernel: [] handle_mm_fault+0x1b7/0xceb Jan 31 13:07:43 asterix kernel: [] ? __dequeue_entity+0x2e/0x33 Jan 31 13:07:43 asterix kernel: [] __do_page_fault+0x3bd/0x3e4 Jan 31 13:07:43 asterix kernel: [] ? mprotect_fixup+0x1c9/0x1fb Jan 31 13:07:43 asterix kernel: [] ? vm_mmap_pgoff+0x6d/0x8f Jan 31 13:07:43 asterix kernel: [] ? SyS_futex+0x103/0x13d Jan 31 13:07:43 asterix kernel: [] do_page_fault+0x9/0xb Jan 31 13:07:43 asterix kernel: [] page_fault+0x22/0x30 Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2 Jan 31 13:07:43 asterix kernel: RIP [] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP Jan 31 13:07:43 asterix kernel: CR2: 001c Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]--- This seems to match: 17027: 49 8b 17mov(%r15),%rdx # page->flags 1702a: 4c 89 f8mov%r15,%rax 1702d: 80 e6 80and$0x80,%dh # PageTail test 17030: 74 04 je 17036 17032: 49 8b 47 30 mov0x30(%r15),%rax # page = page->first_page 17036: 8b 40 1cmov0x1c(%rax),%eax <<< page->_count 17039: ff c8 dec%eax Which seems to be inlined
Re: Need help in bug in isolate_migratepages_range
On Mon 03-02-14 14:29:22, Holger Kiehl wrote: > I have attached it. Please, tell me if you do not get the attachment. I hoped it would help me to get a closer compiled code to yours but I am probably using too different gcc. Anyway I've tried to check whether I can hook on something and it seems that this is a race with thp merge/split or something like that. [...] > >> Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL > >> pointer dereference at 001c > >> Jan 31 13:07:43 asterix kernel: IP: [] > >> isolate_migratepages_range+0x32d/0x653 > >> Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0 > >> Jan 31 13:07:43 asterix kernel: Oops: [#1] SMP > >> Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache > >> coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 > >> sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si > >> ipmi_msghandler usbcore usb_common [last unloaded: microcode] > >> Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted > >> 3.12.9 #1 > >> Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY > >> RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 > >> 07/30/2008 > >> Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: > >> 8807d30b2000 task.ti: 8807d30b2000 > >> Jan 31 13:07:43 asterix kernel: RIP: 0010:[] > >> [] isolate_migratepages_range+0x32d/0x653 > >> Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928 EFLAGS: > >> 00010286 > >> Jan 31 13:07:43 asterix kernel: RAX: RBX: > >> 0020ec09 RCX: 0002 > >> Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: > >> 0004 RDI: 006c > >> Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: > >> 88083fbde390 R09: 0001 > >> Jan 31 13:07:43 asterix kernel: R10: R11: > >> ea000733a000 R12: 8807d30b3a58 > >> Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: > >> R15: 88083ffe1d80 > >> Jan 31 13:07:43 asterix kernel: FS: 7f9d9e72f910() > >> GS:88083fd4() knlGS: > >> Jan 31 13:07:43 asterix kernel: CS: 0010 DS: ES: CR0: > >> 8005003b > >> Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: > >> 0007d307 CR4: 000407e0 > >> Jan 31 13:07:43 asterix kernel: Stack: > >> Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 > >> ea2e6af0 8807d30b3998 > >> Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 > >> 8807d30b08c0 0020f000 > >> Jan 31 13:07:43 asterix kernel: 083b > >> 000a 8807d30b3a68 > >> Jan 31 13:07:43 asterix kernel: Call Trace: > >> Jan 31 13:07:43 asterix kernel: [] ? > >> lru_add_drain_cpu+0x25/0x97 > >> Jan 31 13:07:43 asterix kernel: [] > >> compact_zone+0x2b5/0x319 > >> Jan 31 13:07:43 asterix kernel: [] ? > >> put_super+0x20/0x2c > >> Jan 31 13:07:43 asterix kernel: [] > >> compact_zone_order+0xad/0xc4 > >> Jan 31 13:07:43 asterix kernel: [] > >> try_to_compact_pages+0x91/0xe8 > >> Jan 31 13:07:43 asterix kernel: [] ? > >> page_alloc_cpu_notify+0x3e/0x3e > >> Jan 31 13:07:43 asterix kernel: [] > >> __alloc_pages_direct_compact+0xae/0x195 > >> Jan 31 13:07:43 asterix kernel: [] > >> __alloc_pages_nodemask+0x772/0x7b5 > >> Jan 31 13:07:43 asterix kernel: [] > >> alloc_pages_vma+0xd6/0x101 > >> Jan 31 13:07:43 asterix kernel: [] > >> do_huge_pmd_anonymous_page+0x199/0x2ee > >> Jan 31 13:07:43 asterix kernel: [] > >> handle_mm_fault+0x1b7/0xceb > >> Jan 31 13:07:43 asterix kernel: [] ? > >> __dequeue_entity+0x2e/0x33 > >> Jan 31 13:07:43 asterix kernel: [] > >> __do_page_fault+0x3bd/0x3e4 > >> Jan 31 13:07:43 asterix kernel: [] ? > >> mprotect_fixup+0x1c9/0x1fb > >> Jan 31 13:07:43 asterix kernel: [] ? > >> vm_mmap_pgoff+0x6d/0x8f > >> Jan 31 13:07:43 asterix kernel: [] ? > >> SyS_futex+0x103/0x13d > >> Jan 31 13:07:43 asterix kernel: [] > >> do_page_fault+0x9/0xb > >> Jan 31 13:07:43 asterix kernel: [] page_fault+0x22/0x30 > >> Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 > >> 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 > >> d2 79 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 > >> c0 48 85 d2 > >> Jan 31 13:07:43 asterix kernel: RIP [] > >> isolate_migratepages_range+0x32d/0x653 > >> Jan 31 13:07:43 asterix kernel: RSP > >> Jan 31 13:07:43 asterix kernel: CR2: 001c > >> Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]--- This seems to match: 17027: 49 8b 17mov(%r15),%rdx # page->flags 1702a: 4c 89 f8mov%r15,%rax 1702d: 80 e6 80
Re: Need help in bug in isolate_migratepages_range
[CCing linux-mm] Does this ring bells? I haven't checked very deeply but it doesn't seem to be fixed since 3.12. Hoolger, could you post your config, please? On Fri 31-01-14 21:12:27, Holger Kiehl wrote: > Hello, > > today one of our system got a kernel bug message. It kept on running > but more and more process begin to be stuck in D state (eg. a simple w > command would never return) and I eventually had to reboot. Here the > full message: > >Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer > dereference at 001c >Jan 31 13:07:43 asterix kernel: IP: [] > isolate_migratepages_range+0x32d/0x653 >Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0 >Jan 31 13:07:43 asterix kernel: Oops: [#1] SMP >Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp > ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci > i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore > usb_common [last unloaded: microcode] >Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted > 3.12.9 #1 >Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY > RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008 >Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: > 8807d30b2000 task.ti: 8807d30b2000 >Jan 31 13:07:43 asterix kernel: RIP: 0010:[] > [] isolate_migratepages_range+0x32d/0x653 >Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928 EFLAGS: > 00010286 >Jan 31 13:07:43 asterix kernel: RAX: RBX: > 0020ec09 RCX: 0002 >Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: > 0004 RDI: 006c >Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: > 88083fbde390 R09: 0001 >Jan 31 13:07:43 asterix kernel: R10: R11: > ea000733a000 R12: 8807d30b3a58 >Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: > R15: 88083ffe1d80 >Jan 31 13:07:43 asterix kernel: FS: 7f9d9e72f910() > GS:88083fd4() knlGS: >Jan 31 13:07:43 asterix kernel: CS: 0010 DS: ES: CR0: > 8005003b >Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: > 0007d307 CR4: 000407e0 >Jan 31 13:07:43 asterix kernel: Stack: >Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 > ea2e6af0 8807d30b3998 >Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 > 8807d30b08c0 0020f000 >Jan 31 13:07:43 asterix kernel: 083b > 000a 8807d30b3a68 >Jan 31 13:07:43 asterix kernel: Call Trace: >Jan 31 13:07:43 asterix kernel: [] ? > lru_add_drain_cpu+0x25/0x97 >Jan 31 13:07:43 asterix kernel: [] > compact_zone+0x2b5/0x319 >Jan 31 13:07:43 asterix kernel: [] ? put_super+0x20/0x2c >Jan 31 13:07:43 asterix kernel: [] > compact_zone_order+0xad/0xc4 >Jan 31 13:07:43 asterix kernel: [] > try_to_compact_pages+0x91/0xe8 >Jan 31 13:07:43 asterix kernel: [] ? > page_alloc_cpu_notify+0x3e/0x3e >Jan 31 13:07:43 asterix kernel: [] > __alloc_pages_direct_compact+0xae/0x195 >Jan 31 13:07:43 asterix kernel: [] > __alloc_pages_nodemask+0x772/0x7b5 >Jan 31 13:07:43 asterix kernel: [] > alloc_pages_vma+0xd6/0x101 >Jan 31 13:07:43 asterix kernel: [] > do_huge_pmd_anonymous_page+0x199/0x2ee >Jan 31 13:07:43 asterix kernel: [] > handle_mm_fault+0x1b7/0xceb >Jan 31 13:07:43 asterix kernel: [] ? > __dequeue_entity+0x2e/0x33 >Jan 31 13:07:43 asterix kernel: [] > __do_page_fault+0x3bd/0x3e4 >Jan 31 13:07:43 asterix kernel: [] ? > mprotect_fixup+0x1c9/0x1fb >Jan 31 13:07:43 asterix kernel: [] ? > vm_mmap_pgoff+0x6d/0x8f >Jan 31 13:07:43 asterix kernel: [] ? > SyS_futex+0x103/0x13d >Jan 31 13:07:43 asterix kernel: [] do_page_fault+0x9/0xb >Jan 31 13:07:43 asterix kernel: [] page_fault+0x22/0x30 >Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 > 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 > 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 > d2 >Jan 31 13:07:43 asterix kernel: RIP [] > isolate_migratepages_range+0x32d/0x653 >Jan 31 13:07:43 asterix kernel: RSP >Jan 31 13:07:43 asterix kernel: CR2: 001c >Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]--- > > Kernel is a plain kernel.org kernel 3.12.9 and it uses drbd to replicate > data to another host. Any idea what the cause of this bug is? Could it be > hardware? The system has been running now for five years without any problems. > > Please CC me since I am not on the list. > > Many thanks in advance. > > Regards, > Holger > -- > To
Re: Need help in bug in isolate_migratepages_range
[CCing linux-mm] Does this ring bells? I haven't checked very deeply but it doesn't seem to be fixed since 3.12. Hoolger, could you post your config, please? On Fri 31-01-14 21:12:27, Holger Kiehl wrote: Hello, today one of our system got a kernel bug message. It kept on running but more and more process begin to be stuck in D state (eg. a simple w command would never return) and I eventually had to reboot. Here the full message: Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 001c Jan 31 13:07:43 asterix kernel: IP: [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0 Jan 31 13:07:43 asterix kernel: Oops: [#1] SMP Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode] Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1 Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008 Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 task.ti: 8807d30b2000 Jan 31 13:07:43 asterix kernel: RIP: 0010:[810af0ac] [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928 EFLAGS: 00010286 Jan 31 13:07:43 asterix kernel: RAX: RBX: 0020ec09 RCX: 0002 Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 RDI: 006c Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 R09: 0001 Jan 31 13:07:43 asterix kernel: R10: R11: ea000733a000 R12: 8807d30b3a58 Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: R15: 88083ffe1d80 Jan 31 13:07:43 asterix kernel: FS: 7f9d9e72f910() GS:88083fd4() knlGS: Jan 31 13:07:43 asterix kernel: CS: 0010 DS: ES: CR0: 8005003b Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 CR4: 000407e0 Jan 31 13:07:43 asterix kernel: Stack: Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 ea2e6af0 8807d30b3998 Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 8807d30b08c0 0020f000 Jan 31 13:07:43 asterix kernel: 083b 000a 8807d30b3a68 Jan 31 13:07:43 asterix kernel: Call Trace: Jan 31 13:07:43 asterix kernel: [810a161f] ? lru_add_drain_cpu+0x25/0x97 Jan 31 13:07:43 asterix kernel: [810af687] compact_zone+0x2b5/0x319 Jan 31 13:07:43 asterix kernel: [810da586] ? put_super+0x20/0x2c Jan 31 13:07:43 asterix kernel: [810afa4d] compact_zone_order+0xad/0xc4 Jan 31 13:07:43 asterix kernel: [810afaf5] try_to_compact_pages+0x91/0xe8 Jan 31 13:07:43 asterix kernel: [8109b92d] ? page_alloc_cpu_notify+0x3e/0x3e Jan 31 13:07:43 asterix kernel: [8109da34] __alloc_pages_direct_compact+0xae/0x195 Jan 31 13:07:43 asterix kernel: [8109e45d] __alloc_pages_nodemask+0x772/0x7b5 Jan 31 13:07:43 asterix kernel: [810c85a3] alloc_pages_vma+0xd6/0x101 Jan 31 13:07:43 asterix kernel: [810d47e3] do_huge_pmd_anonymous_page+0x199/0x2ee Jan 31 13:07:43 asterix kernel: [810b3884] handle_mm_fault+0x1b7/0xceb Jan 31 13:07:43 asterix kernel: [8105dedc] ? __dequeue_entity+0x2e/0x33 Jan 31 13:07:43 asterix kernel: [8102d8c3] __do_page_fault+0x3bd/0x3e4 Jan 31 13:07:43 asterix kernel: [810bbe1a] ? mprotect_fixup+0x1c9/0x1fb Jan 31 13:07:43 asterix kernel: [810aa0f0] ? vm_mmap_pgoff+0x6d/0x8f Jan 31 13:07:43 asterix kernel: [810795f5] ? SyS_futex+0x103/0x13d Jan 31 13:07:43 asterix kernel: [8102d8f3] do_page_fault+0x9/0xb Jan 31 13:07:43 asterix kernel: [813d3672] page_fault+0x22/0x30 Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 8b 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2 Jan 31 13:07:43 asterix kernel: RIP [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP 8807d30b3928 Jan 31 13:07:43 asterix kernel: CR2: 001c Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]--- Kernel is a plain kernel.org kernel 3.12.9 and it uses drbd to
Re: Need help in bug in isolate_migratepages_range
On Mon 03-02-14 14:29:22, Holger Kiehl wrote: I have attached it. Please, tell me if you do not get the attachment. I hoped it would help me to get a closer compiled code to yours but I am probably using too different gcc. Anyway I've tried to check whether I can hook on something and it seems that this is a race with thp merge/split or something like that. [...] Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 001c Jan 31 13:07:43 asterix kernel: IP: [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0 Jan 31 13:07:43 asterix kernel: Oops: [#1] SMP Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode] Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1 Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008 Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 task.ti: 8807d30b2000 Jan 31 13:07:43 asterix kernel: RIP: 0010:[810af0ac] [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928 EFLAGS: 00010286 Jan 31 13:07:43 asterix kernel: RAX: RBX: 0020ec09 RCX: 0002 Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 RDI: 006c Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 R09: 0001 Jan 31 13:07:43 asterix kernel: R10: R11: ea000733a000 R12: 8807d30b3a58 Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: R15: 88083ffe1d80 Jan 31 13:07:43 asterix kernel: FS: 7f9d9e72f910() GS:88083fd4() knlGS: Jan 31 13:07:43 asterix kernel: CS: 0010 DS: ES: CR0: 8005003b Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 CR4: 000407e0 Jan 31 13:07:43 asterix kernel: Stack: Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 ea2e6af0 8807d30b3998 Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 8807d30b08c0 0020f000 Jan 31 13:07:43 asterix kernel: 083b 000a 8807d30b3a68 Jan 31 13:07:43 asterix kernel: Call Trace: Jan 31 13:07:43 asterix kernel: [810a161f] ? lru_add_drain_cpu+0x25/0x97 Jan 31 13:07:43 asterix kernel: [810af687] compact_zone+0x2b5/0x319 Jan 31 13:07:43 asterix kernel: [810da586] ? put_super+0x20/0x2c Jan 31 13:07:43 asterix kernel: [810afa4d] compact_zone_order+0xad/0xc4 Jan 31 13:07:43 asterix kernel: [810afaf5] try_to_compact_pages+0x91/0xe8 Jan 31 13:07:43 asterix kernel: [8109b92d] ? page_alloc_cpu_notify+0x3e/0x3e Jan 31 13:07:43 asterix kernel: [8109da34] __alloc_pages_direct_compact+0xae/0x195 Jan 31 13:07:43 asterix kernel: [8109e45d] __alloc_pages_nodemask+0x772/0x7b5 Jan 31 13:07:43 asterix kernel: [810c85a3] alloc_pages_vma+0xd6/0x101 Jan 31 13:07:43 asterix kernel: [810d47e3] do_huge_pmd_anonymous_page+0x199/0x2ee Jan 31 13:07:43 asterix kernel: [810b3884] handle_mm_fault+0x1b7/0xceb Jan 31 13:07:43 asterix kernel: [8105dedc] ? __dequeue_entity+0x2e/0x33 Jan 31 13:07:43 asterix kernel: [8102d8c3] __do_page_fault+0x3bd/0x3e4 Jan 31 13:07:43 asterix kernel: [810bbe1a] ? mprotect_fixup+0x1c9/0x1fb Jan 31 13:07:43 asterix kernel: [810aa0f0] ? vm_mmap_pgoff+0x6d/0x8f Jan 31 13:07:43 asterix kernel: [810795f5] ? SyS_futex+0x103/0x13d Jan 31 13:07:43 asterix kernel: [8102d8f3] do_page_fault+0x9/0xb Jan 31 13:07:43 asterix kernel: [813d3672] page_fault+0x22/0x30 Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 8b 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2 Jan 31 13:07:43 asterix kernel: RIP [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP 8807d30b3928 Jan 31 13:07:43 asterix kernel: CR2: 001c Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]--- This seems to match: 17027: 49 8b 17mov(%r15),%rdx #
Re: Need help in bug in isolate_migratepages_range
On 02/03/2014 05:20 PM, Michal Hocko wrote: On Mon 03-02-14 14:29:22, Holger Kiehl wrote: I have attached it. Please, tell me if you do not get the attachment. I hoped it would help me to get a closer compiled code to yours but I am probably using too different gcc. Anyway I've tried to check whether I can hook on something and it seems that this is a race with thp merge/split or something like that. [...] Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 001c Jan 31 13:07:43 asterix kernel: IP: [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0 Jan 31 13:07:43 asterix kernel: Oops: [#1] SMP Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode] Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1 Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008 Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 task.ti: 8807d30b2000 Jan 31 13:07:43 asterix kernel: RIP: 0010:[810af0ac] [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928 EFLAGS: 00010286 Jan 31 13:07:43 asterix kernel: RAX: RBX: 0020ec09 RCX: 0002 Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 RDI: 006c Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 R09: 0001 Jan 31 13:07:43 asterix kernel: R10: R11: ea000733a000 R12: 8807d30b3a58 Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: R15: 88083ffe1d80 Jan 31 13:07:43 asterix kernel: FS: 7f9d9e72f910() GS:88083fd4() knlGS: Jan 31 13:07:43 asterix kernel: CS: 0010 DS: ES: CR0: 8005003b Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 CR4: 000407e0 Jan 31 13:07:43 asterix kernel: Stack: Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 ea2e6af0 8807d30b3998 Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 8807d30b08c0 0020f000 Jan 31 13:07:43 asterix kernel: 083b 000a 8807d30b3a68 Jan 31 13:07:43 asterix kernel: Call Trace: Jan 31 13:07:43 asterix kernel: [810a161f] ? lru_add_drain_cpu+0x25/0x97 Jan 31 13:07:43 asterix kernel: [810af687] compact_zone+0x2b5/0x319 Jan 31 13:07:43 asterix kernel: [810da586] ? put_super+0x20/0x2c Jan 31 13:07:43 asterix kernel: [810afa4d] compact_zone_order+0xad/0xc4 Jan 31 13:07:43 asterix kernel: [810afaf5] try_to_compact_pages+0x91/0xe8 Jan 31 13:07:43 asterix kernel: [8109b92d] ? page_alloc_cpu_notify+0x3e/0x3e Jan 31 13:07:43 asterix kernel: [8109da34] __alloc_pages_direct_compact+0xae/0x195 Jan 31 13:07:43 asterix kernel: [8109e45d] __alloc_pages_nodemask+0x772/0x7b5 Jan 31 13:07:43 asterix kernel: [810c85a3] alloc_pages_vma+0xd6/0x101 Jan 31 13:07:43 asterix kernel: [810d47e3] do_huge_pmd_anonymous_page+0x199/0x2ee Jan 31 13:07:43 asterix kernel: [810b3884] handle_mm_fault+0x1b7/0xceb Jan 31 13:07:43 asterix kernel: [8105dedc] ? __dequeue_entity+0x2e/0x33 Jan 31 13:07:43 asterix kernel: [8102d8c3] __do_page_fault+0x3bd/0x3e4 Jan 31 13:07:43 asterix kernel: [810bbe1a] ? mprotect_fixup+0x1c9/0x1fb Jan 31 13:07:43 asterix kernel: [810aa0f0] ? vm_mmap_pgoff+0x6d/0x8f Jan 31 13:07:43 asterix kernel: [810795f5] ? SyS_futex+0x103/0x13d Jan 31 13:07:43 asterix kernel: [8102d8f3] do_page_fault+0x9/0xb Jan 31 13:07:43 asterix kernel: [813d3672] page_fault+0x22/0x30 Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 8b 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2 Jan 31 13:07:43 asterix kernel: RIP [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP 8807d30b3928 Jan 31 13:07:43 asterix kernel: CR2: 001c Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]--- This seems to match: 17027: 49 8b 17mov(%r15),%rdx # page-flags 1702a: 4c 89 f8mov%r15,%rax 1702d: 80 e6 80
Re: Need help in bug in isolate_migratepages_range
On Mon, 3 Feb 2014, Michal Hocko wrote: On Mon 03-02-14 14:29:22, Holger Kiehl wrote: I have attached it. Please, tell me if you do not get the attachment. I hoped it would help me to get a closer compiled code to yours but I am probably using too different gcc. I have an old gcc, it is 4.4.1-2. Anyway I've tried to check whether I can hook on something and it seems that this is a race with thp merge/split or something like that. [...] Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 001c Jan 31 13:07:43 asterix kernel: IP: [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0 Jan 31 13:07:43 asterix kernel: Oops: [#1] SMP Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode] Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1 Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008 Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 task.ti: 8807d30b2000 Jan 31 13:07:43 asterix kernel: RIP: 0010:[810af0ac] [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928 EFLAGS: 00010286 Jan 31 13:07:43 asterix kernel: RAX: RBX: 0020ec09 RCX: 0002 Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 RDI: 006c Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 R09: 0001 Jan 31 13:07:43 asterix kernel: R10: R11: ea000733a000 R12: 8807d30b3a58 Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: R15: 88083ffe1d80 Jan 31 13:07:43 asterix kernel: FS: 7f9d9e72f910() GS:88083fd4() knlGS: Jan 31 13:07:43 asterix kernel: CS: 0010 DS: ES: CR0: 8005003b Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 CR4: 000407e0 Jan 31 13:07:43 asterix kernel: Stack: Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 ea2e6af0 8807d30b3998 Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 8807d30b08c0 0020f000 Jan 31 13:07:43 asterix kernel: 083b 000a 8807d30b3a68 Jan 31 13:07:43 asterix kernel: Call Trace: Jan 31 13:07:43 asterix kernel: [810a161f] ? lru_add_drain_cpu+0x25/0x97 Jan 31 13:07:43 asterix kernel: [810af687] compact_zone+0x2b5/0x319 Jan 31 13:07:43 asterix kernel: [810da586] ? put_super+0x20/0x2c Jan 31 13:07:43 asterix kernel: [810afa4d] compact_zone_order+0xad/0xc4 Jan 31 13:07:43 asterix kernel: [810afaf5] try_to_compact_pages+0x91/0xe8 Jan 31 13:07:43 asterix kernel: [8109b92d] ? page_alloc_cpu_notify+0x3e/0x3e Jan 31 13:07:43 asterix kernel: [8109da34] __alloc_pages_direct_compact+0xae/0x195 Jan 31 13:07:43 asterix kernel: [8109e45d] __alloc_pages_nodemask+0x772/0x7b5 Jan 31 13:07:43 asterix kernel: [810c85a3] alloc_pages_vma+0xd6/0x101 Jan 31 13:07:43 asterix kernel: [810d47e3] do_huge_pmd_anonymous_page+0x199/0x2ee Jan 31 13:07:43 asterix kernel: [810b3884] handle_mm_fault+0x1b7/0xceb Jan 31 13:07:43 asterix kernel: [8105dedc] ? __dequeue_entity+0x2e/0x33 Jan 31 13:07:43 asterix kernel: [8102d8c3] __do_page_fault+0x3bd/0x3e4 Jan 31 13:07:43 asterix kernel: [810bbe1a] ? mprotect_fixup+0x1c9/0x1fb Jan 31 13:07:43 asterix kernel: [810aa0f0] ? vm_mmap_pgoff+0x6d/0x8f Jan 31 13:07:43 asterix kernel: [810795f5] ? SyS_futex+0x103/0x13d Jan 31 13:07:43 asterix kernel: [8102d8f3] do_page_fault+0x9/0xb Jan 31 13:07:43 asterix kernel: [813d3672] page_fault+0x22/0x30 Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 8b 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2 Jan 31 13:07:43 asterix kernel: RIP [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP 8807d30b3928 Jan 31 13:07:43 asterix kernel: CR2: 001c Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]--- This seems to match: 17027: 49 8b 17mov(%r15),%rdx # page-flags 1702a: 4c 89 f8mov%r15,%rax 1702d: 80 e6 80and
Re: Need help in bug in isolate_migratepages_range
On Mon, 3 Feb 2014, Vlastimil Babka wrote: It seems to come from balloon_page_movable() and its test page_count(page) == 1. Hmm, I think it might be because compound_head() == NULL here. Holger, this looks like a race condition when allocating a compound page, did you only see it once or is it actually reproducible? I think this happens when a new compound page is allocated and PageBuddy is cleared before prep_compound_page() and then we see PageTail(p) set but p-first_page is not yet initialized. Is there any way to avoid memory barriers in compound_page()? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Need help in bug in isolate_migratepages_range
On Mon, 3 Feb 2014, David Rientjes wrote: On Mon, 3 Feb 2014, Vlastimil Babka wrote: It seems to come from balloon_page_movable() and its test page_count(page) == 1. Hmm, I think it might be because compound_head() == NULL here. Holger, this looks like a race condition when allocating a compound page, did you only see it once or is it actually reproducible? No, this only happened once. It is not reproducable, the system was running for four days without problems. And before this kernel, five years without any problems. Thanks, Holger -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Need help in bug in isolate_migratepages_range
Hello, today one of our system got a kernel bug message. It kept on running but more and more process begin to be stuck in D state (eg. a simple w command would never return) and I eventually had to reboot. Here the full message: Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 001c Jan 31 13:07:43 asterix kernel: IP: [] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0 Jan 31 13:07:43 asterix kernel: Oops: [#1] SMP Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode] Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1 Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008 Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 task.ti: 8807d30b2000 Jan 31 13:07:43 asterix kernel: RIP: 0010:[] [] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928 EFLAGS: 00010286 Jan 31 13:07:43 asterix kernel: RAX: RBX: 0020ec09 RCX: 0002 Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 RDI: 006c Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 R09: 0001 Jan 31 13:07:43 asterix kernel: R10: R11: ea000733a000 R12: 8807d30b3a58 Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: R15: 88083ffe1d80 Jan 31 13:07:43 asterix kernel: FS: 7f9d9e72f910() GS:88083fd4() knlGS: Jan 31 13:07:43 asterix kernel: CS: 0010 DS: ES: CR0: 8005003b Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 CR4: 000407e0 Jan 31 13:07:43 asterix kernel: Stack: Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 ea2e6af0 8807d30b3998 Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 8807d30b08c0 0020f000 Jan 31 13:07:43 asterix kernel: 083b 000a 8807d30b3a68 Jan 31 13:07:43 asterix kernel: Call Trace: Jan 31 13:07:43 asterix kernel: [] ? lru_add_drain_cpu+0x25/0x97 Jan 31 13:07:43 asterix kernel: [] compact_zone+0x2b5/0x319 Jan 31 13:07:43 asterix kernel: [] ? put_super+0x20/0x2c Jan 31 13:07:43 asterix kernel: [] compact_zone_order+0xad/0xc4 Jan 31 13:07:43 asterix kernel: [] try_to_compact_pages+0x91/0xe8 Jan 31 13:07:43 asterix kernel: [] ? page_alloc_cpu_notify+0x3e/0x3e Jan 31 13:07:43 asterix kernel: [] __alloc_pages_direct_compact+0xae/0x195 Jan 31 13:07:43 asterix kernel: [] __alloc_pages_nodemask+0x772/0x7b5 Jan 31 13:07:43 asterix kernel: [] alloc_pages_vma+0xd6/0x101 Jan 31 13:07:43 asterix kernel: [] do_huge_pmd_anonymous_page+0x199/0x2ee Jan 31 13:07:43 asterix kernel: [] handle_mm_fault+0x1b7/0xceb Jan 31 13:07:43 asterix kernel: [] ? __dequeue_entity+0x2e/0x33 Jan 31 13:07:43 asterix kernel: [] __do_page_fault+0x3bd/0x3e4 Jan 31 13:07:43 asterix kernel: [] ? mprotect_fixup+0x1c9/0x1fb Jan 31 13:07:43 asterix kernel: [] ? vm_mmap_pgoff+0x6d/0x8f Jan 31 13:07:43 asterix kernel: [] ? SyS_futex+0x103/0x13d Jan 31 13:07:43 asterix kernel: [] do_page_fault+0x9/0xb Jan 31 13:07:43 asterix kernel: [] page_fault+0x22/0x30 Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2 Jan 31 13:07:43 asterix kernel: RIP [] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP Jan 31 13:07:43 asterix kernel: CR2: 001c Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]--- Kernel is a plain kernel.org kernel 3.12.9 and it uses drbd to replicate data to another host. Any idea what the cause of this bug is? Could it be hardware? The system has been running now for five years without any problems. Please CC me since I am not on the list. Many thanks in advance. Regards, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Need help in bug in isolate_migratepages_range
Hello, today one of our system got a kernel bug message. It kept on running but more and more process begin to be stuck in D state (eg. a simple w command would never return) and I eventually had to reboot. Here the full message: Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 001c Jan 31 13:07:43 asterix kernel: IP: [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0 Jan 31 13:07:43 asterix kernel: Oops: [#1] SMP Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode] Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1 Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008 Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 task.ti: 8807d30b2000 Jan 31 13:07:43 asterix kernel: RIP: 0010:[810af0ac] [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928 EFLAGS: 00010286 Jan 31 13:07:43 asterix kernel: RAX: RBX: 0020ec09 RCX: 0002 Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 RDI: 006c Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 R09: 0001 Jan 31 13:07:43 asterix kernel: R10: R11: ea000733a000 R12: 8807d30b3a58 Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: R15: 88083ffe1d80 Jan 31 13:07:43 asterix kernel: FS: 7f9d9e72f910() GS:88083fd4() knlGS: Jan 31 13:07:43 asterix kernel: CS: 0010 DS: ES: CR0: 8005003b Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 CR4: 000407e0 Jan 31 13:07:43 asterix kernel: Stack: Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 ea2e6af0 8807d30b3998 Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 8807d30b08c0 0020f000 Jan 31 13:07:43 asterix kernel: 083b 000a 8807d30b3a68 Jan 31 13:07:43 asterix kernel: Call Trace: Jan 31 13:07:43 asterix kernel: [810a161f] ? lru_add_drain_cpu+0x25/0x97 Jan 31 13:07:43 asterix kernel: [810af687] compact_zone+0x2b5/0x319 Jan 31 13:07:43 asterix kernel: [810da586] ? put_super+0x20/0x2c Jan 31 13:07:43 asterix kernel: [810afa4d] compact_zone_order+0xad/0xc4 Jan 31 13:07:43 asterix kernel: [810afaf5] try_to_compact_pages+0x91/0xe8 Jan 31 13:07:43 asterix kernel: [8109b92d] ? page_alloc_cpu_notify+0x3e/0x3e Jan 31 13:07:43 asterix kernel: [8109da34] __alloc_pages_direct_compact+0xae/0x195 Jan 31 13:07:43 asterix kernel: [8109e45d] __alloc_pages_nodemask+0x772/0x7b5 Jan 31 13:07:43 asterix kernel: [810c85a3] alloc_pages_vma+0xd6/0x101 Jan 31 13:07:43 asterix kernel: [810d47e3] do_huge_pmd_anonymous_page+0x199/0x2ee Jan 31 13:07:43 asterix kernel: [810b3884] handle_mm_fault+0x1b7/0xceb Jan 31 13:07:43 asterix kernel: [8105dedc] ? __dequeue_entity+0x2e/0x33 Jan 31 13:07:43 asterix kernel: [8102d8c3] __do_page_fault+0x3bd/0x3e4 Jan 31 13:07:43 asterix kernel: [810bbe1a] ? mprotect_fixup+0x1c9/0x1fb Jan 31 13:07:43 asterix kernel: [810aa0f0] ? vm_mmap_pgoff+0x6d/0x8f Jan 31 13:07:43 asterix kernel: [810795f5] ? SyS_futex+0x103/0x13d Jan 31 13:07:43 asterix kernel: [8102d8f3] do_page_fault+0x9/0xb Jan 31 13:07:43 asterix kernel: [813d3672] page_fault+0x22/0x30 Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 8b 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2 Jan 31 13:07:43 asterix kernel: RIP [810af0ac] isolate_migratepages_range+0x32d/0x653 Jan 31 13:07:43 asterix kernel: RSP 8807d30b3928 Jan 31 13:07:43 asterix kernel: CR2: 001c Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]--- Kernel is a plain kernel.org kernel 3.12.9 and it uses drbd to replicate data to another host. Any idea what the cause of this bug is? Could it be hardware? The system has been running now for five years without any problems. Please CC me since I am not on the list. Many thanks in advance. Regards, Holger -- To unsubscribe from this list: send the line unsubscribe