Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread Holger Kiehl

On Mon, 3 Feb 2014, David Rientjes wrote:


On Mon, 3 Feb 2014, Vlastimil Babka wrote:


It seems to come from balloon_page_movable() and its test page_count(page) ==
1.



Hmm, I think it might be because compound_head() == NULL here.  Holger,
this looks like a race condition when allocating a compound page, did you
only see it once or is it actually reproducible?


No, this only happened once. It is not reproducable, the system was running
for four days without problems. And before this kernel, five years without
any problems.

Thanks,
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread David Rientjes
On Mon, 3 Feb 2014, Vlastimil Babka wrote:

> It seems to come from balloon_page_movable() and its test page_count(page) ==
> 1.
> 

Hmm, I think it might be because compound_head() == NULL here.  Holger, 
this looks like a race condition when allocating a compound page, did you 
only see it once or is it actually reproducible?

I think this happens when a new compound page is allocated and PageBuddy 
is cleared before prep_compound_page() and then we see PageTail(p) set but 
p->first_page is not yet initialized.  Is there any way to avoid memory 
barriers in compound_page()?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread Holger Kiehl

On Mon, 3 Feb 2014, Michal Hocko wrote:


On Mon 03-02-14 14:29:22, Holger Kiehl wrote:

I have attached it. Please, tell me if you do not get the attachment.


I hoped it would help me to get a closer compiled code to yours but I am
probably using too different gcc.


I have an old gcc, it is 4.4.1-2.


Anyway I've tried to check whether I can hook on something and it seems
that this is a race with thp merge/split or something like that.

[...]

  Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer 
dereference at 001c
  Jan 31 13:07:43 asterix kernel: IP: [] 
isolate_migratepages_range+0x32d/0x653
  Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
  Jan 31 13:07:43 asterix kernel: Oops:  [#1] SMP
  Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp 
ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci 
i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore 
usb_common [last unloaded: microcode]
  Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 
3.12.9 #1
  Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 
S4 /D2519, BIOS 4.06  Rev. 1.04.2519 07/30/2008
  Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 
task.ti: 8807d30b2000
  Jan 31 13:07:43 asterix kernel: RIP: 0010:[]  
[] isolate_migratepages_range+0x32d/0x653
  Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928  EFLAGS: 00010286
  Jan 31 13:07:43 asterix kernel: RAX:  RBX: 0020ec09 
RCX: 0002
  Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 
RDI: 006c
  Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 
R09: 0001
  Jan 31 13:07:43 asterix kernel: R10:  R11: ea000733a000 
R12: 8807d30b3a58
  Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14:  
R15: 88083ffe1d80
  Jan 31 13:07:43 asterix kernel: FS:  7f9d9e72f910() 
GS:88083fd4() knlGS:
  Jan 31 13:07:43 asterix kernel: CS:  0010 DS:  ES:  CR0: 
8005003b
  Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 
CR4: 000407e0
  Jan 31 13:07:43 asterix kernel: Stack:
  Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 
ea2e6af0 8807d30b3998
  Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 
8807d30b08c0 0020f000
  Jan 31 13:07:43 asterix kernel:  083b 
000a 8807d30b3a68
  Jan 31 13:07:43 asterix kernel: Call Trace:
  Jan 31 13:07:43 asterix kernel: [] ? 
lru_add_drain_cpu+0x25/0x97
  Jan 31 13:07:43 asterix kernel: [] compact_zone+0x2b5/0x319
  Jan 31 13:07:43 asterix kernel: [] ? put_super+0x20/0x2c
  Jan 31 13:07:43 asterix kernel: [] 
compact_zone_order+0xad/0xc4
  Jan 31 13:07:43 asterix kernel: [] 
try_to_compact_pages+0x91/0xe8
  Jan 31 13:07:43 asterix kernel: [] ? 
page_alloc_cpu_notify+0x3e/0x3e
  Jan 31 13:07:43 asterix kernel: [] 
__alloc_pages_direct_compact+0xae/0x195
  Jan 31 13:07:43 asterix kernel: [] 
__alloc_pages_nodemask+0x772/0x7b5
  Jan 31 13:07:43 asterix kernel: [] 
alloc_pages_vma+0xd6/0x101
  Jan 31 13:07:43 asterix kernel: [] 
do_huge_pmd_anonymous_page+0x199/0x2ee
  Jan 31 13:07:43 asterix kernel: [] 
handle_mm_fault+0x1b7/0xceb
  Jan 31 13:07:43 asterix kernel: [] ? 
__dequeue_entity+0x2e/0x33
  Jan 31 13:07:43 asterix kernel: [] 
__do_page_fault+0x3bd/0x3e4
  Jan 31 13:07:43 asterix kernel: [] ? 
mprotect_fixup+0x1c9/0x1fb
  Jan 31 13:07:43 asterix kernel: [] ? vm_mmap_pgoff+0x6d/0x8f
  Jan 31 13:07:43 asterix kernel: [] ? SyS_futex+0x103/0x13d
  Jan 31 13:07:43 asterix kernel: [] do_page_fault+0x9/0xb
  Jan 31 13:07:43 asterix kernel: [] page_fault+0x22/0x30
  Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 
41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 
<8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
  Jan 31 13:07:43 asterix kernel: RIP  [] 
isolate_migratepages_range+0x32d/0x653
  Jan 31 13:07:43 asterix kernel: RSP 
  Jan 31 13:07:43 asterix kernel: CR2: 001c
  Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---


This seems to match:
  17027:   49 8b 17mov(%r15),%rdx   # page->flags
  1702a:   4c 89 f8mov%r15,%rax
  1702d:   80 e6 80and$0x80,%dh # PageTail test
  17030:   74 04   je 17036 

  17032:   49 8b 47 30 mov0x30(%r15),%rax   # page = 
page->first_page
  17036:   8b 40 1cmov0x1c(%rax),%eax   <<< page->_count
  17039:   ff c8   dec%eax

Which seems to be inlined compound_head. DH is 0x80 so this 

Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread Vlastimil Babka

On 02/03/2014 05:20 PM, Michal Hocko wrote:

On Mon 03-02-14 14:29:22, Holger Kiehl wrote:

I have attached it. Please, tell me if you do not get the attachment.


I hoped it would help me to get a closer compiled code to yours but I am
probably using too different gcc.
Anyway I've tried to check whether I can hook on something and it seems
that this is a race with thp merge/split or something like that.

[...]

   Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer 
dereference at 001c
   Jan 31 13:07:43 asterix kernel: IP: [] 
isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
   Jan 31 13:07:43 asterix kernel: Oops:  [#1] SMP
   Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp 
ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci 
i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore 
usb_common [last unloaded: microcode]
   Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 
3.12.9 #1
   Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY 
RX300 S4 /D2519, BIOS 4.06  Rev. 1.04.2519 07/30/2008
   Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 
task.ti: 8807d30b2000
   Jan 31 13:07:43 asterix kernel: RIP: 0010:[]  
[] isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928  EFLAGS: 00010286
   Jan 31 13:07:43 asterix kernel: RAX:  RBX: 0020ec09 
RCX: 0002
   Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 
RDI: 006c
   Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 
R09: 0001
   Jan 31 13:07:43 asterix kernel: R10:  R11: ea000733a000 
R12: 8807d30b3a58
   Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14:  
R15: 88083ffe1d80
   Jan 31 13:07:43 asterix kernel: FS:  7f9d9e72f910() 
GS:88083fd4() knlGS:
   Jan 31 13:07:43 asterix kernel: CS:  0010 DS:  ES:  CR0: 
8005003b
   Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 
CR4: 000407e0
   Jan 31 13:07:43 asterix kernel: Stack:
   Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 
ea2e6af0 8807d30b3998
   Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 
8807d30b08c0 0020f000
   Jan 31 13:07:43 asterix kernel:  083b 
000a 8807d30b3a68
   Jan 31 13:07:43 asterix kernel: Call Trace:
   Jan 31 13:07:43 asterix kernel: [] ? 
lru_add_drain_cpu+0x25/0x97
   Jan 31 13:07:43 asterix kernel: [] compact_zone+0x2b5/0x319
   Jan 31 13:07:43 asterix kernel: [] ? put_super+0x20/0x2c
   Jan 31 13:07:43 asterix kernel: [] 
compact_zone_order+0xad/0xc4
   Jan 31 13:07:43 asterix kernel: [] 
try_to_compact_pages+0x91/0xe8
   Jan 31 13:07:43 asterix kernel: [] ? 
page_alloc_cpu_notify+0x3e/0x3e
   Jan 31 13:07:43 asterix kernel: [] 
__alloc_pages_direct_compact+0xae/0x195
   Jan 31 13:07:43 asterix kernel: [] 
__alloc_pages_nodemask+0x772/0x7b5
   Jan 31 13:07:43 asterix kernel: [] 
alloc_pages_vma+0xd6/0x101
   Jan 31 13:07:43 asterix kernel: [] 
do_huge_pmd_anonymous_page+0x199/0x2ee
   Jan 31 13:07:43 asterix kernel: [] 
handle_mm_fault+0x1b7/0xceb
   Jan 31 13:07:43 asterix kernel: [] ? 
__dequeue_entity+0x2e/0x33
   Jan 31 13:07:43 asterix kernel: [] 
__do_page_fault+0x3bd/0x3e4
   Jan 31 13:07:43 asterix kernel: [] ? 
mprotect_fixup+0x1c9/0x1fb
   Jan 31 13:07:43 asterix kernel: [] ? 
vm_mmap_pgoff+0x6d/0x8f
   Jan 31 13:07:43 asterix kernel: [] ? SyS_futex+0x103/0x13d
   Jan 31 13:07:43 asterix kernel: [] do_page_fault+0x9/0xb
   Jan 31 13:07:43 asterix kernel: [] page_fault+0x22/0x30
   Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 
41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 
<8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
   Jan 31 13:07:43 asterix kernel: RIP  [] 
isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: RSP 
   Jan 31 13:07:43 asterix kernel: CR2: 001c
   Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---


This seems to match:
17027:   49 8b 17mov(%r15),%rdx # page->flags
1702a:   4c 89 f8mov%r15,%rax
1702d:   80 e6 80and$0x80,%dh   # PageTail test
17030:   74 04   je 17036 

17032:   49 8b 47 30 mov0x30(%r15),%rax # page = 
page->first_page
17036:   8b 40 1cmov0x1c(%rax),%eax <<< page->_count
17039:   ff c8   dec%eax

Which seems to be inlined 

Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread Michal Hocko
On Mon 03-02-14 14:29:22, Holger Kiehl wrote:
> I have attached it. Please, tell me if you do not get the attachment.

I hoped it would help me to get a closer compiled code to yours but I am
probably using too different gcc.
Anyway I've tried to check whether I can hook on something and it seems
that this is a race with thp merge/split or something like that.
 
[...]
> >>   Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL 
> >> pointer dereference at 001c
> >>   Jan 31 13:07:43 asterix kernel: IP: [] 
> >> isolate_migratepages_range+0x32d/0x653
> >>   Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
> >>   Jan 31 13:07:43 asterix kernel: Oops:  [#1] SMP
> >>   Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache 
> >> coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 
> >> sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si 
> >> ipmi_msghandler usbcore usb_common [last unloaded: microcode]
> >>   Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 
> >> 3.12.9 #1
> >>   Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY 
> >> RX300 S4 /D2519, BIOS 4.06  Rev. 1.04.2519 
> >> 07/30/2008
> >>   Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 
> >> 8807d30b2000 task.ti: 8807d30b2000
> >>   Jan 31 13:07:43 asterix kernel: RIP: 0010:[]  
> >> [] isolate_migratepages_range+0x32d/0x653
> >>   Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928  EFLAGS: 
> >> 00010286
> >>   Jan 31 13:07:43 asterix kernel: RAX:  RBX: 
> >> 0020ec09 RCX: 0002
> >>   Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 
> >> 0004 RDI: 006c
> >>   Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 
> >> 88083fbde390 R09: 0001
> >>   Jan 31 13:07:43 asterix kernel: R10:  R11: 
> >> ea000733a000 R12: 8807d30b3a58
> >>   Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: 
> >>  R15: 88083ffe1d80
> >>   Jan 31 13:07:43 asterix kernel: FS:  7f9d9e72f910() 
> >> GS:88083fd4() knlGS:
> >>   Jan 31 13:07:43 asterix kernel: CS:  0010 DS:  ES:  CR0: 
> >> 8005003b
> >>   Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 
> >> 0007d307 CR4: 000407e0
> >>   Jan 31 13:07:43 asterix kernel: Stack:
> >>   Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 
> >> ea2e6af0 8807d30b3998
> >>   Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 
> >> 8807d30b08c0 0020f000
> >>   Jan 31 13:07:43 asterix kernel:  083b 
> >> 000a 8807d30b3a68
> >>   Jan 31 13:07:43 asterix kernel: Call Trace:
> >>   Jan 31 13:07:43 asterix kernel: [] ? 
> >> lru_add_drain_cpu+0x25/0x97
> >>   Jan 31 13:07:43 asterix kernel: [] 
> >> compact_zone+0x2b5/0x319
> >>   Jan 31 13:07:43 asterix kernel: [] ? 
> >> put_super+0x20/0x2c
> >>   Jan 31 13:07:43 asterix kernel: [] 
> >> compact_zone_order+0xad/0xc4
> >>   Jan 31 13:07:43 asterix kernel: [] 
> >> try_to_compact_pages+0x91/0xe8
> >>   Jan 31 13:07:43 asterix kernel: [] ? 
> >> page_alloc_cpu_notify+0x3e/0x3e
> >>   Jan 31 13:07:43 asterix kernel: [] 
> >> __alloc_pages_direct_compact+0xae/0x195
> >>   Jan 31 13:07:43 asterix kernel: [] 
> >> __alloc_pages_nodemask+0x772/0x7b5
> >>   Jan 31 13:07:43 asterix kernel: [] 
> >> alloc_pages_vma+0xd6/0x101
> >>   Jan 31 13:07:43 asterix kernel: [] 
> >> do_huge_pmd_anonymous_page+0x199/0x2ee
> >>   Jan 31 13:07:43 asterix kernel: [] 
> >> handle_mm_fault+0x1b7/0xceb
> >>   Jan 31 13:07:43 asterix kernel: [] ? 
> >> __dequeue_entity+0x2e/0x33
> >>   Jan 31 13:07:43 asterix kernel: [] 
> >> __do_page_fault+0x3bd/0x3e4
> >>   Jan 31 13:07:43 asterix kernel: [] ? 
> >> mprotect_fixup+0x1c9/0x1fb
> >>   Jan 31 13:07:43 asterix kernel: [] ? 
> >> vm_mmap_pgoff+0x6d/0x8f
> >>   Jan 31 13:07:43 asterix kernel: [] ? 
> >> SyS_futex+0x103/0x13d
> >>   Jan 31 13:07:43 asterix kernel: [] 
> >> do_page_fault+0x9/0xb
> >>   Jan 31 13:07:43 asterix kernel: [] page_fault+0x22/0x30
> >>   Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 
> >> 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 
> >> d2 79 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 
> >> c0 48 85 d2
> >>   Jan 31 13:07:43 asterix kernel: RIP  [] 
> >> isolate_migratepages_range+0x32d/0x653
> >>   Jan 31 13:07:43 asterix kernel: RSP 
> >>   Jan 31 13:07:43 asterix kernel: CR2: 001c
> >>   Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---

This seems to match:
   17027:   49 8b 17mov(%r15),%rdx  # page->flags
   1702a:   4c 89 f8mov%r15,%rax
   1702d:   80 e6 80  

Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread Michal Hocko
[CCing linux-mm]

Does this ring bells? I haven't checked very deeply but it doesn't seem
to be fixed since 3.12.

Hoolger, could you post your config, please?

On Fri 31-01-14 21:12:27, Holger Kiehl wrote:
> Hello,
> 
> today one of our system got a kernel bug message. It kept on running
> but more and more process begin to be stuck in D state (eg. a simple w
> command would never return) and I eventually had to reboot. Here the
> full message:
> 
>Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer 
> dereference at 001c
>Jan 31 13:07:43 asterix kernel: IP: [] 
> isolate_migratepages_range+0x32d/0x653
>Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
>Jan 31 13:07:43 asterix kernel: Oops:  [#1] SMP
>Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp 
> ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci 
> i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore 
> usb_common [last unloaded: microcode]
>Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 
> 3.12.9 #1
>Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY 
> RX300 S4 /D2519, BIOS 4.06  Rev. 1.04.2519 07/30/2008
>Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 
> 8807d30b2000 task.ti: 8807d30b2000
>Jan 31 13:07:43 asterix kernel: RIP: 0010:[]  
> [] isolate_migratepages_range+0x32d/0x653
>Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928  EFLAGS: 
> 00010286
>Jan 31 13:07:43 asterix kernel: RAX:  RBX: 
> 0020ec09 RCX: 0002
>Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 
> 0004 RDI: 006c
>Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 
> 88083fbde390 R09: 0001
>Jan 31 13:07:43 asterix kernel: R10:  R11: 
> ea000733a000 R12: 8807d30b3a58
>Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: 
>  R15: 88083ffe1d80
>Jan 31 13:07:43 asterix kernel: FS:  7f9d9e72f910() 
> GS:88083fd4() knlGS:
>Jan 31 13:07:43 asterix kernel: CS:  0010 DS:  ES:  CR0: 
> 8005003b
>Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 
> 0007d307 CR4: 000407e0
>Jan 31 13:07:43 asterix kernel: Stack:
>Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 
> ea2e6af0 8807d30b3998
>Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 
> 8807d30b08c0 0020f000
>Jan 31 13:07:43 asterix kernel:  083b 
> 000a 8807d30b3a68
>Jan 31 13:07:43 asterix kernel: Call Trace:
>Jan 31 13:07:43 asterix kernel: [] ? 
> lru_add_drain_cpu+0x25/0x97
>Jan 31 13:07:43 asterix kernel: [] 
> compact_zone+0x2b5/0x319
>Jan 31 13:07:43 asterix kernel: [] ? put_super+0x20/0x2c
>Jan 31 13:07:43 asterix kernel: [] 
> compact_zone_order+0xad/0xc4
>Jan 31 13:07:43 asterix kernel: [] 
> try_to_compact_pages+0x91/0xe8
>Jan 31 13:07:43 asterix kernel: [] ? 
> page_alloc_cpu_notify+0x3e/0x3e
>Jan 31 13:07:43 asterix kernel: [] 
> __alloc_pages_direct_compact+0xae/0x195
>Jan 31 13:07:43 asterix kernel: [] 
> __alloc_pages_nodemask+0x772/0x7b5
>Jan 31 13:07:43 asterix kernel: [] 
> alloc_pages_vma+0xd6/0x101
>Jan 31 13:07:43 asterix kernel: [] 
> do_huge_pmd_anonymous_page+0x199/0x2ee
>Jan 31 13:07:43 asterix kernel: [] 
> handle_mm_fault+0x1b7/0xceb
>Jan 31 13:07:43 asterix kernel: [] ? 
> __dequeue_entity+0x2e/0x33
>Jan 31 13:07:43 asterix kernel: [] 
> __do_page_fault+0x3bd/0x3e4
>Jan 31 13:07:43 asterix kernel: [] ? 
> mprotect_fixup+0x1c9/0x1fb
>Jan 31 13:07:43 asterix kernel: [] ? 
> vm_mmap_pgoff+0x6d/0x8f
>Jan 31 13:07:43 asterix kernel: [] ? 
> SyS_futex+0x103/0x13d
>Jan 31 13:07:43 asterix kernel: [] do_page_fault+0x9/0xb
>Jan 31 13:07:43 asterix kernel: [] page_fault+0x22/0x30
>Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 
> 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 
> 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 
> d2
>Jan 31 13:07:43 asterix kernel: RIP  [] 
> isolate_migratepages_range+0x32d/0x653
>Jan 31 13:07:43 asterix kernel: RSP 
>Jan 31 13:07:43 asterix kernel: CR2: 001c
>Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---
> 
> Kernel is a plain kernel.org kernel 3.12.9 and it uses drbd to replicate
> data to another host. Any idea what the cause of this bug is? Could it be
> hardware? The system has been running now for five years without any problems.
> 
> Please CC me since I am not on the list.
> 
> Many thanks in advance.
> 
> Regards,
> Holger
> --
> To 

Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread Michal Hocko
[CCing linux-mm]

Does this ring bells? I haven't checked very deeply but it doesn't seem
to be fixed since 3.12.

Hoolger, could you post your config, please?

On Fri 31-01-14 21:12:27, Holger Kiehl wrote:
 Hello,
 
 today one of our system got a kernel bug message. It kept on running
 but more and more process begin to be stuck in D state (eg. a simple w
 command would never return) and I eventually had to reboot. Here the
 full message:
 
Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer 
 dereference at 001c
Jan 31 13:07:43 asterix kernel: IP: [810af0ac] 
 isolate_migratepages_range+0x32d/0x653
Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
Jan 31 13:07:43 asterix kernel: Oops:  [#1] SMP
Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp 
 ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci 
 i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore 
 usb_common [last unloaded: microcode]
Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 
 3.12.9 #1
Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY 
 RX300 S4 /D2519, BIOS 4.06  Rev. 1.04.2519 07/30/2008
Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 
 8807d30b2000 task.ti: 8807d30b2000
Jan 31 13:07:43 asterix kernel: RIP: 0010:[810af0ac]  
 [810af0ac] isolate_migratepages_range+0x32d/0x653
Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928  EFLAGS: 
 00010286
Jan 31 13:07:43 asterix kernel: RAX:  RBX: 
 0020ec09 RCX: 0002
Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 
 0004 RDI: 006c
Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 
 88083fbde390 R09: 0001
Jan 31 13:07:43 asterix kernel: R10:  R11: 
 ea000733a000 R12: 8807d30b3a58
Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: 
  R15: 88083ffe1d80
Jan 31 13:07:43 asterix kernel: FS:  7f9d9e72f910() 
 GS:88083fd4() knlGS:
Jan 31 13:07:43 asterix kernel: CS:  0010 DS:  ES:  CR0: 
 8005003b
Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 
 0007d307 CR4: 000407e0
Jan 31 13:07:43 asterix kernel: Stack:
Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 
 ea2e6af0 8807d30b3998
Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 
 8807d30b08c0 0020f000
Jan 31 13:07:43 asterix kernel:  083b 
 000a 8807d30b3a68
Jan 31 13:07:43 asterix kernel: Call Trace:
Jan 31 13:07:43 asterix kernel: [810a161f] ? 
 lru_add_drain_cpu+0x25/0x97
Jan 31 13:07:43 asterix kernel: [810af687] 
 compact_zone+0x2b5/0x319
Jan 31 13:07:43 asterix kernel: [810da586] ? put_super+0x20/0x2c
Jan 31 13:07:43 asterix kernel: [810afa4d] 
 compact_zone_order+0xad/0xc4
Jan 31 13:07:43 asterix kernel: [810afaf5] 
 try_to_compact_pages+0x91/0xe8
Jan 31 13:07:43 asterix kernel: [8109b92d] ? 
 page_alloc_cpu_notify+0x3e/0x3e
Jan 31 13:07:43 asterix kernel: [8109da34] 
 __alloc_pages_direct_compact+0xae/0x195
Jan 31 13:07:43 asterix kernel: [8109e45d] 
 __alloc_pages_nodemask+0x772/0x7b5
Jan 31 13:07:43 asterix kernel: [810c85a3] 
 alloc_pages_vma+0xd6/0x101
Jan 31 13:07:43 asterix kernel: [810d47e3] 
 do_huge_pmd_anonymous_page+0x199/0x2ee
Jan 31 13:07:43 asterix kernel: [810b3884] 
 handle_mm_fault+0x1b7/0xceb
Jan 31 13:07:43 asterix kernel: [8105dedc] ? 
 __dequeue_entity+0x2e/0x33
Jan 31 13:07:43 asterix kernel: [8102d8c3] 
 __do_page_fault+0x3bd/0x3e4
Jan 31 13:07:43 asterix kernel: [810bbe1a] ? 
 mprotect_fixup+0x1c9/0x1fb
Jan 31 13:07:43 asterix kernel: [810aa0f0] ? 
 vm_mmap_pgoff+0x6d/0x8f
Jan 31 13:07:43 asterix kernel: [810795f5] ? 
 SyS_futex+0x103/0x13d
Jan 31 13:07:43 asterix kernel: [8102d8f3] do_page_fault+0x9/0xb
Jan 31 13:07:43 asterix kernel: [813d3672] page_fault+0x22/0x30
Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 
 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 
 04 49 8b 45 30 8b 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 
 d2
Jan 31 13:07:43 asterix kernel: RIP  [810af0ac] 
 isolate_migratepages_range+0x32d/0x653
Jan 31 13:07:43 asterix kernel: RSP 8807d30b3928
Jan 31 13:07:43 asterix kernel: CR2: 001c
Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---
 
 Kernel is a plain kernel.org kernel 3.12.9 and it uses drbd to 

Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread Michal Hocko
On Mon 03-02-14 14:29:22, Holger Kiehl wrote:
 I have attached it. Please, tell me if you do not get the attachment.

I hoped it would help me to get a closer compiled code to yours but I am
probably using too different gcc.
Anyway I've tried to check whether I can hook on something and it seems
that this is a race with thp merge/split or something like that.
 
[...]
Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL 
  pointer dereference at 001c
Jan 31 13:07:43 asterix kernel: IP: [810af0ac] 
  isolate_migratepages_range+0x32d/0x653
Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
Jan 31 13:07:43 asterix kernel: Oops:  [#1] SMP
Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache 
  coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 
  sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si 
  ipmi_msghandler usbcore usb_common [last unloaded: microcode]
Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 
  3.12.9 #1
Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY 
  RX300 S4 /D2519, BIOS 4.06  Rev. 1.04.2519 
  07/30/2008
Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 
  8807d30b2000 task.ti: 8807d30b2000
Jan 31 13:07:43 asterix kernel: RIP: 0010:[810af0ac]  
  [810af0ac] isolate_migratepages_range+0x32d/0x653
Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928  EFLAGS: 
  00010286
Jan 31 13:07:43 asterix kernel: RAX:  RBX: 
  0020ec09 RCX: 0002
Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 
  0004 RDI: 006c
Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 
  88083fbde390 R09: 0001
Jan 31 13:07:43 asterix kernel: R10:  R11: 
  ea000733a000 R12: 8807d30b3a58
Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14: 
   R15: 88083ffe1d80
Jan 31 13:07:43 asterix kernel: FS:  7f9d9e72f910() 
  GS:88083fd4() knlGS:
Jan 31 13:07:43 asterix kernel: CS:  0010 DS:  ES:  CR0: 
  8005003b
Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 
  0007d307 CR4: 000407e0
Jan 31 13:07:43 asterix kernel: Stack:
Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 
  ea2e6af0 8807d30b3998
Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 
  8807d30b08c0 0020f000
Jan 31 13:07:43 asterix kernel:  083b 
  000a 8807d30b3a68
Jan 31 13:07:43 asterix kernel: Call Trace:
Jan 31 13:07:43 asterix kernel: [810a161f] ? 
  lru_add_drain_cpu+0x25/0x97
Jan 31 13:07:43 asterix kernel: [810af687] 
  compact_zone+0x2b5/0x319
Jan 31 13:07:43 asterix kernel: [810da586] ? 
  put_super+0x20/0x2c
Jan 31 13:07:43 asterix kernel: [810afa4d] 
  compact_zone_order+0xad/0xc4
Jan 31 13:07:43 asterix kernel: [810afaf5] 
  try_to_compact_pages+0x91/0xe8
Jan 31 13:07:43 asterix kernel: [8109b92d] ? 
  page_alloc_cpu_notify+0x3e/0x3e
Jan 31 13:07:43 asterix kernel: [8109da34] 
  __alloc_pages_direct_compact+0xae/0x195
Jan 31 13:07:43 asterix kernel: [8109e45d] 
  __alloc_pages_nodemask+0x772/0x7b5
Jan 31 13:07:43 asterix kernel: [810c85a3] 
  alloc_pages_vma+0xd6/0x101
Jan 31 13:07:43 asterix kernel: [810d47e3] 
  do_huge_pmd_anonymous_page+0x199/0x2ee
Jan 31 13:07:43 asterix kernel: [810b3884] 
  handle_mm_fault+0x1b7/0xceb
Jan 31 13:07:43 asterix kernel: [8105dedc] ? 
  __dequeue_entity+0x2e/0x33
Jan 31 13:07:43 asterix kernel: [8102d8c3] 
  __do_page_fault+0x3bd/0x3e4
Jan 31 13:07:43 asterix kernel: [810bbe1a] ? 
  mprotect_fixup+0x1c9/0x1fb
Jan 31 13:07:43 asterix kernel: [810aa0f0] ? 
  vm_mmap_pgoff+0x6d/0x8f
Jan 31 13:07:43 asterix kernel: [810795f5] ? 
  SyS_futex+0x103/0x13d
Jan 31 13:07:43 asterix kernel: [8102d8f3] 
  do_page_fault+0x9/0xb
Jan 31 13:07:43 asterix kernel: [813d3672] page_fault+0x22/0x30
Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 
  43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 
  d2 79 04 49 8b 45 30 8b 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 
  c0 48 85 d2
Jan 31 13:07:43 asterix kernel: RIP  [810af0ac] 
  isolate_migratepages_range+0x32d/0x653
Jan 31 13:07:43 asterix kernel: RSP 8807d30b3928
Jan 31 13:07:43 asterix kernel: CR2: 001c
Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---

This seems to match:
   17027:   49 8b 17mov(%r15),%rdx  # 

Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread Vlastimil Babka

On 02/03/2014 05:20 PM, Michal Hocko wrote:

On Mon 03-02-14 14:29:22, Holger Kiehl wrote:

I have attached it. Please, tell me if you do not get the attachment.


I hoped it would help me to get a closer compiled code to yours but I am
probably using too different gcc.
Anyway I've tried to check whether I can hook on something and it seems
that this is a race with thp merge/split or something like that.

[...]

   Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer 
dereference at 001c
   Jan 31 13:07:43 asterix kernel: IP: [810af0ac] 
isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
   Jan 31 13:07:43 asterix kernel: Oops:  [#1] SMP
   Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp 
ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci 
i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore 
usb_common [last unloaded: microcode]
   Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 
3.12.9 #1
   Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY 
RX300 S4 /D2519, BIOS 4.06  Rev. 1.04.2519 07/30/2008
   Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 
task.ti: 8807d30b2000
   Jan 31 13:07:43 asterix kernel: RIP: 0010:[810af0ac]  
[810af0ac] isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928  EFLAGS: 00010286
   Jan 31 13:07:43 asterix kernel: RAX:  RBX: 0020ec09 
RCX: 0002
   Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 
RDI: 006c
   Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 
R09: 0001
   Jan 31 13:07:43 asterix kernel: R10:  R11: ea000733a000 
R12: 8807d30b3a58
   Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14:  
R15: 88083ffe1d80
   Jan 31 13:07:43 asterix kernel: FS:  7f9d9e72f910() 
GS:88083fd4() knlGS:
   Jan 31 13:07:43 asterix kernel: CS:  0010 DS:  ES:  CR0: 
8005003b
   Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 
CR4: 000407e0
   Jan 31 13:07:43 asterix kernel: Stack:
   Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 
ea2e6af0 8807d30b3998
   Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 
8807d30b08c0 0020f000
   Jan 31 13:07:43 asterix kernel:  083b 
000a 8807d30b3a68
   Jan 31 13:07:43 asterix kernel: Call Trace:
   Jan 31 13:07:43 asterix kernel: [810a161f] ? 
lru_add_drain_cpu+0x25/0x97
   Jan 31 13:07:43 asterix kernel: [810af687] compact_zone+0x2b5/0x319
   Jan 31 13:07:43 asterix kernel: [810da586] ? put_super+0x20/0x2c
   Jan 31 13:07:43 asterix kernel: [810afa4d] 
compact_zone_order+0xad/0xc4
   Jan 31 13:07:43 asterix kernel: [810afaf5] 
try_to_compact_pages+0x91/0xe8
   Jan 31 13:07:43 asterix kernel: [8109b92d] ? 
page_alloc_cpu_notify+0x3e/0x3e
   Jan 31 13:07:43 asterix kernel: [8109da34] 
__alloc_pages_direct_compact+0xae/0x195
   Jan 31 13:07:43 asterix kernel: [8109e45d] 
__alloc_pages_nodemask+0x772/0x7b5
   Jan 31 13:07:43 asterix kernel: [810c85a3] 
alloc_pages_vma+0xd6/0x101
   Jan 31 13:07:43 asterix kernel: [810d47e3] 
do_huge_pmd_anonymous_page+0x199/0x2ee
   Jan 31 13:07:43 asterix kernel: [810b3884] 
handle_mm_fault+0x1b7/0xceb
   Jan 31 13:07:43 asterix kernel: [8105dedc] ? 
__dequeue_entity+0x2e/0x33
   Jan 31 13:07:43 asterix kernel: [8102d8c3] 
__do_page_fault+0x3bd/0x3e4
   Jan 31 13:07:43 asterix kernel: [810bbe1a] ? 
mprotect_fixup+0x1c9/0x1fb
   Jan 31 13:07:43 asterix kernel: [810aa0f0] ? 
vm_mmap_pgoff+0x6d/0x8f
   Jan 31 13:07:43 asterix kernel: [810795f5] ? SyS_futex+0x103/0x13d
   Jan 31 13:07:43 asterix kernel: [8102d8f3] do_page_fault+0x9/0xb
   Jan 31 13:07:43 asterix kernel: [813d3672] page_fault+0x22/0x30
   Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 
41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 
8b 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
   Jan 31 13:07:43 asterix kernel: RIP  [810af0ac] 
isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: RSP 8807d30b3928
   Jan 31 13:07:43 asterix kernel: CR2: 001c
   Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---


This seems to match:
17027:   49 8b 17mov(%r15),%rdx # page-flags
1702a:   4c 89 f8mov%r15,%rax
1702d:   80 e6 80  

Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread Holger Kiehl

On Mon, 3 Feb 2014, Michal Hocko wrote:


On Mon 03-02-14 14:29:22, Holger Kiehl wrote:

I have attached it. Please, tell me if you do not get the attachment.


I hoped it would help me to get a closer compiled code to yours but I am
probably using too different gcc.


I have an old gcc, it is 4.4.1-2.


Anyway I've tried to check whether I can hook on something and it seems
that this is a race with thp merge/split or something like that.

[...]

  Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer 
dereference at 001c
  Jan 31 13:07:43 asterix kernel: IP: [810af0ac] 
isolate_migratepages_range+0x32d/0x653
  Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
  Jan 31 13:07:43 asterix kernel: Oops:  [#1] SMP
  Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp 
ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci 
i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore 
usb_common [last unloaded: microcode]
  Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 
3.12.9 #1
  Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 
S4 /D2519, BIOS 4.06  Rev. 1.04.2519 07/30/2008
  Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 
task.ti: 8807d30b2000
  Jan 31 13:07:43 asterix kernel: RIP: 0010:[810af0ac]  
[810af0ac] isolate_migratepages_range+0x32d/0x653
  Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928  EFLAGS: 00010286
  Jan 31 13:07:43 asterix kernel: RAX:  RBX: 0020ec09 
RCX: 0002
  Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 
RDI: 006c
  Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 
R09: 0001
  Jan 31 13:07:43 asterix kernel: R10:  R11: ea000733a000 
R12: 8807d30b3a58
  Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14:  
R15: 88083ffe1d80
  Jan 31 13:07:43 asterix kernel: FS:  7f9d9e72f910() 
GS:88083fd4() knlGS:
  Jan 31 13:07:43 asterix kernel: CS:  0010 DS:  ES:  CR0: 
8005003b
  Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 
CR4: 000407e0
  Jan 31 13:07:43 asterix kernel: Stack:
  Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 
ea2e6af0 8807d30b3998
  Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 
8807d30b08c0 0020f000
  Jan 31 13:07:43 asterix kernel:  083b 
000a 8807d30b3a68
  Jan 31 13:07:43 asterix kernel: Call Trace:
  Jan 31 13:07:43 asterix kernel: [810a161f] ? 
lru_add_drain_cpu+0x25/0x97
  Jan 31 13:07:43 asterix kernel: [810af687] compact_zone+0x2b5/0x319
  Jan 31 13:07:43 asterix kernel: [810da586] ? put_super+0x20/0x2c
  Jan 31 13:07:43 asterix kernel: [810afa4d] 
compact_zone_order+0xad/0xc4
  Jan 31 13:07:43 asterix kernel: [810afaf5] 
try_to_compact_pages+0x91/0xe8
  Jan 31 13:07:43 asterix kernel: [8109b92d] ? 
page_alloc_cpu_notify+0x3e/0x3e
  Jan 31 13:07:43 asterix kernel: [8109da34] 
__alloc_pages_direct_compact+0xae/0x195
  Jan 31 13:07:43 asterix kernel: [8109e45d] 
__alloc_pages_nodemask+0x772/0x7b5
  Jan 31 13:07:43 asterix kernel: [810c85a3] 
alloc_pages_vma+0xd6/0x101
  Jan 31 13:07:43 asterix kernel: [810d47e3] 
do_huge_pmd_anonymous_page+0x199/0x2ee
  Jan 31 13:07:43 asterix kernel: [810b3884] 
handle_mm_fault+0x1b7/0xceb
  Jan 31 13:07:43 asterix kernel: [8105dedc] ? 
__dequeue_entity+0x2e/0x33
  Jan 31 13:07:43 asterix kernel: [8102d8c3] 
__do_page_fault+0x3bd/0x3e4
  Jan 31 13:07:43 asterix kernel: [810bbe1a] ? 
mprotect_fixup+0x1c9/0x1fb
  Jan 31 13:07:43 asterix kernel: [810aa0f0] ? vm_mmap_pgoff+0x6d/0x8f
  Jan 31 13:07:43 asterix kernel: [810795f5] ? SyS_futex+0x103/0x13d
  Jan 31 13:07:43 asterix kernel: [8102d8f3] do_page_fault+0x9/0xb
  Jan 31 13:07:43 asterix kernel: [813d3672] page_fault+0x22/0x30
  Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 
41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 
8b 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
  Jan 31 13:07:43 asterix kernel: RIP  [810af0ac] 
isolate_migratepages_range+0x32d/0x653
  Jan 31 13:07:43 asterix kernel: RSP 8807d30b3928
  Jan 31 13:07:43 asterix kernel: CR2: 001c
  Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---


This seems to match:
  17027:   49 8b 17mov(%r15),%rdx   # page-flags
  1702a:   4c 89 f8mov%r15,%rax
  1702d:   80 e6 80and   

Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread David Rientjes
On Mon, 3 Feb 2014, Vlastimil Babka wrote:

 It seems to come from balloon_page_movable() and its test page_count(page) ==
 1.
 

Hmm, I think it might be because compound_head() == NULL here.  Holger, 
this looks like a race condition when allocating a compound page, did you 
only see it once or is it actually reproducible?

I think this happens when a new compound page is allocated and PageBuddy 
is cleared before prep_compound_page() and then we see PageTail(p) set but 
p-first_page is not yet initialized.  Is there any way to avoid memory 
barriers in compound_page()?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Need help in bug in isolate_migratepages_range

2014-02-03 Thread Holger Kiehl

On Mon, 3 Feb 2014, David Rientjes wrote:


On Mon, 3 Feb 2014, Vlastimil Babka wrote:


It seems to come from balloon_page_movable() and its test page_count(page) ==
1.



Hmm, I think it might be because compound_head() == NULL here.  Holger,
this looks like a race condition when allocating a compound page, did you
only see it once or is it actually reproducible?


No, this only happened once. It is not reproducable, the system was running
for four days without problems. And before this kernel, five years without
any problems.

Thanks,
Holger
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Need help in bug in isolate_migratepages_range

2014-01-31 Thread Holger Kiehl

Hello,

today one of our system got a kernel bug message. It kept on running
but more and more process begin to be stuck in D state (eg. a simple w
command would never return) and I eventually had to reboot. Here the
full message:

   Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer 
dereference at 001c
   Jan 31 13:07:43 asterix kernel: IP: [] 
isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
   Jan 31 13:07:43 asterix kernel: Oops:  [#1] SMP
   Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp 
ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci 
i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore 
usb_common [last unloaded: microcode]
   Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 
3.12.9 #1
   Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY 
RX300 S4 /D2519, BIOS 4.06  Rev. 1.04.2519 07/30/2008
   Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 
task.ti: 8807d30b2000
   Jan 31 13:07:43 asterix kernel: RIP: 0010:[]  
[] isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928  EFLAGS: 00010286
   Jan 31 13:07:43 asterix kernel: RAX:  RBX: 0020ec09 
RCX: 0002
   Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 
RDI: 006c
   Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 
R09: 0001
   Jan 31 13:07:43 asterix kernel: R10:  R11: ea000733a000 
R12: 8807d30b3a58
   Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14:  
R15: 88083ffe1d80
   Jan 31 13:07:43 asterix kernel: FS:  7f9d9e72f910() 
GS:88083fd4() knlGS:
   Jan 31 13:07:43 asterix kernel: CS:  0010 DS:  ES:  CR0: 
8005003b
   Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 
CR4: 000407e0
   Jan 31 13:07:43 asterix kernel: Stack:
   Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 
ea2e6af0 8807d30b3998
   Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 
8807d30b08c0 0020f000
   Jan 31 13:07:43 asterix kernel:  083b 
000a 8807d30b3a68
   Jan 31 13:07:43 asterix kernel: Call Trace:
   Jan 31 13:07:43 asterix kernel: [] ? 
lru_add_drain_cpu+0x25/0x97
   Jan 31 13:07:43 asterix kernel: [] compact_zone+0x2b5/0x319
   Jan 31 13:07:43 asterix kernel: [] ? put_super+0x20/0x2c
   Jan 31 13:07:43 asterix kernel: [] 
compact_zone_order+0xad/0xc4
   Jan 31 13:07:43 asterix kernel: [] 
try_to_compact_pages+0x91/0xe8
   Jan 31 13:07:43 asterix kernel: [] ? 
page_alloc_cpu_notify+0x3e/0x3e
   Jan 31 13:07:43 asterix kernel: [] 
__alloc_pages_direct_compact+0xae/0x195
   Jan 31 13:07:43 asterix kernel: [] 
__alloc_pages_nodemask+0x772/0x7b5
   Jan 31 13:07:43 asterix kernel: [] 
alloc_pages_vma+0xd6/0x101
   Jan 31 13:07:43 asterix kernel: [] 
do_huge_pmd_anonymous_page+0x199/0x2ee
   Jan 31 13:07:43 asterix kernel: [] 
handle_mm_fault+0x1b7/0xceb
   Jan 31 13:07:43 asterix kernel: [] ? 
__dequeue_entity+0x2e/0x33
   Jan 31 13:07:43 asterix kernel: [] 
__do_page_fault+0x3bd/0x3e4
   Jan 31 13:07:43 asterix kernel: [] ? 
mprotect_fixup+0x1c9/0x1fb
   Jan 31 13:07:43 asterix kernel: [] ? 
vm_mmap_pgoff+0x6d/0x8f
   Jan 31 13:07:43 asterix kernel: [] ? SyS_futex+0x103/0x13d
   Jan 31 13:07:43 asterix kernel: [] do_page_fault+0x9/0xb
   Jan 31 13:07:43 asterix kernel: [] page_fault+0x22/0x30
   Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 
41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 
<8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
   Jan 31 13:07:43 asterix kernel: RIP  [] 
isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: RSP 
   Jan 31 13:07:43 asterix kernel: CR2: 001c
   Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---

Kernel is a plain kernel.org kernel 3.12.9 and it uses drbd to replicate
data to another host. Any idea what the cause of this bug is? Could it be
hardware? The system has been running now for five years without any problems.

Please CC me since I am not on the list.

Many thanks in advance.

Regards,
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Need help in bug in isolate_migratepages_range

2014-01-31 Thread Holger Kiehl

Hello,

today one of our system got a kernel bug message. It kept on running
but more and more process begin to be stuck in D state (eg. a simple w
command would never return) and I eventually had to reboot. Here the
full message:

   Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer 
dereference at 001c
   Jan 31 13:07:43 asterix kernel: IP: [810af0ac] 
isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
   Jan 31 13:07:43 asterix kernel: Oops:  [#1] SMP
   Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp 
ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci 
i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore 
usb_common [last unloaded: microcode]
   Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 
3.12.9 #1
   Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY 
RX300 S4 /D2519, BIOS 4.06  Rev. 1.04.2519 07/30/2008
   Jan 31 13:07:43 asterix kernel: task: 8807d30b08c0 ti: 8807d30b2000 
task.ti: 8807d30b2000
   Jan 31 13:07:43 asterix kernel: RIP: 0010:[810af0ac]  
[810af0ac] isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: RSP: :8807d30b3928  EFLAGS: 00010286
   Jan 31 13:07:43 asterix kernel: RAX:  RBX: 0020ec09 
RCX: 0002
   Jan 31 13:07:43 asterix kernel: RDX: 2c008000 RSI: 0004 
RDI: 006c
   Jan 31 13:07:43 asterix kernel: RBP: 8807d30b39f8 R08: 88083fbde390 
R09: 0001
   Jan 31 13:07:43 asterix kernel: R10:  R11: ea000733a000 
R12: 8807d30b3a58
   Jan 31 13:07:43 asterix kernel: R13: ea000733a1f8 R14:  
R15: 88083ffe1d80
   Jan 31 13:07:43 asterix kernel: FS:  7f9d9e72f910() 
GS:88083fd4() knlGS:
   Jan 31 13:07:43 asterix kernel: CS:  0010 DS:  ES:  CR0: 
8005003b
   Jan 31 13:07:43 asterix kernel: CR2: 001c CR3: 0007d307 
CR4: 000407e0
   Jan 31 13:07:43 asterix kernel: Stack:
   Jan 31 13:07:43 asterix kernel: 0009 88083ffe16c0 
ea2e6af0 8807d30b3998
   Jan 31 13:07:43 asterix kernel: 8807d30b2010 00ff8807d30b08c0 
8807d30b08c0 0020f000
   Jan 31 13:07:43 asterix kernel:  083b 
000a 8807d30b3a68
   Jan 31 13:07:43 asterix kernel: Call Trace:
   Jan 31 13:07:43 asterix kernel: [810a161f] ? 
lru_add_drain_cpu+0x25/0x97
   Jan 31 13:07:43 asterix kernel: [810af687] compact_zone+0x2b5/0x319
   Jan 31 13:07:43 asterix kernel: [810da586] ? put_super+0x20/0x2c
   Jan 31 13:07:43 asterix kernel: [810afa4d] 
compact_zone_order+0xad/0xc4
   Jan 31 13:07:43 asterix kernel: [810afaf5] 
try_to_compact_pages+0x91/0xe8
   Jan 31 13:07:43 asterix kernel: [8109b92d] ? 
page_alloc_cpu_notify+0x3e/0x3e
   Jan 31 13:07:43 asterix kernel: [8109da34] 
__alloc_pages_direct_compact+0xae/0x195
   Jan 31 13:07:43 asterix kernel: [8109e45d] 
__alloc_pages_nodemask+0x772/0x7b5
   Jan 31 13:07:43 asterix kernel: [810c85a3] 
alloc_pages_vma+0xd6/0x101
   Jan 31 13:07:43 asterix kernel: [810d47e3] 
do_huge_pmd_anonymous_page+0x199/0x2ee
   Jan 31 13:07:43 asterix kernel: [810b3884] 
handle_mm_fault+0x1b7/0xceb
   Jan 31 13:07:43 asterix kernel: [8105dedc] ? 
__dequeue_entity+0x2e/0x33
   Jan 31 13:07:43 asterix kernel: [8102d8c3] 
__do_page_fault+0x3bd/0x3e4
   Jan 31 13:07:43 asterix kernel: [810bbe1a] ? 
mprotect_fixup+0x1c9/0x1fb
   Jan 31 13:07:43 asterix kernel: [810aa0f0] ? 
vm_mmap_pgoff+0x6d/0x8f
   Jan 31 13:07:43 asterix kernel: [810795f5] ? SyS_futex+0x103/0x13d
   Jan 31 13:07:43 asterix kernel: [8102d8f3] do_page_fault+0x9/0xb
   Jan 31 13:07:43 asterix kernel: [813d3672] page_fault+0x22/0x30
   Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 
41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 
8b 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
   Jan 31 13:07:43 asterix kernel: RIP  [810af0ac] 
isolate_migratepages_range+0x32d/0x653
   Jan 31 13:07:43 asterix kernel: RSP 8807d30b3928
   Jan 31 13:07:43 asterix kernel: CR2: 001c
   Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---

Kernel is a plain kernel.org kernel 3.12.9 and it uses drbd to replicate
data to another host. Any idea what the cause of this bug is? Could it be
hardware? The system has been running now for five years without any problems.

Please CC me since I am not on the list.

Many thanks in advance.

Regards,
Holger
--
To unsubscribe from this list: send the line unsubscribe