Re: list corruption in deferred_split_scan()

2019-08-05 Thread Qian Cai



> On Aug 5, 2019, at 6:15 PM, Yang Shi  wrote:
> 
> 
> 
> On 7/25/19 2:46 PM, Yang Shi wrote:
>> 
>> 
>> On 7/24/19 2:13 PM, Qian Cai wrote:
>>> On Wed, 2019-07-10 at 17:43 -0400, Qian Cai wrote:
 Running LTP oom01 test case with swap triggers a crash below. Revert the
 series
 "Make deferred split shrinker memcg aware" [1] seems fix the issue.
>>> You might want to look harder on this commit, as reverted it alone on the 
>>> top of
>>>   5.2.0-next-20190711 fixed the issue.
>>> 
>>> aefde94195ca mm: thp: make deferred split shrinker memcg aware [1]
>>> 
>>> [1] 
>>> https://lore.kernel.org/linux-mm/1561507361-59349-5-git-send-email-yang.shi@
>>> linux.alibaba.com/
>> 
>> This is the real meat of the patch series, which converted to memcg deferred 
>> split queue actually.
>> 
>>> 
>>> 
>>> list_del corruption. prev->next should be ea0022b10098, but was
>>> 
>> 
>> Finally I could reproduce the list corruption issue on my machine with THP 
>> swap (swap device is fast device). I should checked this with you at the 
>> first place. The problem can't be reproduced with rotate swap device. So, 
>> I'm supposed you were using THP swap too.
>> 
>> Actually, I found two issues with THP swap:
>> 1. free_transhuge_page() is called in reclaim path instead of put_page. The 
>> mem_cgroup_uncharge() is called before free_transhuge_page() in reclaim 
>> path, which causes page->mem_cgroup is NULL so the wrong 
>> deferred_split_queue would be used, so the THP was not deleted from the 
>> memcg's list at all. Then the page might be split or reused later, 
>> page->mapping would be override.
>> 
>> 2. There is a race condition caused by try_to_unmap() with THP swap. The 
>> try_to_unmap() just calls page_remove_rmap() to add THP to deferred split 
>> queue in reclaim path. This might cause the below race condition to corrupt 
>> the list:
>> 
>>   A  B
>> deferred_split_scan
>> list_move
>>try_to_unmap
>> list_add_tail
>> 
>> list_splice <-- The list might get corrupted here
>> 
>>free_transhuge_page
>>   list_del <-- kernel 
>> bug triggered
>> 
>> I hope the below patch would solve your problem (tested locally).
> 
> Hi Qian,
> 
> Did the below patch solve your problem? I would like the fold the fix into 
> the series then target to 5.4 release.

It is going to take a while before I would be able to access that system again. 
Since you can reproduce this and
test yourself now, I’d say go ahead posting the patch.


> 
> Thanks,
> Yang
> 
>> 
>> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index b7f709d..d6612ec 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2830,6 +2830,19 @@ void deferred_split_huge_page(struct page *page)
>> 
>> VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>> 
>> +   /*
>> +* The try_to_unmap() in page reclaim path might reach here too,
>> +* this may cause a race condition to corrupt deferred split queue.
>> +* And, if page reclaim is already handling the same page, it is
>> +* unnecessary to handle it again in shrinker.
>> +*
>> +* Check PageSwapCache to determine if the page is being
>> +* handled by page reclaim since THP swap would add the page into
>> +* swap cache before reaching try_to_unmap().
>> +*/
>> +   if (PageSwapCache(page))
>> +   return;
>> +
>> spin_lock_irqsave(_queue->split_queue_lock, flags);
>> if (list_empty(page_deferred_list(page))) {
>> count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index a0301ed..40c684a 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1485,10 +1485,9 @@ static unsigned long shrink_page_list(struct 
>> list_head *page_list,
>>  * Is there need to periodically free_page_list? It would
>>  * appear not as the counts should be low
>>  */
>> -   if (unlikely(PageTransHuge(page))) {
>> -   mem_cgroup_uncharge(page);
>> +   if (unlikely(PageTransHuge(page)))
>> (*get_compound_page_dtor(page))(page);
>> -   } else
>> +   else
>> list_add(>lru, _pages);
>> continue;
>> 
>> @@ -1909,7 +1908,6 @@ static unsigned noinline_for_stack 
>> move_pages_to_lru(struct lruvec *lruvec,
>> 
>> if (unlikely(PageCompound(page))) {
>> spin_unlock_irq(>lru_lock);
>> -   mem_cgroup_uncharge(page);
>> (*get_compound_page_dtor(page))(page);
>> spin_lock_irq(>lru_lock);
>> } else
>> 
>>> [  685.284254][ T3456] [ cut here ]
>>> [  685.289616][ T3456] kernel BUG at 

Re: list corruption in deferred_split_scan()

2019-08-05 Thread Yang Shi




On 7/25/19 2:46 PM, Yang Shi wrote:



On 7/24/19 2:13 PM, Qian Cai wrote:

On Wed, 2019-07-10 at 17:43 -0400, Qian Cai wrote:
Running LTP oom01 test case with swap triggers a crash below. Revert 
the

series
"Make deferred split shrinker memcg aware" [1] seems fix the issue.
You might want to look harder on this commit, as reverted it alone on 
the top of

  5.2.0-next-20190711 fixed the issue.

aefde94195ca mm: thp: make deferred split shrinker memcg aware [1]

[1] 
https://lore.kernel.org/linux-mm/1561507361-59349-5-git-send-email-yang.shi@

linux.alibaba.com/


This is the real meat of the patch series, which converted to memcg 
deferred split queue actually.





list_del corruption. prev->next should be ea0022b10098, but was



Finally I could reproduce the list corruption issue on my machine with 
THP swap (swap device is fast device). I should checked this with you 
at the first place. The problem can't be reproduced with rotate swap 
device. So, I'm supposed you were using THP swap too.


Actually, I found two issues with THP swap:
1. free_transhuge_page() is called in reclaim path instead of 
put_page. The mem_cgroup_uncharge() is called before 
free_transhuge_page() in reclaim path, which causes page->mem_cgroup 
is NULL so the wrong deferred_split_queue would be used, so the THP 
was not deleted from the memcg's list at all. Then the page might be 
split or reused later, page->mapping would be override.


2. There is a race condition caused by try_to_unmap() with THP swap. 
The try_to_unmap() just calls page_remove_rmap() to add THP to 
deferred split queue in reclaim path. This might cause the below race 
condition to corrupt the list:


  A  B
deferred_split_scan
    list_move
   try_to_unmap
list_add_tail

list_splice <-- The list might get corrupted here

   free_transhuge_page
  list_del <-- 
kernel bug triggered


I hope the below patch would solve your problem (tested locally).


Hi Qian,

Did the below patch solve your problem? I would like the fold the fix 
into the series then target to 5.4 release.


Thanks,
Yang




diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b7f709d..d6612ec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2830,6 +2830,19 @@ void deferred_split_huge_page(struct page *page)

    VM_BUG_ON_PAGE(!PageTransHuge(page), page);

+   /*
+    * The try_to_unmap() in page reclaim path might reach here too,
+    * this may cause a race condition to corrupt deferred split 
queue.

+    * And, if page reclaim is already handling the same page, it is
+    * unnecessary to handle it again in shrinker.
+    *
+    * Check PageSwapCache to determine if the page is being
+    * handled by page reclaim since THP swap would add the page into
+    * swap cache before reaching try_to_unmap().
+    */
+   if (PageSwapCache(page))
+   return;
+
    spin_lock_irqsave(_queue->split_queue_lock, flags);
    if (list_empty(page_deferred_list(page))) {
    count_vm_event(THP_DEFERRED_SPLIT_PAGE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0301ed..40c684a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1485,10 +1485,9 @@ static unsigned long shrink_page_list(struct 
list_head *page_list,

 * Is there need to periodically free_page_list? It would
 * appear not as the counts should be low
 */
-   if (unlikely(PageTransHuge(page))) {
-   mem_cgroup_uncharge(page);
+   if (unlikely(PageTransHuge(page)))
    (*get_compound_page_dtor(page))(page);
-   } else
+   else
    list_add(>lru, _pages);
    continue;

@@ -1909,7 +1908,6 @@ static unsigned noinline_for_stack 
move_pages_to_lru(struct lruvec *lruvec,


    if (unlikely(PageCompound(page))) {
spin_unlock_irq(>lru_lock);
-   mem_cgroup_uncharge(page);
(*get_compound_page_dtor(page))(page);
spin_lock_irq(>lru_lock);
    } else


[  685.284254][ T3456] [ cut here ]
[  685.289616][ T3456] kernel BUG at lib/list_debug.c:53!
[  685.294808][ T3456] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC 
KASAN NOPTI

[  685.301998][ T3456] CPU: 5 PID: 3456 Comm: oom01 Tainted:
GW 5.2.0-next-20190711+ #3
[  685.311193][ T3456] Hardware name: HPE ProLiant DL385 
Gen10/ProLiant DL385

Gen10, BIOS A40 06/24/2019
[  685.320485][ T3456] RIP: 0010:__list_del_entry_valid+0x8b/0xb6
[  685.326364][ T3456] Code: f1 e0 ff 49 8b 55 08 4c 39 e2 75 2c 5b 
b8 01 00 00
00 41 5c 41 5d 5d c3 4c 89 e2 48 89 de 48 c7 c7 c0 5a 73 a3 e8 d9 fa 
bc ff <0f>

0b 48 c7 c7 60 a0 e1 a3 e8 13 52 01 00 4c 89 e6 

Re: list corruption in deferred_split_scan()

2019-07-25 Thread Yang Shi




On 7/24/19 2:13 PM, Qian Cai wrote:

On Wed, 2019-07-10 at 17:43 -0400, Qian Cai wrote:

Running LTP oom01 test case with swap triggers a crash below. Revert the
series
"Make deferred split shrinker memcg aware" [1] seems fix the issue.

You might want to look harder on this commit, as reverted it alone on the top of
  5.2.0-next-20190711 fixed the issue.

aefde94195ca mm: thp: make deferred split shrinker memcg aware [1]

[1] https://lore.kernel.org/linux-mm/1561507361-59349-5-git-send-email-yang.shi@
linux.alibaba.com/


This is the real meat of the patch series, which converted to memcg 
deferred split queue actually.





list_del corruption. prev->next should be ea0022b10098, but was



Finally I could reproduce the list corruption issue on my machine with 
THP swap (swap device is fast device). I should checked this with you at 
the first place. The problem can't be reproduced with rotate swap 
device. So, I'm supposed you were using THP swap too.


Actually, I found two issues with THP swap:
1. free_transhuge_page() is called in reclaim path instead of put_page. 
The mem_cgroup_uncharge() is called before free_transhuge_page() in 
reclaim path, which causes page->mem_cgroup is NULL so the wrong 
deferred_split_queue would be used, so the THP was not deleted from the 
memcg's list at all. Then the page might be split or reused later, 
page->mapping would be override.


2. There is a race condition caused by try_to_unmap() with THP swap. The 
try_to_unmap() just calls page_remove_rmap() to add THP to deferred 
split queue in reclaim path. This might cause the below race condition 
to corrupt the list:


  A  B
deferred_split_scan
    list_move
   try_to_unmap
  list_add_tail

list_splice <-- The list might get corrupted here

   free_transhuge_page
  list_del <-- 
kernel bug triggered


I hope the below patch would solve your problem (tested locally).


diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b7f709d..d6612ec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2830,6 +2830,19 @@ void deferred_split_huge_page(struct page *page)

    VM_BUG_ON_PAGE(!PageTransHuge(page), page);

+   /*
+    * The try_to_unmap() in page reclaim path might reach here too,
+    * this may cause a race condition to corrupt deferred split queue.
+    * And, if page reclaim is already handling the same page, it is
+    * unnecessary to handle it again in shrinker.
+    *
+    * Check PageSwapCache to determine if the page is being
+    * handled by page reclaim since THP swap would add the page into
+    * swap cache before reaching try_to_unmap().
+    */
+   if (PageSwapCache(page))
+   return;
+
    spin_lock_irqsave(_queue->split_queue_lock, flags);
    if (list_empty(page_deferred_list(page))) {
    count_vm_event(THP_DEFERRED_SPLIT_PAGE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0301ed..40c684a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1485,10 +1485,9 @@ static unsigned long shrink_page_list(struct 
list_head *page_list,

 * Is there need to periodically free_page_list? It would
 * appear not as the counts should be low
 */
-   if (unlikely(PageTransHuge(page))) {
-   mem_cgroup_uncharge(page);
+   if (unlikely(PageTransHuge(page)))
    (*get_compound_page_dtor(page))(page);
-   } else
+   else
    list_add(>lru, _pages);
    continue;

@@ -1909,7 +1908,6 @@ static unsigned noinline_for_stack 
move_pages_to_lru(struct lruvec *lruvec,


    if (unlikely(PageCompound(page))) {
spin_unlock_irq(>lru_lock);
-   mem_cgroup_uncharge(page);
(*get_compound_page_dtor(page))(page);
spin_lock_irq(>lru_lock);
    } else


[  685.284254][ T3456] [ cut here ]
[  685.289616][ T3456] kernel BUG at lib/list_debug.c:53!
[  685.294808][ T3456] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[  685.301998][ T3456] CPU: 5 PID: 3456 Comm: oom01 Tainted:
GW 5.2.0-next-20190711+ #3
[  685.311193][ T3456] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 06/24/2019
[  685.320485][ T3456] RIP: 0010:__list_del_entry_valid+0x8b/0xb6
[  685.326364][ T3456] Code: f1 e0 ff 49 8b 55 08 4c 39 e2 75 2c 5b b8 01 00 00
00 41 5c 41 5d 5d c3 4c 89 e2 48 89 de 48 c7 c7 c0 5a 73 a3 e8 d9 fa bc ff <0f>
0b 48 c7 c7 60 a0 e1 a3 e8 13 52 01 00 4c 89 e6 48 c7 c7 20 5b
[  685.345956][ T3456] RSP: 0018:888e0c8a73c0 EFLAGS: 00010082
[  685.351920][ T3456] RAX: 0054 RBX: 

Re: list corruption in deferred_split_scan()

2019-07-24 Thread Qian Cai
On Wed, 2019-07-10 at 17:43 -0400, Qian Cai wrote:
> Running LTP oom01 test case with swap triggers a crash below. Revert the
> series
> "Make deferred split shrinker memcg aware" [1] seems fix the issue.

You might want to look harder on this commit, as reverted it alone on the top of
 5.2.0-next-20190711 fixed the issue.

aefde94195ca mm: thp: make deferred split shrinker memcg aware [1]

[1] https://lore.kernel.org/linux-mm/1561507361-59349-5-git-send-email-yang.shi@
linux.alibaba.com/

There are all console output while running LTP oom01 before the crash that might
be useful.

[  656.302886][ T3384] WARNING: CPU: 79 PID: 3384 at mm/page_alloc.c:4608
__alloc_pages_nodemask+0x1a8a/0x1bc0
[  656.304395][ T3409] kmemleak: Cannot allocate a kmemleak_object structure
[  656.312714][ T3384] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat
kvm_amd kvm ses enclosure dax_pmem irqbypass dax_pmem_core efivars ip_tables
x_tables xfs sd_mod smartpqi scsi_transport_sas mlx5_core tg3 libphy
firmware_class dm_mirror dm_region_hash dm_log dm_mod efivarfs
[  656.320916][ T3409] kmemleak: Kernel memory leak detector disabled
[  656.344509][ T3384] CPU: 79 PID: 3384 Comm: oom01 Not tainted 5.2.0-next-
20190711+ #3
[  656.344523][ T3384] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 06/24/2019
[  656.352100][  T829] kmemleak: Automatic memory scanning thread ended
[  656.358648][ T3384] RIP: 0010:__alloc_pages_nodemask+0x1a8a/0x1bc0
[  656.358658][ T3384] Code: 00 85 d2 0f 85 a1 00 00 00 48 c7 c7 e0 29 c3 a3 e8
3b 98 62 00 65 48 8b 1c 25 80 ee 01 00 e9 85 fa ff ff 0f 0b e9 3e fb ff ff <0f>
0b 48 8b b5 00 ff ff ff 8b 8d 84 fe ff ff 48 c7 c2 00 1d 6c a3
[  656.358675][ T3384] RSP: :888efa4a6210 EFLAGS: 00010046
[  656.406140][ T3384] RAX:  RBX:  RCX:
a2b28be2
[  656.414033][ T3384] RDX:  RSI: dc00 RDI:
a4d15d60
[  656.421926][ T3384] RBP: 888efa4a6420 R08: fbfff49a2bad R09:
fbfff49a2bac
[  656.429818][ T3384] R10: fbfff49a2bac R11: 0003 R12:
a4d15d60
[  656.437711][ T3384] R13:  R14: 0800 R15:

[  656.445605][ T3384] FS:  7ff44adfc700() GS:889032f8()
knlGS:
[  656.454459][ T3384] CS:  0010 DS:  ES:  CR0: 80050033
[  656.460952][ T3384] CR2: 7ff2f05e1000 CR3: 001012e44000 CR4:
001406a0
[  656.468843][ T3384] Call Trace:
[  656.472026][ T3384]  ? gfp_pfmemalloc_allowed+0xc0/0xc0
[  656.477303][ T3384]  ? stack_depot_save+0x215/0x58b
[  656.482228][ T3384]  ? lock_downgrade+0x390/0x390
[  656.486976][ T3384]  ? stack_depot_save+0x183/0x58b
[  656.491900][ T3384]  ? kasan_check_read+0x11/0x20
[  656.496647][ T3384]  ? do_raw_spin_unlock+0xa8/0x140
[  656.501658][ T3384]  ? stack_depot_save+0x215/0x58b
[  656.506582][ T3384]  alloc_pages_current+0x9c/0x110
[  656.511505][ T3384]  allocate_slab+0x351/0x11f0
[  656.516077][ T3384]  ? kasan_slab_alloc+0x11/0x20
[  656.520824][ T3384]  new_slab+0x46/0x70
[  656.524702][ T3384]  ? pageout.isra.4+0x3e5/0xa00
[  656.529449][ T3384]  ___slab_alloc+0x5d4/0x9c0
[  656.533933][ T3384]  ? try_to_free_pages+0x242/0x4d0
[  656.538941][ T3384]  ? __alloc_pages_nodemask+0x9ce/0x1bc0
[  656.544476][ T3384]  ? alloc_pages_vma+0x89/0x2c0
[  656.549226][ T3384]  ? __do_page_fault+0x25b/0x5d0
[  656.554064][ T3384]  ? create_object+0x3a/0x3e0
[  656.558637][ T3384]  ? init_object+0x7e/0x90
[  656.562947][ T3384]  ? create_object+0x3a/0x3e0
[  656.567520][ T3384]  __slab_alloc+0x12/0x20
[  656.571742][ T3384]  ? __slab_alloc+0x12/0x20
[  656.576142][ T3384]  kmem_cache_alloc+0x32a/0x400
[  656.580890][ T3384]  create_object+0x3a/0x3e0
[  656.585291][ T3384]  ? stack_depot_save+0x183/0x58b
[  656.590215][ T3384]  kmemleak_alloc+0x71/0xa0
[  656.594611][ T3384]  kmem_cache_alloc+0x272/0x400
[  656.599361][ T3384]  ? ___might_sleep+0xab/0xc0
[  656.603934][ T3384]  ? mempool_free+0x170/0x170
[  656.608507][ T3384]  mempool_alloc_slab+0x2d/0x40
[  656.613254][ T3384]  mempool_alloc+0x10a/0x29e
[  656.617739][ T3384]  ? alloc_pages_vma+0x89/0x2c0
[  656.622485][ T3384]  ? mempool_resize+0x390/0x390
[  656.627233][ T3384]  ? __read_once_size_nocheck.constprop.2+0x10/0x10
[  656.633730][ T3384]  bio_alloc_bioset+0x150/0x330
[  656.638477][ T3384]  ? bvec_alloc+0x1b0/0x1b0
[  656.642892][ T3384]  alloc_io+0x2f/0x230 [dm_mod]
[  656.647654][ T3384]  __split_and_process_bio+0x99/0x630 [dm_mod]
[  656.653714][ T3384]  ? blk_rq_map_sg+0x9f0/0x9f0
[  656.658388][ T3384]  ? __send_empty_flush.constprop.11+0x1f0/0x1f0 [dm_mod]
[  656.665407][ T3384]  ? check_chain_key+0x1df/0x2e0
[  656.670244][ T3384]  ? kasan_check_read+0x11/0x20
[  656.674992][ T3384]  ? blk_queue_split+0x60/0x90
[  656.679654][ T3384]  ? __blk_queue_split+0x970/0x970
[  656.684679][ T3384]  dm_process_bio+0x33f/0x520 [dm_mod]
[  656.690054][ T3384]  ? __process_bio+0x230/0x230 [dm_mod]
[  

Re: list corruption in deferred_split_scan()

2019-07-24 Thread Qian Cai
On Thu, 2019-07-18 at 17:59 -0700, Yang Shi wrote:
> 
> On 7/18/19 5:54 PM, Qian Cai wrote:
> > 
> > > On Jul 12, 2019, at 3:12 PM, Yang Shi  wrote:
> > > 
> > > 
> > > 
> > > On 7/11/19 2:07 PM, Qian Cai wrote:
> > > > On Wed, 2019-07-10 at 17:16 -0700, Yang Shi wrote:
> > > > > Hi Qian,
> > > > > 
> > > > > 
> > > > > Thanks for reporting the issue. But, I can't reproduce it on my
> > > > > machine.
> > > > > Could you please share more details about your test? How often did you
> > > > > run into this problem?
> > > > 
> > > > I can almost reproduce it every time on a HPE ProLiant DL385 Gen10
> > > > server. Here
> > > > is some more information.
> > > > 
> > > > # cat .config
> > > > 
> > > > https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
> > > 
> > > I tried your kernel config, but I still can't reproduce it. My compiler
> > > doesn't have retpoline support, so CONFIG_RETPOLINE is disabled in my
> > > test, but I don't think this would make any difference for this case.
> > > 
> > > According to the bug call trace in the earlier email, it looks deferred
> > > _split_scan lost race with put_compound_page. The put_compound_page would
> > > call free_transhuge_page() which delete the page from the deferred split
> > > queue, but it may still appear on the deferred list due to some reason.
> > > 
> > > Would you please try the below patch?
> > > 
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index b7f709d..66bd9db 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -2765,7 +2765,7 @@ int split_huge_page_to_list(struct page *page,
> > > struct list_head *list)
> > >  if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
> > >  if (!list_empty(page_deferred_list(head))) {
> > >  ds_queue->split_queue_len--;
> > > -   list_del(page_deferred_list(head));
> > > +   list_del_init(page_deferred_list(head));
> > >  }
> > >  if (mapping)
> > >  __dec_node_page_state(page, NR_SHMEM_THPS);
> > > @@ -2814,7 +2814,7 @@ void free_transhuge_page(struct page *page)
> > >  spin_lock_irqsave(_queue->split_queue_lock, flags);
> > >  if (!list_empty(page_deferred_list(page))) {
> > >  ds_queue->split_queue_len--;
> > > -   list_del(page_deferred_list(page));
> > > +   list_del_init(page_deferred_list(page));
> > >  }
> > >  spin_unlock_irqrestore(_queue->split_queue_lock, flags);
> > >  free_compound_page(page);
> > 
> > Unfortunately, I am no longer be able to reproduce the original list
> > corruption with today’s linux-next.
> 
> It is because the patches have been dropped from -mm tree by Andrew due 
> to this problem I guess. You have to use next-20190711, or apply the 
> patches on today's linux-next.
> 

The patch you have here does not help. Only applied the part for
free_transhuge_page() as you requested.

[  375.006307][ T3580] list_del corruption. next->prev should be
ea0030e10098, but was 888ea8d0cdb8
[  375.015928][ T3580] [ cut here ]
[  375.021296][ T3580] kernel BUG at lib/list_debug.c:56!
[  375.026491][ T3580] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[  375.033680][ T3580] CPU: 84 PID: 3580 Comm: oom01 Tainted:
GW 5.2.0-next-20190711+ #2
[  375.042964][ T3580] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 06/24/2019
[  375.052256][ T3580] RIP: 0010:__list_del_entry_valid+0xa8/0xb6
[  375.058135][ T3580] Code: de 48 c7 c7 c0 5a b3 b0 e8 b9 fa bc ff 0f 0b 48 c7
c7 60 a0 21 b1 e8 13 52 01 00 4c 89 e6 48 c7 c7 20 5b b3 b0 e8 9c fa bc ff <0f>
0b 48 c7 c7 20 a0 21 b1 e8 f6 51 01 00 4c 89 ea 48 89 de 48 c7
[  375.077722][ T3580] RSP: 0018:888ebc4b73c0 EFLAGS: 00010082
[  375.083684][ T3580] RAX: 0054 RBX: ea0030e10098 RCX:
b015d728
[  375.091566][ T3580] RDX:  RSI: 0008 RDI:
88903263d380
[  375.099448][ T3580] RBP: 888ebc4b73d8 R08: ed12064c7a71 R09:
ed12064c7a70
[  375.107330][ T3580] R10: ed12064c7a70 R11: 88903263d387 R12:
ea0030e10098
[  375.115212][ T3580] R13: ea0031d40098 R14: ea0030e10034 R15:
ea0031d40098
[  375.123095][ T3580] FS:  7fc3dc851700() GS:88903260()
knlGS:
[  375.131937][ T3580] CS:  0010 DS:  ES:  CR0: 80050033
[  375.138421][ T3580] CR2: 7fc25fa39000 CR3: 000884762000 CR4:
001406a0
[  375.146301][ T3580] Call Trace:
[  375.149472][ T3580]  deferred_split_scan+0x337/0x740
[  375.154475][ T3580]  ? split_huge_page_to_list+0xe30/0xe30
[  375.160002][ T3580]  ? __sched_text_start+0x8/0x8
[  375.164743][ T3580]  ? __radix_tree_lookup+0x12d/0x1e0
[  375.169923][ T3580]  do_shrink_slab+0x244/0x5a0
[  375.174490][ T3580]  shrink_slab+0x253/0x440
[  375.178794][ T3580]  

Re: list corruption in deferred_split_scan()

2019-07-18 Thread Yang Shi




On 7/18/19 5:54 PM, Qian Cai wrote:



On Jul 12, 2019, at 3:12 PM, Yang Shi  wrote:



On 7/11/19 2:07 PM, Qian Cai wrote:

On Wed, 2019-07-10 at 17:16 -0700, Yang Shi wrote:

Hi Qian,


Thanks for reporting the issue. But, I can't reproduce it on my machine.
Could you please share more details about your test? How often did you
run into this problem?

I can almost reproduce it every time on a HPE ProLiant DL385 Gen10 server. Here
is some more information.

# cat .config

https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config

I tried your kernel config, but I still can't reproduce it. My compiler doesn't 
have retpoline support, so CONFIG_RETPOLINE is disabled in my test, but I don't 
think this would make any difference for this case.

According to the bug call trace in the earlier email, it looks deferred 
_split_scan lost race with put_compound_page. The put_compound_page would call 
free_transhuge_page() which delete the page from the deferred split queue, but 
it may still appear on the deferred list due to some reason.

Would you please try the below patch?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b7f709d..66bd9db 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2765,7 +2765,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
 if (!list_empty(page_deferred_list(head))) {
 ds_queue->split_queue_len--;
-   list_del(page_deferred_list(head));
+   list_del_init(page_deferred_list(head));
 }
 if (mapping)
 __dec_node_page_state(page, NR_SHMEM_THPS);
@@ -2814,7 +2814,7 @@ void free_transhuge_page(struct page *page)
 spin_lock_irqsave(_queue->split_queue_lock, flags);
 if (!list_empty(page_deferred_list(page))) {
 ds_queue->split_queue_len--;
-   list_del(page_deferred_list(page));
+   list_del_init(page_deferred_list(page));
 }
 spin_unlock_irqrestore(_queue->split_queue_lock, flags);
 free_compound_page(page);

Unfortunately, I am no longer be able to reproduce the original list corruption 
with today’s linux-next.


It is because the patches have been dropped from -mm tree by Andrew due 
to this problem I guess. You have to use next-20190711, or apply the 
patches on today's linux-next.





Re: list corruption in deferred_split_scan()

2019-07-18 Thread Qian Cai



> On Jul 12, 2019, at 3:12 PM, Yang Shi  wrote:
> 
> 
> 
> On 7/11/19 2:07 PM, Qian Cai wrote:
>> On Wed, 2019-07-10 at 17:16 -0700, Yang Shi wrote:
>>> Hi Qian,
>>> 
>>> 
>>> Thanks for reporting the issue. But, I can't reproduce it on my machine.
>>> Could you please share more details about your test? How often did you
>>> run into this problem?
>> I can almost reproduce it every time on a HPE ProLiant DL385 Gen10 server. 
>> Here
>> is some more information.
>> 
>> # cat .config
>> 
>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
> 
> I tried your kernel config, but I still can't reproduce it. My compiler 
> doesn't have retpoline support, so CONFIG_RETPOLINE is disabled in my test, 
> but I don't think this would make any difference for this case.
> 
> According to the bug call trace in the earlier email, it looks deferred 
> _split_scan lost race with put_compound_page. The put_compound_page would 
> call free_transhuge_page() which delete the page from the deferred split 
> queue, but it may still appear on the deferred list due to some reason.
> 
> Would you please try the below patch?
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b7f709d..66bd9db 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2765,7 +2765,7 @@ int split_huge_page_to_list(struct page *page, struct 
> list_head *list)
> if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
> if (!list_empty(page_deferred_list(head))) {
> ds_queue->split_queue_len--;
> -   list_del(page_deferred_list(head));
> +   list_del_init(page_deferred_list(head));
> }
> if (mapping)
> __dec_node_page_state(page, NR_SHMEM_THPS);
> @@ -2814,7 +2814,7 @@ void free_transhuge_page(struct page *page)
> spin_lock_irqsave(_queue->split_queue_lock, flags);
> if (!list_empty(page_deferred_list(page))) {
> ds_queue->split_queue_len--;
> -   list_del(page_deferred_list(page));
> +   list_del_init(page_deferred_list(page));
> }
> spin_unlock_irqrestore(_queue->split_queue_lock, flags);
> free_compound_page(page);

Unfortunately, I am no longer be able to reproduce the original list corruption 
with today’s linux-next.

Re: list corruption in deferred_split_scan()

2019-07-17 Thread Yang Shi




On 7/17/19 10:02 AM, Shakeel Butt wrote:

On Tue, Jul 16, 2019 at 5:12 PM Yang Shi  wrote:



On 7/16/19 4:36 PM, Shakeel Butt wrote:

Adding related people.

The thread starts at:
http://lkml.kernel.org/r/1562795006.8510.19.ca...@lca.pw

On Mon, Jul 15, 2019 at 8:01 PM Yang Shi  wrote:


On 7/15/19 6:36 PM, Qian Cai wrote:

On Jul 15, 2019, at 8:22 PM, Yang Shi  wrote:



On 7/15/19 2:23 PM, Qian Cai wrote:

On Fri, 2019-07-12 at 12:12 -0700, Yang Shi wrote:

Another possible lead is that without reverting the those commits below,
kdump
kernel would always also crash in shrink_slab_memcg() at this line,

map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map, true);

This looks a little bit weird. It seems nodeinfo[nid] is NULL? I didn't
think of where nodeinfo was freed but memcg was still online. Maybe a
check is needed:

Actually, "memcg" is NULL.

It sounds weird. shrink_slab() is called in mem_cgroup_iter which does pin the 
memcg. So, the memcg should not go away.

Well, the commit “mm: shrinker: make shrinker not depend on memcg kmem” changed 
this line in shrink_slab_memcg(),

- if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
+ if (!mem_cgroup_online(memcg))
return 0;

Since the kdump kernel has the parameter “cgroup_disable=memory”, 
shrink_slab_memcg() will no longer be able to handle NULL memcg from 
mem_cgroup_iter() as,

if (mem_cgroup_disabled())
return NULL;

Aha, yes. memcg_kmem_enabled() implicitly checks !mem_cgroup_disabled().
Thanks for figuring this out. I think we need add mem_cgroup_dsiabled()
check before calling shrink_slab_memcg() as below:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0301ed..2f03c61 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -701,7 +701,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int
nid,
   unsigned long ret, freed = 0;
   struct shrinker *shrinker;

-   if (!mem_cgroup_is_root(memcg))
+   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

   if (!down_read_trylock(_rwsem))


We were seeing unneeded oom-kills on kernels with
"cgroup_disabled=memory" and Yang's patch series basically expose the
bug to crash. I think the commit aeed1d325d42 ("mm/vmscan.c:
generalize shrink_slab() calls in shrink_node()") missed the case for
"cgroup_disabled=memory". However I am surprised that root_mem_cgroup
is allocated even for "cgroup_disabled=memory" and it seems like
css_alloc() is called even before checking if the corresponding
controller is disabled.

I'm surprised too. A quick test with drgn shows root memcg is definitely
allocated:

  >>> prog['root_mem_cgroup']
*(struct mem_cgroup *)0x8902cf058000 = {
[snip]

But, isn't this a bug?

It can be treated as a bug as this is not expected but we can discuss
and take care of it later. I think we need your patch urgently as
memory reclaim and /proc/sys/vm/drop_caches is broken for
"cgroup_disabled=memory" kernel. So, please send your patch asap.


Sure. I'm going to post the patch soon.



thanks,
Shakeel




Re: list corruption in deferred_split_scan()

2019-07-17 Thread Shakeel Butt
On Tue, Jul 16, 2019 at 5:12 PM Yang Shi  wrote:
>
>
>
> On 7/16/19 4:36 PM, Shakeel Butt wrote:
> > Adding related people.
> >
> > The thread starts at:
> > http://lkml.kernel.org/r/1562795006.8510.19.ca...@lca.pw
> >
> > On Mon, Jul 15, 2019 at 8:01 PM Yang Shi  wrote:
> >>
> >>
> >> On 7/15/19 6:36 PM, Qian Cai wrote:
>  On Jul 15, 2019, at 8:22 PM, Yang Shi  wrote:
> 
> 
> 
>  On 7/15/19 2:23 PM, Qian Cai wrote:
> > On Fri, 2019-07-12 at 12:12 -0700, Yang Shi wrote:
> >>> Another possible lead is that without reverting the those commits 
> >>> below,
> >>> kdump
> >>> kernel would always also crash in shrink_slab_memcg() at this line,
> >>>
> >>> map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map, 
> >>> true);
> >> This looks a little bit weird. It seems nodeinfo[nid] is NULL? I didn't
> >> think of where nodeinfo was freed but memcg was still online. Maybe a
> >> check is needed:
> > Actually, "memcg" is NULL.
>  It sounds weird. shrink_slab() is called in mem_cgroup_iter which does 
>  pin the memcg. So, the memcg should not go away.
> >>> Well, the commit “mm: shrinker: make shrinker not depend on memcg kmem” 
> >>> changed this line in shrink_slab_memcg(),
> >>>
> >>> - if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
> >>> + if (!mem_cgroup_online(memcg))
> >>>return 0;
> >>>
> >>> Since the kdump kernel has the parameter “cgroup_disable=memory”, 
> >>> shrink_slab_memcg() will no longer be able to handle NULL memcg from 
> >>> mem_cgroup_iter() as,
> >>>
> >>> if (mem_cgroup_disabled())
> >>>return NULL;
> >> Aha, yes. memcg_kmem_enabled() implicitly checks !mem_cgroup_disabled().
> >> Thanks for figuring this out. I think we need add mem_cgroup_dsiabled()
> >> check before calling shrink_slab_memcg() as below:
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index a0301ed..2f03c61 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -701,7 +701,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int
> >> nid,
> >>   unsigned long ret, freed = 0;
> >>   struct shrinker *shrinker;
> >>
> >> -   if (!mem_cgroup_is_root(memcg))
> >> +   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> >>   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> >>
> >>   if (!down_read_trylock(_rwsem))
> >>
> > We were seeing unneeded oom-kills on kernels with
> > "cgroup_disabled=memory" and Yang's patch series basically expose the
> > bug to crash. I think the commit aeed1d325d42 ("mm/vmscan.c:
> > generalize shrink_slab() calls in shrink_node()") missed the case for
> > "cgroup_disabled=memory". However I am surprised that root_mem_cgroup
> > is allocated even for "cgroup_disabled=memory" and it seems like
> > css_alloc() is called even before checking if the corresponding
> > controller is disabled.
>
> I'm surprised too. A quick test with drgn shows root memcg is definitely
> allocated:
>
>  >>> prog['root_mem_cgroup']
> *(struct mem_cgroup *)0x8902cf058000 = {
> [snip]
>
> But, isn't this a bug?

It can be treated as a bug as this is not expected but we can discuss
and take care of it later. I think we need your patch urgently as
memory reclaim and /proc/sys/vm/drop_caches is broken for
"cgroup_disabled=memory" kernel. So, please send your patch asap.

thanks,
Shakeel


Re: list corruption in deferred_split_scan()

2019-07-16 Thread Yang Shi




On 7/16/19 4:36 PM, Shakeel Butt wrote:

Adding related people.

The thread starts at:
http://lkml.kernel.org/r/1562795006.8510.19.ca...@lca.pw

On Mon, Jul 15, 2019 at 8:01 PM Yang Shi  wrote:



On 7/15/19 6:36 PM, Qian Cai wrote:

On Jul 15, 2019, at 8:22 PM, Yang Shi  wrote:



On 7/15/19 2:23 PM, Qian Cai wrote:

On Fri, 2019-07-12 at 12:12 -0700, Yang Shi wrote:

Another possible lead is that without reverting the those commits below,
kdump
kernel would always also crash in shrink_slab_memcg() at this line,

map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map, true);

This looks a little bit weird. It seems nodeinfo[nid] is NULL? I didn't
think of where nodeinfo was freed but memcg was still online. Maybe a
check is needed:

Actually, "memcg" is NULL.

It sounds weird. shrink_slab() is called in mem_cgroup_iter which does pin the 
memcg. So, the memcg should not go away.

Well, the commit “mm: shrinker: make shrinker not depend on memcg kmem” changed 
this line in shrink_slab_memcg(),

- if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
+ if (!mem_cgroup_online(memcg))
   return 0;

Since the kdump kernel has the parameter “cgroup_disable=memory”, 
shrink_slab_memcg() will no longer be able to handle NULL memcg from 
mem_cgroup_iter() as,

if (mem_cgroup_disabled())
   return NULL;

Aha, yes. memcg_kmem_enabled() implicitly checks !mem_cgroup_disabled().
Thanks for figuring this out. I think we need add mem_cgroup_dsiabled()
check before calling shrink_slab_memcg() as below:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0301ed..2f03c61 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -701,7 +701,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int
nid,
  unsigned long ret, freed = 0;
  struct shrinker *shrinker;

-   if (!mem_cgroup_is_root(memcg))
+   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
  return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

  if (!down_read_trylock(_rwsem))


We were seeing unneeded oom-kills on kernels with
"cgroup_disabled=memory" and Yang's patch series basically expose the
bug to crash. I think the commit aeed1d325d42 ("mm/vmscan.c:
generalize shrink_slab() calls in shrink_node()") missed the case for
"cgroup_disabled=memory". However I am surprised that root_mem_cgroup
is allocated even for "cgroup_disabled=memory" and it seems like
css_alloc() is called even before checking if the corresponding
controller is disabled.


I'm surprised too. A quick test with drgn shows root memcg is definitely 
allocated:


>>> prog['root_mem_cgroup']
*(struct mem_cgroup *)0x8902cf058000 = {
[snip]

But, isn't this a bug?

Thanks,
Yang



Yang, can you please send the above change with signed-off and CC to
stable as well?

thanks,
Shakeel




Re: list corruption in deferred_split_scan()

2019-07-16 Thread Shakeel Butt
Adding related people.

The thread starts at:
http://lkml.kernel.org/r/1562795006.8510.19.ca...@lca.pw

On Mon, Jul 15, 2019 at 8:01 PM Yang Shi  wrote:
>
>
>
> On 7/15/19 6:36 PM, Qian Cai wrote:
> >
> >> On Jul 15, 2019, at 8:22 PM, Yang Shi  wrote:
> >>
> >>
> >>
> >> On 7/15/19 2:23 PM, Qian Cai wrote:
> >>> On Fri, 2019-07-12 at 12:12 -0700, Yang Shi wrote:
> > Another possible lead is that without reverting the those commits below,
> > kdump
> > kernel would always also crash in shrink_slab_memcg() at this line,
> >
> > map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map, 
> > true);
>  This looks a little bit weird. It seems nodeinfo[nid] is NULL? I didn't
>  think of where nodeinfo was freed but memcg was still online. Maybe a
>  check is needed:
> >>> Actually, "memcg" is NULL.
> >> It sounds weird. shrink_slab() is called in mem_cgroup_iter which does pin 
> >> the memcg. So, the memcg should not go away.
> > Well, the commit “mm: shrinker: make shrinker not depend on memcg kmem” 
> > changed this line in shrink_slab_memcg(),
> >
> > - if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
> > + if (!mem_cgroup_online(memcg))
> >   return 0;
> >
> > Since the kdump kernel has the parameter “cgroup_disable=memory”, 
> > shrink_slab_memcg() will no longer be able to handle NULL memcg from 
> > mem_cgroup_iter() as,
> >
> > if (mem_cgroup_disabled())
> >   return NULL;
>
> Aha, yes. memcg_kmem_enabled() implicitly checks !mem_cgroup_disabled().
> Thanks for figuring this out. I think we need add mem_cgroup_dsiabled()
> check before calling shrink_slab_memcg() as below:
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a0301ed..2f03c61 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -701,7 +701,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int
> nid,
>  unsigned long ret, freed = 0;
>  struct shrinker *shrinker;
>
> -   if (!mem_cgroup_is_root(memcg))
> +   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>  return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>
>  if (!down_read_trylock(_rwsem))
>

We were seeing unneeded oom-kills on kernels with
"cgroup_disabled=memory" and Yang's patch series basically expose the
bug to crash. I think the commit aeed1d325d42 ("mm/vmscan.c:
generalize shrink_slab() calls in shrink_node()") missed the case for
"cgroup_disabled=memory". However I am surprised that root_mem_cgroup
is allocated even for "cgroup_disabled=memory" and it seems like
css_alloc() is called even before checking if the corresponding
controller is disabled.

Yang, can you please send the above change with signed-off and CC to
stable as well?

thanks,
Shakeel


Re: list corruption in deferred_split_scan()

2019-07-15 Thread Yang Shi




On 7/15/19 6:36 PM, Qian Cai wrote:



On Jul 15, 2019, at 8:22 PM, Yang Shi  wrote:



On 7/15/19 2:23 PM, Qian Cai wrote:

On Fri, 2019-07-12 at 12:12 -0700, Yang Shi wrote:

Another possible lead is that without reverting the those commits below,
kdump
kernel would always also crash in shrink_slab_memcg() at this line,

map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map, true);

This looks a little bit weird. It seems nodeinfo[nid] is NULL? I didn't
think of where nodeinfo was freed but memcg was still online. Maybe a
check is needed:

Actually, "memcg" is NULL.

It sounds weird. shrink_slab() is called in mem_cgroup_iter which does pin the 
memcg. So, the memcg should not go away.

Well, the commit “mm: shrinker: make shrinker not depend on memcg kmem” changed 
this line in shrink_slab_memcg(),

-   if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
+   if (!mem_cgroup_online(memcg))
return 0;

Since the kdump kernel has the parameter “cgroup_disable=memory”, 
shrink_slab_memcg() will no longer be able to handle NULL memcg from 
mem_cgroup_iter() as,

if (mem_cgroup_disabled())  
return NULL;


Aha, yes. memcg_kmem_enabled() implicitly checks !mem_cgroup_disabled(). 
Thanks for figuring this out. I think we need add mem_cgroup_dsiabled() 
check before calling shrink_slab_memcg() as below:


diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0301ed..2f03c61 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -701,7 +701,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int 
nid,

    unsigned long ret, freed = 0;
    struct shrinker *shrinker;

-   if (!mem_cgroup_is_root(memcg))
+   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
    return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

    if (!down_read_trylock(_rwsem))




diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0301ed..bacda49 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -602,6 +602,9 @@ static unsigned long shrink_slab_memcg(gfp_t
gfp_mask, int nid,
  if (!mem_cgroup_online(memcg))
  return 0;

+   if (!memcg->nodeinfo[nid])
+   return 0;
+
  if (!down_read_trylock(_rwsem))
  return 0;


[9.072036][T1] BUG: KASAN: null-ptr-deref in shrink_slab+0x111/0x440
[9.072036][T1] Read of size 8 at addr 0dc8 by task
swapper/0/1
[9.072036][T1]
[9.072036][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.2.0-next-
20190711+ #10
[9.072036][T1] Hardware name: HPE ProLiant DL385 Gen10/ProLiant
DL385
Gen10, BIOS A40 01/25/2019
[9.072036][T1] Call Trace:
[9.072036][T1]  dump_stack+0x62/0x9a
[9.072036][T1]  __kasan_report.cold.4+0xb0/0xb4
[9.072036][T1]  ? unwind_get_return_address+0x40/0x50
[9.072036][T1]  ? shrink_slab+0x111/0x440
[9.072036][T1]  kasan_report+0xc/0xe
[9.072036][T1]  __asan_load8+0x71/0xa0
[9.072036][T1]  shrink_slab+0x111/0x440
[9.072036][T1]  ? mem_cgroup_iter+0x98/0x840
[9.072036][T1]  ? unregister_shrinker+0x110/0x110
[9.072036][T1]  ? kasan_check_read+0x11/0x20
[9.072036][T1]  ? mem_cgroup_protected+0x39/0x260
[9.072036][T1]  shrink_node+0x31e/0xa30
[9.072036][T1]  ? shrink_node_memcg+0x1560/0x1560
[9.072036][T1]  ? ktime_get+0x93/0x110
[9.072036][T1]  do_try_to_free_pages+0x22f/0x820
[9.072036][T1]  ? shrink_node+0xa30/0xa30
[9.072036][T1]  ? kasan_check_read+0x11/0x20
[9.072036][T1]  ? check_chain_key+0x1df/0x2e0
[9.072036][T1]  try_to_free_pages+0x242/0x4d0
[9.072036][T1]  ? do_try_to_free_pages+0x820/0x820
[9.072036][T1]  __alloc_pages_nodemask+0x9ce/0x1bc0
[9.072036][T1]  ? gfp_pfmemalloc_allowed+0xc0/0xc0
[9.072036][T1]  ? unwind_dump+0x260/0x260
[9.072036][T1]  ? kernel_text_address+0x33/0xc0
[9.072036][T1]  ? arch_stack_walk+0x8f/0xf0
[9.072036][T1]  ? ret_from_fork+0x22/0x40
[9.072036][T1]  alloc_page_interleave+0x18/0x130
[9.072036][T1]  alloc_pages_current+0xf6/0x110
[9.072036][T1]  allocate_slab+0x600/0x11f0
[9.072036][T1]  new_slab+0x46/0x70
[9.072036][T1]  ___slab_alloc+0x5d4/0x9c0
[9.072036][T1]  ? create_object+0x3a/0x3e0
[9.072036][T1]  ? fs_reclaim_acquire.part.15+0x5/0x30
[9.072036][T1]  ? ___might_sleep+0xab/0xc0
[9.072036][T1]  ? create_object+0x3a/0x3e0
[9.072036][T1]  __slab_alloc+0x12/0x20
[9.072036][T1]  ? __slab_alloc+0x12/0x20
[9.072036][T1]  kmem_cache_alloc+0x32a/0x400
[9.072036][T1]  create_object+0x3a/0x3e0
[9.072036][T1]  kmemleak_alloc+0x71/0xa0
[9.072036][T1]  kmem_cache_alloc+0x272/0x400
[9.072036][T1]  ? kasan_check_read+0x11/0x20
[9.072036][T1]  ? do_raw_spin_unlock+0xa8/0x140
[9.072036][T1]  acpi_ps_alloc_op+0x76/0x122
[

Re: list corruption in deferred_split_scan()

2019-07-15 Thread Qian Cai



> On Jul 15, 2019, at 8:22 PM, Yang Shi  wrote:
> 
> 
> 
> On 7/15/19 2:23 PM, Qian Cai wrote:
>> On Fri, 2019-07-12 at 12:12 -0700, Yang Shi wrote:
 Another possible lead is that without reverting the those commits below,
 kdump
 kernel would always also crash in shrink_slab_memcg() at this line,
 
 map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map, true);
>>> This looks a little bit weird. It seems nodeinfo[nid] is NULL? I didn't
>>> think of where nodeinfo was freed but memcg was still online. Maybe a
>>> check is needed:
>> Actually, "memcg" is NULL.
> 
> It sounds weird. shrink_slab() is called in mem_cgroup_iter which does pin 
> the memcg. So, the memcg should not go away.

Well, the commit “mm: shrinker: make shrinker not depend on memcg kmem” changed 
this line in shrink_slab_memcg(),

-   if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
+   if (!mem_cgroup_online(memcg))
return 0;

Since the kdump kernel has the parameter “cgroup_disable=memory”, 
shrink_slab_memcg() will no longer be able to handle NULL memcg from 
mem_cgroup_iter() as,

if (mem_cgroup_disabled())  
return NULL;

> 
>> 
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index a0301ed..bacda49 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -602,6 +602,9 @@ static unsigned long shrink_slab_memcg(gfp_t
>>> gfp_mask, int nid,
>>>  if (!mem_cgroup_online(memcg))
>>>  return 0;
>>> 
>>> +   if (!memcg->nodeinfo[nid])
>>> +   return 0;
>>> +
>>>  if (!down_read_trylock(_rwsem))
>>>  return 0;
>>> 
 [9.072036][T1] BUG: KASAN: null-ptr-deref in 
 shrink_slab+0x111/0x440
 [9.072036][T1] Read of size 8 at addr 0dc8 by task
 swapper/0/1
 [9.072036][T1]
 [9.072036][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
 5.2.0-next-
 20190711+ #10
 [9.072036][T1] Hardware name: HPE ProLiant DL385 Gen10/ProLiant
 DL385
 Gen10, BIOS A40 01/25/2019
 [9.072036][T1] Call Trace:
 [9.072036][T1]  dump_stack+0x62/0x9a
 [9.072036][T1]  __kasan_report.cold.4+0xb0/0xb4
 [9.072036][T1]  ? unwind_get_return_address+0x40/0x50
 [9.072036][T1]  ? shrink_slab+0x111/0x440
 [9.072036][T1]  kasan_report+0xc/0xe
 [9.072036][T1]  __asan_load8+0x71/0xa0
 [9.072036][T1]  shrink_slab+0x111/0x440
 [9.072036][T1]  ? mem_cgroup_iter+0x98/0x840
 [9.072036][T1]  ? unregister_shrinker+0x110/0x110
 [9.072036][T1]  ? kasan_check_read+0x11/0x20
 [9.072036][T1]  ? mem_cgroup_protected+0x39/0x260
 [9.072036][T1]  shrink_node+0x31e/0xa30
 [9.072036][T1]  ? shrink_node_memcg+0x1560/0x1560
 [9.072036][T1]  ? ktime_get+0x93/0x110
 [9.072036][T1]  do_try_to_free_pages+0x22f/0x820
 [9.072036][T1]  ? shrink_node+0xa30/0xa30
 [9.072036][T1]  ? kasan_check_read+0x11/0x20
 [9.072036][T1]  ? check_chain_key+0x1df/0x2e0
 [9.072036][T1]  try_to_free_pages+0x242/0x4d0
 [9.072036][T1]  ? do_try_to_free_pages+0x820/0x820
 [9.072036][T1]  __alloc_pages_nodemask+0x9ce/0x1bc0
 [9.072036][T1]  ? gfp_pfmemalloc_allowed+0xc0/0xc0
 [9.072036][T1]  ? unwind_dump+0x260/0x260
 [9.072036][T1]  ? kernel_text_address+0x33/0xc0
 [9.072036][T1]  ? arch_stack_walk+0x8f/0xf0
 [9.072036][T1]  ? ret_from_fork+0x22/0x40
 [9.072036][T1]  alloc_page_interleave+0x18/0x130
 [9.072036][T1]  alloc_pages_current+0xf6/0x110
 [9.072036][T1]  allocate_slab+0x600/0x11f0
 [9.072036][T1]  new_slab+0x46/0x70
 [9.072036][T1]  ___slab_alloc+0x5d4/0x9c0
 [9.072036][T1]  ? create_object+0x3a/0x3e0
 [9.072036][T1]  ? fs_reclaim_acquire.part.15+0x5/0x30
 [9.072036][T1]  ? ___might_sleep+0xab/0xc0
 [9.072036][T1]  ? create_object+0x3a/0x3e0
 [9.072036][T1]  __slab_alloc+0x12/0x20
 [9.072036][T1]  ? __slab_alloc+0x12/0x20
 [9.072036][T1]  kmem_cache_alloc+0x32a/0x400
 [9.072036][T1]  create_object+0x3a/0x3e0
 [9.072036][T1]  kmemleak_alloc+0x71/0xa0
 [9.072036][T1]  kmem_cache_alloc+0x272/0x400
 [9.072036][T1]  ? kasan_check_read+0x11/0x20
 [9.072036][T1]  ? do_raw_spin_unlock+0xa8/0x140
 [9.072036][T1]  acpi_ps_alloc_op+0x76/0x122
 [9.072036][T1]  acpi_ds_execute_arguments+0x2f/0x18d
 [9.072036][T1]  acpi_ds_get_package_arguments+0x7d/0x84
 [9.072036][T1]  acpi_ns_init_one_package+0x33/0x61
 [9.072036][T1]  acpi_ns_init_one_object+0xfc/0x189
 [9.072036][T1]  acpi_ns_walk_namespace+0x114/0x1f2
 [ 

Re: list corruption in deferred_split_scan()

2019-07-15 Thread Yang Shi




On 7/15/19 2:23 PM, Qian Cai wrote:

On Fri, 2019-07-12 at 12:12 -0700, Yang Shi wrote:

Another possible lead is that without reverting the those commits below,
kdump
kernel would always also crash in shrink_slab_memcg() at this line,

map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map, true);

This looks a little bit weird. It seems nodeinfo[nid] is NULL? I didn't
think of where nodeinfo was freed but memcg was still online. Maybe a
check is needed:

Actually, "memcg" is NULL.


It sounds weird. shrink_slab() is called in mem_cgroup_iter which does 
pin the memcg. So, the memcg should not go away.





diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0301ed..bacda49 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -602,6 +602,9 @@ static unsigned long shrink_slab_memcg(gfp_t
gfp_mask, int nid,
  if (!mem_cgroup_online(memcg))
  return 0;

+   if (!memcg->nodeinfo[nid])
+   return 0;
+
  if (!down_read_trylock(_rwsem))
  return 0;


[9.072036][T1] BUG: KASAN: null-ptr-deref in shrink_slab+0x111/0x440
[9.072036][T1] Read of size 8 at addr 0dc8 by task
swapper/0/1
[9.072036][T1]
[9.072036][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.2.0-next-
20190711+ #10
[9.072036][T1] Hardware name: HPE ProLiant DL385 Gen10/ProLiant
DL385
Gen10, BIOS A40 01/25/2019
[9.072036][T1] Call Trace:
[9.072036][T1]  dump_stack+0x62/0x9a
[9.072036][T1]  __kasan_report.cold.4+0xb0/0xb4
[9.072036][T1]  ? unwind_get_return_address+0x40/0x50
[9.072036][T1]  ? shrink_slab+0x111/0x440
[9.072036][T1]  kasan_report+0xc/0xe
[9.072036][T1]  __asan_load8+0x71/0xa0
[9.072036][T1]  shrink_slab+0x111/0x440
[9.072036][T1]  ? mem_cgroup_iter+0x98/0x840
[9.072036][T1]  ? unregister_shrinker+0x110/0x110
[9.072036][T1]  ? kasan_check_read+0x11/0x20
[9.072036][T1]  ? mem_cgroup_protected+0x39/0x260
[9.072036][T1]  shrink_node+0x31e/0xa30
[9.072036][T1]  ? shrink_node_memcg+0x1560/0x1560
[9.072036][T1]  ? ktime_get+0x93/0x110
[9.072036][T1]  do_try_to_free_pages+0x22f/0x820
[9.072036][T1]  ? shrink_node+0xa30/0xa30
[9.072036][T1]  ? kasan_check_read+0x11/0x20
[9.072036][T1]  ? check_chain_key+0x1df/0x2e0
[9.072036][T1]  try_to_free_pages+0x242/0x4d0
[9.072036][T1]  ? do_try_to_free_pages+0x820/0x820
[9.072036][T1]  __alloc_pages_nodemask+0x9ce/0x1bc0
[9.072036][T1]  ? gfp_pfmemalloc_allowed+0xc0/0xc0
[9.072036][T1]  ? unwind_dump+0x260/0x260
[9.072036][T1]  ? kernel_text_address+0x33/0xc0
[9.072036][T1]  ? arch_stack_walk+0x8f/0xf0
[9.072036][T1]  ? ret_from_fork+0x22/0x40
[9.072036][T1]  alloc_page_interleave+0x18/0x130
[9.072036][T1]  alloc_pages_current+0xf6/0x110
[9.072036][T1]  allocate_slab+0x600/0x11f0
[9.072036][T1]  new_slab+0x46/0x70
[9.072036][T1]  ___slab_alloc+0x5d4/0x9c0
[9.072036][T1]  ? create_object+0x3a/0x3e0
[9.072036][T1]  ? fs_reclaim_acquire.part.15+0x5/0x30
[9.072036][T1]  ? ___might_sleep+0xab/0xc0
[9.072036][T1]  ? create_object+0x3a/0x3e0
[9.072036][T1]  __slab_alloc+0x12/0x20
[9.072036][T1]  ? __slab_alloc+0x12/0x20
[9.072036][T1]  kmem_cache_alloc+0x32a/0x400
[9.072036][T1]  create_object+0x3a/0x3e0
[9.072036][T1]  kmemleak_alloc+0x71/0xa0
[9.072036][T1]  kmem_cache_alloc+0x272/0x400
[9.072036][T1]  ? kasan_check_read+0x11/0x20
[9.072036][T1]  ? do_raw_spin_unlock+0xa8/0x140
[9.072036][T1]  acpi_ps_alloc_op+0x76/0x122
[9.072036][T1]  acpi_ds_execute_arguments+0x2f/0x18d
[9.072036][T1]  acpi_ds_get_package_arguments+0x7d/0x84
[9.072036][T1]  acpi_ns_init_one_package+0x33/0x61
[9.072036][T1]  acpi_ns_init_one_object+0xfc/0x189
[9.072036][T1]  acpi_ns_walk_namespace+0x114/0x1f2
[9.072036][T1]  ? acpi_ns_init_one_package+0x61/0x61
[9.072036][T1]  ? acpi_ns_init_one_package+0x61/0x61
[9.072036][T1]  acpi_walk_namespace+0x9e/0xcb
[9.072036][T1]  ? acpi_sleep_proc_init+0x36/0x36
[9.072036][T1]  acpi_ns_initialize_objects+0x99/0xed
[9.072036][T1]  ? acpi_ns_find_ini_methods+0xa2/0xa2
[9.072036][T1]  ? acpi_tb_load_namespace+0x2dc/0x2eb
[9.072036][T1]  acpi_load_tables+0x61/0x80
[9.072036][T1]  acpi_init+0x10d/0x44b
[9.072036][T1]  ? acpi_sleep_proc_init+0x36/0x36
[9.072036][T1]  ? bus_uevent_filter+0x16/0x30
[9.072036][T1]  ? kobject_uevent_env+0x109/0x980
[9.072036][T1]  ? kernfs_get+0x13/0x20
[9.072036][T1]  ? kobject_uevent+0xb/0x10
[9.072036][T1]  ? kset_register+0x31/0x50
[9.072036][T1]  ? kset_create_and_add+0x9f/0xd0
[9.072036][T1]  ? acpi_sleep_proc_init+0x36/0x36
[ 

Re: list corruption in deferred_split_scan()

2019-07-15 Thread Qian Cai
On Fri, 2019-07-12 at 12:12 -0700, Yang Shi wrote:
> > Another possible lead is that without reverting the those commits below,
> > kdump
> > kernel would always also crash in shrink_slab_memcg() at this line,
> > 
> > map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map, true);
> 
> This looks a little bit weird. It seems nodeinfo[nid] is NULL? I didn't 
> think of where nodeinfo was freed but memcg was still online. Maybe a 
> check is needed:

Actually, "memcg" is NULL.

> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a0301ed..bacda49 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -602,6 +602,9 @@ static unsigned long shrink_slab_memcg(gfp_t 
> gfp_mask, int nid,
>  if (!mem_cgroup_online(memcg))
>  return 0;
> 
> +   if (!memcg->nodeinfo[nid])
> +   return 0;
> +
>  if (!down_read_trylock(_rwsem))
>  return 0;
> 
> > 
> > [9.072036][T1] BUG: KASAN: null-ptr-deref in shrink_slab+0x111/0x440
> > [9.072036][T1] Read of size 8 at addr 0dc8 by task
> > swapper/0/1
> > [9.072036][T1]
> > [9.072036][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.2.0-next-
> > 20190711+ #10
> > [9.072036][T1] Hardware name: HPE ProLiant DL385 Gen10/ProLiant
> > DL385
> > Gen10, BIOS A40 01/25/2019
> > [9.072036][T1] Call Trace:
> > [9.072036][T1]  dump_stack+0x62/0x9a
> > [9.072036][T1]  __kasan_report.cold.4+0xb0/0xb4
> > [9.072036][T1]  ? unwind_get_return_address+0x40/0x50
> > [9.072036][T1]  ? shrink_slab+0x111/0x440
> > [9.072036][T1]  kasan_report+0xc/0xe
> > [9.072036][T1]  __asan_load8+0x71/0xa0
> > [9.072036][T1]  shrink_slab+0x111/0x440
> > [9.072036][T1]  ? mem_cgroup_iter+0x98/0x840
> > [9.072036][T1]  ? unregister_shrinker+0x110/0x110
> > [9.072036][T1]  ? kasan_check_read+0x11/0x20
> > [9.072036][T1]  ? mem_cgroup_protected+0x39/0x260
> > [9.072036][T1]  shrink_node+0x31e/0xa30
> > [9.072036][T1]  ? shrink_node_memcg+0x1560/0x1560
> > [9.072036][T1]  ? ktime_get+0x93/0x110
> > [9.072036][T1]  do_try_to_free_pages+0x22f/0x820
> > [9.072036][T1]  ? shrink_node+0xa30/0xa30
> > [9.072036][T1]  ? kasan_check_read+0x11/0x20
> > [9.072036][T1]  ? check_chain_key+0x1df/0x2e0
> > [9.072036][T1]  try_to_free_pages+0x242/0x4d0
> > [9.072036][T1]  ? do_try_to_free_pages+0x820/0x820
> > [9.072036][T1]  __alloc_pages_nodemask+0x9ce/0x1bc0
> > [9.072036][T1]  ? gfp_pfmemalloc_allowed+0xc0/0xc0
> > [9.072036][T1]  ? unwind_dump+0x260/0x260
> > [9.072036][T1]  ? kernel_text_address+0x33/0xc0
> > [9.072036][T1]  ? arch_stack_walk+0x8f/0xf0
> > [9.072036][T1]  ? ret_from_fork+0x22/0x40
> > [9.072036][T1]  alloc_page_interleave+0x18/0x130
> > [9.072036][T1]  alloc_pages_current+0xf6/0x110
> > [9.072036][T1]  allocate_slab+0x600/0x11f0
> > [9.072036][T1]  new_slab+0x46/0x70
> > [9.072036][T1]  ___slab_alloc+0x5d4/0x9c0
> > [9.072036][T1]  ? create_object+0x3a/0x3e0
> > [9.072036][T1]  ? fs_reclaim_acquire.part.15+0x5/0x30
> > [9.072036][T1]  ? ___might_sleep+0xab/0xc0
> > [9.072036][T1]  ? create_object+0x3a/0x3e0
> > [9.072036][T1]  __slab_alloc+0x12/0x20
> > [9.072036][T1]  ? __slab_alloc+0x12/0x20
> > [9.072036][T1]  kmem_cache_alloc+0x32a/0x400
> > [9.072036][T1]  create_object+0x3a/0x3e0
> > [9.072036][T1]  kmemleak_alloc+0x71/0xa0
> > [9.072036][T1]  kmem_cache_alloc+0x272/0x400
> > [9.072036][T1]  ? kasan_check_read+0x11/0x20
> > [9.072036][T1]  ? do_raw_spin_unlock+0xa8/0x140
> > [9.072036][T1]  acpi_ps_alloc_op+0x76/0x122
> > [9.072036][T1]  acpi_ds_execute_arguments+0x2f/0x18d
> > [9.072036][T1]  acpi_ds_get_package_arguments+0x7d/0x84
> > [9.072036][T1]  acpi_ns_init_one_package+0x33/0x61
> > [9.072036][T1]  acpi_ns_init_one_object+0xfc/0x189
> > [9.072036][T1]  acpi_ns_walk_namespace+0x114/0x1f2
> > [9.072036][T1]  ? acpi_ns_init_one_package+0x61/0x61
> > [9.072036][T1]  ? acpi_ns_init_one_package+0x61/0x61
> > [9.072036][T1]  acpi_walk_namespace+0x9e/0xcb
> > [9.072036][T1]  ? acpi_sleep_proc_init+0x36/0x36
> > [9.072036][T1]  acpi_ns_initialize_objects+0x99/0xed
> > [9.072036][T1]  ? acpi_ns_find_ini_methods+0xa2/0xa2
> > [9.072036][T1]  ? acpi_tb_load_namespace+0x2dc/0x2eb
> > [9.072036][T1]  acpi_load_tables+0x61/0x80
> > [9.072036][T1]  acpi_init+0x10d/0x44b
> > [9.072036][T1]  ? acpi_sleep_proc_init+0x36/0x36
> > [9.072036][T1]  ? bus_uevent_filter+0x16/0x30
> > [9.072036][T1]  ? kobject_uevent_env+0x109/0x980
> > [9.072036][T1]  ? kernfs_get+0x13/0x20
> > [9.072036][T1]  ? 

Re: list corruption in deferred_split_scan()

2019-07-14 Thread Yang Shi




On 7/13/19 8:53 PM, Hillf Danton wrote:

On Wed, 10 Jul 2019 14:43:28 -0700 (PDT) Qian Cai wrote:

Running LTP oom01 test case with swap triggers a crash below. Revert the series
"Make deferred split shrinker memcg aware" [1] seems fix the issue.

aefde94195ca mm: thp: make deferred split shrinker memcg aware
cf402211cacc mm-shrinker-make-shrinker-not-depend-on-memcg-kmem-fix-2-fix
ca37e9e5f18d mm-shrinker-make-shrinker-not-depend-on-memcg-kmem-fix-2
5f419d89cab4 mm-shrinker-make-shrinker-not-depend-on-memcg-kmem-fix
c9d49e69e887 mm: shrinker: make shrinker not depend on memcg kmem
1c0af4b86bcf mm: move mem_cgroup_uncharge out of __page_cache_release()
4e050f2df876 mm: thp: extract split_queue_* into a struct

[1] 
https://lore.kernel.org/linux-mm/1561507361-59349-1-git-send-email-yang@linux.alibaba.com/

[ 1145.730682][ T5764] list_del corruption, ea00251c8098->next is 
LIST_POISON1 (dead0100)
[ 1145.739763][ T5764] [ cut here ]
[ 1145.745126][ T5764] kernel BUG at lib/list_debug.c:47!
[ 1145.750320][ T5764] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[ 1145.757513][ T5764] CPU: 1 PID: 5764 Comm: oom01 Tainted: GW 
5.2.0-next-20190710+ #7
[ 1145.766709][ T5764] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 
Gen10, BIOS A40 01/25/2019
[ 1145.776000][ T5764] RIP: 0010:__list_del_entry_valid.cold.0+0x12/0x4a
[ 1145.782491][ T5764] Code: c7 40 5a 33 af e8 ac fe bc ff 0f 0b 48 c7 c7 80 9e
a1 af e8 f6 4c 01 00 4c 89 ea 48 89 de 48 c7 c7 20 59 33 af e8 8c fe bc ff <0f>
0b 48 c7 c7 40 9f a1 af e8 d6 4c 01 00 4c 89 e2 48 89 de 48 c7
[ 1145.802078][ T5764] RSP: 0018:888514d773c0 EFLAGS: 00010082
[ 1145.808042][ T5764] RAX: 004e RBX: ea00251c8098 RCX: 
ae95d318
[ 1145.815923][ T5764] RDX:  RSI: 0008 RDI: 
440bd380
[ 1145.823806][ T5764] RBP: 888514d773d8 R08: ed1108817a71 R09: 
ed1108817a70
[ 1145.831689][ T5764] R10: ed1108817a70 R11: 440bd387 R12: 
dead0122
[ 1145.839571][ T5764] R13: dead0100 R14: ea00251c8034 R15: 
dead0100
[ 1145.847455][ T5764] FS:  7f765ad4d700() GS:4408() 
knlGS:
[ 1145.856299][ T5764] CS:  0010 DS:  ES:  CR0: 80050033
[ 1145.862784][ T5764] CR2: 7f8cebec7000 CR3: 000459338000 CR4: 
001406a0
[ 1145.870664][ T5764] Call Trace:
[ 1145.873835][ T5764]  deferred_split_scan+0x337/0x740
[ 1145.878835][ T5764]  ? split_huge_page_to_list+0xe30/0xe30
[ 1145.884364][ T5764]  ? __radix_tree_lookup+0x12d/0x1e0
[ 1145.889539][ T5764]  ? node_tag_get.part.0.constprop.6+0x40/0x40
[ 1145.895592][ T5764]  do_shrink_slab+0x244/0x5a0
[ 1145.900159][ T5764]  shrink_slab+0x253/0x440
[ 1145.904462][ T5764]  ? unregister_shrinker+0x110/0x110
[ 1145.909641][ T5764]  ? kasan_check_read+0x11/0x20
[ 1145.914383][ T5764]  ? mem_cgroup_protected+0x20f/0x260
[ 1145.919645][ T5764]  shrink_node+0x31e/0xa30
[ 1145.923949][ T5764]  ? shrink_node_memcg+0x1560/0x1560
[ 1145.929126][ T5764]  ? ktime_get+0x93/0x110
[ 1145.933340][ T5764]  do_try_to_free_pages+0x22f/0x820
[ 1145.938429][ T5764]  ? shrink_node+0xa30/0xa30
[ 1145.942906][ T5764]  ? kasan_check_read+0x11/0x20
[ 1145.947647][ T5764]  ? check_chain_key+0x1df/0x2e0
[ 1145.952474][ T5764]  try_to_free_pages+0x242/0x4d0
[ 1145.957299][ T5764]  ? do_try_to_free_pages+0x820/0x820
[ 1145.962566][ T5764]  __alloc_pages_nodemask+0x9ce/0x1bc0
[ 1145.967917][ T5764]  ? kasan_check_read+0x11/0x20
[ 1145.972657][ T5764]  ? gfp_pfmemalloc_allowed+0xc0/0xc0
[ 1145.977920][ T5764]  ? kasan_check_read+0x11/0x20
[ 1145.982659][ T5764]  ? check_chain_key+0x1df/0x2e0
[ 1145.987487][ T5764]  ? do_anonymous_page+0x343/0xe30
[ 1145.992489][ T5764]  ? lock_downgrade+0x390/0x390
[ 1145.997230][ T5764]  ? __count_memcg_events+0x8b/0x1c0
[ 1146.002404][ T5764]  ? kasan_check_read+0x11/0x20
[ 1146.007145][ T5764]  ? __lru_cache_add+0x122/0x160
[ 1146.011974][ T5764]  alloc_pages_vma+0x89/0x2c0
[ 1146.016538][ T5764]  do_anonymous_page+0x3e1/0xe30
[ 1146.021367][ T5764]  ? __update_load_avg_cfs_rq+0x2c/0x490
[ 1146.026893][ T5764]  ? finish_fault+0x120/0x120
[ 1146.031461][ T5764]  ? call_function_interrupt+0xa/0x20
[ 1146.036724][ T5764]  handle_pte_fault+0x457/0x12c0
[ 1146.041552][ T5764]  __handle_mm_fault+0x79a/0xa50
[ 1146.046378][ T5764]  ? vmf_insert_mixed_mkwrite+0x20/0x20
[ 1146.051817][ T5764]  ? kasan_check_read+0x11/0x20
[ 1146.056557][ T5764]  ? __count_memcg_events+0x8b/0x1c0
[ 1146.061732][ T5764]  handle_mm_fault+0x17f/0x370
[ 1146.066386][ T5764]  __do_page_fault+0x25b/0x5d0
[ 1146.071037][ T5764]  do_page_fault+0x4c/0x2cf
[ 1146.075426][ T5764]  ? page_fault+0x5/0x20
[ 1146.079553][ T5764]  page_fault+0x1b/0x20
[ 1146.083594][ T5764] RIP: 0033:0x410be0
[ 1146.087373][ T5764] Code: 89 de e8 e3 23 ff ff 48 83 f8 ff 0f 84 86 00 00 00
48 89 c5 41 83 fc 02 74 28 41 83 fc 03 74 62 e8 95 29 ff ff 31 d2 48 98 90 
44 

Re: list corruption in deferred_split_scan()

2019-07-12 Thread Yang Shi




On 7/12/19 12:12 PM, Yang Shi wrote:



On 7/11/19 2:07 PM, Qian Cai wrote:

On Wed, 2019-07-10 at 17:16 -0700, Yang Shi wrote:

Hi Qian,


Thanks for reporting the issue. But, I can't reproduce it on my 
machine.

Could you please share more details about your test? How often did you
run into this problem?
I can almost reproduce it every time on a HPE ProLiant DL385 Gen10 
server. Here

is some more information.

# cat .config

https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config


I tried your kernel config, but I still can't reproduce it. My 
compiler doesn't have retpoline support, so CONFIG_RETPOLINE is 
disabled in my test, but I don't think this would make any difference 
for this case.


According to the bug call trace in the earlier email, it looks 
deferred _split_scan lost race with put_compound_page. The 
put_compound_page would call free_transhuge_page() which delete the 
page from the deferred split queue, but it may still appear on the 
deferred list due to some reason.


Would you please try the below patch?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b7f709d..66bd9db 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2765,7 +2765,7 @@ int split_huge_page_to_list(struct page *page, 
struct list_head *list)

    if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
    if (!list_empty(page_deferred_list(head))) {
    ds_queue->split_queue_len--;
-   list_del(page_deferred_list(head));
+   list_del_init(page_deferred_list(head));


This line should not be changed. Please just apply the below part.


}
    if (mapping)
    __dec_node_page_state(page, NR_SHMEM_THPS);
@@ -2814,7 +2814,7 @@ void free_transhuge_page(struct page *page)
    spin_lock_irqsave(_queue->split_queue_lock, flags);
    if (!list_empty(page_deferred_list(page))) {
    ds_queue->split_queue_len--;
-   list_del(page_deferred_list(page));
+   list_del_init(page_deferred_list(page));
    }
    spin_unlock_irqrestore(_queue->split_queue_lock, flags);
    free_compound_page(page);



# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 19984 MB
node 0 free: 7251 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 31524 MB
node 4 free: 25165 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 0 MB
node 7 free: 0 MB
node distances:
node   0   1   2   3   4   5   6   7
   0:  10  16  16  16  32  32  32  32
   1:  16  10  16  16  32  32  32  32
   2:  16  16  10  16  32  32  32  32
   3:  16  16  16  10  32  32  32  32
   4:  32  32  32  32  10  16  16  16
   5:  32  32  32  32  16  10  16  16
   6:  32  32  32  32  16  16  10  16
   7:  32  32  32  32  16  16  16  10

# lscpu
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):   2
NUMA node(s):8
Vendor ID:   AuthenticAMD
CPU family:  23
Model:   1
Model name:  AMD EPYC 7601 32-Core Processor
Stepping:2
CPU MHz: 2713.551
BogoMIPS:4391.39
Virtualization:  AMD-V
L1d cache:   32K
L1i cache:   64K
L2 cache:512K
L3 cache:8192K
NUMA node0 CPU(s):   0-7,64-71
NUMA node1 CPU(s):   8-15,72-79
NUMA node2 CPU(s):   16-23,80-87
NUMA node3 CPU(s):   24-31,88-95
NUMA node4 CPU(s):   32-39,96-103
NUMA node5 CPU(s):   40-47,104-111
NUMA node6 CPU(s):   48-55,112-119
NUMA node7 CPU(s):   56-63,120-127

Another possible lead is that without reverting the those commits 
below, kdump

kernel would always also crash in shrink_slab_memcg() at this line,

map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map, 
true);


This looks a little bit weird. It seems nodeinfo[nid] is NULL? I 
didn't think of where nodeinfo was freed but memcg was still online. 
Maybe a check is needed:


diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0301ed..bacda49 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -602,6 +602,9 @@ static unsigned long shrink_slab_memcg(gfp_t 
gfp_mask, int nid,

    if (!mem_cgroup_online(memcg))
    return 0;

+   if (!memcg->nodeinfo[nid])
+   return 0;
+
    if 

Re: list corruption in deferred_split_scan()

2019-07-12 Thread Yang Shi




On 7/11/19 2:07 PM, Qian Cai wrote:

On Wed, 2019-07-10 at 17:16 -0700, Yang Shi wrote:

Hi Qian,


Thanks for reporting the issue. But, I can't reproduce it on my machine.
Could you please share more details about your test? How often did you
run into this problem?

I can almost reproduce it every time on a HPE ProLiant DL385 Gen10 server. Here
is some more information.

# cat .config

https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config


I tried your kernel config, but I still can't reproduce it. My compiler 
doesn't have retpoline support, so CONFIG_RETPOLINE is disabled in my 
test, but I don't think this would make any difference for this case.


According to the bug call trace in the earlier email, it looks deferred 
_split_scan lost race with put_compound_page. The put_compound_page 
would call free_transhuge_page() which delete the page from the deferred 
split queue, but it may still appear on the deferred list due to some 
reason.


Would you please try the below patch?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b7f709d..66bd9db 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2765,7 +2765,7 @@ int split_huge_page_to_list(struct page *page, 
struct list_head *list)

    if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
    if (!list_empty(page_deferred_list(head))) {
    ds_queue->split_queue_len--;
-   list_del(page_deferred_list(head));
+   list_del_init(page_deferred_list(head));
    }
    if (mapping)
    __dec_node_page_state(page, NR_SHMEM_THPS);
@@ -2814,7 +2814,7 @@ void free_transhuge_page(struct page *page)
    spin_lock_irqsave(_queue->split_queue_lock, flags);
    if (!list_empty(page_deferred_list(page))) {
    ds_queue->split_queue_len--;
-   list_del(page_deferred_list(page));
+   list_del_init(page_deferred_list(page));
    }
    spin_unlock_irqrestore(_queue->split_queue_lock, flags);
    free_compound_page(page);



# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 19984 MB
node 0 free: 7251 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 31524 MB
node 4 free: 25165 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 0 MB
node 7 free: 0 MB
node distances:
node   0   1   2   3   4   5   6   7
   0:  10  16  16  16  32  32  32  32
   1:  16  10  16  16  32  32  32  32
   2:  16  16  10  16  32  32  32  32
   3:  16  16  16  10  32  32  32  32
   4:  32  32  32  32  10  16  16  16
   5:  32  32  32  32  16  10  16  16
   6:  32  32  32  32  16  16  10  16
   7:  32  32  32  32  16  16  16  10

# lscpu
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):   2
NUMA node(s):8
Vendor ID:   AuthenticAMD
CPU family:  23
Model:   1
Model name:  AMD EPYC 7601 32-Core Processor
Stepping:2
CPU MHz: 2713.551
BogoMIPS:4391.39
Virtualization:  AMD-V
L1d cache:   32K
L1i cache:   64K
L2 cache:512K
L3 cache:8192K
NUMA node0 CPU(s):   0-7,64-71
NUMA node1 CPU(s):   8-15,72-79
NUMA node2 CPU(s):   16-23,80-87
NUMA node3 CPU(s):   24-31,88-95
NUMA node4 CPU(s):   32-39,96-103
NUMA node5 CPU(s):   40-47,104-111
NUMA node6 CPU(s):   48-55,112-119
NUMA node7 CPU(s):   56-63,120-127

Another possible lead is that without reverting the those commits below, kdump
kernel would always also crash in shrink_slab_memcg() at this line,

map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map, true);


This looks a little bit weird. It seems nodeinfo[nid] is NULL? I didn't 
think of where nodeinfo was freed but memcg was still online. Maybe a 
check is needed:


diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0301ed..bacda49 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -602,6 +602,9 @@ static unsigned long shrink_slab_memcg(gfp_t 
gfp_mask, int nid,

    if (!mem_cgroup_online(memcg))
    return 0;

+   if (!memcg->nodeinfo[nid])
+   return 0;
+
    if (!down_read_trylock(_rwsem))
    return 0;



[9.072036][T1] BUG: KASAN: 

Re: list corruption in deferred_split_scan()

2019-07-11 Thread Qian Cai
On Wed, 2019-07-10 at 17:16 -0700, Yang Shi wrote:
> Hi Qian,
> 
> 
> Thanks for reporting the issue. But, I can't reproduce it on my machine. 
> Could you please share more details about your test? How often did you 
> run into this problem?

I can almost reproduce it every time on a HPE ProLiant DL385 Gen10 server. Here
is some more information.

# cat .config

https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config

# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 19984 MB
node 0 free: 7251 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 31524 MB
node 4 free: 25165 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 0 MB
node 7 free: 0 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  16  16  16  32  32  32  32 
  1:  16  10  16  16  32  32  32  32 
  2:  16  16  10  16  32  32  32  32 
  3:  16  16  16  10  32  32  32  32 
  4:  32  32  32  32  10  16  16  16 
  5:  32  32  32  32  16  10  16  16 
  6:  32  32  32  32  16  16  10  16 
  7:  32  32  32  32  16  16  16  10

# lscpu
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):   2
NUMA node(s):8
Vendor ID:   AuthenticAMD
CPU family:  23
Model:   1
Model name:  AMD EPYC 7601 32-Core Processor
Stepping:2
CPU MHz: 2713.551
BogoMIPS:4391.39
Virtualization:  AMD-V
L1d cache:   32K
L1i cache:   64K
L2 cache:512K
L3 cache:8192K
NUMA node0 CPU(s):   0-7,64-71
NUMA node1 CPU(s):   8-15,72-79
NUMA node2 CPU(s):   16-23,80-87
NUMA node3 CPU(s):   24-31,88-95
NUMA node4 CPU(s):   32-39,96-103
NUMA node5 CPU(s):   40-47,104-111
NUMA node6 CPU(s):   48-55,112-119
NUMA node7 CPU(s):   56-63,120-127

Another possible lead is that without reverting the those commits below, kdump
kernel would always also crash in shrink_slab_memcg() at this line,

map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map, true);

[9.072036][T1] BUG: KASAN: null-ptr-deref in shrink_slab+0x111/0x440
[9.072036][T1] Read of size 8 at addr 0dc8 by task
swapper/0/1
[9.072036][T1] 
[9.072036][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.2.0-next-
20190711+ #10
[9.072036][T1] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 01/25/2019
[9.072036][T1] Call Trace:
[9.072036][T1]  dump_stack+0x62/0x9a
[9.072036][T1]  __kasan_report.cold.4+0xb0/0xb4
[9.072036][T1]  ? unwind_get_return_address+0x40/0x50
[9.072036][T1]  ? shrink_slab+0x111/0x440
[9.072036][T1]  kasan_report+0xc/0xe
[9.072036][T1]  __asan_load8+0x71/0xa0
[9.072036][T1]  shrink_slab+0x111/0x440
[9.072036][T1]  ? mem_cgroup_iter+0x98/0x840
[9.072036][T1]  ? unregister_shrinker+0x110/0x110
[9.072036][T1]  ? kasan_check_read+0x11/0x20
[9.072036][T1]  ? mem_cgroup_protected+0x39/0x260
[9.072036][T1]  shrink_node+0x31e/0xa30
[9.072036][T1]  ? shrink_node_memcg+0x1560/0x1560
[9.072036][T1]  ? ktime_get+0x93/0x110
[9.072036][T1]  do_try_to_free_pages+0x22f/0x820
[9.072036][T1]  ? shrink_node+0xa30/0xa30
[9.072036][T1]  ? kasan_check_read+0x11/0x20
[9.072036][T1]  ? check_chain_key+0x1df/0x2e0
[9.072036][T1]  try_to_free_pages+0x242/0x4d0
[9.072036][T1]  ? do_try_to_free_pages+0x820/0x820
[9.072036][T1]  __alloc_pages_nodemask+0x9ce/0x1bc0
[9.072036][T1]  ? gfp_pfmemalloc_allowed+0xc0/0xc0
[9.072036][T1]  ? unwind_dump+0x260/0x260
[9.072036][T1]  ? kernel_text_address+0x33/0xc0
[9.072036][T1]  ? arch_stack_walk+0x8f/0xf0
[9.072036][T1]  ? ret_from_fork+0x22/0x40
[9.072036][T1]  alloc_page_interleave+0x18/0x130
[9.072036][T1]  alloc_pages_current+0xf6/0x110
[9.072036][T1]  allocate_slab+0x600/0x11f0
[9.072036][T1]  new_slab+0x46/0x70
[9.072036][T1]  ___slab_alloc+0x5d4/0x9c0
[9.072036][T1]  ? create_object+0x3a/0x3e0
[9.072036][T1]  ? fs_reclaim_acquire.part.15+0x5/0x30
[9.072036][T1]  ? ___might_sleep+0xab/0xc0
[9.072036][T1]  ? create_object+0x3a/0x3e0

Re: list corruption in deferred_split_scan()

2019-07-10 Thread Yang Shi

Hi Qian,


Thanks for reporting the issue. But, I can't reproduce it on my machine. 
Could you please share more details about your test? How often did you 
run into this problem?



Regards,

Yang



On 7/10/19 2:43 PM, Qian Cai wrote:

Running LTP oom01 test case with swap triggers a crash below. Revert the series
"Make deferred split shrinker memcg aware" [1] seems fix the issue.

aefde94195ca mm: thp: make deferred split shrinker memcg aware
cf402211cacc mm-shrinker-make-shrinker-not-depend-on-memcg-kmem-fix-2-fix
ca37e9e5f18d mm-shrinker-make-shrinker-not-depend-on-memcg-kmem-fix-2
5f419d89cab4 mm-shrinker-make-shrinker-not-depend-on-memcg-kmem-fix
c9d49e69e887 mm: shrinker: make shrinker not depend on memcg kmem
1c0af4b86bcf mm: move mem_cgroup_uncharge out of __page_cache_release()
4e050f2df876 mm: thp: extract split_queue_* into a struct

[1] https://lore.kernel.org/linux-mm/1561507361-59349-1-git-send-email-yang.shi@
linux.alibaba.com/

[ 1145.730682][ T5764] list_del corruption, ea00251c8098->next is
LIST_POISON1 (dead0100)
[ 1145.739763][ T5764] [ cut here ]
[ 1145.745126][ T5764] kernel BUG at lib/list_debug.c:47!
[ 1145.750320][ T5764] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[ 1145.757513][ T5764] CPU: 1 PID: 5764 Comm: oom01 Tainted:
GW 5.2.0-next-20190710+ #7
[ 1145.766709][ T5764] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 01/25/2019
[ 1145.776000][ T5764] RIP: 0010:__list_del_entry_valid.cold.0+0x12/0x4a
[ 1145.782491][ T5764] Code: c7 40 5a 33 af e8 ac fe bc ff 0f 0b 48 c7 c7 80 9e
a1 af e8 f6 4c 01 00 4c 89 ea 48 89 de 48 c7 c7 20 59 33 af e8 8c fe bc ff <0f>
0b 48 c7 c7 40 9f a1 af e8 d6 4c 01 00 4c 89 e2 48 89 de 48 c7
[ 1145.802078][ T5764] RSP: 0018:888514d773c0 EFLAGS: 00010082
[ 1145.808042][ T5764] RAX: 004e RBX: ea00251c8098 RCX:
ae95d318
[ 1145.815923][ T5764] RDX:  RSI: 0008 RDI:
440bd380
[ 1145.823806][ T5764] RBP: 888514d773d8 R08: ed1108817a71 R09:
ed1108817a70
[ 1145.831689][ T5764] R10: ed1108817a70 R11: 440bd387 R12:
dead0122
[ 1145.839571][ T5764] R13: dead0100 R14: ea00251c8034 R15:
dead0100
[ 1145.847455][ T5764] FS:  7f765ad4d700() GS:4408()
knlGS:
[ 1145.856299][ T5764] CS:  0010 DS:  ES:  CR0: 80050033
[ 1145.862784][ T5764] CR2: 7f8cebec7000 CR3: 000459338000 CR4:
001406a0
[ 1145.870664][ T5764] Call Trace:
[ 1145.873835][ T5764]  deferred_split_scan+0x337/0x740
[ 1145.878835][ T5764]  ? split_huge_page_to_list+0xe30/0xe30
[ 1145.884364][ T5764]  ? __radix_tree_lookup+0x12d/0x1e0
[ 1145.889539][ T5764]  ? node_tag_get.part.0.constprop.6+0x40/0x40
[ 1145.895592][ T5764]  do_shrink_slab+0x244/0x5a0
[ 1145.900159][ T5764]  shrink_slab+0x253/0x440
[ 1145.904462][ T5764]  ? unregister_shrinker+0x110/0x110
[ 1145.909641][ T5764]  ? kasan_check_read+0x11/0x20
[ 1145.914383][ T5764]  ? mem_cgroup_protected+0x20f/0x260
[ 1145.919645][ T5764]  shrink_node+0x31e/0xa30
[ 1145.923949][ T5764]  ? shrink_node_memcg+0x1560/0x1560
[ 1145.929126][ T5764]  ? ktime_get+0x93/0x110
[ 1145.933340][ T5764]  do_try_to_free_pages+0x22f/0x820
[ 1145.938429][ T5764]  ? shrink_node+0xa30/0xa30
[ 1145.942906][ T5764]  ? kasan_check_read+0x11/0x20
[ 1145.947647][ T5764]  ? check_chain_key+0x1df/0x2e0
[ 1145.952474][ T5764]  try_to_free_pages+0x242/0x4d0
[ 1145.957299][ T5764]  ? do_try_to_free_pages+0x820/0x820
[ 1145.962566][ T5764]  __alloc_pages_nodemask+0x9ce/0x1bc0
[ 1145.967917][ T5764]  ? kasan_check_read+0x11/0x20
[ 1145.972657][ T5764]  ? gfp_pfmemalloc_allowed+0xc0/0xc0
[ 1145.977920][ T5764]  ? kasan_check_read+0x11/0x20
[ 1145.982659][ T5764]  ? check_chain_key+0x1df/0x2e0
[ 1145.987487][ T5764]  ? do_anonymous_page+0x343/0xe30
[ 1145.992489][ T5764]  ? lock_downgrade+0x390/0x390
[ 1145.997230][ T5764]  ? __count_memcg_events+0x8b/0x1c0
[ 1146.002404][ T5764]  ? kasan_check_read+0x11/0x20
[ 1146.007145][ T5764]  ? __lru_cache_add+0x122/0x160
[ 1146.011974][ T5764]  alloc_pages_vma+0x89/0x2c0
[ 1146.016538][ T5764]  do_anonymous_page+0x3e1/0xe30
[ 1146.021367][ T5764]  ? __update_load_avg_cfs_rq+0x2c/0x490
[ 1146.026893][ T5764]  ? finish_fault+0x120/0x120
[ 1146.031461][ T5764]  ? call_function_interrupt+0xa/0x20
[ 1146.036724][ T5764]  handle_pte_fault+0x457/0x12c0
[ 1146.041552][ T5764]  __handle_mm_fault+0x79a/0xa50
[ 1146.046378][ T5764]  ? vmf_insert_mixed_mkwrite+0x20/0x20
[ 1146.051817][ T5764]  ? kasan_check_read+0x11/0x20
[ 1146.056557][ T5764]  ? __count_memcg_events+0x8b/0x1c0
[ 1146.061732][ T5764]  handle_mm_fault+0x17f/0x370
[ 1146.066386][ T5764]  __do_page_fault+0x25b/0x5d0
[ 1146.071037][ T5764]  do_page_fault+0x4c/0x2cf
[ 1146.075426][ T5764]  ? page_fault+0x5/0x20
[ 1146.079553][ T5764]  page_fault+0x1b/0x20
[ 1146.083594][ T5764] RIP: 0033:0x410be0
[ 1146.087373][ T5764] Code: 89