Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
> Am 21.10.2020 um 18:58 schrieb Mike Kravetz : > >> >>> On 21.10.20 15:11, David Hildenbrand wrote: >>> On 21.10.20 14:57, Michal Privoznik wrote: On 10/21/20 5:35 AM, Mike Kravetz wrote: > On 10/20/20 6:38 AM, David Hildenbrand wrote: > It would be good if Mina (at least) would look these over. Would also > be interesting to know if these fixes address the bug seen with the qemu > use case. > I'm still doing more testing and code inspection to look for other issues. > ... > ... I've applied, rebuilt and tested, but unfortunately I still hit the problem: [ 6472.719047] [ cut here ] [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 > ... > ... >>> Agreed, same over here. :( >> >> I *think* the following on top makes it fly >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 67fc6383995b..5cf7f6a6c1a6 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -656,6 +656,9 @@ static long region_del(struct resv_map *resv, long >> f, long t) >> >>del += t - f; >> >> + hugetlb_cgroup_uncharge_file_region( >> + resv, rg, t - f); >> + >>/* New entry for end of split region */ >>nrg->from = t; >>nrg->to = rg->to; >> @@ -667,9 +670,6 @@ static long region_del(struct resv_map *resv, long >> f, long t) >>/* Original entry is trimmed */ >>rg->to = f; >> >> - hugetlb_cgroup_uncharge_file_region( >> - resv, rg, nrg->to - nrg->from); >> - >>list_add(>link, >link); >>nrg = NULL; >>break; > > Thanks, yes that certainly does look like a bug in that code. > > Does that resolve the issue with quemu? I was not able to reproduce, so I guess we found all issues! Thanks! > > I want to do a little more testing/research before sending a patch later > today. > -- > Mike Kravetz
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
> On 21.10.20 15:11, David Hildenbrand wrote: >> On 21.10.20 14:57, Michal Privoznik wrote: >>> On 10/21/20 5:35 AM, Mike Kravetz wrote: On 10/20/20 6:38 AM, David Hildenbrand wrote: It would be good if Mina (at least) would look these over. Would also be interesting to know if these fixes address the bug seen with the qemu use case. I'm still doing more testing and code inspection to look for other issues. ... ... >>> >>> I've applied, rebuilt and tested, but unfortunately I still hit the problem: >>> [ 6472.719047] [ cut here ] >>> [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 ... ... >> >> Agreed, same over here. :( >> > > I *think* the following on top makes it fly > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 67fc6383995b..5cf7f6a6c1a6 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -656,6 +656,9 @@ static long region_del(struct resv_map *resv, long > f, long t) > > del += t - f; > > + hugetlb_cgroup_uncharge_file_region( > + resv, rg, t - f); > + > /* New entry for end of split region */ > nrg->from = t; > nrg->to = rg->to; > @@ -667,9 +670,6 @@ static long region_del(struct resv_map *resv, long > f, long t) > /* Original entry is trimmed */ > rg->to = f; > > - hugetlb_cgroup_uncharge_file_region( > - resv, rg, nrg->to - nrg->from); > - > list_add(>link, >link); > nrg = NULL; > break; > > Thanks, yes that certainly does look like a bug in that code. Does that resolve the issue with quemu? I want to do a little more testing/research before sending a patch later today. -- Mike Kravetz
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 21.10.20 15:34, David Hildenbrand wrote: > On 21.10.20 15:11, David Hildenbrand wrote: >> On 21.10.20 14:57, Michal Privoznik wrote: >>> On 10/21/20 5:35 AM, Mike Kravetz wrote: On 10/20/20 6:38 AM, David Hildenbrand wrote: > > I'm bisecting the warning right now. Looks like it was introduced in v5.7. I found the following bugs in the cgroup reservation accounting. The ones in region_del are pretty obvious as the number of pages to uncharge would always be zero. The one on alloc_huge_page needs racing code to expose. With these fixes, my testing is showing consistent/correct results for hugetlb reservation cgroup accounting. It would be good if Mina (at least) would look these over. Would also be interesting to know if these fixes address the bug seen with the qemu use case. I'm still doing more testing and code inspection to look for other issues. From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Tue, 20 Oct 2020 20:21:42 -0700 Subject: [PATCH] hugetlb_cgroup: fix reservation accounting Signed-off-by: Mike Kravetz --- mm/hugetlb.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 67fc6383995b..c92366313780 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, long t) } if (f <= rg->from) {/* Trim beginning of region */ - del += t - rg->from; - rg->from = t; - hugetlb_cgroup_uncharge_file_region(resv, rg, t - rg->from); - } else {/* Trim end of region */ - del += rg->to - f; - rg->to = f; + del += t - rg->from; + rg->from = t; + } else {/* Trim end of region */ hugetlb_cgroup_uncharge_file_region(resv, rg, rg->to - f); + + del += rg->to - f; + rg->to = f; } } @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, rsv_adjust = hugepage_subpool_put_pages(spool, 1); hugetlb_acct_memory(h, -rsv_adjust); + if (deferred_reserve) + hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h), + pages_per_huge_page(h), page); } return page; >>> >>> I've applied, rebuilt and tested, but unfortunately I still hit the problem: >>> [ 6472.719047] [ cut here ] >>> [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 >>> page_counter_uncharge+0x33/0x40 >>> [ 6472.719052] Modules linked in: kvm_amd amdgpu kvm btusb sp5100_tco >>> btrtl watchdog k10temp btbcm btintel mfd_core gpu_sched ttm >>> [ 6472.719057] CPU: 6 PID: 11773 Comm: CPU 3/KVM Not tainted >>> 5.9.1-gentoo-x86_64 #1 >>> [ 6472.719057] Hardware name: System manufacturer System Product >>> Name/PRIME X570-PRO, BIOS 1005 08/01/2019 >>> [ 6472.719059] RIP: 0010:page_counter_uncharge+0x33/0x40 >>> [ 6472.719060] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 >>> 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc >>> c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41 >>> [ 6472.719061] RSP: 0018:c9b77b40 EFLAGS: 00010286 >>> [ 6472.719061] RAX: fffe9200 RBX: 888fb3b97b40 RCX: >>> fffe9200 >>> [ 6472.719062] RDX: 0221 RSI: fffe9200 RDI: >>> 888fd8451dd0 >>> [ 6472.719062] RBP: 888fb6990420 R08: 00044200 R09: >>> fffbbe00 >>> [ 6472.719062] R10: 888fb3b97b40 R11: 000a R12: >>> 0001 >>> [ 6472.719063] R13: 05df R14: 05de R15: >>> 888fb3b97b40 >>> [ 6472.719063] FS: 7fbd175fe700() GS:888fde98() >>> knlGS: >>> [ 6472.719064] CS: 0010 DS: ES: CR0: 80050033 >>> [ 6472.719064] CR2: 7fbd825101f0 CR3: 000fb5e41000 CR4: >>> 00350ee0 >>> [ 6472.719065] Call Trace: >>> [ 6472.719067] hugetlb_cgroup_uncharge_file_region+0x46/0x70 >>> [ 6472.719069] region_del+0x1ae/0x270 >>> [ 6472.719070] hugetlb_unreserve_pages+0x32/0xa0 >>> [ 6472.719072] remove_inode_hugepages+0x19d/0x3a0 >>> [ 6472.719079] ? writeback_registers+0x45/0x60 [kvm] >>> [ 6472.719080] hugetlbfs_fallocate+0x3f2/0x4a0 >>> [ 6472.719081] ? __mod_lruvec_state+0x1d/0x40 >>> [ 6472.719081] ?
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 21.10.20 15:11, David Hildenbrand wrote: > On 21.10.20 14:57, Michal Privoznik wrote: >> On 10/21/20 5:35 AM, Mike Kravetz wrote: >>> On 10/20/20 6:38 AM, David Hildenbrand wrote: I'm bisecting the warning right now. Looks like it was introduced in v5.7. >>> >>> I found the following bugs in the cgroup reservation accounting. The ones >>> in region_del are pretty obvious as the number of pages to uncharge would >>> always be zero. The one on alloc_huge_page needs racing code to expose. >>> >>> With these fixes, my testing is showing consistent/correct results for >>> hugetlb reservation cgroup accounting. >>> >>> It would be good if Mina (at least) would look these over. Would also >>> be interesting to know if these fixes address the bug seen with the qemu >>> use case. >>> >>> I'm still doing more testing and code inspection to look for other issues. >>> >>> From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001 >>> From: Mike Kravetz >>> Date: Tue, 20 Oct 2020 20:21:42 -0700 >>> Subject: [PATCH] hugetlb_cgroup: fix reservation accounting >>> >>> Signed-off-by: Mike Kravetz >>> --- >>> mm/hugetlb.c | 15 +-- >>> 1 file changed, 9 insertions(+), 6 deletions(-) >>> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 67fc6383995b..c92366313780 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, >>> long t) >>> } >>> >>> if (f <= rg->from) {/* Trim beginning of region */ >>> - del += t - rg->from; >>> - rg->from = t; >>> - >>> hugetlb_cgroup_uncharge_file_region(resv, rg, >>> t - rg->from); >>> - } else {/* Trim end of region */ >>> - del += rg->to - f; >>> - rg->to = f; >>> >>> + del += t - rg->from; >>> + rg->from = t; >>> + } else {/* Trim end of region */ >>> hugetlb_cgroup_uncharge_file_region(resv, rg, >>> rg->to - f); >>> + >>> + del += rg->to - f; >>> + rg->to = f; >>> } >>> } >>> >>> @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct >>> *vma, >>> >>> rsv_adjust = hugepage_subpool_put_pages(spool, 1); >>> hugetlb_acct_memory(h, -rsv_adjust); >>> + if (deferred_reserve) >>> + hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h), >>> + pages_per_huge_page(h), page); >>> } >>> return page; >>> >>> >> >> I've applied, rebuilt and tested, but unfortunately I still hit the problem: >> [ 6472.719047] [ cut here ] >> [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 >> page_counter_uncharge+0x33/0x40 >> [ 6472.719052] Modules linked in: kvm_amd amdgpu kvm btusb sp5100_tco >> btrtl watchdog k10temp btbcm btintel mfd_core gpu_sched ttm >> [ 6472.719057] CPU: 6 PID: 11773 Comm: CPU 3/KVM Not tainted >> 5.9.1-gentoo-x86_64 #1 >> [ 6472.719057] Hardware name: System manufacturer System Product >> Name/PRIME X570-PRO, BIOS 1005 08/01/2019 >> [ 6472.719059] RIP: 0010:page_counter_uncharge+0x33/0x40 >> [ 6472.719060] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 >> 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc >> c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41 >> [ 6472.719061] RSP: 0018:c9b77b40 EFLAGS: 00010286 >> [ 6472.719061] RAX: fffe9200 RBX: 888fb3b97b40 RCX: >> fffe9200 >> [ 6472.719062] RDX: 0221 RSI: fffe9200 RDI: >> 888fd8451dd0 >> [ 6472.719062] RBP: 888fb6990420 R08: 00044200 R09: >> fffbbe00 >> [ 6472.719062] R10: 888fb3b97b40 R11: 000a R12: >> 0001 >> [ 6472.719063] R13: 05df R14: 05de R15: >> 888fb3b97b40 >> [ 6472.719063] FS: 7fbd175fe700() GS:888fde98() >> knlGS: >> [ 6472.719064] CS: 0010 DS: ES: CR0: 80050033 >> [ 6472.719064] CR2: 7fbd825101f0 CR3: 000fb5e41000 CR4: >> 00350ee0 >> [ 6472.719065] Call Trace: >> [ 6472.719067] hugetlb_cgroup_uncharge_file_region+0x46/0x70 >> [ 6472.719069] region_del+0x1ae/0x270 >> [ 6472.719070] hugetlb_unreserve_pages+0x32/0xa0 >> [ 6472.719072] remove_inode_hugepages+0x19d/0x3a0 >> [ 6472.719079] ? writeback_registers+0x45/0x60 [kvm] >> [ 6472.719080] hugetlbfs_fallocate+0x3f2/0x4a0 >> [ 6472.719081] ? __mod_lruvec_state+0x1d/0x40 >> [ 6472.719081] ? __mod_memcg_lruvec_state+0x1b/0xe0 >> [ 6472.719083] ? __seccomp_filter+0x75/0x6a0 >> [ 6472.719084] vfs_fallocate+0x122/0x260 >> [ 6472.719085]
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 21.10.20 14:57, Michal Privoznik wrote: > On 10/21/20 5:35 AM, Mike Kravetz wrote: >> On 10/20/20 6:38 AM, David Hildenbrand wrote: >>> >>> I'm bisecting the warning right now. Looks like it was introduced in v5.7. >> >> I found the following bugs in the cgroup reservation accounting. The ones >> in region_del are pretty obvious as the number of pages to uncharge would >> always be zero. The one on alloc_huge_page needs racing code to expose. >> >> With these fixes, my testing is showing consistent/correct results for >> hugetlb reservation cgroup accounting. >> >> It would be good if Mina (at least) would look these over. Would also >> be interesting to know if these fixes address the bug seen with the qemu >> use case. >> >> I'm still doing more testing and code inspection to look for other issues. >> >> From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001 >> From: Mike Kravetz >> Date: Tue, 20 Oct 2020 20:21:42 -0700 >> Subject: [PATCH] hugetlb_cgroup: fix reservation accounting >> >> Signed-off-by: Mike Kravetz >> --- >> mm/hugetlb.c | 15 +-- >> 1 file changed, 9 insertions(+), 6 deletions(-) >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 67fc6383995b..c92366313780 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, >> long t) >> } >> >> if (f <= rg->from) {/* Trim beginning of region */ >> -del += t - rg->from; >> -rg->from = t; >> - >> hugetlb_cgroup_uncharge_file_region(resv, rg, >> t - rg->from); >> -} else {/* Trim end of region */ >> -del += rg->to - f; >> -rg->to = f; >> >> +del += t - rg->from; >> +rg->from = t; >> +} else {/* Trim end of region */ >> hugetlb_cgroup_uncharge_file_region(resv, rg, >> rg->to - f); >> + >> +del += rg->to - f; >> +rg->to = f; >> } >> } >> >> @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct >> *vma, >> >> rsv_adjust = hugepage_subpool_put_pages(spool, 1); >> hugetlb_acct_memory(h, -rsv_adjust); >> +if (deferred_reserve) >> +hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h), >> +pages_per_huge_page(h), page); >> } >> return page; >> >> > > I've applied, rebuilt and tested, but unfortunately I still hit the problem: > [ 6472.719047] [ cut here ] > [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 > page_counter_uncharge+0x33/0x40 > [ 6472.719052] Modules linked in: kvm_amd amdgpu kvm btusb sp5100_tco > btrtl watchdog k10temp btbcm btintel mfd_core gpu_sched ttm > [ 6472.719057] CPU: 6 PID: 11773 Comm: CPU 3/KVM Not tainted > 5.9.1-gentoo-x86_64 #1 > [ 6472.719057] Hardware name: System manufacturer System Product > Name/PRIME X570-PRO, BIOS 1005 08/01/2019 > [ 6472.719059] RIP: 0010:page_counter_uncharge+0x33/0x40 > [ 6472.719060] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 > 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc > c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41 > [ 6472.719061] RSP: 0018:c9b77b40 EFLAGS: 00010286 > [ 6472.719061] RAX: fffe9200 RBX: 888fb3b97b40 RCX: > fffe9200 > [ 6472.719062] RDX: 0221 RSI: fffe9200 RDI: > 888fd8451dd0 > [ 6472.719062] RBP: 888fb6990420 R08: 00044200 R09: > fffbbe00 > [ 6472.719062] R10: 888fb3b97b40 R11: 000a R12: > 0001 > [ 6472.719063] R13: 05df R14: 05de R15: > 888fb3b97b40 > [ 6472.719063] FS: 7fbd175fe700() GS:888fde98() > knlGS: > [ 6472.719064] CS: 0010 DS: ES: CR0: 80050033 > [ 6472.719064] CR2: 7fbd825101f0 CR3: 000fb5e41000 CR4: > 00350ee0 > [ 6472.719065] Call Trace: > [ 6472.719067] hugetlb_cgroup_uncharge_file_region+0x46/0x70 > [ 6472.719069] region_del+0x1ae/0x270 > [ 6472.719070] hugetlb_unreserve_pages+0x32/0xa0 > [ 6472.719072] remove_inode_hugepages+0x19d/0x3a0 > [ 6472.719079] ? writeback_registers+0x45/0x60 [kvm] > [ 6472.719080] hugetlbfs_fallocate+0x3f2/0x4a0 > [ 6472.719081] ? __mod_lruvec_state+0x1d/0x40 > [ 6472.719081] ? __mod_memcg_lruvec_state+0x1b/0xe0 > [ 6472.719083] ? __seccomp_filter+0x75/0x6a0 > [ 6472.719084] vfs_fallocate+0x122/0x260 > [ 6472.719085] __x64_sys_fallocate+0x39/0x60 > [ 6472.719086] do_syscall_64+0x2d/0x40 > [ 6472.719088] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 10/21/20 5:35 AM, Mike Kravetz wrote: On 10/20/20 6:38 AM, David Hildenbrand wrote: I'm bisecting the warning right now. Looks like it was introduced in v5.7. I found the following bugs in the cgroup reservation accounting. The ones in region_del are pretty obvious as the number of pages to uncharge would always be zero. The one on alloc_huge_page needs racing code to expose. With these fixes, my testing is showing consistent/correct results for hugetlb reservation cgroup accounting. It would be good if Mina (at least) would look these over. Would also be interesting to know if these fixes address the bug seen with the qemu use case. I'm still doing more testing and code inspection to look for other issues. From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Tue, 20 Oct 2020 20:21:42 -0700 Subject: [PATCH] hugetlb_cgroup: fix reservation accounting Signed-off-by: Mike Kravetz --- mm/hugetlb.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 67fc6383995b..c92366313780 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, long t) } if (f <= rg->from) { /* Trim beginning of region */ - del += t - rg->from; - rg->from = t; - hugetlb_cgroup_uncharge_file_region(resv, rg, t - rg->from); - } else {/* Trim end of region */ - del += rg->to - f; - rg->to = f; + del += t - rg->from; + rg->from = t; + } else {/* Trim end of region */ hugetlb_cgroup_uncharge_file_region(resv, rg, rg->to - f); + + del += rg->to - f; + rg->to = f; } } @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, rsv_adjust = hugepage_subpool_put_pages(spool, 1); hugetlb_acct_memory(h, -rsv_adjust); + if (deferred_reserve) + hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h), + pages_per_huge_page(h), page); } return page; I've applied, rebuilt and tested, but unfortunately I still hit the problem: [ 6472.719047] [ cut here ] [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 page_counter_uncharge+0x33/0x40 [ 6472.719052] Modules linked in: kvm_amd amdgpu kvm btusb sp5100_tco btrtl watchdog k10temp btbcm btintel mfd_core gpu_sched ttm [ 6472.719057] CPU: 6 PID: 11773 Comm: CPU 3/KVM Not tainted 5.9.1-gentoo-x86_64 #1 [ 6472.719057] Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 1005 08/01/2019 [ 6472.719059] RIP: 0010:page_counter_uncharge+0x33/0x40 [ 6472.719060] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41 [ 6472.719061] RSP: 0018:c9b77b40 EFLAGS: 00010286 [ 6472.719061] RAX: fffe9200 RBX: 888fb3b97b40 RCX: fffe9200 [ 6472.719062] RDX: 0221 RSI: fffe9200 RDI: 888fd8451dd0 [ 6472.719062] RBP: 888fb6990420 R08: 00044200 R09: fffbbe00 [ 6472.719062] R10: 888fb3b97b40 R11: 000a R12: 0001 [ 6472.719063] R13: 05df R14: 05de R15: 888fb3b97b40 [ 6472.719063] FS: 7fbd175fe700() GS:888fde98() knlGS: [ 6472.719064] CS: 0010 DS: ES: CR0: 80050033 [ 6472.719064] CR2: 7fbd825101f0 CR3: 000fb5e41000 CR4: 00350ee0 [ 6472.719065] Call Trace: [ 6472.719067] hugetlb_cgroup_uncharge_file_region+0x46/0x70 [ 6472.719069] region_del+0x1ae/0x270 [ 6472.719070] hugetlb_unreserve_pages+0x32/0xa0 [ 6472.719072] remove_inode_hugepages+0x19d/0x3a0 [ 6472.719079] ? writeback_registers+0x45/0x60 [kvm] [ 6472.719080] hugetlbfs_fallocate+0x3f2/0x4a0 [ 6472.719081] ? __mod_lruvec_state+0x1d/0x40 [ 6472.719081] ? __mod_memcg_lruvec_state+0x1b/0xe0 [ 6472.719083] ? __seccomp_filter+0x75/0x6a0 [ 6472.719084] vfs_fallocate+0x122/0x260 [ 6472.719085] __x64_sys_fallocate+0x39/0x60 [ 6472.719086] do_syscall_64+0x2d/0x40 [ 6472.719088] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 6472.719089] RIP: 0033:0x7fbe3cefcde7 [ 6472.719089] Code: 89 7c 24 08 48 89 4c 24 18 e8 45 fc f8 ff 41 89 c0 4c 8b 54 24 18 48 8b 54 24 10 b8 1d 01 00 00 8b 74 24 0c 8b 7c 24 08 0f 05 <48> 3d 00 f0 ff ff 77 41 44 89 c7 89 44 24 08 e8 75 fc f8 ff 8b 44 [ 6472.719090] RSP: 002b:7fbd175fc7a0
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 21.10.20 05:35, Mike Kravetz wrote: > On 10/20/20 6:38 AM, David Hildenbrand wrote: >> >> I'm bisecting the warning right now. Looks like it was introduced in v5.7. > So bisecting nailed it down to one of 353b2de42e84 mm/hugetlb.c: clean code by removing unnecessary initialization a9b3f867404b hugetlb: support file_region coalescing again 08cf9faf7558 hugetlb_cgroup: support noreserve mappings 075a61d07a8e hugetlb_cgroup: add accounting for shared mappings 0db9d74ed884 hugetlb: disable region_add file_region coalescing e9fe92ae0cd2 hugetlb_cgroup: add reservation accounting for private mappings 9808895e1a44 mm/hugetlb_cgroup: fix hugetlb_cgroup migration 1adc4d419aa2 hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations cdc2fcfea79b hugetlb_cgroup: add hugetlb_cgroup reservation counter So seems to be broken right from the beginning of charge/uncharge/reservations. Not a surpise when looking at your fixes :) > I found the following bugs in the cgroup reservation accounting. The ones > in region_del are pretty obvious as the number of pages to uncharge would > always be zero. The one on alloc_huge_page needs racing code to expose. > > With these fixes, my testing is showing consistent/correct results for > hugetlb reservation cgroup accounting. > > It would be good if Mina (at least) would look these over. Would also > be interesting to know if these fixes address the bug seen with the qemu > use case. I strongly suspect it will. Compiling, will reply in half an our or so with the result. > > I'm still doing more testing and code inspection to look for other issues. When sending, can you make sure to credit Michal P.? Thanks! Reported-by: Michal Privoznik > > From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001 > From: Mike Kravetz > Date: Tue, 20 Oct 2020 20:21:42 -0700 > Subject: [PATCH] hugetlb_cgroup: fix reservation accounting > > Signed-off-by: Mike Kravetz > --- > mm/hugetlb.c | 15 +-- > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 67fc6383995b..c92366313780 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, > long t) > } > > if (f <= rg->from) {/* Trim beginning of region */ > - del += t - rg->from; > - rg->from = t; > - > hugetlb_cgroup_uncharge_file_region(resv, rg, > t - rg->from); > - } else {/* Trim end of region */ > - del += rg->to - f; > - rg->to = f; > > + del += t - rg->from; > + rg->from = t; > + } else {/* Trim end of region */ > hugetlb_cgroup_uncharge_file_region(resv, rg, > rg->to - f); > + > + del += rg->to - f; > + rg->to = f; Those two look very correct to me. You could keep computing "del" before the uncharge, similar to the /* Remove entire region */ case. > } > } > > @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, > > rsv_adjust = hugepage_subpool_put_pages(spool, 1); > hugetlb_acct_memory(h, -rsv_adjust); > + if (deferred_reserve) > + hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h), > + That looks correct to me as well. > } > return page; > > Thanks for debugging! -- Thanks, David / dhildenb
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 10/20/20 6:38 AM, David Hildenbrand wrote: > > I'm bisecting the warning right now. Looks like it was introduced in v5.7. I found the following bugs in the cgroup reservation accounting. The ones in region_del are pretty obvious as the number of pages to uncharge would always be zero. The one on alloc_huge_page needs racing code to expose. With these fixes, my testing is showing consistent/correct results for hugetlb reservation cgroup accounting. It would be good if Mina (at least) would look these over. Would also be interesting to know if these fixes address the bug seen with the qemu use case. I'm still doing more testing and code inspection to look for other issues. >From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Tue, 20 Oct 2020 20:21:42 -0700 Subject: [PATCH] hugetlb_cgroup: fix reservation accounting Signed-off-by: Mike Kravetz --- mm/hugetlb.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 67fc6383995b..c92366313780 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, long t) } if (f <= rg->from) {/* Trim beginning of region */ - del += t - rg->from; - rg->from = t; - hugetlb_cgroup_uncharge_file_region(resv, rg, t - rg->from); - } else {/* Trim end of region */ - del += rg->to - f; - rg->to = f; + del += t - rg->from; + rg->from = t; + } else {/* Trim end of region */ hugetlb_cgroup_uncharge_file_region(resv, rg, rg->to - f); + + del += rg->to - f; + rg->to = f; } } @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, rsv_adjust = hugepage_subpool_put_pages(spool, 1); hugetlb_acct_memory(h, -rsv_adjust); + if (deferred_reserve) + hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h), + pages_per_huge_page(h), page); } return page; -- 2.25.4
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 16.10.20 01:14, Mike Kravetz wrote: > On 10/14/20 11:31 AM, Mike Kravetz wrote: >> On 10/14/20 11:18 AM, David Hildenbrand wrote: >> >> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching >> with (and without) hugetlb controller enabled and did not see this issue. >> > > I took a closer look after running just the fallocate_stress test > in libhugetlbfs. Here are the cgroup counter values: > > hugetlb.2MB.failcnt 0 > hugetlb.2MB.limit_in_bytes 9223372036854771712 > hugetlb.2MB.max_usage_in_bytes 209715200 > hugetlb.2MB.rsvd.failcnt 0 > hugetlb.2MB.rsvd.limit_in_bytes 9223372036854771712 > hugetlb.2MB.rsvd.max_usage_in_bytes 601882624 > hugetlb.2MB.rsvd.usage_in_bytes 392167424 > hugetlb.2MB.usage_in_bytes 0 > > We did not hit the WARN_ON_ONCE(), but the 'rsvd.usage_in_bytes' value > is not correct in that it should be zero. No huge page reservations > remain after the test. > > HugePages_Total:1024 > HugePages_Free: 1024 > HugePages_Rsvd:0 > HugePages_Surp:0 > Hugepagesize: 2048 kB > Hugetlb: 2097152 kB > > To try and better understand the reservation cgroup controller, I addded > a few printks to the code. While running fallocate_stress with the > printks, I can consistently hit the WARN_ON_ONCE() due to the counter > going negative. Here are the cgroup counter values after such a run: > > hugetlb.2MB.failcnt 0 > hugetlb.2MB.limit_in_bytes 9223372036854771712 > hugetlb.2MB.max_usage_in_bytes 209715200 > hugetlb.2MB.rsvd.failcnt 3 > hugetlb.2MB.rsvd.limit_in_bytes 9223372036854771712 > hugetlb.2MB.rsvd.max_usage_in_bytes 251658240 > hugetlb.2MB.rsvd.usage_in_bytes 18446744073487253504 > hugetlb.2MB.usage_in_bytes 0 > > Again, no reserved pages after the test. > > HugePages_Total:1024 > HugePages_Free: 1024 > HugePages_Rsvd:0 > HugePages_Surp:0 > Hugepagesize: 2048 kB > Hugetlb: 2097152 kB > > I have some basic hugetlb hole punch functionality tests. Running > these on the kernel with added printk's does not cause any issues. > In order to reproduce, I need to run fallocate_stress test which > will cause hole punch to race with page fault. Best guess at this > time is that some of the error/race detection reservation back out > code is not properly dealing with cgroup accounting. > > I'll take a look at this as well. > I'm bisecting the warning right now. Looks like it was introduced in v5.7. -- Thanks, David / dhildenb
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 10/14/20 11:31 AM, Mike Kravetz wrote: > On 10/14/20 11:18 AM, David Hildenbrand wrote: > > FWIW - I ran libhugetlbfs tests which do a bunch of hole punching > with (and without) hugetlb controller enabled and did not see this issue. > I took a closer look after running just the fallocate_stress test in libhugetlbfs. Here are the cgroup counter values: hugetlb.2MB.failcnt 0 hugetlb.2MB.limit_in_bytes 9223372036854771712 hugetlb.2MB.max_usage_in_bytes 209715200 hugetlb.2MB.rsvd.failcnt 0 hugetlb.2MB.rsvd.limit_in_bytes 9223372036854771712 hugetlb.2MB.rsvd.max_usage_in_bytes 601882624 hugetlb.2MB.rsvd.usage_in_bytes 392167424 hugetlb.2MB.usage_in_bytes 0 We did not hit the WARN_ON_ONCE(), but the 'rsvd.usage_in_bytes' value is not correct in that it should be zero. No huge page reservations remain after the test. HugePages_Total:1024 HugePages_Free: 1024 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB Hugetlb: 2097152 kB To try and better understand the reservation cgroup controller, I addded a few printks to the code. While running fallocate_stress with the printks, I can consistently hit the WARN_ON_ONCE() due to the counter going negative. Here are the cgroup counter values after such a run: hugetlb.2MB.failcnt 0 hugetlb.2MB.limit_in_bytes 9223372036854771712 hugetlb.2MB.max_usage_in_bytes 209715200 hugetlb.2MB.rsvd.failcnt 3 hugetlb.2MB.rsvd.limit_in_bytes 9223372036854771712 hugetlb.2MB.rsvd.max_usage_in_bytes 251658240 hugetlb.2MB.rsvd.usage_in_bytes 18446744073487253504 hugetlb.2MB.usage_in_bytes 0 Again, no reserved pages after the test. HugePages_Total:1024 HugePages_Free: 1024 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB Hugetlb: 2097152 kB I have some basic hugetlb hole punch functionality tests. Running these on the kernel with added printk's does not cause any issues. In order to reproduce, I need to run fallocate_stress test which will cause hole punch to race with page fault. Best guess at this time is that some of the error/race detection reservation back out code is not properly dealing with cgroup accounting. I'll take a look at this as well. -- Mike Kravetz
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 15.10.20 10:57, David Hildenbrand wrote: > On 15.10.20 09:56, David Hildenbrand wrote: >> On 14.10.20 20:31, Mike Kravetz wrote: >>> On 10/14/20 11:18 AM, David Hildenbrand wrote: On 14.10.20 19:56, Mina Almasry wrote: > On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand > wrote: >> >> On 14.10.20 17:22, David Hildenbrand wrote: >>> Hi everybody, >>> >>> Michal Privoznik played with "free page reporting" in >>> QEMU/virtio-balloon >>> with hugetlbfs and reported that this results in [1] >>> >>> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 >>> page_counter_uncharge+0x4b/0x5 >>> >>> 2. Any hugetlbfs allocations failing. (I assume because some accounting >>> is wrong) >>> >>> >>> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE) >>> to discard pages that are reported as free by a VM. The reporting >>> granularity is in pageblock granularity. So when the guest reports >>> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU. >>> >>> I was also able to reproduce (also with virtio-mem, which similarly >>> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9 >>> (and on v5.7.X from F32). >>> >>> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting >>> is broken with cgroups. I did *not* try without cgroups yet. >>> >>> Any ideas? > > Hi David, > > I may be able to dig in and take a look. How do I reproduce this > though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a > hugetlb region? > Hi Mina, thanks for having a look. I started poking around myself but, being new to cgroup code, I even failed to understand why that code gets triggered though the hugetlb controller isn't even enabled. I assume you at least have to make sure that there is a page populated (MMAP_POPULATE, or read/write it). But I am not sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is sufficient, or if it will require a sequence of populate+discard(punch) (or multi-threading). >>> >>> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching >>> with (and without) hugetlb controller enabled and did not see this issue. >>> >>> May need to reproduce via QEMU as below. >> >> Not sure if relevant, but QEMU should be using >> memfd_create(MFD_HUGETLB|MFD_HUGE_2MB) to obtain a hugetlbfs file. >> >> Also, QEMU fallocate(FALLOC_FL_PUNCH_HOLE)'s a significant of memory of >> the md (e.g., > 90%). >> > > I just tried to reproduce by doing random accesses + random > fallocate(FALLOC_FL_PUNCH_HOLE) within a file - without success. > > So could be > 1. KVM is involved messing this up > 2. Multi-threading is involved > Able to reproduce with TCG under QEMU, so not a KVM issue. -- Thanks, David / dhildenb
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 15.10.20 09:56, David Hildenbrand wrote: > On 14.10.20 20:31, Mike Kravetz wrote: >> On 10/14/20 11:18 AM, David Hildenbrand wrote: >>> On 14.10.20 19:56, Mina Almasry wrote: On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand wrote: > > On 14.10.20 17:22, David Hildenbrand wrote: >> Hi everybody, >> >> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon >> with hugetlbfs and reported that this results in [1] >> >> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 >> page_counter_uncharge+0x4b/0x5 >> >> 2. Any hugetlbfs allocations failing. (I assume because some accounting >> is wrong) >> >> >> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE) >> to discard pages that are reported as free by a VM. The reporting >> granularity is in pageblock granularity. So when the guest reports >> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU. >> >> I was also able to reproduce (also with virtio-mem, which similarly >> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9 >> (and on v5.7.X from F32). >> >> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting >> is broken with cgroups. I did *not* try without cgroups yet. >> >> Any ideas? Hi David, I may be able to dig in and take a look. How do I reproduce this though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a hugetlb region? >>> >>> Hi Mina, >>> >>> thanks for having a look. I started poking around myself but, >>> being new to cgroup code, I even failed to understand why that code gets >>> triggered though the hugetlb controller isn't even enabled. >>> >>> I assume you at least have to make sure that there is >>> a page populated (MMAP_POPULATE, or read/write it). But I am not >>> sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is >>> sufficient, or if it will require a sequence of >>> populate+discard(punch) (or multi-threading). >> >> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching >> with (and without) hugetlb controller enabled and did not see this issue. >> >> May need to reproduce via QEMU as below. > > Not sure if relevant, but QEMU should be using > memfd_create(MFD_HUGETLB|MFD_HUGE_2MB) to obtain a hugetlbfs file. > > Also, QEMU fallocate(FALLOC_FL_PUNCH_HOLE)'s a significant of memory of > the md (e.g., > 90%). > I just tried to reproduce by doing random accesses + random fallocate(FALLOC_FL_PUNCH_HOLE) within a file - without success. So could be 1. KVM is involved messing this up 2. Multi-threading is involved However, I am also able to reproduce with only a single VCPU (there is still the QEMU main thread, but it limits the chance for races). Even KVM spits fire after a while, which could be a side effect of allocations failing: error: kvm run failed Bad address RAX= RBX=8c12c9c217c0 RCX=8c12fb1b8fc0 RDX=0007 RSI=8c12c9c217c0 RDI=8c12c9c217c8 RBP=000d RSP=b3964040fa68 R8 =0008 R9 =8c12c9c2 R10=8c12fffd5000 R11=000303c0 R12=8c12c9c217c0 R13=0008 R14=0001 R15=f31d44270800 RIP=af33ba0f RFL=0246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES = CS =0010 00a09b00 DPL=0 CS64 [-RA] SS =0018 00c09300 DPL=0 DS [-WA] DS = FS = 7f8fabc87040 GS = 8c12fbc0 LDT= fe00 TR =0040 fe003000 4087 8b00 DPL=0 TSS64-busy GDT= fe001000 007f IDT= fe00 0fff CR0=80050033 CR2=560e10895398 CR3=0001073b2000 CR4=00350ef0 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER=0d01 Code=0f 0b eb e2 90 0f 1f 44 00 00 53 48 89 fb 31 c0 48 8d 7f 08 <48> c7 47 f8 00 00 00 00 48 89 d9 48 c7 c2 44 d3 52 -- Thanks, David / dhildenb
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 14.10.20 20:31, Mike Kravetz wrote: > On 10/14/20 11:18 AM, David Hildenbrand wrote: >> On 14.10.20 19:56, Mina Almasry wrote: >>> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand wrote: On 14.10.20 17:22, David Hildenbrand wrote: > Hi everybody, > > Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon > with hugetlbfs and reported that this results in [1] > > 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 > page_counter_uncharge+0x4b/0x5 > > 2. Any hugetlbfs allocations failing. (I assume because some accounting > is wrong) > > > QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE) > to discard pages that are reported as free by a VM. The reporting > granularity is in pageblock granularity. So when the guest reports > 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU. > > I was also able to reproduce (also with virtio-mem, which similarly > uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9 > (and on v5.7.X from F32). > > Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting > is broken with cgroups. I did *not* try without cgroups yet. > > Any ideas? >>> >>> Hi David, >>> >>> I may be able to dig in and take a look. How do I reproduce this >>> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a >>> hugetlb region? >>> >> >> Hi Mina, >> >> thanks for having a look. I started poking around myself but, >> being new to cgroup code, I even failed to understand why that code gets >> triggered though the hugetlb controller isn't even enabled. >> >> I assume you at least have to make sure that there is >> a page populated (MMAP_POPULATE, or read/write it). But I am not >> sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is >> sufficient, or if it will require a sequence of >> populate+discard(punch) (or multi-threading). > > FWIW - I ran libhugetlbfs tests which do a bunch of hole punching > with (and without) hugetlb controller enabled and did not see this issue. > > May need to reproduce via QEMU as below. Not sure if relevant, but QEMU should be using memfd_create(MFD_HUGETLB|MFD_HUGE_2MB) to obtain a hugetlbfs file. Also, QEMU fallocate(FALLOC_FL_PUNCH_HOLE)'s a significant of memory of the md (e.g., > 90%). -- Thanks, David / dhildenb
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 10/14/20 11:18 AM, David Hildenbrand wrote: > On 14.10.20 19:56, Mina Almasry wrote: >> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand wrote: >>> >>> On 14.10.20 17:22, David Hildenbrand wrote: Hi everybody, Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon with hugetlbfs and reported that this results in [1] 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5 2. Any hugetlbfs allocations failing. (I assume because some accounting is wrong) QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE) to discard pages that are reported as free by a VM. The reporting granularity is in pageblock granularity. So when the guest reports 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU. I was also able to reproduce (also with virtio-mem, which similarly uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9 (and on v5.7.X from F32). Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting is broken with cgroups. I did *not* try without cgroups yet. Any ideas? >> >> Hi David, >> >> I may be able to dig in and take a look. How do I reproduce this >> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a >> hugetlb region? >> > > Hi Mina, > > thanks for having a look. I started poking around myself but, > being new to cgroup code, I even failed to understand why that code gets > triggered though the hugetlb controller isn't even enabled. > > I assume you at least have to make sure that there is > a page populated (MMAP_POPULATE, or read/write it). But I am not > sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is > sufficient, or if it will require a sequence of > populate+discard(punch) (or multi-threading). FWIW - I ran libhugetlbfs tests which do a bunch of hole punching with (and without) hugetlb controller enabled and did not see this issue. May need to reproduce via QEMU as below. -- Mike Kravetz > What definitely makes it trigger is via QEMU > > qemu-system-x86_64 \ > -machine > pc-i440fx-4.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \ > -cpu host,migratable=on \ > -m 4096 \ > -object > memory-backend-memfd,id=pc.ram,hugetlb=yes,hugetlbsize=2097152,size=4294967296 > \ > -overcommit mem-lock=off \ > -smp 4,sockets=1,dies=1,cores=2,threads=2 \ > -nodefaults \ > -nographic \ > -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 \ > -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \ > -blockdev > '{"driver":"file","filename":"../Fedora-Cloud-Base-32-1.6.x86_64.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' > \ > -blockdev > '{"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-1-storage","backing":null}' > \ > -device > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-1-format,id=scsi0-0-0-0,bootindex=1 > \ > -chardev stdio,nosignal,id=serial \ > -device isa-serial,chardev=serial \ > -device > virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7,free-page-reporting=on > > > However, you need a recent QEMU (>= v5.1 IIRC) and a recent kernel > (>= v5.7) inside your guest image. > > Fedora rawhide qcow2 should do: > https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/Cloud/x86_64/images/Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 >
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 14.10.20 19:56, Mina Almasry wrote: > On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand wrote: >> >> On 14.10.20 17:22, David Hildenbrand wrote: >>> Hi everybody, >>> >>> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon >>> with hugetlbfs and reported that this results in [1] >>> >>> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 >>> page_counter_uncharge+0x4b/0x5 >>> >>> 2. Any hugetlbfs allocations failing. (I assume because some accounting is >>> wrong) >>> >>> >>> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE) >>> to discard pages that are reported as free by a VM. The reporting >>> granularity is in pageblock granularity. So when the guest reports >>> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU. >>> >>> I was also able to reproduce (also with virtio-mem, which similarly >>> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9 >>> (and on v5.7.X from F32). >>> >>> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting >>> is broken with cgroups. I did *not* try without cgroups yet. >>> >>> Any ideas? > > Hi David, > > I may be able to dig in and take a look. How do I reproduce this > though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a > hugetlb region? > Hi Mina, thanks for having a look. I started poking around myself but, being new to cgroup code, I even failed to understand why that code gets triggered though the hugetlb controller isn't even enabled. I assume you at least have to make sure that there is a page populated (MMAP_POPULATE, or read/write it). But I am not sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is sufficient, or if it will require a sequence of populate+discard(punch) (or multi-threading). What definitely makes it trigger is via QEMU qemu-system-x86_64 \ -machine pc-i440fx-4.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \ -cpu host,migratable=on \ -m 4096 \ -object memory-backend-memfd,id=pc.ram,hugetlb=yes,hugetlbsize=2097152,size=4294967296 \ -overcommit mem-lock=off \ -smp 4,sockets=1,dies=1,cores=2,threads=2 \ -nodefaults \ -nographic \ -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 \ -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \ -blockdev '{"driver":"file","filename":"../Fedora-Cloud-Base-32-1.6.x86_64.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-1-storage","backing":null}' \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-1-format,id=scsi0-0-0-0,bootindex=1 \ -chardev stdio,nosignal,id=serial \ -device isa-serial,chardev=serial \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7,free-page-reporting=on However, you need a recent QEMU (>= v5.1 IIRC) and a recent kernel (>= v5.7) inside your guest image. Fedora rawhide qcow2 should do: https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/Cloud/x86_64/images/Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -- Thanks, David / dhildenb
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand wrote: > > On 14.10.20 17:22, David Hildenbrand wrote: > > Hi everybody, > > > > Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon > > with hugetlbfs and reported that this results in [1] > > > > 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 > > page_counter_uncharge+0x4b/0x5 > > > > 2. Any hugetlbfs allocations failing. (I assume because some accounting is > > wrong) > > > > > > QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE) > > to discard pages that are reported as free by a VM. The reporting > > granularity is in pageblock granularity. So when the guest reports > > 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU. > > > > I was also able to reproduce (also with virtio-mem, which similarly > > uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9 > > (and on v5.7.X from F32). > > > > Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting > > is broken with cgroups. I did *not* try without cgroups yet. > > > > Any ideas? Hi David, I may be able to dig in and take a look. How do I reproduce this though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a hugetlb region? > > Just tried without the hugetlb controller, seems to work just fine. > > I'd like to note that > - The controller was not activated > - I had to compile the hugetlb controller out to make it work. > > -- > Thanks, > > David / dhildenb >
Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
On 14.10.20 17:22, David Hildenbrand wrote: > Hi everybody, > > Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon > with hugetlbfs and reported that this results in [1] > > 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 > page_counter_uncharge+0x4b/0x5 > > 2. Any hugetlbfs allocations failing. (I assume because some accounting is > wrong) > > > QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE) > to discard pages that are reported as free by a VM. The reporting > granularity is in pageblock granularity. So when the guest reports > 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU. > > I was also able to reproduce (also with virtio-mem, which similarly > uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9 > (and on v5.7.X from F32). > > Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting > is broken with cgroups. I did *not* try without cgroups yet. > > Any ideas? Just tried without the hugetlb controller, seems to work just fine. I'd like to note that - The controller was not activated - I had to compile the hugetlb controller out to make it work. -- Thanks, David / dhildenb
cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
Hi everybody, Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon with hugetlbfs and reported that this results in [1] 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5 2. Any hugetlbfs allocations failing. (I assume because some accounting is wrong) QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE) to discard pages that are reported as free by a VM. The reporting granularity is in pageblock granularity. So when the guest reports 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU. I was also able to reproduce (also with virtio-mem, which similarly uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9 (and on v5.7.X from F32). Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting is broken with cgroups. I did *not* try without cgroups yet. Any ideas? Here is report #1: [ 315.251417] [ cut here ] [ 315.251424] WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x50 [ 315.251425] Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp rfcomm tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep hwmon_vid sunrpc squashfs vfat fat loop snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec edac_mce_amd snd_hda_core btusb btrtl btbcm snd_hwdep snd_seq btintel kvm_amd snd_seq_device bluetooth kvm snd_pcm ecdh_generic sp5100_tco irqbypass rfkill snd_timer rapl ecc pcspkr wmi_bmof joydev i2c_piix4 k10temp snd [ 315.251454] soundcore acpi_cpufreq ip_tables xfs libcrc32c dm_crypt igb hid_logitech_hidpp video dca amdgpu iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel mxm_wmi drm ghash_clmulni_intel ccp nvme nvme_core wmi pinctrl_amd hid_logitech_dj fuse [ 315.251466] CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 #137 [ 315.251467] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020 [ 315.251469] RIP: 0010:page_counter_uncharge+0x4b/0x50 [ 315.251471] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 0f 1f 44 00 00 48 8b 17 48 39 d6 72 41 41 54 49 89 [ 315.251472] RSP: 0018:b60f01ed3b20 EFLAGS: 00010286 [ 315.251473] RAX: fffd0600 RBX: fffd0600 RCX: 8de8272e3200 [ 315.251473] RDX: 028e RSI: fffd0600 RDI: 8de838452e40 [ 315.251474] RBP: 8de838452e40 R08: 8de838452e40 R09: 8de837f86c80 [ 315.251475] R10: b60f01ed3b58 R11: 0001 R12: 00051c00 [ 315.251475] R13: fffae400 R14: 8de8272e3240 R15: 0571 [ 315.251476] FS: 7f9c2edfd700() GS:8de83ebc() knlGS: [ 315.251477] CS: 0010 DS: ES: CR0: 80050033 [ 315.251478] CR2: 7f2a76787e78 CR3: 000fcbb1c000 CR4: 00350ee0 [ 315.251479] Call Trace: [ 315.251485] hugetlb_cgroup_uncharge_file_region+0x4b/0x80 [ 315.251487] region_del+0x1d3/0x300 [ 315.251489] hugetlb_unreserve_pages+0x39/0xb0 [ 315.251492] remove_inode_hugepages+0x1a8/0x3d0 [ 315.251495] ? tlb_finish_mmu+0x7a/0x1d0 [ 315.251497] hugetlbfs_fallocate+0x3c4/0x5c0 [ 315.251519] ? kvm_arch_vcpu_ioctl_run+0x614/0x1700 [kvm] [ 315.251522] ? file_has_perm+0xa2/0xb0 [ 315.251524] ? inode_security+0xc/0x60 [ 315.251525] ? selinux_file_permission+0x4e/0x120 [ 315.251527] vfs_fallocate+0x146/0x290 [ 315.251529] __x64_sys_fallocate+0x3e/0x70 [ 315.251531] do_syscall_64+0x33/0x40 [ 315.251533] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 315.251535] RIP: 0033:0x7f9d3fb5641f [ 315.251536] Code: 89 7c 24 08 48 89 4c 24 18 e8 5d fc f8 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 8b 74 24 0c 8b 7c 24 08 b8 1d 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 89 44 24 08 e8 8d fc f8 ff 8b 44 [ 315.251537] RSP: 002b:7f9c2edfc470 EFLAGS: 0293 ORIG_RAX: 011d [ 315.251538] RAX: ffda RBX: 1000 RCX: 7f9d3fb5641f [ 315.251539] RDX: ae20 RSI: 0003 RDI: 000c [ 315.251539] RBP: 557389d6736c R08: R09: 000c [ 315.251540] R10: 0020 R11: 0293 R12: 0020 [ 315.251540] R13: R14: ae20 R15: 7f9cde00 [ 315.251542] ---[ end