Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-21 Thread David Hildenbrand


> Am 21.10.2020 um 18:58 schrieb Mike Kravetz :
> 
>> 
>>> On 21.10.20 15:11, David Hildenbrand wrote:
>>> On 21.10.20 14:57, Michal Privoznik wrote:
 On 10/21/20 5:35 AM, Mike Kravetz wrote:
> On 10/20/20 6:38 AM, David Hildenbrand wrote:
> It would be good if Mina (at least) would look these over.  Would also
> be interesting to know if these fixes address the bug seen with the qemu
> use case.
> I'm still doing more testing and code inspection to look for other issues.
> ...
> ...
 I've applied, rebuilt and tested, but unfortunately I still hit the 
 problem:
 [ 6472.719047] [ cut here ]
 [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57
> ...
> ...
>>> Agreed, same over here. :(
>> 
>> I *think* the following on top makes it fly
>> 
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 67fc6383995b..5cf7f6a6c1a6 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -656,6 +656,9 @@ static long region_del(struct resv_map *resv, long
>> f, long t)
>> 
>>del += t - f;
>> 
>> +   hugetlb_cgroup_uncharge_file_region(
>> +   resv, rg, t - f);
>> +
>>/* New entry for end of split region */
>>nrg->from = t;
>>nrg->to = rg->to;
>> @@ -667,9 +670,6 @@ static long region_del(struct resv_map *resv, long
>> f, long t)
>>/* Original entry is trimmed */
>>rg->to = f;
>> 
>> -   hugetlb_cgroup_uncharge_file_region(
>> -   resv, rg, nrg->to - nrg->from);
>> -
>>list_add(>link, >link);
>>nrg = NULL;
>>break;
> 
> Thanks, yes that certainly does look like a bug in that code.
> 
> Does that resolve the issue with quemu?

I was not able to reproduce, so I guess we found all issues!

Thanks!

> 
> I want to do a little more testing/research before sending a patch later
> today.
> -- 
> Mike Kravetz



Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-21 Thread Mike Kravetz
> On 21.10.20 15:11, David Hildenbrand wrote:
>> On 21.10.20 14:57, Michal Privoznik wrote:
>>> On 10/21/20 5:35 AM, Mike Kravetz wrote:
 On 10/20/20 6:38 AM, David Hildenbrand wrote:

 It would be good if Mina (at least) would look these over.  Would also
 be interesting to know if these fixes address the bug seen with the qemu
 use case.

 I'm still doing more testing and code inspection to look for other issues.

...
...
>>>
>>> I've applied, rebuilt and tested, but unfortunately I still hit the problem:
>>> [ 6472.719047] [ cut here ]
>>> [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 
...
...
>>
>> Agreed, same over here. :(
>>
> 
> I *think* the following on top makes it fly
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 67fc6383995b..5cf7f6a6c1a6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -656,6 +656,9 @@ static long region_del(struct resv_map *resv, long
> f, long t)
> 
> del += t - f;
> 
> +   hugetlb_cgroup_uncharge_file_region(
> +   resv, rg, t - f);
> +
> /* New entry for end of split region */
> nrg->from = t;
> nrg->to = rg->to;
> @@ -667,9 +670,6 @@ static long region_del(struct resv_map *resv, long
> f, long t)
> /* Original entry is trimmed */
> rg->to = f;
> 
> -   hugetlb_cgroup_uncharge_file_region(
> -   resv, rg, nrg->to - nrg->from);
> -
> list_add(>link, >link);
> nrg = NULL;
> break;
> 
> 

Thanks, yes that certainly does look like a bug in that code.

Does that resolve the issue with quemu?

I want to do a little more testing/research before sending a patch later
today.
-- 
Mike Kravetz


Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-21 Thread David Hildenbrand
On 21.10.20 15:34, David Hildenbrand wrote:
> On 21.10.20 15:11, David Hildenbrand wrote:
>> On 21.10.20 14:57, Michal Privoznik wrote:
>>> On 10/21/20 5:35 AM, Mike Kravetz wrote:
 On 10/20/20 6:38 AM, David Hildenbrand wrote:
>
> I'm bisecting the warning right now. Looks like it was introduced in v5.7.

 I found the following bugs in the cgroup reservation accounting.  The ones
 in region_del are pretty obvious as the number of pages to uncharge would
 always be zero.  The one on alloc_huge_page needs racing code to expose.

 With these fixes, my testing is showing consistent/correct results for
 hugetlb reservation cgroup accounting.

 It would be good if Mina (at least) would look these over.  Would also
 be interesting to know if these fixes address the bug seen with the qemu
 use case.

 I'm still doing more testing and code inspection to look for other issues.

  From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001
 From: Mike Kravetz 
 Date: Tue, 20 Oct 2020 20:21:42 -0700
 Subject: [PATCH] hugetlb_cgroup: fix reservation accounting

 Signed-off-by: Mike Kravetz 
 ---
   mm/hugetlb.c | 15 +--
   1 file changed, 9 insertions(+), 6 deletions(-)

 diff --git a/mm/hugetlb.c b/mm/hugetlb.c
 index 67fc6383995b..c92366313780 100644
 --- a/mm/hugetlb.c
 +++ b/mm/hugetlb.c
 @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long 
 f, long t)
}
   
if (f <= rg->from) {/* Trim beginning of region */
 -  del += t - rg->from;
 -  rg->from = t;
 -
hugetlb_cgroup_uncharge_file_region(resv, rg,
t - rg->from);
 -  } else {/* Trim end of region */
 -  del += rg->to - f;
 -  rg->to = f;
   
 +  del += t - rg->from;
 +  rg->from = t;
 +  } else {/* Trim end of region */
hugetlb_cgroup_uncharge_file_region(resv, rg,
rg->to - f);
 +
 +  del += rg->to - f;
 +  rg->to = f;
}
}
   
 @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct 
 *vma,
   
rsv_adjust = hugepage_subpool_put_pages(spool, 1);
hugetlb_acct_memory(h, -rsv_adjust);
 +  if (deferred_reserve)
 +  hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
 +  pages_per_huge_page(h), page);
}
return page;
   

>>>
>>> I've applied, rebuilt and tested, but unfortunately I still hit the problem:
>>> [ 6472.719047] [ cut here ]
>>> [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 
>>> page_counter_uncharge+0x33/0x40
>>> [ 6472.719052] Modules linked in: kvm_amd amdgpu kvm btusb sp5100_tco 
>>> btrtl watchdog k10temp btbcm btintel mfd_core gpu_sched ttm
>>> [ 6472.719057] CPU: 6 PID: 11773 Comm: CPU 3/KVM Not tainted 
>>> 5.9.1-gentoo-x86_64 #1
>>> [ 6472.719057] Hardware name: System manufacturer System Product 
>>> Name/PRIME X570-PRO, BIOS 1005 08/01/2019
>>> [ 6472.719059] RIP: 0010:page_counter_uncharge+0x33/0x40
>>> [ 6472.719060] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 
>>> 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc 
>>> c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41
>>> [ 6472.719061] RSP: 0018:c9b77b40 EFLAGS: 00010286
>>> [ 6472.719061] RAX: fffe9200 RBX: 888fb3b97b40 RCX: 
>>> fffe9200
>>> [ 6472.719062] RDX: 0221 RSI: fffe9200 RDI: 
>>> 888fd8451dd0
>>> [ 6472.719062] RBP: 888fb6990420 R08: 00044200 R09: 
>>> fffbbe00
>>> [ 6472.719062] R10: 888fb3b97b40 R11: 000a R12: 
>>> 0001
>>> [ 6472.719063] R13: 05df R14: 05de R15: 
>>> 888fb3b97b40
>>> [ 6472.719063] FS:  7fbd175fe700() GS:888fde98() 
>>> knlGS:
>>> [ 6472.719064] CS:  0010 DS:  ES:  CR0: 80050033
>>> [ 6472.719064] CR2: 7fbd825101f0 CR3: 000fb5e41000 CR4: 
>>> 00350ee0
>>> [ 6472.719065] Call Trace:
>>> [ 6472.719067]  hugetlb_cgroup_uncharge_file_region+0x46/0x70
>>> [ 6472.719069]  region_del+0x1ae/0x270
>>> [ 6472.719070]  hugetlb_unreserve_pages+0x32/0xa0
>>> [ 6472.719072]  remove_inode_hugepages+0x19d/0x3a0
>>> [ 6472.719079]  ? writeback_registers+0x45/0x60 [kvm]
>>> [ 6472.719080]  hugetlbfs_fallocate+0x3f2/0x4a0
>>> [ 6472.719081]  ? __mod_lruvec_state+0x1d/0x40
>>> [ 6472.719081]  ? 

Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-21 Thread David Hildenbrand
On 21.10.20 15:11, David Hildenbrand wrote:
> On 21.10.20 14:57, Michal Privoznik wrote:
>> On 10/21/20 5:35 AM, Mike Kravetz wrote:
>>> On 10/20/20 6:38 AM, David Hildenbrand wrote:

 I'm bisecting the warning right now. Looks like it was introduced in v5.7.
>>>
>>> I found the following bugs in the cgroup reservation accounting.  The ones
>>> in region_del are pretty obvious as the number of pages to uncharge would
>>> always be zero.  The one on alloc_huge_page needs racing code to expose.
>>>
>>> With these fixes, my testing is showing consistent/correct results for
>>> hugetlb reservation cgroup accounting.
>>>
>>> It would be good if Mina (at least) would look these over.  Would also
>>> be interesting to know if these fixes address the bug seen with the qemu
>>> use case.
>>>
>>> I'm still doing more testing and code inspection to look for other issues.
>>>
>>>  From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001
>>> From: Mike Kravetz 
>>> Date: Tue, 20 Oct 2020 20:21:42 -0700
>>> Subject: [PATCH] hugetlb_cgroup: fix reservation accounting
>>>
>>> Signed-off-by: Mike Kravetz 
>>> ---
>>>   mm/hugetlb.c | 15 +--
>>>   1 file changed, 9 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 67fc6383995b..c92366313780 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, 
>>> long t)
>>> }
>>>   
>>> if (f <= rg->from) {/* Trim beginning of region */
>>> -   del += t - rg->from;
>>> -   rg->from = t;
>>> -
>>> hugetlb_cgroup_uncharge_file_region(resv, rg,
>>> t - rg->from);
>>> -   } else {/* Trim end of region */
>>> -   del += rg->to - f;
>>> -   rg->to = f;
>>>   
>>> +   del += t - rg->from;
>>> +   rg->from = t;
>>> +   } else {/* Trim end of region */
>>> hugetlb_cgroup_uncharge_file_region(resv, rg,
>>> rg->to - f);
>>> +
>>> +   del += rg->to - f;
>>> +   rg->to = f;
>>> }
>>> }
>>>   
>>> @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct 
>>> *vma,
>>>   
>>> rsv_adjust = hugepage_subpool_put_pages(spool, 1);
>>> hugetlb_acct_memory(h, -rsv_adjust);
>>> +   if (deferred_reserve)
>>> +   hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
>>> +   pages_per_huge_page(h), page);
>>> }
>>> return page;
>>>   
>>>
>>
>> I've applied, rebuilt and tested, but unfortunately I still hit the problem:
>> [ 6472.719047] [ cut here ]
>> [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 
>> page_counter_uncharge+0x33/0x40
>> [ 6472.719052] Modules linked in: kvm_amd amdgpu kvm btusb sp5100_tco 
>> btrtl watchdog k10temp btbcm btintel mfd_core gpu_sched ttm
>> [ 6472.719057] CPU: 6 PID: 11773 Comm: CPU 3/KVM Not tainted 
>> 5.9.1-gentoo-x86_64 #1
>> [ 6472.719057] Hardware name: System manufacturer System Product 
>> Name/PRIME X570-PRO, BIOS 1005 08/01/2019
>> [ 6472.719059] RIP: 0010:page_counter_uncharge+0x33/0x40
>> [ 6472.719060] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 
>> 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc 
>> c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41
>> [ 6472.719061] RSP: 0018:c9b77b40 EFLAGS: 00010286
>> [ 6472.719061] RAX: fffe9200 RBX: 888fb3b97b40 RCX: 
>> fffe9200
>> [ 6472.719062] RDX: 0221 RSI: fffe9200 RDI: 
>> 888fd8451dd0
>> [ 6472.719062] RBP: 888fb6990420 R08: 00044200 R09: 
>> fffbbe00
>> [ 6472.719062] R10: 888fb3b97b40 R11: 000a R12: 
>> 0001
>> [ 6472.719063] R13: 05df R14: 05de R15: 
>> 888fb3b97b40
>> [ 6472.719063] FS:  7fbd175fe700() GS:888fde98() 
>> knlGS:
>> [ 6472.719064] CS:  0010 DS:  ES:  CR0: 80050033
>> [ 6472.719064] CR2: 7fbd825101f0 CR3: 000fb5e41000 CR4: 
>> 00350ee0
>> [ 6472.719065] Call Trace:
>> [ 6472.719067]  hugetlb_cgroup_uncharge_file_region+0x46/0x70
>> [ 6472.719069]  region_del+0x1ae/0x270
>> [ 6472.719070]  hugetlb_unreserve_pages+0x32/0xa0
>> [ 6472.719072]  remove_inode_hugepages+0x19d/0x3a0
>> [ 6472.719079]  ? writeback_registers+0x45/0x60 [kvm]
>> [ 6472.719080]  hugetlbfs_fallocate+0x3f2/0x4a0
>> [ 6472.719081]  ? __mod_lruvec_state+0x1d/0x40
>> [ 6472.719081]  ? __mod_memcg_lruvec_state+0x1b/0xe0
>> [ 6472.719083]  ? __seccomp_filter+0x75/0x6a0
>> [ 6472.719084]  vfs_fallocate+0x122/0x260
>> [ 6472.719085]  

Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-21 Thread David Hildenbrand
On 21.10.20 14:57, Michal Privoznik wrote:
> On 10/21/20 5:35 AM, Mike Kravetz wrote:
>> On 10/20/20 6:38 AM, David Hildenbrand wrote:
>>>
>>> I'm bisecting the warning right now. Looks like it was introduced in v5.7.
>>
>> I found the following bugs in the cgroup reservation accounting.  The ones
>> in region_del are pretty obvious as the number of pages to uncharge would
>> always be zero.  The one on alloc_huge_page needs racing code to expose.
>>
>> With these fixes, my testing is showing consistent/correct results for
>> hugetlb reservation cgroup accounting.
>>
>> It would be good if Mina (at least) would look these over.  Would also
>> be interesting to know if these fixes address the bug seen with the qemu
>> use case.
>>
>> I'm still doing more testing and code inspection to look for other issues.
>>
>>  From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001
>> From: Mike Kravetz 
>> Date: Tue, 20 Oct 2020 20:21:42 -0700
>> Subject: [PATCH] hugetlb_cgroup: fix reservation accounting
>>
>> Signed-off-by: Mike Kravetz 
>> ---
>>   mm/hugetlb.c | 15 +--
>>   1 file changed, 9 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 67fc6383995b..c92366313780 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, 
>> long t)
>>  }
>>   
>>  if (f <= rg->from) {/* Trim beginning of region */
>> -del += t - rg->from;
>> -rg->from = t;
>> -
>>  hugetlb_cgroup_uncharge_file_region(resv, rg,
>>  t - rg->from);
>> -} else {/* Trim end of region */
>> -del += rg->to - f;
>> -rg->to = f;
>>   
>> +del += t - rg->from;
>> +rg->from = t;
>> +} else {/* Trim end of region */
>>  hugetlb_cgroup_uncharge_file_region(resv, rg,
>>  rg->to - f);
>> +
>> +del += rg->to - f;
>> +rg->to = f;
>>  }
>>  }
>>   
>> @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct 
>> *vma,
>>   
>>  rsv_adjust = hugepage_subpool_put_pages(spool, 1);
>>  hugetlb_acct_memory(h, -rsv_adjust);
>> +if (deferred_reserve)
>> +hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
>> +pages_per_huge_page(h), page);
>>  }
>>  return page;
>>   
>>
> 
> I've applied, rebuilt and tested, but unfortunately I still hit the problem:
> [ 6472.719047] [ cut here ]
> [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 
> page_counter_uncharge+0x33/0x40
> [ 6472.719052] Modules linked in: kvm_amd amdgpu kvm btusb sp5100_tco 
> btrtl watchdog k10temp btbcm btintel mfd_core gpu_sched ttm
> [ 6472.719057] CPU: 6 PID: 11773 Comm: CPU 3/KVM Not tainted 
> 5.9.1-gentoo-x86_64 #1
> [ 6472.719057] Hardware name: System manufacturer System Product 
> Name/PRIME X570-PRO, BIOS 1005 08/01/2019
> [ 6472.719059] RIP: 0010:page_counter_uncharge+0x33/0x40
> [ 6472.719060] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 
> 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc 
> c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41
> [ 6472.719061] RSP: 0018:c9b77b40 EFLAGS: 00010286
> [ 6472.719061] RAX: fffe9200 RBX: 888fb3b97b40 RCX: 
> fffe9200
> [ 6472.719062] RDX: 0221 RSI: fffe9200 RDI: 
> 888fd8451dd0
> [ 6472.719062] RBP: 888fb6990420 R08: 00044200 R09: 
> fffbbe00
> [ 6472.719062] R10: 888fb3b97b40 R11: 000a R12: 
> 0001
> [ 6472.719063] R13: 05df R14: 05de R15: 
> 888fb3b97b40
> [ 6472.719063] FS:  7fbd175fe700() GS:888fde98() 
> knlGS:
> [ 6472.719064] CS:  0010 DS:  ES:  CR0: 80050033
> [ 6472.719064] CR2: 7fbd825101f0 CR3: 000fb5e41000 CR4: 
> 00350ee0
> [ 6472.719065] Call Trace:
> [ 6472.719067]  hugetlb_cgroup_uncharge_file_region+0x46/0x70
> [ 6472.719069]  region_del+0x1ae/0x270
> [ 6472.719070]  hugetlb_unreserve_pages+0x32/0xa0
> [ 6472.719072]  remove_inode_hugepages+0x19d/0x3a0
> [ 6472.719079]  ? writeback_registers+0x45/0x60 [kvm]
> [ 6472.719080]  hugetlbfs_fallocate+0x3f2/0x4a0
> [ 6472.719081]  ? __mod_lruvec_state+0x1d/0x40
> [ 6472.719081]  ? __mod_memcg_lruvec_state+0x1b/0xe0
> [ 6472.719083]  ? __seccomp_filter+0x75/0x6a0
> [ 6472.719084]  vfs_fallocate+0x122/0x260
> [ 6472.719085]  __x64_sys_fallocate+0x39/0x60
> [ 6472.719086]  do_syscall_64+0x2d/0x40
> [ 6472.719088]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 

Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-21 Thread Michal Privoznik

On 10/21/20 5:35 AM, Mike Kravetz wrote:

On 10/20/20 6:38 AM, David Hildenbrand wrote:


I'm bisecting the warning right now. Looks like it was introduced in v5.7.


I found the following bugs in the cgroup reservation accounting.  The ones
in region_del are pretty obvious as the number of pages to uncharge would
always be zero.  The one on alloc_huge_page needs racing code to expose.

With these fixes, my testing is showing consistent/correct results for
hugetlb reservation cgroup accounting.

It would be good if Mina (at least) would look these over.  Would also
be interesting to know if these fixes address the bug seen with the qemu
use case.

I'm still doing more testing and code inspection to look for other issues.

 From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001
From: Mike Kravetz 
Date: Tue, 20 Oct 2020 20:21:42 -0700
Subject: [PATCH] hugetlb_cgroup: fix reservation accounting

Signed-off-by: Mike Kravetz 
---
  mm/hugetlb.c | 15 +--
  1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 67fc6383995b..c92366313780 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, 
long t)
}
  
  		if (f <= rg->from) {	/* Trim beginning of region */

-   del += t - rg->from;
-   rg->from = t;
-
hugetlb_cgroup_uncharge_file_region(resv, rg,
t - rg->from);
-   } else {/* Trim end of region */
-   del += rg->to - f;
-   rg->to = f;
  
+			del += t - rg->from;

+   rg->from = t;
+   } else {/* Trim end of region */
hugetlb_cgroup_uncharge_file_region(resv, rg,
rg->to - f);
+
+   del += rg->to - f;
+   rg->to = f;
}
}
  
@@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
  
  		rsv_adjust = hugepage_subpool_put_pages(spool, 1);

hugetlb_acct_memory(h, -rsv_adjust);
+   if (deferred_reserve)
+   hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
+   pages_per_huge_page(h), page);
}
return page;
  



I've applied, rebuilt and tested, but unfortunately I still hit the problem:
[ 6472.719047] [ cut here ]
[ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 
page_counter_uncharge+0x33/0x40
[ 6472.719052] Modules linked in: kvm_amd amdgpu kvm btusb sp5100_tco 
btrtl watchdog k10temp btbcm btintel mfd_core gpu_sched ttm
[ 6472.719057] CPU: 6 PID: 11773 Comm: CPU 3/KVM Not tainted 
5.9.1-gentoo-x86_64 #1
[ 6472.719057] Hardware name: System manufacturer System Product 
Name/PRIME X570-PRO, BIOS 1005 08/01/2019

[ 6472.719059] RIP: 0010:page_counter_uncharge+0x33/0x40
[ 6472.719060] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 
89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc 
c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41

[ 6472.719061] RSP: 0018:c9b77b40 EFLAGS: 00010286
[ 6472.719061] RAX: fffe9200 RBX: 888fb3b97b40 RCX: 
fffe9200
[ 6472.719062] RDX: 0221 RSI: fffe9200 RDI: 
888fd8451dd0
[ 6472.719062] RBP: 888fb6990420 R08: 00044200 R09: 
fffbbe00
[ 6472.719062] R10: 888fb3b97b40 R11: 000a R12: 
0001
[ 6472.719063] R13: 05df R14: 05de R15: 
888fb3b97b40
[ 6472.719063] FS:  7fbd175fe700() GS:888fde98() 
knlGS:

[ 6472.719064] CS:  0010 DS:  ES:  CR0: 80050033
[ 6472.719064] CR2: 7fbd825101f0 CR3: 000fb5e41000 CR4: 
00350ee0

[ 6472.719065] Call Trace:
[ 6472.719067]  hugetlb_cgroup_uncharge_file_region+0x46/0x70
[ 6472.719069]  region_del+0x1ae/0x270
[ 6472.719070]  hugetlb_unreserve_pages+0x32/0xa0
[ 6472.719072]  remove_inode_hugepages+0x19d/0x3a0
[ 6472.719079]  ? writeback_registers+0x45/0x60 [kvm]
[ 6472.719080]  hugetlbfs_fallocate+0x3f2/0x4a0
[ 6472.719081]  ? __mod_lruvec_state+0x1d/0x40
[ 6472.719081]  ? __mod_memcg_lruvec_state+0x1b/0xe0
[ 6472.719083]  ? __seccomp_filter+0x75/0x6a0
[ 6472.719084]  vfs_fallocate+0x122/0x260
[ 6472.719085]  __x64_sys_fallocate+0x39/0x60
[ 6472.719086]  do_syscall_64+0x2d/0x40
[ 6472.719088]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 6472.719089] RIP: 0033:0x7fbe3cefcde7
[ 6472.719089] Code: 89 7c 24 08 48 89 4c 24 18 e8 45 fc f8 ff 41 89 c0 
4c 8b 54 24 18 48 8b 54 24 10 b8 1d 01 00 00 8b 74 24 0c 8b 7c 24 08 0f 
05 <48> 3d 00 f0 ff ff 77 41 44 89 c7 89 44 24 08 e8 75 fc f8 ff 8b 44
[ 6472.719090] RSP: 002b:7fbd175fc7a0 

Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-21 Thread David Hildenbrand
On 21.10.20 05:35, Mike Kravetz wrote:
> On 10/20/20 6:38 AM, David Hildenbrand wrote:
>>
>> I'm bisecting the warning right now. Looks like it was introduced in v5.7.
> 

So bisecting nailed it down to one of

353b2de42e84 mm/hugetlb.c: clean code by removing unnecessary initialization
a9b3f867404b hugetlb: support file_region coalescing again
08cf9faf7558 hugetlb_cgroup: support noreserve mappings
075a61d07a8e hugetlb_cgroup: add accounting for shared mappings
0db9d74ed884 hugetlb: disable region_add file_region coalescing
e9fe92ae0cd2 hugetlb_cgroup: add reservation accounting for private mappings
9808895e1a44 mm/hugetlb_cgroup: fix hugetlb_cgroup migration
1adc4d419aa2 hugetlb_cgroup: add interface for charge/uncharge hugetlb
reservations
cdc2fcfea79b hugetlb_cgroup: add hugetlb_cgroup reservation counter

So seems to be broken right from the beginning of
charge/uncharge/reservations. Not a surpise when looking at your fixes :)


> I found the following bugs in the cgroup reservation accounting.  The ones
> in region_del are pretty obvious as the number of pages to uncharge would
> always be zero.  The one on alloc_huge_page needs racing code to expose.
> 
> With these fixes, my testing is showing consistent/correct results for
> hugetlb reservation cgroup accounting.
> 
> It would be good if Mina (at least) would look these over.  Would also
> be interesting to know if these fixes address the bug seen with the qemu
> use case.

I strongly suspect it will. Compiling, will reply in half an our or so
with the result.

> 
> I'm still doing more testing and code inspection to look for other issues.

When sending, can you make sure to credit Michal P.? Thanks!

Reported-by: Michal Privoznik 

> 
> From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz 
> Date: Tue, 20 Oct 2020 20:21:42 -0700
> Subject: [PATCH] hugetlb_cgroup: fix reservation accounting
> 
> Signed-off-by: Mike Kravetz 
> ---
>  mm/hugetlb.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 67fc6383995b..c92366313780 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, 
> long t)
>   }
>  
>   if (f <= rg->from) {/* Trim beginning of region */
> - del += t - rg->from;
> - rg->from = t;
> -
>   hugetlb_cgroup_uncharge_file_region(resv, rg,
>   t - rg->from);
> - } else {/* Trim end of region */
> - del += rg->to - f;
> - rg->to = f;
>  
> + del += t - rg->from;
> + rg->from = t;
> + } else {/* Trim end of region */
>   hugetlb_cgroup_uncharge_file_region(resv, rg,
>   rg->to - f);
> +
> + del += rg->to - f;
> + rg->to = f;

Those two look very correct to me.

You could keep computing "del" before the uncharge, similar to the /*
Remove entire region */ case.

>   }
>   }
>  
> @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  
>   rsv_adjust = hugepage_subpool_put_pages(spool, 1);
>   hugetlb_acct_memory(h, -rsv_adjust);
> + if (deferred_reserve)
> + hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
> + 

That looks correct to me as well.

>   }
>   return page;
>  
> 

Thanks for debugging!

-- 
Thanks,

David / dhildenb



Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-20 Thread Mike Kravetz
On 10/20/20 6:38 AM, David Hildenbrand wrote:
> 
> I'm bisecting the warning right now. Looks like it was introduced in v5.7.

I found the following bugs in the cgroup reservation accounting.  The ones
in region_del are pretty obvious as the number of pages to uncharge would
always be zero.  The one on alloc_huge_page needs racing code to expose.

With these fixes, my testing is showing consistent/correct results for
hugetlb reservation cgroup accounting.

It would be good if Mina (at least) would look these over.  Would also
be interesting to know if these fixes address the bug seen with the qemu
use case.

I'm still doing more testing and code inspection to look for other issues.

>From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001
From: Mike Kravetz 
Date: Tue, 20 Oct 2020 20:21:42 -0700
Subject: [PATCH] hugetlb_cgroup: fix reservation accounting

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 67fc6383995b..c92366313780 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, 
long t)
}
 
if (f <= rg->from) {/* Trim beginning of region */
-   del += t - rg->from;
-   rg->from = t;
-
hugetlb_cgroup_uncharge_file_region(resv, rg,
t - rg->from);
-   } else {/* Trim end of region */
-   del += rg->to - f;
-   rg->to = f;
 
+   del += t - rg->from;
+   rg->from = t;
+   } else {/* Trim end of region */
hugetlb_cgroup_uncharge_file_region(resv, rg,
rg->to - f);
+
+   del += rg->to - f;
+   rg->to = f;
}
}
 
@@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 
rsv_adjust = hugepage_subpool_put_pages(spool, 1);
hugetlb_acct_memory(h, -rsv_adjust);
+   if (deferred_reserve)
+   hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
+   pages_per_huge_page(h), page);
}
return page;
 
-- 
2.25.4



Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-20 Thread David Hildenbrand
On 16.10.20 01:14, Mike Kravetz wrote:
> On 10/14/20 11:31 AM, Mike Kravetz wrote:
>> On 10/14/20 11:18 AM, David Hildenbrand wrote:
>>
>> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
>> with (and without) hugetlb controller enabled and did not see this issue.
>>
> 
> I took a closer look after running just the fallocate_stress test
> in libhugetlbfs.  Here are the cgroup counter values:
> 
> hugetlb.2MB.failcnt 0
> hugetlb.2MB.limit_in_bytes 9223372036854771712
> hugetlb.2MB.max_usage_in_bytes 209715200
> hugetlb.2MB.rsvd.failcnt 0
> hugetlb.2MB.rsvd.limit_in_bytes 9223372036854771712
> hugetlb.2MB.rsvd.max_usage_in_bytes 601882624
> hugetlb.2MB.rsvd.usage_in_bytes 392167424
> hugetlb.2MB.usage_in_bytes 0
> 
> We did not hit the WARN_ON_ONCE(), but the 'rsvd.usage_in_bytes' value
> is not correct in that it should be zero.   No huge page reservations
> remain after the test.
> 
> HugePages_Total:1024
> HugePages_Free: 1024
> HugePages_Rsvd:0
> HugePages_Surp:0
> Hugepagesize:   2048 kB
> Hugetlb: 2097152 kB
> 
> To try and better understand the reservation cgroup controller, I addded
> a few printks to the code.  While running fallocate_stress with the
> printks, I can consistently hit the WARN_ON_ONCE() due to the counter
> going negative.  Here are the cgroup counter values after such a run:
> 
> hugetlb.2MB.failcnt 0
> hugetlb.2MB.limit_in_bytes 9223372036854771712
> hugetlb.2MB.max_usage_in_bytes 209715200
> hugetlb.2MB.rsvd.failcnt 3
> hugetlb.2MB.rsvd.limit_in_bytes 9223372036854771712
> hugetlb.2MB.rsvd.max_usage_in_bytes 251658240
> hugetlb.2MB.rsvd.usage_in_bytes 18446744073487253504
> hugetlb.2MB.usage_in_bytes 0
> 
> Again, no reserved pages after the test.
> 
> HugePages_Total:1024
> HugePages_Free: 1024
> HugePages_Rsvd:0
> HugePages_Surp:0
> Hugepagesize:   2048 kB
> Hugetlb: 2097152 kB
> 
> I have some basic hugetlb hole punch functionality tests.  Running
> these on the kernel with added printk's does not cause any issues.
> In order to reproduce, I need to run fallocate_stress test which
> will cause hole punch to race with page fault.  Best guess at this
> time is that some of the error/race detection reservation back out
> code is not properly dealing with cgroup accounting.
> 
> I'll take a look at this as well.
> 

I'm bisecting the warning right now. Looks like it was introduced in v5.7.

-- 
Thanks,

David / dhildenb



Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-15 Thread Mike Kravetz
On 10/14/20 11:31 AM, Mike Kravetz wrote:
> On 10/14/20 11:18 AM, David Hildenbrand wrote:
> 
> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
> with (and without) hugetlb controller enabled and did not see this issue.
> 

I took a closer look after running just the fallocate_stress test
in libhugetlbfs.  Here are the cgroup counter values:

hugetlb.2MB.failcnt 0
hugetlb.2MB.limit_in_bytes 9223372036854771712
hugetlb.2MB.max_usage_in_bytes 209715200
hugetlb.2MB.rsvd.failcnt 0
hugetlb.2MB.rsvd.limit_in_bytes 9223372036854771712
hugetlb.2MB.rsvd.max_usage_in_bytes 601882624
hugetlb.2MB.rsvd.usage_in_bytes 392167424
hugetlb.2MB.usage_in_bytes 0

We did not hit the WARN_ON_ONCE(), but the 'rsvd.usage_in_bytes' value
is not correct in that it should be zero.   No huge page reservations
remain after the test.

HugePages_Total:1024
HugePages_Free: 1024
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
Hugetlb: 2097152 kB

To try and better understand the reservation cgroup controller, I addded
a few printks to the code.  While running fallocate_stress with the
printks, I can consistently hit the WARN_ON_ONCE() due to the counter
going negative.  Here are the cgroup counter values after such a run:

hugetlb.2MB.failcnt 0
hugetlb.2MB.limit_in_bytes 9223372036854771712
hugetlb.2MB.max_usage_in_bytes 209715200
hugetlb.2MB.rsvd.failcnt 3
hugetlb.2MB.rsvd.limit_in_bytes 9223372036854771712
hugetlb.2MB.rsvd.max_usage_in_bytes 251658240
hugetlb.2MB.rsvd.usage_in_bytes 18446744073487253504
hugetlb.2MB.usage_in_bytes 0

Again, no reserved pages after the test.

HugePages_Total:1024
HugePages_Free: 1024
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
Hugetlb: 2097152 kB

I have some basic hugetlb hole punch functionality tests.  Running
these on the kernel with added printk's does not cause any issues.
In order to reproduce, I need to run fallocate_stress test which
will cause hole punch to race with page fault.  Best guess at this
time is that some of the error/race detection reservation back out
code is not properly dealing with cgroup accounting.

I'll take a look at this as well.
-- 
Mike Kravetz


Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-15 Thread David Hildenbrand
On 15.10.20 10:57, David Hildenbrand wrote:
> On 15.10.20 09:56, David Hildenbrand wrote:
>> On 14.10.20 20:31, Mike Kravetz wrote:
>>> On 10/14/20 11:18 AM, David Hildenbrand wrote:
 On 14.10.20 19:56, Mina Almasry wrote:
> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand  
> wrote:
>>
>> On 14.10.20 17:22, David Hildenbrand wrote:
>>> Hi everybody,
>>>
>>> Michal Privoznik played with "free page reporting" in 
>>> QEMU/virtio-balloon
>>> with hugetlbfs and reported that this results in [1]
>>>
>>> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 
>>> page_counter_uncharge+0x4b/0x5
>>>
>>> 2. Any hugetlbfs allocations failing. (I assume because some accounting 
>>> is wrong)
>>>
>>>
>>> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
>>> to discard pages that are reported as free by a VM. The reporting
>>> granularity is in pageblock granularity. So when the guest reports
>>> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
>>>
>>> I was also able to reproduce (also with virtio-mem, which similarly
>>> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
>>> (and on v5.7.X from F32).
>>>
>>> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
>>> is broken with cgroups. I did *not* try without cgroups yet.
>>>
>>> Any ideas?
>
> Hi David,
>
> I may be able to dig in and take a look. How do I reproduce this
> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
> hugetlb region?
>

 Hi Mina,

 thanks for having a look. I started poking around myself but,
 being new to cgroup code, I even failed to understand why that code gets
 triggered though the hugetlb controller isn't even enabled.

 I assume you at least have to make sure that there is
 a page populated (MMAP_POPULATE, or read/write it). But I am not
 sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is
 sufficient, or if it will require a sequence of
 populate+discard(punch) (or multi-threading).
>>>
>>> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
>>> with (and without) hugetlb controller enabled and did not see this issue.
>>>
>>> May need to reproduce via QEMU as below.
>>
>> Not sure if relevant, but QEMU should be using
>> memfd_create(MFD_HUGETLB|MFD_HUGE_2MB) to obtain a hugetlbfs file.
>>
>> Also, QEMU fallocate(FALLOC_FL_PUNCH_HOLE)'s a significant of memory of
>> the md (e.g., > 90%).
>>
> 
> I just tried to reproduce by doing random accesses + random 
> fallocate(FALLOC_FL_PUNCH_HOLE) within a file - without success.
> 
> So could be
> 1. KVM is involved messing this up
> 2. Multi-threading is involved
> 

Able to reproduce with TCG under QEMU, so not a KVM issue.

-- 
Thanks,

David / dhildenb



Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-15 Thread David Hildenbrand
On 15.10.20 09:56, David Hildenbrand wrote:
> On 14.10.20 20:31, Mike Kravetz wrote:
>> On 10/14/20 11:18 AM, David Hildenbrand wrote:
>>> On 14.10.20 19:56, Mina Almasry wrote:
 On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand  wrote:
>
> On 14.10.20 17:22, David Hildenbrand wrote:
>> Hi everybody,
>>
>> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
>> with hugetlbfs and reported that this results in [1]
>>
>> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 
>> page_counter_uncharge+0x4b/0x5
>>
>> 2. Any hugetlbfs allocations failing. (I assume because some accounting 
>> is wrong)
>>
>>
>> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
>> to discard pages that are reported as free by a VM. The reporting
>> granularity is in pageblock granularity. So when the guest reports
>> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
>>
>> I was also able to reproduce (also with virtio-mem, which similarly
>> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
>> (and on v5.7.X from F32).
>>
>> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
>> is broken with cgroups. I did *not* try without cgroups yet.
>>
>> Any ideas?

 Hi David,

 I may be able to dig in and take a look. How do I reproduce this
 though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
 hugetlb region?

>>>
>>> Hi Mina,
>>>
>>> thanks for having a look. I started poking around myself but,
>>> being new to cgroup code, I even failed to understand why that code gets
>>> triggered though the hugetlb controller isn't even enabled.
>>>
>>> I assume you at least have to make sure that there is
>>> a page populated (MMAP_POPULATE, or read/write it). But I am not
>>> sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is
>>> sufficient, or if it will require a sequence of
>>> populate+discard(punch) (or multi-threading).
>>
>> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
>> with (and without) hugetlb controller enabled and did not see this issue.
>>
>> May need to reproduce via QEMU as below.
> 
> Not sure if relevant, but QEMU should be using
> memfd_create(MFD_HUGETLB|MFD_HUGE_2MB) to obtain a hugetlbfs file.
> 
> Also, QEMU fallocate(FALLOC_FL_PUNCH_HOLE)'s a significant of memory of
> the md (e.g., > 90%).
> 

I just tried to reproduce by doing random accesses + random 
fallocate(FALLOC_FL_PUNCH_HOLE) within a file - without success.

So could be
1. KVM is involved messing this up
2. Multi-threading is involved

However, I am also able to reproduce with only a single VCPU (there is still 
the QEMU main thread, but it limits the chance for races).

Even KVM spits fire after a while, which could be a side effect of allocations 
failing:

error: kvm run failed Bad address
RAX= RBX=8c12c9c217c0 RCX=8c12fb1b8fc0 
RDX=0007
RSI=8c12c9c217c0 RDI=8c12c9c217c8 RBP=000d 
RSP=b3964040fa68
R8 =0008 R9 =8c12c9c2 R10=8c12fffd5000 
R11=000303c0
R12=8c12c9c217c0 R13=0008 R14=0001 
R15=f31d44270800
RIP=af33ba0f RFL=0246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   
CS =0010   00a09b00 DPL=0 CS64 [-RA]
SS =0018   00c09300 DPL=0 DS   [-WA]
DS =   
FS = 7f8fabc87040  
GS = 8c12fbc0  
LDT= fe00  
TR =0040 fe003000 4087 8b00 DPL=0 TSS64-busy
GDT= fe001000 007f
IDT= fe00 0fff
CR0=80050033 CR2=560e10895398 CR3=0001073b2000 CR4=00350ef0
DR0= DR1= DR2= 
DR3= 
DR6=0ff0 DR7=0400
EFER=0d01
Code=0f 0b eb e2 90 0f 1f 44 00 00 53 48 89 fb 31 c0 48 8d 7f 08 <48> c7 47 f8 
00 00 00 00 48 89 d9 48 c7 c2 44 d3 52

-- 
Thanks,

David / dhildenb



Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-15 Thread David Hildenbrand
On 14.10.20 20:31, Mike Kravetz wrote:
> On 10/14/20 11:18 AM, David Hildenbrand wrote:
>> On 14.10.20 19:56, Mina Almasry wrote:
>>> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand  wrote:

 On 14.10.20 17:22, David Hildenbrand wrote:
> Hi everybody,
>
> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
> with hugetlbfs and reported that this results in [1]
>
> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 
> page_counter_uncharge+0x4b/0x5
>
> 2. Any hugetlbfs allocations failing. (I assume because some accounting 
> is wrong)
>
>
> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
> to discard pages that are reported as free by a VM. The reporting
> granularity is in pageblock granularity. So when the guest reports
> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
>
> I was also able to reproduce (also with virtio-mem, which similarly
> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
> (and on v5.7.X from F32).
>
> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
> is broken with cgroups. I did *not* try without cgroups yet.
>
> Any ideas?
>>>
>>> Hi David,
>>>
>>> I may be able to dig in and take a look. How do I reproduce this
>>> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
>>> hugetlb region?
>>>
>>
>> Hi Mina,
>>
>> thanks for having a look. I started poking around myself but,
>> being new to cgroup code, I even failed to understand why that code gets
>> triggered though the hugetlb controller isn't even enabled.
>>
>> I assume you at least have to make sure that there is
>> a page populated (MMAP_POPULATE, or read/write it). But I am not
>> sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is
>> sufficient, or if it will require a sequence of
>> populate+discard(punch) (or multi-threading).
> 
> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
> with (and without) hugetlb controller enabled and did not see this issue.
> 
> May need to reproduce via QEMU as below.

Not sure if relevant, but QEMU should be using
memfd_create(MFD_HUGETLB|MFD_HUGE_2MB) to obtain a hugetlbfs file.

Also, QEMU fallocate(FALLOC_FL_PUNCH_HOLE)'s a significant of memory of
the md (e.g., > 90%).

-- 
Thanks,

David / dhildenb



Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-14 Thread Mike Kravetz
On 10/14/20 11:18 AM, David Hildenbrand wrote:
> On 14.10.20 19:56, Mina Almasry wrote:
>> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand  wrote:
>>>
>>> On 14.10.20 17:22, David Hildenbrand wrote:
 Hi everybody,

 Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
 with hugetlbfs and reported that this results in [1]

 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 
 page_counter_uncharge+0x4b/0x5

 2. Any hugetlbfs allocations failing. (I assume because some accounting is 
 wrong)


 QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
 to discard pages that are reported as free by a VM. The reporting
 granularity is in pageblock granularity. So when the guest reports
 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.

 I was also able to reproduce (also with virtio-mem, which similarly
 uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
 (and on v5.7.X from F32).

 Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
 is broken with cgroups. I did *not* try without cgroups yet.

 Any ideas?
>>
>> Hi David,
>>
>> I may be able to dig in and take a look. How do I reproduce this
>> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
>> hugetlb region?
>>
> 
> Hi Mina,
> 
> thanks for having a look. I started poking around myself but,
> being new to cgroup code, I even failed to understand why that code gets
> triggered though the hugetlb controller isn't even enabled.
> 
> I assume you at least have to make sure that there is
> a page populated (MMAP_POPULATE, or read/write it). But I am not
> sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is
> sufficient, or if it will require a sequence of
> populate+discard(punch) (or multi-threading).

FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
with (and without) hugetlb controller enabled and did not see this issue.

May need to reproduce via QEMU as below.
-- 
Mike Kravetz

> What definitely makes it trigger is via QEMU
> 
> qemu-system-x86_64 \
> -machine 
> pc-i440fx-4.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \
> -cpu host,migratable=on \
> -m 4096 \
> -object 
> memory-backend-memfd,id=pc.ram,hugetlb=yes,hugetlbsize=2097152,size=4294967296
>  \
> -overcommit mem-lock=off \
> -smp 4,sockets=1,dies=1,cores=2,threads=2 \
> -nodefaults \
> -nographic \
> -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 \
> -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \
> -blockdev 
> '{"driver":"file","filename":"../Fedora-Cloud-Base-32-1.6.x86_64.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}'
>  \
> -blockdev 
> '{"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-1-storage","backing":null}'
>  \
> -device 
> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-1-format,id=scsi0-0-0-0,bootindex=1
>  \
> -chardev stdio,nosignal,id=serial \
> -device isa-serial,chardev=serial \
> -device 
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7,free-page-reporting=on
> 
> 
> However, you need a recent QEMU (>= v5.1 IIRC) and a recent kernel
> (>= v5.7) inside your guest image.
> 
> Fedora rawhide qcow2 should do: 
> https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/Cloud/x86_64/images/Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2
> 


Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-14 Thread David Hildenbrand
On 14.10.20 19:56, Mina Almasry wrote:
> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand  wrote:
>>
>> On 14.10.20 17:22, David Hildenbrand wrote:
>>> Hi everybody,
>>>
>>> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
>>> with hugetlbfs and reported that this results in [1]
>>>
>>> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 
>>> page_counter_uncharge+0x4b/0x5
>>>
>>> 2. Any hugetlbfs allocations failing. (I assume because some accounting is 
>>> wrong)
>>>
>>>
>>> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
>>> to discard pages that are reported as free by a VM. The reporting
>>> granularity is in pageblock granularity. So when the guest reports
>>> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
>>>
>>> I was also able to reproduce (also with virtio-mem, which similarly
>>> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
>>> (and on v5.7.X from F32).
>>>
>>> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
>>> is broken with cgroups. I did *not* try without cgroups yet.
>>>
>>> Any ideas?
> 
> Hi David,
> 
> I may be able to dig in and take a look. How do I reproduce this
> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
> hugetlb region?
> 

Hi Mina,

thanks for having a look. I started poking around myself but,
being new to cgroup code, I even failed to understand why that code gets
triggered though the hugetlb controller isn't even enabled.

I assume you at least have to make sure that there is
a page populated (MMAP_POPULATE, or read/write it). But I am not
sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is
sufficient, or if it will require a sequence of
populate+discard(punch) (or multi-threading).

What definitely makes it trigger is via QEMU

qemu-system-x86_64 \
-machine 
pc-i440fx-4.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \
-cpu host,migratable=on \
-m 4096 \
-object 
memory-backend-memfd,id=pc.ram,hugetlb=yes,hugetlbsize=2097152,size=4294967296 \
-overcommit mem-lock=off \
-smp 4,sockets=1,dies=1,cores=2,threads=2 \
-nodefaults \
-nographic \
-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 \
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \
-blockdev 
'{"driver":"file","filename":"../Fedora-Cloud-Base-32-1.6.x86_64.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}'
 \
-blockdev 
'{"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-1-storage","backing":null}'
 \
-device 
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-1-format,id=scsi0-0-0-0,bootindex=1
 \
-chardev stdio,nosignal,id=serial \
-device isa-serial,chardev=serial \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7,free-page-reporting=on


However, you need a recent QEMU (>= v5.1 IIRC) and a recent kernel
(>= v5.7) inside your guest image.

Fedora rawhide qcow2 should do: 
https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/Cloud/x86_64/images/Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2


-- 
Thanks,

David / dhildenb



Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-14 Thread Mina Almasry
On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand  wrote:
>
> On 14.10.20 17:22, David Hildenbrand wrote:
> > Hi everybody,
> >
> > Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
> > with hugetlbfs and reported that this results in [1]
> >
> > 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 
> > page_counter_uncharge+0x4b/0x5
> >
> > 2. Any hugetlbfs allocations failing. (I assume because some accounting is 
> > wrong)
> >
> >
> > QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
> > to discard pages that are reported as free by a VM. The reporting
> > granularity is in pageblock granularity. So when the guest reports
> > 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
> >
> > I was also able to reproduce (also with virtio-mem, which similarly
> > uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
> > (and on v5.7.X from F32).
> >
> > Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
> > is broken with cgroups. I did *not* try without cgroups yet.
> >
> > Any ideas?

Hi David,

I may be able to dig in and take a look. How do I reproduce this
though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
hugetlb region?

>
> Just tried without the hugetlb controller, seems to work just fine.
>
> I'd like to note that
> - The controller was not activated
> - I had to compile the hugetlb controller out to make it work.
>
> --
> Thanks,
>
> David / dhildenb
>


Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-14 Thread David Hildenbrand
On 14.10.20 17:22, David Hildenbrand wrote:
> Hi everybody,
> 
> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
> with hugetlbfs and reported that this results in [1]
> 
> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 
> page_counter_uncharge+0x4b/0x5
> 
> 2. Any hugetlbfs allocations failing. (I assume because some accounting is 
> wrong)
> 
> 
> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
> to discard pages that are reported as free by a VM. The reporting
> granularity is in pageblock granularity. So when the guest reports
> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
> 
> I was also able to reproduce (also with virtio-mem, which similarly
> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
> (and on v5.7.X from F32).
> 
> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
> is broken with cgroups. I did *not* try without cgroups yet.
> 
> Any ideas?

Just tried without the hugetlb controller, seems to work just fine.

I'd like to note that
- The controller was not activated
- I had to compile the hugetlb controller out to make it work.

-- 
Thanks,

David / dhildenb



cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-14 Thread David Hildenbrand
Hi everybody,

Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
with hugetlbfs and reported that this results in [1]

1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 
page_counter_uncharge+0x4b/0x5

2. Any hugetlbfs allocations failing. (I assume because some accounting is 
wrong)


QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
to discard pages that are reported as free by a VM. The reporting
granularity is in pageblock granularity. So when the guest reports
2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.

I was also able to reproduce (also with virtio-mem, which similarly
uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
(and on v5.7.X from F32).

Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
is broken with cgroups. I did *not* try without cgroups yet.

Any ideas?


Here is report #1:

[  315.251417] [ cut here ]
[  315.251424] WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 
page_counter_uncharge+0x4b/0x50
[  315.251425] Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack 
ipt_REJECT nf_nat_tftp nf_conntrack_tftp rfcomm tun bridge stp llc nft_objref 
nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject 
nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat 
ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack 
nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security 
ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter cmac bnep hwmon_vid sunrpc squashfs vfat fat loop 
snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio 
snd_hda_intel snd_intel_dspcfg snd_hda_codec edac_mce_amd snd_hda_core btusb 
btrtl btbcm snd_hwdep snd_seq btintel kvm_amd snd_seq_device bluetooth kvm 
snd_pcm ecdh_generic sp5100_tco irqbypass rfkill snd_timer rapl ecc pcspkr 
wmi_bmof joydev i2c_piix4 k10temp snd
[  315.251454]  soundcore acpi_cpufreq ip_tables xfs libcrc32c dm_crypt igb 
hid_logitech_hidpp video dca amdgpu iommu_v2 gpu_sched i2c_algo_bit ttm 
drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel mxm_wmi drm 
ghash_clmulni_intel ccp nvme nvme_core wmi pinctrl_amd hid_logitech_dj fuse
[  315.251466] CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 #137
[  315.251467] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 
AORUS PRO, BIOS F21 07/31/2020
[  315.251469] RIP: 0010:page_counter_uncharge+0x4b/0x50
[  315.251471] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe 
ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb 
ec 90 0f 1f 44 00 00 48 8b 17 48 39 d6 72 41 41 54 49 89
[  315.251472] RSP: 0018:b60f01ed3b20 EFLAGS: 00010286
[  315.251473] RAX: fffd0600 RBX: fffd0600 RCX: 8de8272e3200
[  315.251473] RDX: 028e RSI: fffd0600 RDI: 8de838452e40
[  315.251474] RBP: 8de838452e40 R08: 8de838452e40 R09: 8de837f86c80
[  315.251475] R10: b60f01ed3b58 R11: 0001 R12: 00051c00
[  315.251475] R13: fffae400 R14: 8de8272e3240 R15: 0571
[  315.251476] FS:  7f9c2edfd700() GS:8de83ebc() 
knlGS:
[  315.251477] CS:  0010 DS:  ES:  CR0: 80050033
[  315.251478] CR2: 7f2a76787e78 CR3: 000fcbb1c000 CR4: 00350ee0
[  315.251479] Call Trace:
[  315.251485]  hugetlb_cgroup_uncharge_file_region+0x4b/0x80
[  315.251487]  region_del+0x1d3/0x300
[  315.251489]  hugetlb_unreserve_pages+0x39/0xb0
[  315.251492]  remove_inode_hugepages+0x1a8/0x3d0
[  315.251495]  ? tlb_finish_mmu+0x7a/0x1d0
[  315.251497]  hugetlbfs_fallocate+0x3c4/0x5c0
[  315.251519]  ? kvm_arch_vcpu_ioctl_run+0x614/0x1700 [kvm]
[  315.251522]  ? file_has_perm+0xa2/0xb0
[  315.251524]  ? inode_security+0xc/0x60
[  315.251525]  ? selinux_file_permission+0x4e/0x120
[  315.251527]  vfs_fallocate+0x146/0x290
[  315.251529]  __x64_sys_fallocate+0x3e/0x70
[  315.251531]  do_syscall_64+0x33/0x40
[  315.251533]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  315.251535] RIP: 0033:0x7f9d3fb5641f
[  315.251536] Code: 89 7c 24 08 48 89 4c 24 18 e8 5d fc f8 ff 4c 8b 54 24 18 
48 8b 54 24 10 41 89 c0 8b 74 24 0c 8b 7c 24 08 b8 1d 01 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 31 44 89 c7 89 44 24 08 e8 8d fc f8 ff 8b 44
[  315.251537] RSP: 002b:7f9c2edfc470 EFLAGS: 0293 ORIG_RAX: 
011d
[  315.251538] RAX: ffda RBX: 1000 RCX: 7f9d3fb5641f
[  315.251539] RDX: ae20 RSI: 0003 RDI: 000c
[  315.251539] RBP: 557389d6736c R08:  R09: 000c
[  315.251540] R10: 0020 R11: 0293 R12: 0020
[  315.251540] R13:  R14: ae20 R15: 7f9cde00
[  315.251542] ---[ end