Re: [PATCH] mm/vmalloc: a slight change of compare target in __insert_vmap_area()

2017-06-01 Thread zhong jiang
On 2017/6/2 9:45, Wei Yang wrote:
> On Fri, May 26, 2017 at 09:55:31AM +0800, zhong jiang wrote:
>> On 2017/5/26 9:36, Wei Yang wrote:
>>> On Thu, May 25, 2017 at 11:04:44AM +0800, zhong jiang wrote:
>>>> I hit the overlap issue, but it  is hard to reproduced. if you think it is 
>>>> safe. and the situation
>>>> is not happen. AFAIC, it is no need to add the code.
>>>>
>>>> if you insist on the point. Maybe VM_WARN_ON is a choice.
>>>>
>>> Do you have some log to show the overlap happens?
>> Hi  wei
>>
>> cat /proc/vmallocinfo
>> 0xf158-0xf160  524288 raw_dump_mem_write+0x10c/0x188 phys=8b901000 
>> ioremap
>> 0xf1638000-0xf163a0008192 mcss_pou_queue_init+0xa0/0x13c [mcss] 
>> phys=fc614000 ioremap
>> 0xf528e000-0xf5292000   16384 n_tty_open+0x10/0xd0 pages=3 vmalloc
>> 0xf500-0xf9001000 67112960 devm_ioremap+0x38/0x70 phys=4000 ioremap
> These two ranges overlap.
>
> This is hard to say where is the problem. From the code point of view, I don't
> see there is possibility to allocate an overlapped range.
>
> Which version of your kernel?
> Hard to reproduce means just see once? 
  yes, just once.  I have also no see any problem from the code.   The kernel 
version is linux 4.1.
 but That indeed exist. 

 Thanks
zhongjiang
>> 0xfe001000-0xfe0020004096 iotable_init+0x0/0xc phys=20001000 ioremap
>> 0xfe20-0xfe2010004096 iotable_init+0x0/0xc phys=1a00 ioremap
>> 0xff10-0xff1010004096 iotable_init+0x0/0xc phys=2000a000 ioremap
>>
>> I hit the above issue, but the log no more useful info. it just is found by 
>> accident.
>> and it is hard to reprodeced. no more info can be supported for further 
>> investigation.
>> therefore, it is no idea for me. 
>>
>> Thanks
>> zhongjinag
>>




Re: [PATCH] mm/vmalloc: a slight change of compare target in __insert_vmap_area()

2017-05-25 Thread zhong jiang
On 2017/5/26 9:36, Wei Yang wrote:
> On Thu, May 25, 2017 at 11:04:44AM +0800, zhong jiang wrote:
>> I hit the overlap issue, but it  is hard to reproduced. if you think it is 
>> safe. and the situation
>> is not happen. AFAIC, it is no need to add the code.
>>
>> if you insist on the point. Maybe VM_WARN_ON is a choice.
>>
> Do you have some log to show the overlap happens?
 Hi  wei

cat /proc/vmallocinfo
0xf158-0xf160  524288 raw_dump_mem_write+0x10c/0x188 phys=8b901000 
ioremap
0xf1638000-0xf163a0008192 mcss_pou_queue_init+0xa0/0x13c [mcss] 
phys=fc614000 ioremap
0xf528e000-0xf5292000   16384 n_tty_open+0x10/0xd0 pages=3 vmalloc
0xf500-0xf9001000 67112960 devm_ioremap+0x38/0x70 phys=4000 ioremap
0xfe001000-0xfe0020004096 iotable_init+0x0/0xc phys=20001000 ioremap
0xfe20-0xfe2010004096 iotable_init+0x0/0xc phys=1a00 ioremap
0xff10-0xff1010004096 iotable_init+0x0/0xc phys=2000a000 ioremap

I hit the above issue, but the log no more useful info. it just is found by 
accident.
and it is hard to reprodeced. no more info can be supported for further 
investigation.
therefore, it is no idea for me. 

Thanks
zhongjinag




Re: [PATCH] mm/vmalloc: a slight change of compare target in __insert_vmap_area()

2017-05-25 Thread zhong jiang
On 2017/5/26 9:36, Wei Yang wrote:
> On Thu, May 25, 2017 at 11:04:44AM +0800, zhong jiang wrote:
>> I hit the overlap issue, but it  is hard to reproduced. if you think it is 
>> safe. and the situation
>> is not happen. AFAIC, it is no need to add the code.
>>
>> if you insist on the point. Maybe VM_WARN_ON is a choice.
>>
> Do you have some log to show the overlap happens?
 Hi  wei

cat /proc/vmallocinfo
0xf158-0xf160  524288 raw_dump_mem_write+0x10c/0x188 phys=8b901000 
ioremap
0xf1638000-0xf163a0008192 mcss_pou_queue_init+0xa0/0x13c [mcss] 
phys=fc614000 ioremap
0xf528e000-0xf5292000   16384 n_tty_open+0x10/0xd0 pages=3 vmalloc
0xf500-0xf9001000 67112960 devm_ioremap+0x38/0x70 phys=4000 ioremap
0xfe001000-0xfe0020004096 iotable_init+0x0/0xc phys=20001000 ioremap
0xfe20-0xfe2010004096 iotable_init+0x0/0xc phys=1a00 ioremap
0xff10-0xff1010004096 iotable_init+0x0/0xc phys=2000a000 ioremap

I hit the above issue, but the log no more useful info. it just is found by 
accident.
and it is hard to reprodeced. no more info can be supported for further 
investigation.
therefore, it is no idea for me. 

Thanks
zhongjinag




Re: [PATCH] mm/vmalloc: a slight change of compare target in __insert_vmap_area()

2017-05-24 Thread zhong jiang
I hit the overlap issue, but it  is hard to reproduced. if you think it is 
safe. and the situation
is not happen. AFAIC, it is no need to add the code.

if you insist on the point. Maybe VM_WARN_ON is a choice.

Regards
zhongjiang
On 2017/5/24 18:03, Wei Yang wrote:
> The vmap RB tree store the elements in order and no overlap between any of
> them. The comparison in __insert_vmap_area() is to decide which direction
> the search should follow and make sure the new vmap_area is not overlap
> with any other.
>
> Current implementation fails to do the overlap check.
>
> When first "if" is not true, it means
>
> va->va_start >= tmp_va->va_end
>
> And with the truth
>
> xxx->va_end > xxx->va_start
>
> The deduction is
>
> va->va_end > tmp_va->va_start
>
> which is the condition in second "if".
>
> This patch changes a little of the comparison in __insert_vmap_area() to
> make sure it forbids the overlapped vmap_area.
>
> Signed-off-by: Wei Yang 
> ---
>  mm/vmalloc.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 0b057628a7ba..8087451cb332 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -360,9 +360,9 @@ static void __insert_vmap_area(struct vmap_area *va)
>  
>   parent = *p;
>   tmp_va = rb_entry(parent, struct vmap_area, rb_node);
> - if (va->va_start < tmp_va->va_end)
> + if (va->va_end <= tmp_va->va_start)
>   p = &(*p)->rb_left;
> - else if (va->va_end > tmp_va->va_start)
> + else if (va->va_start >= tmp_va->va_end)
>   p = &(*p)->rb_right;
>   else
>   BUG();




Re: [PATCH] mm/vmalloc: a slight change of compare target in __insert_vmap_area()

2017-05-24 Thread zhong jiang
I hit the overlap issue, but it  is hard to reproduced. if you think it is 
safe. and the situation
is not happen. AFAIC, it is no need to add the code.

if you insist on the point. Maybe VM_WARN_ON is a choice.

Regards
zhongjiang
On 2017/5/24 18:03, Wei Yang wrote:
> The vmap RB tree store the elements in order and no overlap between any of
> them. The comparison in __insert_vmap_area() is to decide which direction
> the search should follow and make sure the new vmap_area is not overlap
> with any other.
>
> Current implementation fails to do the overlap check.
>
> When first "if" is not true, it means
>
> va->va_start >= tmp_va->va_end
>
> And with the truth
>
> xxx->va_end > xxx->va_start
>
> The deduction is
>
> va->va_end > tmp_va->va_start
>
> which is the condition in second "if".
>
> This patch changes a little of the comparison in __insert_vmap_area() to
> make sure it forbids the overlapped vmap_area.
>
> Signed-off-by: Wei Yang 
> ---
>  mm/vmalloc.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 0b057628a7ba..8087451cb332 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -360,9 +360,9 @@ static void __insert_vmap_area(struct vmap_area *va)
>  
>   parent = *p;
>   tmp_va = rb_entry(parent, struct vmap_area, rb_node);
> - if (va->va_start < tmp_va->va_end)
> + if (va->va_end <= tmp_va->va_start)
>   p = &(*p)->rb_left;
> - else if (va->va_end > tmp_va->va_start)
> + else if (va->va_start >= tmp_va->va_end)
>   p = &(*p)->rb_right;
>   else
>   BUG();




Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-23 Thread zhong jiang
On 2017/5/23 17:33, Vlastimil Babka wrote:
> On 05/23/2017 11:21 AM, zhong jiang wrote:
>> On 2017/5/23 0:51, Vlastimil Babka wrote:
>>> On 05/20/2017 05:01 AM, zhong jiang wrote:
>>>> On 2017/5/20 10:40, Hugh Dickins wrote:
>>>>> On Sat, 20 May 2017, Xishi Qiu wrote:
>>>>>> Here is a bug report form redhat: 
>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1305620
>>>>>> And I meet the bug too. However it is hard to reproduce, and 
>>>>>> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not 
>>>>>> help.
>>>>>>
>>>>>> From the vmcore, it seems that the page is still mapped(_mapcount=0 and 
>>>>>> _count=2),
>>>>>> and the value of mapping is a valid address(mapping = 
>>>>>> 0x8801b3e2a101),
>>>>>> but anon_vma has been corrupted.
>>>>>>
>>>>>> Any ideas?
>>>>> Sorry, no.  I assume that _mapcount has been misaccounted, for example
>>>>> a pte mapped in on top of another pte; but cannot begin tell you where
>>>>> in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.
>>>>>
>>>>> Hugh
>>>>>
>>>>> .
>>>>>
>>>> Hi, Hugh
>>>>
>>>> I find the following message from the dmesg.
>>>>
>>>> [26068.316592] BUG: Bad rss-counter state mm:8800a7de2d80 idx:1 val:1
>>>>
>>>> I can prove that the __mapcount is misaccount.  when task is exited. the 
>>>> rmap
>>>> still exist.
>>> Check if the kernel in question contains this commit: ad33bb04b2a6 ("mm:
>>> thp: fix SMP race condition between THP page fault and MADV_DONTNEED")
>>   HI, Vlastimil
>>  
>>   I miss the patch.
> Try applying it then, there's good chance the error and crash will go
> away. Even if your workload doesn't actually run any madvise(MADV_DONTNEED).
 ok , I will try.   Thanks
>> when I read the patch. I find the following issue. but I am sure it is right.
>>
>>   if (unlikely(pmd_trans_unstable(pmd)))
>> return 0;
>> /*
>>  * A regular pmd is established and it can't morph into a huge pmd
>>  * from under us anymore at this point because we hold the mmap_sem
>>  * read mode and khugepaged takes it in write mode. So now it's
>>  * safe to run pte_offset_map().
>>  */
>> pte = pte_offset_map(pmd, address);
>>
>>   after pmd_trans_unstable call,  without any protect method.  by the 
>> comments,
>>   it think the pte_offset_map is safe.before pte_offset_map call, it 
>> still may be
>>   unstable. it is possible?
> IIRC it's "unstable" wrt possible none->huge->none transition. But once
> we've seen it's a regular pmd via pmd_trans_unstable(), we're safe as a
> transition from regular pmd can't happen.
  Thank you for clarify. 
 
  Regards
 zhongjiang
>>   Thanks
>> zhongjiang
>>>> Thanks
>>>> zhongjiang
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majord...@kvack.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>>>>
>>> .
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>
> .
>




Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-23 Thread zhong jiang
On 2017/5/23 17:33, Vlastimil Babka wrote:
> On 05/23/2017 11:21 AM, zhong jiang wrote:
>> On 2017/5/23 0:51, Vlastimil Babka wrote:
>>> On 05/20/2017 05:01 AM, zhong jiang wrote:
>>>> On 2017/5/20 10:40, Hugh Dickins wrote:
>>>>> On Sat, 20 May 2017, Xishi Qiu wrote:
>>>>>> Here is a bug report form redhat: 
>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1305620
>>>>>> And I meet the bug too. However it is hard to reproduce, and 
>>>>>> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not 
>>>>>> help.
>>>>>>
>>>>>> From the vmcore, it seems that the page is still mapped(_mapcount=0 and 
>>>>>> _count=2),
>>>>>> and the value of mapping is a valid address(mapping = 
>>>>>> 0x8801b3e2a101),
>>>>>> but anon_vma has been corrupted.
>>>>>>
>>>>>> Any ideas?
>>>>> Sorry, no.  I assume that _mapcount has been misaccounted, for example
>>>>> a pte mapped in on top of another pte; but cannot begin tell you where
>>>>> in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.
>>>>>
>>>>> Hugh
>>>>>
>>>>> .
>>>>>
>>>> Hi, Hugh
>>>>
>>>> I find the following message from the dmesg.
>>>>
>>>> [26068.316592] BUG: Bad rss-counter state mm:8800a7de2d80 idx:1 val:1
>>>>
>>>> I can prove that the __mapcount is misaccount.  when task is exited. the 
>>>> rmap
>>>> still exist.
>>> Check if the kernel in question contains this commit: ad33bb04b2a6 ("mm:
>>> thp: fix SMP race condition between THP page fault and MADV_DONTNEED")
>>   HI, Vlastimil
>>  
>>   I miss the patch.
> Try applying it then, there's good chance the error and crash will go
> away. Even if your workload doesn't actually run any madvise(MADV_DONTNEED).
 ok , I will try.   Thanks
>> when I read the patch. I find the following issue. but I am sure it is right.
>>
>>   if (unlikely(pmd_trans_unstable(pmd)))
>> return 0;
>> /*
>>  * A regular pmd is established and it can't morph into a huge pmd
>>  * from under us anymore at this point because we hold the mmap_sem
>>  * read mode and khugepaged takes it in write mode. So now it's
>>  * safe to run pte_offset_map().
>>  */
>> pte = pte_offset_map(pmd, address);
>>
>>   after pmd_trans_unstable call,  without any protect method.  by the 
>> comments,
>>   it think the pte_offset_map is safe.before pte_offset_map call, it 
>> still may be
>>   unstable. it is possible?
> IIRC it's "unstable" wrt possible none->huge->none transition. But once
> we've seen it's a regular pmd via pmd_trans_unstable(), we're safe as a
> transition from regular pmd can't happen.
  Thank you for clarify. 
 
  Regards
 zhongjiang
>>   Thanks
>> zhongjiang
>>>> Thanks
>>>> zhongjiang
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majord...@kvack.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>>>>
>>> .
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>
> .
>




Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-23 Thread zhong jiang
On 2017/5/23 0:51, Vlastimil Babka wrote:
> On 05/20/2017 05:01 AM, zhong jiang wrote:
>> On 2017/5/20 10:40, Hugh Dickins wrote:
>>> On Sat, 20 May 2017, Xishi Qiu wrote:
>>>> Here is a bug report form redhat: 
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1305620
>>>> And I meet the bug too. However it is hard to reproduce, and 
>>>> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not 
>>>> help.
>>>>
>>>> From the vmcore, it seems that the page is still mapped(_mapcount=0 and 
>>>> _count=2),
>>>> and the value of mapping is a valid address(mapping = 0x8801b3e2a101),
>>>> but anon_vma has been corrupted.
>>>>
>>>> Any ideas?
>>> Sorry, no.  I assume that _mapcount has been misaccounted, for example
>>> a pte mapped in on top of another pte; but cannot begin tell you where
>>> in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.
>>>
>>> Hugh
>>>
>>> .
>>>
>> Hi, Hugh
>>
>> I find the following message from the dmesg.
>>
>> [26068.316592] BUG: Bad rss-counter state mm:8800a7de2d80 idx:1 val:1
>>
>> I can prove that the __mapcount is misaccount.  when task is exited. the rmap
>> still exist.
> Check if the kernel in question contains this commit: ad33bb04b2a6 ("mm:
> thp: fix SMP race condition between THP page fault and MADV_DONTNEED")
  HI, Vlastimil
 
  I miss the patch.  when I read the patch. I find the following issue. but I 
am sure it is right.

  if (unlikely(pmd_trans_unstable(pmd)))
return 0;
/*
 * A regular pmd is established and it can't morph into a huge pmd
 * from under us anymore at this point because we hold the mmap_sem
 * read mode and khugepaged takes it in write mode. So now it's
 * safe to run pte_offset_map().
 */
pte = pte_offset_map(pmd, address);

  after pmd_trans_unstable call,  without any protect method.  by the comments,
  it think the pte_offset_map is safe.before pte_offset_map call, it still 
may be
  unstable. it is possible?

  Thanks
zhongjiang
>> Thanks
>> zhongjiang
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>>
>
> .
>




Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-23 Thread zhong jiang
On 2017/5/23 0:51, Vlastimil Babka wrote:
> On 05/20/2017 05:01 AM, zhong jiang wrote:
>> On 2017/5/20 10:40, Hugh Dickins wrote:
>>> On Sat, 20 May 2017, Xishi Qiu wrote:
>>>> Here is a bug report form redhat: 
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1305620
>>>> And I meet the bug too. However it is hard to reproduce, and 
>>>> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not 
>>>> help.
>>>>
>>>> From the vmcore, it seems that the page is still mapped(_mapcount=0 and 
>>>> _count=2),
>>>> and the value of mapping is a valid address(mapping = 0x8801b3e2a101),
>>>> but anon_vma has been corrupted.
>>>>
>>>> Any ideas?
>>> Sorry, no.  I assume that _mapcount has been misaccounted, for example
>>> a pte mapped in on top of another pte; but cannot begin tell you where
>>> in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.
>>>
>>> Hugh
>>>
>>> .
>>>
>> Hi, Hugh
>>
>> I find the following message from the dmesg.
>>
>> [26068.316592] BUG: Bad rss-counter state mm:8800a7de2d80 idx:1 val:1
>>
>> I can prove that the __mapcount is misaccount.  when task is exited. the rmap
>> still exist.
> Check if the kernel in question contains this commit: ad33bb04b2a6 ("mm:
> thp: fix SMP race condition between THP page fault and MADV_DONTNEED")
  HI, Vlastimil
 
  I miss the patch.  when I read the patch. I find the following issue. but I 
am sure it is right.

  if (unlikely(pmd_trans_unstable(pmd)))
return 0;
/*
 * A regular pmd is established and it can't morph into a huge pmd
 * from under us anymore at this point because we hold the mmap_sem
 * read mode and khugepaged takes it in write mode. So now it's
 * safe to run pte_offset_map().
 */
pte = pte_offset_map(pmd, address);

  after pmd_trans_unstable call,  without any protect method.  by the comments,
  it think the pte_offset_map is safe.before pte_offset_map call, it still 
may be
  unstable. it is possible?

  Thanks
zhongjiang
>> Thanks
>> zhongjiang
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>>
>
> .
>




Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-19 Thread zhong jiang
On 2017/5/20 10:40, Hugh Dickins wrote:
> On Sat, 20 May 2017, Xishi Qiu wrote:
>> Here is a bug report form redhat: 
>> https://bugzilla.redhat.com/show_bug.cgi?id=1305620
>> And I meet the bug too. However it is hard to reproduce, and 
>> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not 
>> help.
>>
>> From the vmcore, it seems that the page is still mapped(_mapcount=0 and 
>> _count=2),
>> and the value of mapping is a valid address(mapping = 0x8801b3e2a101),
>> but anon_vma has been corrupted.
>>
>> Any ideas?
> Sorry, no.  I assume that _mapcount has been misaccounted, for example
> a pte mapped in on top of another pte; but cannot begin tell you where
> in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.
>
> Hugh
>
> .
>
Hi, Hugh

I find the following message from the dmesg.

[26068.316592] BUG: Bad rss-counter state mm:8800a7de2d80 idx:1 val:1

I can prove that the __mapcount is misaccount.  when task is exited. the rmap
still exist.

Thanks
zhongjiang



Re: mm, something wring in page_lock_anon_vma_read()?

2017-05-19 Thread zhong jiang
On 2017/5/20 10:40, Hugh Dickins wrote:
> On Sat, 20 May 2017, Xishi Qiu wrote:
>> Here is a bug report form redhat: 
>> https://bugzilla.redhat.com/show_bug.cgi?id=1305620
>> And I meet the bug too. However it is hard to reproduce, and 
>> 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma") is not 
>> help.
>>
>> From the vmcore, it seems that the page is still mapped(_mapcount=0 and 
>> _count=2),
>> and the value of mapping is a valid address(mapping = 0x8801b3e2a101),
>> but anon_vma has been corrupted.
>>
>> Any ideas?
> Sorry, no.  I assume that _mapcount has been misaccounted, for example
> a pte mapped in on top of another pte; but cannot begin tell you where
> in Red Hat's kernel-3.10.0-229.4.2.el7 that might happen.
>
> Hugh
>
> .
>
Hi, Hugh

I find the following message from the dmesg.

[26068.316592] BUG: Bad rss-counter state mm:8800a7de2d80 idx:1 val:1

I can prove that the __mapcount is misaccount.  when task is exited. the rmap
still exist.

Thanks
zhongjiang



Re: [Qustion] vmalloc area overlap with another allocated vmalloc area

2017-05-17 Thread zhong jiang
On 2017/5/17 21:44, Michal Hocko wrote:
> On Wed 17-05-17 20:53:57, zhong jiang wrote:
>> +to linux-mm maintainer for any suggestions
>>
>> Thanks
>> zhongjiang
>> On 2017/5/16 13:03, zhong jiang wrote:
>>> Hi
>>>
>>> I  hit the following issue by runing /proc/vmallocinfo.  The kernel is 4.1 
>>> stable and
>>> 32 bit to be used.  after I expand the vamlloc area,  the issue is not 
>>> occur again.
>>> it is related to the overflow. but I do not see any problem so far.
> Is this a clean 4.1 stable kernel without any additional patches on top?
> Are you able to reproduce this? How? Does the same problem happen with
> the current Linus tree?
  It is hard to reproduce.  just for special case and only once. we can not 
structure the case.
  I do not test it in Linus tree so far.  because I know it is hard to 
reprodeuce.

  Just by reading the code, I do not find the same issue. so I have no idea.

 Thanks
 zhongjiang 
>>> cat /proc/vmallocinfo
>>> 0xf158-0xf160  524288 raw_dump_mem_write+0x10c/0x188 phys=8b901000 
>>> ioremap
>>> 0xf1638000-0xf163a0008192 mcss_pou_queue_init+0xa0/0x13c [mcss] 
>>> phys=fc614000 ioremap
>>> 0xf528e000-0xf5292000   16384 n_tty_open+0x10/0xd0 pages=3 vmalloc
>>> 0xf500-0xf9001000 67112960 devm_ioremap+0x38/0x70 phys=4000 ioremap
>>> 0xfe001000-0xfe0020004096 iotable_init+0x0/0xc phys=20001000 ioremap
>>> 0xfe20-0xfe2010004096 iotable_init+0x0/0xc phys=1a00 ioremap
>>> 0xff10-0xff1010004096 iotable_init+0x0/0xc phys=2000a000 ioremap
>>>
>>> n_tty_open allocate the vmap area is surrounded by the devm_ioremap ioremap 
>>> by above info.
>>> I do not see also the race in the condition.
>>>
>>> I have no idea to the issue.  Anyone has any suggestions will be 
>>> appreicated.
>>> The related config is attatched.
>>>
>>> Thanks
>>> zhongjiang




Re: [Qustion] vmalloc area overlap with another allocated vmalloc area

2017-05-17 Thread zhong jiang
On 2017/5/17 21:44, Michal Hocko wrote:
> On Wed 17-05-17 20:53:57, zhong jiang wrote:
>> +to linux-mm maintainer for any suggestions
>>
>> Thanks
>> zhongjiang
>> On 2017/5/16 13:03, zhong jiang wrote:
>>> Hi
>>>
>>> I  hit the following issue by runing /proc/vmallocinfo.  The kernel is 4.1 
>>> stable and
>>> 32 bit to be used.  after I expand the vamlloc area,  the issue is not 
>>> occur again.
>>> it is related to the overflow. but I do not see any problem so far.
> Is this a clean 4.1 stable kernel without any additional patches on top?
> Are you able to reproduce this? How? Does the same problem happen with
> the current Linus tree?
  It is hard to reproduce.  just for special case and only once. we can not 
structure the case.
  I do not test it in Linus tree so far.  because I know it is hard to 
reprodeuce.

  Just by reading the code, I do not find the same issue. so I have no idea.

 Thanks
 zhongjiang 
>>> cat /proc/vmallocinfo
>>> 0xf158-0xf160  524288 raw_dump_mem_write+0x10c/0x188 phys=8b901000 
>>> ioremap
>>> 0xf1638000-0xf163a0008192 mcss_pou_queue_init+0xa0/0x13c [mcss] 
>>> phys=fc614000 ioremap
>>> 0xf528e000-0xf5292000   16384 n_tty_open+0x10/0xd0 pages=3 vmalloc
>>> 0xf500-0xf9001000 67112960 devm_ioremap+0x38/0x70 phys=4000 ioremap
>>> 0xfe001000-0xfe0020004096 iotable_init+0x0/0xc phys=20001000 ioremap
>>> 0xfe20-0xfe2010004096 iotable_init+0x0/0xc phys=1a00 ioremap
>>> 0xff10-0xff1010004096 iotable_init+0x0/0xc phys=2000a000 ioremap
>>>
>>> n_tty_open allocate the vmap area is surrounded by the devm_ioremap ioremap 
>>> by above info.
>>> I do not see also the race in the condition.
>>>
>>> I have no idea to the issue.  Anyone has any suggestions will be 
>>> appreicated.
>>> The related config is attatched.
>>>
>>> Thanks
>>> zhongjiang




Re: [Qustion] vmalloc area overlap with another allocated vmalloc area

2017-05-17 Thread zhong jiang
+to linux-mm maintainer for any suggestions

Thanks
zhongjiang
On 2017/5/16 13:03, zhong jiang wrote:
> Hi
>
> I  hit the following issue by runing /proc/vmallocinfo.  The kernel is 4.1 
> stable and
> 32 bit to be used.  after I expand the vamlloc area,  the issue is not occur 
> again.
> it is related to the overflow. but I do not see any problem so far.
>
> cat /proc/vmallocinfo
> 0xf158-0xf160  524288 raw_dump_mem_write+0x10c/0x188 phys=8b901000 
> ioremap
> 0xf1638000-0xf163a0008192 mcss_pou_queue_init+0xa0/0x13c [mcss] 
> phys=fc614000 ioremap
> 0xf528e000-0xf5292000   16384 n_tty_open+0x10/0xd0 pages=3 vmalloc
> 0xf500-0xf9001000 67112960 devm_ioremap+0x38/0x70 phys=4000 ioremap
> 0xfe001000-0xfe0020004096 iotable_init+0x0/0xc phys=20001000 ioremap
> 0xfe20-0xfe2010004096 iotable_init+0x0/0xc phys=1a00 ioremap
> 0xff10-0xff1010004096 iotable_init+0x0/0xc phys=2000a000 ioremap
>
> n_tty_open allocate the vmap area is surrounded by the devm_ioremap ioremap 
> by above info.
> I do not see also the race in the condition.
>
> I have no idea to the issue.  Anyone has any suggestions will be appreicated.
> The related config is attatched.
>
> Thanks
> zhongjiang




Re: [Qustion] vmalloc area overlap with another allocated vmalloc area

2017-05-17 Thread zhong jiang
+to linux-mm maintainer for any suggestions

Thanks
zhongjiang
On 2017/5/16 13:03, zhong jiang wrote:
> Hi
>
> I  hit the following issue by runing /proc/vmallocinfo.  The kernel is 4.1 
> stable and
> 32 bit to be used.  after I expand the vamlloc area,  the issue is not occur 
> again.
> it is related to the overflow. but I do not see any problem so far.
>
> cat /proc/vmallocinfo
> 0xf158-0xf160  524288 raw_dump_mem_write+0x10c/0x188 phys=8b901000 
> ioremap
> 0xf1638000-0xf163a0008192 mcss_pou_queue_init+0xa0/0x13c [mcss] 
> phys=fc614000 ioremap
> 0xf528e000-0xf5292000   16384 n_tty_open+0x10/0xd0 pages=3 vmalloc
> 0xf500-0xf9001000 67112960 devm_ioremap+0x38/0x70 phys=4000 ioremap
> 0xfe001000-0xfe0020004096 iotable_init+0x0/0xc phys=20001000 ioremap
> 0xfe20-0xfe2010004096 iotable_init+0x0/0xc phys=1a00 ioremap
> 0xff10-0xff1010004096 iotable_init+0x0/0xc phys=2000a000 ioremap
>
> n_tty_open allocate the vmap area is surrounded by the devm_ioremap ioremap 
> by above info.
> I do not see also the race in the condition.
>
> I have no idea to the issue.  Anyone has any suggestions will be appreicated.
> The related config is attatched.
>
> Thanks
> zhongjiang




[Qustion] vmalloc area overlap with another allocated vmalloc area

2017-05-15 Thread zhong jiang
Hi

I  hit the following issue by runing /proc/vmallocinfo.  The kernel is 4.1 
stable and
32 bit to be used.  after I expand the vamlloc area,  the issue is not occur 
again.
it is related to the overflow. but I do not see any problem so far.

cat /proc/vmallocinfo
0xf158-0xf160  524288 raw_dump_mem_write+0x10c/0x188 phys=8b901000 
ioremap
0xf1638000-0xf163a0008192 mcss_pou_queue_init+0xa0/0x13c [mcss] 
phys=fc614000 ioremap
0xf528e000-0xf5292000   16384 n_tty_open+0x10/0xd0 pages=3 vmalloc
0xf500-0xf9001000 67112960 devm_ioremap+0x38/0x70 phys=4000 ioremap
0xfe001000-0xfe0020004096 iotable_init+0x0/0xc phys=20001000 ioremap
0xfe20-0xfe2010004096 iotable_init+0x0/0xc phys=1a00 ioremap
0xff10-0xff1010004096 iotable_init+0x0/0xc phys=2000a000 ioremap

n_tty_open allocate the vmap area is surrounded by the devm_ioremap ioremap by 
above info.
I do not see also the race in the condition.

I have no idea to the issue.  Anyone has any suggestions will be appreicated.
The related config is attatched.

Thanks
zhongjiang
#
# Automatically generated file; DO NOT EDIT.
# Linux/arm 4.1.12 Kernel Configuration
#
CONFIG_ARM=y
CONFIG_MIGHT_HAVE_PCI=y
CONFIG_SYS_SUPPORTS_APM_EMULATION=y
CONFIG_HAVE_PROC_CPU=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_IRQ_BYPASS is not set
# CONFIG_LIB_INTERRUPT is not set
# CONFIG_SET_IRQPRIORITY is not set
# CONFIG_SRE_PREHANDLER is not set
# CONFIG_IPI_COMBINE is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIQ=y
CONFIG_VECTORS_BASE=0x
CONFIG_NEED_MACH_MEMORY_H=y
CONFIG_PHYS_OFFSET=0x800
CONFIG_GENERIC_BUG=y
CONFIG_PGTABLE_LEVELS=2
# CONFIG_ARCH_AARCH32_ES_SUPPORT is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
# CONFIG_KERNEL_GZIP is not set
CONFIG_KERNEL_LZMA=y
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SY SCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_FHANDLE is not set
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_SHOW_LEVEL=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_HANDLE_DOMAIN_IRQ=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_ARCH_HAS_TICK_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y

#
# Timers subsystem
#
CONFIG_HZ_PERIODIC=y
# CONFIG_NO_HZ_IDLE is not set
# CONFIG_NO_HZ_FULL is not set
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
# CONFIG_RCU_USER_QS is not set
CONFIG_RCU_FANOUT=32
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_KTHREAD_PRIO=0
# CONFIG_RCU_NOCB_CPU is not set
# CONFIG_RCU_EXPEDITE_BOOT is not set
# CONFIG_BUILD_BIN2C is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=14
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_GENERIC_SCHED_CLOCK=y
CONFIG_CGROUPS=y
CONFIG_CGROUP_DEBUG=y
# CONFIG_CGROUP_FREEZER is not set
# CONFIG_CGROUP_PIDS is not set
# CONFIG_CGROUP_DEVICE is not set
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
# CONFIG_MEMCG is not set
# CONFIG_CGROUP_HUGETLB is not set
# CONFIG_CGROUP_PERF is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
# CONFIG_RT_GROUP_SCHED is not set
# CONFIG_BLK_CGROUP is not set
# CONFIG_CHECKPOINT_RESTORE is not set
# CONFIG_NAMESPACES is not set
# CONFIG_SCHED_AUTOGROUP is not set
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_RD_GZIP is not set
# CONFIG_RD_BZIP2 is not set
CONFIG_RD_LZMA=y
# CONFIG_RD_XZ is not set
# CONFIG_RD_LZO is not set
# CONFIG_RD_LZ4 is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_HAVE_UID16=y
CONFIG_BPF=y
CONFIG_EXPERT=y

[Qustion] vmalloc area overlap with another allocated vmalloc area

2017-05-15 Thread zhong jiang
Hi

I  hit the following issue by runing /proc/vmallocinfo.  The kernel is 4.1 
stable and
32 bit to be used.  after I expand the vamlloc area,  the issue is not occur 
again.
it is related to the overflow. but I do not see any problem so far.

cat /proc/vmallocinfo
0xf158-0xf160  524288 raw_dump_mem_write+0x10c/0x188 phys=8b901000 
ioremap
0xf1638000-0xf163a0008192 mcss_pou_queue_init+0xa0/0x13c [mcss] 
phys=fc614000 ioremap
0xf528e000-0xf5292000   16384 n_tty_open+0x10/0xd0 pages=3 vmalloc
0xf500-0xf9001000 67112960 devm_ioremap+0x38/0x70 phys=4000 ioremap
0xfe001000-0xfe0020004096 iotable_init+0x0/0xc phys=20001000 ioremap
0xfe20-0xfe2010004096 iotable_init+0x0/0xc phys=1a00 ioremap
0xff10-0xff1010004096 iotable_init+0x0/0xc phys=2000a000 ioremap

n_tty_open allocate the vmap area is surrounded by the devm_ioremap ioremap by 
above info.
I do not see also the race in the condition.

I have no idea to the issue.  Anyone has any suggestions will be appreicated.
The related config is attatched.

Thanks
zhongjiang
#
# Automatically generated file; DO NOT EDIT.
# Linux/arm 4.1.12 Kernel Configuration
#
CONFIG_ARM=y
CONFIG_MIGHT_HAVE_PCI=y
CONFIG_SYS_SUPPORTS_APM_EMULATION=y
CONFIG_HAVE_PROC_CPU=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_IRQ_BYPASS is not set
# CONFIG_LIB_INTERRUPT is not set
# CONFIG_SET_IRQPRIORITY is not set
# CONFIG_SRE_PREHANDLER is not set
# CONFIG_IPI_COMBINE is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIQ=y
CONFIG_VECTORS_BASE=0x
CONFIG_NEED_MACH_MEMORY_H=y
CONFIG_PHYS_OFFSET=0x800
CONFIG_GENERIC_BUG=y
CONFIG_PGTABLE_LEVELS=2
# CONFIG_ARCH_AARCH32_ES_SUPPORT is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
# CONFIG_KERNEL_GZIP is not set
CONFIG_KERNEL_LZMA=y
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SY SCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_FHANDLE is not set
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_SHOW_LEVEL=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_HANDLE_DOMAIN_IRQ=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_ARCH_HAS_TICK_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y

#
# Timers subsystem
#
CONFIG_HZ_PERIODIC=y
# CONFIG_NO_HZ_IDLE is not set
# CONFIG_NO_HZ_FULL is not set
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
# CONFIG_RCU_USER_QS is not set
CONFIG_RCU_FANOUT=32
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_KTHREAD_PRIO=0
# CONFIG_RCU_NOCB_CPU is not set
# CONFIG_RCU_EXPEDITE_BOOT is not set
# CONFIG_BUILD_BIN2C is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=14
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_GENERIC_SCHED_CLOCK=y
CONFIG_CGROUPS=y
CONFIG_CGROUP_DEBUG=y
# CONFIG_CGROUP_FREEZER is not set
# CONFIG_CGROUP_PIDS is not set
# CONFIG_CGROUP_DEVICE is not set
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
# CONFIG_MEMCG is not set
# CONFIG_CGROUP_HUGETLB is not set
# CONFIG_CGROUP_PERF is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
# CONFIG_RT_GROUP_SCHED is not set
# CONFIG_BLK_CGROUP is not set
# CONFIG_CHECKPOINT_RESTORE is not set
# CONFIG_NAMESPACES is not set
# CONFIG_SCHED_AUTOGROUP is not set
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_RD_GZIP is not set
# CONFIG_RD_BZIP2 is not set
CONFIG_RD_LZMA=y
# CONFIG_RD_XZ is not set
# CONFIG_RD_LZO is not set
# CONFIG_RD_LZ4 is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_HAVE_UID16=y
CONFIG_BPF=y
CONFIG_EXPERT=y

Re: [RESENT PATCH] x86/mem: fix the offset overflow when read/write mem

2017-05-09 Thread zhong jiang
On 2017/5/9 23:46, Rik van Riel wrote:
> On Thu, 2017-05-04 at 10:28 +0800, zhong jiang wrote:
>> On 2017/5/4 2:46, Rik van Riel wrote:
>>> However, it is not as easy as simply checking the
>>> end against __pa(high_memory). Some systems have
>>> non-contiguous physical memory ranges, with gaps
>>> of invalid addresses in-between.
>>  The invalid physical address means that it is used as
>>  io mapped. not in system ram region. /dev/mem is not
>>  access to them , is it right?
> Not necessarily. Some systems simply have large
> gaps in physical memory access. Their memory map
> may look like this:
>
> |MM|IO||..||
>
> Where M is memory, IO is IO space, and the
> dots are simply a gap in physical address
> space with no valid accesses at all.
>
>>> At that point, is the complexity so much that it no
>>> longer makes sense to try to protect against root
>>> crashing the system?
>>>
>>  your suggestion is to let the issue along without any protection.
>>  just root user know what they are doing.
> Well, root already has other ways to crash the system.
>
> Implementing validation on /dev/mem may make sense if
> it can be done in a simple way, but may not be worth
> it if it becomes too complex.
>
 I have no a simple way to fix. Do you any suggestion. or you can send
 a patch for me ?

 Thanks
 zhongjiang



Re: [RESENT PATCH] x86/mem: fix the offset overflow when read/write mem

2017-05-09 Thread zhong jiang
On 2017/5/9 23:46, Rik van Riel wrote:
> On Thu, 2017-05-04 at 10:28 +0800, zhong jiang wrote:
>> On 2017/5/4 2:46, Rik van Riel wrote:
>>> However, it is not as easy as simply checking the
>>> end against __pa(high_memory). Some systems have
>>> non-contiguous physical memory ranges, with gaps
>>> of invalid addresses in-between.
>>  The invalid physical address means that it is used as
>>  io mapped. not in system ram region. /dev/mem is not
>>  access to them , is it right?
> Not necessarily. Some systems simply have large
> gaps in physical memory access. Their memory map
> may look like this:
>
> |MM|IO||..||
>
> Where M is memory, IO is IO space, and the
> dots are simply a gap in physical address
> space with no valid accesses at all.
>
>>> At that point, is the complexity so much that it no
>>> longer makes sense to try to protect against root
>>> crashing the system?
>>>
>>  your suggestion is to let the issue along without any protection.
>>  just root user know what they are doing.
> Well, root already has other ways to crash the system.
>
> Implementing validation on /dev/mem may make sense if
> it can be done in a simple way, but may not be worth
> it if it becomes too complex.
>
 I have no a simple way to fix. Do you any suggestion. or you can send
 a patch for me ?

 Thanks
 zhongjiang



Re: [RESENT PATCH] x86/mem: fix the offset overflow when read/write mem

2017-05-03 Thread zhong jiang
On 2017/5/4 2:46, Rik van Riel wrote:
> On Tue, 2017-05-02 at 13:54 -0700, David Rientjes wrote:
>
>>> diff --git a/drivers/char/mem.c b/drivers/char/mem.c
>>> index 7e4a9d1..3a765e02 100644
>>> --- a/drivers/char/mem.c
>>> +++ b/drivers/char/mem.c
>>> @@ -55,7 +55,7 @@ static inline int
>> valid_phys_addr_range(phys_addr_t addr, size_t count)
>>>   
>>>   static inline int valid_mmap_phys_addr_range(unsigned long pfn,
>> size_t size)
>>>   {
>>> - return 1;
>>> + return (pfn << PAGE_SHIFT) + size <= __pa(high_memory);
>>>   }
>>>   #endif
>>>   
>> I suppose you are correct that there should be some sanity checking
>> on the 
>> size used for the mmap().
> My apologies for not responding earlier. It may
> indeed make sense to have a sanity check here.
>
> However, it is not as easy as simply checking the
> end against __pa(high_memory). Some systems have
> non-contiguous physical memory ranges, with gaps
> of invalid addresses in-between.
 The invalid physical address means that it is used as
 io mapped. not in system ram region. /dev/mem is not
 access to them , is it right?
> You would have to make sure that both the beginning
> and the end are valid, and that there are no gaps of
> invalid pfns in the middle...
 If it is limited in system ram, we can walk the resource
 to exclude them. or adding pfn_valid further to optimize.
 whether other situation should be consider ? I am not sure.
> At that point, is the complexity so much that it no
> longer makes sense to try to protect against root
> crashing the system?
>
 your suggestion is to let the issue along without any protection.
 just root user know what they are doing.
 
 Thanks
 zhongjiang



Re: [RESENT PATCH] x86/mem: fix the offset overflow when read/write mem

2017-05-03 Thread zhong jiang
On 2017/5/4 2:46, Rik van Riel wrote:
> On Tue, 2017-05-02 at 13:54 -0700, David Rientjes wrote:
>
>>> diff --git a/drivers/char/mem.c b/drivers/char/mem.c
>>> index 7e4a9d1..3a765e02 100644
>>> --- a/drivers/char/mem.c
>>> +++ b/drivers/char/mem.c
>>> @@ -55,7 +55,7 @@ static inline int
>> valid_phys_addr_range(phys_addr_t addr, size_t count)
>>>   
>>>   static inline int valid_mmap_phys_addr_range(unsigned long pfn,
>> size_t size)
>>>   {
>>> - return 1;
>>> + return (pfn << PAGE_SHIFT) + size <= __pa(high_memory);
>>>   }
>>>   #endif
>>>   
>> I suppose you are correct that there should be some sanity checking
>> on the 
>> size used for the mmap().
> My apologies for not responding earlier. It may
> indeed make sense to have a sanity check here.
>
> However, it is not as easy as simply checking the
> end against __pa(high_memory). Some systems have
> non-contiguous physical memory ranges, with gaps
> of invalid addresses in-between.
 The invalid physical address means that it is used as
 io mapped. not in system ram region. /dev/mem is not
 access to them , is it right?
> You would have to make sure that both the beginning
> and the end are valid, and that there are no gaps of
> invalid pfns in the middle...
 If it is limited in system ram, we can walk the resource
 to exclude them. or adding pfn_valid further to optimize.
 whether other situation should be consider ? I am not sure.
> At that point, is the complexity so much that it no
> longer makes sense to try to protect against root
> crashing the system?
>
 your suggestion is to let the issue along without any protection.
 just root user know what they are doing.
 
 Thanks
 zhongjiang



Re: [RESENT PATCH] x86/mem: fix the offset overflow when read/write mem

2017-05-03 Thread zhong jiang
On 2017/5/3 4:54, David Rientjes wrote:
> On Thu, 27 Apr 2017, zhongjiang wrote:
>
>> From: zhong jiang <zhongji...@huawei.com>
>>
>> Recently, I found the following issue, it will result in the panic.
>>
>> [  168.739152] mmap1: Corrupted page table at address 7f3e6275a002
>> [  168.745039] PGD 61f4a1067
>> [  168.745040] PUD 61ab19067
>> [  168.747730] PMD 61fb8b067
>> [  168.750418] PTE 80001225
>> [  168.753109]
>> [  168.757795] Bad pagetable: 000d [#1] SMP
>> [  168.761696] Modules linked in: intel_powerclamp coretemp kvm_intel kvm 
>> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel 
>> crypto_simd iTCO_wdt glue_helper cryptd sg iTCO_vendor_support i7core_edac 
>> edac_core shpchp lpc_ich i2c_i801 pcspkr mfd_core acpi_cpufreq ip_tables xfs 
>> libcrc32c sd_mod igb ata_generic ptp pata_acpi pps_core mptsas ata_piix 
>> scsi_transport_sas i2c_algo_bit libata mptscsih i2c_core serio_raw 
>> crc32c_intel bnx2 mptbase dca dm_mirror dm_region_hash dm_log dm_mod
>> [  168.805983] CPU: 15 PID: 10369 Comm: mmap1 Not tainted 
>> 4.11.0-rc2-327.28.3.53.x86_64+ #345
>> [  168.814202] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285 
>>  /BC11BTSA  , BIOS CTSAV036 04/27/2011
>> [  168.825704] task: 8806207d5200 task.stack: c9000c34
>> [  168.831592] RIP: 0033:0x7f3e622c5360
>> [  168.835147] RSP: 002b:7ffe2bb7a098 EFLAGS: 00010203
>> [  168.840344] RAX: 7ffe2bb7a0c0 RBX:  RCX: 
>> 7f3e6275a000
>> [  168.847439] RDX: 7f3e622c5360 RSI: 7f3e6275a000 RDI: 
>> 7ffe2bb7a0c0
>> [  168.854535] RBP: 7ffe2bb7a4e0 R08: 7f3e621c3d58 R09: 
>> 002d
>> [  168.861632] R10: 7ffe2bb79e20 R11: 7f3e622fbcb0 R12: 
>> 004005d0
>> [  168.868728] R13: 7ffe2bb7a5c0 R14:  R15: 
>> 
>> [  168.875825] FS:  7f3e62752740() GS:880627bc() 
>> knlGS:
>> [  168.883870] CS:  0010 DS:  ES:  CR0: 80050033
>> [  168.889583] CR2: 7f3e6275a002 CR3: 000622845000 CR4: 
>> 06e0
>> [  168.896680] RIP: 0x7f3e622c5360 RSP: 7ffe2bb7a098
>> [  168.901713] ---[ end trace ef98fa9f2a01cbc6 ]---
>> [  168.90630 arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3f/0x50
>> [  168.935410] Modules linked in: intel_powerclamp coretemp kvm_intel kvm 
>> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel 
>> crypto_simd iTCO_wdt glue_helper cryptd sg iTCO_vendor_support i7core_edac 
>> edac_core shpchp lpc_ich i2c_i801 pcspkr mfd_core acpi_cpufreq ip_tables xfs 
>> libcrc32c sd_mod igb ata_generic ptp pata_acpi pps_core mptsas ata_piix 
>> scsi_transport_sas i2c_algo_bit libata mptscsih i2c_core serio_raw 
>> crc32c_intel bnx2 mptbase dca dm_mirror dm_region_hash dm_log dm_mod
>> [  168.979686] CPU: 15 PID: 10369 Comm: mmap1 Tainted: G  D 
>> 4.11.0-rc2-327.28.3.53.x86_64+ #345
>> [  168.989114] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285 
>>  /BC11BTSA  , BIOS CTSAV036 04/27/2011
>> [  169.000616] Call Trace:
>> [  169.003049]  
>> [  169.005050]  dump_stack+0x63/0x84
>> [  169.008348]  __warn+0xd1/0xf0
>> [  169.011297]  warn_slowpath_null+0x1d/0x20
>> [  169.015282]  native_smp_send_reschedule+0x3f/0x50
>> [  169.019961]  resched_curr+0xa1/0xc0
>> [  169.023428]  check_preempt_curr+0x70/0x90
>> [  169.027415]  ttwu_do_wakeup+0x1e/0x160
>> [  169.031142]  ttwu_do_activate+0x77/0x80
>> [  169.034956]  try_to_wake_up+0x1c3/0x430
>> [  169.038771]  default_wake_function+0x12/0x20
>> [  169.043019]  __wake_up_common+0x55/0x90
>> [  169.046833]  __wake_up_locked+0x13/0x20
>> [  169.050649]  ep_poll_callback+0xbb/0x240
>> [  169.054550]  __wake_up_common+0x55/0x90
>> [  169.058363]  __wake_up+0x39/0x50
>> [  169.061574]  wake_up_klogd_work_func+0x40/0x60
>> [  169.065994]  irq_work_run_list+0x4d/0x70
>> [  169.069895]  irq_work_tick+0x40/0x50
>> [  169.073452]  update_process_times+0x42/0x60
>> [  169.077612]  tick_periodic+0x2b/0x80
>> [  169.081166]  tick_handle_periodic+0x25/0x70
>> [  169.085326]  local_apic_timer_interrupt+0x35/0x60
>> [  169.090004]  smp_apic_timer_interrupt+0x38/0x50
>> [  169.094507]  apic_timer_interrupt+0x93/0xa0
>> [  169.098667] RIP: 0010:panic+0x1f5/0x239
>> [  169.102480] RSP: :c9000c343dd8 EFLAGS: 0246 ORIG_RAX: 
>> ff10
>> [  169.110010] RAX: 0034 RBX: 000

Re: [RESENT PATCH] x86/mem: fix the offset overflow when read/write mem

2017-05-03 Thread zhong jiang
On 2017/5/3 4:54, David Rientjes wrote:
> On Thu, 27 Apr 2017, zhongjiang wrote:
>
>> From: zhong jiang 
>>
>> Recently, I found the following issue, it will result in the panic.
>>
>> [  168.739152] mmap1: Corrupted page table at address 7f3e6275a002
>> [  168.745039] PGD 61f4a1067
>> [  168.745040] PUD 61ab19067
>> [  168.747730] PMD 61fb8b067
>> [  168.750418] PTE 80001225
>> [  168.753109]
>> [  168.757795] Bad pagetable: 000d [#1] SMP
>> [  168.761696] Modules linked in: intel_powerclamp coretemp kvm_intel kvm 
>> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel 
>> crypto_simd iTCO_wdt glue_helper cryptd sg iTCO_vendor_support i7core_edac 
>> edac_core shpchp lpc_ich i2c_i801 pcspkr mfd_core acpi_cpufreq ip_tables xfs 
>> libcrc32c sd_mod igb ata_generic ptp pata_acpi pps_core mptsas ata_piix 
>> scsi_transport_sas i2c_algo_bit libata mptscsih i2c_core serio_raw 
>> crc32c_intel bnx2 mptbase dca dm_mirror dm_region_hash dm_log dm_mod
>> [  168.805983] CPU: 15 PID: 10369 Comm: mmap1 Not tainted 
>> 4.11.0-rc2-327.28.3.53.x86_64+ #345
>> [  168.814202] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285 
>>  /BC11BTSA  , BIOS CTSAV036 04/27/2011
>> [  168.825704] task: 8806207d5200 task.stack: c9000c34
>> [  168.831592] RIP: 0033:0x7f3e622c5360
>> [  168.835147] RSP: 002b:7ffe2bb7a098 EFLAGS: 00010203
>> [  168.840344] RAX: 7ffe2bb7a0c0 RBX:  RCX: 
>> 7f3e6275a000
>> [  168.847439] RDX: 7f3e622c5360 RSI: 7f3e6275a000 RDI: 
>> 7ffe2bb7a0c0
>> [  168.854535] RBP: 7ffe2bb7a4e0 R08: 7f3e621c3d58 R09: 
>> 002d
>> [  168.861632] R10: 7ffe2bb79e20 R11: 7f3e622fbcb0 R12: 
>> 004005d0
>> [  168.868728] R13: 7ffe2bb7a5c0 R14:  R15: 
>> 
>> [  168.875825] FS:  7f3e62752740() GS:880627bc() 
>> knlGS:
>> [  168.883870] CS:  0010 DS:  ES:  CR0: 80050033
>> [  168.889583] CR2: 7f3e6275a002 CR3: 000622845000 CR4: 
>> 06e0
>> [  168.896680] RIP: 0x7f3e622c5360 RSP: 7ffe2bb7a098
>> [  168.901713] ---[ end trace ef98fa9f2a01cbc6 ]---
>> [  168.90630 arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3f/0x50
>> [  168.935410] Modules linked in: intel_powerclamp coretemp kvm_intel kvm 
>> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel 
>> crypto_simd iTCO_wdt glue_helper cryptd sg iTCO_vendor_support i7core_edac 
>> edac_core shpchp lpc_ich i2c_i801 pcspkr mfd_core acpi_cpufreq ip_tables xfs 
>> libcrc32c sd_mod igb ata_generic ptp pata_acpi pps_core mptsas ata_piix 
>> scsi_transport_sas i2c_algo_bit libata mptscsih i2c_core serio_raw 
>> crc32c_intel bnx2 mptbase dca dm_mirror dm_region_hash dm_log dm_mod
>> [  168.979686] CPU: 15 PID: 10369 Comm: mmap1 Tainted: G  D 
>> 4.11.0-rc2-327.28.3.53.x86_64+ #345
>> [  168.989114] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285 
>>  /BC11BTSA  , BIOS CTSAV036 04/27/2011
>> [  169.000616] Call Trace:
>> [  169.003049]  
>> [  169.005050]  dump_stack+0x63/0x84
>> [  169.008348]  __warn+0xd1/0xf0
>> [  169.011297]  warn_slowpath_null+0x1d/0x20
>> [  169.015282]  native_smp_send_reschedule+0x3f/0x50
>> [  169.019961]  resched_curr+0xa1/0xc0
>> [  169.023428]  check_preempt_curr+0x70/0x90
>> [  169.027415]  ttwu_do_wakeup+0x1e/0x160
>> [  169.031142]  ttwu_do_activate+0x77/0x80
>> [  169.034956]  try_to_wake_up+0x1c3/0x430
>> [  169.038771]  default_wake_function+0x12/0x20
>> [  169.043019]  __wake_up_common+0x55/0x90
>> [  169.046833]  __wake_up_locked+0x13/0x20
>> [  169.050649]  ep_poll_callback+0xbb/0x240
>> [  169.054550]  __wake_up_common+0x55/0x90
>> [  169.058363]  __wake_up+0x39/0x50
>> [  169.061574]  wake_up_klogd_work_func+0x40/0x60
>> [  169.065994]  irq_work_run_list+0x4d/0x70
>> [  169.069895]  irq_work_tick+0x40/0x50
>> [  169.073452]  update_process_times+0x42/0x60
>> [  169.077612]  tick_periodic+0x2b/0x80
>> [  169.081166]  tick_handle_periodic+0x25/0x70
>> [  169.085326]  local_apic_timer_interrupt+0x35/0x60
>> [  169.090004]  smp_apic_timer_interrupt+0x38/0x50
>> [  169.094507]  apic_timer_interrupt+0x93/0xa0
>> [  169.098667] RIP: 0010:panic+0x1f5/0x239
>> [  169.102480] RSP: :c9000c343dd8 EFLAGS: 0246 ORIG_RAX: 
>> ff10
>> [  169.110010] RAX: 0034 RBX:  RCX: 
>> 000

Re: [RESENT PATCH] x86/mem: fix the offset overflow when read/write mem

2017-05-01 Thread zhong jiang
ping 

anyone has any objections.
On 2017/4/27 19:49, zhongjiang wrote:
> From: zhong jiang <zhongji...@huawei.com>
>
> Recently, I found the following issue, it will result in the panic.
>
> [  168.739152] mmap1: Corrupted page table at address 7f3e6275a002
> [  168.745039] PGD 61f4a1067
> [  168.745040] PUD 61ab19067
> [  168.747730] PMD 61fb8b067
> [  168.750418] PTE 80001225
> [  168.753109]
> [  168.757795] Bad pagetable: 000d [#1] SMP
> [  168.761696] Modules linked in: intel_powerclamp coretemp kvm_intel kvm 
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel 
> crypto_simd iTCO_wdt glue_helper cryptd sg iTCO_vendor_support i7core_edac 
> edac_core shpchp lpc_ich i2c_i801 pcspkr mfd_core acpi_cpufreq ip_tables xfs 
> libcrc32c sd_mod igb ata_generic ptp pata_acpi pps_core mptsas ata_piix 
> scsi_transport_sas i2c_algo_bit libata mptscsih i2c_core serio_raw 
> crc32c_intel bnx2 mptbase dca dm_mirror dm_region_hash dm_log dm_mod
> [  168.805983] CPU: 15 PID: 10369 Comm: mmap1 Not tainted 
> 4.11.0-rc2-327.28.3.53.x86_64+ #345
> [  168.814202] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285  
> /BC11BTSA  , BIOS CTSAV036 04/27/2011
> [  168.825704] task: 8806207d5200 task.stack: c9000c34
> [  168.831592] RIP: 0033:0x7f3e622c5360
> [  168.835147] RSP: 002b:7ffe2bb7a098 EFLAGS: 00010203
> [  168.840344] RAX: 7ffe2bb7a0c0 RBX:  RCX: 
> 7f3e6275a000
> [  168.847439] RDX: 7f3e622c5360 RSI: 7f3e6275a000 RDI: 
> 7ffe2bb7a0c0
> [  168.854535] RBP: 7ffe2bb7a4e0 R08: 7f3e621c3d58 R09: 
> 002d
> [  168.861632] R10: 7ffe2bb79e20 R11: 7f3e622fbcb0 R12: 
> 004005d0
> [  168.868728] R13: 7ffe2bb7a5c0 R14:  R15: 
> 
> [  168.875825] FS:  7f3e62752740() GS:880627bc() 
> knlGS:
> [  168.883870] CS:  0010 DS:  ES:  CR0: 80050033
> [  168.889583] CR2: 7f3e6275a002 CR3: 000622845000 CR4: 
> 06e0
> [  168.896680] RIP: 0x7f3e622c5360 RSP: 7ffe2bb7a098
> [  168.901713] ---[ end trace ef98fa9f2a01cbc6 ]---
> [  168.90630 arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3f/0x50
> [  168.935410] Modules linked in: intel_powerclamp coretemp kvm_intel kvm 
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel 
> crypto_simd iTCO_wdt glue_helper cryptd sg iTCO_vendor_support i7core_edac 
> edac_core shpchp lpc_ich i2c_i801 pcspkr mfd_core acpi_cpufreq ip_tables xfs 
> libcrc32c sd_mod igb ata_generic ptp pata_acpi pps_core mptsas ata_piix 
> scsi_transport_sas i2c_algo_bit libata mptscsih i2c_core serio_raw 
> crc32c_intel bnx2 mptbase dca dm_mirror dm_region_hash dm_log dm_mod
> [  168.979686] CPU: 15 PID: 10369 Comm: mmap1 Tainted: G  D 
> 4.11.0-rc2-327.28.3.53.x86_64+ #345
> [  168.989114] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285  
> /BC11BTSA  , BIOS CTSAV036 04/27/2011
> [  169.000616] Call Trace:
> [  169.003049]  
> [  169.005050]  dump_stack+0x63/0x84
> [  169.008348]  __warn+0xd1/0xf0
> [  169.011297]  warn_slowpath_null+0x1d/0x20
> [  169.015282]  native_smp_send_reschedule+0x3f/0x50
> [  169.019961]  resched_curr+0xa1/0xc0
> [  169.023428]  check_preempt_curr+0x70/0x90
> [  169.027415]  ttwu_do_wakeup+0x1e/0x160
> [  169.031142]  ttwu_do_activate+0x77/0x80
> [  169.034956]  try_to_wake_up+0x1c3/0x430
> [  169.038771]  default_wake_function+0x12/0x20
> [  169.043019]  __wake_up_common+0x55/0x90
> [  169.046833]  __wake_up_locked+0x13/0x20
> [  169.050649]  ep_poll_callback+0xbb/0x240
> [  169.054550]  __wake_up_common+0x55/0x90
> [  169.058363]  __wake_up+0x39/0x50
> [  169.061574]  wake_up_klogd_work_func+0x40/0x60
> [  169.065994]  irq_work_run_list+0x4d/0x70
> [  169.069895]  irq_work_tick+0x40/0x50
> [  169.073452]  update_process_times+0x42/0x60
> [  169.077612]  tick_periodic+0x2b/0x80
> [  169.081166]  tick_handle_periodic+0x25/0x70
> [  169.085326]  local_apic_timer_interrupt+0x35/0x60
> [  169.090004]  smp_apic_timer_interrupt+0x38/0x50
> [  169.094507]  apic_timer_interrupt+0x93/0xa0
> [  169.098667] RIP: 0010:panic+0x1f5/0x239
> [  169.102480] RSP: :c9000c343dd8 EFLAGS: 0246 ORIG_RAX: 
> ff10
> [  169.110010] RAX: 0034 RBX:  RCX: 
> 0006
> [  169.117106] RDX:  RSI: 0086 RDI: 
> 880627bcdfe0
> [  169.124201] RBP: c9000c343e48 R08: fffe R09: 
> 0395
> [  169.131298] R10: 0005 R11: 0394 R12: 
> 81a0c475
> [  169.138395] R13: 000

Re: [RESENT PATCH] x86/mem: fix the offset overflow when read/write mem

2017-05-01 Thread zhong jiang
ping 

anyone has any objections.
On 2017/4/27 19:49, zhongjiang wrote:
> From: zhong jiang 
>
> Recently, I found the following issue, it will result in the panic.
>
> [  168.739152] mmap1: Corrupted page table at address 7f3e6275a002
> [  168.745039] PGD 61f4a1067
> [  168.745040] PUD 61ab19067
> [  168.747730] PMD 61fb8b067
> [  168.750418] PTE 80001225
> [  168.753109]
> [  168.757795] Bad pagetable: 000d [#1] SMP
> [  168.761696] Modules linked in: intel_powerclamp coretemp kvm_intel kvm 
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel 
> crypto_simd iTCO_wdt glue_helper cryptd sg iTCO_vendor_support i7core_edac 
> edac_core shpchp lpc_ich i2c_i801 pcspkr mfd_core acpi_cpufreq ip_tables xfs 
> libcrc32c sd_mod igb ata_generic ptp pata_acpi pps_core mptsas ata_piix 
> scsi_transport_sas i2c_algo_bit libata mptscsih i2c_core serio_raw 
> crc32c_intel bnx2 mptbase dca dm_mirror dm_region_hash dm_log dm_mod
> [  168.805983] CPU: 15 PID: 10369 Comm: mmap1 Not tainted 
> 4.11.0-rc2-327.28.3.53.x86_64+ #345
> [  168.814202] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285  
> /BC11BTSA  , BIOS CTSAV036 04/27/2011
> [  168.825704] task: 8806207d5200 task.stack: c9000c34
> [  168.831592] RIP: 0033:0x7f3e622c5360
> [  168.835147] RSP: 002b:7ffe2bb7a098 EFLAGS: 00010203
> [  168.840344] RAX: 7ffe2bb7a0c0 RBX:  RCX: 
> 7f3e6275a000
> [  168.847439] RDX: 7f3e622c5360 RSI: 7f3e6275a000 RDI: 
> 7ffe2bb7a0c0
> [  168.854535] RBP: 7ffe2bb7a4e0 R08: 7f3e621c3d58 R09: 
> 002d
> [  168.861632] R10: 7ffe2bb79e20 R11: 7f3e622fbcb0 R12: 
> 004005d0
> [  168.868728] R13: 7ffe2bb7a5c0 R14:  R15: 
> 
> [  168.875825] FS:  7f3e62752740() GS:880627bc() 
> knlGS:
> [  168.883870] CS:  0010 DS:  ES:  CR0: 80050033
> [  168.889583] CR2: 7f3e6275a002 CR3: 000622845000 CR4: 
> 06e0
> [  168.896680] RIP: 0x7f3e622c5360 RSP: 7ffe2bb7a098
> [  168.901713] ---[ end trace ef98fa9f2a01cbc6 ]---
> [  168.90630 arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3f/0x50
> [  168.935410] Modules linked in: intel_powerclamp coretemp kvm_intel kvm 
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel 
> crypto_simd iTCO_wdt glue_helper cryptd sg iTCO_vendor_support i7core_edac 
> edac_core shpchp lpc_ich i2c_i801 pcspkr mfd_core acpi_cpufreq ip_tables xfs 
> libcrc32c sd_mod igb ata_generic ptp pata_acpi pps_core mptsas ata_piix 
> scsi_transport_sas i2c_algo_bit libata mptscsih i2c_core serio_raw 
> crc32c_intel bnx2 mptbase dca dm_mirror dm_region_hash dm_log dm_mod
> [  168.979686] CPU: 15 PID: 10369 Comm: mmap1 Tainted: G  D 
> 4.11.0-rc2-327.28.3.53.x86_64+ #345
> [  168.989114] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285  
> /BC11BTSA  , BIOS CTSAV036 04/27/2011
> [  169.000616] Call Trace:
> [  169.003049]  
> [  169.005050]  dump_stack+0x63/0x84
> [  169.008348]  __warn+0xd1/0xf0
> [  169.011297]  warn_slowpath_null+0x1d/0x20
> [  169.015282]  native_smp_send_reschedule+0x3f/0x50
> [  169.019961]  resched_curr+0xa1/0xc0
> [  169.023428]  check_preempt_curr+0x70/0x90
> [  169.027415]  ttwu_do_wakeup+0x1e/0x160
> [  169.031142]  ttwu_do_activate+0x77/0x80
> [  169.034956]  try_to_wake_up+0x1c3/0x430
> [  169.038771]  default_wake_function+0x12/0x20
> [  169.043019]  __wake_up_common+0x55/0x90
> [  169.046833]  __wake_up_locked+0x13/0x20
> [  169.050649]  ep_poll_callback+0xbb/0x240
> [  169.054550]  __wake_up_common+0x55/0x90
> [  169.058363]  __wake_up+0x39/0x50
> [  169.061574]  wake_up_klogd_work_func+0x40/0x60
> [  169.065994]  irq_work_run_list+0x4d/0x70
> [  169.069895]  irq_work_tick+0x40/0x50
> [  169.073452]  update_process_times+0x42/0x60
> [  169.077612]  tick_periodic+0x2b/0x80
> [  169.081166]  tick_handle_periodic+0x25/0x70
> [  169.085326]  local_apic_timer_interrupt+0x35/0x60
> [  169.090004]  smp_apic_timer_interrupt+0x38/0x50
> [  169.094507]  apic_timer_interrupt+0x93/0xa0
> [  169.098667] RIP: 0010:panic+0x1f5/0x239
> [  169.102480] RSP: :c9000c343dd8 EFLAGS: 0246 ORIG_RAX: 
> ff10
> [  169.110010] RAX: 0034 RBX:  RCX: 
> 0006
> [  169.117106] RDX:  RSI: 0086 RDI: 
> 880627bcdfe0
> [  169.124201] RBP: c9000c343e48 R08: fffe R09: 
> 0395
> [  169.131298] R10: 0005 R11: 0394 R12: 
> 81a0c475
> [  169.138395] R13:  R14:  R15:

Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read

2017-04-22 Thread zhong jiang
Hi,  Dashi
The same issue I had occured every other week.  Do you have solve it .
 I want to know how it is fixed.  The patch exist in the mainline.

Thanks
zhongjiang
On 2016/12/23 10:38, Dashi DS1 Cao wrote:
> I'd expected that one or more tasks doing the free were the current task of 
> other cpu cores, but only one of the four dumps has two swapd task that are 
> concurrently at execution on different cpu.
> This is the task leading to the crash:
> PID: 247TASK: 881fcfad8000  CPU: 14  COMMAND: "kswapd1"
>  #0 [881fcfad7978] machine_kexec at 81051e9b
>  #1 [881fcfad79d8] crash_kexec at 810f27e2
>  #2 [881fcfad7aa8] oops_end at 8163f448
>  #3 [881fcfad7ad0] die at 8101859b
>  #4 [881fcfad7b00] do_general_protection at 8163ed3e
>  #5 [881fcfad7b30] general_protection at 8163e5e8
> [exception RIP: down_read_trylock+9]
> RIP: 810aa9f9  RSP: 881fcfad7be0  RFLAGS: 00010286
> RAX:   RBX: 882b47ddadc0  RCX: 
> RDX:   RSI:   RDI: 91550b2b32f5a3e8
> RBP: 881fcfad7be0   R8: ea00ecc28860   R9: 883fcffeae28
> R10: 81a691a0  R11: 0001  R12: 882b47ddadc1
> R13: ea00ecc28840  R14: 91550b2b32f5a3e8  R15: ea00ecc28840
> ORIG_RAX:   CS: 0010  SS: 
>  #6 [881fcfad7be8] page_lock_anon_vma_read at 811a3365
>  #7 [881fcfad7c18] page_referenced at 811a35e7
>  #8 [881fcfad7c90] shrink_active_list at 8117e8cc
>  #9 [881fcfad7d48] balance_pgdat at 81180288
> #10 [881fcfad7e20] kswapd at 81180813
> #11 [881fcfad7ec8] kthread at 810a5b8f
> #12 [881fcfad7f50] ret_from_fork at 81646a98
>
> And this is the one at the same time:
> PID: 246TASK: 881fd27af300  CPU: 20  COMMAND: "kswapd0"
>  #0 [881fffd05e70] crash_nmi_callback at 81045982
>  #1 [881fffd05e80] nmi_handle at 8163f5d9
>  #2 [881fffd05ec8] do_nmi at 8163f6f0
>  #3 [881fffd05ef0] end_repeat_nmi at 8163ea13
> [exception RIP: free_pcppages_bulk+529]
> RIP: 81171ae1  RSP: 881fcfad38f0  RFLAGS: 0087
> RAX: 002f002c  RBX: ea007606ae40  RCX: 
> RDX: ea007606ae00  RSI: 02b9  RDI: 
> RBP: 881fcfad3978   R8:    R9: 0001
> R10: 88207ffda000  R11: 0002  R12: ea007606ae40
> R13: 0002  R14: 88207ffda000  R15: 02b8
> ORIG_RAX:   CS: 0010  SS: 0018
> ---  ---
>  #4 [881fcfad38f0] free_pcppages_bulk at 81171ae1
>  #5 [881fcfad3980] free_hot_cold_page at 81171f08
>  #6 [881fcfad39b8] free_hot_cold_page_list at 81171f76
>  #7 [881fcfad39f0] shrink_page_list at 8117d71e
>  #8 [881fcfad3b28] shrink_inactive_list at 8117e37a
>  #9 [881fcfad3bf0] shrink_lruvec at 8117ee45
> #10 [881fcfad3cf0] shrink_zone at 8117f2a6
> #11 [881fcfad3d48] balance_pgdat at 8118054c
> #12 [881fcfad3e20] kswapd at 81180813
> #13 [881fcfad3ec8] kthread at 810a5b8f
> #14 [881fcfad3f50] ret_from_fork at 81646a98
>
> I hope the information would be useful.
> Dashi Cao
>
> -Original Message-
> From: Hugh Dickins [mailto:hu...@google.com] 
> Sent: Friday, December 23, 2016 6:27 AM
> To: Peter Zijlstra 
> Cc: Michal Hocko ; Dashi DS1 Cao ; 
> linux...@kvack.org; linux-kernel@vger.kernel.org; Hugh Dickins 
> 
> Subject: Re: A small window for a race condition in 
> mm/rmap.c:page_lock_anon_vma_read
>
> On Thu, 22 Dec 2016, Peter Zijlstra wrote:
>> On Wed, Dec 21, 2016 at 03:43:43PM +0100, Michal Hocko wrote:
>>> anon_vma locking is clever^Wsubtle as hell. CC Peter...
>>>
>>> On Tue 20-12-16 09:32:27, Dashi DS1 Cao wrote:
 I've collected four crash dumps with similar backtrace. 

 PID: 247TASK: 881fcfad8000  CPU: 14  COMMAND: "kswapd1"
  #0 [881fcfad7978] machine_kexec at 81051e9b
  #1 [881fcfad79d8] crash_kexec at 810f27e2
  #2 [881fcfad7aa8] oops_end at 8163f448
  #3 [881fcfad7ad0] die at 8101859b
  #4 [881fcfad7b00] do_general_protection at 8163ed3e
  #5 [881fcfad7b30] general_protection at 8163e5e8
 [exception RIP: down_read_trylock+9]
 RIP: 810aa9f9  RSP: 881fcfad7be0  RFLAGS: 00010286
 RAX:   RBX: 882b47ddadc0  RCX: 
 RDX:   RSI:   RDI: 
 91550b2b32f5a3e8
>>> rdi is obviously a mess - smells like a string. So either sombody 
>>> has overwritten root_anon_vma or this is really a use after 

Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read

2017-04-22 Thread zhong jiang
Hi,  Dashi
The same issue I had occured every other week.  Do you have solve it .
 I want to know how it is fixed.  The patch exist in the mainline.

Thanks
zhongjiang
On 2016/12/23 10:38, Dashi DS1 Cao wrote:
> I'd expected that one or more tasks doing the free were the current task of 
> other cpu cores, but only one of the four dumps has two swapd task that are 
> concurrently at execution on different cpu.
> This is the task leading to the crash:
> PID: 247TASK: 881fcfad8000  CPU: 14  COMMAND: "kswapd1"
>  #0 [881fcfad7978] machine_kexec at 81051e9b
>  #1 [881fcfad79d8] crash_kexec at 810f27e2
>  #2 [881fcfad7aa8] oops_end at 8163f448
>  #3 [881fcfad7ad0] die at 8101859b
>  #4 [881fcfad7b00] do_general_protection at 8163ed3e
>  #5 [881fcfad7b30] general_protection at 8163e5e8
> [exception RIP: down_read_trylock+9]
> RIP: 810aa9f9  RSP: 881fcfad7be0  RFLAGS: 00010286
> RAX:   RBX: 882b47ddadc0  RCX: 
> RDX:   RSI:   RDI: 91550b2b32f5a3e8
> RBP: 881fcfad7be0   R8: ea00ecc28860   R9: 883fcffeae28
> R10: 81a691a0  R11: 0001  R12: 882b47ddadc1
> R13: ea00ecc28840  R14: 91550b2b32f5a3e8  R15: ea00ecc28840
> ORIG_RAX:   CS: 0010  SS: 
>  #6 [881fcfad7be8] page_lock_anon_vma_read at 811a3365
>  #7 [881fcfad7c18] page_referenced at 811a35e7
>  #8 [881fcfad7c90] shrink_active_list at 8117e8cc
>  #9 [881fcfad7d48] balance_pgdat at 81180288
> #10 [881fcfad7e20] kswapd at 81180813
> #11 [881fcfad7ec8] kthread at 810a5b8f
> #12 [881fcfad7f50] ret_from_fork at 81646a98
>
> And this is the one at the same time:
> PID: 246TASK: 881fd27af300  CPU: 20  COMMAND: "kswapd0"
>  #0 [881fffd05e70] crash_nmi_callback at 81045982
>  #1 [881fffd05e80] nmi_handle at 8163f5d9
>  #2 [881fffd05ec8] do_nmi at 8163f6f0
>  #3 [881fffd05ef0] end_repeat_nmi at 8163ea13
> [exception RIP: free_pcppages_bulk+529]
> RIP: 81171ae1  RSP: 881fcfad38f0  RFLAGS: 0087
> RAX: 002f002c  RBX: ea007606ae40  RCX: 
> RDX: ea007606ae00  RSI: 02b9  RDI: 
> RBP: 881fcfad3978   R8:    R9: 0001
> R10: 88207ffda000  R11: 0002  R12: ea007606ae40
> R13: 0002  R14: 88207ffda000  R15: 02b8
> ORIG_RAX:   CS: 0010  SS: 0018
> ---  ---
>  #4 [881fcfad38f0] free_pcppages_bulk at 81171ae1
>  #5 [881fcfad3980] free_hot_cold_page at 81171f08
>  #6 [881fcfad39b8] free_hot_cold_page_list at 81171f76
>  #7 [881fcfad39f0] shrink_page_list at 8117d71e
>  #8 [881fcfad3b28] shrink_inactive_list at 8117e37a
>  #9 [881fcfad3bf0] shrink_lruvec at 8117ee45
> #10 [881fcfad3cf0] shrink_zone at 8117f2a6
> #11 [881fcfad3d48] balance_pgdat at 8118054c
> #12 [881fcfad3e20] kswapd at 81180813
> #13 [881fcfad3ec8] kthread at 810a5b8f
> #14 [881fcfad3f50] ret_from_fork at 81646a98
>
> I hope the information would be useful.
> Dashi Cao
>
> -Original Message-
> From: Hugh Dickins [mailto:hu...@google.com] 
> Sent: Friday, December 23, 2016 6:27 AM
> To: Peter Zijlstra 
> Cc: Michal Hocko ; Dashi DS1 Cao ; 
> linux...@kvack.org; linux-kernel@vger.kernel.org; Hugh Dickins 
> 
> Subject: Re: A small window for a race condition in 
> mm/rmap.c:page_lock_anon_vma_read
>
> On Thu, 22 Dec 2016, Peter Zijlstra wrote:
>> On Wed, Dec 21, 2016 at 03:43:43PM +0100, Michal Hocko wrote:
>>> anon_vma locking is clever^Wsubtle as hell. CC Peter...
>>>
>>> On Tue 20-12-16 09:32:27, Dashi DS1 Cao wrote:
 I've collected four crash dumps with similar backtrace. 

 PID: 247TASK: 881fcfad8000  CPU: 14  COMMAND: "kswapd1"
  #0 [881fcfad7978] machine_kexec at 81051e9b
  #1 [881fcfad79d8] crash_kexec at 810f27e2
  #2 [881fcfad7aa8] oops_end at 8163f448
  #3 [881fcfad7ad0] die at 8101859b
  #4 [881fcfad7b00] do_general_protection at 8163ed3e
  #5 [881fcfad7b30] general_protection at 8163e5e8
 [exception RIP: down_read_trylock+9]
 RIP: 810aa9f9  RSP: 881fcfad7be0  RFLAGS: 00010286
 RAX:   RBX: 882b47ddadc0  RCX: 
 RDX:   RSI:   RDI: 
 91550b2b32f5a3e8
>>> rdi is obviously a mess - smells like a string. So either sombody 
>>> has overwritten root_anon_vma or this is really a use after free...
>> e8 - ???
>> a3 - ???
>> f5 - ???
>> 32 - 2
>> 2b - +
>>  b -
>>
>> 55 

Re: [PATCH 2/5] mm: convert anon_vma.refcount from atomic_t to refcount_t

2017-04-22 Thread zhong jiang
Hi, Elean

Do the issue had really occured,  use-after-free. but why the patch
 is not received.   or is is possible for the situation.

Thanks
zhongjiang
On 2017/2/20 18:49, Elena Reshetova wrote:
> refcount_t type and corresponding API should be
> used instead of atomic_t when the variable is used as
> a reference counter. This allows to avoid accidental
> refcounter overflows that might lead to use-after-free
> situations.
>
> Signed-off-by: Elena Reshetova 
> Signed-off-by: Hans Liljestrand 
> Signed-off-by: Kees Cook 
> Signed-off-by: David Windsor 
> ---
>  include/linux/rmap.h |  7 ---
>  mm/rmap.c| 14 +++---
>  2 files changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 8c89e90..a8f4a97 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * The anon_vma heads a list of private "related" vmas, to scan if
> @@ -35,7 +36,7 @@ struct anon_vma {
>* the reference is responsible for clearing up the
>* anon_vma if they are the last user on release
>*/
> - atomic_t refcount;
> + refcount_t refcount;
>  
>   /*
>* Count of child anon_vmas and VMAs which points to this anon_vma.
> @@ -102,14 +103,14 @@ enum ttu_flags {
>  #ifdef CONFIG_MMU
>  static inline void get_anon_vma(struct anon_vma *anon_vma)
>  {
> - atomic_inc(_vma->refcount);
> + refcount_inc(_vma->refcount);
>  }
>  
>  void __put_anon_vma(struct anon_vma *anon_vma);
>  
>  static inline void put_anon_vma(struct anon_vma *anon_vma)
>  {
> - if (atomic_dec_and_test(_vma->refcount))
> + if (refcount_dec_and_test(_vma->refcount))
>   __put_anon_vma(anon_vma);
>  }
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8774791..3321c86 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -77,7 +77,7 @@ static inline struct anon_vma *anon_vma_alloc(void)
>  
>   anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
>   if (anon_vma) {
> - atomic_set(_vma->refcount, 1);
> + refcount_set(_vma->refcount, 1);
>   anon_vma->degree = 1;   /* Reference for first vma */
>   anon_vma->parent = anon_vma;
>   /*
> @@ -92,7 +92,7 @@ static inline struct anon_vma *anon_vma_alloc(void)
>  
>  static inline void anon_vma_free(struct anon_vma *anon_vma)
>  {
> - VM_BUG_ON(atomic_read(_vma->refcount));
> + VM_BUG_ON(refcount_read(_vma->refcount));
>  
>   /*
>* Synchronize against page_lock_anon_vma_read() such that
> @@ -421,7 +421,7 @@ static void anon_vma_ctor(void *data)
>   struct anon_vma *anon_vma = data;
>  
>   init_rwsem(_vma->rwsem);
> - atomic_set(_vma->refcount, 0);
> + refcount_set(_vma->refcount, 0);
>   anon_vma->rb_root = RB_ROOT;
>  }
>  
> @@ -470,7 +470,7 @@ struct anon_vma *page_get_anon_vma(struct page *page)
>   goto out;
>  
>   anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> - if (!atomic_inc_not_zero(_vma->refcount)) {
> + if (!refcount_inc_not_zero(_vma->refcount)) {
>   anon_vma = NULL;
>   goto out;
>   }
> @@ -529,7 +529,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page 
> *page)
>   }
>  
>   /* trylock failed, we got to sleep */
> - if (!atomic_inc_not_zero(_vma->refcount)) {
> + if (!refcount_inc_not_zero(_vma->refcount)) {
>   anon_vma = NULL;
>   goto out;
>   }
> @@ -544,7 +544,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page 
> *page)
>   rcu_read_unlock();
>   anon_vma_lock_read(anon_vma);
>  
> - if (atomic_dec_and_test(_vma->refcount)) {
> + if (refcount_dec_and_test(_vma->refcount)) {
>   /*
>* Oops, we held the last refcount, release the lock
>* and bail -- can't simply use put_anon_vma() because
> @@ -1577,7 +1577,7 @@ void __put_anon_vma(struct anon_vma *anon_vma)
>   struct anon_vma *root = anon_vma->root;
>  
>   anon_vma_free(anon_vma);
> - if (root != anon_vma && atomic_dec_and_test(>refcount))
> + if (root != anon_vma && refcount_dec_and_test(>refcount))
>   anon_vma_free(root);
>  }
>  



Re: [PATCH 2/5] mm: convert anon_vma.refcount from atomic_t to refcount_t

2017-04-22 Thread zhong jiang
Hi, Elean

Do the issue had really occured,  use-after-free. but why the patch
 is not received.   or is is possible for the situation.

Thanks
zhongjiang
On 2017/2/20 18:49, Elena Reshetova wrote:
> refcount_t type and corresponding API should be
> used instead of atomic_t when the variable is used as
> a reference counter. This allows to avoid accidental
> refcounter overflows that might lead to use-after-free
> situations.
>
> Signed-off-by: Elena Reshetova 
> Signed-off-by: Hans Liljestrand 
> Signed-off-by: Kees Cook 
> Signed-off-by: David Windsor 
> ---
>  include/linux/rmap.h |  7 ---
>  mm/rmap.c| 14 +++---
>  2 files changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 8c89e90..a8f4a97 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * The anon_vma heads a list of private "related" vmas, to scan if
> @@ -35,7 +36,7 @@ struct anon_vma {
>* the reference is responsible for clearing up the
>* anon_vma if they are the last user on release
>*/
> - atomic_t refcount;
> + refcount_t refcount;
>  
>   /*
>* Count of child anon_vmas and VMAs which points to this anon_vma.
> @@ -102,14 +103,14 @@ enum ttu_flags {
>  #ifdef CONFIG_MMU
>  static inline void get_anon_vma(struct anon_vma *anon_vma)
>  {
> - atomic_inc(_vma->refcount);
> + refcount_inc(_vma->refcount);
>  }
>  
>  void __put_anon_vma(struct anon_vma *anon_vma);
>  
>  static inline void put_anon_vma(struct anon_vma *anon_vma)
>  {
> - if (atomic_dec_and_test(_vma->refcount))
> + if (refcount_dec_and_test(_vma->refcount))
>   __put_anon_vma(anon_vma);
>  }
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8774791..3321c86 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -77,7 +77,7 @@ static inline struct anon_vma *anon_vma_alloc(void)
>  
>   anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
>   if (anon_vma) {
> - atomic_set(_vma->refcount, 1);
> + refcount_set(_vma->refcount, 1);
>   anon_vma->degree = 1;   /* Reference for first vma */
>   anon_vma->parent = anon_vma;
>   /*
> @@ -92,7 +92,7 @@ static inline struct anon_vma *anon_vma_alloc(void)
>  
>  static inline void anon_vma_free(struct anon_vma *anon_vma)
>  {
> - VM_BUG_ON(atomic_read(_vma->refcount));
> + VM_BUG_ON(refcount_read(_vma->refcount));
>  
>   /*
>* Synchronize against page_lock_anon_vma_read() such that
> @@ -421,7 +421,7 @@ static void anon_vma_ctor(void *data)
>   struct anon_vma *anon_vma = data;
>  
>   init_rwsem(_vma->rwsem);
> - atomic_set(_vma->refcount, 0);
> + refcount_set(_vma->refcount, 0);
>   anon_vma->rb_root = RB_ROOT;
>  }
>  
> @@ -470,7 +470,7 @@ struct anon_vma *page_get_anon_vma(struct page *page)
>   goto out;
>  
>   anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> - if (!atomic_inc_not_zero(_vma->refcount)) {
> + if (!refcount_inc_not_zero(_vma->refcount)) {
>   anon_vma = NULL;
>   goto out;
>   }
> @@ -529,7 +529,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page 
> *page)
>   }
>  
>   /* trylock failed, we got to sleep */
> - if (!atomic_inc_not_zero(_vma->refcount)) {
> + if (!refcount_inc_not_zero(_vma->refcount)) {
>   anon_vma = NULL;
>   goto out;
>   }
> @@ -544,7 +544,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page 
> *page)
>   rcu_read_unlock();
>   anon_vma_lock_read(anon_vma);
>  
> - if (atomic_dec_and_test(_vma->refcount)) {
> + if (refcount_dec_and_test(_vma->refcount)) {
>   /*
>* Oops, we held the last refcount, release the lock
>* and bail -- can't simply use put_anon_vma() because
> @@ -1577,7 +1577,7 @@ void __put_anon_vma(struct anon_vma *anon_vma)
>   struct anon_vma *root = anon_vma->root;
>  
>   anon_vma_free(anon_vma);
> - if (root != anon_vma && atomic_dec_and_test(>refcount))
> + if (root != anon_vma && refcount_dec_and_test(>refcount))
>   anon_vma_free(root);
>  }
>  



Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread zhong jiang
On 2017/4/10 22:13, Willy Tarreau wrote:
> On Mon, Apr 10, 2017 at 10:06:59PM +0800, zhong jiang wrote:
>> On 2017/4/10 20:48, Michal Hocko wrote:
>>> On Mon 10-04-17 20:10:06, zhong jiang wrote:
>>>> On 2017/4/10 16:56, Mel Gorman wrote:
>>>>> On Sat, Apr 08, 2017 at 09:39:42PM +0800, zhong jiang wrote:
>>>>>> when runing the stabile docker cases in the vm.   The following issue 
>>>>>> will come up.
>>>>>>
>>>>>> #40 [8801b57ffb30] async_page_fault at 8165c9f8
>>>>>> [exception RIP: down_read_trylock+5]
>>>>>> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
>>>>>> RAX:   RBX: 88018ae858c1  RCX: 
>>>>>> RDX:   RSI:   RDI: 0008
>>>>>> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
>>>>>> R10: 22cb  R11:   R12: 88018ae858c0
>>>>>> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
>>>>>> ORIG_RAX:   CS: 0010  SS: 
>>>>> Post the full report including the kernel version and state whether any
>>>>> additional patches to 3.10 are applied.
>>>>>
>>>>  Hi, Mel
>>>>
>>>> Our kernel from RHEL 7.2, Addtional patches all from upstream -- 
>>>> include Bugfix and CVE.
>>> I believe you should contact Redhat for the support. This is a) old
>>> kernel and b) with other patches which might or might not be relevant.
>>   Ok, regardless of the kernel version, we just discuss the situation in 
>> theory.  if commit
>>   624483f3ea8  ("mm: rmap: fix use-after-free in __put_anon_vma")  is not 
>> exist. the issue
>>  will trigger . Any thought.
> But this commit was backported into 3.10.43, so stable kernel users are safe.
>
> Regards,
> Willy
>
> .
  yes,  you are sure that the commit can fix the issue.





Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread zhong jiang
On 2017/4/10 22:13, Willy Tarreau wrote:
> On Mon, Apr 10, 2017 at 10:06:59PM +0800, zhong jiang wrote:
>> On 2017/4/10 20:48, Michal Hocko wrote:
>>> On Mon 10-04-17 20:10:06, zhong jiang wrote:
>>>> On 2017/4/10 16:56, Mel Gorman wrote:
>>>>> On Sat, Apr 08, 2017 at 09:39:42PM +0800, zhong jiang wrote:
>>>>>> when runing the stabile docker cases in the vm.   The following issue 
>>>>>> will come up.
>>>>>>
>>>>>> #40 [8801b57ffb30] async_page_fault at 8165c9f8
>>>>>> [exception RIP: down_read_trylock+5]
>>>>>> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
>>>>>> RAX:   RBX: 88018ae858c1  RCX: 
>>>>>> RDX:   RSI:   RDI: 0008
>>>>>> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
>>>>>> R10: 22cb  R11:   R12: 88018ae858c0
>>>>>> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
>>>>>> ORIG_RAX:   CS: 0010  SS: 
>>>>> Post the full report including the kernel version and state whether any
>>>>> additional patches to 3.10 are applied.
>>>>>
>>>>  Hi, Mel
>>>>
>>>> Our kernel from RHEL 7.2, Addtional patches all from upstream -- 
>>>> include Bugfix and CVE.
>>> I believe you should contact Redhat for the support. This is a) old
>>> kernel and b) with other patches which might or might not be relevant.
>>   Ok, regardless of the kernel version, we just discuss the situation in 
>> theory.  if commit
>>   624483f3ea8  ("mm: rmap: fix use-after-free in __put_anon_vma")  is not 
>> exist. the issue
>>  will trigger . Any thought.
> But this commit was backported into 3.10.43, so stable kernel users are safe.
>
> Regards,
> Willy
>
> .
  yes,  you are sure that the commit can fix the issue.





Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread zhong jiang
On 2017/4/10 22:06, Mel Gorman wrote:
> On Mon, Apr 10, 2017 at 08:10:06PM +0800, zhong jiang wrote:
>> On 2017/4/10 16:56, Mel Gorman wrote:
>>> On Sat, Apr 08, 2017 at 09:39:42PM +0800, zhong jiang wrote:
>>>> when runing the stabile docker cases in the vm.   The following issue will 
>>>> come up.
>>>>
>>>> #40 [8801b57ffb30] async_page_fault at 8165c9f8
>>>> [exception RIP: down_read_trylock+5]
>>>> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
>>>> RAX:   RBX: 88018ae858c1  RCX: 
>>>> RDX:   RSI:   RDI: 0008
>>>> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
>>>> R10: 22cb  R11:   R12: 88018ae858c0
>>>> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
>>>> ORIG_RAX:   CS: 0010  SS: 
>>> Post the full report including the kernel version and state whether any
>>> additional patches to 3.10 are applied.
>>>
>>  Hi, Mel
>>
>> Our kernel from RHEL 7.2, Addtional patches all from upstream -- 
>> include Bugfix and CVE.
>>
>> Commit 624483f3ea8 ("mm: rmap: fix use-after-free in __put_anon_vma") 
>> exclude in
>> the RHEL 7.2. it looks seems to the issue. but I don't know how it triggered.
>> or it is not the correct fix.  Any suggestion? Thanks
>>
> I'm afraid you'll need to bring it up with RHEL support as it contains
> a number of backported patches from them that cannot be meaningfully
> evaluated outside of RedHat and they may have additional questions on the
> patches applied on top.
>
 Thanks



Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread zhong jiang
On 2017/4/10 22:06, Mel Gorman wrote:
> On Mon, Apr 10, 2017 at 08:10:06PM +0800, zhong jiang wrote:
>> On 2017/4/10 16:56, Mel Gorman wrote:
>>> On Sat, Apr 08, 2017 at 09:39:42PM +0800, zhong jiang wrote:
>>>> when runing the stabile docker cases in the vm.   The following issue will 
>>>> come up.
>>>>
>>>> #40 [8801b57ffb30] async_page_fault at 8165c9f8
>>>> [exception RIP: down_read_trylock+5]
>>>> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
>>>> RAX:   RBX: 88018ae858c1  RCX: 
>>>> RDX:   RSI:   RDI: 0008
>>>> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
>>>> R10: 22cb  R11:   R12: 88018ae858c0
>>>> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
>>>> ORIG_RAX:   CS: 0010  SS: 
>>> Post the full report including the kernel version and state whether any
>>> additional patches to 3.10 are applied.
>>>
>>  Hi, Mel
>>
>> Our kernel from RHEL 7.2, Addtional patches all from upstream -- 
>> include Bugfix and CVE.
>>
>> Commit 624483f3ea8 ("mm: rmap: fix use-after-free in __put_anon_vma") 
>> exclude in
>> the RHEL 7.2. it looks seems to the issue. but I don't know how it triggered.
>> or it is not the correct fix.  Any suggestion? Thanks
>>
> I'm afraid you'll need to bring it up with RHEL support as it contains
> a number of backported patches from them that cannot be meaningfully
> evaluated outside of RedHat and they may have additional questions on the
> patches applied on top.
>
 Thanks



Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread zhong jiang
On 2017/4/10 20:48, Michal Hocko wrote:
> On Mon 10-04-17 20:10:06, zhong jiang wrote:
>> On 2017/4/10 16:56, Mel Gorman wrote:
>>> On Sat, Apr 08, 2017 at 09:39:42PM +0800, zhong jiang wrote:
>>>> when runing the stabile docker cases in the vm.   The following issue will 
>>>> come up.
>>>>
>>>> #40 [8801b57ffb30] async_page_fault at 8165c9f8
>>>> [exception RIP: down_read_trylock+5]
>>>> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
>>>> RAX:   RBX: 88018ae858c1  RCX: 
>>>> RDX:   RSI:   RDI: 0008
>>>> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
>>>> R10: 22cb  R11:   R12: 88018ae858c0
>>>> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
>>>> ORIG_RAX:   CS: 0010  SS: 
>>> Post the full report including the kernel version and state whether any
>>> additional patches to 3.10 are applied.
>>>
>>  Hi, Mel
>>
>> Our kernel from RHEL 7.2, Addtional patches all from upstream -- 
>> include Bugfix and CVE.
> I believe you should contact Redhat for the support. This is a) old
> kernel and b) with other patches which might or might not be relevant.
  Ok, regardless of the kernel version, we just discuss the situation in 
theory.  if commit
  624483f3ea8  ("mm: rmap: fix use-after-free in __put_anon_vma")  is not 
exist. the issue
 will trigger . Any thought.

Thanks
zhongjiang 



Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread zhong jiang
On 2017/4/10 20:48, Michal Hocko wrote:
> On Mon 10-04-17 20:10:06, zhong jiang wrote:
>> On 2017/4/10 16:56, Mel Gorman wrote:
>>> On Sat, Apr 08, 2017 at 09:39:42PM +0800, zhong jiang wrote:
>>>> when runing the stabile docker cases in the vm.   The following issue will 
>>>> come up.
>>>>
>>>> #40 [8801b57ffb30] async_page_fault at 8165c9f8
>>>> [exception RIP: down_read_trylock+5]
>>>> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
>>>> RAX:   RBX: 88018ae858c1  RCX: 
>>>> RDX:   RSI:   RDI: 0008
>>>> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
>>>> R10: 22cb  R11:   R12: 88018ae858c0
>>>> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
>>>> ORIG_RAX:   CS: 0010  SS: 
>>> Post the full report including the kernel version and state whether any
>>> additional patches to 3.10 are applied.
>>>
>>  Hi, Mel
>>
>> Our kernel from RHEL 7.2, Addtional patches all from upstream -- 
>> include Bugfix and CVE.
> I believe you should contact Redhat for the support. This is a) old
> kernel and b) with other patches which might or might not be relevant.
  Ok, regardless of the kernel version, we just discuss the situation in 
theory.  if commit
  624483f3ea8  ("mm: rmap: fix use-after-free in __put_anon_vma")  is not 
exist. the issue
 will trigger . Any thought.

Thanks
zhongjiang 



Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread zhong jiang
On 2017/4/10 16:56, Mel Gorman wrote:
> On Sat, Apr 08, 2017 at 09:39:42PM +0800, zhong jiang wrote:
>> when runing the stabile docker cases in the vm.   The following issue will 
>> come up.
>>
>> #40 [8801b57ffb30] async_page_fault at 8165c9f8
>> [exception RIP: down_read_trylock+5]
>> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
>> RAX:   RBX: 88018ae858c1  RCX: 
>> RDX:   RSI:   RDI: 0008
>> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
>> R10: 22cb  R11:   R12: 88018ae858c0
>> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
>> ORIG_RAX:   CS: 0010  SS: 
> Post the full report including the kernel version and state whether any
> additional patches to 3.10 are applied.
>
 Hi, Mel
   
Our kernel from RHEL 7.2, Addtional patches all from upstream -- 
include Bugfix and CVE.

Commit 624483f3ea8 ("mm: rmap: fix use-after-free in __put_anon_vma") exclude in
the RHEL 7.2. it looks seems to the issue. but I don't know how it triggered.
or it is not the correct fix.  Any suggestion? Thanks


partly dmesg will print in the following.

[59982.162223] EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: 
(null)
[59985.261635] device-mapper: ioctl: remove_all left 8 open device(s)
[59986.492174] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[59987.445606] device-mapper: ioctl: remove_all left 8 open device(s)
[59987.625887] EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: 
(null)
[59988.174600] device-mapper: ioctl: remove_all left 8 open device(s)
[59988.345667] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[59990.951713] EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: 
(null)
[59991.025185] device vethd295793 entered promiscuous mode
[59991.025253] IPv6: ADDRCONF(NETDEV_UP): vethd295793: link is not ready
[59991.860817] IPv6: ADDRCONF(NETDEV_CHANGE): vethd295793: link becomes ready
[59991.860836] docker0: port 4(vethd295793) entered forwarding state
[59991.860840] docker0: port 4(vethd295793) entered forwarding state
[59992.704027] docker0: port 4(vethd295793) entered disabled state
[59992.724049] EXT4-fs (dm-9): mounted filesystem with ordered data mode. Opts: 
(null)
[59993.098341] docker0: port 4(vethd295793) entered disabled state
[59993.102583] device vethd295793 left promiscuous mode
[59993.102605] docker0: port 4(vethd295793) entered disabled state
[59995.109048] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[59995.229390] docker0: port 2(veth2ad76e2) entered disabled state
[59995.523997] docker0: port 2(veth2ad76e2) entered disabled state
[59995.528183] device veth2ad76e2 left promiscuous mode
[59995.528202] docker0: port 2(veth2ad76e2) entered disabled state
[59995.975559] device-mapper: ioctl: remove_all left 8 open device(s)
[59996.084575] EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: 
(null)
[59996.660641] device-mapper: ioctl: remove_all left 7 open device(s)
[59997.109018] EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts: 
(null)
[59998.360101] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[60001.721429] EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: 
(null)
[60001.771433] device vethcca3b6a entered promiscuous mode
[60001.771643] IPv6: ADDRCONF(NETDEV_UP): vethcca3b6a: link is not ready
[60002.872102] IPv6: ADDRCONF(NETDEV_CHANGE): vethcca3b6a: link becomes ready
[60002.872124] docker0: port 2(vethcca3b6a) entered forwarding state
[60002.872130] docker0: port 2(vethcca3b6a) entered forwarding state
[60005.041654] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[60005.597179] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[60013.731728] [/usr/bin/os_rotate_and_save_log.sh]space of output directory is 
larger than 500M bytes,delete the oldest tar file 
messages-20170321181104-129.tar.bz2
[60016.243601] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[60016.669594] device-mapper: ioctl: remove_all left 9 open device(s)
[60016.930232] EXT4-fs (dm-9): mounted filesystem with ordered data mode. Opts: 
(null)
[60017.918511] docker0: port 2(vethcca3b6a) entered forwarding state
[60022.197574] device-mapper: ioctl: remove_all left 8 open device(s)
[60022.575774] EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts: 
(null)
[60023.288744] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[60024.282579] device-mapper: ioctl: remove_all left 8 open device(s)
[60024.505905] EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts: 
(null)
[60024.934311] EXT4-fs

Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread zhong jiang
On 2017/4/10 16:56, Mel Gorman wrote:
> On Sat, Apr 08, 2017 at 09:39:42PM +0800, zhong jiang wrote:
>> when runing the stabile docker cases in the vm.   The following issue will 
>> come up.
>>
>> #40 [8801b57ffb30] async_page_fault at 8165c9f8
>> [exception RIP: down_read_trylock+5]
>> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
>> RAX:   RBX: 88018ae858c1  RCX: 
>> RDX:   RSI:   RDI: 0008
>> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
>> R10: 22cb  R11:   R12: 88018ae858c0
>> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
>> ORIG_RAX:   CS: 0010  SS: 
> Post the full report including the kernel version and state whether any
> additional patches to 3.10 are applied.
>
 Hi, Mel
   
Our kernel from RHEL 7.2, Addtional patches all from upstream -- 
include Bugfix and CVE.

Commit 624483f3ea8 ("mm: rmap: fix use-after-free in __put_anon_vma") exclude in
the RHEL 7.2. it looks seems to the issue. but I don't know how it triggered.
or it is not the correct fix.  Any suggestion? Thanks


partly dmesg will print in the following.

[59982.162223] EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: 
(null)
[59985.261635] device-mapper: ioctl: remove_all left 8 open device(s)
[59986.492174] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[59987.445606] device-mapper: ioctl: remove_all left 8 open device(s)
[59987.625887] EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: 
(null)
[59988.174600] device-mapper: ioctl: remove_all left 8 open device(s)
[59988.345667] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[59990.951713] EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: 
(null)
[59991.025185] device vethd295793 entered promiscuous mode
[59991.025253] IPv6: ADDRCONF(NETDEV_UP): vethd295793: link is not ready
[59991.860817] IPv6: ADDRCONF(NETDEV_CHANGE): vethd295793: link becomes ready
[59991.860836] docker0: port 4(vethd295793) entered forwarding state
[59991.860840] docker0: port 4(vethd295793) entered forwarding state
[59992.704027] docker0: port 4(vethd295793) entered disabled state
[59992.724049] EXT4-fs (dm-9): mounted filesystem with ordered data mode. Opts: 
(null)
[59993.098341] docker0: port 4(vethd295793) entered disabled state
[59993.102583] device vethd295793 left promiscuous mode
[59993.102605] docker0: port 4(vethd295793) entered disabled state
[59995.109048] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[59995.229390] docker0: port 2(veth2ad76e2) entered disabled state
[59995.523997] docker0: port 2(veth2ad76e2) entered disabled state
[59995.528183] device veth2ad76e2 left promiscuous mode
[59995.528202] docker0: port 2(veth2ad76e2) entered disabled state
[59995.975559] device-mapper: ioctl: remove_all left 8 open device(s)
[59996.084575] EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: 
(null)
[59996.660641] device-mapper: ioctl: remove_all left 7 open device(s)
[59997.109018] EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts: 
(null)
[59998.360101] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[60001.721429] EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: 
(null)
[60001.771433] device vethcca3b6a entered promiscuous mode
[60001.771643] IPv6: ADDRCONF(NETDEV_UP): vethcca3b6a: link is not ready
[60002.872102] IPv6: ADDRCONF(NETDEV_CHANGE): vethcca3b6a: link becomes ready
[60002.872124] docker0: port 2(vethcca3b6a) entered forwarding state
[60002.872130] docker0: port 2(vethcca3b6a) entered forwarding state
[60005.041654] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[60005.597179] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[60013.731728] [/usr/bin/os_rotate_and_save_log.sh]space of output directory is 
larger than 500M bytes,delete the oldest tar file 
messages-20170321181104-129.tar.bz2
[60016.243601] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[60016.669594] device-mapper: ioctl: remove_all left 9 open device(s)
[60016.930232] EXT4-fs (dm-9): mounted filesystem with ordered data mode. Opts: 
(null)
[60017.918511] docker0: port 2(vethcca3b6a) entered forwarding state
[60022.197574] device-mapper: ioctl: remove_all left 8 open device(s)
[60022.575774] EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts: 
(null)
[60023.288744] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: 
(null)
[60024.282579] device-mapper: ioctl: remove_all left 8 open device(s)
[60024.505905] EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts: 
(null)
[60024.934311] EXT4-fs

NULL pointer dereference in the kernel 3.10

2017-04-08 Thread zhong jiang
when runing the stabile docker cases in the vm.   The following issue will come 
up.

#40 [8801b57ffb30] async_page_fault at 8165c9f8
[exception RIP: down_read_trylock+5]
RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
RAX:   RBX: 88018ae858c1  RCX: 
RDX:   RSI:   RDI: 0008
RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
R10: 22cb  R11:   R12: 88018ae858c0
R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
ORIG_RAX:   CS: 0010  SS: 
#41 [8801b57ffbe8] page_lock_anon_vma_read at 811b241c
#42 [8801b57ffc18] page_referenced at 811b26a7
#43 [8801b57ffc90] shrink_active_list at 8118d634
#44 [8801b57ffd48] balance_pgdat at 8118f088
#45 [8801b57ffe20] kswapd at 8118f633
#46 [8801b57ffec8] kthread at 810a795f
#47 [8801b57fff50] ret_from_fork at 81665398
crash> struct page.mapping ea0006903dc0
  mapping = 0x88018ae858c1
crash> struct anon_vma 0x88018ae858c0
struct anon_vma {
  root = 0x0,
  rwsem = {
count = 0,
wait_lock = {
  raw_lock = {
{
  head_tail = 1,
  tickets = {
head = 1,
tail = 0
  }
}
  }
},
wait_list = {
  next = 0x0,
  prev = 0x0
}
  },
  refcount = {
counter = 0
  },
  rb_root = {
rb_node = 0x0
  }
}

This maks me wonder,  the anon_vma do not come from slab structure.
and the content is abnormal. IMO,  At least anon_vma->root will not NULL.
The issue can be reproduced every other week.

Any comments will be appreciated.

Thanks
zhongjiang





NULL pointer dereference in the kernel 3.10

2017-04-08 Thread zhong jiang
when runing the stabile docker cases in the vm.   The following issue will come 
up.

#40 [8801b57ffb30] async_page_fault at 8165c9f8
[exception RIP: down_read_trylock+5]
RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
RAX:   RBX: 88018ae858c1  RCX: 
RDX:   RSI:   RDI: 0008
RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
R10: 22cb  R11:   R12: 88018ae858c0
R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
ORIG_RAX:   CS: 0010  SS: 
#41 [8801b57ffbe8] page_lock_anon_vma_read at 811b241c
#42 [8801b57ffc18] page_referenced at 811b26a7
#43 [8801b57ffc90] shrink_active_list at 8118d634
#44 [8801b57ffd48] balance_pgdat at 8118f088
#45 [8801b57ffe20] kswapd at 8118f633
#46 [8801b57ffec8] kthread at 810a795f
#47 [8801b57fff50] ret_from_fork at 81665398
crash> struct page.mapping ea0006903dc0
  mapping = 0x88018ae858c1
crash> struct anon_vma 0x88018ae858c0
struct anon_vma {
  root = 0x0,
  rwsem = {
count = 0,
wait_lock = {
  raw_lock = {
{
  head_tail = 1,
  tickets = {
head = 1,
tail = 0
  }
}
  }
},
wait_list = {
  next = 0x0,
  prev = 0x0
}
  },
  refcount = {
counter = 0
  },
  rb_root = {
rb_node = 0x0
  }
}

This maks me wonder,  the anon_vma do not come from slab structure.
and the content is abnormal. IMO,  At least anon_vma->root will not NULL.
The issue can be reproduced every other week.

Any comments will be appreciated.

Thanks
zhongjiang





Re: [PATCH] mm: do not export ioremap_page_range symbol for external module

2017-01-23 Thread zhong jiang
On 2017/1/23 9:30, John Hubbard wrote:
>
>
> On 01/22/2017 05:14 PM, zhong jiang wrote:
>> On 2017/1/22 20:58, zhongjiang wrote:
>>> From: zhong jiang <zhongji...@huawei.com>
>>>
>>> Recently, I find the ioremap_page_range had been abusing. The improper
>>> address mapping is a issue. it will result in the crash. so, remove
>>> the symbol. It can be replaced by the ioremap_cache or others symbol.
>>>
>>> Signed-off-by: zhong jiang <zhongji...@huawei.com>
>>> ---
>>>  lib/ioremap.c | 1 -
>>>  1 file changed, 1 deletion(-)
>>>
>>> diff --git a/lib/ioremap.c b/lib/ioremap.c
>>> index 86c8911..a3e14ce 100644
>>> --- a/lib/ioremap.c
>>> +++ b/lib/ioremap.c
>>> @@ -144,4 +144,3 @@ int ioremap_page_range(unsigned long addr,
>>>
>>>  return err;
>>>  }
>>> -EXPORT_SYMBOL_GPL(ioremap_page_range);
>> self nack
>>
>
> heh. What changed your mind?
>
  Very sorry,  I mistake own kernel modules call it directly.  Thank you review
  the patch . I will take your changelog and send it in v2.

  Thanks
  zhongjiang
> .
>




Re: [PATCH] mm: do not export ioremap_page_range symbol for external module

2017-01-23 Thread zhong jiang
On 2017/1/23 9:30, John Hubbard wrote:
>
>
> On 01/22/2017 05:14 PM, zhong jiang wrote:
>> On 2017/1/22 20:58, zhongjiang wrote:
>>> From: zhong jiang 
>>>
>>> Recently, I find the ioremap_page_range had been abusing. The improper
>>> address mapping is a issue. it will result in the crash. so, remove
>>> the symbol. It can be replaced by the ioremap_cache or others symbol.
>>>
>>> Signed-off-by: zhong jiang 
>>> ---
>>>  lib/ioremap.c | 1 -
>>>  1 file changed, 1 deletion(-)
>>>
>>> diff --git a/lib/ioremap.c b/lib/ioremap.c
>>> index 86c8911..a3e14ce 100644
>>> --- a/lib/ioremap.c
>>> +++ b/lib/ioremap.c
>>> @@ -144,4 +144,3 @@ int ioremap_page_range(unsigned long addr,
>>>
>>>  return err;
>>>  }
>>> -EXPORT_SYMBOL_GPL(ioremap_page_range);
>> self nack
>>
>
> heh. What changed your mind?
>
  Very sorry,  I mistake own kernel modules call it directly.  Thank you review
  the patch . I will take your changelog and send it in v2.

  Thanks
  zhongjiang
> .
>




Re: [PATCH] mm: do not export ioremap_page_range symbol for external module

2017-01-22 Thread zhong jiang
On 2017/1/22 20:58, zhongjiang wrote:
> From: zhong jiang <zhongji...@huawei.com>
>
> Recently, I find the ioremap_page_range had been abusing. The improper
> address mapping is a issue. it will result in the crash. so, remove
> the symbol. It can be replaced by the ioremap_cache or others symbol.
>
> Signed-off-by: zhong jiang <zhongji...@huawei.com>
> ---
>  lib/ioremap.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/lib/ioremap.c b/lib/ioremap.c
> index 86c8911..a3e14ce 100644
> --- a/lib/ioremap.c
> +++ b/lib/ioremap.c
> @@ -144,4 +144,3 @@ int ioremap_page_range(unsigned long addr,
>  
>   return err;
>  }
> -EXPORT_SYMBOL_GPL(ioremap_page_range);
self nack



Re: [PATCH] mm: do not export ioremap_page_range symbol for external module

2017-01-22 Thread zhong jiang
On 2017/1/22 20:58, zhongjiang wrote:
> From: zhong jiang 
>
> Recently, I find the ioremap_page_range had been abusing. The improper
> address mapping is a issue. it will result in the crash. so, remove
> the symbol. It can be replaced by the ioremap_cache or others symbol.
>
> Signed-off-by: zhong jiang 
> ---
>  lib/ioremap.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/lib/ioremap.c b/lib/ioremap.c
> index 86c8911..a3e14ce 100644
> --- a/lib/ioremap.c
> +++ b/lib/ioremap.c
> @@ -144,4 +144,3 @@ int ioremap_page_range(unsigned long addr,
>  
>   return err;
>  }
> -EXPORT_SYMBOL_GPL(ioremap_page_range);
self nack



Re: [RESEND PATCH 2/2] arm64: make WANT_HUGE_PMD_SHARE depends on HUGETLB_PAGE

2016-12-16 Thread zhong jiang
On 2016/12/16 20:35, Will Deacon wrote:
> On Fri, Dec 16, 2016 at 05:10:05PM +0800, zhong jiang wrote:
>> On 2016/12/14 22:19, zhongjiang wrote:
>>> From: zhong jiang <zhongji...@huawei.com>
>>>
>>> when HUGETLB_PAGE is disable, WANT_HUGE_PMD_SHARE contains the
>>> fuctions should not be use. therefore, we add the dependency.
>>>
>>> Signed-off-by: zhong jiang <zhongji...@huawei.com>
>>> ---
>>>  arch/arm64/Kconfig | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 969ef88..694ca73 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -640,6 +640,7 @@ config SYS_SUPPORTS_HUGETLBFS
>>>  
>>>  config ARCH_WANT_HUGE_PMD_SHARE
>>> def_bool y if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
>>> +   depends on HUGETLB_PAGE
>>>  
>>>  config ARCH_HAS_CACHE_LINE_SIZE
>>> def_bool y
>> Hi,
>>   I still think it is a issue. Perhaps above changelog is unclear.  
>> Further explain is as follows.
>>  when hugetlb_pages is disable and arch_want_huge_pmd_share is enable,   we 
>> maybe call
>>  huge_pmd_sahre in hugetlbpage.c, but the function actually is not 
>> definition as it is not
>>  exported.  is it right ??
> The only users of ARCH_WANT_HUGE_PMD_SHARE on arm64 are:
>
> arch/arm64/mm/hugetlbpage.c:if 
> (IS_ENABLED(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) &&
> mm/hugetlb.c:#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
> mm/hugetlb.c:#else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
> mm/hugetlb.c:#endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>
> and neither of those files are compiled if !HUGETLB_PAGE.
>
> Are you actually seeing a problem here?
>
> Will
>
> .
>
  I got it.  Please forget the  stupid patch and my stupid comments.
 
 but the first patch look reasonable.  Is it accepted  ?

 Thanks
 zhongjiang



Re: [RESEND PATCH 2/2] arm64: make WANT_HUGE_PMD_SHARE depends on HUGETLB_PAGE

2016-12-16 Thread zhong jiang
On 2016/12/16 20:35, Will Deacon wrote:
> On Fri, Dec 16, 2016 at 05:10:05PM +0800, zhong jiang wrote:
>> On 2016/12/14 22:19, zhongjiang wrote:
>>> From: zhong jiang 
>>>
>>> when HUGETLB_PAGE is disable, WANT_HUGE_PMD_SHARE contains the
>>> fuctions should not be use. therefore, we add the dependency.
>>>
>>> Signed-off-by: zhong jiang 
>>> ---
>>>  arch/arm64/Kconfig | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 969ef88..694ca73 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -640,6 +640,7 @@ config SYS_SUPPORTS_HUGETLBFS
>>>  
>>>  config ARCH_WANT_HUGE_PMD_SHARE
>>> def_bool y if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
>>> +   depends on HUGETLB_PAGE
>>>  
>>>  config ARCH_HAS_CACHE_LINE_SIZE
>>> def_bool y
>> Hi,
>>   I still think it is a issue. Perhaps above changelog is unclear.  
>> Further explain is as follows.
>>  when hugetlb_pages is disable and arch_want_huge_pmd_share is enable,   we 
>> maybe call
>>  huge_pmd_sahre in hugetlbpage.c, but the function actually is not 
>> definition as it is not
>>  exported.  is it right ??
> The only users of ARCH_WANT_HUGE_PMD_SHARE on arm64 are:
>
> arch/arm64/mm/hugetlbpage.c:if 
> (IS_ENABLED(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) &&
> mm/hugetlb.c:#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
> mm/hugetlb.c:#else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
> mm/hugetlb.c:#endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>
> and neither of those files are compiled if !HUGETLB_PAGE.
>
> Are you actually seeing a problem here?
>
> Will
>
> .
>
  I got it.  Please forget the  stupid patch and my stupid comments.
 
 but the first patch look reasonable.  Is it accepted  ?

 Thanks
 zhongjiang



Re: [RESEND PATCH 2/2] arm64: make WANT_HUGE_PMD_SHARE depends on HUGETLB_PAGE

2016-12-16 Thread zhong jiang
On 2016/12/16 20:35, Will Deacon wrote:
> On Fri, Dec 16, 2016 at 05:10:05PM +0800, zhong jiang wrote:
>> On 2016/12/14 22:19, zhongjiang wrote:
>>> From: zhong jiang <zhongji...@huawei.com>
>>>
>>> when HUGETLB_PAGE is disable, WANT_HUGE_PMD_SHARE contains the
>>> fuctions should not be use. therefore, we add the dependency.
>>>
>>> Signed-off-by: zhong jiang <zhongji...@huawei.com>
>>> ---
>>>  arch/arm64/Kconfig | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 969ef88..694ca73 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -640,6 +640,7 @@ config SYS_SUPPORTS_HUGETLBFS
>>>  
>>>  config ARCH_WANT_HUGE_PMD_SHARE
>>> def_bool y if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
>>> +   depends on HUGETLB_PAGE
>>>  
>>>  config ARCH_HAS_CACHE_LINE_SIZE
>>> def_bool y
>> Hi,
>>   I still think it is a issue. Perhaps above changelog is unclear.  
>> Further explain is as follows.
>>  when hugetlb_pages is disable and arch_want_huge_pmd_share is enable,   we 
>> maybe call
>>  huge_pmd_sahre in hugetlbpage.c, but the function actually is not 
>> definition as it is not
>>  exported.  is it right ??
> The only users of ARCH_WANT_HUGE_PMD_SHARE on arm64 are:
>
> arch/arm64/mm/hugetlbpage.c:if 
> (IS_ENABLED(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) &&
  yes
> mm/hugetlb.c:#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
> mm/hugetlb.c:#else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
> mm/hugetlb.c:#endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>
> and neither of those files are compiled if !HUGETLB_PAGE.
  Indeed, but  The only users maybe use it  if !HUGETLB_PAGE. 
  because  include/linux/hugetlb.h  can not export the reference if 
!HUGETLB_PAGE.
  IOW,  the function is not definition.  

  is it right ? or I miss the key point.
> Are you actually seeing a problem here?
  I review the code and find the issue.

 Thanks
 zhongjiang
> Will
>
> .
>




Re: [RESEND PATCH 2/2] arm64: make WANT_HUGE_PMD_SHARE depends on HUGETLB_PAGE

2016-12-16 Thread zhong jiang
On 2016/12/16 20:35, Will Deacon wrote:
> On Fri, Dec 16, 2016 at 05:10:05PM +0800, zhong jiang wrote:
>> On 2016/12/14 22:19, zhongjiang wrote:
>>> From: zhong jiang 
>>>
>>> when HUGETLB_PAGE is disable, WANT_HUGE_PMD_SHARE contains the
>>> fuctions should not be use. therefore, we add the dependency.
>>>
>>> Signed-off-by: zhong jiang 
>>> ---
>>>  arch/arm64/Kconfig | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 969ef88..694ca73 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -640,6 +640,7 @@ config SYS_SUPPORTS_HUGETLBFS
>>>  
>>>  config ARCH_WANT_HUGE_PMD_SHARE
>>> def_bool y if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
>>> +   depends on HUGETLB_PAGE
>>>  
>>>  config ARCH_HAS_CACHE_LINE_SIZE
>>> def_bool y
>> Hi,
>>   I still think it is a issue. Perhaps above changelog is unclear.  
>> Further explain is as follows.
>>  when hugetlb_pages is disable and arch_want_huge_pmd_share is enable,   we 
>> maybe call
>>  huge_pmd_sahre in hugetlbpage.c, but the function actually is not 
>> definition as it is not
>>  exported.  is it right ??
> The only users of ARCH_WANT_HUGE_PMD_SHARE on arm64 are:
>
> arch/arm64/mm/hugetlbpage.c:if 
> (IS_ENABLED(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) &&
  yes
> mm/hugetlb.c:#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
> mm/hugetlb.c:#else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
> mm/hugetlb.c:#endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>
> and neither of those files are compiled if !HUGETLB_PAGE.
  Indeed, but  The only users maybe use it  if !HUGETLB_PAGE. 
  because  include/linux/hugetlb.h  can not export the reference if 
!HUGETLB_PAGE.
  IOW,  the function is not definition.  

  is it right ? or I miss the key point.
> Are you actually seeing a problem here?
  I review the code and find the issue.

 Thanks
 zhongjiang
> Will
>
> .
>




Re: [RESEND PATCH 2/2] arm64: make WANT_HUGE_PMD_SHARE depends on HUGETLB_PAGE

2016-12-16 Thread zhong jiang
On 2016/12/14 22:19, zhongjiang wrote:
> From: zhong jiang <zhongji...@huawei.com>
>
> when HUGETLB_PAGE is disable, WANT_HUGE_PMD_SHARE contains the
> fuctions should not be use. therefore, we add the dependency.
>
> Signed-off-by: zhong jiang <zhongji...@huawei.com>
> ---
>  arch/arm64/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 969ef88..694ca73 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -640,6 +640,7 @@ config SYS_SUPPORTS_HUGETLBFS
>  
>  config ARCH_WANT_HUGE_PMD_SHARE
>   def_bool y if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> + depends on HUGETLB_PAGE
>  
>  config ARCH_HAS_CACHE_LINE_SIZE
>   def_bool y
Hi,
  I still think it is a issue. Perhaps above changelog is unclear.  Further 
explain is as follows.
 when hugetlb_pages is disable and arch_want_huge_pmd_share is enable,   we 
maybe call
 huge_pmd_sahre in hugetlbpage.c, but the function actually is not definition 
as it is not
 exported.  is it right ??

Thanks
zhongjiang



Re: [RESEND PATCH 2/2] arm64: make WANT_HUGE_PMD_SHARE depends on HUGETLB_PAGE

2016-12-16 Thread zhong jiang
On 2016/12/14 22:19, zhongjiang wrote:
> From: zhong jiang 
>
> when HUGETLB_PAGE is disable, WANT_HUGE_PMD_SHARE contains the
> fuctions should not be use. therefore, we add the dependency.
>
> Signed-off-by: zhong jiang 
> ---
>  arch/arm64/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 969ef88..694ca73 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -640,6 +640,7 @@ config SYS_SUPPORTS_HUGETLBFS
>  
>  config ARCH_WANT_HUGE_PMD_SHARE
>   def_bool y if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> + depends on HUGETLB_PAGE
>  
>  config ARCH_HAS_CACHE_LINE_SIZE
>   def_bool y
Hi,
  I still think it is a issue. Perhaps above changelog is unclear.  Further 
explain is as follows.
 when hugetlb_pages is disable and arch_want_huge_pmd_share is enable,   we 
maybe call
 huge_pmd_sahre in hugetlbpage.c, but the function actually is not definition 
as it is not
 exported.  is it right ??

Thanks
zhongjiang



Re: [RESEND PATCH 1/2] arm64: change from CONT_PMD_SHIFT to CONT_PTE_SHIFT

2016-12-14 Thread zhong jiang
On 2016/12/14 22:45, Ard Biesheuvel wrote:
> On 14 December 2016 at 14:19, zhongjiang <zhongji...@huawei.com> wrote:
>> From: zhong jiang <zhongji...@huawei.com>
>>
>> I think that CONT_PTE_SHIFT is more reasonable even if they are some
>> value. and the patch is not any functional change.
>>
> This may be the case for 64k pages, but not for 16k pages, and I
> actually think add_default_hugepagesz() could be made unconditional,
> given that both 64k on 4k kernels and 2 MB on 16k kernels are useful
> hugepage sizes that are not otherwise available by default.
 I agree that we can make add_default_hugepagesz() to be unconditional.
 but I do not know the history why it did so at that time. The patch
 just is based on the current kernel.

 by the way, please review the second patch if you have time. Any comment
 will be welcomed.

 Thanks
 zhongjiang
>> Signed-off-by: zhong jiang <zhongji...@huawei.com>
> Reviewed-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>
>> ---
>>  arch/arm64/mm/hugetlbpage.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index 2e49bd2..0a4c97b 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -323,7 +323,7 @@ static __init int setup_hugepagesz(char *opt)
>>  static __init int add_default_hugepagesz(void)
>>  {
>> if (size_to_hstate(CONT_PTES * PAGE_SIZE) == NULL)
>> -   hugetlb_add_hstate(CONT_PMD_SHIFT);
>> +   hugetlb_add_hstate(CONT_PTE_SHIFT);
>> return 0;
>>  }
>>  arch_initcall(add_default_hugepagesz);
>> --
>> 1.8.3.1
>>
> .
>




Re: [RESEND PATCH 1/2] arm64: change from CONT_PMD_SHIFT to CONT_PTE_SHIFT

2016-12-14 Thread zhong jiang
On 2016/12/14 22:45, Ard Biesheuvel wrote:
> On 14 December 2016 at 14:19, zhongjiang  wrote:
>> From: zhong jiang 
>>
>> I think that CONT_PTE_SHIFT is more reasonable even if they are some
>> value. and the patch is not any functional change.
>>
> This may be the case for 64k pages, but not for 16k pages, and I
> actually think add_default_hugepagesz() could be made unconditional,
> given that both 64k on 4k kernels and 2 MB on 16k kernels are useful
> hugepage sizes that are not otherwise available by default.
 I agree that we can make add_default_hugepagesz() to be unconditional.
 but I do not know the history why it did so at that time. The patch
 just is based on the current kernel.

 by the way, please review the second patch if you have time. Any comment
 will be welcomed.

 Thanks
 zhongjiang
>> Signed-off-by: zhong jiang 
> Reviewed-by: Ard Biesheuvel 
>
>> ---
>>  arch/arm64/mm/hugetlbpage.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index 2e49bd2..0a4c97b 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -323,7 +323,7 @@ static __init int setup_hugepagesz(char *opt)
>>  static __init int add_default_hugepagesz(void)
>>  {
>> if (size_to_hstate(CONT_PTES * PAGE_SIZE) == NULL)
>> -   hugetlb_add_hstate(CONT_PMD_SHIFT);
>> +   hugetlb_add_hstate(CONT_PTE_SHIFT);
>> return 0;
>>  }
>>  arch_initcall(add_default_hugepagesz);
>> --
>> 1.8.3.1
>>
> .
>




Re: [PATCH v2] kexec: add cond_resched into kimage_alloc_crash_control_pages

2016-12-08 Thread zhong jiang
On 2016/12/9 13:19, Eric W. Biederman wrote:
> zhong jiang <zhongji...@huawei.com> writes:
>
>> On 2016/12/8 17:41, Xunlei Pang wrote:
>>> On 12/08/2016 at 10:37 AM, zhongjiang wrote:
>>>> From: zhong jiang <zhongji...@huawei.com>
>>>>
> [snip]
>>>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>>>> index 5616755..bfc9621 100644
>>>> --- a/kernel/kexec_core.c
>>>> +++ b/kernel/kexec_core.c
>>>> @@ -441,6 +441,8 @@ static struct page 
>>>> *kimage_alloc_crash_control_pages(struct kimage *image,
>>>>while (hole_end <= crashk_res.end) {
>>>>unsigned long i;
>>>>  
>>>> +  cond_resched();
>>>> +
>>> I can't see why it would take a long time to loop inside, the job it does 
>>> is simply to find a control area
>>> not overlapped with image->segment[], you can see the loop "for (i = 0; i < 
>>> image->nr_segments; i++)",
>>> @hole_end will be advanced to the end of its next nearby segment once 
>>> overlap was detected each loop,
>>> also there are limited (<=16) segments, so it won't take long to locate the 
>>> right area.
>>>
>>> Am I missing something?
>>>
>>> Regards,
>>> Xunlei
>>   if the crashkernel = auto is set in cmdline.  it represent crashk_res.end 
>> will exceed to 4G, the first allocate control pages will
>>   loop  million times. if we set crashk_res.end to the higher value
>>   manually,  you can image
> Or in short the cond_resched is about keeping things reasonable when the
> loop has worst case behavior.
>
> Eric
>
>
  Yes,   Thank you reply and comment.

  Regards,
  zhongjiang



Re: [PATCH v2] kexec: add cond_resched into kimage_alloc_crash_control_pages

2016-12-08 Thread zhong jiang
On 2016/12/9 13:19, Eric W. Biederman wrote:
> zhong jiang  writes:
>
>> On 2016/12/8 17:41, Xunlei Pang wrote:
>>> On 12/08/2016 at 10:37 AM, zhongjiang wrote:
>>>> From: zhong jiang 
>>>>
> [snip]
>>>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>>>> index 5616755..bfc9621 100644
>>>> --- a/kernel/kexec_core.c
>>>> +++ b/kernel/kexec_core.c
>>>> @@ -441,6 +441,8 @@ static struct page 
>>>> *kimage_alloc_crash_control_pages(struct kimage *image,
>>>>while (hole_end <= crashk_res.end) {
>>>>unsigned long i;
>>>>  
>>>> +  cond_resched();
>>>> +
>>> I can't see why it would take a long time to loop inside, the job it does 
>>> is simply to find a control area
>>> not overlapped with image->segment[], you can see the loop "for (i = 0; i < 
>>> image->nr_segments; i++)",
>>> @hole_end will be advanced to the end of its next nearby segment once 
>>> overlap was detected each loop,
>>> also there are limited (<=16) segments, so it won't take long to locate the 
>>> right area.
>>>
>>> Am I missing something?
>>>
>>> Regards,
>>> Xunlei
>>   if the crashkernel = auto is set in cmdline.  it represent crashk_res.end 
>> will exceed to 4G, the first allocate control pages will
>>   loop  million times. if we set crashk_res.end to the higher value
>>   manually,  you can image
> Or in short the cond_resched is about keeping things reasonable when the
> loop has worst case behavior.
>
> Eric
>
>
  Yes,   Thank you reply and comment.

  Regards,
  zhongjiang



Re: [PATCH v2] kexec: add cond_resched into kimage_alloc_crash_control_pages

2016-12-08 Thread zhong jiang
On 2016/12/8 17:41, Xunlei Pang wrote:
> On 12/08/2016 at 10:37 AM, zhongjiang wrote:
>> From: zhong jiang <zhongji...@huawei.com>
>>
>> A soft lookup will occur when I run trinity in syscall kexec_load.
>> the corresponding stack information is as follows.
>>
>> [  237.235937] BUG: soft lockup - CPU#6 stuck for 22s! [trinity-c6:13859]
>> [  237.242699] Kernel panic - not syncing: softlockup: hung tasks
>> [  237.248573] CPU: 6 PID: 13859 Comm: trinity-c6 Tainted: G   O L 
>> V---   3.10.0-327.28.3.35.zhongjiang.x86_64 #1
>> [  237.259984] Hardware name: Huawei Technologies Co., Ltd. Tecal BH622 
>> V2/BC01SRSA0, BIOS RMIBV386 06/30/2014
>> [  237.269752]  8187626b 18cfde31 88184c803e18 
>> 81638f16
>> [  237.277471]  88184c803e98 8163278f 0008 
>> 88184c803ea8
>> [  237.285190]  88184c803e48 18cfde31 88184c803e67 
>> 
>> [  237.292909] Call Trace:
>> [  237.295404][] dump_stack+0x19/0x1b
>> [  237.301352]  [] panic+0xd8/0x214
>> [  237.306196]  [] watchdog_timer_fn+0x1cc/0x1e0
>> [  237.312157]  [] ? watchdog_enable+0xc0/0xc0
>> [  237.317955]  [] __hrtimer_run_queues+0xd2/0x260
>> [  237.324087]  [] hrtimer_interrupt+0xb0/0x1e0
>> [  237.329963]  [] ? call_softirq+0x1c/0x30
>> [  237.335500]  [] local_apic_timer_interrupt+0x37/0x60
>> [  237.342228]  [] smp_apic_timer_interrupt+0x3f/0x60
>> [  237.348771]  [] apic_timer_interrupt+0x6d/0x80
>> [  237.354967][] ? 
>> kimage_alloc_control_pages+0x80/0x270
>> [  237.362875]  [] ? kmem_cache_alloc_trace+0x1ce/0x1f0
>> [  237.369592]  [] ? do_kimage_alloc_init+0x1f/0x90
>> [  237.375992]  [] kimage_alloc_init+0x12a/0x180
>> [  237.382103]  [] SyS_kexec_load+0x20a/0x260
>> [  237.387957]  [] system_call_fastpath+0x16/0x1b
>>
>> the first time allocate control pages may take too much time because
>> crash_res.end can be set to a higher value. we need to add cond_resched
>> to avoid the issue.
>>
>> The patch have been tested and above issue is not appear.
>>
>> Signed-off-by: zhong jiang <zhongji...@huawei.com>
>> ---
>>  kernel/kexec_core.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index 5616755..bfc9621 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -441,6 +441,8 @@ static struct page 
>> *kimage_alloc_crash_control_pages(struct kimage *image,
>>  while (hole_end <= crashk_res.end) {
>>  unsigned long i;
>>  
>> +cond_resched();
>> +
> I can't see why it would take a long time to loop inside, the job it does is 
> simply to find a control area
> not overlapped with image->segment[], you can see the loop "for (i = 0; i < 
> image->nr_segments; i++)",
> @hole_end will be advanced to the end of its next nearby segment once overlap 
> was detected each loop,
> also there are limited (<=16) segments, so it won't take long to locate the 
> right area.
>
> Am I missing something?
>
> Regards,
> Xunlei
  if the crashkernel = auto is set in cmdline.  it represent crashk_res.end 
will exceed to 4G, the first allocate control pages will
  loop  million times. if we set crashk_res.end to the higher value manually,  
you can image

  Thanks
  zhongjiang
>>  if (hole_end > KEXEC_CRASH_CONTROL_MEMORY_LIMIT)
>>  break;
>>  /* See if I overlap any of the segments */
>
> .
>




Re: [PATCH v2] kexec: add cond_resched into kimage_alloc_crash_control_pages

2016-12-08 Thread zhong jiang
On 2016/12/8 17:41, Xunlei Pang wrote:
> On 12/08/2016 at 10:37 AM, zhongjiang wrote:
>> From: zhong jiang 
>>
>> A soft lookup will occur when I run trinity in syscall kexec_load.
>> the corresponding stack information is as follows.
>>
>> [  237.235937] BUG: soft lockup - CPU#6 stuck for 22s! [trinity-c6:13859]
>> [  237.242699] Kernel panic - not syncing: softlockup: hung tasks
>> [  237.248573] CPU: 6 PID: 13859 Comm: trinity-c6 Tainted: G   O L 
>> V---   3.10.0-327.28.3.35.zhongjiang.x86_64 #1
>> [  237.259984] Hardware name: Huawei Technologies Co., Ltd. Tecal BH622 
>> V2/BC01SRSA0, BIOS RMIBV386 06/30/2014
>> [  237.269752]  8187626b 18cfde31 88184c803e18 
>> 81638f16
>> [  237.277471]  88184c803e98 8163278f 0008 
>> 88184c803ea8
>> [  237.285190]  88184c803e48 18cfde31 88184c803e67 
>> 
>> [  237.292909] Call Trace:
>> [  237.295404][] dump_stack+0x19/0x1b
>> [  237.301352]  [] panic+0xd8/0x214
>> [  237.306196]  [] watchdog_timer_fn+0x1cc/0x1e0
>> [  237.312157]  [] ? watchdog_enable+0xc0/0xc0
>> [  237.317955]  [] __hrtimer_run_queues+0xd2/0x260
>> [  237.324087]  [] hrtimer_interrupt+0xb0/0x1e0
>> [  237.329963]  [] ? call_softirq+0x1c/0x30
>> [  237.335500]  [] local_apic_timer_interrupt+0x37/0x60
>> [  237.342228]  [] smp_apic_timer_interrupt+0x3f/0x60
>> [  237.348771]  [] apic_timer_interrupt+0x6d/0x80
>> [  237.354967][] ? 
>> kimage_alloc_control_pages+0x80/0x270
>> [  237.362875]  [] ? kmem_cache_alloc_trace+0x1ce/0x1f0
>> [  237.369592]  [] ? do_kimage_alloc_init+0x1f/0x90
>> [  237.375992]  [] kimage_alloc_init+0x12a/0x180
>> [  237.382103]  [] SyS_kexec_load+0x20a/0x260
>> [  237.387957]  [] system_call_fastpath+0x16/0x1b
>>
>> the first time allocate control pages may take too much time because
>> crash_res.end can be set to a higher value. we need to add cond_resched
>> to avoid the issue.
>>
>> The patch have been tested and above issue is not appear.
>>
>> Signed-off-by: zhong jiang 
>> ---
>>  kernel/kexec_core.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index 5616755..bfc9621 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -441,6 +441,8 @@ static struct page 
>> *kimage_alloc_crash_control_pages(struct kimage *image,
>>  while (hole_end <= crashk_res.end) {
>>  unsigned long i;
>>  
>> +cond_resched();
>> +
> I can't see why it would take a long time to loop inside, the job it does is 
> simply to find a control area
> not overlapped with image->segment[], you can see the loop "for (i = 0; i < 
> image->nr_segments; i++)",
> @hole_end will be advanced to the end of its next nearby segment once overlap 
> was detected each loop,
> also there are limited (<=16) segments, so it won't take long to locate the 
> right area.
>
> Am I missing something?
>
> Regards,
> Xunlei
  if the crashkernel = auto is set in cmdline.  it represent crashk_res.end 
will exceed to 4G, the first allocate control pages will
  loop  million times. if we set crashk_res.end to the higher value manually,  
you can image

  Thanks
  zhongjiang
>>  if (hole_end > KEXEC_CRASH_CONTROL_MEMORY_LIMIT)
>>  break;
>>  /* See if I overlap any of the segments */
>
> .
>




Re: [PATCH] kexec: add cond_resched into kimage_alloc_crash_control_pages

2016-12-07 Thread zhong jiang
On 2016/12/8 9:50, Eric W. Biederman wrote:
> zhongjiang <zhongji...@huawei.com> writes:
>
>> From: zhong jiang <zhongji...@huawei.com>
>>
>> A soft lookup will occur when I run trinity in syscall kexec_load.
>> the corresponding stack information is as follows.
> Overall that looks reasonable.  Why only every 256 page and not call
> cond_resched unconditionally?
>
> The function cond_resched won't reschedule unless the process has spent
> it's cpu quota anyway.
  The value just a test.  I mistake it can reschedule immediately.
  cond_resched unconditionally will be a good choice.  if you accept the change,
  I will resend it .
> Eric
>
>> [  237.235937] BUG: soft lockup - CPU#6 stuck for 22s! [trinity-c6:13859]
>> [  237.242699] Kernel panic - not syncing: softlockup: hung tasks
>> [  237.248573] CPU: 6 PID: 13859 Comm: trinity-c6 Tainted: G   O L 
>> V---   3.10.0-327.28.3.35.zhongjiang.x86_64 #1
>> [  237.259984] Hardware name: Huawei Technologies Co., Ltd. Tecal BH622 
>> V2/BC01SRSA0, BIOS RMIBV386 06/30/2014
>> [  237.269752]  8187626b 18cfde31 88184c803e18 
>> 81638f16
>> [  237.277471]  88184c803e98 8163278f 0008 
>> 88184c803ea8
>> [  237.285190]  88184c803e48 18cfde31 88184c803e67 
>> 
>> [  237.292909] Call Trace:
>> [  237.295404][] dump_stack+0x19/0x1b
>> [  237.301352]  [] panic+0xd8/0x214
>> [  237.306196]  [] watchdog_timer_fn+0x1cc/0x1e0
>> [  237.312157]  [] ? watchdog_enable+0xc0/0xc0
>> [  237.317955]  [] __hrtimer_run_queues+0xd2/0x260
>> [  237.324087]  [] hrtimer_interrupt+0xb0/0x1e0
>> [  237.329963]  [] ? call_softirq+0x1c/0x30
>> [  237.335500]  [] local_apic_timer_interrupt+0x37/0x60
>> [  237.342228]  [] smp_apic_timer_interrupt+0x3f/0x60
>> [  237.348771]  [] apic_timer_interrupt+0x6d/0x80
>> [  237.354967][] ? 
>> kimage_alloc_control_pages+0x80/0x270
>> [  237.362875]  [] ? kmem_cache_alloc_trace+0x1ce/0x1f0
>> [  237.369592]  [] ? do_kimage_alloc_init+0x1f/0x90
>> [  237.375992]  [] kimage_alloc_init+0x12a/0x180
>> [  237.382103]  [] SyS_kexec_load+0x20a/0x260
>> [  237.387957]  [] system_call_fastpath+0x16/0x1b
>>
>> the first time allocate control pages may take too much time because
>> crash_res.end can be set to a higher value. we need to add cond_resched
>> to avoid the issue.
>>
>> The patch have been tested and above issue is not appear.
>>
>> Signed-off-by: zhong jiang <zhongji...@huawei.com>
>> ---
>>  kernel/kexec_core.c | 4 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index 5616755..2b43cc5 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -433,6 +433,7 @@ static struct page 
>> *kimage_alloc_crash_control_pages(struct kimage *image,
>>   */
>>  unsigned long hole_start, hole_end, size;
>>  struct page *pages;
>> +unsigned long count = 0;
>>  
>>  pages = NULL;
>>  size = (1 << order) << PAGE_SHIFT;
>> @@ -441,6 +442,9 @@ static struct page 
>> *kimage_alloc_crash_control_pages(struct kimage *image,
>>  while (hole_end <= crashk_res.end) {
>>  unsigned long i;
>>  
>> +if (++count % 256 == 0)
>> +cond_resched();
>> +
>>  if (hole_end > KEXEC_CRASH_CONTROL_MEMORY_LIMIT)
>>  break;
>>  /* See if I overlap any of the segments */
> .
>




Re: [PATCH] kexec: add cond_resched into kimage_alloc_crash_control_pages

2016-12-07 Thread zhong jiang
On 2016/12/8 9:50, Eric W. Biederman wrote:
> zhongjiang  writes:
>
>> From: zhong jiang 
>>
>> A soft lookup will occur when I run trinity in syscall kexec_load.
>> the corresponding stack information is as follows.
> Overall that looks reasonable.  Why only every 256 page and not call
> cond_resched unconditionally?
>
> The function cond_resched won't reschedule unless the process has spent
> it's cpu quota anyway.
  The value just a test.  I mistake it can reschedule immediately.
  cond_resched unconditionally will be a good choice.  if you accept the change,
  I will resend it .
> Eric
>
>> [  237.235937] BUG: soft lockup - CPU#6 stuck for 22s! [trinity-c6:13859]
>> [  237.242699] Kernel panic - not syncing: softlockup: hung tasks
>> [  237.248573] CPU: 6 PID: 13859 Comm: trinity-c6 Tainted: G   O L 
>> V---   3.10.0-327.28.3.35.zhongjiang.x86_64 #1
>> [  237.259984] Hardware name: Huawei Technologies Co., Ltd. Tecal BH622 
>> V2/BC01SRSA0, BIOS RMIBV386 06/30/2014
>> [  237.269752]  8187626b 18cfde31 88184c803e18 
>> 81638f16
>> [  237.277471]  88184c803e98 8163278f 0008 
>> 88184c803ea8
>> [  237.285190]  88184c803e48 18cfde31 88184c803e67 
>> 
>> [  237.292909] Call Trace:
>> [  237.295404][] dump_stack+0x19/0x1b
>> [  237.301352]  [] panic+0xd8/0x214
>> [  237.306196]  [] watchdog_timer_fn+0x1cc/0x1e0
>> [  237.312157]  [] ? watchdog_enable+0xc0/0xc0
>> [  237.317955]  [] __hrtimer_run_queues+0xd2/0x260
>> [  237.324087]  [] hrtimer_interrupt+0xb0/0x1e0
>> [  237.329963]  [] ? call_softirq+0x1c/0x30
>> [  237.335500]  [] local_apic_timer_interrupt+0x37/0x60
>> [  237.342228]  [] smp_apic_timer_interrupt+0x3f/0x60
>> [  237.348771]  [] apic_timer_interrupt+0x6d/0x80
>> [  237.354967][] ? 
>> kimage_alloc_control_pages+0x80/0x270
>> [  237.362875]  [] ? kmem_cache_alloc_trace+0x1ce/0x1f0
>> [  237.369592]  [] ? do_kimage_alloc_init+0x1f/0x90
>> [  237.375992]  [] kimage_alloc_init+0x12a/0x180
>> [  237.382103]  [] SyS_kexec_load+0x20a/0x260
>> [  237.387957]  [] system_call_fastpath+0x16/0x1b
>>
>> the first time allocate control pages may take too much time because
>> crash_res.end can be set to a higher value. we need to add cond_resched
>> to avoid the issue.
>>
>> The patch have been tested and above issue is not appear.
>>
>> Signed-off-by: zhong jiang 
>> ---
>>  kernel/kexec_core.c | 4 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index 5616755..2b43cc5 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -433,6 +433,7 @@ static struct page 
>> *kimage_alloc_crash_control_pages(struct kimage *image,
>>   */
>>  unsigned long hole_start, hole_end, size;
>>  struct page *pages;
>> +unsigned long count = 0;
>>  
>>  pages = NULL;
>>  size = (1 << order) << PAGE_SHIFT;
>> @@ -441,6 +442,9 @@ static struct page 
>> *kimage_alloc_crash_control_pages(struct kimage *image,
>>  while (hole_end <= crashk_res.end) {
>>  unsigned long i;
>>  
>> +if (++count % 256 == 0)
>> +cond_resched();
>> +
>>  if (hole_end > KEXEC_CRASH_CONTROL_MEMORY_LIMIT)
>>  break;
>>  /* See if I overlap any of the segments */
> .
>




Re: [RFC PATCH] hugetlbfs: fix the hugetlbfs can not be mounted

2016-11-03 Thread zhong jiang
On 2016/11/4 3:17, Andrew Morton wrote:
> On Sat, 29 Oct 2016 14:08:31 +0800 zhongjiang <zhongji...@huawei.com> wrote:
>
>> From: zhong jiang <zhongji...@huawei.com>
>>
>> Since 'commit 3e89e1c5ea84 ("hugetlb: make mm and fs code explicitly 
>> non-modular")'
>> bring in the mainline. mount hugetlbfs will result in the following issue.
>>
>> mount: unknown filesystme type 'hugetlbfs'
>>
>> because previous patch remove the module_alias_fs, when we mount the fs type,
>> the caller get_fs_type can not find the filesystem.
>>
>> The patch just recover the module_alias_fs to identify the hugetlbfs.
> hm, 3e89e1c5ea84 ("hugetlb: make mm and fs code explicitly
> non-modular") was merged almost a year ago.  And you are apparently the
> first person to discover this regression.  Can you think why that is?
  when I pull the upstream patch in 4.9-rc2. I find that I cannot mount the 
hugetlbfs.
  but when I pull  the upstream remain patch in the next day.  I test again. it 
 work well.
  so I reply the mail right now,  please ignore the patch.  The detailed reason 
is not digged.

  I am sorry for wasting your time.

  Thanks you
  zhongjiang
>> index 4fb7b10..b63e7de 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -35,6 +35,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #include 
>> @@ -1209,6 +1210,7 @@ static struct dentry *hugetlbfs_mount(struct 
>> file_system_type *fs_type,
>>  .mount  = hugetlbfs_mount,
>>  .kill_sb= kill_litter_super,
>>  };
>> +MODULE_ALIAS_FS("hugetlbfs");
>>  
>>  static struct vfsmount *hugetlbfs_vfsmount[HUGE_MAX_HSTATE];
>>  
>
> .
>




Re: [RFC PATCH] hugetlbfs: fix the hugetlbfs can not be mounted

2016-11-03 Thread zhong jiang
On 2016/11/4 3:17, Andrew Morton wrote:
> On Sat, 29 Oct 2016 14:08:31 +0800 zhongjiang  wrote:
>
>> From: zhong jiang 
>>
>> Since 'commit 3e89e1c5ea84 ("hugetlb: make mm and fs code explicitly 
>> non-modular")'
>> bring in the mainline. mount hugetlbfs will result in the following issue.
>>
>> mount: unknown filesystme type 'hugetlbfs'
>>
>> because previous patch remove the module_alias_fs, when we mount the fs type,
>> the caller get_fs_type can not find the filesystem.
>>
>> The patch just recover the module_alias_fs to identify the hugetlbfs.
> hm, 3e89e1c5ea84 ("hugetlb: make mm and fs code explicitly
> non-modular") was merged almost a year ago.  And you are apparently the
> first person to discover this regression.  Can you think why that is?
  when I pull the upstream patch in 4.9-rc2. I find that I cannot mount the 
hugetlbfs.
  but when I pull  the upstream remain patch in the next day.  I test again. it 
 work well.
  so I reply the mail right now,  please ignore the patch.  The detailed reason 
is not digged.

  I am sorry for wasting your time.

  Thanks you
  zhongjiang
>> index 4fb7b10..b63e7de 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -35,6 +35,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #include 
>> @@ -1209,6 +1210,7 @@ static struct dentry *hugetlbfs_mount(struct 
>> file_system_type *fs_type,
>>  .mount  = hugetlbfs_mount,
>>  .kill_sb= kill_litter_super,
>>  };
>> +MODULE_ALIAS_FS("hugetlbfs");
>>  
>>  static struct vfsmount *hugetlbfs_vfsmount[HUGE_MAX_HSTATE];
>>  
>
> .
>




Re: [RFC PATCH] hugetlbfs: fix the hugetlbfs can not be mounted

2016-10-29 Thread zhong jiang
On 2016/10/29 14:08, zhongjiang wrote:
> From: zhong jiang <zhongji...@huawei.com>
>
> Since 'commit 3e89e1c5ea84 ("hugetlb: make mm and fs code explicitly 
> non-modular")'
> bring in the mainline. mount hugetlbfs will result in the following issue.
>
> mount: unknown filesystme type 'hugetlbfs'
>
> because previous patch remove the module_alias_fs, when we mount the fs type,
> the caller get_fs_type can not find the filesystem.
>
> The patch just recover the module_alias_fs to identify the hugetlbfs.
>
> Signed-off-by: zhong jiang <zhongji...@huawei.com>
> ---
>  fs/hugetlbfs/inode.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 4fb7b10..b63e7de 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include 
> @@ -1209,6 +1210,7 @@ static struct dentry *hugetlbfs_mount(struct 
> file_system_type *fs_type,
>   .mount  = hugetlbfs_mount,
>   .kill_sb= kill_litter_super,
>  };
> +MODULE_ALIAS_FS("hugetlbfs");
>  
>  static struct vfsmount *hugetlbfs_vfsmount[HUGE_MAX_HSTATE];
>  
 please ignore the patch,  It have been fixed.



Re: [RFC PATCH] hugetlbfs: fix the hugetlbfs can not be mounted

2016-10-29 Thread zhong jiang
On 2016/10/29 14:08, zhongjiang wrote:
> From: zhong jiang 
>
> Since 'commit 3e89e1c5ea84 ("hugetlb: make mm and fs code explicitly 
> non-modular")'
> bring in the mainline. mount hugetlbfs will result in the following issue.
>
> mount: unknown filesystme type 'hugetlbfs'
>
> because previous patch remove the module_alias_fs, when we mount the fs type,
> the caller get_fs_type can not find the filesystem.
>
> The patch just recover the module_alias_fs to identify the hugetlbfs.
>
> Signed-off-by: zhong jiang 
> ---
>  fs/hugetlbfs/inode.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 4fb7b10..b63e7de 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include 
> @@ -1209,6 +1210,7 @@ static struct dentry *hugetlbfs_mount(struct 
> file_system_type *fs_type,
>   .mount  = hugetlbfs_mount,
>   .kill_sb= kill_litter_super,
>  };
> +MODULE_ALIAS_FS("hugetlbfs");
>  
>  static struct vfsmount *hugetlbfs_vfsmount[HUGE_MAX_HSTATE];
>  
 please ignore the patch,  It have been fixed.



Re: [PATCH] net: avoid uninitialized variable

2016-10-26 Thread zhong jiang
On 2016/10/27 12:02, Gao Feng wrote:
> On Thu, Oct 27, 2016 at 11:56 AM, zhongjiang <zhongji...@huawei.com> wrote:
>> From: zhong jiang <zhongji...@huawei.com>
>>
>> when I compiler the newest kernel, I hit the following error with
>> Werror=may-uninitalized.
>>
>> net/core/flow_dissector.c: In function ?._skb_flow_dissect?
>> include/uapi/linux/swab.h:100:46: error: ?.lan?.may be used uninitialized in 
>> this function [-Werror=maybe-uninitialized]
>> net/core/flow_dissector.c:248:26: note: ?.lan?.was declared here
>>
>> This adds an additional check for proto to explicitly tell the compiler
>> that vlan pointer have the correct value before it is used.
>>
>> Signed-off-by: zhong jiang <zhongji...@huawei.com>
>> ---
>>  net/core/flow_dissector.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>> index 1a7b80f..a04d9cf 100644
>> --- a/net/core/flow_dissector.c
>> +++ b/net/core/flow_dissector.c
>> @@ -245,7 +245,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>> }
>> case htons(ETH_P_8021AD):
>> case htons(ETH_P_8021Q): {
>> -   const struct vlan_hdr *vlan;
>> +   const struct vlan_hdr *vlan = NULL;
>>
>> if (skb_vlan_tag_present(skb))
>> proto = skb->protocol;
>> @@ -276,7 +276,8 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>> key_vlan->vlan_id = skb_vlan_tag_get_id(skb);
>> key_vlan->vlan_priority =
>> (skb_vlan_tag_get_prio(skb) >> 
>> VLAN_PRIO_SHIFT);
>> -   } else {
>> +   } else if (proto == cpu_to_be16(ETH_P_8021Q) ||
>> +   proto == 
>> cpu_to_be16(ETH_P_8021AD)) {
>> key_vlan->vlan_id = ntohs(vlan->h_vlan_TCI) &
>> VLAN_VID_MASK;
>> key_vlan->vlan_priority =
>> --
>> 1.8.3.1
>>
> It seems there is one similar patch already.
> You could refer to https://patchwork.kernel.org/patch/9389565/
>
> Regards
> Feng
>
> .
>
 sorry, I doesn't notice that patch.

 Thanks
 zhongjiang 



Re: [PATCH] net: avoid uninitialized variable

2016-10-26 Thread zhong jiang
On 2016/10/27 12:02, Gao Feng wrote:
> On Thu, Oct 27, 2016 at 11:56 AM, zhongjiang  wrote:
>> From: zhong jiang 
>>
>> when I compiler the newest kernel, I hit the following error with
>> Werror=may-uninitalized.
>>
>> net/core/flow_dissector.c: In function ?._skb_flow_dissect?
>> include/uapi/linux/swab.h:100:46: error: ?.lan?.may be used uninitialized in 
>> this function [-Werror=maybe-uninitialized]
>> net/core/flow_dissector.c:248:26: note: ?.lan?.was declared here
>>
>> This adds an additional check for proto to explicitly tell the compiler
>> that vlan pointer have the correct value before it is used.
>>
>> Signed-off-by: zhong jiang 
>> ---
>>  net/core/flow_dissector.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>> index 1a7b80f..a04d9cf 100644
>> --- a/net/core/flow_dissector.c
>> +++ b/net/core/flow_dissector.c
>> @@ -245,7 +245,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>> }
>> case htons(ETH_P_8021AD):
>> case htons(ETH_P_8021Q): {
>> -   const struct vlan_hdr *vlan;
>> +   const struct vlan_hdr *vlan = NULL;
>>
>> if (skb_vlan_tag_present(skb))
>> proto = skb->protocol;
>> @@ -276,7 +276,8 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>> key_vlan->vlan_id = skb_vlan_tag_get_id(skb);
>> key_vlan->vlan_priority =
>> (skb_vlan_tag_get_prio(skb) >> 
>> VLAN_PRIO_SHIFT);
>> -   } else {
>> +   } else if (proto == cpu_to_be16(ETH_P_8021Q) ||
>> +   proto == 
>> cpu_to_be16(ETH_P_8021AD)) {
>> key_vlan->vlan_id = ntohs(vlan->h_vlan_TCI) &
>> VLAN_VID_MASK;
>> key_vlan->vlan_priority =
>> --
>> 1.8.3.1
>>
> It seems there is one similar patch already.
> You could refer to https://patchwork.kernel.org/patch/9389565/
>
> Regards
> Feng
>
> .
>
 sorry, I doesn't notice that patch.

 Thanks
 zhongjiang 



Re: [PATCH v2] z3fold: fix the potential encode bug in encod_handle

2016-10-17 Thread zhong jiang
On 2016/10/17 23:30, Dan Streetman wrote:
> On Mon, Oct 17, 2016 at 8:48 AM, zhong jiang <zhongji...@huawei.com> wrote:
>> On 2016/10/17 20:03, Vitaly Wool wrote:
>>> Hi Zhong Jiang,
>>>
>>> On Mon, Oct 17, 2016 at 3:58 AM, zhong jiang <zhongji...@huawei.com> wrote:
>>>> Hi,  Vitaly
>>>>
>>>> About the following patch,  is it right?
>>>>
>>>> Thanks
>>>> zhongjiang
>>>> On 2016/10/13 12:02, zhongjiang wrote:
>>>>> From: zhong jiang <zhongji...@huawei.com>
>>>>>
>>>>> At present, zhdr->first_num plus bud can exceed the BUDDY_MASK
>>>>> in encode_handle, it will lead to the the caller handle_to_buddy
>>>>> return the error value.
>>>>>
>>>>> The patch fix the issue by changing the BUDDY_MASK to PAGE_MASK,
>>>>> it will be consistent with handle_to_z3fold_header. At the same time,
>>>>> change the BUDDY_MASK to PAGE_MASK in handle_to_buddy is better.
>>> are you seeing problems with the existing code? first_num should wrap around
>>> BUDDY_MASK and this should be ok because it is way bigger than the number
>>> of buddies.
>>>
>>> ~vitaly
>>>
>>> .
>>>
>>  first_num plus buddies can exceed the BUDDY_MASK. is it right?
> yes.
>
>>  (first_num + buddies) & BUDDY_MASK may be a smaller value than first_num.
> yes, but that doesn't matter; the value stored in the handle is never
> accessed directly.
>
>>   but (handle - zhdr->first_num) & BUDDY_MASK will return incorrect value
>>   in handle_to_buddy.
> the subtraction and masking will result in the correct buddy number,
> even if (handle & BUDDY_MASK) < zhdr->first_num.
 yes, I see. it is hard to read.
> However, I agree it's nonobvious, and tying the first_num size to
> NCHUNKS_ORDER is confusing - the number of chunks is completely
> unrelated to the number of buddies.
 yes. indeed.
> Possibly a better way to handle first_num is to limit it to the order
> of enum buddy to the actual range of possible buddy indexes, which is
> 0x3, i.e.:
>
> #define BUDDY_MASK  (0x3)
>
> and
>
>unsigned short first_num:2;
>
> with that and a small bit of explanation in the encode_handle or
> handle_to_buddy comments, it should be clear how the first_num and
> buddy numbering work, including that overflow/underflow are ok (due to
> the masking)...
 yes, It is better and clearer. Thanks for your relpy and advice. I will
 resend the patch.
>>   Thanks
>>   zhongjiang
>>
>>
> .
>




Re: [PATCH v2] z3fold: fix the potential encode bug in encod_handle

2016-10-17 Thread zhong jiang
On 2016/10/17 23:30, Dan Streetman wrote:
> On Mon, Oct 17, 2016 at 8:48 AM, zhong jiang  wrote:
>> On 2016/10/17 20:03, Vitaly Wool wrote:
>>> Hi Zhong Jiang,
>>>
>>> On Mon, Oct 17, 2016 at 3:58 AM, zhong jiang  wrote:
>>>> Hi,  Vitaly
>>>>
>>>> About the following patch,  is it right?
>>>>
>>>> Thanks
>>>> zhongjiang
>>>> On 2016/10/13 12:02, zhongjiang wrote:
>>>>> From: zhong jiang 
>>>>>
>>>>> At present, zhdr->first_num plus bud can exceed the BUDDY_MASK
>>>>> in encode_handle, it will lead to the the caller handle_to_buddy
>>>>> return the error value.
>>>>>
>>>>> The patch fix the issue by changing the BUDDY_MASK to PAGE_MASK,
>>>>> it will be consistent with handle_to_z3fold_header. At the same time,
>>>>> change the BUDDY_MASK to PAGE_MASK in handle_to_buddy is better.
>>> are you seeing problems with the existing code? first_num should wrap around
>>> BUDDY_MASK and this should be ok because it is way bigger than the number
>>> of buddies.
>>>
>>> ~vitaly
>>>
>>> .
>>>
>>  first_num plus buddies can exceed the BUDDY_MASK. is it right?
> yes.
>
>>  (first_num + buddies) & BUDDY_MASK may be a smaller value than first_num.
> yes, but that doesn't matter; the value stored in the handle is never
> accessed directly.
>
>>   but (handle - zhdr->first_num) & BUDDY_MASK will return incorrect value
>>   in handle_to_buddy.
> the subtraction and masking will result in the correct buddy number,
> even if (handle & BUDDY_MASK) < zhdr->first_num.
 yes, I see. it is hard to read.
> However, I agree it's nonobvious, and tying the first_num size to
> NCHUNKS_ORDER is confusing - the number of chunks is completely
> unrelated to the number of buddies.
 yes. indeed.
> Possibly a better way to handle first_num is to limit it to the order
> of enum buddy to the actual range of possible buddy indexes, which is
> 0x3, i.e.:
>
> #define BUDDY_MASK  (0x3)
>
> and
>
>unsigned short first_num:2;
>
> with that and a small bit of explanation in the encode_handle or
> handle_to_buddy comments, it should be clear how the first_num and
> buddy numbering work, including that overflow/underflow are ok (due to
> the masking)...
 yes, It is better and clearer. Thanks for your relpy and advice. I will
 resend the patch.
>>   Thanks
>>   zhongjiang
>>
>>
> .
>




Re: [PATCH v2] z3fold: fix the potential encode bug in encod_handle

2016-10-17 Thread zhong jiang
On 2016/10/17 20:03, Vitaly Wool wrote:
> Hi Zhong Jiang,
>
> On Mon, Oct 17, 2016 at 3:58 AM, zhong jiang <zhongji...@huawei.com> wrote:
>> Hi,  Vitaly
>>
>> About the following patch,  is it right?
>>
>> Thanks
>> zhongjiang
>> On 2016/10/13 12:02, zhongjiang wrote:
>>> From: zhong jiang <zhongji...@huawei.com>
>>>
>>> At present, zhdr->first_num plus bud can exceed the BUDDY_MASK
>>> in encode_handle, it will lead to the the caller handle_to_buddy
>>> return the error value.
>>>
>>> The patch fix the issue by changing the BUDDY_MASK to PAGE_MASK,
>>> it will be consistent with handle_to_z3fold_header. At the same time,
>>> change the BUDDY_MASK to PAGE_MASK in handle_to_buddy is better.
> are you seeing problems with the existing code? first_num should wrap around
> BUDDY_MASK and this should be ok because it is way bigger than the number
> of buddies.
>
> ~vitaly
>
> .
>
 I am curious about the z3fold implement, Thus, I am reviewing the code.

 Thanks
 zhongjiang



Re: [PATCH v2] z3fold: fix the potential encode bug in encod_handle

2016-10-17 Thread zhong jiang
On 2016/10/17 20:03, Vitaly Wool wrote:
> Hi Zhong Jiang,
>
> On Mon, Oct 17, 2016 at 3:58 AM, zhong jiang  wrote:
>> Hi,  Vitaly
>>
>> About the following patch,  is it right?
>>
>> Thanks
>> zhongjiang
>> On 2016/10/13 12:02, zhongjiang wrote:
>>> From: zhong jiang 
>>>
>>> At present, zhdr->first_num plus bud can exceed the BUDDY_MASK
>>> in encode_handle, it will lead to the the caller handle_to_buddy
>>> return the error value.
>>>
>>> The patch fix the issue by changing the BUDDY_MASK to PAGE_MASK,
>>> it will be consistent with handle_to_z3fold_header. At the same time,
>>> change the BUDDY_MASK to PAGE_MASK in handle_to_buddy is better.
> are you seeing problems with the existing code? first_num should wrap around
> BUDDY_MASK and this should be ok because it is way bigger than the number
> of buddies.
>
> ~vitaly
>
> .
>
 I am curious about the z3fold implement, Thus, I am reviewing the code.

 Thanks
 zhongjiang



Re: [PATCH v2] z3fold: fix the potential encode bug in encod_handle

2016-10-17 Thread zhong jiang
On 2016/10/17 20:03, Vitaly Wool wrote:
> Hi Zhong Jiang,
>
> On Mon, Oct 17, 2016 at 3:58 AM, zhong jiang <zhongji...@huawei.com> wrote:
>> Hi,  Vitaly
>>
>> About the following patch,  is it right?
>>
>> Thanks
>> zhongjiang
>> On 2016/10/13 12:02, zhongjiang wrote:
>>> From: zhong jiang <zhongji...@huawei.com>
>>>
>>> At present, zhdr->first_num plus bud can exceed the BUDDY_MASK
>>> in encode_handle, it will lead to the the caller handle_to_buddy
>>> return the error value.
>>>
>>> The patch fix the issue by changing the BUDDY_MASK to PAGE_MASK,
>>> it will be consistent with handle_to_z3fold_header. At the same time,
>>> change the BUDDY_MASK to PAGE_MASK in handle_to_buddy is better.
> are you seeing problems with the existing code? first_num should wrap around
> BUDDY_MASK and this should be ok because it is way bigger than the number
> of buddies.
>
> ~vitaly
>
> .
>
 first_num plus buddies can exceed the BUDDY_MASK. is it right?
 (first_num + buddies) & BUDDY_MASK may be a smaller value than first_num.

  but (handle - zhdr->first_num) & BUDDY_MASK will return incorrect value
  in handle_to_buddy.

  Thanks
  zhongjiang
 



Re: [PATCH v2] z3fold: fix the potential encode bug in encod_handle

2016-10-17 Thread zhong jiang
On 2016/10/17 20:03, Vitaly Wool wrote:
> Hi Zhong Jiang,
>
> On Mon, Oct 17, 2016 at 3:58 AM, zhong jiang  wrote:
>> Hi,  Vitaly
>>
>> About the following patch,  is it right?
>>
>> Thanks
>> zhongjiang
>> On 2016/10/13 12:02, zhongjiang wrote:
>>> From: zhong jiang 
>>>
>>> At present, zhdr->first_num plus bud can exceed the BUDDY_MASK
>>> in encode_handle, it will lead to the the caller handle_to_buddy
>>> return the error value.
>>>
>>> The patch fix the issue by changing the BUDDY_MASK to PAGE_MASK,
>>> it will be consistent with handle_to_z3fold_header. At the same time,
>>> change the BUDDY_MASK to PAGE_MASK in handle_to_buddy is better.
> are you seeing problems with the existing code? first_num should wrap around
> BUDDY_MASK and this should be ok because it is way bigger than the number
> of buddies.
>
> ~vitaly
>
> .
>
 first_num plus buddies can exceed the BUDDY_MASK. is it right?
 (first_num + buddies) & BUDDY_MASK may be a smaller value than first_num.

  but (handle - zhdr->first_num) & BUDDY_MASK will return incorrect value
  in handle_to_buddy.

  Thanks
  zhongjiang
 



Re: [PATCH v2] z3fold: fix the potential encode bug in encod_handle

2016-10-16 Thread zhong jiang
Hi,  Vitaly

About the following patch,  is it right?

Thanks
zhongjiang
On 2016/10/13 12:02, zhongjiang wrote:
> From: zhong jiang <zhongji...@huawei.com>
>
> At present, zhdr->first_num plus bud can exceed the BUDDY_MASK
> in encode_handle, it will lead to the the caller handle_to_buddy
> return the error value.
>
> The patch fix the issue by changing the BUDDY_MASK to PAGE_MASK,
> it will be consistent with handle_to_z3fold_header. At the same time,
> change the BUDDY_MASK to PAGE_MASK in handle_to_buddy is better.
>
> Signed-off-by: zhong jiang <zhongji...@huawei.com>
> ---
>  mm/z3fold.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/z3fold.c b/mm/z3fold.c
> index 8f9e89c..e8fc216 100644
> --- a/mm/z3fold.c
> +++ b/mm/z3fold.c
> @@ -169,7 +169,7 @@ static unsigned long encode_handle(struct z3fold_header 
> *zhdr, enum buddy bud)
>  
>   handle = (unsigned long)zhdr;
>   if (bud != HEADLESS)
> - handle += (bud + zhdr->first_num) & BUDDY_MASK;
> + handle += (bud + zhdr->first_num) & PAGE_MASK;
>   return handle;
>  }
>  
> @@ -183,7 +183,7 @@ static struct z3fold_header 
> *handle_to_z3fold_header(unsigned long handle)
>  static enum buddy handle_to_buddy(unsigned long handle)
>  {
>   struct z3fold_header *zhdr = handle_to_z3fold_header(handle);
> - return (handle - zhdr->first_num) & BUDDY_MASK;
> + return (handle - zhdr->first_num) & PAGE_MASK;
>  }
>  
>  /*




Re: [PATCH v2] z3fold: fix the potential encode bug in encod_handle

2016-10-16 Thread zhong jiang
Hi,  Vitaly

About the following patch,  is it right?

Thanks
zhongjiang
On 2016/10/13 12:02, zhongjiang wrote:
> From: zhong jiang 
>
> At present, zhdr->first_num plus bud can exceed the BUDDY_MASK
> in encode_handle, it will lead to the the caller handle_to_buddy
> return the error value.
>
> The patch fix the issue by changing the BUDDY_MASK to PAGE_MASK,
> it will be consistent with handle_to_z3fold_header. At the same time,
> change the BUDDY_MASK to PAGE_MASK in handle_to_buddy is better.
>
> Signed-off-by: zhong jiang 
> ---
>  mm/z3fold.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/z3fold.c b/mm/z3fold.c
> index 8f9e89c..e8fc216 100644
> --- a/mm/z3fold.c
> +++ b/mm/z3fold.c
> @@ -169,7 +169,7 @@ static unsigned long encode_handle(struct z3fold_header 
> *zhdr, enum buddy bud)
>  
>   handle = (unsigned long)zhdr;
>   if (bud != HEADLESS)
> - handle += (bud + zhdr->first_num) & BUDDY_MASK;
> + handle += (bud + zhdr->first_num) & PAGE_MASK;
>   return handle;
>  }
>  
> @@ -183,7 +183,7 @@ static struct z3fold_header 
> *handle_to_z3fold_header(unsigned long handle)
>  static enum buddy handle_to_buddy(unsigned long handle)
>  {
>   struct z3fold_header *zhdr = handle_to_z3fold_header(handle);
> - return (handle - zhdr->first_num) & BUDDY_MASK;
> + return (handle - zhdr->first_num) & PAGE_MASK;
>  }
>  
>  /*




Re: [PATCH] z3fold: remove the unnecessary limit in z3fold_compact_page

2016-10-16 Thread zhong jiang
On 2016/10/15 3:25, Vitaly Wool wrote:
> On Fri, Oct 14, 2016 at 3:35 PM, zhongjiang <zhongji...@huawei.com> wrote:
>> From: zhong jiang <zhongji...@huawei.com>
>>
>> z3fold compact page has nothing with the last_chunks. even if
>> last_chunks is not free, compact page will proceed.
>>
>> The patch just remove the limit without functional change.
>>
>> Signed-off-by: zhong jiang <zhongji...@huawei.com>
>> ---
>>  mm/z3fold.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/mm/z3fold.c b/mm/z3fold.c
>> index e8fc216..4668e1c 100644
>> --- a/mm/z3fold.c
>> +++ b/mm/z3fold.c
>> @@ -258,8 +258,7 @@ static int z3fold_compact_page(struct z3fold_header 
>> *zhdr)
>>
>>
>> if (!test_bit(MIDDLE_CHUNK_MAPPED, >private) &&
>> -   zhdr->middle_chunks != 0 &&
>> -   zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
>> +   zhdr->middle_chunks != 0 && zhdr->first_chunks == 0) {
>> memmove(beg + ZHDR_SIZE_ALIGNED,
>> beg + (zhdr->start_middle << CHUNK_SHIFT),
>> zhdr->middle_chunks << CHUNK_SHIFT);
> This check is actually important because if we move the middle chunk
> to the first and leave the last chunk, handles will become invalid and
> there won't be any easy way to fix that.
>
> Best regards,
>Vitaly
>
> .
>
 Thanks for you reply. you are right. Leave the last chunk to compact will
 lead to the first_num increase. Thus, handle_to_buddy will become invalid.

 Thanks
 zhongjiang
 



Re: [PATCH] z3fold: remove the unnecessary limit in z3fold_compact_page

2016-10-16 Thread zhong jiang
On 2016/10/15 3:25, Vitaly Wool wrote:
> On Fri, Oct 14, 2016 at 3:35 PM, zhongjiang  wrote:
>> From: zhong jiang 
>>
>> z3fold compact page has nothing with the last_chunks. even if
>> last_chunks is not free, compact page will proceed.
>>
>> The patch just remove the limit without functional change.
>>
>> Signed-off-by: zhong jiang 
>> ---
>>  mm/z3fold.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/mm/z3fold.c b/mm/z3fold.c
>> index e8fc216..4668e1c 100644
>> --- a/mm/z3fold.c
>> +++ b/mm/z3fold.c
>> @@ -258,8 +258,7 @@ static int z3fold_compact_page(struct z3fold_header 
>> *zhdr)
>>
>>
>> if (!test_bit(MIDDLE_CHUNK_MAPPED, >private) &&
>> -   zhdr->middle_chunks != 0 &&
>> -   zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
>> +   zhdr->middle_chunks != 0 && zhdr->first_chunks == 0) {
>> memmove(beg + ZHDR_SIZE_ALIGNED,
>> beg + (zhdr->start_middle << CHUNK_SHIFT),
>> zhdr->middle_chunks << CHUNK_SHIFT);
> This check is actually important because if we move the middle chunk
> to the first and leave the last chunk, handles will become invalid and
> there won't be any easy way to fix that.
>
> Best regards,
>Vitaly
>
> .
>
 Thanks for you reply. you are right. Leave the last chunk to compact will
 lead to the first_num increase. Thus, handle_to_buddy will become invalid.

 Thanks
 zhongjiang
 



Re: [PATCH] z3fold: fix the potential encode bug in encod_handle

2016-10-12 Thread zhong jiang
On 2016/10/13 11:33, zhongjiang wrote:
> From: zhong jiang <zhongji...@huawei.com>
>
> At present, zhdr->first_num plus bud can exceed the BUDDY_MASK
> in encode_handle, it will lead to the the caller handle_to_buddy
> return the error value.
>
> The patch fix the issue by changing the BUDDY_MASK to PAGE_MASK,
> it will be consistent with handle_to_z3fold_header. At the same time,
> The code will much comprehensible to change the BUDDY_MASK to
> BUDDIES_MAX in handle_to_buddy.
>
> Signed-off-by: zhong jiang <zhongji...@huawei.com>
> ---
>  mm/z3fold.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/z3fold.c b/mm/z3fold.c
> index 8f9e89c..5884b9e 100644
> --- a/mm/z3fold.c
> +++ b/mm/z3fold.c
> @@ -169,7 +169,7 @@ static unsigned long encode_handle(struct z3fold_header 
> *zhdr, enum buddy bud)
>  
>   handle = (unsigned long)zhdr;
>   if (bud != HEADLESS)
> - handle += (bud + zhdr->first_num) & BUDDY_MASK;
> + handle += (bud + zhdr->first_num) & PAGE_MASK;
>   return handle;
>  }
>  
> @@ -183,7 +183,7 @@ static struct z3fold_header 
> *handle_to_z3fold_header(unsigned long handle)
>  static enum buddy handle_to_buddy(unsigned long handle)
>  {
>   struct z3fold_header *zhdr = handle_to_z3fold_header(handle);
> - return (handle - zhdr->first_num) & BUDDY_MASK;
> + return (handle - zhdr->first_num) & BUDDIES_MAX;
>  }
>  
>  /*
  oh,  a  obvious problem, please ignore it. I will resent the patch in v2.  
Thanks



Re: [PATCH] z3fold: fix the potential encode bug in encod_handle

2016-10-12 Thread zhong jiang
On 2016/10/13 11:33, zhongjiang wrote:
> From: zhong jiang 
>
> At present, zhdr->first_num plus bud can exceed the BUDDY_MASK
> in encode_handle, it will lead to the the caller handle_to_buddy
> return the error value.
>
> The patch fix the issue by changing the BUDDY_MASK to PAGE_MASK,
> it will be consistent with handle_to_z3fold_header. At the same time,
> The code will much comprehensible to change the BUDDY_MASK to
> BUDDIES_MAX in handle_to_buddy.
>
> Signed-off-by: zhong jiang 
> ---
>  mm/z3fold.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/z3fold.c b/mm/z3fold.c
> index 8f9e89c..5884b9e 100644
> --- a/mm/z3fold.c
> +++ b/mm/z3fold.c
> @@ -169,7 +169,7 @@ static unsigned long encode_handle(struct z3fold_header 
> *zhdr, enum buddy bud)
>  
>   handle = (unsigned long)zhdr;
>   if (bud != HEADLESS)
> - handle += (bud + zhdr->first_num) & BUDDY_MASK;
> + handle += (bud + zhdr->first_num) & PAGE_MASK;
>   return handle;
>  }
>  
> @@ -183,7 +183,7 @@ static struct z3fold_header 
> *handle_to_z3fold_header(unsigned long handle)
>  static enum buddy handle_to_buddy(unsigned long handle)
>  {
>   struct z3fold_header *zhdr = handle_to_z3fold_header(handle);
> - return (handle - zhdr->first_num) & BUDDY_MASK;
> + return (handle - zhdr->first_num) & BUDDIES_MAX;
>  }
>  
>  /*
  oh,  a  obvious problem, please ignore it. I will resent the patch in v2.  
Thanks



Re: [RFC] remove unnecessary condition in remove_inode_hugepages

2016-09-25 Thread zhong jiang
On 2016/9/25 8:06, Mike Kravetz wrote:
> On 09/23/2016 07:56 PM, zhong jiang wrote:
>> On 2016/9/24 1:19, Mike Kravetz wrote:
>>> On 09/22/2016 06:53 PM, zhong jiang wrote:
>>>> At present, we need to call hugetlb_fix_reserve_count when 
>>>> hugetlb_unrserve_pages fails,
>>>> and PagePrivate will decide hugetlb reserves counts.
>>>>
>>>> we obtain the page from page cache. and use page both lock_page and 
>>>> mutex_lock.
>>>> alloc_huge_page add page to page chace always hold lock page, then bail 
>>>> out clearpageprivate
>>>> before unlock page. 
>>>>
>>>> but I' m not sure  it is right  or I miss the points.
>>> Let me try to explain the code you suggest is unnecessary.
>>>
>>> The PagePrivate flag is used in huge page allocation/deallocation to
>>> indicate that the page was globally reserved.  For example, in
>>> dequeue_huge_page_vma() there is this code:
>>>
>>> if (page) {
>>> if (avoid_reserve)
>>> break;
>>> if (!vma_has_reserves(vma, chg))
>>> break;
>>>
>>> SetPagePrivate(page);
>>> h->resv_huge_pages--;
>>> break;
>>> }
>>>
>>> and in free_huge_page():
>>>
>>> restore_reserve = PagePrivate(page);
>>> ClearPagePrivate(page);
>>> .
>>> 
>>> .
>>> if (restore_reserve)
>>> h->resv_huge_pages++;
>>>
>>> This helps maintains the global huge page reserve count.
>>>
>>> In addition to the global reserve count, there are per VMA reservation
>>> structures.  Unfortunately, these structures have different meanings
>>> depending on the context in which they are used.
>>>
>>> If there is a VMA reservation entry for a page, and the page has not
>>> been instantiated in the VMA this indicates there is a huge page reserved
>>> and the global resv_huge_pages count reflects that reservation.  Even
>>> if a page was not reserved, a VMA reservation entry is added when a page
>>> is instantiated in the VMA.
>>>
>>> With that background, let's look at the existing code/proposed changes.
>>  Clearly. 
>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>> index 4ea71eb..010723b 100644
>>>> --- a/fs/hugetlbfs/inode.c
>>>> +++ b/fs/hugetlbfs/inode.c
>>>> @@ -462,14 +462,12 @@ static void remove_inode_hugepages(struct inode 
>>>> *inode, loff_t lstart,
>>>>  * the page, note PagePrivate which is used in case
>>>>  * of error.
>>>>  */
>>>> -   rsv_on_error = !PagePrivate(page);
>>> This rsv_on_error flag indicates that when the huge page was allocated,
>>yes
>>> it was NOT counted against the global reserve count.  So, when
>>> remove_huge_page eventually calls free_huge_page(), the global count
>>> resv_huge_pages is not incremented.  So far, no problem.
>>  but the page comes from the page cache.  if it is.  it should implement
>>  ClearPageprivate(page) when lock page.   This condition always true.
>>
>>   The key point is why it need still check the PagePrivate(page) when page 
>> from
>>   page cache and hold lock.
> You are correct.  My apologies for not seeing your point in the original
> post.
>
> When the huge page is added to the page cache (huge_add_to_page_cache),
> the Page Private flag will be cleared.  Since this code
> (remove_inode_hugepages) will only be called for pages in the page cache,
> PagePrivate(page) will always be false.
>
> The comments in this area should be changed along with the code.
>
 Thanks, I will resend the patch.



Re: [RFC] remove unnecessary condition in remove_inode_hugepages

2016-09-25 Thread zhong jiang
On 2016/9/25 8:06, Mike Kravetz wrote:
> On 09/23/2016 07:56 PM, zhong jiang wrote:
>> On 2016/9/24 1:19, Mike Kravetz wrote:
>>> On 09/22/2016 06:53 PM, zhong jiang wrote:
>>>> At present, we need to call hugetlb_fix_reserve_count when 
>>>> hugetlb_unrserve_pages fails,
>>>> and PagePrivate will decide hugetlb reserves counts.
>>>>
>>>> we obtain the page from page cache. and use page both lock_page and 
>>>> mutex_lock.
>>>> alloc_huge_page add page to page chace always hold lock page, then bail 
>>>> out clearpageprivate
>>>> before unlock page. 
>>>>
>>>> but I' m not sure  it is right  or I miss the points.
>>> Let me try to explain the code you suggest is unnecessary.
>>>
>>> The PagePrivate flag is used in huge page allocation/deallocation to
>>> indicate that the page was globally reserved.  For example, in
>>> dequeue_huge_page_vma() there is this code:
>>>
>>> if (page) {
>>> if (avoid_reserve)
>>> break;
>>> if (!vma_has_reserves(vma, chg))
>>> break;
>>>
>>> SetPagePrivate(page);
>>> h->resv_huge_pages--;
>>> break;
>>> }
>>>
>>> and in free_huge_page():
>>>
>>> restore_reserve = PagePrivate(page);
>>> ClearPagePrivate(page);
>>> .
>>> 
>>> .
>>> if (restore_reserve)
>>> h->resv_huge_pages++;
>>>
>>> This helps maintains the global huge page reserve count.
>>>
>>> In addition to the global reserve count, there are per VMA reservation
>>> structures.  Unfortunately, these structures have different meanings
>>> depending on the context in which they are used.
>>>
>>> If there is a VMA reservation entry for a page, and the page has not
>>> been instantiated in the VMA this indicates there is a huge page reserved
>>> and the global resv_huge_pages count reflects that reservation.  Even
>>> if a page was not reserved, a VMA reservation entry is added when a page
>>> is instantiated in the VMA.
>>>
>>> With that background, let's look at the existing code/proposed changes.
>>  Clearly. 
>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>> index 4ea71eb..010723b 100644
>>>> --- a/fs/hugetlbfs/inode.c
>>>> +++ b/fs/hugetlbfs/inode.c
>>>> @@ -462,14 +462,12 @@ static void remove_inode_hugepages(struct inode 
>>>> *inode, loff_t lstart,
>>>>  * the page, note PagePrivate which is used in case
>>>>  * of error.
>>>>  */
>>>> -   rsv_on_error = !PagePrivate(page);
>>> This rsv_on_error flag indicates that when the huge page was allocated,
>>yes
>>> it was NOT counted against the global reserve count.  So, when
>>> remove_huge_page eventually calls free_huge_page(), the global count
>>> resv_huge_pages is not incremented.  So far, no problem.
>>  but the page comes from the page cache.  if it is.  it should implement
>>  ClearPageprivate(page) when lock page.   This condition always true.
>>
>>   The key point is why it need still check the PagePrivate(page) when page 
>> from
>>   page cache and hold lock.
> You are correct.  My apologies for not seeing your point in the original
> post.
>
> When the huge page is added to the page cache (huge_add_to_page_cache),
> the Page Private flag will be cleared.  Since this code
> (remove_inode_hugepages) will only be called for pages in the page cache,
> PagePrivate(page) will always be false.
>
> The comments in this area should be changed along with the code.
>
 Thanks, I will resend the patch.



Re: [RFC] remove unnecessary condition in remove_inode_hugepages

2016-09-23 Thread zhong jiang
On 2016/9/24 1:19, Mike Kravetz wrote:
> On 09/22/2016 06:53 PM, zhong jiang wrote:
>> At present, we need to call hugetlb_fix_reserve_count when 
>> hugetlb_unrserve_pages fails,
>> and PagePrivate will decide hugetlb reserves counts.
>>
>> we obtain the page from page cache. and use page both lock_page and 
>> mutex_lock.
>> alloc_huge_page add page to page chace always hold lock page, then bail out 
>> clearpageprivate
>> before unlock page. 
>>
>> but I' m not sure  it is right  or I miss the points.
> Let me try to explain the code you suggest is unnecessary.
>
> The PagePrivate flag is used in huge page allocation/deallocation to
> indicate that the page was globally reserved.  For example, in
> dequeue_huge_page_vma() there is this code:
>
> if (page) {
> if (avoid_reserve)
> break;
> if (!vma_has_reserves(vma, chg))
> break;
>
> SetPagePrivate(page);
> h->resv_huge_pages--;
> break;
> }
>
> and in free_huge_page():
>
> restore_reserve = PagePrivate(page);
> ClearPagePrivate(page);
>   .
>   
>   .
> if (restore_reserve)
> h->resv_huge_pages++;
>
> This helps maintains the global huge page reserve count.
>
> In addition to the global reserve count, there are per VMA reservation
> structures.  Unfortunately, these structures have different meanings
> depending on the context in which they are used.
>
> If there is a VMA reservation entry for a page, and the page has not
> been instantiated in the VMA this indicates there is a huge page reserved
> and the global resv_huge_pages count reflects that reservation.  Even
> if a page was not reserved, a VMA reservation entry is added when a page
> is instantiated in the VMA.
>
> With that background, let's look at the existing code/proposed changes.
 Clearly. 
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 4ea71eb..010723b 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -462,14 +462,12 @@ static void remove_inode_hugepages(struct inode 
>> *inode, loff_t lstart,
>>  * the page, note PagePrivate which is used in case
>>  * of error.
>>  */
>> -   rsv_on_error = !PagePrivate(page);
> This rsv_on_error flag indicates that when the huge page was allocated,
   yes
> it was NOT counted against the global reserve count.  So, when
> remove_huge_page eventually calls free_huge_page(), the global count
> resv_huge_pages is not incremented.  So far, no problem.
 but the page comes from the page cache.  if it is.  it should implement
 ClearPageprivate(page) when lock page.   This condition always true.

  The key point is why it need still check the PagePrivate(page) when page from
  page cache and hold lock.

  Thanks you
 zhongjiang
>> remove_huge_page(page);
>> freed++;
>> if (!truncate_op) {
>> if (unlikely(hugetlb_unreserve_pages(inode,
>> next, next + 1, 1)))
> We now have this VERY unlikely situation that hugetlb_unreserve_pages fails.
> This means that the VMA reservation entry for the page was not removed.
> So, we are in a bit of a mess.  The page has already been removed, but the
> VMA reservation entry can not.  This LOOKS like there is a reservation for
> the page in the VMA reservation structure.  But, the global count
> resv_huge_pages does not reflect this reservation.
>
> If we do nothing, when the VMA is eventually removed the VMA reservation
> structure will be completely removed and the global count resv_huge_pages
> will be decremented for each entry in the structure.  Since, there is a
> VMA reservation entry without a corresponding global count, the global
> count will be one less than it should (will eventually go to -1).
>
> To 'fix' this, hugetlb_fix_reserve_counts is called.  In this case, it will
> increment the global count so that it is consistent with the entries in
> the VMA reservation structure.
>
> This is all quite confusing and really unlikely to happen.  I tried to
> explain in code comments:
>
> Before removing the page:
> /*
>  * We must free the huge page and remove from page
>

Re: [RFC] remove unnecessary condition in remove_inode_hugepages

2016-09-23 Thread zhong jiang
On 2016/9/24 1:19, Mike Kravetz wrote:
> On 09/22/2016 06:53 PM, zhong jiang wrote:
>> At present, we need to call hugetlb_fix_reserve_count when 
>> hugetlb_unrserve_pages fails,
>> and PagePrivate will decide hugetlb reserves counts.
>>
>> we obtain the page from page cache. and use page both lock_page and 
>> mutex_lock.
>> alloc_huge_page add page to page chace always hold lock page, then bail out 
>> clearpageprivate
>> before unlock page. 
>>
>> but I' m not sure  it is right  or I miss the points.
> Let me try to explain the code you suggest is unnecessary.
>
> The PagePrivate flag is used in huge page allocation/deallocation to
> indicate that the page was globally reserved.  For example, in
> dequeue_huge_page_vma() there is this code:
>
> if (page) {
> if (avoid_reserve)
> break;
> if (!vma_has_reserves(vma, chg))
> break;
>
> SetPagePrivate(page);
> h->resv_huge_pages--;
> break;
> }
>
> and in free_huge_page():
>
> restore_reserve = PagePrivate(page);
> ClearPagePrivate(page);
>   .
>   
>   .
> if (restore_reserve)
> h->resv_huge_pages++;
>
> This helps maintains the global huge page reserve count.
>
> In addition to the global reserve count, there are per VMA reservation
> structures.  Unfortunately, these structures have different meanings
> depending on the context in which they are used.
>
> If there is a VMA reservation entry for a page, and the page has not
> been instantiated in the VMA this indicates there is a huge page reserved
> and the global resv_huge_pages count reflects that reservation.  Even
> if a page was not reserved, a VMA reservation entry is added when a page
> is instantiated in the VMA.
>
> With that background, let's look at the existing code/proposed changes.
 Clearly. 
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 4ea71eb..010723b 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -462,14 +462,12 @@ static void remove_inode_hugepages(struct inode 
>> *inode, loff_t lstart,
>>  * the page, note PagePrivate which is used in case
>>  * of error.
>>  */
>> -   rsv_on_error = !PagePrivate(page);
> This rsv_on_error flag indicates that when the huge page was allocated,
   yes
> it was NOT counted against the global reserve count.  So, when
> remove_huge_page eventually calls free_huge_page(), the global count
> resv_huge_pages is not incremented.  So far, no problem.
 but the page comes from the page cache.  if it is.  it should implement
 ClearPageprivate(page) when lock page.   This condition always true.

  The key point is why it need still check the PagePrivate(page) when page from
  page cache and hold lock.

  Thanks you
 zhongjiang
>> remove_huge_page(page);
>> freed++;
>> if (!truncate_op) {
>> if (unlikely(hugetlb_unreserve_pages(inode,
>> next, next + 1, 1)))
> We now have this VERY unlikely situation that hugetlb_unreserve_pages fails.
> This means that the VMA reservation entry for the page was not removed.
> So, we are in a bit of a mess.  The page has already been removed, but the
> VMA reservation entry can not.  This LOOKS like there is a reservation for
> the page in the VMA reservation structure.  But, the global count
> resv_huge_pages does not reflect this reservation.
>
> If we do nothing, when the VMA is eventually removed the VMA reservation
> structure will be completely removed and the global count resv_huge_pages
> will be decremented for each entry in the structure.  Since, there is a
> VMA reservation entry without a corresponding global count, the global
> count will be one less than it should (will eventually go to -1).
>
> To 'fix' this, hugetlb_fix_reserve_counts is called.  In this case, it will
> increment the global count so that it is consistent with the entries in
> the VMA reservation structure.
>
> This is all quite confusing and really unlikely to happen.  I tried to
> explain in code comments:
>
> Before removing the page:
> /*
>  * We must free the huge page and remove from page
>

[RFC] remove unnecessary condition in remove_inode_hugepages

2016-09-22 Thread zhong jiang

At present, we need to call hugetlb_fix_reserve_count when 
hugetlb_unrserve_pages fails,
and PagePrivate will decide hugetlb reserves counts.

we obtain the page from page cache. and use page both lock_page and mutex_lock.
alloc_huge_page add page to page chace always hold lock page, then bail out 
clearpageprivate
before unlock page. 

but I' m not sure  it is right  or I miss the points.


diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 4ea71eb..010723b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -462,14 +462,12 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
 * the page, note PagePrivate which is used in case
 * of error.
 */
-   rsv_on_error = !PagePrivate(page);
remove_huge_page(page);
freed++;
if (!truncate_op) {
if (unlikely(hugetlb_unreserve_pages(inode,
next, next + 1, 1)))
-   hugetlb_fix_reserve_counts(inode,
-   rsv_on_error);
+   hugetlb_fix_reserve_counts(inode)
}

unlock_page(page);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c26d463..d2e0fc5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -90,7 +90,7 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
 void free_huge_page(struct page *page);
-void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
+void hugetlb_fix_reserve_counts(struct inode *inode);
 extern struct mutex *hugetlb_fault_mutex_table;
 u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87e11d8..28a079a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -567,13 +567,13 @@ retry:
  * appear as a "reserved" entry instead of simply dangling with incorrect
  * counts.
  */
-void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
+void hugetlb_fix_reserve_counts(struct inode *inode)
 {
struct hugepage_subpool *spool = subpool_inode(inode);
long rsv_adjust;

rsv_adjust = hugepage_subpool_get_pages(spool, 1);
-   if (restore_reserve && rsv_adjust) {
+   if (rsv_adjust) {
struct hstate *h = hstate_inode(inode);

hugetlb_acct_memory(h, 1);




[RFC] remove unnecessary condition in remove_inode_hugepages

2016-09-22 Thread zhong jiang

At present, we need to call hugetlb_fix_reserve_count when 
hugetlb_unrserve_pages fails,
and PagePrivate will decide hugetlb reserves counts.

we obtain the page from page cache. and use page both lock_page and mutex_lock.
alloc_huge_page add page to page chace always hold lock page, then bail out 
clearpageprivate
before unlock page. 

but I' m not sure  it is right  or I miss the points.


diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 4ea71eb..010723b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -462,14 +462,12 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
 * the page, note PagePrivate which is used in case
 * of error.
 */
-   rsv_on_error = !PagePrivate(page);
remove_huge_page(page);
freed++;
if (!truncate_op) {
if (unlikely(hugetlb_unreserve_pages(inode,
next, next + 1, 1)))
-   hugetlb_fix_reserve_counts(inode,
-   rsv_on_error);
+   hugetlb_fix_reserve_counts(inode)
}

unlock_page(page);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c26d463..d2e0fc5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -90,7 +90,7 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
 void free_huge_page(struct page *page);
-void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
+void hugetlb_fix_reserve_counts(struct inode *inode);
 extern struct mutex *hugetlb_fault_mutex_table;
 u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87e11d8..28a079a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -567,13 +567,13 @@ retry:
  * appear as a "reserved" entry instead of simply dangling with incorrect
  * counts.
  */
-void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
+void hugetlb_fix_reserve_counts(struct inode *inode)
 {
struct hugepage_subpool *spool = subpool_inode(inode);
long rsv_adjust;

rsv_adjust = hugepage_subpool_get_pages(spool, 1);
-   if (restore_reserve && rsv_adjust) {
+   if (rsv_adjust) {
struct hstate *h = hstate_inode(inode);

hugetlb_acct_memory(h, 1);




Re: [PATCH] mm, numa: boot cpu should bound to the node0 when node_off enable

2016-08-23 Thread zhong jiang
On 2016/8/22 22:28, Catalin Marinas wrote:
> On Sat, Aug 20, 2016 at 05:38:59PM +0800, zhong jiang wrote:
>> On 2016/8/19 12:11, Ganapatrao Kulkarni wrote:
>>> On Fri, Aug 19, 2016 at 9:30 AM, Ganapatrao Kulkarni
>>> <gpkulka...@gmail.com> wrote:
>>>> On Fri, Aug 19, 2016 at 7:28 AM, zhong jiang <zhongji...@huawei.com> wrote:
>>>>> On 2016/8/19 1:45, Ganapatrao Kulkarni wrote:
>>>>>> On Thu, Aug 18, 2016 at 9:34 PM, Catalin Marinas
>>>>>> <catalin.mari...@arm.com> wrote:
>>>>>>> On Thu, Aug 18, 2016 at 09:09:26PM +0800, zhongjiang wrote:
>>>>>>>> At present, boot cpu will bound to a node from device tree when 
>>>>>>>> node_off enable.
>>>>>>>> if the node is not initialization, it will lead to a following problem.
> [...]
>>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>>> @@ -119,7 +119,7 @@ void numa_store_cpu_info(unsigned int cpu)
>>>>>>>>  void __init early_map_cpu_to_node(unsigned int cpu, int nid)
>>>>>>>>  {
>>>>>>>>   /* fallback to node 0 */
>>>>>>>> - if (nid < 0 || nid >= MAX_NUMNODES)
>>>>>>>> + if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
>>>>>> i  did not understood how this line change fixes the issue that you
>>>>>> have mentioned (i too not understood fully the issue description)
>>>>>> this array used while mapping node id when secondary cores comes up
>>>>>> when numa_off is set the cpu_to_node_map[cpu] is not used and set to
>>>>>> node0 always( refer function numa_store_cpu_info)..
>>>>>> please provide more details to understand the issue you are facing.
>>>>>> /*
>>>>>>  *  Set the cpu to node and mem mapping
>>>>>>  */
>>>>>> void numa_store_cpu_info(unsigned int cpu)
>>>>>> {
>>>>>> map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
>>>>>> }
>>>>> The issue comes up when we test the kdump. it will leads to kernel crash.
>>>>> when I debug the issue, I find boot cpu actually bound to the node1. while
>>>>> node1 is not real existence when numa_off enable.
>>>> boot cpu is default mapped to node0
>>>> are you running with any other patches?
>>> if you added any patch to change this code
>>>   /* init boot processor */
>>> cpu_to_node_map[0] = 0;
>>> map_cpu_to_node(0, 0);
>>>
>>> then adding code to take-care numa_off here might solve your issue.
>>  but in of_smp_init_cpus, boot cpu will call early_map_cpu_to_node[] to get
>>  the relation node. and the node is from devicetree.
>>
>>  you points to the code will be covered with another node. therefore, it is
>>  possible that cpu_to_node[cpu] will leads to the incorrect results. 
>> therefore,
>>  The crash will come up.
> I think I get Ganapat's point. The cpu_to_node_map[0] may be incorrectly
> set by early_map_cpu_to_node() when called from smp_init_cpus() ->
> of_parse_and_init_cpus(). However, the cpu_to_node_map[] array is *only*
> read by numa_store_cpu_info(). This latter function calls
> map_cpu_to_node() and, if numa_off, will only ever pass 0 as the nid.
>
> Given that the cpu_to_node_map[] array is static, I don't see how any
> non-zero value could leak outside the arch/arm64/mm/numa.c file.
>
> So please give more details of any additional patches you have on top of
> mainline or whether you reproduced this issue with the vanilla kernel
> (since you mentioned kdump, that's not in mainline yet).
>
Thanks for Catalin and Ganapatral.
I am sorry for that.  The mainline have solved.  The mainline changes is too 
much, I did not notice.




Re: [PATCH] mm, numa: boot cpu should bound to the node0 when node_off enable

2016-08-23 Thread zhong jiang
On 2016/8/22 22:28, Catalin Marinas wrote:
> On Sat, Aug 20, 2016 at 05:38:59PM +0800, zhong jiang wrote:
>> On 2016/8/19 12:11, Ganapatrao Kulkarni wrote:
>>> On Fri, Aug 19, 2016 at 9:30 AM, Ganapatrao Kulkarni
>>>  wrote:
>>>> On Fri, Aug 19, 2016 at 7:28 AM, zhong jiang  wrote:
>>>>> On 2016/8/19 1:45, Ganapatrao Kulkarni wrote:
>>>>>> On Thu, Aug 18, 2016 at 9:34 PM, Catalin Marinas
>>>>>>  wrote:
>>>>>>> On Thu, Aug 18, 2016 at 09:09:26PM +0800, zhongjiang wrote:
>>>>>>>> At present, boot cpu will bound to a node from device tree when 
>>>>>>>> node_off enable.
>>>>>>>> if the node is not initialization, it will lead to a following problem.
> [...]
>>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>>> @@ -119,7 +119,7 @@ void numa_store_cpu_info(unsigned int cpu)
>>>>>>>>  void __init early_map_cpu_to_node(unsigned int cpu, int nid)
>>>>>>>>  {
>>>>>>>>   /* fallback to node 0 */
>>>>>>>> - if (nid < 0 || nid >= MAX_NUMNODES)
>>>>>>>> + if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
>>>>>> i  did not understood how this line change fixes the issue that you
>>>>>> have mentioned (i too not understood fully the issue description)
>>>>>> this array used while mapping node id when secondary cores comes up
>>>>>> when numa_off is set the cpu_to_node_map[cpu] is not used and set to
>>>>>> node0 always( refer function numa_store_cpu_info)..
>>>>>> please provide more details to understand the issue you are facing.
>>>>>> /*
>>>>>>  *  Set the cpu to node and mem mapping
>>>>>>  */
>>>>>> void numa_store_cpu_info(unsigned int cpu)
>>>>>> {
>>>>>> map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
>>>>>> }
>>>>> The issue comes up when we test the kdump. it will leads to kernel crash.
>>>>> when I debug the issue, I find boot cpu actually bound to the node1. while
>>>>> node1 is not real existence when numa_off enable.
>>>> boot cpu is default mapped to node0
>>>> are you running with any other patches?
>>> if you added any patch to change this code
>>>   /* init boot processor */
>>> cpu_to_node_map[0] = 0;
>>> map_cpu_to_node(0, 0);
>>>
>>> then adding code to take-care numa_off here might solve your issue.
>>  but in of_smp_init_cpus, boot cpu will call early_map_cpu_to_node[] to get
>>  the relation node. and the node is from devicetree.
>>
>>  you points to the code will be covered with another node. therefore, it is
>>  possible that cpu_to_node[cpu] will leads to the incorrect results. 
>> therefore,
>>  The crash will come up.
> I think I get Ganapat's point. The cpu_to_node_map[0] may be incorrectly
> set by early_map_cpu_to_node() when called from smp_init_cpus() ->
> of_parse_and_init_cpus(). However, the cpu_to_node_map[] array is *only*
> read by numa_store_cpu_info(). This latter function calls
> map_cpu_to_node() and, if numa_off, will only ever pass 0 as the nid.
>
> Given that the cpu_to_node_map[] array is static, I don't see how any
> non-zero value could leak outside the arch/arm64/mm/numa.c file.
>
> So please give more details of any additional patches you have on top of
> mainline or whether you reproduced this issue with the vanilla kernel
> (since you mentioned kdump, that's not in mainline yet).
>
Thanks for Catalin and Ganapatral.
I am sorry for that.  The mainline have solved.  The mainline changes is too 
much, I did not notice.




Re: [PATCH] mm, numa: boot cpu should bound to the node0 when node_off enable

2016-08-20 Thread zhong jiang
On 2016/8/19 12:11, Ganapatrao Kulkarni wrote:
> On Fri, Aug 19, 2016 at 9:30 AM, Ganapatrao Kulkarni
> <gpkulka...@gmail.com> wrote:
>> On Fri, Aug 19, 2016 at 7:28 AM, zhong jiang <zhongji...@huawei.com> wrote:
>>> On 2016/8/19 1:45, Ganapatrao Kulkarni wrote:
>>>> On Thu, Aug 18, 2016 at 9:34 PM, Catalin Marinas
>>>> <catalin.mari...@arm.com> wrote:
>>>>> On Thu, Aug 18, 2016 at 09:09:26PM +0800, zhongjiang wrote:
>>>>>> At present, boot cpu will bound to a node from device tree when node_off 
>>>>>> enable.
>>>>>> if the node is not initialization, it will lead to a following problem.
>>>>>>
>>>>>>  next_zones_zonelist+0x18/0x80
>>>>>>  __build_all_zonelists+0x1e0/0x288
>>>>>>  build_all_zonelists_init+0x10/0x1c
>>>>>>  build_all_zonelists+0x114/0x128
>>>>>>  start_kernel+0x1a0/0x414
>>>>> I think this "problem" is missing a lot of information. Is this supposed
>>>>> to be a kernel panic?
>>>>>
>>>>>> The patch fix it by fallback to node 0. therefore, the cpu will bound to 
>>>>>> the node
>>>>>> correctly.
>>>>>>
>>>>>> Signed-off-by: zhongjiang <zhongji...@huawei.com>
>>>>>> ---
>>>>>>  arch/arm64/mm/numa.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>> index 4dcd7d6..1f8f5da 100644
>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>> @@ -119,7 +119,7 @@ void numa_store_cpu_info(unsigned int cpu)
>>>>>>  void __init early_map_cpu_to_node(unsigned int cpu, int nid)
>>>>>>  {
>>>>>>   /* fallback to node 0 */
>>>>>> - if (nid < 0 || nid >= MAX_NUMNODES)
>>>>>> + if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
>>>> i  did not understood how this line change fixes the issue that you
>>>> have mentioned (i too not understood fully the issue description)
>>>> this array used while mapping node id when secondary cores comes up
>>>> when numa_off is set the cpu_to_node_map[cpu] is not used and set to
>>>> node0 always( refer function numa_store_cpu_info)..
>>>> please provide more details to understand the issue you are facing.
>>>> /*
>>>>  *  Set the cpu to node and mem mapping
>>>>  */
>>>> void numa_store_cpu_info(unsigned int cpu)
>>>> {
>>>> map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
>>>> }
>>>>
>>>> thanks
>>>> Ganapat
>>> The issue comes up when we test the kdump. it will leads to kernel crash.
>>> when I debug the issue, I find boot cpu actually bound to the node1. while
>>> node1 is not real existence when numa_off enable.
>> boot cpu is default mapped to node0
>> are you running with any other patches?
> if you added any patch to change this code
>   /* init boot processor */
> cpu_to_node_map[0] = 0;
> map_cpu_to_node(0, 0);
>
> then adding code to take-care numa_off here might solve your issue.
 but in of_smp_init_cpus, boot cpu will call early_map_cpu_to_node[] to get
 the relation node. and the node is from devicetree.

 you points to the code will be covered with another node. therefore, it is
 possible that cpu_to_node[cpu] will leads to the incorrect results. therefore,
 The crash will come up.
>>> __build_all_zonelists will call the cpu_to_node[cpu], but orresponding 
>>> relation
>>> will be obtained from the devicetree. therefore, the issue will come up.
>> when numa_off, all cpus are mapped to node0( refer
>> numa_store_cpu_info) and device tree mapping is ignored.
>>> The corresponding message is as follows when kdump start. it is obvious 
>>> that mem
>>> range points to the node1 in the devicetree.
>>>
>>> Early memory node ranges
>>> node   0: [mem 0x005fe000-0x005f]
>>> Initmem setup node 0 [mem 0x005fe000-0x005f]
>>>
>>> Unable to handle kernel paging request at virtual address 1690
>>> pgd = 81226000
>>> [1690] *pgd=
>>> Internal error: Oops: 9604 [#1] SMP
>>>  Modules linked in:
>>> CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.27-vhulk3.6.5.aarch64 #1
>>> Hardware name: Hisilicon Hi1612 Development Board (DT)
>>>  task: 8102b730 ti: 81018000 task.ti: 81018000
>>> PC is at next_zones_zonelist+0x18/0x80
>>>  LR is at __build_all_zonelists+0x1e0/0x288
>>> next_zones_zonelist+0x18/0x80
>>>  __build_all_zonelists+0x1e0/0x288
>>> build_all_zonelists_init+0x10/0x1c
>>>  build_all_zonelists+0x114/0x128
>>>  start_kernel+0x1a0/0x414
>>>>>>   nid = 0;
>>>>>>
>>>>>>   cpu_to_node_map[cpu] = nid;
>>>>> The patch looks fine (slight inconsistence from the map_cpu_to_node()
>>>>> callers but I guess we don't want to expose numa_off outside this file).
>>>>> I would however like to see an Ack from Ganapat (cc'ed).
>>>>>
>>>>> --
>>>>> Catalin
>>>>>
>>>>> ___
>>>>> linux-arm-kernel mailing list
>>>>> linux-arm-ker...@lists.infradead.org
>>>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>>> .
>>>>
>>>
> .
>




Re: [PATCH] mm, numa: boot cpu should bound to the node0 when node_off enable

2016-08-20 Thread zhong jiang
On 2016/8/19 12:11, Ganapatrao Kulkarni wrote:
> On Fri, Aug 19, 2016 at 9:30 AM, Ganapatrao Kulkarni
>  wrote:
>> On Fri, Aug 19, 2016 at 7:28 AM, zhong jiang  wrote:
>>> On 2016/8/19 1:45, Ganapatrao Kulkarni wrote:
>>>> On Thu, Aug 18, 2016 at 9:34 PM, Catalin Marinas
>>>>  wrote:
>>>>> On Thu, Aug 18, 2016 at 09:09:26PM +0800, zhongjiang wrote:
>>>>>> At present, boot cpu will bound to a node from device tree when node_off 
>>>>>> enable.
>>>>>> if the node is not initialization, it will lead to a following problem.
>>>>>>
>>>>>>  next_zones_zonelist+0x18/0x80
>>>>>>  __build_all_zonelists+0x1e0/0x288
>>>>>>  build_all_zonelists_init+0x10/0x1c
>>>>>>  build_all_zonelists+0x114/0x128
>>>>>>  start_kernel+0x1a0/0x414
>>>>> I think this "problem" is missing a lot of information. Is this supposed
>>>>> to be a kernel panic?
>>>>>
>>>>>> The patch fix it by fallback to node 0. therefore, the cpu will bound to 
>>>>>> the node
>>>>>> correctly.
>>>>>>
>>>>>> Signed-off-by: zhongjiang 
>>>>>> ---
>>>>>>  arch/arm64/mm/numa.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>> index 4dcd7d6..1f8f5da 100644
>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>> @@ -119,7 +119,7 @@ void numa_store_cpu_info(unsigned int cpu)
>>>>>>  void __init early_map_cpu_to_node(unsigned int cpu, int nid)
>>>>>>  {
>>>>>>   /* fallback to node 0 */
>>>>>> - if (nid < 0 || nid >= MAX_NUMNODES)
>>>>>> + if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
>>>> i  did not understood how this line change fixes the issue that you
>>>> have mentioned (i too not understood fully the issue description)
>>>> this array used while mapping node id when secondary cores comes up
>>>> when numa_off is set the cpu_to_node_map[cpu] is not used and set to
>>>> node0 always( refer function numa_store_cpu_info)..
>>>> please provide more details to understand the issue you are facing.
>>>> /*
>>>>  *  Set the cpu to node and mem mapping
>>>>  */
>>>> void numa_store_cpu_info(unsigned int cpu)
>>>> {
>>>> map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
>>>> }
>>>>
>>>> thanks
>>>> Ganapat
>>> The issue comes up when we test the kdump. it will leads to kernel crash.
>>> when I debug the issue, I find boot cpu actually bound to the node1. while
>>> node1 is not real existence when numa_off enable.
>> boot cpu is default mapped to node0
>> are you running with any other patches?
> if you added any patch to change this code
>   /* init boot processor */
> cpu_to_node_map[0] = 0;
> map_cpu_to_node(0, 0);
>
> then adding code to take-care numa_off here might solve your issue.
 but in of_smp_init_cpus, boot cpu will call early_map_cpu_to_node[] to get
 the relation node. and the node is from devicetree.

 you points to the code will be covered with another node. therefore, it is
 possible that cpu_to_node[cpu] will leads to the incorrect results. therefore,
 The crash will come up.
>>> __build_all_zonelists will call the cpu_to_node[cpu], but orresponding 
>>> relation
>>> will be obtained from the devicetree. therefore, the issue will come up.
>> when numa_off, all cpus are mapped to node0( refer
>> numa_store_cpu_info) and device tree mapping is ignored.
>>> The corresponding message is as follows when kdump start. it is obvious 
>>> that mem
>>> range points to the node1 in the devicetree.
>>>
>>> Early memory node ranges
>>> node   0: [mem 0x005fe000-0x005f]
>>> Initmem setup node 0 [mem 0x005fe000-0x005f]
>>>
>>> Unable to handle kernel paging request at virtual address 1690
>>> pgd = 81226000
>>> [1690] *pgd=
>>> Internal error: Oops: 9604 [#1] SMP
>>>  Modules linked in:
>>> CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.27-vhulk3.6.5.aarch64 #1
>>> Hardware name: Hisilicon Hi1612 Development Board (DT)
>>>  task: 8102b730 ti: 81018000 task.ti: 81018000
>>> PC is at next_zones_zonelist+0x18/0x80
>>>  LR is at __build_all_zonelists+0x1e0/0x288
>>> next_zones_zonelist+0x18/0x80
>>>  __build_all_zonelists+0x1e0/0x288
>>> build_all_zonelists_init+0x10/0x1c
>>>  build_all_zonelists+0x114/0x128
>>>  start_kernel+0x1a0/0x414
>>>>>>   nid = 0;
>>>>>>
>>>>>>   cpu_to_node_map[cpu] = nid;
>>>>> The patch looks fine (slight inconsistence from the map_cpu_to_node()
>>>>> callers but I guess we don't want to expose numa_off outside this file).
>>>>> I would however like to see an Ack from Ganapat (cc'ed).
>>>>>
>>>>> --
>>>>> Catalin
>>>>>
>>>>> ___
>>>>> linux-arm-kernel mailing list
>>>>> linux-arm-ker...@lists.infradead.org
>>>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>>> .
>>>>
>>>
> .
>




Re: [PATCH] mm, numa: boot cpu should bound to the node0 when node_off enable

2016-08-18 Thread zhong jiang
On 2016/8/19 1:45, Ganapatrao Kulkarni wrote:
> On Thu, Aug 18, 2016 at 9:34 PM, Catalin Marinas
>  wrote:
>> On Thu, Aug 18, 2016 at 09:09:26PM +0800, zhongjiang wrote:
>>> At present, boot cpu will bound to a node from device tree when node_off 
>>> enable.
>>> if the node is not initialization, it will lead to a following problem.
>>>
>>>  next_zones_zonelist+0x18/0x80
>>>  __build_all_zonelists+0x1e0/0x288
>>>  build_all_zonelists_init+0x10/0x1c
>>>  build_all_zonelists+0x114/0x128
>>>  start_kernel+0x1a0/0x414
>> I think this "problem" is missing a lot of information. Is this supposed
>> to be a kernel panic?
>>
>>> The patch fix it by fallback to node 0. therefore, the cpu will bound to 
>>> the node
>>> correctly.
>>>
>>> Signed-off-by: zhongjiang 
>>> ---
>>>  arch/arm64/mm/numa.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>> index 4dcd7d6..1f8f5da 100644
>>> --- a/arch/arm64/mm/numa.c
>>> +++ b/arch/arm64/mm/numa.c
>>> @@ -119,7 +119,7 @@ void numa_store_cpu_info(unsigned int cpu)
>>>  void __init early_map_cpu_to_node(unsigned int cpu, int nid)
>>>  {
>>>   /* fallback to node 0 */
>>> - if (nid < 0 || nid >= MAX_NUMNODES)
>>> + if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
> i  did not understood how this line change fixes the issue that you
> have mentioned (i too not understood fully the issue description)
> this array used while mapping node id when secondary cores comes up
> when numa_off is set the cpu_to_node_map[cpu] is not used and set to
> node0 always( refer function numa_store_cpu_info)..
> please provide more details to understand the issue you are facing.
> /*
>  *  Set the cpu to node and mem mapping
>  */
> void numa_store_cpu_info(unsigned int cpu)
> {
> map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
> }
>
> thanks
> Ganapat
The issue comes up when we test the kdump. it will leads to kernel crash.
when I debug the issue, I find boot cpu actually bound to the node1. while
node1 is not real existence when numa_off enable.

__build_all_zonelists will call the cpu_to_node[cpu], but orresponding relation
will be obtained from the devicetree. therefore, the issue will come up.
The corresponding message is as follows when kdump start. it is obvious that mem
range points to the node1 in the devicetree.

Early memory node ranges
node   0: [mem 0x005fe000-0x005f]
Initmem setup node 0 [mem 0x005fe000-0x005f]

Unable to handle kernel paging request at virtual address 1690
pgd = 81226000
[1690] *pgd=
Internal error: Oops: 9604 [#1] SMP
 Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.27-vhulk3.6.5.aarch64 #1
Hardware name: Hisilicon Hi1612 Development Board (DT)
 task: 8102b730 ti: 81018000 task.ti: 81018000
PC is at next_zones_zonelist+0x18/0x80
 LR is at __build_all_zonelists+0x1e0/0x288
next_zones_zonelist+0x18/0x80
 __build_all_zonelists+0x1e0/0x288
build_all_zonelists_init+0x10/0x1c
 build_all_zonelists+0x114/0x128
 start_kernel+0x1a0/0x414
>>>   nid = 0;
>>>
>>>   cpu_to_node_map[cpu] = nid;
>> The patch looks fine (slight inconsistence from the map_cpu_to_node()
>> callers but I guess we don't want to expose numa_off outside this file).
>> I would however like to see an Ack from Ganapat (cc'ed).
>>
>> --
>> Catalin
>>
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> .
>




Re: [PATCH] mm, numa: boot cpu should bound to the node0 when node_off enable

2016-08-18 Thread zhong jiang
On 2016/8/19 1:45, Ganapatrao Kulkarni wrote:
> On Thu, Aug 18, 2016 at 9:34 PM, Catalin Marinas
>  wrote:
>> On Thu, Aug 18, 2016 at 09:09:26PM +0800, zhongjiang wrote:
>>> At present, boot cpu will bound to a node from device tree when node_off 
>>> enable.
>>> if the node is not initialization, it will lead to a following problem.
>>>
>>>  next_zones_zonelist+0x18/0x80
>>>  __build_all_zonelists+0x1e0/0x288
>>>  build_all_zonelists_init+0x10/0x1c
>>>  build_all_zonelists+0x114/0x128
>>>  start_kernel+0x1a0/0x414
>> I think this "problem" is missing a lot of information. Is this supposed
>> to be a kernel panic?
>>
>>> The patch fix it by fallback to node 0. therefore, the cpu will bound to 
>>> the node
>>> correctly.
>>>
>>> Signed-off-by: zhongjiang 
>>> ---
>>>  arch/arm64/mm/numa.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>> index 4dcd7d6..1f8f5da 100644
>>> --- a/arch/arm64/mm/numa.c
>>> +++ b/arch/arm64/mm/numa.c
>>> @@ -119,7 +119,7 @@ void numa_store_cpu_info(unsigned int cpu)
>>>  void __init early_map_cpu_to_node(unsigned int cpu, int nid)
>>>  {
>>>   /* fallback to node 0 */
>>> - if (nid < 0 || nid >= MAX_NUMNODES)
>>> + if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
> i  did not understood how this line change fixes the issue that you
> have mentioned (i too not understood fully the issue description)
> this array used while mapping node id when secondary cores comes up
> when numa_off is set the cpu_to_node_map[cpu] is not used and set to
> node0 always( refer function numa_store_cpu_info)..
> please provide more details to understand the issue you are facing.
> /*
>  *  Set the cpu to node and mem mapping
>  */
> void numa_store_cpu_info(unsigned int cpu)
> {
> map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
> }
>
> thanks
> Ganapat
The issue comes up when we test the kdump. it will leads to kernel crash.
when I debug the issue, I find boot cpu actually bound to the node1. while
node1 is not real existence when numa_off enable.

__build_all_zonelists will call the cpu_to_node[cpu], but orresponding relation
will be obtained from the devicetree. therefore, the issue will come up.
The corresponding message is as follows when kdump start. it is obvious that mem
range points to the node1 in the devicetree.

Early memory node ranges
node   0: [mem 0x005fe000-0x005f]
Initmem setup node 0 [mem 0x005fe000-0x005f]

Unable to handle kernel paging request at virtual address 1690
pgd = 81226000
[1690] *pgd=
Internal error: Oops: 9604 [#1] SMP
 Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.27-vhulk3.6.5.aarch64 #1
Hardware name: Hisilicon Hi1612 Development Board (DT)
 task: 8102b730 ti: 81018000 task.ti: 81018000
PC is at next_zones_zonelist+0x18/0x80
 LR is at __build_all_zonelists+0x1e0/0x288
next_zones_zonelist+0x18/0x80
 __build_all_zonelists+0x1e0/0x288
build_all_zonelists_init+0x10/0x1c
 build_all_zonelists+0x114/0x128
 start_kernel+0x1a0/0x414
>>>   nid = 0;
>>>
>>>   cpu_to_node_map[cpu] = nid;
>> The patch looks fine (slight inconsistence from the map_cpu_to_node()
>> callers but I guess we don't want to expose numa_off outside this file).
>> I would however like to see an Ack from Ganapat (cc'ed).
>>
>> --
>> Catalin
>>
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> .
>




Re: [PATCH] mm, numa: boot cpu should bound to the node0 when node_off enable

2016-08-18 Thread zhong jiang
On 2016/8/19 1:45, Ganapatrao Kulkarni wrote:
> On Thu, Aug 18, 2016 at 9:34 PM, Catalin Marinas
>  wrote:
>> On Thu, Aug 18, 2016 at 09:09:26PM +0800, zhongjiang wrote:
>>> At present, boot cpu will bound to a node from device tree when node_off 
>>> enable.
>>> if the node is not initialization, it will lead to a following problem.
>>>
>>>  next_zones_zonelist+0x18/0x80
>>>  __build_all_zonelists+0x1e0/0x288
>>>  build_all_zonelists_init+0x10/0x1c
>>>  build_all_zonelists+0x114/0x128
>>>  start_kernel+0x1a0/0x414
>> I think this "problem" is missing a lot of information. Is this supposed
>> to be a kernel panic?
yes, it will leads to kernel crash. the details is as follows.
 
Unable to handle kernel paging request at virtual address 1690
pgd = 81226000
[1690] *pgd=
Internal error: Oops: 9604 [#1] SMP
 Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.27-vhulk3.6.5.aarch64 #1
Hardware name: Hisilicon Hi1612 Development Board (DT)
 task: 8102b730 ti: 81018000 task.ti: 81018000
PC is at next_zones_zonelist+0x18/0x80
 LR is at __build_all_zonelists+0x1e0/0x288
next_zones_zonelist+0x18/0x80
 __build_all_zonelists+0x1e0/0x288
build_all_zonelists_init+0x10/0x1c
 build_all_zonelists+0x114/0x128
 start_kernel+0x1a0/0x414
>>> The patch fix it by fallback to node 0. therefore, the cpu will bound to 
>>> the node
>>> correctly.
>>>
>>> Signed-off-by: zhongjiang 
>>> ---
>>>  arch/arm64/mm/numa.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>> index 4dcd7d6..1f8f5da 100644
>>> --- a/arch/arm64/mm/numa.c
>>> +++ b/arch/arm64/mm/numa.c
>>> @@ -119,7 +119,7 @@ void numa_store_cpu_info(unsigned int cpu)
>>>  void __init early_map_cpu_to_node(unsigned int cpu, int nid)
>>>  {
>>>   /* fallback to node 0 */
>>> - if (nid < 0 || nid >= MAX_NUMNODES)
>>> + if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
> i  did not understood how this line change fixes the issue that you
> have mentioned (i too not understood fully the issue description)
> this array used while mapping node id when secondary cores comes up
> when numa_off is set the cpu_to_node_map[cpu] is not used and set to
> node0 always( refer function numa_store_cpu_info)..
> please provide more details to understand the issue you are facing.
> /*
>  *  Set the cpu to node and mem mapping
>  */
> void numa_store_cpu_info(unsigned int cpu)
> {
> map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
> }
>
> thanks
> Ganapat
>>>   nid = 0;
>>>
>>>   cpu_to_node_map[cpu] = nid;
>> The patch looks fine (slight inconsistence from the map_cpu_to_node()
>> callers but I guess we don't want to expose numa_off outside this file).
>> I would however like to see an Ack from Ganapat (cc'ed).
>>
>> --
>> Catalin
>>
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> .
>




Re: [PATCH] mm, numa: boot cpu should bound to the node0 when node_off enable

2016-08-18 Thread zhong jiang
On 2016/8/19 1:45, Ganapatrao Kulkarni wrote:
> On Thu, Aug 18, 2016 at 9:34 PM, Catalin Marinas
>  wrote:
>> On Thu, Aug 18, 2016 at 09:09:26PM +0800, zhongjiang wrote:
>>> At present, boot cpu will bound to a node from device tree when node_off 
>>> enable.
>>> if the node is not initialization, it will lead to a following problem.
>>>
>>>  next_zones_zonelist+0x18/0x80
>>>  __build_all_zonelists+0x1e0/0x288
>>>  build_all_zonelists_init+0x10/0x1c
>>>  build_all_zonelists+0x114/0x128
>>>  start_kernel+0x1a0/0x414
>> I think this "problem" is missing a lot of information. Is this supposed
>> to be a kernel panic?
yes, it will leads to kernel crash. the details is as follows.
 
Unable to handle kernel paging request at virtual address 1690
pgd = 81226000
[1690] *pgd=
Internal error: Oops: 9604 [#1] SMP
 Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.27-vhulk3.6.5.aarch64 #1
Hardware name: Hisilicon Hi1612 Development Board (DT)
 task: 8102b730 ti: 81018000 task.ti: 81018000
PC is at next_zones_zonelist+0x18/0x80
 LR is at __build_all_zonelists+0x1e0/0x288
next_zones_zonelist+0x18/0x80
 __build_all_zonelists+0x1e0/0x288
build_all_zonelists_init+0x10/0x1c
 build_all_zonelists+0x114/0x128
 start_kernel+0x1a0/0x414
>>> The patch fix it by fallback to node 0. therefore, the cpu will bound to 
>>> the node
>>> correctly.
>>>
>>> Signed-off-by: zhongjiang 
>>> ---
>>>  arch/arm64/mm/numa.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>> index 4dcd7d6..1f8f5da 100644
>>> --- a/arch/arm64/mm/numa.c
>>> +++ b/arch/arm64/mm/numa.c
>>> @@ -119,7 +119,7 @@ void numa_store_cpu_info(unsigned int cpu)
>>>  void __init early_map_cpu_to_node(unsigned int cpu, int nid)
>>>  {
>>>   /* fallback to node 0 */
>>> - if (nid < 0 || nid >= MAX_NUMNODES)
>>> + if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
> i  did not understood how this line change fixes the issue that you
> have mentioned (i too not understood fully the issue description)
> this array used while mapping node id when secondary cores comes up
> when numa_off is set the cpu_to_node_map[cpu] is not used and set to
> node0 always( refer function numa_store_cpu_info)..
> please provide more details to understand the issue you are facing.
> /*
>  *  Set the cpu to node and mem mapping
>  */
> void numa_store_cpu_info(unsigned int cpu)
> {
> map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
> }
>
> thanks
> Ganapat
>>>   nid = 0;
>>>
>>>   cpu_to_node_map[cpu] = nid;
>> The patch looks fine (slight inconsistence from the map_cpu_to_node()
>> callers but I guess we don't want to expose numa_off outside this file).
>> I would however like to see an Ack from Ganapat (cc'ed).
>>
>> --
>> Catalin
>>
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> .
>




Re: [PATCH] mm: optimize find_zone_movable_pfns_for_nodes to avoid unnecessary loop.

2016-08-09 Thread zhong jiang
On 2016/8/10 7:29, Andrew Morton wrote:
> On Fri, 5 Aug 2016 22:04:07 +0800 zhongjiang  wrote:
>
>> when required_kernelcore decrease to zero, we should exit the loop in time.
>> because It will waste time to scan the remainder node.
> The patch is rather ugly and it only affects __init code, so the only
> benefit will be to boot time.
   yes
> Do we have any timing measurements which would justify changing this code?
  I am sorry for that.  That is a only theoretical analysis.
>





Re: [PATCH] mm: optimize find_zone_movable_pfns_for_nodes to avoid unnecessary loop.

2016-08-09 Thread zhong jiang
On 2016/8/10 7:29, Andrew Morton wrote:
> On Fri, 5 Aug 2016 22:04:07 +0800 zhongjiang  wrote:
>
>> when required_kernelcore decrease to zero, we should exit the loop in time.
>> because It will waste time to scan the remainder node.
> The patch is rather ugly and it only affects __init code, so the only
> benefit will be to boot time.
   yes
> Do we have any timing measurements which would justify changing this code?
  I am sorry for that.  That is a only theoretical analysis.
>





Re: [PATCH] mm: optimize find_zone_movable_pfns_for_nodes to avoid unnecessary loop.

2016-08-09 Thread zhong jiang
On 2016/8/5 22:04, zhongjiang wrote:
> From: zhong jiang <zhongji...@huawei.com>
>
> when required_kernelcore decrease to zero, we should exit the loop in time.
> because It will waste time to scan the remainder node.
>
> Signed-off-by: zhong jiang <zhongji...@huawei.com>
> ---
>  mm/page_alloc.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ea759b9..be7df17 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6093,7 +6093,7 @@ static unsigned long __init 
> early_calculate_totalpages(void)
>   unsigned long pages = end_pfn - start_pfn;
>  
>   totalpages += pages;
> - if (pages)
> + if (!node_isset(nid, node_states[N_MEMORY]) && pages)
>   node_set_state(nid, N_MEMORY);
>   }
>   return totalpages;
> @@ -6115,6 +6115,7 @@ static void __init 
> find_zone_movable_pfns_for_nodes(void)
>   unsigned long totalpages = early_calculate_totalpages();
>   int usable_nodes = nodes_weight(node_states[N_MEMORY]);
>   struct memblock_region *r;
> + bool avoid_loop = false;
>  
>   /* Need to find movable_zone earlier when movable_node is specified. */
>   find_usable_zone_for_movable();
> @@ -6275,6 +6276,8 @@ restart:
>   required_kernelcore -= min(required_kernelcore,
>   size_pages);
>   kernelcore_remaining -= size_pages;
> + if (!required_kernelcore && avoid_loop)
> + goto out2;
>   if (!kernelcore_remaining)
>   break;
>   }
> @@ -6287,9 +6290,10 @@ restart:
>* satisfied
>*/
>   usable_nodes--;
> - if (usable_nodes && required_kernelcore > usable_nodes)
> + if (usable_nodes && required_kernelcore > usable_nodes) {
> + avoid_loop = true;
>   goto restart;
> -
> + }
>  out2:
>   /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
>   for (nid = 0; nid < MAX_NUMNODES; nid++)
  Any one have any objection about above patch ? please let me know.
 
 



Re: [PATCH] mm: optimize find_zone_movable_pfns_for_nodes to avoid unnecessary loop.

2016-08-09 Thread zhong jiang
On 2016/8/5 22:04, zhongjiang wrote:
> From: zhong jiang 
>
> when required_kernelcore decrease to zero, we should exit the loop in time.
> because It will waste time to scan the remainder node.
>
> Signed-off-by: zhong jiang 
> ---
>  mm/page_alloc.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ea759b9..be7df17 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6093,7 +6093,7 @@ static unsigned long __init 
> early_calculate_totalpages(void)
>   unsigned long pages = end_pfn - start_pfn;
>  
>   totalpages += pages;
> - if (pages)
> + if (!node_isset(nid, node_states[N_MEMORY]) && pages)
>   node_set_state(nid, N_MEMORY);
>   }
>   return totalpages;
> @@ -6115,6 +6115,7 @@ static void __init 
> find_zone_movable_pfns_for_nodes(void)
>   unsigned long totalpages = early_calculate_totalpages();
>   int usable_nodes = nodes_weight(node_states[N_MEMORY]);
>   struct memblock_region *r;
> + bool avoid_loop = false;
>  
>   /* Need to find movable_zone earlier when movable_node is specified. */
>   find_usable_zone_for_movable();
> @@ -6275,6 +6276,8 @@ restart:
>   required_kernelcore -= min(required_kernelcore,
>   size_pages);
>   kernelcore_remaining -= size_pages;
> + if (!required_kernelcore && avoid_loop)
> + goto out2;
>   if (!kernelcore_remaining)
>   break;
>   }
> @@ -6287,9 +6290,10 @@ restart:
>* satisfied
>*/
>   usable_nodes--;
> - if (usable_nodes && required_kernelcore > usable_nodes)
> + if (usable_nodes && required_kernelcore > usable_nodes) {
> + avoid_loop = true;
>   goto restart;
> -
> + }
>  out2:
>   /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
>   for (nid = 0; nid < MAX_NUMNODES; nid++)
  Any one have any objection about above patch ? please let me know.
 
 



Re: [PATCH] mm: fix the incorrect hugepages count

2016-08-09 Thread zhong jiang
On 2016/8/9 1:14, Mike Kravetz wrote:
> On 08/07/2016 07:49 PM, zhongjiang wrote:
>> From: zhong jiang <zhongji...@huawei.com>
>>
>> when memory hotplug enable, free hugepages will be freed if movable node 
>> offline.
>> therefore, /proc/sys/vm/nr_hugepages will be incorrect.
>>
>> The patch fix it by reduce the max_huge_pages when the node offline.
>>
>> Signed-off-by: zhong jiang <zhongji...@huawei.com>
>> ---
>>  mm/hugetlb.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index f904246..3356e3a 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1448,6 +1448,7 @@ static void dissolve_free_huge_page(struct page *page)
>>  list_del(>lru);
>>  h->free_huge_pages--;
>>  h->free_huge_pages_node[nid]--;
>> +h->max_huge_pages--;
>>  update_and_free_page(h, page);
>>  }
>>  spin_unlock(_lock);
>>
> Adding Naoya as he was the original author of this code.
>
> >From quick look it appears that the huge page will be migrated (allocated
> on another node).  If my understanding is correct, then max_huge_pages
> should not be adjusted here.
>
  we need to take free hugetlb pages into account.  of course, the allocated 
huge pages is no
  need to reduce.  The patch just reduce the free hugetlb pages count.

  Thanks
 zhongjiang



Re: [PATCH] mm: fix the incorrect hugepages count

2016-08-09 Thread zhong jiang
On 2016/8/9 1:14, Mike Kravetz wrote:
> On 08/07/2016 07:49 PM, zhongjiang wrote:
>> From: zhong jiang 
>>
>> when memory hotplug enable, free hugepages will be freed if movable node 
>> offline.
>> therefore, /proc/sys/vm/nr_hugepages will be incorrect.
>>
>> The patch fix it by reduce the max_huge_pages when the node offline.
>>
>> Signed-off-by: zhong jiang 
>> ---
>>  mm/hugetlb.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index f904246..3356e3a 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1448,6 +1448,7 @@ static void dissolve_free_huge_page(struct page *page)
>>  list_del(>lru);
>>  h->free_huge_pages--;
>>  h->free_huge_pages_node[nid]--;
>> +h->max_huge_pages--;
>>  update_and_free_page(h, page);
>>  }
>>  spin_unlock(_lock);
>>
> Adding Naoya as he was the original author of this code.
>
> >From quick look it appears that the huge page will be migrated (allocated
> on another node).  If my understanding is correct, then max_huge_pages
> should not be adjusted here.
>
  we need to take free hugetlb pages into account.  of course, the allocated 
huge pages is no
  need to reduce.  The patch just reduce the free hugetlb pages count.

  Thanks
 zhongjiang



A question about transparent hugepages

2016-08-03 Thread zhong jiang
according to the total_mapcount,  Different process can map any subpages of the
transparent hugepages.   How it can happen to ? 




A question about transparent hugepages

2016-08-03 Thread zhong jiang
according to the total_mapcount,  Different process can map any subpages of the
transparent hugepages.   How it can happen to ? 




Re: [PATCH] fs: fix a bug when new_insert_key is not initialization

2016-08-01 Thread zhong jiang
On 2016/8/2 7:05, Andrew Morton wrote:
> On Sat, 30 Jul 2016 11:51:09 +0800 zhongjiang <zhongji...@huawei.com> wrote:
>
>> From: zhong jiang <zhongji...@huawei.com>
>>
>> when compile the kenrel code, I happens to the following warn.
>> fs/reiserfs/ibalance.c:1156:2: warning: ___new_insert_key___ may be used
>> uninitialized in this function.
>> memcpy(new_insert_key_addr, _insert_key, KEY_SIZE);
>>
>> The patch fix it by check the new_insert_ptr. if new_insert_ptr is not
>> NULL, we ensure that new_insert_key is assigned. therefore, memcpy will
>> saftly exec the operatetion.
>>
>> --- a/fs/reiserfs/ibalance.c
>> +++ b/fs/reiserfs/ibalance.c
>> @@ -1153,8 +1153,10 @@ int balance_internal(struct tree_balance *tb,
>> insert_ptr);
>>  }
>>  
>> -memcpy(new_insert_key_addr, _insert_key, KEY_SIZE);
>> -insert_ptr[0] = new_insert_ptr;
>> +if (new_insert_ptr) {
>> +memcpy(new_insert_key_addr, _insert_key, KEY_SIZE);
>> +insert_ptr[0] = new_insert_ptr;
>> +}
>>  
>>  return order;
> Jeff has aleady fixed this with an equivalent patch.  It's in -mm at
> present.
>
> From: Jeff Mahoney <je...@suse.com>
> Subject: reiserfs: fix "new_insert_key may be used uninitialized ..."
>
> new_insert_key only makes any sense when it's associated with a
> new_insert_ptr, which is initialized to NULL and changed to a buffer_head
> when we also initialize new_insert_key.  We can key off of that to avoid
> the uninitialized warning.
>
> Link: http://lkml.kernel.org/r/5eca5ffb-2155-8df2-b4a2-f162f105e...@suse.com
> Signed-off-by: Jeff Mahoney <je...@suse.com>
> Cc: Arnd Bergmann <a...@arndb.de>
> Cc: Jan Kara <j...@suse.cz>
> Cc: Linus Torvalds <torva...@linux-foundation.org>
> Signed-off-by: Andrew Morton <a...@linux-foundation.org>
> ---
>
>  fs/reiserfs/ibalance.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff -puN 
> fs/reiserfs/ibalance.c~reiserfs-fix-new_insert_key-may-be-used-uninitialized 
> fs/reiserfs/ibalance.c
> --- 
> a/fs/reiserfs/ibalance.c~reiserfs-fix-new_insert_key-may-be-used-uninitialized
> +++ a/fs/reiserfs/ibalance.c
> @@ -1153,8 +1153,9 @@ int balance_internal(struct tree_balance
>  insert_ptr);
>   }
>  
> - memcpy(new_insert_key_addr, _insert_key, KEY_SIZE);
>   insert_ptr[0] = new_insert_ptr;
> + if (new_insert_ptr)
> + memcpy(new_insert_key_addr, _insert_key, KEY_SIZE);
>  
>   return order;
>  }
> _
>
>
> .
>
 ok ,  I did not notice.  thanks.



Re: [PATCH] fs: fix a bug when new_insert_key is not initialization

2016-08-01 Thread zhong jiang
On 2016/8/2 7:05, Andrew Morton wrote:
> On Sat, 30 Jul 2016 11:51:09 +0800 zhongjiang  wrote:
>
>> From: zhong jiang 
>>
>> when compile the kenrel code, I happens to the following warn.
>> fs/reiserfs/ibalance.c:1156:2: warning: ___new_insert_key___ may be used
>> uninitialized in this function.
>> memcpy(new_insert_key_addr, _insert_key, KEY_SIZE);
>>
>> The patch fix it by check the new_insert_ptr. if new_insert_ptr is not
>> NULL, we ensure that new_insert_key is assigned. therefore, memcpy will
>> saftly exec the operatetion.
>>
>> --- a/fs/reiserfs/ibalance.c
>> +++ b/fs/reiserfs/ibalance.c
>> @@ -1153,8 +1153,10 @@ int balance_internal(struct tree_balance *tb,
>> insert_ptr);
>>  }
>>  
>> -memcpy(new_insert_key_addr, _insert_key, KEY_SIZE);
>> -insert_ptr[0] = new_insert_ptr;
>> +if (new_insert_ptr) {
>> +memcpy(new_insert_key_addr, _insert_key, KEY_SIZE);
>> +insert_ptr[0] = new_insert_ptr;
>> +}
>>  
>>  return order;
> Jeff has aleady fixed this with an equivalent patch.  It's in -mm at
> present.
>
> From: Jeff Mahoney 
> Subject: reiserfs: fix "new_insert_key may be used uninitialized ..."
>
> new_insert_key only makes any sense when it's associated with a
> new_insert_ptr, which is initialized to NULL and changed to a buffer_head
> when we also initialize new_insert_key.  We can key off of that to avoid
> the uninitialized warning.
>
> Link: http://lkml.kernel.org/r/5eca5ffb-2155-8df2-b4a2-f162f105e...@suse.com
> Signed-off-by: Jeff Mahoney 
> Cc: Arnd Bergmann 
> Cc: Jan Kara 
> Cc: Linus Torvalds 
> Signed-off-by: Andrew Morton 
> ---
>
>  fs/reiserfs/ibalance.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff -puN 
> fs/reiserfs/ibalance.c~reiserfs-fix-new_insert_key-may-be-used-uninitialized 
> fs/reiserfs/ibalance.c
> --- 
> a/fs/reiserfs/ibalance.c~reiserfs-fix-new_insert_key-may-be-used-uninitialized
> +++ a/fs/reiserfs/ibalance.c
> @@ -1153,8 +1153,9 @@ int balance_internal(struct tree_balance
>  insert_ptr);
>   }
>  
> - memcpy(new_insert_key_addr, _insert_key, KEY_SIZE);
>   insert_ptr[0] = new_insert_ptr;
> + if (new_insert_ptr)
> + memcpy(new_insert_key_addr, _insert_key, KEY_SIZE);
>  
>   return order;
>  }
> _
>
>
> .
>
 ok ,  I did not notice.  thanks.



Re: [PATCH] fs: wipe off the compiler warn

2016-07-29 Thread zhong jiang
On 2016/7/30 7:02, Andrew Morton wrote:
> On Fri, 29 Jul 2016 22:46:39 +0800 zhongjiang <zhongji...@huawei.com> wrote:
>
>> From: zhong jiang <zhongji...@huawei.com>
>>
>> when compile the kenrel code, I happens to the following warn.
>> fs/reiserfs/ibalance.c:1156:2: warning: ___new_insert_key___ may be used
>> uninitialized in this function.
>> memcpy(new_insert_key_addr, _insert_key, KEY_SIZE);
>> ^
>> The patch just fix it to avoid the warn.
>>
>> Signed-off-by: zhong jiang <zhongji...@huawei.com>
>> ---
>>  fs/reiserfs/ibalance.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/reiserfs/ibalance.c b/fs/reiserfs/ibalance.c
>> index b751eea..512ce95 100644
>> --- a/fs/reiserfs/ibalance.c
>> +++ b/fs/reiserfs/ibalance.c
>> @@ -818,7 +818,7 @@ int balance_internal(struct tree_balance *tb,
>>  int order;
>>  int insert_num, n, k;
>>  struct buffer_head *S_new;
>> -struct item_head new_insert_key;
>> +struct item_head uninitialized_var(new_insert_key);
>>  struct buffer_head *new_insert_ptr = NULL;
>>  struct item_head *new_insert_key_addr = insert_key;
> How do we know this isn't a real bug?  It isn't obvious to me that this
> warning is a false positive.
>
>
> .
>
  yes ,it maybe a real bug, I will resend it in v2.



<    2   3   4   5   6   7   8   >