Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

2019-04-11 Thread Wei Yang
On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:

>Find a mostly untested patch which implements this below. I just booted it
>in a 64bit guest and it did not explode.
>
>It removes 55 lines of code instead of adding 35 and reduces the binary
>size by 408 bytes on 64bit and 128 bytes on 32bit.
>
>Note, it's a combo of changes (including your patch 1/6) and needs to be
>split up. It would be nice if you have time to split it up into separate
>patches, add proper changelogs and test the heck out of it on both 32 and
>64 bit. If you don't have time, please let me know.

I tried to test mem hotplug with this applied.

On x86_64, this looks good. It split ranges well. While I found some problem
on testing mem hotplug on x86_32.

I start up a guest memory and trying to plug 1G memory.

The original memory split range looks good:

[0.004260] ywtest: [mem 0x-0x0010]
[0.004261] ywtest: [mem 0x-0x000f] page size 4k
[0.004268] ywtest: [mem 0x3720-0x3740]
[0.004269] ywtest: [mem 0x3720-0x373f] page size 2M
[0.004271] ywtest: [mem 0x2000-0x3720]
[0.004272] ywtest: [mem 0x2000-0x371f] page size 2M
[0.004280] ywtest: [mem 0x0010-0x2000]
[0.004281] ywtest: [mem 0x0010-0x001f] page size 4k
[0.004281] ywtest: [mem 0x0020-0x1fff] page size 2M
[0.004292] ywtest: [mem 0x3740-0x375fe000]
[0.004293] ywtest: [mem 0x3740-0x375fdfff] page size 4k

While I can't online the new added memory device. And the new memory device's
range is out of 4G, which is a little bit strange.

>
>Thanks,
>
>   tglx
>
-- 
Wei Yang
Help you, Help me


Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

2019-03-28 Thread Wei Yang
On Thu, Mar 28, 2019 at 09:08:43AM +0100, Thomas Gleixner wrote:
>On Thu, 28 Mar 2019, Wei Yang wrote:
>> On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
>> My question is to the for loop.
>> 
>> For example, we have a range
>> 
>>+--+-+---+
>>^ 128M   1G  2G
>>128M - 4K
>> 
>
>Yes. You misread mr_try_map().

You are right, I misunderstand the functionality of mr_try_map().

I went through the code and this looks nice to me. I have to say you are
genius.

Thanks for your code and I really learned a lot from this.

BTW, for the test cases, I thinks mem-hotplug may be introduce layout
diversity. Since mem-hotplug's range has to be 128M aligned.

>
>Thanks,
>
>   tglx

-- 
Wei Yang
Help you, Help me


Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

2019-03-28 Thread Thomas Gleixner
On Thu, 28 Mar 2019, Wei Yang wrote:
> On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
> My question is to the for loop.
> 
> For example, we have a range
> 
>+--+-+---+
>^ 128M   1G  2G
>128M - 4K
> 
> If my understanding is correct, the original behavior will split this into
> three ranges:
> 
>4K size: [128M - 4K, 128M]
>2M size: [128M, 1G]
>1G size: [1G, 2G]
> 
> While after your change, it will split this into two ranges:
> 
>?? size: [128M - 4K, 1G]
>2M size: [1G, 2G]
>
> The question mark here is because you leave the page_size_mask unchanged in
> this case.
> 
> Is my understanding correct? Or I missed something?

Yes. You misread mr_try_map().

Thanks,

tglx


Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

2019-03-28 Thread Thomas Gleixner
Wei,

On Thu, 28 Mar 2019, Wei Yang wrote:

please trim your replies. It's annoying if one has to search the content in
the middle of a large useless quote.

> On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
> >Wei,
> >-static int __meminit split_mem_range(struct map_range *mr, int nr_range,
> >- unsigned long start,
> >- unsigned long end)
> >-{
> >-unsigned long start_pfn, end_pfn, limit_pfn;
> >-unsigned long pfn;
> >-int i;
> >+if (!IS_ALIGNED(mr->end, mi->size)) {
> >+/* Try to fit as much as possible */
> >+len = round_down(mr->end - mr->start, mi->size);
> >+if (!len)
> >+return false;
> >+mr->end = mr->start + len;
> >+}
> > 
> >-limit_pfn = PFN_DOWN(end);
> >+/* Store the effective page size mask */
> >+mr->page_size_mask = mi->mask;
> 
> I don't get the point here. Why store the effective page size mask just for
> the "middle" range.
> 
> The original behavior will set the "head" and "tail" range with a lower level
> page size mask.

What has this to do with the middle range? Nothing. This is the path where
the current map level (1g, 2m, 4k) is applied from mr->start to
mr->end. That's the effective mapping of this map_range entry. 

Thanks,

tglx


Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

2019-03-28 Thread Wei Yang
On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
> 
>+static int __meminit split_mem_range(struct map_range *mr, unsigned long 
>start,
>+   unsigned long end)
>+{
>+  static const struct mapinfo mapinfos[] = {
> #ifdef CONFIG_X86_64
>+  { .mask = 1U << PG_LEVEL_1G, .size = PUD_SIZE },
> #endif
>+  { .mask = 1U << PG_LEVEL_2M, .size = PMD_SIZE },
>+  { .mask = 0, .size = PAGE_SIZE },
>+  };
>+  const struct mapinfo *mi;
>+  struct map_range *curmr;
>+  unsigned long addr;
>+  int idx;
>+
>+  for (idx = 0, addr = start, curmr = mr; addr < end; idx++, curmr++) {
>+  BUG_ON(idx == NR_RANGE_MR);
>+  mr_setup(curmr, addr, end);
> 
>+  /* Try map sizes top down. PAGE_SIZE will always succeed. */
>+  for (mi = mapinfos; !mr_try_map(curmr, mi); mi++);
>
>+  /* Get the start address for the next range */
>+  addr = curmr->end;
>   }

I re-arrange the code to make split_mem_range() here easy to read.

My question is to the for loop.

For example, we have a range

   +--+-+---+
   ^ 128M   1G  2G
   128M - 4K

If my understanding is correct, the original behavior will split this into
three ranges:

   4K size: [128M - 4K, 128M]
   2M size: [128M, 1G]
   1G size: [1G, 2G]

While after your change, it will split this into two ranges:

   ?? size: [128M - 4K, 1G]
   2M size: [1G, 2G]

The question mark here is because you leave the page_size_mask unchanged in
this case.

Is my understanding correct? Or I missed something?

-- 
Wei Yang
Help you, Help me


Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

2019-03-27 Thread Wei Yang
On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
>Wei,
>
>On Tue, 12 Feb 2019, Wei Yang wrote:
>> 
>> This patch changes the implementation from the first perception to the
>> second to reduce one different handling on end_pfn. After doing so, the
>> code is easier to read.
>
>It's maybe slightly easier to read, but it's still completely unreadable
>garbage.
>
>  Not your fault, it was garbage before.
>
>But refining garbage still results in garbage. Just the smell is slightly
>different.
>
>Why?
>
> 1) Doing all the calculations PFN based is just a pointless
>indirection. Just look at all the rounding magic and back and forth
>conversions.
>
>All of that can be done purely address/size based which makes the code
>truly readable.
>
> 2) The 5(3) sections are more or less copied code with a few changes of
>constants, except for the first section (A) which has an extra quirk
>for 32bit. Plus all the 64bit vs. 32bit #ifdeffery which is not making
>it any better.
>
>This copied mess can be avoided by using helper functions and proper
>loops.
>
> 3) During the bootmem phase the code tries to preserve large mappings
>_AFTER_ splitting them up and then it tries to merge the resulting
>overlaps.
>
>This is completely backwards because the expansion of the range can be
>tried right away when then mapping of a large page is attempted. Surely
>not with the current mess, but with a proper loop based approach it can
>be done nicely.
>
>Btw, there is a bug in that expansion code which could result in
>undoing the enforced 4K mapping of the lower 2M/4M range on 32bit. It's
>probably not an issue in practice because the low range is usually not
>contiguous. But it works by chance not by design.

Hi, Thomas

I want to confirm with you, this bug is in adjust_range_page_size_mask(),
right?

-- 
Wei Yang
Help you, Help me


Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

2019-03-27 Thread Wei Yang
On Thu, Mar 28, 2019 at 01:16:08AM +0100, Thomas Gleixner wrote:
>Wei,
>
>On Wed, 27 Mar 2019, Wei Yang wrote:
>> On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
>> >Note, it's a combo of changes (including your patch 1/6) and needs to be
>> >split up. It would be nice if you have time to split it up into separate
>> >patches, add proper changelogs and test the heck out of it on both 32 and
>> >64 bit. If you don't have time, please let me know.
>> >
>> 
>> Thanks for your suggestions :-)
>> 
>> Just get my head up, will try to understand the code and test on both
>> arch.
>> 
>> BTW, do you have some suggestions in the test? Currently I just use
>> bootup test. Basicly I think this is fine to cover the cases. Maybe you
>> would have some better idea.
>
>This is about bootup in the first place. After that memory hotplug which
>you can emulate with qemu/kvm IIRC.
>

Ok, this is not difficult.

>The important part about testing is to have systems which expose a wide
>variety memory layouts.
>

May be I can add some test with nvdimm.

>Thanks,
>
>   tglx

-- 
Wei Yang
Help you, Help me


Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

2019-03-27 Thread Thomas Gleixner
Wei,

On Wed, 27 Mar 2019, Wei Yang wrote:
> On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
> >Note, it's a combo of changes (including your patch 1/6) and needs to be
> >split up. It would be nice if you have time to split it up into separate
> >patches, add proper changelogs and test the heck out of it on both 32 and
> >64 bit. If you don't have time, please let me know.
> >
> 
> Thanks for your suggestions :-)
> 
> Just get my head up, will try to understand the code and test on both
> arch.
> 
> BTW, do you have some suggestions in the test? Currently I just use
> bootup test. Basicly I think this is fine to cover the cases. Maybe you
> would have some better idea.

This is about bootup in the first place. After that memory hotplug which
you can emulate with qemu/kvm IIRC.

The important part about testing is to have systems which expose a wide
variety memory layouts.

Thanks,

tglx



Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

2019-03-27 Thread Wei Yang
On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
>Wei,
>
>On Tue, 12 Feb 2019, Wei Yang wrote:
>> 
>> This patch changes the implementation from the first perception to the
>> second to reduce one different handling on end_pfn. After doing so, the
>> code is easier to read.
>
>It's maybe slightly easier to read, but it's still completely unreadable
>garbage.
>
>  Not your fault, it was garbage before.
>
>But refining garbage still results in garbage. Just the smell is slightly
>different.
>
>Why?
>
> 1) Doing all the calculations PFN based is just a pointless
>indirection. Just look at all the rounding magic and back and forth
>conversions.
>
>All of that can be done purely address/size based which makes the code
>truly readable.
>
> 2) The 5(3) sections are more or less copied code with a few changes of
>constants, except for the first section (A) which has an extra quirk
>for 32bit. Plus all the 64bit vs. 32bit #ifdeffery which is not making
>it any better.
>
>This copied mess can be avoided by using helper functions and proper
>loops.
>
> 3) During the bootmem phase the code tries to preserve large mappings
>_AFTER_ splitting them up and then it tries to merge the resulting
>overlaps.
>
>This is completely backwards because the expansion of the range can be
>tried right away when then mapping of a large page is attempted. Surely
>not with the current mess, but with a proper loop based approach it can
>be done nicely.
>
>Btw, there is a bug in that expansion code which could result in
>undoing the enforced 4K mapping of the lower 2M/4M range on 32bit. It's
>probably not an issue in practice because the low range is usually not
>contiguous. But it works by chance not by design.
>
> 4) The debug print string retrieval function is just silly.
>
>The logic for rewriting this is pretty obvious:
>
>while (addr < end) {
> setup_mr(mr, addr, end);
> for_each_map_size(1G, 2M, 4K) {
>   if (try_map(mr, size))
>   break;
> }
> addr = mr->end;
>}
>
>setup_mr() takes care of the 32bit 0-2/4M range by limiting the
>range and setting the allowed size mask in mr to 4k only.
>
>try_map() does:
>
>   1) Try to expand the range to preserve large pages during bootmem
>   
>   2) If the range start is not aligned, limit the end to the large
>  size boundary, so the next lower map size will only cover the
> unaligned portion.
>
>   3) If the range end is not aligned, fit as much large
>  size as possible.
>
>   No magic A-E required, because it just falls into place naturally and
>   the expansion is done at the right place and not after the fact.
>
>Find a mostly untested patch which implements this below. I just booted it
>in a 64bit guest and it did not explode.
>
>It removes 55 lines of code instead of adding 35 and reduces the binary
>size by 408 bytes on 64bit and 128 bytes on 32bit.
>
>Note, it's a combo of changes (including your patch 1/6) and needs to be
>split up. It would be nice if you have time to split it up into separate
>patches, add proper changelogs and test the heck out of it on both 32 and
>64 bit. If you don't have time, please let me know.
>

Thanks for your suggestions :-)

Just get my head up, will try to understand the code and test on both
arch.

BTW, do you have some suggestions in the test? Currently I just use
bootup test. Basicly I think this is fine to cover the cases. Maybe you
would have some better idea.

>Thanks,
>
>   tglx
>
>8<---
> arch/x86/mm/init.c |  259 
> -
> 1 file changed, 102 insertions(+), 157 deletions(-)
>
>--- a/arch/x86/mm/init.c
>+++ b/arch/x86/mm/init.c
>@@ -157,17 +157,28 @@ void  __init early_alloc_pgt_buf(void)
>   pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
> }
> 
>-int after_bootmem;
>+int after_bootmem __ro_after_init;
> 
> early_param_on_off("gbpages", "nogbpages", direct_gbpages, 
> CONFIG_X86_DIRECT_GBPAGES);
> 
> struct map_range {
>-  unsigned long start;
>-  unsigned long end;
>-  unsigned page_size_mask;
>+  unsigned long   start;
>+  unsigned long   end;
>+  unsigned intpage_size_mask;
> };
> 
>-static int page_size_mask;
>+#ifdef CONFIG_X86_32
>+#define NR_RANGE_MR 3
>+#else
>+#define NR_RANGE_MR 5
>+#endif
>+
>+struct mapinfo {
>+  unsigned intmask;
>+  unsigned intsize;
>+};
>+
>+static unsigned int page_size_mask __ro_after_init;
> 
> static void __init probe_page_size_mask(void)
> {
>@@ -177,7 +188,7 @@ static void __init probe_page_size_mask(
>* large pages into small in interrupt context, etc.
>*/
>   if (boot_cpu_has(X86_FEATURE_PSE) && !debug_pagealloc_enabled())
>-  page_size_mask |= 1 << PG_LEVEL_2M;
>+  page_size_mask |= 1U << PG_LEVEL_2M;
>   else
>   

Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

2019-03-24 Thread Thomas Gleixner
Wei,

On Tue, 12 Feb 2019, Wei Yang wrote:
> 
> This patch changes the implementation from the first perception to the
> second to reduce one different handling on end_pfn. After doing so, the
> code is easier to read.

It's maybe slightly easier to read, but it's still completely unreadable
garbage.

  Not your fault, it was garbage before.

But refining garbage still results in garbage. Just the smell is slightly
different.

Why?

 1) Doing all the calculations PFN based is just a pointless
indirection. Just look at all the rounding magic and back and forth
conversions.

All of that can be done purely address/size based which makes the code
truly readable.

 2) The 5(3) sections are more or less copied code with a few changes of
constants, except for the first section (A) which has an extra quirk
for 32bit. Plus all the 64bit vs. 32bit #ifdeffery which is not making
it any better.

This copied mess can be avoided by using helper functions and proper
loops.

 3) During the bootmem phase the code tries to preserve large mappings
_AFTER_ splitting them up and then it tries to merge the resulting
overlaps.

This is completely backwards because the expansion of the range can be
tried right away when then mapping of a large page is attempted. Surely
not with the current mess, but with a proper loop based approach it can
be done nicely.

Btw, there is a bug in that expansion code which could result in
undoing the enforced 4K mapping of the lower 2M/4M range on 32bit. It's
probably not an issue in practice because the low range is usually not
contiguous. But it works by chance not by design.

 4) The debug print string retrieval function is just silly.

The logic for rewriting this is pretty obvious:

while (addr < end) {
  setup_mr(mr, addr, end);
  for_each_map_size(1G, 2M, 4K) {
if (try_map(mr, size))
break;
  }
  addr = mr->end;
}

setup_mr() takes care of the 32bit 0-2/4M range by limiting the
range and setting the allowed size mask in mr to 4k only.

try_map() does:

   1) Try to expand the range to preserve large pages during bootmem
   
   2) If the range start is not aligned, limit the end to the large
  size boundary, so the next lower map size will only cover the
  unaligned portion.

   3) If the range end is not aligned, fit as much large
  size as possible.

   No magic A-E required, because it just falls into place naturally and
   the expansion is done at the right place and not after the fact.

Find a mostly untested patch which implements this below. I just booted it
in a 64bit guest and it did not explode.

It removes 55 lines of code instead of adding 35 and reduces the binary
size by 408 bytes on 64bit and 128 bytes on 32bit.

Note, it's a combo of changes (including your patch 1/6) and needs to be
split up. It would be nice if you have time to split it up into separate
patches, add proper changelogs and test the heck out of it on both 32 and
64 bit. If you don't have time, please let me know.

Thanks,

tglx

8<---
 arch/x86/mm/init.c |  259 -
 1 file changed, 102 insertions(+), 157 deletions(-)

--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -157,17 +157,28 @@ void  __init early_alloc_pgt_buf(void)
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
 }
 
-int after_bootmem;
+int after_bootmem __ro_after_init;
 
 early_param_on_off("gbpages", "nogbpages", direct_gbpages, 
CONFIG_X86_DIRECT_GBPAGES);
 
 struct map_range {
-   unsigned long start;
-   unsigned long end;
-   unsigned page_size_mask;
+   unsigned long   start;
+   unsigned long   end;
+   unsigned intpage_size_mask;
 };
 
-static int page_size_mask;
+#ifdef CONFIG_X86_32
+#define NR_RANGE_MR 3
+#else
+#define NR_RANGE_MR 5
+#endif
+
+struct mapinfo {
+   unsigned intmask;
+   unsigned intsize;
+};
+
+static unsigned int page_size_mask __ro_after_init;
 
 static void __init probe_page_size_mask(void)
 {
@@ -177,7 +188,7 @@ static void __init probe_page_size_mask(
 * large pages into small in interrupt context, etc.
 */
if (boot_cpu_has(X86_FEATURE_PSE) && !debug_pagealloc_enabled())
-   page_size_mask |= 1 << PG_LEVEL_2M;
+   page_size_mask |= 1U << PG_LEVEL_2M;
else
direct_gbpages = 0;
 
@@ -201,7 +212,7 @@ static void __init probe_page_size_mask(
/* Enable 1 GB linear kernel mappings if available: */
if (direct_gbpages && boot_cpu_has(X86_FEATURE_GBPAGES)) {
printk(KERN_INFO "Using GB pages for direct mapping\n");
-   page_size_mask |= 1 << PG_LEVEL_1G;
+   page_size_mask |= 1U << PG_LEVEL_1G;
} else {
direct_gbpages = 0;
}
@@ 

[PATCH 4/6] x86, mm: make split_mem_range() more easy to read

2019-02-11 Thread Wei Yang
As the comment explains, there are at most 5 possible ranges:

  kkkmmmGGGmmmkkk
  (A)(B)(C)(D)(E)

While there are two ways to perceive:

* C & D are extra ranges on X86_64
* B & C are extra ranges on X86_64

Current implementation takes the first way, which leads to handling
end_pfn of B differently:

* align to PMD on X86_32
* align to PUD on X86_64

If we take the second way, we don't need to handle it differently.

* the end_pfn of B only need to align to PUD
* the end_pfn of D only need to align to PMD

This patch changes the implementation from the first perception to the
second to reduce one different handling on end_pfn. After doing so, the
code is easier to read.

Signed-off-by: Wei Yang 
---
 arch/x86/mm/init.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 2b782dcd6d71..87275238dbb0 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -389,25 +389,21 @@ static int __meminit split_mem_range(struct map_range *mr,
pfn = end_pfn;
}
 
+#ifdef CONFIG_X86_64
/*
 * Range (B):
 * big page (2M) range
 */
start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
-#ifdef CONFIG_X86_32
-   end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
-#else /* CONFIG_X86_64 */
end_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE));
if (end_pfn > round_down(limit_pfn, PFN_DOWN(PMD_SIZE)))
end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
-#endif
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask & (1<