Re: uninitialized pmem struct pages

2021-01-05 Thread David Hildenbrand
On 05.01.21 10:56, Dan Williams wrote:
> On Tue, Jan 5, 2021 at 1:37 AM David Hildenbrand  wrote:
>>
 Yeah, obviously the first one. Being able to add+use PMEM is more
 important than using each and every last MB of main memory.

 I wonder if we can just stop adding any system RAM like

 [ Memory Section]
 [ RAM ] [  Hole ]

 When there could be the possibility that the hole might actually be
 PMEM. (e.g., with CONFIG_ZONE_DEVICE and it being the last section in a
 sequence of sections, not just a tiny hole)
>>>
>>> I like the simplicity of it... I worry that the capacity loss
>>> regression is easy to notice by looking at the output of free(1) from
>>> one kernel to the next and someone screams.
>>
>> Well, you can always make it configurable and then simply fail to add
>> PMEM later if impossible (trying to sub-section hot-add into early
>> section). It's in the hands of the sysadmin then ("max out system ram"
>> vs. "support any PMEM device that could eventually be there at
>> runtime"). Distros would go for the second.
>>
>> I agree that it's not optimal, but sometimes simplicity has to win.
> 
> Here's where we left it last time, open to pfn_to_online_page hacks...
> 
> https://lore.kernel.org/linux-mm/CAPcyv4ivq=epuepxix2ercvyf7+dv9yv215oue7x_y2x_jf...@mail.gmail.com
> 

Yeah, I recall. That's why I favor simple approaches right now - less
brain power to waste ;)

> I don't think a slow-path flag in the mem-section is too onerous, but
> I'll withhold judgement until I have the patch I'm thinking of
> in-hand. Let me give it a shot, you can always nack the final result.

Sure!


-- 
Thanks,

David / dhildenb



Re: uninitialized pmem struct pages

2021-01-05 Thread David Hildenbrand
On 05.01.21 08:44, Michal Hocko wrote:
> On Mon 04-01-21 17:30:48, David Hildenbrand wrote:
 Let's assume this is indeed a reserved pfn in the altmap. What's the
 actual address of the memmap?
>>>
>>> Not sure what exactly you are asking for but crash says
>>> crash> kmem -p 606
>>>   PAGE  PHYSICAL  MAPPING   INDEX CNT FLAGS
>>> f8c600181800 60600  0 fc000
>>
>> ^ this looks like it was somewhat initialized. All flags zero, nid/zone
>> set to -1 (wild guess) and thus the crash? weird
> 
> Yes that made me scratch my head as well.

My best guess would be that this is indeed part of the altmap reserve
with a memmap on the altmap. Maybe these values are just leftovers on
the PMEM device (e.g., from older kernels)?

In the kernel, I cannot find anything that would initialize node/zone to
-1 only.


@Dan: Will get_dev_pagemap() return a valid dev_pagemap for pfns falling
into the altmap reserved region? I think we should also initialize the
memmap of the altmap reserved region somewhow.

-- 
Thanks,

David / dhildenb



Re: uninitialized pmem struct pages

2021-01-05 Thread Dan Williams
On Tue, Jan 5, 2021 at 1:37 AM David Hildenbrand  wrote:
>
> >> Yeah, obviously the first one. Being able to add+use PMEM is more
> >> important than using each and every last MB of main memory.
> >>
> >> I wonder if we can just stop adding any system RAM like
> >>
> >> [ Memory Section]
> >> [ RAM ] [  Hole ]
> >>
> >> When there could be the possibility that the hole might actually be
> >> PMEM. (e.g., with CONFIG_ZONE_DEVICE and it being the last section in a
> >> sequence of sections, not just a tiny hole)
> >
> > I like the simplicity of it... I worry that the capacity loss
> > regression is easy to notice by looking at the output of free(1) from
> > one kernel to the next and someone screams.
>
> Well, you can always make it configurable and then simply fail to add
> PMEM later if impossible (trying to sub-section hot-add into early
> section). It's in the hands of the sysadmin then ("max out system ram"
> vs. "support any PMEM device that could eventually be there at
> runtime"). Distros would go for the second.
>
> I agree that it's not optimal, but sometimes simplicity has to win.

Here's where we left it last time, open to pfn_to_online_page hacks...

https://lore.kernel.org/linux-mm/CAPcyv4ivq=epuepxix2ercvyf7+dv9yv215oue7x_y2x_jf...@mail.gmail.com

I don't think a slow-path flag in the mem-section is too onerous, but
I'll withhold judgement until I have the patch I'm thinking of
in-hand. Let me give it a shot, you can always nack the final result.


Re: uninitialized pmem struct pages

2021-01-05 Thread David Hildenbrand
>> Yeah, obviously the first one. Being able to add+use PMEM is more
>> important than using each and every last MB of main memory.
>>
>> I wonder if we can just stop adding any system RAM like
>>
>> [ Memory Section]
>> [ RAM ] [  Hole ]
>>
>> When there could be the possibility that the hole might actually be
>> PMEM. (e.g., with CONFIG_ZONE_DEVICE and it being the last section in a
>> sequence of sections, not just a tiny hole)
> 
> I like the simplicity of it... I worry that the capacity loss
> regression is easy to notice by looking at the output of free(1) from
> one kernel to the next and someone screams.

Well, you can always make it configurable and then simply fail to add
PMEM later if impossible (trying to sub-section hot-add into early
section). It's in the hands of the sysadmin then ("max out system ram"
vs. "support any PMEM device that could eventually be there at
runtime"). Distros would go for the second.

I agree that it's not optimal, but sometimes simplicity has to win.

-- 
Thanks,

David / dhildenb



Re: uninitialized pmem struct pages

2021-01-05 Thread Dan Williams
On Tue, Jan 5, 2021 at 1:25 AM David Hildenbrand  wrote:
>
> On 05.01.21 06:17, Dan Williams wrote:
> > On Mon, Jan 4, 2021 at 2:45 AM David Hildenbrand  wrote:
> >>
> >> On 04.01.21 11:03, Michal Hocko wrote:
> >>> Hi,
> >>> back in March [1] you have recommended 53cdc1cb29e8
> >>> ("drivers/base/memory.c: indicate all memory blocks as removable") to be
> >>> backported to stable trees and that has led to a more general discussion
> >>> about the current state of pfn walkers wrt. uninitialized pmem struct
> >>> pages. We haven't concluded any specific solution for that except for a
> >>> general sentiment that pfn_to_online_page should be able to catch those.
> >>> I might have missed any follow ups on that but I do not think we have
> >>> landed on any actual solution in the end. Have I just missed any 
> >>> followups?
> >>
> >> Thanks for raising this issue. It's still on my list of "broken and
> >> non-trivial to fix" things.
> >>
> >> So, as far as I recall, we still have the following two issues remaining:
> >>
> >> 1. pfn_to_online_page() false positives
> >>
> >> The semantics of pfn_to_online_page() were broken with sub-section
> >> hot-add in corner cases.
> >
> > The motivation for that backport was for pre-subsection kernels. When
> > onlining pmem that collides with the same section as System-RAM we may
> > have a situation like:
> >
> > |--   SECTION   -- |
> > |-- System RAM  --PMEM  -- |
> > |-- pfn_valid() -- PMEM metadata-- |
> >
> > So problem 0. is just System RAM + PMEM collisions independent of
> > sub-section support.
>
> IIRC, you were not able to hot-add the PMEM device before sub-section
> support (which was bad and fixed by sub-section hot-add). How was this a
> problem before?

The details are in the subsection changelog but the tl;dr is that I
clumsily hacked around it by throwing away PMEM, but that breaks when
PMEM shifts around in physical address space from one boot to the
next. The moving around is on 64MB granularity, so subsection hotplug
lets those shifts be covered and enables smaller granularity /
alignment support for smaller things like p2pdma mappings.

>
> >
> >>
> >> If we have ZONE_DEVICE hot-added memory that overlaps in a section with
> >> boot memory, this memory section will contain parts ZONE_DEVICE memory
> >> and parts !ZONE_DEVICE memory. This can happen in sub-section
> >> granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
> >> memory parts as the whole section is marked as online. Bad.
> >>
> >> One instance where this is still an issue is
> >> mm/memory-failure.c:memory_failure() and
> >> mm/memory-failure.c:soft_offline_page(). I thought for a while about
> >> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
> >> actually the right approach.
> >
> > This is complicated by MEMORY_DEVICE_PRIVATE that I believe wants to
> > say "yes" to pfn_to_online_page().
> >
> >>
> >> But worse, before ZONE_DEVICE hot-add
> >> 1. The whole section memmap does already exist (early sections always
> >> have a full memmap for the whole section)
> >> 2. The whole section memmap is initialized (although eventually with
> >> dummy node/zone 0/0 for memory holes until that part is fixed) and might
> >> be accessed by pfn walkers.
> >>
> >> So when hotadding ZONE_DEVICE we are modifying already existing and
> >> visible memmaps. Bad.
> >
> > Where does the rewrite of a dummy page entry matter in practice? It
> > would certainly be exceedingly Bad if in-use 'struct page' instances
> > we're rewritten. You're only alleging the former, correct?
>
> Yes, just another piece of the puzzle.
>
> >> 2. Deferred init of ZONE_DEVICE ranges
> >>
> >> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
> >> and outside the memhp lock. I did not follow if the use of
> >> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
> >> pagemap_range() actually completed. I don't think it does.
> >>
> >>>
> >>> Is anybody working on that?
> >>>
> >>
> >> I believe Dan mentioned somewhere that he wants to see a real instance
> >> of this producing a BUG before actually moving forward with a fix. I
> >> might be wrong.
> >
> > I think I'm missing an argument for the user-visible effects of the
> > "Bad." statements above. I think soft_offline_page() is a candidate
> > for a local fix because mm/memory-failure.c already has a significant
> > amount of page-type specific knowledge. So teaching it "yes" for
> > MEMORY_DEVICE_PRIVATE-ZONE_DEVICE and "no" for other ZONE_DEVICE seems
> > ok to me.
>
> I am not completely against working around the issue we have in the code
> but nobody stumbled over them. IMHO it's just error prone code and
> handling we have here that will bite us in the long term. E.g., any pfn
> walker/code that sticks to the current documentation of
> pfn_to_online_page().
>
> I am not sure it's the right thing to do for
> 

Re: uninitialized pmem struct pages

2021-01-05 Thread David Hildenbrand
On 05.01.21 10:25, Michal Hocko wrote:
> On Tue 05-01-21 10:13:49, David Hildenbrand wrote:
>> On 05.01.21 10:05, Michal Hocko wrote:
>>> On Tue 05-01-21 00:57:43, Dan Williams wrote:
 On Tue, Jan 5, 2021 at 12:42 AM Michal Hocko  wrote:
>
> On Tue 05-01-21 00:27:34, Dan Williams wrote:
>> On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko  wrote:
>>>
>>> On Tue 05-01-21 09:01:00, Michal Hocko wrote:
 On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> On 04.01.21 16:43, David Hildenbrand wrote:
>> On 04.01.21 16:33, Michal Hocko wrote:
>>> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
 On 04.01.21 16:10, Michal Hocko wrote:
>>> [...]
 Do the physical addresses you see fall into the same section as 
 boot
 memory? Or what's around these addresses?
>>>
>>> Yes I am getting a garbage for the first struct page belonging to 
>>> the
>>> pmem section [1]
>>> [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 
>>> 0x1-0x603fff]
>>> [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 
>>> 0x606000-0x11d5fff] non-volatile
>>>
>>> The pfn without the initialized struct page is 0x606. This is a
>>> first pfn in a section.
>>
>> Okay, so we're not dealing with the "early section" mess I described,
>> different story.
>>
>> Due to [1], is_mem_section_removable() called
>> pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as 
>> not
>> initialized.
>>
>> Let's assume this is indeed a reserved pfn in the altmap. What's the
>> actual address of the memmap?
>>
>> I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it 
>> actually
>> part of the actual altmap (i.e. > 0x606) or maybe even 
>> self-hosted?
>>
>> If it's not self-hosted, initializing the relevant memmaps should 
>> work
>> just fine I guess. Otherwise things get more complicated.
>
> Oh, I forgot: pfn_to_online_page() should at least in your example 
> make
> sure other pfn walkers are safe. It was just an issue of
> is_mem_section_removable().

 Hmm, I suspect you are right. I haven't put this together, thanks! The 
 memory
 section is indeed marked offline so pfn_to_online_page would indeed 
 bail
 out:
 crash> p (0x606>>15)
 $3 = 3084
 crash> p mem_section[3084/128][3084 & 127]
 $4 = {
   section_mem_map = 18446736128020054019,
   usage = 0x902dcf956680,
   page_ext = 0x0,
   pad = 0
 }
 crash> p 18446736128020054019 & (1UL<<2)
 $5 = 0

 That makes it considerably less of a problem than I thought!
>>>
>>> Forgot to add that those who are running kernels without 53cdc1cb29e8
>>> ("drivers/base/memory.c: indicate all memory blocks as removable") for
>>> some reason can fix the crash by the following simple patch.
>>>
>>> Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
>>> ===
>>> --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
>>> +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
>>> @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
>>> goto out;
>>>
>>> for (i = 0; i < sections_per_block; i++) {
>>> -   if (!present_section_nr(mem->start_section_nr + i))
>>> +   unsigned long nr = mem->start_section_nr + i;
>>> +   if (!present_section_nr(nr))
>>> continue;
>>> -   pfn = section_nr_to_pfn(mem->start_section_nr + i);
>>> +   if (!online_section_nr()) {
>>
>> I assume that's onlince_section_nr(nr) in the version that compiles?
>
> Yup.
>
>> This makes sense because the memory block size is larger than the
>> section size. I suspect you have 1GB memory block size on this system,
>> but since the System RAM and PMEM collide at a 512MB alignment in a
>> memory block you end up walking the back end of the last 512MB of the
>> System RAM memory block and run into the offline PMEM section.
>
> Sections are 128MB and memory blocks are 2GB on this system.
>
>> So, I don't think it's pfn_to_online_page that necessarily needs to
>> know how to disambiguate each page, it's things that walk sections and
>> memory blocks and expects them to be consistent over the span.
>
> Well, memory hotplug code is hard wired to sparse memory model so in
> this particular case asking about the section 

Re: uninitialized pmem struct pages

2021-01-05 Thread David Hildenbrand
On 05.01.21 06:17, Dan Williams wrote:
> On Mon, Jan 4, 2021 at 2:45 AM David Hildenbrand  wrote:
>>
>> On 04.01.21 11:03, Michal Hocko wrote:
>>> Hi,
>>> back in March [1] you have recommended 53cdc1cb29e8
>>> ("drivers/base/memory.c: indicate all memory blocks as removable") to be
>>> backported to stable trees and that has led to a more general discussion
>>> about the current state of pfn walkers wrt. uninitialized pmem struct
>>> pages. We haven't concluded any specific solution for that except for a
>>> general sentiment that pfn_to_online_page should be able to catch those.
>>> I might have missed any follow ups on that but I do not think we have
>>> landed on any actual solution in the end. Have I just missed any followups?
>>
>> Thanks for raising this issue. It's still on my list of "broken and
>> non-trivial to fix" things.
>>
>> So, as far as I recall, we still have the following two issues remaining:
>>
>> 1. pfn_to_online_page() false positives
>>
>> The semantics of pfn_to_online_page() were broken with sub-section
>> hot-add in corner cases.
> 
> The motivation for that backport was for pre-subsection kernels. When
> onlining pmem that collides with the same section as System-RAM we may
> have a situation like:
> 
> |--   SECTION   -- |
> |-- System RAM  --PMEM  -- |
> |-- pfn_valid() -- PMEM metadata-- |
> 
> So problem 0. is just System RAM + PMEM collisions independent of
> sub-section support.

IIRC, you were not able to hot-add the PMEM device before sub-section
support (which was bad and fixed by sub-section hot-add). How was this a
problem before?

> 
>>
>> If we have ZONE_DEVICE hot-added memory that overlaps in a section with
>> boot memory, this memory section will contain parts ZONE_DEVICE memory
>> and parts !ZONE_DEVICE memory. This can happen in sub-section
>> granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
>> memory parts as the whole section is marked as online. Bad.
>>
>> One instance where this is still an issue is
>> mm/memory-failure.c:memory_failure() and
>> mm/memory-failure.c:soft_offline_page(). I thought for a while about
>> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
>> actually the right approach.
> 
> This is complicated by MEMORY_DEVICE_PRIVATE that I believe wants to
> say "yes" to pfn_to_online_page().
> 
>>
>> But worse, before ZONE_DEVICE hot-add
>> 1. The whole section memmap does already exist (early sections always
>> have a full memmap for the whole section)
>> 2. The whole section memmap is initialized (although eventually with
>> dummy node/zone 0/0 for memory holes until that part is fixed) and might
>> be accessed by pfn walkers.
>>
>> So when hotadding ZONE_DEVICE we are modifying already existing and
>> visible memmaps. Bad.
> 
> Where does the rewrite of a dummy page entry matter in practice? It
> would certainly be exceedingly Bad if in-use 'struct page' instances
> we're rewritten. You're only alleging the former, correct?

Yes, just another piece of the puzzle.

>> 2. Deferred init of ZONE_DEVICE ranges
>>
>> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
>> and outside the memhp lock. I did not follow if the use of
>> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
>> pagemap_range() actually completed. I don't think it does.
>>
>>>
>>> Is anybody working on that?
>>>
>>
>> I believe Dan mentioned somewhere that he wants to see a real instance
>> of this producing a BUG before actually moving forward with a fix. I
>> might be wrong.
> 
> I think I'm missing an argument for the user-visible effects of the
> "Bad." statements above. I think soft_offline_page() is a candidate
> for a local fix because mm/memory-failure.c already has a significant
> amount of page-type specific knowledge. So teaching it "yes" for
> MEMORY_DEVICE_PRIVATE-ZONE_DEVICE and "no" for other ZONE_DEVICE seems
> ok to me.

I am not completely against working around the issue we have in the code
but nobody stumbled over them. IMHO it's just error prone code and
handling we have here that will bite us in the long term. E.g., any pfn
walker/code that sticks to the current documentation of
pfn_to_online_page().

I am not sure it's the right thing to do for
MEMORY_DEVICE_PRIVATE-ZONE_DEVICE, that requires more discussion.

>> We might tackle 1. by:

[...]
>> d) fixed by not allowing ZONE_DEVICE and !ZONE_DEVICE within a single
>> section. In the worst case, don't add partially present sections that
>> have big holes in the beginning/end. Like, if there is a 128MB section
>> with 126MB of memory followed by a 2MB hole, don't add that memory to
>> Linux with CONFIG_ZONE_DEVICE. Might turn some memory unusable, but
>> well, it would be the price to pay for simplicity. Being able to hotadd
>> PMEM is more important than using each and every last MB of memory.
> 
> The collisions that are out there in the wild are 64MB System 

Re: uninitialized pmem struct pages

2021-01-05 Thread Michal Hocko
On Tue 05-01-21 10:13:49, David Hildenbrand wrote:
> On 05.01.21 10:05, Michal Hocko wrote:
> > On Tue 05-01-21 00:57:43, Dan Williams wrote:
> >> On Tue, Jan 5, 2021 at 12:42 AM Michal Hocko  wrote:
> >>>
> >>> On Tue 05-01-21 00:27:34, Dan Williams wrote:
>  On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko  wrote:
> >
> > On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> >> On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> >>> On 04.01.21 16:43, David Hildenbrand wrote:
>  On 04.01.21 16:33, Michal Hocko wrote:
> > On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> >> On 04.01.21 16:10, Michal Hocko wrote:
> > [...]
> >> Do the physical addresses you see fall into the same section as 
> >> boot
> >> memory? Or what's around these addresses?
> >
> > Yes I am getting a garbage for the first struct page belonging to 
> > the
> > pmem section [1]
> > [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 
> > 0x1-0x603fff]
> > [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 
> > 0x606000-0x11d5fff] non-volatile
> >
> > The pfn without the initialized struct page is 0x606. This is a
> > first pfn in a section.
> 
>  Okay, so we're not dealing with the "early section" mess I described,
>  different story.
> 
>  Due to [1], is_mem_section_removable() called
>  pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as 
>  not
>  initialized.
> 
>  Let's assume this is indeed a reserved pfn in the altmap. What's the
>  actual address of the memmap?
> 
>  I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it 
>  actually
>  part of the actual altmap (i.e. > 0x606) or maybe even 
>  self-hosted?
> 
>  If it's not self-hosted, initializing the relevant memmaps should 
>  work
>  just fine I guess. Otherwise things get more complicated.
> >>>
> >>> Oh, I forgot: pfn_to_online_page() should at least in your example 
> >>> make
> >>> sure other pfn walkers are safe. It was just an issue of
> >>> is_mem_section_removable().
> >>
> >> Hmm, I suspect you are right. I haven't put this together, thanks! The 
> >> memory
> >> section is indeed marked offline so pfn_to_online_page would indeed 
> >> bail
> >> out:
> >> crash> p (0x606>>15)
> >> $3 = 3084
> >> crash> p mem_section[3084/128][3084 & 127]
> >> $4 = {
> >>   section_mem_map = 18446736128020054019,
> >>   usage = 0x902dcf956680,
> >>   page_ext = 0x0,
> >>   pad = 0
> >> }
> >> crash> p 18446736128020054019 & (1UL<<2)
> >> $5 = 0
> >>
> >> That makes it considerably less of a problem than I thought!
> >
> > Forgot to add that those who are running kernels without 53cdc1cb29e8
> > ("drivers/base/memory.c: indicate all memory blocks as removable") for
> > some reason can fix the crash by the following simple patch.
> >
> > Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > ===
> > --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> > +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> > goto out;
> >
> > for (i = 0; i < sections_per_block; i++) {
> > -   if (!present_section_nr(mem->start_section_nr + i))
> > +   unsigned long nr = mem->start_section_nr + i;
> > +   if (!present_section_nr(nr))
> > continue;
> > -   pfn = section_nr_to_pfn(mem->start_section_nr + i);
> > +   if (!online_section_nr()) {
> 
>  I assume that's onlince_section_nr(nr) in the version that compiles?
> >>>
> >>> Yup.
> >>>
>  This makes sense because the memory block size is larger than the
>  section size. I suspect you have 1GB memory block size on this system,
>  but since the System RAM and PMEM collide at a 512MB alignment in a
>  memory block you end up walking the back end of the last 512MB of the
>  System RAM memory block and run into the offline PMEM section.
> >>>
> >>> Sections are 128MB and memory blocks are 2GB on this system.
> >>>
>  So, I don't think it's pfn_to_online_page that necessarily needs to
>  know how to disambiguate each page, it's things that walk sections and
>  memory blocks and expects them to be consistent over the span.
> >>>
> >>> Well, memory hotplug code is hard wired to sparse memory model so in
> >>> this particular case asking about the section is ok. But pfn walkers
> >>> shouldn't 

Re: uninitialized pmem struct pages

2021-01-05 Thread David Hildenbrand
On 05.01.21 08:50, Michal Hocko wrote:
> On Mon 04-01-21 21:17:43, Dan Williams wrote:
>> On Mon, Jan 4, 2021 at 2:45 AM David Hildenbrand  wrote:
> [...]
>>> I believe Dan mentioned somewhere that he wants to see a real instance
>>> of this producing a BUG before actually moving forward with a fix. I
>>> might be wrong.
>>
>> I think I'm missing an argument for the user-visible effects of the
>> "Bad." statements above. I think soft_offline_page() is a candidate
>> for a local fix because mm/memory-failure.c already has a significant
>> amount of page-type specific knowledge. So teaching it "yes" for
>> MEMORY_DEVICE_PRIVATE-ZONE_DEVICE and "no" for other ZONE_DEVICE seems
>> ok to me.
> 
> I believe we do not want to teach _every_ pfn walker about zone device
> pages. This would be quite error prone. Especially when a missig check
> could lead to a silently broken data or BUG_ON with debugging enabled
> (which is not the case for many production users). Or are we talking
> about different bugs here?

I'd like us to stick to the documentation, e.g., include/linux/mmzone.h


"
pfn_valid() is meant to be able to tell if a given PFN has valid memmap
associated with it or not. This means that a struct page exists for this
pfn. The caller cannot assume the page is fully initialized in general.
Hotplugable pages might not have been onlined yet. pfn_to_online_page()
will ensure the struct page is fully online and initialized. Special
pages (e.g. ZONE_DEVICE) are never onlined and should be treated
accordingly.
"

-- 
Thanks,

David / dhildenb



Re: uninitialized pmem struct pages

2021-01-05 Thread David Hildenbrand
On 05.01.21 10:05, Michal Hocko wrote:
> On Tue 05-01-21 00:57:43, Dan Williams wrote:
>> On Tue, Jan 5, 2021 at 12:42 AM Michal Hocko  wrote:
>>>
>>> On Tue 05-01-21 00:27:34, Dan Williams wrote:
 On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko  wrote:
>
> On Tue 05-01-21 09:01:00, Michal Hocko wrote:
>> On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
>>> On 04.01.21 16:43, David Hildenbrand wrote:
 On 04.01.21 16:33, Michal Hocko wrote:
> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
>> On 04.01.21 16:10, Michal Hocko wrote:
> [...]
>> Do the physical addresses you see fall into the same section as boot
>> memory? Or what's around these addresses?
>
> Yes I am getting a garbage for the first struct page belonging to the
> pmem section [1]
> [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x603fff]
> [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 
> 0x606000-0x11d5fff] non-volatile
>
> The pfn without the initialized struct page is 0x606. This is a
> first pfn in a section.

 Okay, so we're not dealing with the "early section" mess I described,
 different story.

 Due to [1], is_mem_section_removable() called
 pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as not
 initialized.

 Let's assume this is indeed a reserved pfn in the altmap. What's the
 actual address of the memmap?

 I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it 
 actually
 part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?

 If it's not self-hosted, initializing the relevant memmaps should work
 just fine I guess. Otherwise things get more complicated.
>>>
>>> Oh, I forgot: pfn_to_online_page() should at least in your example make
>>> sure other pfn walkers are safe. It was just an issue of
>>> is_mem_section_removable().
>>
>> Hmm, I suspect you are right. I haven't put this together, thanks! The 
>> memory
>> section is indeed marked offline so pfn_to_online_page would indeed bail
>> out:
>> crash> p (0x606>>15)
>> $3 = 3084
>> crash> p mem_section[3084/128][3084 & 127]
>> $4 = {
>>   section_mem_map = 18446736128020054019,
>>   usage = 0x902dcf956680,
>>   page_ext = 0x0,
>>   pad = 0
>> }
>> crash> p 18446736128020054019 & (1UL<<2)
>> $5 = 0
>>
>> That makes it considerably less of a problem than I thought!
>
> Forgot to add that those who are running kernels without 53cdc1cb29e8
> ("drivers/base/memory.c: indicate all memory blocks as removable") for
> some reason can fix the crash by the following simple patch.
>
> Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> ===
> --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> goto out;
>
> for (i = 0; i < sections_per_block; i++) {
> -   if (!present_section_nr(mem->start_section_nr + i))
> +   unsigned long nr = mem->start_section_nr + i;
> +   if (!present_section_nr(nr))
> continue;
> -   pfn = section_nr_to_pfn(mem->start_section_nr + i);
> +   if (!online_section_nr()) {

 I assume that's onlince_section_nr(nr) in the version that compiles?
>>>
>>> Yup.
>>>
 This makes sense because the memory block size is larger than the
 section size. I suspect you have 1GB memory block size on this system,
 but since the System RAM and PMEM collide at a 512MB alignment in a
 memory block you end up walking the back end of the last 512MB of the
 System RAM memory block and run into the offline PMEM section.
>>>
>>> Sections are 128MB and memory blocks are 2GB on this system.
>>>
 So, I don't think it's pfn_to_online_page that necessarily needs to
 know how to disambiguate each page, it's things that walk sections and
 memory blocks and expects them to be consistent over the span.
>>>
>>> Well, memory hotplug code is hard wired to sparse memory model so in
>>> this particular case asking about the section is ok. But pfn walkers
>>> shouldn't really care and only rely on pfn_to_online_page. But that will
>>> do the right thing here. So we are good as long as the section is marked
>>> properly. But this would become a problem as soon as the uninitialized
>>> pages where sharing the same memory section as David pointed out.
>>> pfn_to_online_page would then return something containing garbage. So 

Re: uninitialized pmem struct pages

2021-01-05 Thread Michal Hocko
On Tue 05-01-21 00:57:43, Dan Williams wrote:
> On Tue, Jan 5, 2021 at 12:42 AM Michal Hocko  wrote:
> >
> > On Tue 05-01-21 00:27:34, Dan Williams wrote:
> > > On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko  wrote:
> > > >
> > > > On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> > > > > On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> > > > > > On 04.01.21 16:43, David Hildenbrand wrote:
> > > > > > > On 04.01.21 16:33, Michal Hocko wrote:
> > > > > > >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > > > > > >>> On 04.01.21 16:10, Michal Hocko wrote:
> > > > > > >> [...]
> > > > > > >>> Do the physical addresses you see fall into the same section as 
> > > > > > >>> boot
> > > > > > >>> memory? Or what's around these addresses?
> > > > > > >>
> > > > > > >> Yes I am getting a garbage for the first struct page belonging 
> > > > > > >> to the
> > > > > > >> pmem section [1]
> > > > > > >> [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 
> > > > > > >> 0x1-0x603fff]
> > > > > > >> [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 
> > > > > > >> 0x606000-0x11d5fff] non-volatile
> > > > > > >>
> > > > > > >> The pfn without the initialized struct page is 0x606. This 
> > > > > > >> is a
> > > > > > >> first pfn in a section.
> > > > > > >
> > > > > > > Okay, so we're not dealing with the "early section" mess I 
> > > > > > > described,
> > > > > > > different story.
> > > > > > >
> > > > > > > Due to [1], is_mem_section_removable() called
> > > > > > > pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, 
> > > > > > > as not
> > > > > > > initialized.
> > > > > > >
> > > > > > > Let's assume this is indeed a reserved pfn in the altmap. What's 
> > > > > > > the
> > > > > > > actual address of the memmap?
> > > > > > >
> > > > > > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it 
> > > > > > > actually
> > > > > > > part of the actual altmap (i.e. > 0x606) or maybe even 
> > > > > > > self-hosted?
> > > > > > >
> > > > > > > If it's not self-hosted, initializing the relevant memmaps should 
> > > > > > > work
> > > > > > > just fine I guess. Otherwise things get more complicated.
> > > > > >
> > > > > > Oh, I forgot: pfn_to_online_page() should at least in your example 
> > > > > > make
> > > > > > sure other pfn walkers are safe. It was just an issue of
> > > > > > is_mem_section_removable().
> > > > >
> > > > > Hmm, I suspect you are right. I haven't put this together, thanks! 
> > > > > The memory
> > > > > section is indeed marked offline so pfn_to_online_page would indeed 
> > > > > bail
> > > > > out:
> > > > > crash> p (0x606>>15)
> > > > > $3 = 3084
> > > > > crash> p mem_section[3084/128][3084 & 127]
> > > > > $4 = {
> > > > >   section_mem_map = 18446736128020054019,
> > > > >   usage = 0x902dcf956680,
> > > > >   page_ext = 0x0,
> > > > >   pad = 0
> > > > > }
> > > > > crash> p 18446736128020054019 & (1UL<<2)
> > > > > $5 = 0
> > > > >
> > > > > That makes it considerably less of a problem than I thought!
> > > >
> > > > Forgot to add that those who are running kernels without 53cdc1cb29e8
> > > > ("drivers/base/memory.c: indicate all memory blocks as removable") for
> > > > some reason can fix the crash by the following simple patch.
> > > >
> > > > Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > > > ===
> > > > --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> > > > +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > > > @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> > > > goto out;
> > > >
> > > > for (i = 0; i < sections_per_block; i++) {
> > > > -   if (!present_section_nr(mem->start_section_nr + i))
> > > > +   unsigned long nr = mem->start_section_nr + i;
> > > > +   if (!present_section_nr(nr))
> > > > continue;
> > > > -   pfn = section_nr_to_pfn(mem->start_section_nr + i);
> > > > +   if (!online_section_nr()) {
> > >
> > > I assume that's onlince_section_nr(nr) in the version that compiles?
> >
> > Yup.
> >
> > > This makes sense because the memory block size is larger than the
> > > section size. I suspect you have 1GB memory block size on this system,
> > > but since the System RAM and PMEM collide at a 512MB alignment in a
> > > memory block you end up walking the back end of the last 512MB of the
> > > System RAM memory block and run into the offline PMEM section.
> >
> > Sections are 128MB and memory blocks are 2GB on this system.
> >
> > > So, I don't think it's pfn_to_online_page that necessarily needs to
> > > know how to disambiguate each page, it's things that walk sections and
> > > memory blocks and expects them to be consistent over the span.
> >
> > Well, memory hotplug code is hard wired to sparse memory model so in
> > this particular case asking 

Re: uninitialized pmem struct pages

2021-01-05 Thread Dan Williams
On Tue, Jan 5, 2021 at 12:42 AM Michal Hocko  wrote:
>
> On Tue 05-01-21 00:27:34, Dan Williams wrote:
> > On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko  wrote:
> > >
> > > On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> > > > On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> > > > > On 04.01.21 16:43, David Hildenbrand wrote:
> > > > > > On 04.01.21 16:33, Michal Hocko wrote:
> > > > > >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > > > > >>> On 04.01.21 16:10, Michal Hocko wrote:
> > > > > >> [...]
> > > > > >>> Do the physical addresses you see fall into the same section as 
> > > > > >>> boot
> > > > > >>> memory? Or what's around these addresses?
> > > > > >>
> > > > > >> Yes I am getting a garbage for the first struct page belonging to 
> > > > > >> the
> > > > > >> pmem section [1]
> > > > > >> [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 
> > > > > >> 0x1-0x603fff]
> > > > > >> [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 
> > > > > >> 0x606000-0x11d5fff] non-volatile
> > > > > >>
> > > > > >> The pfn without the initialized struct page is 0x606. This is a
> > > > > >> first pfn in a section.
> > > > > >
> > > > > > Okay, so we're not dealing with the "early section" mess I 
> > > > > > described,
> > > > > > different story.
> > > > > >
> > > > > > Due to [1], is_mem_section_removable() called
> > > > > > pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as 
> > > > > > not
> > > > > > initialized.
> > > > > >
> > > > > > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > > > > > actual address of the memmap?
> > > > > >
> > > > > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it 
> > > > > > actually
> > > > > > part of the actual altmap (i.e. > 0x606) or maybe even 
> > > > > > self-hosted?
> > > > > >
> > > > > > If it's not self-hosted, initializing the relevant memmaps should 
> > > > > > work
> > > > > > just fine I guess. Otherwise things get more complicated.
> > > > >
> > > > > Oh, I forgot: pfn_to_online_page() should at least in your example 
> > > > > make
> > > > > sure other pfn walkers are safe. It was just an issue of
> > > > > is_mem_section_removable().
> > > >
> > > > Hmm, I suspect you are right. I haven't put this together, thanks! The 
> > > > memory
> > > > section is indeed marked offline so pfn_to_online_page would indeed bail
> > > > out:
> > > > crash> p (0x606>>15)
> > > > $3 = 3084
> > > > crash> p mem_section[3084/128][3084 & 127]
> > > > $4 = {
> > > >   section_mem_map = 18446736128020054019,
> > > >   usage = 0x902dcf956680,
> > > >   page_ext = 0x0,
> > > >   pad = 0
> > > > }
> > > > crash> p 18446736128020054019 & (1UL<<2)
> > > > $5 = 0
> > > >
> > > > That makes it considerably less of a problem than I thought!
> > >
> > > Forgot to add that those who are running kernels without 53cdc1cb29e8
> > > ("drivers/base/memory.c: indicate all memory blocks as removable") for
> > > some reason can fix the crash by the following simple patch.
> > >
> > > Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > > ===
> > > --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> > > +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > > @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> > > goto out;
> > >
> > > for (i = 0; i < sections_per_block; i++) {
> > > -   if (!present_section_nr(mem->start_section_nr + i))
> > > +   unsigned long nr = mem->start_section_nr + i;
> > > +   if (!present_section_nr(nr))
> > > continue;
> > > -   pfn = section_nr_to_pfn(mem->start_section_nr + i);
> > > +   if (!online_section_nr()) {
> >
> > I assume that's onlince_section_nr(nr) in the version that compiles?
>
> Yup.
>
> > This makes sense because the memory block size is larger than the
> > section size. I suspect you have 1GB memory block size on this system,
> > but since the System RAM and PMEM collide at a 512MB alignment in a
> > memory block you end up walking the back end of the last 512MB of the
> > System RAM memory block and run into the offline PMEM section.
>
> Sections are 128MB and memory blocks are 2GB on this system.
>
> > So, I don't think it's pfn_to_online_page that necessarily needs to
> > know how to disambiguate each page, it's things that walk sections and
> > memory blocks and expects them to be consistent over the span.
>
> Well, memory hotplug code is hard wired to sparse memory model so in
> this particular case asking about the section is ok. But pfn walkers
> shouldn't really care and only rely on pfn_to_online_page. But that will
> do the right thing here. So we are good as long as the section is marked
> properly. But this would become a problem as soon as the uninitialized
> pages where sharing the same memory 

Re: uninitialized pmem struct pages

2021-01-05 Thread Michal Hocko
On Tue 05-01-21 00:27:34, Dan Williams wrote:
> On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko  wrote:
> >
> > On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> > > On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> > > > On 04.01.21 16:43, David Hildenbrand wrote:
> > > > > On 04.01.21 16:33, Michal Hocko wrote:
> > > > >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > > > >>> On 04.01.21 16:10, Michal Hocko wrote:
> > > > >> [...]
> > > > >>> Do the physical addresses you see fall into the same section as boot
> > > > >>> memory? Or what's around these addresses?
> > > > >>
> > > > >> Yes I am getting a garbage for the first struct page belonging to the
> > > > >> pmem section [1]
> > > > >> [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 
> > > > >> 0x1-0x603fff]
> > > > >> [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 
> > > > >> 0x606000-0x11d5fff] non-volatile
> > > > >>
> > > > >> The pfn without the initialized struct page is 0x606. This is a
> > > > >> first pfn in a section.
> > > > >
> > > > > Okay, so we're not dealing with the "early section" mess I described,
> > > > > different story.
> > > > >
> > > > > Due to [1], is_mem_section_removable() called
> > > > > pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as 
> > > > > not
> > > > > initialized.
> > > > >
> > > > > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > > > > actual address of the memmap?
> > > > >
> > > > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it 
> > > > > actually
> > > > > part of the actual altmap (i.e. > 0x606) or maybe even 
> > > > > self-hosted?
> > > > >
> > > > > If it's not self-hosted, initializing the relevant memmaps should work
> > > > > just fine I guess. Otherwise things get more complicated.
> > > >
> > > > Oh, I forgot: pfn_to_online_page() should at least in your example make
> > > > sure other pfn walkers are safe. It was just an issue of
> > > > is_mem_section_removable().
> > >
> > > Hmm, I suspect you are right. I haven't put this together, thanks! The 
> > > memory
> > > section is indeed marked offline so pfn_to_online_page would indeed bail
> > > out:
> > > crash> p (0x606>>15)
> > > $3 = 3084
> > > crash> p mem_section[3084/128][3084 & 127]
> > > $4 = {
> > >   section_mem_map = 18446736128020054019,
> > >   usage = 0x902dcf956680,
> > >   page_ext = 0x0,
> > >   pad = 0
> > > }
> > > crash> p 18446736128020054019 & (1UL<<2)
> > > $5 = 0
> > >
> > > That makes it considerably less of a problem than I thought!
> >
> > Forgot to add that those who are running kernels without 53cdc1cb29e8
> > ("drivers/base/memory.c: indicate all memory blocks as removable") for
> > some reason can fix the crash by the following simple patch.
> >
> > Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > ===
> > --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> > +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> > goto out;
> >
> > for (i = 0; i < sections_per_block; i++) {
> > -   if (!present_section_nr(mem->start_section_nr + i))
> > +   unsigned long nr = mem->start_section_nr + i;
> > +   if (!present_section_nr(nr))
> > continue;
> > -   pfn = section_nr_to_pfn(mem->start_section_nr + i);
> > +   if (!online_section_nr()) {
> 
> I assume that's onlince_section_nr(nr) in the version that compiles?

Yup.

> This makes sense because the memory block size is larger than the
> section size. I suspect you have 1GB memory block size on this system,
> but since the System RAM and PMEM collide at a 512MB alignment in a
> memory block you end up walking the back end of the last 512MB of the
> System RAM memory block and run into the offline PMEM section.

Sections are 128MB and memory blocks are 2GB on this system.

> So, I don't think it's pfn_to_online_page that necessarily needs to
> know how to disambiguate each page, it's things that walk sections and
> memory blocks and expects them to be consistent over the span.

Well, memory hotplug code is hard wired to sparse memory model so in
this particular case asking about the section is ok. But pfn walkers
shouldn't really care and only rely on pfn_to_online_page. But that will
do the right thing here. So we are good as long as the section is marked
properly. But this would become a problem as soon as the uninitialized
pages where sharing the same memory section as David pointed out.
pfn_to_online_page would then return something containing garbage. So we
should still think of a way to either initialize all those pages or make
sure pfn_to_online_page recognizes them. The former is preferred IMHO.
-- 
Michal Hocko
SUSE Labs


Re: uninitialized pmem struct pages

2021-01-05 Thread Dan Williams
On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko  wrote:
>
> On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> > On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> > > On 04.01.21 16:43, David Hildenbrand wrote:
> > > > On 04.01.21 16:33, Michal Hocko wrote:
> > > >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > > >>> On 04.01.21 16:10, Michal Hocko wrote:
> > > >> [...]
> > > >>> Do the physical addresses you see fall into the same section as boot
> > > >>> memory? Or what's around these addresses?
> > > >>
> > > >> Yes I am getting a garbage for the first struct page belonging to the
> > > >> pmem section [1]
> > > >> [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x603fff]
> > > >> [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 
> > > >> 0x606000-0x11d5fff] non-volatile
> > > >>
> > > >> The pfn without the initialized struct page is 0x606. This is a
> > > >> first pfn in a section.
> > > >
> > > > Okay, so we're not dealing with the "early section" mess I described,
> > > > different story.
> > > >
> > > > Due to [1], is_mem_section_removable() called
> > > > pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as not
> > > > initialized.
> > > >
> > > > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > > > actual address of the memmap?
> > > >
> > > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it actually
> > > > part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?
> > > >
> > > > If it's not self-hosted, initializing the relevant memmaps should work
> > > > just fine I guess. Otherwise things get more complicated.
> > >
> > > Oh, I forgot: pfn_to_online_page() should at least in your example make
> > > sure other pfn walkers are safe. It was just an issue of
> > > is_mem_section_removable().
> >
> > Hmm, I suspect you are right. I haven't put this together, thanks! The 
> > memory
> > section is indeed marked offline so pfn_to_online_page would indeed bail
> > out:
> > crash> p (0x606>>15)
> > $3 = 3084
> > crash> p mem_section[3084/128][3084 & 127]
> > $4 = {
> >   section_mem_map = 18446736128020054019,
> >   usage = 0x902dcf956680,
> >   page_ext = 0x0,
> >   pad = 0
> > }
> > crash> p 18446736128020054019 & (1UL<<2)
> > $5 = 0
> >
> > That makes it considerably less of a problem than I thought!
>
> Forgot to add that those who are running kernels without 53cdc1cb29e8
> ("drivers/base/memory.c: indicate all memory blocks as removable") for
> some reason can fix the crash by the following simple patch.
>
> Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> ===
> --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> goto out;
>
> for (i = 0; i < sections_per_block; i++) {
> -   if (!present_section_nr(mem->start_section_nr + i))
> +   unsigned long nr = mem->start_section_nr + i;
> +   if (!present_section_nr(nr))
> continue;
> -   pfn = section_nr_to_pfn(mem->start_section_nr + i);
> +   if (!online_section_nr()) {

I assume that's onlince_section_nr(nr) in the version that compiles?

This makes sense because the memory block size is larger than the
section size. I suspect you have 1GB memory block size on this system,
but since the System RAM and PMEM collide at a 512MB alignment in a
memory block you end up walking the back end of the last 512MB of the
System RAM memory block and run into the offline PMEM section.

So, I don't think it's pfn_to_online_page that necessarily needs to
know how to disambiguate each page, it's things that walk sections and
memory blocks and expects them to be consistent over the span.


Re: uninitialized pmem struct pages

2021-01-05 Thread Michal Hocko
On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> > On 04.01.21 16:43, David Hildenbrand wrote:
> > > On 04.01.21 16:33, Michal Hocko wrote:
> > >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > >>> On 04.01.21 16:10, Michal Hocko wrote:
> > >> [...]
> > >>> Do the physical addresses you see fall into the same section as boot
> > >>> memory? Or what's around these addresses?
> > >>
> > >> Yes I am getting a garbage for the first struct page belonging to the
> > >> pmem section [1]
> > >> [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x603fff]
> > >> [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x606000-0x11d5fff] 
> > >> non-volatile
> > >>
> > >> The pfn without the initialized struct page is 0x606. This is a
> > >> first pfn in a section.
> > > 
> > > Okay, so we're not dealing with the "early section" mess I described,
> > > different story.
> > > 
> > > Due to [1], is_mem_section_removable() called
> > > pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as not
> > > initialized.
> > > 
> > > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > > actual address of the memmap?
> > > 
> > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it actually
> > > part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?
> > > 
> > > If it's not self-hosted, initializing the relevant memmaps should work
> > > just fine I guess. Otherwise things get more complicated.
> > 
> > Oh, I forgot: pfn_to_online_page() should at least in your example make
> > sure other pfn walkers are safe. It was just an issue of
> > is_mem_section_removable().
> 
> Hmm, I suspect you are right. I haven't put this together, thanks! The memory
> section is indeed marked offline so pfn_to_online_page would indeed bail
> out:
> crash> p (0x606>>15)
> $3 = 3084
> crash> p mem_section[3084/128][3084 & 127]
> $4 = {
>   section_mem_map = 18446736128020054019,
>   usage = 0x902dcf956680,
>   page_ext = 0x0,
>   pad = 0
> }
> crash> p 18446736128020054019 & (1UL<<2)
> $5 = 0
> 
> That makes it considerably less of a problem than I thought!

Forgot to add that those who are running kernels without 53cdc1cb29e8
("drivers/base/memory.c: indicate all memory blocks as removable") for
some reason can fix the crash by the following simple patch.

Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
===
--- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
+++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
@@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
goto out;
 
for (i = 0; i < sections_per_block; i++) {
-   if (!present_section_nr(mem->start_section_nr + i))
+   unsigned long nr = mem->start_section_nr + i;
+   if (!present_section_nr(nr))
continue;
-   pfn = section_nr_to_pfn(mem->start_section_nr + i);
+   if (!online_section_nr()) {
+   ret = 0;
+   break;
+   }
+   pfn = section_nr_to_pfn(nr);
ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
}
 

-- 
Michal Hocko
SUSE Labs


Re: uninitialized pmem struct pages

2021-01-05 Thread Michal Hocko
On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> On 04.01.21 16:43, David Hildenbrand wrote:
> > On 04.01.21 16:33, Michal Hocko wrote:
> >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> >>> On 04.01.21 16:10, Michal Hocko wrote:
> >> [...]
> >>> Do the physical addresses you see fall into the same section as boot
> >>> memory? Or what's around these addresses?
> >>
> >> Yes I am getting a garbage for the first struct page belonging to the
> >> pmem section [1]
> >> [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x603fff]
> >> [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x606000-0x11d5fff] 
> >> non-volatile
> >>
> >> The pfn without the initialized struct page is 0x606. This is a
> >> first pfn in a section.
> > 
> > Okay, so we're not dealing with the "early section" mess I described,
> > different story.
> > 
> > Due to [1], is_mem_section_removable() called
> > pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as not
> > initialized.
> > 
> > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > actual address of the memmap?
> > 
> > I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it actually
> > part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?
> > 
> > If it's not self-hosted, initializing the relevant memmaps should work
> > just fine I guess. Otherwise things get more complicated.
> 
> Oh, I forgot: pfn_to_online_page() should at least in your example make
> sure other pfn walkers are safe. It was just an issue of
> is_mem_section_removable().

Hmm, I suspect you are right. I haven't put this together, thanks! The memory
section is indeed marked offline so pfn_to_online_page would indeed bail
out:
crash> p (0x606>>15)
$3 = 3084
crash> p mem_section[3084/128][3084 & 127]
$4 = {
  section_mem_map = 18446736128020054019,
  usage = 0x902dcf956680,
  page_ext = 0x0,
  pad = 0
}
crash> p 18446736128020054019 & (1UL<<2)
$5 = 0

That makes it considerably less of a problem than I thought!

Thanks David!
-- 
Michal Hocko
SUSE Labs


Re: uninitialized pmem struct pages

2021-01-04 Thread Michal Hocko
On Mon 04-01-21 21:17:43, Dan Williams wrote:
> On Mon, Jan 4, 2021 at 2:45 AM David Hildenbrand  wrote:
[...]
> > I believe Dan mentioned somewhere that he wants to see a real instance
> > of this producing a BUG before actually moving forward with a fix. I
> > might be wrong.
> 
> I think I'm missing an argument for the user-visible effects of the
> "Bad." statements above. I think soft_offline_page() is a candidate
> for a local fix because mm/memory-failure.c already has a significant
> amount of page-type specific knowledge. So teaching it "yes" for
> MEMORY_DEVICE_PRIVATE-ZONE_DEVICE and "no" for other ZONE_DEVICE seems
> ok to me.

I believe we do not want to teach _every_ pfn walker about zone device
pages. This would be quite error prone. Especially when a missig check
could lead to a silently broken data or BUG_ON with debugging enabled
(which is not the case for many production users). Or are we talking
about different bugs here?
-- 
Michal Hocko
SUSE Labs


Re: uninitialized pmem struct pages

2021-01-04 Thread Michal Hocko
On Mon 04-01-21 17:30:48, David Hildenbrand wrote:
> >> Let's assume this is indeed a reserved pfn in the altmap. What's the
> >> actual address of the memmap?
> > 
> > Not sure what exactly you are asking for but crash says
> > crash> kmem -p 606
> >   PAGE  PHYSICAL  MAPPING   INDEX CNT FLAGS
> > f8c600181800 60600  0 fc000
> 
> ^ this looks like it was somewhat initialized. All flags zero, nid/zone
> set to -1 (wild guess) and thus the crash? weird

Yes that made me scratch my head as well.
  
> >> I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it actually
> >> part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?
> > 
> > I am not really familiar with the pmem so I would need more assistance
> > here. I've tried this (shot into the dark):
> > crash> struct page.pgmap f8c600181800
> >   pgmap = 0xf8c600181808
> 
> That's weird. If the memmap is at f8c600181800, why should the pgmap
> be at an offset of 8 bytes from there. The "pgmap" field is actually at
> an offset of 8 bytes within the memmap ...

I haven't noticed this is an offset into struct page. This is indeed
weird because the structure is much larger than struct page. But maybe
it reuses storage of several struct pages in a row. Or maybe it is not
pgmap but something else that shares the offset in the structure.
-- 
Michal Hocko
SUSE Labs


Re: uninitialized pmem struct pages

2021-01-04 Thread Michal Hocko
On Mon 04-01-21 21:33:06, Dan Williams wrote:
> On Mon, Jan 4, 2021 at 7:59 AM Michal Hocko  wrote:
[...]
> > Not sure what exactly you are asking for but crash says
> > crash> kmem -p 606
> >   PAGE  PHYSICAL  MAPPING   INDEX CNT FLAGS
> > f8c600181800 60600  0 fc000
> >
> > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it actually
> > > part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?
> >
> > I am not really familiar with the pmem so I would need more assistance
> > here. I've tried this (shot into the dark):
> > crash> struct page.pgmap f8c600181800
> >   pgmap = 0xf8c600181808
> 
> Does /proc/iomem show an active namespace in the range?

Any tips how I dig that out from the crash dump?

> You should be
> able to skip ahead to the first pfn in that namespace to find the
> first dev_pagemap. I would have expected pfn_to_online_page() to have
> saved you here. This address range is section aligned.

Well, the affected code in this case was 
end_pfn = min(start_pfn + nr_pages,
zone_end_pfn(page_zone(pfn_to_page(start_pfn;
where start_pfn was the first pfn of a memory section. This code was
completely unaware of zone device or dev_pagemap like most others pfn
walkers. It just wanted to get bounds for the zone but it stumbled over
uninitialized zone/node.
-- 
Michal Hocko
SUSE Labs


Re: uninitialized pmem struct pages

2021-01-04 Thread Dan Williams
On Mon, Jan 4, 2021 at 7:59 AM Michal Hocko  wrote:
>
> On Mon 04-01-21 16:43:49, David Hildenbrand wrote:
> > On 04.01.21 16:33, Michal Hocko wrote:
> > > On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > >> On 04.01.21 16:10, Michal Hocko wrote:
> > > [...]
> > >> Do the physical addresses you see fall into the same section as boot
> > >> memory? Or what's around these addresses?
> > >
> > > Yes I am getting a garbage for the first struct page belonging to the
> > > pmem section [1]
> > > [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x603fff]
> > > [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x606000-0x11d5fff] 
> > > non-volatile
> > >
> > > The pfn without the initialized struct page is 0x606. This is a
> > > first pfn in a section.
> >
> > Okay, so we're not dealing with the "early section" mess I described,
> > different story.
> >
> > Due to [1], is_mem_section_removable() called
> > pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as not
> > initialized.
>
> Exactly!
>
> > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > actual address of the memmap?
>
> Not sure what exactly you are asking for but crash says
> crash> kmem -p 606
>   PAGE  PHYSICAL  MAPPING   INDEX CNT FLAGS
> f8c600181800 60600  0 fc000
>
> > I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it actually
> > part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?
>
> I am not really familiar with the pmem so I would need more assistance
> here. I've tried this (shot into the dark):
> crash> struct page.pgmap f8c600181800
>   pgmap = 0xf8c600181808

Does /proc/iomem show an active namespace in the range? You should be
able to skip ahead to the first pfn in that namespace to find the
first dev_pagemap. I would have expected pfn_to_online_page() to have
saved you here. This address range is section aligned.


Re: uninitialized pmem struct pages

2021-01-04 Thread Dan Williams
On Mon, Jan 4, 2021 at 2:45 AM David Hildenbrand  wrote:
>
> On 04.01.21 11:03, Michal Hocko wrote:
> > Hi,
> > back in March [1] you have recommended 53cdc1cb29e8
> > ("drivers/base/memory.c: indicate all memory blocks as removable") to be
> > backported to stable trees and that has led to a more general discussion
> > about the current state of pfn walkers wrt. uninitialized pmem struct
> > pages. We haven't concluded any specific solution for that except for a
> > general sentiment that pfn_to_online_page should be able to catch those.
> > I might have missed any follow ups on that but I do not think we have
> > landed on any actual solution in the end. Have I just missed any followups?
>
> Thanks for raising this issue. It's still on my list of "broken and
> non-trivial to fix" things.
>
> So, as far as I recall, we still have the following two issues remaining:
>
> 1. pfn_to_online_page() false positives
>
> The semantics of pfn_to_online_page() were broken with sub-section
> hot-add in corner cases.

The motivation for that backport was for pre-subsection kernels. When
onlining pmem that collides with the same section as System-RAM we may
have a situation like:

|--   SECTION   -- |
|-- System RAM  --PMEM  -- |
|-- pfn_valid() -- PMEM metadata-- |

So problem 0. is just System RAM + PMEM collisions independent of
sub-section support.

>
> If we have ZONE_DEVICE hot-added memory that overlaps in a section with
> boot memory, this memory section will contain parts ZONE_DEVICE memory
> and parts !ZONE_DEVICE memory. This can happen in sub-section
> granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
> memory parts as the whole section is marked as online. Bad.
>
> One instance where this is still an issue is
> mm/memory-failure.c:memory_failure() and
> mm/memory-failure.c:soft_offline_page(). I thought for a while about
> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
> actually the right approach.

This is complicated by MEMORY_DEVICE_PRIVATE that I believe wants to
say "yes" to pfn_to_online_page().

>
> But worse, before ZONE_DEVICE hot-add
> 1. The whole section memmap does already exist (early sections always
> have a full memmap for the whole section)
> 2. The whole section memmap is initialized (although eventually with
> dummy node/zone 0/0 for memory holes until that part is fixed) and might
> be accessed by pfn walkers.
>
> So when hotadding ZONE_DEVICE we are modifying already existing and
> visible memmaps. Bad.

Where does the rewrite of a dummy page entry matter in practice? It
would certainly be exceedingly Bad if in-use 'struct page' instances
we're rewritten. You're only alleging the former, correct?

>
>
> 2. Deferred init of ZONE_DEVICE ranges
>
> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
> and outside the memhp lock. I did not follow if the use of
> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
> pagemap_range() actually completed. I don't think it does.
>
> >
> > Is anybody working on that?
> >
>
> I believe Dan mentioned somewhere that he wants to see a real instance
> of this producing a BUG before actually moving forward with a fix. I
> might be wrong.

I think I'm missing an argument for the user-visible effects of the
"Bad." statements above. I think soft_offline_page() is a candidate
for a local fix because mm/memory-failure.c already has a significant
amount of page-type specific knowledge. So teaching it "yes" for
MEMORY_DEVICE_PRIVATE-ZONE_DEVICE and "no" for other ZONE_DEVICE seems
ok to me.

>
>
> We might tackle 1. by:
>
> a) [worked-around] doing get_dev_pagemap() before pfn_to_online_page() -
> just plain ugly.
>
> b) [worked-around] using a sub-section online-map and extending
> pfn_to_online_page(). IMHO ugly to fix this corner case.
>
> c) [worked-around] extending pfn_to_online_page() by zone checks. IMHO racy.
>
> d) fixed by not allowing ZONE_DEVICE and !ZONE_DEVICE within a single
> section. In the worst case, don't add partially present sections that
> have big holes in the beginning/end. Like, if there is a 128MB section
> with 126MB of memory followed by a 2MB hole, don't add that memory to
> Linux with CONFIG_ZONE_DEVICE. Might turn some memory unusable, but
> well, it would be the price to pay for simplicity. Being able to hotadd
> PMEM is more important than using each and every last MB of memory.

The collisions that are out there in the wild are 64MB System RAM
followed by 64MB of PMEM. If you're suggesting reducing System RAM
that collides with PMEM that's a consideration. It can't go the other
way because there are deployed configurations that have persistent
data there. Reducing System RAM runs into the problem of how early
does the kernel know that it's bordering ZONE_DEVICE. It's not just
PMEM, it's also EFI_MEMORY_SP (Linux Soft Reserved) memory.

> e) decrease the section size and drop 

Re: uninitialized pmem struct pages

2021-01-04 Thread David Hildenbrand
>> Let's assume this is indeed a reserved pfn in the altmap. What's the
>> actual address of the memmap?
> 
> Not sure what exactly you are asking for but crash says
> crash> kmem -p 606
>   PAGE  PHYSICAL  MAPPING   INDEX CNT FLAGS
> f8c600181800 60600  0 fc000

^ this looks like it was somewhat initialized. All flags zero, nid/zone
set to -1 (wild guess) and thus the crash? weird

>  
>> I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it actually
>> part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?
> 
> I am not really familiar with the pmem so I would need more assistance
> here. I've tried this (shot into the dark):
> crash> struct page.pgmap f8c600181800
>   pgmap = 0xf8c600181808

That's weird. If the memmap is at f8c600181800, why should the pgmap
be at an offset of 8 bytes from there. The "pgmap" field is actually at
an offset of 8 bytes within the memmap ...

Assuming the memmap is not actually ZONE_DEVICE, f8c600181800 really
only contains garbage, including the pgmap pointer :(

> crash> struct -x dev_pagemap 0xf8c600181808
> struct dev_pagemap {
>   altmap = {
> base_pfn = 0xf8c600181808, 
> end_pfn = 0xf8c600181808,

^ because this is very weird

> reserve = 0x0, 

^ and this indicates nothing was actually reserved.


-- 
Thanks,

David / dhildenb



Re: uninitialized pmem struct pages

2021-01-04 Thread Michal Hocko
On Mon 04-01-21 16:43:49, David Hildenbrand wrote:
> On 04.01.21 16:33, Michal Hocko wrote:
> > On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> >> On 04.01.21 16:10, Michal Hocko wrote:
> > [...]
> >> Do the physical addresses you see fall into the same section as boot
> >> memory? Or what's around these addresses?
> > 
> > Yes I am getting a garbage for the first struct page belonging to the
> > pmem section [1]
> > [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x603fff]
> > [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x606000-0x11d5fff] 
> > non-volatile
> > 
> > The pfn without the initialized struct page is 0x606. This is a
> > first pfn in a section.
> 
> Okay, so we're not dealing with the "early section" mess I described,
> different story.
> 
> Due to [1], is_mem_section_removable() called
> pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as not
> initialized.

Exactly!

> Let's assume this is indeed a reserved pfn in the altmap. What's the
> actual address of the memmap?

Not sure what exactly you are asking for but crash says
crash> kmem -p 606
  PAGE  PHYSICAL  MAPPING   INDEX CNT FLAGS
f8c600181800 60600  0 fc000
 
> I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it actually
> part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?

I am not really familiar with the pmem so I would need more assistance
here. I've tried this (shot into the dark):
crash> struct page.pgmap f8c600181800
  pgmap = 0xf8c600181808
crash> struct -x dev_pagemap 0xf8c600181808
struct dev_pagemap {
  altmap = {
base_pfn = 0xf8c600181808, 
end_pfn = 0xf8c600181808, 
reserve = 0x0, 
free = 0x0, 
align = 0x0, 
alloc = 0x
  }, 
  res = {
start = 0x0, 
end = 0xfc000, 
name = 0xf8c600181848 "H\030\030", 
flags = 0xf8c600181848, 
desc = 0x0, 
parent = 0x0, 
sibling = 0x0, 
child = 0x
  }, 
  ref = 0x0, 
  internal_ref = {
count = {
  counter = 0xfc000
}, 
percpu_count_ptr = 0xf8c600181888, 
release = 0xf8c600181888, 
confirm_switch = 0x0, 
force_atomic = 0x0, 
allow_reinit = 0x0, 
rcu = {
  next = 0x0, 
  func = 0x
}
  }, 
  done = {
done = 0x0, 
wait = {
  lock = {
{
  rlock = {
raw_lock = {
  {
val = {
  counter = 0xc000
}, 
{
  locked = 0x0, 
  pending = 0x0
}, 
{
  locked_pending = 0x0, 
  tail = 0xc000
}
  }
}
  }
}
  }, 
  head = {
next = 0xf8c6001818c8, 
prev = 0xf8c6001818c8
  }
}
  }, 
  dev = 0x0, 
  type = 0, 
  flags = 0x0, 
  ops = 0x0
}

Not sure whether this is of any use.
 
> If it's not self-hosted, initializing the relevant memmaps should work
> just fine I guess. Otherwise things get more complicated.

-- 
Michal Hocko
SUSE Labs


Re: uninitialized pmem struct pages

2021-01-04 Thread David Hildenbrand
On 04.01.21 16:43, David Hildenbrand wrote:
> On 04.01.21 16:33, Michal Hocko wrote:
>> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
>>> On 04.01.21 16:10, Michal Hocko wrote:
>> [...]
>>> Do the physical addresses you see fall into the same section as boot
>>> memory? Or what's around these addresses?
>>
>> Yes I am getting a garbage for the first struct page belonging to the
>> pmem section [1]
>> [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x603fff]
>> [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x606000-0x11d5fff] 
>> non-volatile
>>
>> The pfn without the initialized struct page is 0x606. This is a
>> first pfn in a section.
> 
> Okay, so we're not dealing with the "early section" mess I described,
> different story.
> 
> Due to [1], is_mem_section_removable() called
> pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as not
> initialized.
> 
> Let's assume this is indeed a reserved pfn in the altmap. What's the
> actual address of the memmap?
> 
> I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it actually
> part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?
> 
> If it's not self-hosted, initializing the relevant memmaps should work
> just fine I guess. Otherwise things get more complicated.

Oh, I forgot: pfn_to_online_page() should at least in your example make
sure other pfn walkers are safe. It was just an issue of
is_mem_section_removable().


-- 
Thanks,

David / dhildenb



Re: uninitialized pmem struct pages

2021-01-04 Thread David Hildenbrand
On 04.01.21 16:33, Michal Hocko wrote:
> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
>> On 04.01.21 16:10, Michal Hocko wrote:
> [...]
>> Do the physical addresses you see fall into the same section as boot
>> memory? Or what's around these addresses?
> 
> Yes I am getting a garbage for the first struct page belonging to the
> pmem section [1]
> [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x603fff]
> [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x606000-0x11d5fff] 
> non-volatile
> 
> The pfn without the initialized struct page is 0x606. This is a
> first pfn in a section.

Okay, so we're not dealing with the "early section" mess I described,
different story.

Due to [1], is_mem_section_removable() called
pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as not
initialized.

Let's assume this is indeed a reserved pfn in the altmap. What's the
actual address of the memmap?

I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it actually
part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?

If it's not self-hosted, initializing the relevant memmaps should work
just fine I guess. Otherwise things get more complicated.


-- 
Thanks,

David / dhildenb



Re: uninitialized pmem struct pages

2021-01-04 Thread Michal Hocko
On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> On 04.01.21 16:10, Michal Hocko wrote:
[...]
> Do the physical addresses you see fall into the same section as boot
> memory? Or what's around these addresses?

Yes I am getting a garbage for the first struct page belonging to the
pmem section [1]
[0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x603fff]
[0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x606000-0x11d5fff] 
non-volatile

The pfn without the initialized struct page is 0x606. This is a
first pfn in a section.

--- 
[1] it shares the same memory block with the regular memory as those
are in the same 2G range.
-- 
Michal Hocko
SUSE Labs


Re: uninitialized pmem struct pages

2021-01-04 Thread David Hildenbrand
On 04.01.21 16:10, Michal Hocko wrote:
> On Mon 04-01-21 15:51:35, David Hildenbrand wrote:
>> On 04.01.21 15:26, Michal Hocko wrote:
>>> On Mon 04-01-21 11:45:39, David Hildenbrand wrote:
> []
 One instance where this is still an issue is
 mm/memory-failure.c:memory_failure() and
 mm/memory-failure.c:soft_offline_page(). I thought for a while about
 "fixing" these, but to me it felt like fixing pfn_to_online_page() is
 actually the right approach.

 But worse, before ZONE_DEVICE hot-add
 1. The whole section memmap does already exist (early sections always
 have a full memmap for the whole section)
 2. The whole section memmap is initialized (although eventually with
 dummy node/zone 0/0 for memory holes until that part is fixed) and might
 be accessed by pfn walkers.

 So when hotadding ZONE_DEVICE we are modifying already existing and
 visible memmaps. Bad.
>>>
>>> Could you elaborate please?
>>
>> Simplistic example: Assume you have a VM with 64MB on x86-64.
>>
>> We need exactly one memory section (-> one memory block device). We
>> allocate the memmap for a full section - an "early section". So we have
>> a memmap for 128MB, while 64MB are actually in use, the other 64MB is
>> initialized (like a memory hole). pfn_to_online_page() would return a
>> valid struct page for the whole section memmap.
>>
>> The remaining 64MB can later be used for hot-adding ZONE_DEVICE memory,
>> essentially re-initializing that part of the already-existing memmap.
>>
>> See pfn_valid():
>>
>> /*
>>  * Traditionally early sections always returned pfn_valid() for
>>  * the entire section-sized span.
>>  */
>> return early_section(ms) || pfn_section_valid(ms, pfn);
>>
>>
>> Depending on the memory layout of the system, a pfn walker might just be
>> about to stumble over this range getting re-initialized.
> 
> Right. But as long as pfn walkers are not synchronized with the memory
> hotplug this is a general problem with any struct page. Whether it
> belongs to pmem or a regular memory, no?

Yes, however in this case even the memory hotplug lock does not help.
But yes, related issues.

> 
 2. Deferred init of ZONE_DEVICE ranges

 memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
 and outside the memhp lock. I did not follow if the use of
 get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
 pagemap_range() actually completed. I don't think it does.
>>>
>>> So a pfn walker can see an unitialized struct page for a while, right?
>>>
>>> The problem that I have encountered is that some zone device pages are
>>> not initialized at all. That sounds like a different from those 2 above.
>>> I am having hard time to track what kind of pages those are and why we
>>> cannot initialized their zone/node and make them reserved at least.
>>
>> And you are sure that these are in fact ZONE_DEVICE pages? Not memory
>> holes e.g., tackled by
> 
> Well, the physical address matches the pmem range so I believe this is
> the case.
> 
> [...]
>> However, I do remember a discussion regarding "reserved altmap space"
>> ZONE_DEVICE ranges, and whether to initialize them or leave them
>> uninitialized. See comment in
>>
>> commit 77e080e7680e1e615587352f70c87b9e98126d03
>> Author: Aneesh Kumar K.V 
>> Date:   Fri Oct 18 20:19:39 2019 -0700
>>
>> mm/memunmap: don't access uninitialized memmap in memunmap_pages()
> 
> yes, the reserved altmap space sounds like it might be it.

[...]

> Would it be possible to iterate over the reserved space and initialize
> Node/zones at least?

Right, I was confused by the terminology. We actually initialize the
pages used for memory mapping in
move_pfn_range_to_zone()->memmap_init_zone(). But we seem to exclude the
"reserved space" - I think for good reason.

I think the issue is that this "reserved space" might actually get
overridden by something else later, as it won't be used as a memmap, but
just to store "anything".

Do the physical addresses you see fall into the same section as boot
memory? Or what's around these addresses?


-- 
Thanks,

David / dhildenb



Re: uninitialized pmem struct pages

2021-01-04 Thread Michal Hocko
On Mon 04-01-21 15:51:35, David Hildenbrand wrote:
> On 04.01.21 15:26, Michal Hocko wrote:
> > On Mon 04-01-21 11:45:39, David Hildenbrand wrote:
[]
> >> One instance where this is still an issue is
> >> mm/memory-failure.c:memory_failure() and
> >> mm/memory-failure.c:soft_offline_page(). I thought for a while about
> >> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
> >> actually the right approach.
> >>
> >> But worse, before ZONE_DEVICE hot-add
> >> 1. The whole section memmap does already exist (early sections always
> >> have a full memmap for the whole section)
> >> 2. The whole section memmap is initialized (although eventually with
> >> dummy node/zone 0/0 for memory holes until that part is fixed) and might
> >> be accessed by pfn walkers.
> >>
> >> So when hotadding ZONE_DEVICE we are modifying already existing and
> >> visible memmaps. Bad.
> > 
> > Could you elaborate please?
> 
> Simplistic example: Assume you have a VM with 64MB on x86-64.
> 
> We need exactly one memory section (-> one memory block device). We
> allocate the memmap for a full section - an "early section". So we have
> a memmap for 128MB, while 64MB are actually in use, the other 64MB is
> initialized (like a memory hole). pfn_to_online_page() would return a
> valid struct page for the whole section memmap.
> 
> The remaining 64MB can later be used for hot-adding ZONE_DEVICE memory,
> essentially re-initializing that part of the already-existing memmap.
> 
> See pfn_valid():
> 
> /*
>  * Traditionally early sections always returned pfn_valid() for
>  * the entire section-sized span.
>  */
> return early_section(ms) || pfn_section_valid(ms, pfn);
> 
> 
> Depending on the memory layout of the system, a pfn walker might just be
> about to stumble over this range getting re-initialized.

Right. But as long as pfn walkers are not synchronized with the memory
hotplug this is a general problem with any struct page. Whether it
belongs to pmem or a regular memory, no?

> >> 2. Deferred init of ZONE_DEVICE ranges
> >>
> >> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
> >> and outside the memhp lock. I did not follow if the use of
> >> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
> >> pagemap_range() actually completed. I don't think it does.
> > 
> > So a pfn walker can see an unitialized struct page for a while, right?
> > 
> > The problem that I have encountered is that some zone device pages are
> > not initialized at all. That sounds like a different from those 2 above.
> > I am having hard time to track what kind of pages those are and why we
> > cannot initialized their zone/node and make them reserved at least.
> 
> And you are sure that these are in fact ZONE_DEVICE pages? Not memory
> holes e.g., tackled by

Well, the physical address matches the pmem range so I believe this is
the case.

[...]
> However, I do remember a discussion regarding "reserved altmap space"
> ZONE_DEVICE ranges, and whether to initialize them or leave them
> uninitialized. See comment in
> 
> commit 77e080e7680e1e615587352f70c87b9e98126d03
> Author: Aneesh Kumar K.V 
> Date:   Fri Oct 18 20:19:39 2019 -0700
> 
> mm/memunmap: don't access uninitialized memmap in memunmap_pages()

yes, the reserved altmap space sounds like it might be it.

> "With an altmap, the memmap falling into the reserved altmap space are
> not initialized and, therefore, contain a garbage NID and a garbage zone.".
> 
> I think the issue is that the ZONE_DEVICE pages that *host* the memmap
> of other pages might be left uninitialized.
> 
> Like pfn_to_page(VIRT_TO_PFN(pfn_to_page(zone_device_pfn))), which falls
> onto ZONE_DEVICE with an altmap, could be uninitialized. This is very
> similar to our Oscar's vmemmap-on-hotadded-memory approach, however,
> there we implicitly initialize the memmap of these pages just by the way
> the vmemmap is placed at the beginning of the memory block.
> 
> If altmap-reserved space is placed partially into an early section that
> is marked as online (issue 1. I described), we have the same issue as
> 1., just a little harder to fix :)

Would it be possible to iterate over the reserved space and initialize
Node/zones at least?
 
Thanks!
-- 
Michal Hocko
SUSE Labs


Re: uninitialized pmem struct pages

2021-01-04 Thread David Hildenbrand
On 04.01.21 15:26, Michal Hocko wrote:
> On Mon 04-01-21 11:45:39, David Hildenbrand wrote:
>> On 04.01.21 11:03, Michal Hocko wrote:
>>> Hi,
>>> back in March [1] you have recommended 53cdc1cb29e8
>>> ("drivers/base/memory.c: indicate all memory blocks as removable") to be
>>> backported to stable trees and that has led to a more general discussion
>>> about the current state of pfn walkers wrt. uninitialized pmem struct
>>> pages. We haven't concluded any specific solution for that except for a
>>> general sentiment that pfn_to_online_page should be able to catch those.
>>> I might have missed any follow ups on that but I do not think we have
>>> landed on any actual solution in the end. Have I just missed any followups?
>>
>> Thanks for raising this issue. It's still on my list of "broken and
>> non-trivial to fix" things.
>>
>> So, as far as I recall, we still have the following two issues remaining:
>>
>> 1. pfn_to_online_page() false positives
>>
>> The semantics of pfn_to_online_page() were broken with sub-section
>> hot-add in corner cases.
>>
>> If we have ZONE_DEVICE hot-added memory that overlaps in a section with
>> boot memory, this memory section will contain parts ZONE_DEVICE memory
>> and parts !ZONE_DEVICE memory. This can happen in sub-section
>> granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
>> memory parts as the whole section is marked as online. Bad.
> 
> OK, I was not aware of this problem. Anyway, those pages should be still
> allocated and their state should retain their last state. I would have
> to double check but this shouldn't be harmfull. Or what would be an
> actual problem?
> 
>> One instance where this is still an issue is
>> mm/memory-failure.c:memory_failure() and
>> mm/memory-failure.c:soft_offline_page(). I thought for a while about
>> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
>> actually the right approach.
>>
>> But worse, before ZONE_DEVICE hot-add
>> 1. The whole section memmap does already exist (early sections always
>> have a full memmap for the whole section)
>> 2. The whole section memmap is initialized (although eventually with
>> dummy node/zone 0/0 for memory holes until that part is fixed) and might
>> be accessed by pfn walkers.
>>
>> So when hotadding ZONE_DEVICE we are modifying already existing and
>> visible memmaps. Bad.
> 
> Could you elaborate please?

Simplistic example: Assume you have a VM with 64MB on x86-64.

We need exactly one memory section (-> one memory block device). We
allocate the memmap for a full section - an "early section". So we have
a memmap for 128MB, while 64MB are actually in use, the other 64MB is
initialized (like a memory hole). pfn_to_online_page() would return a
valid struct page for the whole section memmap.

The remaining 64MB can later be used for hot-adding ZONE_DEVICE memory,
essentially re-initializing that part of the already-existing memmap.

See pfn_valid():

/*
 * Traditionally early sections always returned pfn_valid() for
 * the entire section-sized span.
 */
return early_section(ms) || pfn_section_valid(ms, pfn);


Depending on the memory layout of the system, a pfn walker might just be
about to stumble over this range getting re-initialized.

>  
>> 2. Deferred init of ZONE_DEVICE ranges
>>
>> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
>> and outside the memhp lock. I did not follow if the use of
>> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
>> pagemap_range() actually completed. I don't think it does.
> 
> So a pfn walker can see an unitialized struct page for a while, right?
> 
> The problem that I have encountered is that some zone device pages are
> not initialized at all. That sounds like a different from those 2 above.
> I am having hard time to track what kind of pages those are and why we
> cannot initialized their zone/node and make them reserved at least.

And you are sure that these are in fact ZONE_DEVICE pages? Not memory
holes e.g., tackled by

commit 4b094b7851bf4bf551ad456195d3f26e1c03bd74
Author: David Hildenbrand 
Date:   Mon Feb 3 17:33:55 2020 -0800

mm/page_alloc.c: initialize memmap of unavailable memory directly


commit e822969cab48b786b64246aad1a3ba2a774f5d23
Author: David Hildenbrand 
Date:   Mon Feb 3 17:33:48 2020 -0800

mm/page_alloc.c: fix uninitialized memmaps on a partially populated
last section


(note there is currently an upstream discussion on improving this
initialization, especially getting better node/zone information, mostly
involving Andrea and Mike - but it only changes "how" these parts are
initialized, not "if" or "when")

---

However, I do remember a discussion regarding "reserved altmap space"
ZONE_DEVICE ranges, and whether to initialize them or leave them
uninitialized. See comment in

commit 77e080e7680e1e615587352f70c87b9e98126d03
Author: Aneesh Kumar K.V 
Date:   Fri Oct 18 20:19:39 2019 -0700

mm/memunmap: don't access uninitialized 

Re: uninitialized pmem struct pages

2021-01-04 Thread Michal Hocko
On Mon 04-01-21 11:45:39, David Hildenbrand wrote:
> On 04.01.21 11:03, Michal Hocko wrote:
> > Hi,
> > back in March [1] you have recommended 53cdc1cb29e8
> > ("drivers/base/memory.c: indicate all memory blocks as removable") to be
> > backported to stable trees and that has led to a more general discussion
> > about the current state of pfn walkers wrt. uninitialized pmem struct
> > pages. We haven't concluded any specific solution for that except for a
> > general sentiment that pfn_to_online_page should be able to catch those.
> > I might have missed any follow ups on that but I do not think we have
> > landed on any actual solution in the end. Have I just missed any followups?
> 
> Thanks for raising this issue. It's still on my list of "broken and
> non-trivial to fix" things.
> 
> So, as far as I recall, we still have the following two issues remaining:
> 
> 1. pfn_to_online_page() false positives
> 
> The semantics of pfn_to_online_page() were broken with sub-section
> hot-add in corner cases.
> 
> If we have ZONE_DEVICE hot-added memory that overlaps in a section with
> boot memory, this memory section will contain parts ZONE_DEVICE memory
> and parts !ZONE_DEVICE memory. This can happen in sub-section
> granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
> memory parts as the whole section is marked as online. Bad.

OK, I was not aware of this problem. Anyway, those pages should be still
allocated and their state should retain their last state. I would have
to double check but this shouldn't be harmfull. Or what would be an
actual problem?

> One instance where this is still an issue is
> mm/memory-failure.c:memory_failure() and
> mm/memory-failure.c:soft_offline_page(). I thought for a while about
> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
> actually the right approach.
> 
> But worse, before ZONE_DEVICE hot-add
> 1. The whole section memmap does already exist (early sections always
> have a full memmap for the whole section)
> 2. The whole section memmap is initialized (although eventually with
> dummy node/zone 0/0 for memory holes until that part is fixed) and might
> be accessed by pfn walkers.
> 
> So when hotadding ZONE_DEVICE we are modifying already existing and
> visible memmaps. Bad.

Could you elaborate please?
 
> 2. Deferred init of ZONE_DEVICE ranges
> 
> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
> and outside the memhp lock. I did not follow if the use of
> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
> pagemap_range() actually completed. I don't think it does.

So a pfn walker can see an unitialized struct page for a while, right?

The problem that I have encountered is that some zone device pages are
not initialized at all. That sounds like a different from those 2 above.
I am having hard time to track what kind of pages those are and why we
cannot initialized their zone/node and make them reserved at least.

> > Is anybody working on that?
> > 
> 
> I believe Dan mentioned somewhere that he wants to see a real instance
> of this producing a BUG before actually moving forward with a fix. I
> might be wrong.

We have seen reports about those uninitialized struct pages on our 5.3
based kernels. Backporting 53cdc1cb29e8 helped for the particular report
but I still consider it a workaround rather than a fix. I do not have
any reports for other pfn walkers but we might be just lucky and I will
sleep better if I do not have rely on the luck.

[...]

I will think about your proposed solutions after I manage to get through
my email backlog.

Thanks!

-- 
Michal Hocko
SUSE Labs


Re: uninitialized pmem struct pages

2021-01-04 Thread David Hildenbrand
On 04.01.21 11:03, Michal Hocko wrote:
> Hi,
> back in March [1] you have recommended 53cdc1cb29e8
> ("drivers/base/memory.c: indicate all memory blocks as removable") to be
> backported to stable trees and that has led to a more general discussion
> about the current state of pfn walkers wrt. uninitialized pmem struct
> pages. We haven't concluded any specific solution for that except for a
> general sentiment that pfn_to_online_page should be able to catch those.
> I might have missed any follow ups on that but I do not think we have
> landed on any actual solution in the end. Have I just missed any followups?

Thanks for raising this issue. It's still on my list of "broken and
non-trivial to fix" things.

So, as far as I recall, we still have the following two issues remaining:

1. pfn_to_online_page() false positives

The semantics of pfn_to_online_page() were broken with sub-section
hot-add in corner cases.

If we have ZONE_DEVICE hot-added memory that overlaps in a section with
boot memory, this memory section will contain parts ZONE_DEVICE memory
and parts !ZONE_DEVICE memory. This can happen in sub-section
granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
memory parts as the whole section is marked as online. Bad.

One instance where this is still an issue is
mm/memory-failure.c:memory_failure() and
mm/memory-failure.c:soft_offline_page(). I thought for a while about
"fixing" these, but to me it felt like fixing pfn_to_online_page() is
actually the right approach.

But worse, before ZONE_DEVICE hot-add
1. The whole section memmap does already exist (early sections always
have a full memmap for the whole section)
2. The whole section memmap is initialized (although eventually with
dummy node/zone 0/0 for memory holes until that part is fixed) and might
be accessed by pfn walkers.

So when hotadding ZONE_DEVICE we are modifying already existing and
visible memmaps. Bad.


2. Deferred init of ZONE_DEVICE ranges

memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
and outside the memhp lock. I did not follow if the use of
get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
pagemap_range() actually completed. I don't think it does.

> 
> Is anybody working on that?
> 

I believe Dan mentioned somewhere that he wants to see a real instance
of this producing a BUG before actually moving forward with a fix. I
might be wrong.


We might tackle 1. by:

a) [worked-around] doing get_dev_pagemap() before pfn_to_online_page() -
just plain ugly.

b) [worked-around] using a sub-section online-map and extending
pfn_to_online_page(). IMHO ugly to fix this corner case.

c) [worked-around] extending pfn_to_online_page() by zone checks. IMHO racy.

d) fixed by not allowing ZONE_DEVICE and !ZONE_DEVICE within a single
section. In the worst case, don't add partially present sections that
have big holes in the beginning/end. Like, if there is a 128MB section
with 126MB of memory followed by a 2MB hole, don't add that memory to
Linux with CONFIG_ZONE_DEVICE. Might turn some memory unusable, but
well, it would be the price to pay for simplicity. Being able to hotadd
PMEM is more important than using each and every last MB of memory.

e) decrease the section size and drop sub-section hotadd support. As
sub-section hotadd is 2MB and MAX_ORDER - 1 corresponds 4MB, this is
mostly impossible. Worse on aarch64 with 64k base pages - 1024MB
sections and MAX_ORDER - 1 corresponds 512MB. I think this is not feasible.


We might tackle 2. by making sure that get_dev_pagemap() only succeeds
after memmap_init_zone_device() finished. As I said, not sure if that is
already done.


> Also is there any reference explaining what those struct pages are and
> why we cannot initialize them? I am sorry if this has been explained to
> me but I really cannot find that in my notes anywhere. Most pmem pages
> should be initialized via memmap_init_zone_device, right?

Long story short, IMHO we have to

a) fix pfn_to_online_page() to only succeed for !ZONE_DEVICE memory.
b) fix get_dev_pagemap() to only succeed if memmap_init_zone_device()
completed. (again, unless this already works as desired)

Dealing with races (memory_offlining() racing with pfn_to_online_page())
is another story and stuff for the future. Another level of complexity :)


> I am asking mostly because we are starting to see these issues in
> production and while the only visible problem so far is a crash when
> reading sysfs (removable file) I am worried we are just lucky no other
> pfn walker stumble over this.

Which exact issue are you seeing? The one tackled by "[PATCH v1]
drivers/base/memory.c: indicate all memory blocks as removable" ?


-- 
Thanks,

David / dhildenb