Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-22 Thread Dan Williams
On Wed, Jan 22, 2020 at 11:09 AM Michal Hocko  wrote:
>
> On Wed 22-01-20 19:46:15, David Hildenbrand wrote:
> > On 22.01.20 19:38, Michal Hocko wrote:
> [...]
> > > How exactly is check + offline more optimal then offline which makes
> > > check as its first step? I will get to your later points after this is
> > > clarified.
> >
> > Scanning (almost) lockless is more efficient than bouncing back and
> > forth with the device_hotplug_lock, mem_hotplug_lock, cpu_hotplug_lock
> > and zone locks - as far as I understand.
>
> All but the zone lock shouldn't be really contended and as such
> shouldn't cause any troubles. zone->lock really depends on the page
> allocator usage of course. But as soon as we have a contention then it
> is just more likely that the result is less reliable.
>
> I would be also really curious about how much actual time could be saved
> by this - some real numbers - because hotplug operations shouldn't
> happen so often that this would stand out. At least that is my
> understanding.
>
> > And as far as I understood, that was the whole reason of the original
> > commit.
>
> Well, I have my doubts but it might be just me and I might be wrong. My
> experience from a large part of the memory hotplug functionality is that
> it was driven by a good intention but without a due diligence to think
> behind the most obvious usecase. Having a removable flag on the memblock
> sounds like a neat idea of course. But an inherently racy flag is just
> borderline useful.
>
> Anyway, I will stop at this moment and wait for real usecases.

...that and practical numbers showing that optimizing an interface
that can at best give rough estimate answers is worth the code change.


Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-22 Thread Michal Hocko
On Wed 22-01-20 19:46:15, David Hildenbrand wrote:
> On 22.01.20 19:38, Michal Hocko wrote:
[...]
> > How exactly is check + offline more optimal then offline which makes
> > check as its first step? I will get to your later points after this is
> > clarified.
> 
> Scanning (almost) lockless is more efficient than bouncing back and
> forth with the device_hotplug_lock, mem_hotplug_lock, cpu_hotplug_lock
> and zone locks - as far as I understand.

All but the zone lock shouldn't be really contended and as such
shouldn't cause any troubles. zone->lock really depends on the page
allocator usage of course. But as soon as we have a contention then it
is just more likely that the result is less reliable.

I would be also really curious about how much actual time could be saved
by this - some real numbers - because hotplug operations shouldn't
happen so often that this would stand out. At least that is my
understanding.

> And as far as I understood, that was the whole reason of the original
> commit.

Well, I have my doubts but it might be just me and I might be wrong. My
experience from a large part of the memory hotplug functionality is that
it was driven by a good intention but without a due diligence to think
behind the most obvious usecase. Having a removable flag on the memblock
sounds like a neat idea of course. But an inherently racy flag is just
borderline useful.

Anyway, I will stop at this moment and wait for real usecases.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-22 Thread Michal Hocko
On Wed 22-01-20 19:15:47, David Hildenbrand wrote:
[...]
> c) CC relevant people I identify (lsmem/chmem/powerpc-utils/etc.) on the
> patch to see if we are missing other use cases/users/implications.
> 
> Sounds like a plan?

I would start with this step. Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-22 Thread David Hildenbrand
On 22.01.20 19:38, Michal Hocko wrote:
> On Wed 22-01-20 19:15:47, David Hildenbrand wrote:
>> On 22.01.20 17:46, Michal Hocko wrote:
>>> On Wed 22-01-20 12:58:16, David Hildenbrand wrote:
> [...]
 Especially interesting for IBM z Systems, whereby memory
 onlining/offlining will trigger the actual population of memory in the
 hypervisor. So if an admin wants to offline some memory (to give it back
 to the hypervisor), it would use lsmem to identify such blocks first,
 instead of trying random blocks until one offlining request succeeds.
>>>
>>> I am sorry for being dense here but I still do not understand why s390
>>
>> It's good that we talk about it :) It's hard to reconstruct actual use
>> cases from tools and some documentation only ...
>>
>> Side note (just FYI): One difference on s390x compared to other
>> architectures (AFAIKS) is that once memory is offline, you might not be
>> allowed (by the hypervisor) to online it again - because it was
>> effectively unplugged. Such memory is not removed via remove_memory(),
>> it's simply kept offline.
> 
> I have a very vague understanding of s390 specialities but this is not
> really relevant to the discussion AFAICS because this happens _after_
> offlining.

Jep, that's why I flagged it as a side note.

>  
>>> and the way how it does the hotremove matters here. Afterall there are
>>> no arch specific operations done until the memory is offlined. Also
>>> randomly checking memory blocks and then hoping that the offline will
>>> succeed is not way much different from just trying the offline the
>>> block. Both have to crawl through the pfn range and bail out on the
>>> unmovable memory.
>>
>> I think in general we have to approaches to memory unplugging.
>>
>> 1. Know explicitly what you want to unplug (e.g., a DIMM spanning
>> multiple memory blocks).
>>
>> 2. Find random memory blocks you can offline/unplug.
>>
>>
>> For 1, I think we both agree that we don't need this. Just try to
>> offline and you know if it worked.
>>
>> Now of course, for 2 you can try random blocks until you succeeded. From
>> a sysadmin point of view that's very inefficient. From a powerpc-utils
>> point of view, that's inefficient.
> 
> How exactly is check + offline more optimal then offline which makes
> check as its first step? I will get to your later points after this is
> clarified.

Scanning (almost) lockless is more efficient than bouncing back and
forth with the device_hotplug_lock, mem_hotplug_lock, cpu_hotplug_lock
and zone locks - as far as I understand. And as far as I understood,
that was the whole reason of the original commit.

Anyhow, you should have read until the end of my mail to find what you
were looking for :)

-- 
Thanks,

David / dhildenb



Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-22 Thread Michal Hocko
On Wed 22-01-20 19:15:47, David Hildenbrand wrote:
> On 22.01.20 17:46, Michal Hocko wrote:
> > On Wed 22-01-20 12:58:16, David Hildenbrand wrote:
[...]
> >> Especially interesting for IBM z Systems, whereby memory
> >> onlining/offlining will trigger the actual population of memory in the
> >> hypervisor. So if an admin wants to offline some memory (to give it back
> >> to the hypervisor), it would use lsmem to identify such blocks first,
> >> instead of trying random blocks until one offlining request succeeds.
> > 
> > I am sorry for being dense here but I still do not understand why s390
> 
> It's good that we talk about it :) It's hard to reconstruct actual use
> cases from tools and some documentation only ...
> 
> Side note (just FYI): One difference on s390x compared to other
> architectures (AFAIKS) is that once memory is offline, you might not be
> allowed (by the hypervisor) to online it again - because it was
> effectively unplugged. Such memory is not removed via remove_memory(),
> it's simply kept offline.

I have a very vague understanding of s390 specialities but this is not
really relevant to the discussion AFAICS because this happens _after_
offlining.
 
> > and the way how it does the hotremove matters here. Afterall there are
> > no arch specific operations done until the memory is offlined. Also
> > randomly checking memory blocks and then hoping that the offline will
> > succeed is not way much different from just trying the offline the
> > block. Both have to crawl through the pfn range and bail out on the
> > unmovable memory.
> 
> I think in general we have to approaches to memory unplugging.
> 
> 1. Know explicitly what you want to unplug (e.g., a DIMM spanning
> multiple memory blocks).
> 
> 2. Find random memory blocks you can offline/unplug.
> 
> 
> For 1, I think we both agree that we don't need this. Just try to
> offline and you know if it worked.
> 
> Now of course, for 2 you can try random blocks until you succeeded. From
> a sysadmin point of view that's very inefficient. From a powerpc-utils
> point of view, that's inefficient.

How exactly is check + offline more optimal then offline which makes
check as its first step? I will get to your later points after this is
clarified.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-22 Thread David Hildenbrand
On 22.01.20 17:46, Michal Hocko wrote:
> On Wed 22-01-20 12:58:16, David Hildenbrand wrote:
>> On 22.01.20 11:54, David Hildenbrand wrote:
>>> On 22.01.20 11:42, Michal Hocko wrote:
 On Wed 22-01-20 11:39:08, David Hildenbrand wrote:
 Really, the interface is flawed and should have never been merged in 
 the
 first place. We cannot simply remove it altogether I am afraid so let's
 at least remove the bogus code and pretend that the world is a better
 place where everything is removable except the reality sucks...
>>>
>>> As I expressed already, the interface works as designed/documented and
>>> has been used like that for years.
>>
>> It seems we do differ in the usefulness though. Using a crappy interface
>> for years doesn't make it less crappy. I do realize we cannot remove the
>> interface but we can remove issues with the implementation and I dare to
>> say that most existing users wouldn't really notice.
>
> Well, at least powerpc-utils (why this interface was introduced) will
> notice a) performance wise and b) because more logging output will be
> generated (obviously non-offlineable blocks will be tried to offline).

 I would really appreciate some specific example for a real usecase. I am
 not familiar with powerpc-utils worklflows myself.

>>>
>>> Not an expert myself:
>>>
>>> https://github.com/ibm-power-utilities/powerpc-utils
>>>
>>> -> src/drmgr/drslot_chrp_mem.c
>>>
>>> On request to remove some memory it will
>>>
>>> a) Read "->removable" of all memory blocks ("lmb")
>>> b) Check if the request can be fulfilled using the removable blocks
>>> c) Try to offline the memory blocks by trying to offline it. If that
>>> succeeded, trigger removeal of it using some hypervisor hooks.
>>>
>>> Interestingly, with "AMS ballooning", it will already consider the
>>> "removable" information useless (most probably, because of
>>> non-migratable balloon pages that can be offlined - I assume the powerpc
>>> code that I converted to proper balloon compaction just recently). a)
>>> and b) is skipped.
>>>
>>> Returning "yes" on all blocks will make them handle it just like if "AMS
>>> ballooning" is active. So any memory block will be tried. Should work
>>> but will be slower if no ballooning is active.
>>>
>>
>> On lsmem:
>>
>> https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.lgdd/lgdd_r_lsmem_cmd.html
>>
>> "
>> Removable
>> yes if the memory range can be set offline, no if it cannot be set
>> offline. A dash (-) means that the range is already offline. The kernel
>> method that identifies removable memory ranges is heuristic and not
>> exact. Occasionally, memory ranges are falsely reported as removable or
>> falsely reported as not removable.
>> "
>>
>> Usage of lsmem paird with chmem:
>>
>> https://access.redhat.com/solutions/3937181
>>
>>
>> Especially interesting for IBM z Systems, whereby memory
>> onlining/offlining will trigger the actual population of memory in the
>> hypervisor. So if an admin wants to offline some memory (to give it back
>> to the hypervisor), it would use lsmem to identify such blocks first,
>> instead of trying random blocks until one offlining request succeeds.
> 
> I am sorry for being dense here but I still do not understand why s390

It's good that we talk about it :) It's hard to reconstruct actual use
cases from tools and some documentation only ...

Side note (just FYI): One difference on s390x compared to other
architectures (AFAIKS) is that once memory is offline, you might not be
allowed (by the hypervisor) to online it again - because it was
effectively unplugged. Such memory is not removed via remove_memory(),
it's simply kept offline.


> and the way how it does the hotremove matters here. Afterall there are
> no arch specific operations done until the memory is offlined. Also
> randomly checking memory blocks and then hoping that the offline will
> succeed is not way much different from just trying the offline the
> block. Both have to crawl through the pfn range and bail out on the
> unmovable memory.

I think in general we have to approaches to memory unplugging.

1. Know explicitly what you want to unplug (e.g., a DIMM spanning
multiple memory blocks).

2. Find random memory blocks you can offline/unplug.


For 1, I think we both agree that we don't need this. Just try to
offline and you know if it worked.

Now of course, for 2 you can try random blocks until you succeeded. From
a sysadmin point of view that's very inefficient. From a powerpc-utils
point of view, that's inefficient.

I learned just now, "chmem"[1] has a mode where you can specify a "size"
and not only a range. So a sysadmin can still control onlining/offlining
for this use case with a few commands. In other tools (e.g.,
powerpc-utils), well, you have to try to offline random memory blocks
(just like chmem does).


AFAIK, once we turn /sys/.../removable useless, I 

Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-22 Thread Michal Hocko
On Wed 22-01-20 12:58:16, David Hildenbrand wrote:
> On 22.01.20 11:54, David Hildenbrand wrote:
> > On 22.01.20 11:42, Michal Hocko wrote:
> >> On Wed 22-01-20 11:39:08, David Hildenbrand wrote:
> >> Really, the interface is flawed and should have never been merged in 
> >> the
> >> first place. We cannot simply remove it altogether I am afraid so let's
> >> at least remove the bogus code and pretend that the world is a better
> >> place where everything is removable except the reality sucks...
> >
> > As I expressed already, the interface works as designed/documented and
> > has been used like that for years.
> 
>  It seems we do differ in the usefulness though. Using a crappy interface
>  for years doesn't make it less crappy. I do realize we cannot remove the
>  interface but we can remove issues with the implementation and I dare to
>  say that most existing users wouldn't really notice.
> >>>
> >>> Well, at least powerpc-utils (why this interface was introduced) will
> >>> notice a) performance wise and b) because more logging output will be
> >>> generated (obviously non-offlineable blocks will be tried to offline).
> >>
> >> I would really appreciate some specific example for a real usecase. I am
> >> not familiar with powerpc-utils worklflows myself.
> >>
> > 
> > Not an expert myself:
> > 
> > https://github.com/ibm-power-utilities/powerpc-utils
> > 
> > -> src/drmgr/drslot_chrp_mem.c
> > 
> > On request to remove some memory it will
> > 
> > a) Read "->removable" of all memory blocks ("lmb")
> > b) Check if the request can be fulfilled using the removable blocks
> > c) Try to offline the memory blocks by trying to offline it. If that
> > succeeded, trigger removeal of it using some hypervisor hooks.
> > 
> > Interestingly, with "AMS ballooning", it will already consider the
> > "removable" information useless (most probably, because of
> > non-migratable balloon pages that can be offlined - I assume the powerpc
> > code that I converted to proper balloon compaction just recently). a)
> > and b) is skipped.
> > 
> > Returning "yes" on all blocks will make them handle it just like if "AMS
> > ballooning" is active. So any memory block will be tried. Should work
> > but will be slower if no ballooning is active.
> > 
> 
> On lsmem:
> 
> https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.lgdd/lgdd_r_lsmem_cmd.html
> 
> "
> Removable
> yes if the memory range can be set offline, no if it cannot be set
> offline. A dash (-) means that the range is already offline. The kernel
> method that identifies removable memory ranges is heuristic and not
> exact. Occasionally, memory ranges are falsely reported as removable or
> falsely reported as not removable.
> "
> 
> Usage of lsmem paird with chmem:
> 
> https://access.redhat.com/solutions/3937181
> 
> 
> Especially interesting for IBM z Systems, whereby memory
> onlining/offlining will trigger the actual population of memory in the
> hypervisor. So if an admin wants to offline some memory (to give it back
> to the hypervisor), it would use lsmem to identify such blocks first,
> instead of trying random blocks until one offlining request succeeds.

I am sorry for being dense here but I still do not understand why s390
and the way how it does the hotremove matters here. Afterall there are
no arch specific operations done until the memory is offlined. Also
randomly checking memory blocks and then hoping that the offline will
succeed is not way much different from just trying the offline the
block. Both have to crawl through the pfn range and bail out on the
unmovable memory.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-22 Thread David Hildenbrand
On 22.01.20 11:54, David Hildenbrand wrote:
> On 22.01.20 11:42, Michal Hocko wrote:
>> On Wed 22-01-20 11:39:08, David Hildenbrand wrote:
>> Really, the interface is flawed and should have never been merged in the
>> first place. We cannot simply remove it altogether I am afraid so let's
>> at least remove the bogus code and pretend that the world is a better
>> place where everything is removable except the reality sucks...
>
> As I expressed already, the interface works as designed/documented and
> has been used like that for years.

 It seems we do differ in the usefulness though. Using a crappy interface
 for years doesn't make it less crappy. I do realize we cannot remove the
 interface but we can remove issues with the implementation and I dare to
 say that most existing users wouldn't really notice.
>>>
>>> Well, at least powerpc-utils (why this interface was introduced) will
>>> notice a) performance wise and b) because more logging output will be
>>> generated (obviously non-offlineable blocks will be tried to offline).
>>
>> I would really appreciate some specific example for a real usecase. I am
>> not familiar with powerpc-utils worklflows myself.
>>
> 
> Not an expert myself:
> 
> https://github.com/ibm-power-utilities/powerpc-utils
> 
> -> src/drmgr/drslot_chrp_mem.c
> 
> On request to remove some memory it will
> 
> a) Read "->removable" of all memory blocks ("lmb")
> b) Check if the request can be fulfilled using the removable blocks
> c) Try to offline the memory blocks by trying to offline it. If that
> succeeded, trigger removeal of it using some hypervisor hooks.
> 
> Interestingly, with "AMS ballooning", it will already consider the
> "removable" information useless (most probably, because of
> non-migratable balloon pages that can be offlined - I assume the powerpc
> code that I converted to proper balloon compaction just recently). a)
> and b) is skipped.
> 
> Returning "yes" on all blocks will make them handle it just like if "AMS
> ballooning" is active. So any memory block will be tried. Should work
> but will be slower if no ballooning is active.
> 

On lsmem:

https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.lgdd/lgdd_r_lsmem_cmd.html

"
Removable
yes if the memory range can be set offline, no if it cannot be set
offline. A dash (-) means that the range is already offline. The kernel
method that identifies removable memory ranges is heuristic and not
exact. Occasionally, memory ranges are falsely reported as removable or
falsely reported as not removable.
"

Usage of lsmem paird with chmem:

https://access.redhat.com/solutions/3937181


Especially interesting for IBM z Systems, whereby memory
onlining/offlining will trigger the actual population of memory in the
hypervisor. So if an admin wants to offline some memory (to give it back
to the hypervisor), it would use lsmem to identify such blocks first,
instead of trying random blocks until one offlining request succeeds.

E.g., documented in

https://books.google.de/books?id=1UEhDQAAQBAJ=PA117=PA117=lsmem+removable=bl=OzMfU6Gbzu=ACfU3U2IfH0eTVJs0qu50FdkysA3iC0elw=de=X=2ahUKEwjQpdXQkpfnAhVOzqQKHTN4BsoQ6AEwBXoECAoQAQ#v=onepage=lsmem%20removable=false


So I still think that the interface is useful and is getting used in
real life. Users tolerate false positives/negatives.

-- 
Thanks,

David / dhildenb



Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-22 Thread David Hildenbrand
On 22.01.20 11:42, Michal Hocko wrote:
> On Wed 22-01-20 11:39:08, David Hildenbrand wrote:
> Really, the interface is flawed and should have never been merged in the
> first place. We cannot simply remove it altogether I am afraid so let's
> at least remove the bogus code and pretend that the world is a better
> place where everything is removable except the reality sucks...

 As I expressed already, the interface works as designed/documented and
 has been used like that for years.
>>>
>>> It seems we do differ in the usefulness though. Using a crappy interface
>>> for years doesn't make it less crappy. I do realize we cannot remove the
>>> interface but we can remove issues with the implementation and I dare to
>>> say that most existing users wouldn't really notice.
>>
>> Well, at least powerpc-utils (why this interface was introduced) will
>> notice a) performance wise and b) because more logging output will be
>> generated (obviously non-offlineable blocks will be tried to offline).
> 
> I would really appreciate some specific example for a real usecase. I am
> not familiar with powerpc-utils worklflows myself.
> 

Not an expert myself:

https://github.com/ibm-power-utilities/powerpc-utils

-> src/drmgr/drslot_chrp_mem.c

On request to remove some memory it will

a) Read "->removable" of all memory blocks ("lmb")
b) Check if the request can be fulfilled using the removable blocks
c) Try to offline the memory blocks by trying to offline it. If that
succeeded, trigger removeal of it using some hypervisor hooks.

Interestingly, with "AMS ballooning", it will already consider the
"removable" information useless (most probably, because of
non-migratable balloon pages that can be offlined - I assume the powerpc
code that I converted to proper balloon compaction just recently). a)
and b) is skipped.

Returning "yes" on all blocks will make them handle it just like if "AMS
ballooning" is active. So any memory block will be tried. Should work
but will be slower if no ballooning is active.

-- 
Thanks,

David / dhildenb



Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-22 Thread Michal Hocko
On Wed 22-01-20 11:39:08, David Hildenbrand wrote:
> >>> Really, the interface is flawed and should have never been merged in the
> >>> first place. We cannot simply remove it altogether I am afraid so let's
> >>> at least remove the bogus code and pretend that the world is a better
> >>> place where everything is removable except the reality sucks...
> >>
> >> As I expressed already, the interface works as designed/documented and
> >> has been used like that for years.
> > 
> > It seems we do differ in the usefulness though. Using a crappy interface
> > for years doesn't make it less crappy. I do realize we cannot remove the
> > interface but we can remove issues with the implementation and I dare to
> > say that most existing users wouldn't really notice.
> 
> Well, at least powerpc-utils (why this interface was introduced) will
> notice a) performance wise and b) because more logging output will be
> generated (obviously non-offlineable blocks will be tried to offline).

I would really appreciate some specific example for a real usecase. I am
not familiar with powerpc-utils worklflows myself.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-22 Thread David Hildenbrand
>>> Really, the interface is flawed and should have never been merged in the
>>> first place. We cannot simply remove it altogether I am afraid so let's
>>> at least remove the bogus code and pretend that the world is a better
>>> place where everything is removable except the reality sucks...
>>
>> As I expressed already, the interface works as designed/documented and
>> has been used like that for years.
> 
> It seems we do differ in the usefulness though. Using a crappy interface
> for years doesn't make it less crappy. I do realize we cannot remove the
> interface but we can remove issues with the implementation and I dare to
> say that most existing users wouldn't really notice.

Well, at least powerpc-utils (why this interface was introduced) will
notice a) performance wise and b) because more logging output will be
generated (obviously non-offlineable blocks will be tried to offline).

However, it should not break, because we could have had races
before/false positives.

> 
>> I tend to agree that it never should have been merged like that.
>>
>> We have (at least) two places that are racy (with concurrent memory
>> hotplug):
>>
>> 1. /sys/.../memoryX/removable
>> - a) make it always return yes and make the interface useless
>> - b) add proper locking and keep it running as is (e.g., so David can
>>  identify offlineable memory blocks :) ).
>>
>> 2. /sys/.../memoryX/valid_zones
>> - a) always return "none" if the memory is online
>> - b) add proper locking and keep it running as is
>> - c) cache the result ("zone") when a block is onlined (e.g., in
>> mem->zone. If it is NULL, either mixed zones or unknown)
>>
>> At least 2. already scream for a proper device_lock() locking as the
>> mem->state is not stable across the function call.
>>
>> 1a and 2a are the easiest solutions but remove all ways to identify if a
>> memory block could theoretically be offlined - without trying
>> (especially, also to identify the MOVABLE zone).
>>
>> I tend to prefer 1b) and 2c), paired with proper device_lock() locking.
>> We don't affect existing use cases but are able to simplify the code +
>> fix the races.
>>
>> What's your opinion? Any alternatives?
> 
> 1a) and 2c) if you ask me.
> 

I'll look into that all, just might take a little (busy with a lot of
stuff). But after all, it does not seem to be urgent.

1a) will be easy, I'll post a patch soon that we can let rest in -next
for a bit to see if people start to scream out loud.

-- 
Thanks,

David / dhildenb



Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-21 Thread Michal Hocko
On Mon 20-01-20 10:14:44, David Hildenbrand wrote:
> On 20.01.20 08:48, Michal Hocko wrote:
> > On Fri 17-01-20 08:57:51, Dan Williams wrote:
> > [...]
> >> Unless the user is willing to hold the device_hotplug_lock over the
> >> evaluation then the result is unreliable.
> > 
> > Do we want to hold the device_hotplug_lock from this user readable file
> > in the first place? My book says that this just waits to become a
> > problem.
> 
> It was the "big hammer" solution for this RFC.
> 
> I think we could do with a try_lock() on the device_lock() paired with a
> device->removed flag. The latter is helpful for properly catching zombie
> devices on the onlining/offlining path either way (and on my todo list).

try_lock would be more considerate. It would at least make any potential
hammering a bit harder.

> > Really, the interface is flawed and should have never been merged in the
> > first place. We cannot simply remove it altogether I am afraid so let's
> > at least remove the bogus code and pretend that the world is a better
> > place where everything is removable except the reality sucks...
> 
> As I expressed already, the interface works as designed/documented and
> has been used like that for years.

It seems we do differ in the usefulness though. Using a crappy interface
for years doesn't make it less crappy. I do realize we cannot remove the
interface but we can remove issues with the implementation and I dare to
say that most existing users wouldn't really notice.

> I tend to agree that it never should have been merged like that.
> 
> We have (at least) two places that are racy (with concurrent memory
> hotplug):
> 
> 1. /sys/.../memoryX/removable
> - a) make it always return yes and make the interface useless
> - b) add proper locking and keep it running as is (e.g., so David can
>  identify offlineable memory blocks :) ).
> 
> 2. /sys/.../memoryX/valid_zones
> - a) always return "none" if the memory is online
> - b) add proper locking and keep it running as is
> - c) cache the result ("zone") when a block is onlined (e.g., in
> mem->zone. If it is NULL, either mixed zones or unknown)
> 
> At least 2. already scream for a proper device_lock() locking as the
> mem->state is not stable across the function call.
> 
> 1a and 2a are the easiest solutions but remove all ways to identify if a
> memory block could theoretically be offlined - without trying
> (especially, also to identify the MOVABLE zone).
> 
> I tend to prefer 1b) and 2c), paired with proper device_lock() locking.
> We don't affect existing use cases but are able to simplify the code +
> fix the races.
> 
> What's your opinion? Any alternatives?

1a) and 2c) if you ask me.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-20 Thread David Hildenbrand
On 20.01.20 10:14, David Hildenbrand wrote:
> On 20.01.20 08:48, Michal Hocko wrote:
>> On Fri 17-01-20 08:57:51, Dan Williams wrote:
>> [...]
>>> Unless the user is willing to hold the device_hotplug_lock over the
>>> evaluation then the result is unreliable.
>>
>> Do we want to hold the device_hotplug_lock from this user readable file
>> in the first place? My book says that this just waits to become a
>> problem.
> 
> It was the "big hammer" solution for this RFC.
> 
> I think we could do with a try_lock() on the device_lock() paired with a
> device->removed flag. The latter is helpful for properly catching zombie
> devices on the onlining/offlining path either way (and on my todo list).

We do have dev->p->dead which could come in handy.

-- 
Thanks,

David / dhildenb



Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-20 Thread David Hildenbrand
On 20.01.20 08:48, Michal Hocko wrote:
> On Fri 17-01-20 08:57:51, Dan Williams wrote:
> [...]
>> Unless the user is willing to hold the device_hotplug_lock over the
>> evaluation then the result is unreliable.
> 
> Do we want to hold the device_hotplug_lock from this user readable file
> in the first place? My book says that this just waits to become a
> problem.

It was the "big hammer" solution for this RFC.

I think we could do with a try_lock() on the device_lock() paired with a
device->removed flag. The latter is helpful for properly catching zombie
devices on the onlining/offlining path either way (and on my todo list).

> 
> Really, the interface is flawed and should have never been merged in the
> first place. We cannot simply remove it altogether I am afraid so let's
> at least remove the bogus code and pretend that the world is a better
> place where everything is removable except the reality sucks...

As I expressed already, the interface works as designed/documented and
has been used like that for years. I tend to agree that it never should
have been merged like that.

We have (at least) two places that are racy (with concurrent memory
hotplug):

1. /sys/.../memoryX/removable
- a) make it always return yes and make the interface useless
- b) add proper locking and keep it running as is (e.g., so David can
 identify offlineable memory blocks :) ).

2. /sys/.../memoryX/valid_zones
- a) always return "none" if the memory is online
- b) add proper locking and keep it running as is
- c) cache the result ("zone") when a block is onlined (e.g., in
mem->zone. If it is NULL, either mixed zones or unknown)

At least 2. already scream for a proper device_lock() locking as the
mem->state is not stable across the function call.

1a and 2a are the easiest solutions but remove all ways to identify if a
memory block could theoretically be offlined - without trying
(especially, also to identify the MOVABLE zone).

I tend to prefer 1b) and 2c), paired with proper device_lock() locking.
We don't affect existing use cases but are able to simplify the code +
fix the races.

What's your opinion? Any alternatives?

-- 
Thanks,

David / dhildenb



Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-19 Thread Michal Hocko
On Fri 17-01-20 08:57:51, Dan Williams wrote:
[...]
> Unless the user is willing to hold the device_hotplug_lock over the
> evaluation then the result is unreliable.

Do we want to hold the device_hotplug_lock from this user readable file
in the first place? My book says that this just waits to become a
problem.

Really, the interface is flawed and should have never been merged in the
first place. We cannot simply remove it altogether I am afraid so let's
at least remove the bogus code and pretend that the world is a better
place where everything is removable except the reality sucks...
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-17 Thread Dan Williams
On Fri, Jan 17, 2020 at 8:11 AM David Hildenbrand  wrote:
>
> On 17.01.20 16:54, Dan Williams wrote:
> > On Fri, Jan 17, 2020 at 7:30 AM Michal Hocko  wrote:
> >>
> >> On Fri 17-01-20 15:58:26, David Hildenbrand wrote:
> >>> On 17.01.20 15:52, Michal Hocko wrote:
>  On Fri 17-01-20 14:08:06, David Hildenbrand wrote:
> > On 17.01.20 12:33, Michal Hocko wrote:
> >> On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
> >>> Let's refactor that code. We want to check if we can offline memory
> >>> blocks. Add a new function is_mem_section_offlineable() for that and
> >>> make it call is_mem_section_offlineable() for each contained section.
> >>> Within is_mem_section_offlineable(), add some more sanity checks and
> >>> directly bail out if the section contains holes or if it spans 
> >>> multiple
> >>> zones.
> >>
> >> I didn't read the patch (yet) but I am wondering. If we want to touch
> >> this code, can we simply always return true there? I mean whoever
> >> depends on this check is racy and the failure can happen even after
> >> the sysfs says good to go, right? The check is essentially as expensive
> >> as calling the offlining code itself. So the only usecase I can think 
> >> of
> >> is a dumb driver to crawl over blocks and check which is removable and
> >> try to hotremove it. But just trying to offline one block after another
> >> is essentially going to achieve the same.
> >
> > Some thoughts:
> >
> > 1. It allows you to check if memory is likely to be offlineable without
> > doing expensive locking and trying to isolate pages (meaning:
> > zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
> > when isolating)
> >
> > 2. There are use cases that want to identify a memory block/DIMM to
> > unplug. One example is PPC DLPAR code (see this patch). Going over all
> > memory block trying to offline them is an expensive operation.
> >
> > 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
> > makes use of /sys/.../removable to speed up the search AFAIK.
> 
>  Well, while I do see those points I am not really sure they are worth
>  having a broken (by-definition) interface.
> >>>
> >>> It's a pure speedup. And for that, the interface has been working
> >>> perfectly fine for years?
> >>>
> 
> > 4. lsmem displays/groups by "removable".
> 
>  Is anybody really using that?
> >>>
> >>> Well at least I am using that when testing to identify which
> >>> (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate
> >>> all the zone shrinking stuff I have been fixing)
> >>>
> >>> So there is at least one user ;)
> >>
> >> Fair enough. But I would argue that there are better ways to do the same
> >> solely for testing purposes. Rather than having a subtly broken code to
> >> maintain.
> >>
> 
> >> Or does anybody see any reasonable usecase that would break if we did
> >> that unconditional behavior?
> >
> > If we would return always "true", then the whole reason the
> > interface originally was introduced would be "broken" (meaning, less
> > performant as you would try to offline any memory block).
> 
>  I would argue that the whole interface is broken ;). Not the first time
>  in the kernel development history and not the last time either. What I
>  am trying to say here is that unless there are _real_ usecases depending
>  on knowing that something surely is _not_ offlineable then I would just
>  try to drop the functionality while preserving the interface and see
>  what happens.
> >>>
> >>> I can see that, but I can perfectly well understand why - especially
> >>> powerpc - wants a fast way to sense which blocks actually sense to try
> >>> to online.
> >>>
> >>> The original patch correctly states
> >>>"which sections of
> >>> memory are likely to be removable before attempting the potentially
> >>> expensive operation."
> >>>
> >>> It works as designed I would say.
> >>
> >> Then I would just keep it crippled the same way it has been for years
> >> without anybody noticing.
> >
> > I tend to agree. At least the kmem driver that wants to unplug memory
> > could not use an interface that does not give stable answers. It just
> > relies on remove_memory() to return a definitive error.
> >
>
> Just because kmem cannot reuse such an interface doesn't mean we should
> not touch it (or I am not getting your point). Especially, this
> interface is about "can it be likely be offlined and then eventually be
> removed (if there is a HW interface for that)" (as documented), not
> about "will remove_memory()" work.

Unless the user is willing to hold the device_hotplug_lock over the
evaluation then the result is unreliable. For example the changes to
removable_show() are no better than they were previously because the
result is invalidated as soon as 

Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-17 Thread David Hildenbrand
On 17.01.20 16:54, Dan Williams wrote:
> On Fri, Jan 17, 2020 at 7:30 AM Michal Hocko  wrote:
>>
>> On Fri 17-01-20 15:58:26, David Hildenbrand wrote:
>>> On 17.01.20 15:52, Michal Hocko wrote:
 On Fri 17-01-20 14:08:06, David Hildenbrand wrote:
> On 17.01.20 12:33, Michal Hocko wrote:
>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
>>> Let's refactor that code. We want to check if we can offline memory
>>> blocks. Add a new function is_mem_section_offlineable() for that and
>>> make it call is_mem_section_offlineable() for each contained section.
>>> Within is_mem_section_offlineable(), add some more sanity checks and
>>> directly bail out if the section contains holes or if it spans multiple
>>> zones.
>>
>> I didn't read the patch (yet) but I am wondering. If we want to touch
>> this code, can we simply always return true there? I mean whoever
>> depends on this check is racy and the failure can happen even after
>> the sysfs says good to go, right? The check is essentially as expensive
>> as calling the offlining code itself. So the only usecase I can think of
>> is a dumb driver to crawl over blocks and check which is removable and
>> try to hotremove it. But just trying to offline one block after another
>> is essentially going to achieve the same.
>
> Some thoughts:
>
> 1. It allows you to check if memory is likely to be offlineable without
> doing expensive locking and trying to isolate pages (meaning:
> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
> when isolating)
>
> 2. There are use cases that want to identify a memory block/DIMM to
> unplug. One example is PPC DLPAR code (see this patch). Going over all
> memory block trying to offline them is an expensive operation.
>
> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
> makes use of /sys/.../removable to speed up the search AFAIK.

 Well, while I do see those points I am not really sure they are worth
 having a broken (by-definition) interface.
>>>
>>> It's a pure speedup. And for that, the interface has been working
>>> perfectly fine for years?
>>>

> 4. lsmem displays/groups by "removable".

 Is anybody really using that?
>>>
>>> Well at least I am using that when testing to identify which
>>> (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate
>>> all the zone shrinking stuff I have been fixing)
>>>
>>> So there is at least one user ;)
>>
>> Fair enough. But I would argue that there are better ways to do the same
>> solely for testing purposes. Rather than having a subtly broken code to
>> maintain.
>>

>> Or does anybody see any reasonable usecase that would break if we did
>> that unconditional behavior?
>
> If we would return always "true", then the whole reason the
> interface originally was introduced would be "broken" (meaning, less
> performant as you would try to offline any memory block).

 I would argue that the whole interface is broken ;). Not the first time
 in the kernel development history and not the last time either. What I
 am trying to say here is that unless there are _real_ usecases depending
 on knowing that something surely is _not_ offlineable then I would just
 try to drop the functionality while preserving the interface and see
 what happens.
>>>
>>> I can see that, but I can perfectly well understand why - especially
>>> powerpc - wants a fast way to sense which blocks actually sense to try
>>> to online.
>>>
>>> The original patch correctly states
>>>"which sections of
>>> memory are likely to be removable before attempting the potentially
>>> expensive operation."
>>>
>>> It works as designed I would say.
>>
>> Then I would just keep it crippled the same way it has been for years
>> without anybody noticing.
> 
> I tend to agree. At least the kmem driver that wants to unplug memory
> could not use an interface that does not give stable answers. It just
> relies on remove_memory() to return a definitive error.
> 

Just because kmem cannot reuse such an interface doesn't mean we should
not touch it (or I am not getting your point). Especially, this
interface is about "can it be likely be offlined and then eventually be
removed (if there is a HW interface for that)" (as documented), not
about "will remove_memory()" work.

We do have users and if we agree to keep it (what I think we should as I
expressed) then I think we should un-cripple and fix it. After all we
have to maintain it. The current interface provides what was documented
- "likely to be offlineable". (the chosen name was just horribly bad -
as I expressed a while ago already :) )

-- 
Thanks,

David / dhildenb



Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-17 Thread Dan Williams
On Fri, Jan 17, 2020 at 7:30 AM Michal Hocko  wrote:
>
> On Fri 17-01-20 15:58:26, David Hildenbrand wrote:
> > On 17.01.20 15:52, Michal Hocko wrote:
> > > On Fri 17-01-20 14:08:06, David Hildenbrand wrote:
> > >> On 17.01.20 12:33, Michal Hocko wrote:
> > >>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
> >  Let's refactor that code. We want to check if we can offline memory
> >  blocks. Add a new function is_mem_section_offlineable() for that and
> >  make it call is_mem_section_offlineable() for each contained section.
> >  Within is_mem_section_offlineable(), add some more sanity checks and
> >  directly bail out if the section contains holes or if it spans multiple
> >  zones.
> > >>>
> > >>> I didn't read the patch (yet) but I am wondering. If we want to touch
> > >>> this code, can we simply always return true there? I mean whoever
> > >>> depends on this check is racy and the failure can happen even after
> > >>> the sysfs says good to go, right? The check is essentially as expensive
> > >>> as calling the offlining code itself. So the only usecase I can think of
> > >>> is a dumb driver to crawl over blocks and check which is removable and
> > >>> try to hotremove it. But just trying to offline one block after another
> > >>> is essentially going to achieve the same.
> > >>
> > >> Some thoughts:
> > >>
> > >> 1. It allows you to check if memory is likely to be offlineable without
> > >> doing expensive locking and trying to isolate pages (meaning:
> > >> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
> > >> when isolating)
> > >>
> > >> 2. There are use cases that want to identify a memory block/DIMM to
> > >> unplug. One example is PPC DLPAR code (see this patch). Going over all
> > >> memory block trying to offline them is an expensive operation.
> > >>
> > >> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
> > >> makes use of /sys/.../removable to speed up the search AFAIK.
> > >
> > > Well, while I do see those points I am not really sure they are worth
> > > having a broken (by-definition) interface.
> >
> > It's a pure speedup. And for that, the interface has been working
> > perfectly fine for years?
> >
> > >
> > >> 4. lsmem displays/groups by "removable".
> > >
> > > Is anybody really using that?
> >
> > Well at least I am using that when testing to identify which
> > (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate
> > all the zone shrinking stuff I have been fixing)
> >
> > So there is at least one user ;)
>
> Fair enough. But I would argue that there are better ways to do the same
> solely for testing purposes. Rather than having a subtly broken code to
> maintain.
>
> > >
> > >>> Or does anybody see any reasonable usecase that would break if we did
> > >>> that unconditional behavior?
> > >>
> > >> If we would return always "true", then the whole reason the
> > >> interface originally was introduced would be "broken" (meaning, less
> > >> performant as you would try to offline any memory block).
> > >
> > > I would argue that the whole interface is broken ;). Not the first time
> > > in the kernel development history and not the last time either. What I
> > > am trying to say here is that unless there are _real_ usecases depending
> > > on knowing that something surely is _not_ offlineable then I would just
> > > try to drop the functionality while preserving the interface and see
> > > what happens.
> >
> > I can see that, but I can perfectly well understand why - especially
> > powerpc - wants a fast way to sense which blocks actually sense to try
> > to online.
> >
> > The original patch correctly states
> >"which sections of
> > memory are likely to be removable before attempting the potentially
> > expensive operation."
> >
> > It works as designed I would say.
>
> Then I would just keep it crippled the same way it has been for years
> without anybody noticing.

I tend to agree. At least the kmem driver that wants to unplug memory
could not use an interface that does not give stable answers. It just
relies on remove_memory() to return a definitive error.


Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-17 Thread Michal Hocko
On Fri 17-01-20 15:58:26, David Hildenbrand wrote:
> On 17.01.20 15:52, Michal Hocko wrote:
> > On Fri 17-01-20 14:08:06, David Hildenbrand wrote:
> >> On 17.01.20 12:33, Michal Hocko wrote:
> >>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
>  Let's refactor that code. We want to check if we can offline memory
>  blocks. Add a new function is_mem_section_offlineable() for that and
>  make it call is_mem_section_offlineable() for each contained section.
>  Within is_mem_section_offlineable(), add some more sanity checks and
>  directly bail out if the section contains holes or if it spans multiple
>  zones.
> >>>
> >>> I didn't read the patch (yet) but I am wondering. If we want to touch
> >>> this code, can we simply always return true there? I mean whoever
> >>> depends on this check is racy and the failure can happen even after
> >>> the sysfs says good to go, right? The check is essentially as expensive
> >>> as calling the offlining code itself. So the only usecase I can think of
> >>> is a dumb driver to crawl over blocks and check which is removable and
> >>> try to hotremove it. But just trying to offline one block after another
> >>> is essentially going to achieve the same.
> >>
> >> Some thoughts:
> >>
> >> 1. It allows you to check if memory is likely to be offlineable without
> >> doing expensive locking and trying to isolate pages (meaning:
> >> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
> >> when isolating)
> >>
> >> 2. There are use cases that want to identify a memory block/DIMM to
> >> unplug. One example is PPC DLPAR code (see this patch). Going over all
> >> memory block trying to offline them is an expensive operation.
> >>
> >> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
> >> makes use of /sys/.../removable to speed up the search AFAIK.
> > 
> > Well, while I do see those points I am not really sure they are worth
> > having a broken (by-definition) interface.
> 
> It's a pure speedup. And for that, the interface has been working
> perfectly fine for years?
> 
> >  
> >> 4. lsmem displays/groups by "removable".
> > 
> > Is anybody really using that?
> 
> Well at least I am using that when testing to identify which
> (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate
> all the zone shrinking stuff I have been fixing)
> 
> So there is at least one user ;)

Fair enough. But I would argue that there are better ways to do the same
solely for testing purposes. Rather than having a subtly broken code to
maintain.
 
> > 
> >>> Or does anybody see any reasonable usecase that would break if we did
> >>> that unconditional behavior?
> >>
> >> If we would return always "true", then the whole reason the
> >> interface originally was introduced would be "broken" (meaning, less
> >> performant as you would try to offline any memory block).
> > 
> > I would argue that the whole interface is broken ;). Not the first time
> > in the kernel development history and not the last time either. What I
> > am trying to say here is that unless there are _real_ usecases depending
> > on knowing that something surely is _not_ offlineable then I would just
> > try to drop the functionality while preserving the interface and see
> > what happens.
> 
> I can see that, but I can perfectly well understand why - especially
> powerpc - wants a fast way to sense which blocks actually sense to try
> to online.
> 
> The original patch correctly states
>"which sections of
> memory are likely to be removable before attempting the potentially
> expensive operation."
> 
> It works as designed I would say.

Then I would just keep it crippled the same way it has been for years
without anybody noticing.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-17 Thread David Hildenbrand
On 17.01.20 15:52, Michal Hocko wrote:
> On Fri 17-01-20 14:08:06, David Hildenbrand wrote:
>> On 17.01.20 12:33, Michal Hocko wrote:
>>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
 Let's refactor that code. We want to check if we can offline memory
 blocks. Add a new function is_mem_section_offlineable() for that and
 make it call is_mem_section_offlineable() for each contained section.
 Within is_mem_section_offlineable(), add some more sanity checks and
 directly bail out if the section contains holes or if it spans multiple
 zones.
>>>
>>> I didn't read the patch (yet) but I am wondering. If we want to touch
>>> this code, can we simply always return true there? I mean whoever
>>> depends on this check is racy and the failure can happen even after
>>> the sysfs says good to go, right? The check is essentially as expensive
>>> as calling the offlining code itself. So the only usecase I can think of
>>> is a dumb driver to crawl over blocks and check which is removable and
>>> try to hotremove it. But just trying to offline one block after another
>>> is essentially going to achieve the same.
>>
>> Some thoughts:
>>
>> 1. It allows you to check if memory is likely to be offlineable without
>> doing expensive locking and trying to isolate pages (meaning:
>> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
>> when isolating)
>>
>> 2. There are use cases that want to identify a memory block/DIMM to
>> unplug. One example is PPC DLPAR code (see this patch). Going over all
>> memory block trying to offline them is an expensive operation.
>>
>> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
>> makes use of /sys/.../removable to speed up the search AFAIK.
> 
> Well, while I do see those points I am not really sure they are worth
> having a broken (by-definition) interface.

It's a pure speedup. And for that, the interface has been working
perfectly fine for years?

>  
>> 4. lsmem displays/groups by "removable".
> 
> Is anybody really using that?

Well at least I am using that when testing to identify which
(ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate
all the zone shrinking stuff I have been fixing)

So there is at least one user ;)

[...]
> 
>>> Or does anybody see any reasonable usecase that would break if we did
>>> that unconditional behavior?
>>
>> If we would return always "true", then the whole reason the
>> interface originally was introduced would be "broken" (meaning, less
>> performant as you would try to offline any memory block).
> 
> I would argue that the whole interface is broken ;). Not the first time
> in the kernel development history and not the last time either. What I
> am trying to say here is that unless there are _real_ usecases depending
> on knowing that something surely is _not_ offlineable then I would just
> try to drop the functionality while preserving the interface and see
> what happens.

I can see that, but I can perfectly well understand why - especially
powerpc - wants a fast way to sense which blocks actually sense to try
to online.

The original patch correctly states
   "which sections of
memory are likely to be removable before attempting the potentially
expensive operation."

It works as designed I would say.

-- 
Thanks,

David / dhildenb



Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-17 Thread Michal Hocko
On Fri 17-01-20 14:08:06, David Hildenbrand wrote:
> On 17.01.20 12:33, Michal Hocko wrote:
> > On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
> >> Let's refactor that code. We want to check if we can offline memory
> >> blocks. Add a new function is_mem_section_offlineable() for that and
> >> make it call is_mem_section_offlineable() for each contained section.
> >> Within is_mem_section_offlineable(), add some more sanity checks and
> >> directly bail out if the section contains holes or if it spans multiple
> >> zones.
> > 
> > I didn't read the patch (yet) but I am wondering. If we want to touch
> > this code, can we simply always return true there? I mean whoever
> > depends on this check is racy and the failure can happen even after
> > the sysfs says good to go, right? The check is essentially as expensive
> > as calling the offlining code itself. So the only usecase I can think of
> > is a dumb driver to crawl over blocks and check which is removable and
> > try to hotremove it. But just trying to offline one block after another
> > is essentially going to achieve the same.
> 
> Some thoughts:
> 
> 1. It allows you to check if memory is likely to be offlineable without
> doing expensive locking and trying to isolate pages (meaning:
> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
> when isolating)
> 
> 2. There are use cases that want to identify a memory block/DIMM to
> unplug. One example is PPC DLPAR code (see this patch). Going over all
> memory block trying to offline them is an expensive operation.
> 
> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
> makes use of /sys/.../removable to speed up the search AFAIK.

Well, while I do see those points I am not really sure they are worth
having a broken (by-definition) interface.
 
> 4. lsmem displays/groups by "removable".

Is anybody really using that?

> 5. If "removable=false" then it usually really is not offlineable.
> Of course, there could also be races (free the last unmovable page),
> but it means "don't even try". OTOH, "removable=true" is more racy,
> and gives less guarantees. ("looks okay, feel free to try")

Yeah, but you could be already pessimistic and try movable zones before
other kernel zones.

> > Or does anybody see any reasonable usecase that would break if we did
> > that unconditional behavior?
> 
> If we would return always "true", then the whole reason the
> interface originally was introduced would be "broken" (meaning, less
> performant as you would try to offline any memory block).

I would argue that the whole interface is broken ;). Not the first time
in the kernel development history and not the last time either. What I
am trying to say here is that unless there are _real_ usecases depending
on knowing that something surely is _not_ offlineable then I would just
try to drop the functionality while preserving the interface and see
what happens.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-17 Thread David Hildenbrand
On 17.01.20 12:33, Michal Hocko wrote:
> On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
>> Let's refactor that code. We want to check if we can offline memory
>> blocks. Add a new function is_mem_section_offlineable() for that and
>> make it call is_mem_section_offlineable() for each contained section.
>> Within is_mem_section_offlineable(), add some more sanity checks and
>> directly bail out if the section contains holes or if it spans multiple
>> zones.
> 
> I didn't read the patch (yet) but I am wondering. If we want to touch
> this code, can we simply always return true there? I mean whoever
> depends on this check is racy and the failure can happen even after
> the sysfs says good to go, right? The check is essentially as expensive
> as calling the offlining code itself. So the only usecase I can think of
> is a dumb driver to crawl over blocks and check which is removable and
> try to hotremove it. But just trying to offline one block after another
> is essentially going to achieve the same.

Some thoughts:

1. It allows you to check if memory is likely to be offlineable without
doing expensive locking and trying to isolate pages (meaning:
zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
when isolating)

2. There are use cases that want to identify a memory block/DIMM to
unplug. One example is PPC DLPAR code (see this patch). Going over all
memory block trying to offline them is an expensive operation.

3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
makes use of /sys/.../removable to speed up the search AFAIK.

4. lsmem displays/groups by "removable".

5. If "removable=false" then it usually really is not offlineable.
Of course, there could also be races (free the last unmovable page),
but it means "don't even try". OTOH, "removable=true" is more racy,
and gives less guarantees. ("looks okay, feel free to try")

> 
> Or does anybody see any reasonable usecase that would break if we did
> that unconditional behavior?

If we would return always "true", then the whole reason the
interface originally was introduced would be "broken" (meaning, less
performant as you would try to offline any memory block).

commit 5c755e9fd813810680abd56ec09a5f90143e815b
Author: Badari Pulavarty 
Date:   Wed Jul 23 21:28:19 2008 -0700

memory-hotplug: add sysfs removable attribute for hotplug memory remove

Memory may be hot-removed on a per-memory-block basis, particularly on
POWER where the SPARSEMEM section size often matches the memory-block
size.  A user-level agent must be able to identify which sections of
memory are likely to be removable before attempting the potentially
expensive operation.  This patch adds a file called "removable" to the
memory directory in sysfs to help such an agent.  In this patch, a memory
block is considered removable if;


I'd love to see this go away (just like "valid_zones"), but I don't
think it is possible.

So this patch makes it work a little more correctly (multiplezones, holes),
cleans up the code and avoids races with unplug code. It can, however,
not give more guarantees if memory offlining will actually succeed.

-- 
Thanks,

David / dhildenb



Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul

2020-01-17 Thread Michal Hocko
On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
> Let's refactor that code. We want to check if we can offline memory
> blocks. Add a new function is_mem_section_offlineable() for that and
> make it call is_mem_section_offlineable() for each contained section.
> Within is_mem_section_offlineable(), add some more sanity checks and
> directly bail out if the section contains holes or if it spans multiple
> zones.

I didn't read the patch (yet) but I am wondering. If we want to touch
this code, can we simply always return true there? I mean whoever
depends on this check is racy and the failure can happen even after
the sysfs says good to go, right? The check is essentially as expensive
as calling the offlining code itself. So the only usecase I can think of
is a dumb driver to crawl over blocks and check which is removable and
try to hotremove it. But just trying to offline one block after another
is essentially going to achieve the same.

Or does anybody see any reasonable usecase that would break if we did
that unconditional behavior?
-- 
Michal Hocko
SUSE Labs