Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Wed, Jan 22, 2020 at 11:09 AM Michal Hocko wrote: > > On Wed 22-01-20 19:46:15, David Hildenbrand wrote: > > On 22.01.20 19:38, Michal Hocko wrote: > [...] > > > How exactly is check + offline more optimal then offline which makes > > > check as its first step? I will get to your later points after this is > > > clarified. > > > > Scanning (almost) lockless is more efficient than bouncing back and > > forth with the device_hotplug_lock, mem_hotplug_lock, cpu_hotplug_lock > > and zone locks - as far as I understand. > > All but the zone lock shouldn't be really contended and as such > shouldn't cause any troubles. zone->lock really depends on the page > allocator usage of course. But as soon as we have a contention then it > is just more likely that the result is less reliable. > > I would be also really curious about how much actual time could be saved > by this - some real numbers - because hotplug operations shouldn't > happen so often that this would stand out. At least that is my > understanding. > > > And as far as I understood, that was the whole reason of the original > > commit. > > Well, I have my doubts but it might be just me and I might be wrong. My > experience from a large part of the memory hotplug functionality is that > it was driven by a good intention but without a due diligence to think > behind the most obvious usecase. Having a removable flag on the memblock > sounds like a neat idea of course. But an inherently racy flag is just > borderline useful. > > Anyway, I will stop at this moment and wait for real usecases. ...that and practical numbers showing that optimizing an interface that can at best give rough estimate answers is worth the code change.
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Wed 22-01-20 19:46:15, David Hildenbrand wrote: > On 22.01.20 19:38, Michal Hocko wrote: [...] > > How exactly is check + offline more optimal then offline which makes > > check as its first step? I will get to your later points after this is > > clarified. > > Scanning (almost) lockless is more efficient than bouncing back and > forth with the device_hotplug_lock, mem_hotplug_lock, cpu_hotplug_lock > and zone locks - as far as I understand. All but the zone lock shouldn't be really contended and as such shouldn't cause any troubles. zone->lock really depends on the page allocator usage of course. But as soon as we have a contention then it is just more likely that the result is less reliable. I would be also really curious about how much actual time could be saved by this - some real numbers - because hotplug operations shouldn't happen so often that this would stand out. At least that is my understanding. > And as far as I understood, that was the whole reason of the original > commit. Well, I have my doubts but it might be just me and I might be wrong. My experience from a large part of the memory hotplug functionality is that it was driven by a good intention but without a due diligence to think behind the most obvious usecase. Having a removable flag on the memblock sounds like a neat idea of course. But an inherently racy flag is just borderline useful. Anyway, I will stop at this moment and wait for real usecases. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Wed 22-01-20 19:15:47, David Hildenbrand wrote: [...] > c) CC relevant people I identify (lsmem/chmem/powerpc-utils/etc.) on the > patch to see if we are missing other use cases/users/implications. > > Sounds like a plan? I would start with this step. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On 22.01.20 19:38, Michal Hocko wrote: > On Wed 22-01-20 19:15:47, David Hildenbrand wrote: >> On 22.01.20 17:46, Michal Hocko wrote: >>> On Wed 22-01-20 12:58:16, David Hildenbrand wrote: > [...] Especially interesting for IBM z Systems, whereby memory onlining/offlining will trigger the actual population of memory in the hypervisor. So if an admin wants to offline some memory (to give it back to the hypervisor), it would use lsmem to identify such blocks first, instead of trying random blocks until one offlining request succeeds. >>> >>> I am sorry for being dense here but I still do not understand why s390 >> >> It's good that we talk about it :) It's hard to reconstruct actual use >> cases from tools and some documentation only ... >> >> Side note (just FYI): One difference on s390x compared to other >> architectures (AFAIKS) is that once memory is offline, you might not be >> allowed (by the hypervisor) to online it again - because it was >> effectively unplugged. Such memory is not removed via remove_memory(), >> it's simply kept offline. > > I have a very vague understanding of s390 specialities but this is not > really relevant to the discussion AFAICS because this happens _after_ > offlining. Jep, that's why I flagged it as a side note. > >>> and the way how it does the hotremove matters here. Afterall there are >>> no arch specific operations done until the memory is offlined. Also >>> randomly checking memory blocks and then hoping that the offline will >>> succeed is not way much different from just trying the offline the >>> block. Both have to crawl through the pfn range and bail out on the >>> unmovable memory. >> >> I think in general we have to approaches to memory unplugging. >> >> 1. Know explicitly what you want to unplug (e.g., a DIMM spanning >> multiple memory blocks). >> >> 2. Find random memory blocks you can offline/unplug. >> >> >> For 1, I think we both agree that we don't need this. Just try to >> offline and you know if it worked. >> >> Now of course, for 2 you can try random blocks until you succeeded. From >> a sysadmin point of view that's very inefficient. From a powerpc-utils >> point of view, that's inefficient. > > How exactly is check + offline more optimal then offline which makes > check as its first step? I will get to your later points after this is > clarified. Scanning (almost) lockless is more efficient than bouncing back and forth with the device_hotplug_lock, mem_hotplug_lock, cpu_hotplug_lock and zone locks - as far as I understand. And as far as I understood, that was the whole reason of the original commit. Anyhow, you should have read until the end of my mail to find what you were looking for :) -- Thanks, David / dhildenb
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Wed 22-01-20 19:15:47, David Hildenbrand wrote: > On 22.01.20 17:46, Michal Hocko wrote: > > On Wed 22-01-20 12:58:16, David Hildenbrand wrote: [...] > >> Especially interesting for IBM z Systems, whereby memory > >> onlining/offlining will trigger the actual population of memory in the > >> hypervisor. So if an admin wants to offline some memory (to give it back > >> to the hypervisor), it would use lsmem to identify such blocks first, > >> instead of trying random blocks until one offlining request succeeds. > > > > I am sorry for being dense here but I still do not understand why s390 > > It's good that we talk about it :) It's hard to reconstruct actual use > cases from tools and some documentation only ... > > Side note (just FYI): One difference on s390x compared to other > architectures (AFAIKS) is that once memory is offline, you might not be > allowed (by the hypervisor) to online it again - because it was > effectively unplugged. Such memory is not removed via remove_memory(), > it's simply kept offline. I have a very vague understanding of s390 specialities but this is not really relevant to the discussion AFAICS because this happens _after_ offlining. > > and the way how it does the hotremove matters here. Afterall there are > > no arch specific operations done until the memory is offlined. Also > > randomly checking memory blocks and then hoping that the offline will > > succeed is not way much different from just trying the offline the > > block. Both have to crawl through the pfn range and bail out on the > > unmovable memory. > > I think in general we have to approaches to memory unplugging. > > 1. Know explicitly what you want to unplug (e.g., a DIMM spanning > multiple memory blocks). > > 2. Find random memory blocks you can offline/unplug. > > > For 1, I think we both agree that we don't need this. Just try to > offline and you know if it worked. > > Now of course, for 2 you can try random blocks until you succeeded. From > a sysadmin point of view that's very inefficient. From a powerpc-utils > point of view, that's inefficient. How exactly is check + offline more optimal then offline which makes check as its first step? I will get to your later points after this is clarified. -- Michal Hocko SUSE Labs
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On 22.01.20 17:46, Michal Hocko wrote: > On Wed 22-01-20 12:58:16, David Hildenbrand wrote: >> On 22.01.20 11:54, David Hildenbrand wrote: >>> On 22.01.20 11:42, Michal Hocko wrote: On Wed 22-01-20 11:39:08, David Hildenbrand wrote: Really, the interface is flawed and should have never been merged in the first place. We cannot simply remove it altogether I am afraid so let's at least remove the bogus code and pretend that the world is a better place where everything is removable except the reality sucks... >>> >>> As I expressed already, the interface works as designed/documented and >>> has been used like that for years. >> >> It seems we do differ in the usefulness though. Using a crappy interface >> for years doesn't make it less crappy. I do realize we cannot remove the >> interface but we can remove issues with the implementation and I dare to >> say that most existing users wouldn't really notice. > > Well, at least powerpc-utils (why this interface was introduced) will > notice a) performance wise and b) because more logging output will be > generated (obviously non-offlineable blocks will be tried to offline). I would really appreciate some specific example for a real usecase. I am not familiar with powerpc-utils worklflows myself. >>> >>> Not an expert myself: >>> >>> https://github.com/ibm-power-utilities/powerpc-utils >>> >>> -> src/drmgr/drslot_chrp_mem.c >>> >>> On request to remove some memory it will >>> >>> a) Read "->removable" of all memory blocks ("lmb") >>> b) Check if the request can be fulfilled using the removable blocks >>> c) Try to offline the memory blocks by trying to offline it. If that >>> succeeded, trigger removeal of it using some hypervisor hooks. >>> >>> Interestingly, with "AMS ballooning", it will already consider the >>> "removable" information useless (most probably, because of >>> non-migratable balloon pages that can be offlined - I assume the powerpc >>> code that I converted to proper balloon compaction just recently). a) >>> and b) is skipped. >>> >>> Returning "yes" on all blocks will make them handle it just like if "AMS >>> ballooning" is active. So any memory block will be tried. Should work >>> but will be slower if no ballooning is active. >>> >> >> On lsmem: >> >> https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.lgdd/lgdd_r_lsmem_cmd.html >> >> " >> Removable >> yes if the memory range can be set offline, no if it cannot be set >> offline. A dash (-) means that the range is already offline. The kernel >> method that identifies removable memory ranges is heuristic and not >> exact. Occasionally, memory ranges are falsely reported as removable or >> falsely reported as not removable. >> " >> >> Usage of lsmem paird with chmem: >> >> https://access.redhat.com/solutions/3937181 >> >> >> Especially interesting for IBM z Systems, whereby memory >> onlining/offlining will trigger the actual population of memory in the >> hypervisor. So if an admin wants to offline some memory (to give it back >> to the hypervisor), it would use lsmem to identify such blocks first, >> instead of trying random blocks until one offlining request succeeds. > > I am sorry for being dense here but I still do not understand why s390 It's good that we talk about it :) It's hard to reconstruct actual use cases from tools and some documentation only ... Side note (just FYI): One difference on s390x compared to other architectures (AFAIKS) is that once memory is offline, you might not be allowed (by the hypervisor) to online it again - because it was effectively unplugged. Such memory is not removed via remove_memory(), it's simply kept offline. > and the way how it does the hotremove matters here. Afterall there are > no arch specific operations done until the memory is offlined. Also > randomly checking memory blocks and then hoping that the offline will > succeed is not way much different from just trying the offline the > block. Both have to crawl through the pfn range and bail out on the > unmovable memory. I think in general we have to approaches to memory unplugging. 1. Know explicitly what you want to unplug (e.g., a DIMM spanning multiple memory blocks). 2. Find random memory blocks you can offline/unplug. For 1, I think we both agree that we don't need this. Just try to offline and you know if it worked. Now of course, for 2 you can try random blocks until you succeeded. From a sysadmin point of view that's very inefficient. From a powerpc-utils point of view, that's inefficient. I learned just now, "chmem"[1] has a mode where you can specify a "size" and not only a range. So a sysadmin can still control onlining/offlining for this use case with a few commands. In other tools (e.g., powerpc-utils), well, you have to try to offline random memory blocks (just like chmem does). AFAIK, once we turn /sys/.../removable useless, I
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Wed 22-01-20 12:58:16, David Hildenbrand wrote: > On 22.01.20 11:54, David Hildenbrand wrote: > > On 22.01.20 11:42, Michal Hocko wrote: > >> On Wed 22-01-20 11:39:08, David Hildenbrand wrote: > >> Really, the interface is flawed and should have never been merged in > >> the > >> first place. We cannot simply remove it altogether I am afraid so let's > >> at least remove the bogus code and pretend that the world is a better > >> place where everything is removable except the reality sucks... > > > > As I expressed already, the interface works as designed/documented and > > has been used like that for years. > > It seems we do differ in the usefulness though. Using a crappy interface > for years doesn't make it less crappy. I do realize we cannot remove the > interface but we can remove issues with the implementation and I dare to > say that most existing users wouldn't really notice. > >>> > >>> Well, at least powerpc-utils (why this interface was introduced) will > >>> notice a) performance wise and b) because more logging output will be > >>> generated (obviously non-offlineable blocks will be tried to offline). > >> > >> I would really appreciate some specific example for a real usecase. I am > >> not familiar with powerpc-utils worklflows myself. > >> > > > > Not an expert myself: > > > > https://github.com/ibm-power-utilities/powerpc-utils > > > > -> src/drmgr/drslot_chrp_mem.c > > > > On request to remove some memory it will > > > > a) Read "->removable" of all memory blocks ("lmb") > > b) Check if the request can be fulfilled using the removable blocks > > c) Try to offline the memory blocks by trying to offline it. If that > > succeeded, trigger removeal of it using some hypervisor hooks. > > > > Interestingly, with "AMS ballooning", it will already consider the > > "removable" information useless (most probably, because of > > non-migratable balloon pages that can be offlined - I assume the powerpc > > code that I converted to proper balloon compaction just recently). a) > > and b) is skipped. > > > > Returning "yes" on all blocks will make them handle it just like if "AMS > > ballooning" is active. So any memory block will be tried. Should work > > but will be slower if no ballooning is active. > > > > On lsmem: > > https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.lgdd/lgdd_r_lsmem_cmd.html > > " > Removable > yes if the memory range can be set offline, no if it cannot be set > offline. A dash (-) means that the range is already offline. The kernel > method that identifies removable memory ranges is heuristic and not > exact. Occasionally, memory ranges are falsely reported as removable or > falsely reported as not removable. > " > > Usage of lsmem paird with chmem: > > https://access.redhat.com/solutions/3937181 > > > Especially interesting for IBM z Systems, whereby memory > onlining/offlining will trigger the actual population of memory in the > hypervisor. So if an admin wants to offline some memory (to give it back > to the hypervisor), it would use lsmem to identify such blocks first, > instead of trying random blocks until one offlining request succeeds. I am sorry for being dense here but I still do not understand why s390 and the way how it does the hotremove matters here. Afterall there are no arch specific operations done until the memory is offlined. Also randomly checking memory blocks and then hoping that the offline will succeed is not way much different from just trying the offline the block. Both have to crawl through the pfn range and bail out on the unmovable memory. -- Michal Hocko SUSE Labs
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On 22.01.20 11:54, David Hildenbrand wrote: > On 22.01.20 11:42, Michal Hocko wrote: >> On Wed 22-01-20 11:39:08, David Hildenbrand wrote: >> Really, the interface is flawed and should have never been merged in the >> first place. We cannot simply remove it altogether I am afraid so let's >> at least remove the bogus code and pretend that the world is a better >> place where everything is removable except the reality sucks... > > As I expressed already, the interface works as designed/documented and > has been used like that for years. It seems we do differ in the usefulness though. Using a crappy interface for years doesn't make it less crappy. I do realize we cannot remove the interface but we can remove issues with the implementation and I dare to say that most existing users wouldn't really notice. >>> >>> Well, at least powerpc-utils (why this interface was introduced) will >>> notice a) performance wise and b) because more logging output will be >>> generated (obviously non-offlineable blocks will be tried to offline). >> >> I would really appreciate some specific example for a real usecase. I am >> not familiar with powerpc-utils worklflows myself. >> > > Not an expert myself: > > https://github.com/ibm-power-utilities/powerpc-utils > > -> src/drmgr/drslot_chrp_mem.c > > On request to remove some memory it will > > a) Read "->removable" of all memory blocks ("lmb") > b) Check if the request can be fulfilled using the removable blocks > c) Try to offline the memory blocks by trying to offline it. If that > succeeded, trigger removeal of it using some hypervisor hooks. > > Interestingly, with "AMS ballooning", it will already consider the > "removable" information useless (most probably, because of > non-migratable balloon pages that can be offlined - I assume the powerpc > code that I converted to proper balloon compaction just recently). a) > and b) is skipped. > > Returning "yes" on all blocks will make them handle it just like if "AMS > ballooning" is active. So any memory block will be tried. Should work > but will be slower if no ballooning is active. > On lsmem: https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.lgdd/lgdd_r_lsmem_cmd.html " Removable yes if the memory range can be set offline, no if it cannot be set offline. A dash (-) means that the range is already offline. The kernel method that identifies removable memory ranges is heuristic and not exact. Occasionally, memory ranges are falsely reported as removable or falsely reported as not removable. " Usage of lsmem paird with chmem: https://access.redhat.com/solutions/3937181 Especially interesting for IBM z Systems, whereby memory onlining/offlining will trigger the actual population of memory in the hypervisor. So if an admin wants to offline some memory (to give it back to the hypervisor), it would use lsmem to identify such blocks first, instead of trying random blocks until one offlining request succeeds. E.g., documented in https://books.google.de/books?id=1UEhDQAAQBAJ=PA117=PA117=lsmem+removable=bl=OzMfU6Gbzu=ACfU3U2IfH0eTVJs0qu50FdkysA3iC0elw=de=X=2ahUKEwjQpdXQkpfnAhVOzqQKHTN4BsoQ6AEwBXoECAoQAQ#v=onepage=lsmem%20removable=false So I still think that the interface is useful and is getting used in real life. Users tolerate false positives/negatives. -- Thanks, David / dhildenb
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On 22.01.20 11:42, Michal Hocko wrote: > On Wed 22-01-20 11:39:08, David Hildenbrand wrote: > Really, the interface is flawed and should have never been merged in the > first place. We cannot simply remove it altogether I am afraid so let's > at least remove the bogus code and pretend that the world is a better > place where everything is removable except the reality sucks... As I expressed already, the interface works as designed/documented and has been used like that for years. >>> >>> It seems we do differ in the usefulness though. Using a crappy interface >>> for years doesn't make it less crappy. I do realize we cannot remove the >>> interface but we can remove issues with the implementation and I dare to >>> say that most existing users wouldn't really notice. >> >> Well, at least powerpc-utils (why this interface was introduced) will >> notice a) performance wise and b) because more logging output will be >> generated (obviously non-offlineable blocks will be tried to offline). > > I would really appreciate some specific example for a real usecase. I am > not familiar with powerpc-utils worklflows myself. > Not an expert myself: https://github.com/ibm-power-utilities/powerpc-utils -> src/drmgr/drslot_chrp_mem.c On request to remove some memory it will a) Read "->removable" of all memory blocks ("lmb") b) Check if the request can be fulfilled using the removable blocks c) Try to offline the memory blocks by trying to offline it. If that succeeded, trigger removeal of it using some hypervisor hooks. Interestingly, with "AMS ballooning", it will already consider the "removable" information useless (most probably, because of non-migratable balloon pages that can be offlined - I assume the powerpc code that I converted to proper balloon compaction just recently). a) and b) is skipped. Returning "yes" on all blocks will make them handle it just like if "AMS ballooning" is active. So any memory block will be tried. Should work but will be slower if no ballooning is active. -- Thanks, David / dhildenb
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Wed 22-01-20 11:39:08, David Hildenbrand wrote: > >>> Really, the interface is flawed and should have never been merged in the > >>> first place. We cannot simply remove it altogether I am afraid so let's > >>> at least remove the bogus code and pretend that the world is a better > >>> place where everything is removable except the reality sucks... > >> > >> As I expressed already, the interface works as designed/documented and > >> has been used like that for years. > > > > It seems we do differ in the usefulness though. Using a crappy interface > > for years doesn't make it less crappy. I do realize we cannot remove the > > interface but we can remove issues with the implementation and I dare to > > say that most existing users wouldn't really notice. > > Well, at least powerpc-utils (why this interface was introduced) will > notice a) performance wise and b) because more logging output will be > generated (obviously non-offlineable blocks will be tried to offline). I would really appreciate some specific example for a real usecase. I am not familiar with powerpc-utils worklflows myself. -- Michal Hocko SUSE Labs
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
>>> Really, the interface is flawed and should have never been merged in the >>> first place. We cannot simply remove it altogether I am afraid so let's >>> at least remove the bogus code and pretend that the world is a better >>> place where everything is removable except the reality sucks... >> >> As I expressed already, the interface works as designed/documented and >> has been used like that for years. > > It seems we do differ in the usefulness though. Using a crappy interface > for years doesn't make it less crappy. I do realize we cannot remove the > interface but we can remove issues with the implementation and I dare to > say that most existing users wouldn't really notice. Well, at least powerpc-utils (why this interface was introduced) will notice a) performance wise and b) because more logging output will be generated (obviously non-offlineable blocks will be tried to offline). However, it should not break, because we could have had races before/false positives. > >> I tend to agree that it never should have been merged like that. >> >> We have (at least) two places that are racy (with concurrent memory >> hotplug): >> >> 1. /sys/.../memoryX/removable >> - a) make it always return yes and make the interface useless >> - b) add proper locking and keep it running as is (e.g., so David can >> identify offlineable memory blocks :) ). >> >> 2. /sys/.../memoryX/valid_zones >> - a) always return "none" if the memory is online >> - b) add proper locking and keep it running as is >> - c) cache the result ("zone") when a block is onlined (e.g., in >> mem->zone. If it is NULL, either mixed zones or unknown) >> >> At least 2. already scream for a proper device_lock() locking as the >> mem->state is not stable across the function call. >> >> 1a and 2a are the easiest solutions but remove all ways to identify if a >> memory block could theoretically be offlined - without trying >> (especially, also to identify the MOVABLE zone). >> >> I tend to prefer 1b) and 2c), paired with proper device_lock() locking. >> We don't affect existing use cases but are able to simplify the code + >> fix the races. >> >> What's your opinion? Any alternatives? > > 1a) and 2c) if you ask me. > I'll look into that all, just might take a little (busy with a lot of stuff). But after all, it does not seem to be urgent. 1a) will be easy, I'll post a patch soon that we can let rest in -next for a bit to see if people start to scream out loud. -- Thanks, David / dhildenb
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Mon 20-01-20 10:14:44, David Hildenbrand wrote: > On 20.01.20 08:48, Michal Hocko wrote: > > On Fri 17-01-20 08:57:51, Dan Williams wrote: > > [...] > >> Unless the user is willing to hold the device_hotplug_lock over the > >> evaluation then the result is unreliable. > > > > Do we want to hold the device_hotplug_lock from this user readable file > > in the first place? My book says that this just waits to become a > > problem. > > It was the "big hammer" solution for this RFC. > > I think we could do with a try_lock() on the device_lock() paired with a > device->removed flag. The latter is helpful for properly catching zombie > devices on the onlining/offlining path either way (and on my todo list). try_lock would be more considerate. It would at least make any potential hammering a bit harder. > > Really, the interface is flawed and should have never been merged in the > > first place. We cannot simply remove it altogether I am afraid so let's > > at least remove the bogus code and pretend that the world is a better > > place where everything is removable except the reality sucks... > > As I expressed already, the interface works as designed/documented and > has been used like that for years. It seems we do differ in the usefulness though. Using a crappy interface for years doesn't make it less crappy. I do realize we cannot remove the interface but we can remove issues with the implementation and I dare to say that most existing users wouldn't really notice. > I tend to agree that it never should have been merged like that. > > We have (at least) two places that are racy (with concurrent memory > hotplug): > > 1. /sys/.../memoryX/removable > - a) make it always return yes and make the interface useless > - b) add proper locking and keep it running as is (e.g., so David can > identify offlineable memory blocks :) ). > > 2. /sys/.../memoryX/valid_zones > - a) always return "none" if the memory is online > - b) add proper locking and keep it running as is > - c) cache the result ("zone") when a block is onlined (e.g., in > mem->zone. If it is NULL, either mixed zones or unknown) > > At least 2. already scream for a proper device_lock() locking as the > mem->state is not stable across the function call. > > 1a and 2a are the easiest solutions but remove all ways to identify if a > memory block could theoretically be offlined - without trying > (especially, also to identify the MOVABLE zone). > > I tend to prefer 1b) and 2c), paired with proper device_lock() locking. > We don't affect existing use cases but are able to simplify the code + > fix the races. > > What's your opinion? Any alternatives? 1a) and 2c) if you ask me. -- Michal Hocko SUSE Labs
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On 20.01.20 10:14, David Hildenbrand wrote: > On 20.01.20 08:48, Michal Hocko wrote: >> On Fri 17-01-20 08:57:51, Dan Williams wrote: >> [...] >>> Unless the user is willing to hold the device_hotplug_lock over the >>> evaluation then the result is unreliable. >> >> Do we want to hold the device_hotplug_lock from this user readable file >> in the first place? My book says that this just waits to become a >> problem. > > It was the "big hammer" solution for this RFC. > > I think we could do with a try_lock() on the device_lock() paired with a > device->removed flag. The latter is helpful for properly catching zombie > devices on the onlining/offlining path either way (and on my todo list). We do have dev->p->dead which could come in handy. -- Thanks, David / dhildenb
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On 20.01.20 08:48, Michal Hocko wrote: > On Fri 17-01-20 08:57:51, Dan Williams wrote: > [...] >> Unless the user is willing to hold the device_hotplug_lock over the >> evaluation then the result is unreliable. > > Do we want to hold the device_hotplug_lock from this user readable file > in the first place? My book says that this just waits to become a > problem. It was the "big hammer" solution for this RFC. I think we could do with a try_lock() on the device_lock() paired with a device->removed flag. The latter is helpful for properly catching zombie devices on the onlining/offlining path either way (and on my todo list). > > Really, the interface is flawed and should have never been merged in the > first place. We cannot simply remove it altogether I am afraid so let's > at least remove the bogus code and pretend that the world is a better > place where everything is removable except the reality sucks... As I expressed already, the interface works as designed/documented and has been used like that for years. I tend to agree that it never should have been merged like that. We have (at least) two places that are racy (with concurrent memory hotplug): 1. /sys/.../memoryX/removable - a) make it always return yes and make the interface useless - b) add proper locking and keep it running as is (e.g., so David can identify offlineable memory blocks :) ). 2. /sys/.../memoryX/valid_zones - a) always return "none" if the memory is online - b) add proper locking and keep it running as is - c) cache the result ("zone") when a block is onlined (e.g., in mem->zone. If it is NULL, either mixed zones or unknown) At least 2. already scream for a proper device_lock() locking as the mem->state is not stable across the function call. 1a and 2a are the easiest solutions but remove all ways to identify if a memory block could theoretically be offlined - without trying (especially, also to identify the MOVABLE zone). I tend to prefer 1b) and 2c), paired with proper device_lock() locking. We don't affect existing use cases but are able to simplify the code + fix the races. What's your opinion? Any alternatives? -- Thanks, David / dhildenb
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Fri 17-01-20 08:57:51, Dan Williams wrote: [...] > Unless the user is willing to hold the device_hotplug_lock over the > evaluation then the result is unreliable. Do we want to hold the device_hotplug_lock from this user readable file in the first place? My book says that this just waits to become a problem. Really, the interface is flawed and should have never been merged in the first place. We cannot simply remove it altogether I am afraid so let's at least remove the bogus code and pretend that the world is a better place where everything is removable except the reality sucks... -- Michal Hocko SUSE Labs
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Fri, Jan 17, 2020 at 8:11 AM David Hildenbrand wrote: > > On 17.01.20 16:54, Dan Williams wrote: > > On Fri, Jan 17, 2020 at 7:30 AM Michal Hocko wrote: > >> > >> On Fri 17-01-20 15:58:26, David Hildenbrand wrote: > >>> On 17.01.20 15:52, Michal Hocko wrote: > On Fri 17-01-20 14:08:06, David Hildenbrand wrote: > > On 17.01.20 12:33, Michal Hocko wrote: > >> On Fri 17-01-20 11:57:59, David Hildenbrand wrote: > >>> Let's refactor that code. We want to check if we can offline memory > >>> blocks. Add a new function is_mem_section_offlineable() for that and > >>> make it call is_mem_section_offlineable() for each contained section. > >>> Within is_mem_section_offlineable(), add some more sanity checks and > >>> directly bail out if the section contains holes or if it spans > >>> multiple > >>> zones. > >> > >> I didn't read the patch (yet) but I am wondering. If we want to touch > >> this code, can we simply always return true there? I mean whoever > >> depends on this check is racy and the failure can happen even after > >> the sysfs says good to go, right? The check is essentially as expensive > >> as calling the offlining code itself. So the only usecase I can think > >> of > >> is a dumb driver to crawl over blocks and check which is removable and > >> try to hotremove it. But just trying to offline one block after another > >> is essentially going to achieve the same. > > > > Some thoughts: > > > > 1. It allows you to check if memory is likely to be offlineable without > > doing expensive locking and trying to isolate pages (meaning: > > zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() > > when isolating) > > > > 2. There are use cases that want to identify a memory block/DIMM to > > unplug. One example is PPC DLPAR code (see this patch). Going over all > > memory block trying to offline them is an expensive operation. > > > > 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) > > makes use of /sys/.../removable to speed up the search AFAIK. > > Well, while I do see those points I am not really sure they are worth > having a broken (by-definition) interface. > >>> > >>> It's a pure speedup. And for that, the interface has been working > >>> perfectly fine for years? > >>> > > > 4. lsmem displays/groups by "removable". > > Is anybody really using that? > >>> > >>> Well at least I am using that when testing to identify which > >>> (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate > >>> all the zone shrinking stuff I have been fixing) > >>> > >>> So there is at least one user ;) > >> > >> Fair enough. But I would argue that there are better ways to do the same > >> solely for testing purposes. Rather than having a subtly broken code to > >> maintain. > >> > > >> Or does anybody see any reasonable usecase that would break if we did > >> that unconditional behavior? > > > > If we would return always "true", then the whole reason the > > interface originally was introduced would be "broken" (meaning, less > > performant as you would try to offline any memory block). > > I would argue that the whole interface is broken ;). Not the first time > in the kernel development history and not the last time either. What I > am trying to say here is that unless there are _real_ usecases depending > on knowing that something surely is _not_ offlineable then I would just > try to drop the functionality while preserving the interface and see > what happens. > >>> > >>> I can see that, but I can perfectly well understand why - especially > >>> powerpc - wants a fast way to sense which blocks actually sense to try > >>> to online. > >>> > >>> The original patch correctly states > >>>"which sections of > >>> memory are likely to be removable before attempting the potentially > >>> expensive operation." > >>> > >>> It works as designed I would say. > >> > >> Then I would just keep it crippled the same way it has been for years > >> without anybody noticing. > > > > I tend to agree. At least the kmem driver that wants to unplug memory > > could not use an interface that does not give stable answers. It just > > relies on remove_memory() to return a definitive error. > > > > Just because kmem cannot reuse such an interface doesn't mean we should > not touch it (or I am not getting your point). Especially, this > interface is about "can it be likely be offlined and then eventually be > removed (if there is a HW interface for that)" (as documented), not > about "will remove_memory()" work. Unless the user is willing to hold the device_hotplug_lock over the evaluation then the result is unreliable. For example the changes to removable_show() are no better than they were previously because the result is invalidated as soon as
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On 17.01.20 16:54, Dan Williams wrote: > On Fri, Jan 17, 2020 at 7:30 AM Michal Hocko wrote: >> >> On Fri 17-01-20 15:58:26, David Hildenbrand wrote: >>> On 17.01.20 15:52, Michal Hocko wrote: On Fri 17-01-20 14:08:06, David Hildenbrand wrote: > On 17.01.20 12:33, Michal Hocko wrote: >> On Fri 17-01-20 11:57:59, David Hildenbrand wrote: >>> Let's refactor that code. We want to check if we can offline memory >>> blocks. Add a new function is_mem_section_offlineable() for that and >>> make it call is_mem_section_offlineable() for each contained section. >>> Within is_mem_section_offlineable(), add some more sanity checks and >>> directly bail out if the section contains holes or if it spans multiple >>> zones. >> >> I didn't read the patch (yet) but I am wondering. If we want to touch >> this code, can we simply always return true there? I mean whoever >> depends on this check is racy and the failure can happen even after >> the sysfs says good to go, right? The check is essentially as expensive >> as calling the offlining code itself. So the only usecase I can think of >> is a dumb driver to crawl over blocks and check which is removable and >> try to hotremove it. But just trying to offline one block after another >> is essentially going to achieve the same. > > Some thoughts: > > 1. It allows you to check if memory is likely to be offlineable without > doing expensive locking and trying to isolate pages (meaning: > zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() > when isolating) > > 2. There are use cases that want to identify a memory block/DIMM to > unplug. One example is PPC DLPAR code (see this patch). Going over all > memory block trying to offline them is an expensive operation. > > 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) > makes use of /sys/.../removable to speed up the search AFAIK. Well, while I do see those points I am not really sure they are worth having a broken (by-definition) interface. >>> >>> It's a pure speedup. And for that, the interface has been working >>> perfectly fine for years? >>> > 4. lsmem displays/groups by "removable". Is anybody really using that? >>> >>> Well at least I am using that when testing to identify which >>> (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate >>> all the zone shrinking stuff I have been fixing) >>> >>> So there is at least one user ;) >> >> Fair enough. But I would argue that there are better ways to do the same >> solely for testing purposes. Rather than having a subtly broken code to >> maintain. >> >> Or does anybody see any reasonable usecase that would break if we did >> that unconditional behavior? > > If we would return always "true", then the whole reason the > interface originally was introduced would be "broken" (meaning, less > performant as you would try to offline any memory block). I would argue that the whole interface is broken ;). Not the first time in the kernel development history and not the last time either. What I am trying to say here is that unless there are _real_ usecases depending on knowing that something surely is _not_ offlineable then I would just try to drop the functionality while preserving the interface and see what happens. >>> >>> I can see that, but I can perfectly well understand why - especially >>> powerpc - wants a fast way to sense which blocks actually sense to try >>> to online. >>> >>> The original patch correctly states >>>"which sections of >>> memory are likely to be removable before attempting the potentially >>> expensive operation." >>> >>> It works as designed I would say. >> >> Then I would just keep it crippled the same way it has been for years >> without anybody noticing. > > I tend to agree. At least the kmem driver that wants to unplug memory > could not use an interface that does not give stable answers. It just > relies on remove_memory() to return a definitive error. > Just because kmem cannot reuse such an interface doesn't mean we should not touch it (or I am not getting your point). Especially, this interface is about "can it be likely be offlined and then eventually be removed (if there is a HW interface for that)" (as documented), not about "will remove_memory()" work. We do have users and if we agree to keep it (what I think we should as I expressed) then I think we should un-cripple and fix it. After all we have to maintain it. The current interface provides what was documented - "likely to be offlineable". (the chosen name was just horribly bad - as I expressed a while ago already :) ) -- Thanks, David / dhildenb
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Fri, Jan 17, 2020 at 7:30 AM Michal Hocko wrote: > > On Fri 17-01-20 15:58:26, David Hildenbrand wrote: > > On 17.01.20 15:52, Michal Hocko wrote: > > > On Fri 17-01-20 14:08:06, David Hildenbrand wrote: > > >> On 17.01.20 12:33, Michal Hocko wrote: > > >>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote: > > Let's refactor that code. We want to check if we can offline memory > > blocks. Add a new function is_mem_section_offlineable() for that and > > make it call is_mem_section_offlineable() for each contained section. > > Within is_mem_section_offlineable(), add some more sanity checks and > > directly bail out if the section contains holes or if it spans multiple > > zones. > > >>> > > >>> I didn't read the patch (yet) but I am wondering. If we want to touch > > >>> this code, can we simply always return true there? I mean whoever > > >>> depends on this check is racy and the failure can happen even after > > >>> the sysfs says good to go, right? The check is essentially as expensive > > >>> as calling the offlining code itself. So the only usecase I can think of > > >>> is a dumb driver to crawl over blocks and check which is removable and > > >>> try to hotremove it. But just trying to offline one block after another > > >>> is essentially going to achieve the same. > > >> > > >> Some thoughts: > > >> > > >> 1. It allows you to check if memory is likely to be offlineable without > > >> doing expensive locking and trying to isolate pages (meaning: > > >> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() > > >> when isolating) > > >> > > >> 2. There are use cases that want to identify a memory block/DIMM to > > >> unplug. One example is PPC DLPAR code (see this patch). Going over all > > >> memory block trying to offline them is an expensive operation. > > >> > > >> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) > > >> makes use of /sys/.../removable to speed up the search AFAIK. > > > > > > Well, while I do see those points I am not really sure they are worth > > > having a broken (by-definition) interface. > > > > It's a pure speedup. And for that, the interface has been working > > perfectly fine for years? > > > > > > > >> 4. lsmem displays/groups by "removable". > > > > > > Is anybody really using that? > > > > Well at least I am using that when testing to identify which > > (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate > > all the zone shrinking stuff I have been fixing) > > > > So there is at least one user ;) > > Fair enough. But I would argue that there are better ways to do the same > solely for testing purposes. Rather than having a subtly broken code to > maintain. > > > > > > >>> Or does anybody see any reasonable usecase that would break if we did > > >>> that unconditional behavior? > > >> > > >> If we would return always "true", then the whole reason the > > >> interface originally was introduced would be "broken" (meaning, less > > >> performant as you would try to offline any memory block). > > > > > > I would argue that the whole interface is broken ;). Not the first time > > > in the kernel development history and not the last time either. What I > > > am trying to say here is that unless there are _real_ usecases depending > > > on knowing that something surely is _not_ offlineable then I would just > > > try to drop the functionality while preserving the interface and see > > > what happens. > > > > I can see that, but I can perfectly well understand why - especially > > powerpc - wants a fast way to sense which blocks actually sense to try > > to online. > > > > The original patch correctly states > >"which sections of > > memory are likely to be removable before attempting the potentially > > expensive operation." > > > > It works as designed I would say. > > Then I would just keep it crippled the same way it has been for years > without anybody noticing. I tend to agree. At least the kmem driver that wants to unplug memory could not use an interface that does not give stable answers. It just relies on remove_memory() to return a definitive error.
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Fri 17-01-20 15:58:26, David Hildenbrand wrote: > On 17.01.20 15:52, Michal Hocko wrote: > > On Fri 17-01-20 14:08:06, David Hildenbrand wrote: > >> On 17.01.20 12:33, Michal Hocko wrote: > >>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote: > Let's refactor that code. We want to check if we can offline memory > blocks. Add a new function is_mem_section_offlineable() for that and > make it call is_mem_section_offlineable() for each contained section. > Within is_mem_section_offlineable(), add some more sanity checks and > directly bail out if the section contains holes or if it spans multiple > zones. > >>> > >>> I didn't read the patch (yet) but I am wondering. If we want to touch > >>> this code, can we simply always return true there? I mean whoever > >>> depends on this check is racy and the failure can happen even after > >>> the sysfs says good to go, right? The check is essentially as expensive > >>> as calling the offlining code itself. So the only usecase I can think of > >>> is a dumb driver to crawl over blocks and check which is removable and > >>> try to hotremove it. But just trying to offline one block after another > >>> is essentially going to achieve the same. > >> > >> Some thoughts: > >> > >> 1. It allows you to check if memory is likely to be offlineable without > >> doing expensive locking and trying to isolate pages (meaning: > >> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() > >> when isolating) > >> > >> 2. There are use cases that want to identify a memory block/DIMM to > >> unplug. One example is PPC DLPAR code (see this patch). Going over all > >> memory block trying to offline them is an expensive operation. > >> > >> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) > >> makes use of /sys/.../removable to speed up the search AFAIK. > > > > Well, while I do see those points I am not really sure they are worth > > having a broken (by-definition) interface. > > It's a pure speedup. And for that, the interface has been working > perfectly fine for years? > > > > >> 4. lsmem displays/groups by "removable". > > > > Is anybody really using that? > > Well at least I am using that when testing to identify which > (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate > all the zone shrinking stuff I have been fixing) > > So there is at least one user ;) Fair enough. But I would argue that there are better ways to do the same solely for testing purposes. Rather than having a subtly broken code to maintain. > > > >>> Or does anybody see any reasonable usecase that would break if we did > >>> that unconditional behavior? > >> > >> If we would return always "true", then the whole reason the > >> interface originally was introduced would be "broken" (meaning, less > >> performant as you would try to offline any memory block). > > > > I would argue that the whole interface is broken ;). Not the first time > > in the kernel development history and not the last time either. What I > > am trying to say here is that unless there are _real_ usecases depending > > on knowing that something surely is _not_ offlineable then I would just > > try to drop the functionality while preserving the interface and see > > what happens. > > I can see that, but I can perfectly well understand why - especially > powerpc - wants a fast way to sense which blocks actually sense to try > to online. > > The original patch correctly states >"which sections of > memory are likely to be removable before attempting the potentially > expensive operation." > > It works as designed I would say. Then I would just keep it crippled the same way it has been for years without anybody noticing. -- Michal Hocko SUSE Labs
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On 17.01.20 15:52, Michal Hocko wrote: > On Fri 17-01-20 14:08:06, David Hildenbrand wrote: >> On 17.01.20 12:33, Michal Hocko wrote: >>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote: Let's refactor that code. We want to check if we can offline memory blocks. Add a new function is_mem_section_offlineable() for that and make it call is_mem_section_offlineable() for each contained section. Within is_mem_section_offlineable(), add some more sanity checks and directly bail out if the section contains holes or if it spans multiple zones. >>> >>> I didn't read the patch (yet) but I am wondering. If we want to touch >>> this code, can we simply always return true there? I mean whoever >>> depends on this check is racy and the failure can happen even after >>> the sysfs says good to go, right? The check is essentially as expensive >>> as calling the offlining code itself. So the only usecase I can think of >>> is a dumb driver to crawl over blocks and check which is removable and >>> try to hotremove it. But just trying to offline one block after another >>> is essentially going to achieve the same. >> >> Some thoughts: >> >> 1. It allows you to check if memory is likely to be offlineable without >> doing expensive locking and trying to isolate pages (meaning: >> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() >> when isolating) >> >> 2. There are use cases that want to identify a memory block/DIMM to >> unplug. One example is PPC DLPAR code (see this patch). Going over all >> memory block trying to offline them is an expensive operation. >> >> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) >> makes use of /sys/.../removable to speed up the search AFAIK. > > Well, while I do see those points I am not really sure they are worth > having a broken (by-definition) interface. It's a pure speedup. And for that, the interface has been working perfectly fine for years? > >> 4. lsmem displays/groups by "removable". > > Is anybody really using that? Well at least I am using that when testing to identify which (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate all the zone shrinking stuff I have been fixing) So there is at least one user ;) [...] > >>> Or does anybody see any reasonable usecase that would break if we did >>> that unconditional behavior? >> >> If we would return always "true", then the whole reason the >> interface originally was introduced would be "broken" (meaning, less >> performant as you would try to offline any memory block). > > I would argue that the whole interface is broken ;). Not the first time > in the kernel development history and not the last time either. What I > am trying to say here is that unless there are _real_ usecases depending > on knowing that something surely is _not_ offlineable then I would just > try to drop the functionality while preserving the interface and see > what happens. I can see that, but I can perfectly well understand why - especially powerpc - wants a fast way to sense which blocks actually sense to try to online. The original patch correctly states "which sections of memory are likely to be removable before attempting the potentially expensive operation." It works as designed I would say. -- Thanks, David / dhildenb
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Fri 17-01-20 14:08:06, David Hildenbrand wrote: > On 17.01.20 12:33, Michal Hocko wrote: > > On Fri 17-01-20 11:57:59, David Hildenbrand wrote: > >> Let's refactor that code. We want to check if we can offline memory > >> blocks. Add a new function is_mem_section_offlineable() for that and > >> make it call is_mem_section_offlineable() for each contained section. > >> Within is_mem_section_offlineable(), add some more sanity checks and > >> directly bail out if the section contains holes or if it spans multiple > >> zones. > > > > I didn't read the patch (yet) but I am wondering. If we want to touch > > this code, can we simply always return true there? I mean whoever > > depends on this check is racy and the failure can happen even after > > the sysfs says good to go, right? The check is essentially as expensive > > as calling the offlining code itself. So the only usecase I can think of > > is a dumb driver to crawl over blocks and check which is removable and > > try to hotremove it. But just trying to offline one block after another > > is essentially going to achieve the same. > > Some thoughts: > > 1. It allows you to check if memory is likely to be offlineable without > doing expensive locking and trying to isolate pages (meaning: > zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() > when isolating) > > 2. There are use cases that want to identify a memory block/DIMM to > unplug. One example is PPC DLPAR code (see this patch). Going over all > memory block trying to offline them is an expensive operation. > > 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) > makes use of /sys/.../removable to speed up the search AFAIK. Well, while I do see those points I am not really sure they are worth having a broken (by-definition) interface. > 4. lsmem displays/groups by "removable". Is anybody really using that? > 5. If "removable=false" then it usually really is not offlineable. > Of course, there could also be races (free the last unmovable page), > but it means "don't even try". OTOH, "removable=true" is more racy, > and gives less guarantees. ("looks okay, feel free to try") Yeah, but you could be already pessimistic and try movable zones before other kernel zones. > > Or does anybody see any reasonable usecase that would break if we did > > that unconditional behavior? > > If we would return always "true", then the whole reason the > interface originally was introduced would be "broken" (meaning, less > performant as you would try to offline any memory block). I would argue that the whole interface is broken ;). Not the first time in the kernel development history and not the last time either. What I am trying to say here is that unless there are _real_ usecases depending on knowing that something surely is _not_ offlineable then I would just try to drop the functionality while preserving the interface and see what happens. -- Michal Hocko SUSE Labs
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On 17.01.20 12:33, Michal Hocko wrote: > On Fri 17-01-20 11:57:59, David Hildenbrand wrote: >> Let's refactor that code. We want to check if we can offline memory >> blocks. Add a new function is_mem_section_offlineable() for that and >> make it call is_mem_section_offlineable() for each contained section. >> Within is_mem_section_offlineable(), add some more sanity checks and >> directly bail out if the section contains holes or if it spans multiple >> zones. > > I didn't read the patch (yet) but I am wondering. If we want to touch > this code, can we simply always return true there? I mean whoever > depends on this check is racy and the failure can happen even after > the sysfs says good to go, right? The check is essentially as expensive > as calling the offlining code itself. So the only usecase I can think of > is a dumb driver to crawl over blocks and check which is removable and > try to hotremove it. But just trying to offline one block after another > is essentially going to achieve the same. Some thoughts: 1. It allows you to check if memory is likely to be offlineable without doing expensive locking and trying to isolate pages (meaning: zone->lock, mem_hotplug_lock. but also, calling drain_all_pages() when isolating) 2. There are use cases that want to identify a memory block/DIMM to unplug. One example is PPC DLPAR code (see this patch). Going over all memory block trying to offline them is an expensive operation. 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils) makes use of /sys/.../removable to speed up the search AFAIK. 4. lsmem displays/groups by "removable". 5. If "removable=false" then it usually really is not offlineable. Of course, there could also be races (free the last unmovable page), but it means "don't even try". OTOH, "removable=true" is more racy, and gives less guarantees. ("looks okay, feel free to try") > > Or does anybody see any reasonable usecase that would break if we did > that unconditional behavior? If we would return always "true", then the whole reason the interface originally was introduced would be "broken" (meaning, less performant as you would try to offline any memory block). commit 5c755e9fd813810680abd56ec09a5f90143e815b Author: Badari Pulavarty Date: Wed Jul 23 21:28:19 2008 -0700 memory-hotplug: add sysfs removable attribute for hotplug memory remove Memory may be hot-removed on a per-memory-block basis, particularly on POWER where the SPARSEMEM section size often matches the memory-block size. A user-level agent must be able to identify which sections of memory are likely to be removable before attempting the potentially expensive operation. This patch adds a file called "removable" to the memory directory in sysfs to help such an agent. In this patch, a memory block is considered removable if; I'd love to see this go away (just like "valid_zones"), but I don't think it is possible. So this patch makes it work a little more correctly (multiplezones, holes), cleans up the code and avoids races with unplug code. It can, however, not give more guarantees if memory offlining will actually succeed. -- Thanks, David / dhildenb
Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
On Fri 17-01-20 11:57:59, David Hildenbrand wrote: > Let's refactor that code. We want to check if we can offline memory > blocks. Add a new function is_mem_section_offlineable() for that and > make it call is_mem_section_offlineable() for each contained section. > Within is_mem_section_offlineable(), add some more sanity checks and > directly bail out if the section contains holes or if it spans multiple > zones. I didn't read the patch (yet) but I am wondering. If we want to touch this code, can we simply always return true there? I mean whoever depends on this check is racy and the failure can happen even after the sysfs says good to go, right? The check is essentially as expensive as calling the offlining code itself. So the only usecase I can think of is a dumb driver to crawl over blocks and check which is removable and try to hotremove it. But just trying to offline one block after another is essentially going to achieve the same. Or does anybody see any reasonable usecase that would break if we did that unconditional behavior? -- Michal Hocko SUSE Labs