Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/31/2017 11:34 PM, Dave Hansen wrote: > On 01/30/2017 11:25 PM, John Hubbard wrote: >> I also don't like having these policies hard-coded, and your 100x >> example above helps clarify what can go wrong about it. It would be >> nicer if, instead, we could better express the "distance" between nodes >> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA >> system figure out the Right Thing To Do. >> >> I realize that this is not quite possible with NUMA just yet, but I >> wonder if that's a reasonable direction to go with this? > > In the end, I don't think the kernel can make the "right" decision very > widely here. > > Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that > evidently has a higher latency than DRAM. Given a plain malloc(), how > is the kernel to know that the memory will be used for AVX-512 > instructions that need lots of bandwidth vs. some random data structure > that's latency-sensitive? CDM has been designed to work with a driver which can take these kind of appropriate memory placement decisions along the way. But as per the above example of an generic malloc() allocated buffer. (1) System RAM gets allocated if there are first CPU faults (2) CDM memory gets allocated if there are first device access faults (3) After monitoring the access patterns there after, the driver can then take required "right" decisions about its eventual placement and migrates memory as required > > In the end, I think all we can do is keep the kernel's existing default > of "low latency to the CPU that allocated it", and let apps override > when that policy doesn't fit them. I think this is almost similar to what we are trying to achieve with CDM representation and driver based migrations. Dont you agree ?
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/31/2017 11:34 PM, Dave Hansen wrote: > On 01/30/2017 11:25 PM, John Hubbard wrote: >> I also don't like having these policies hard-coded, and your 100x >> example above helps clarify what can go wrong about it. It would be >> nicer if, instead, we could better express the "distance" between nodes >> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA >> system figure out the Right Thing To Do. >> >> I realize that this is not quite possible with NUMA just yet, but I >> wonder if that's a reasonable direction to go with this? > > In the end, I don't think the kernel can make the "right" decision very > widely here. > > Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that > evidently has a higher latency than DRAM. Given a plain malloc(), how > is the kernel to know that the memory will be used for AVX-512 > instructions that need lots of bandwidth vs. some random data structure > that's latency-sensitive? CDM has been designed to work with a driver which can take these kind of appropriate memory placement decisions along the way. But as per the above example of an generic malloc() allocated buffer. (1) System RAM gets allocated if there are first CPU faults (2) CDM memory gets allocated if there are first device access faults (3) After monitoring the access patterns there after, the driver can then take required "right" decisions about its eventual placement and migrates memory as required > > In the end, I think all we can do is keep the kernel's existing default > of "low latency to the CPU that allocated it", and let apps override > when that policy doesn't fit them. I think this is almost similar to what we are trying to achieve with CDM representation and driver based migrations. Dont you agree ?
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/31/2017 12:55 PM, John Hubbard wrote: > On 01/30/2017 05:57 PM, Dave Hansen wrote: >> On 01/30/2017 05:36 PM, Anshuman Khandual wrote: Let's say we had a CDM node with 100x more RAM than the rest of the system and it was just as fast as the rest of the RAM. Would we still want it isolated like this? Or would we want a different policy? >>> >>> But then the other argument being, dont we want to keep this 100X more >>> memory isolated for some special purpose to be utilized by specific >>> applications ? >> >> I was thinking that in this case, we wouldn't even want to bother with >> having "system RAM" in the fallback lists. A device who got its memory >> usage off by 1% could start to starve the rest of the system. A sane >> policy in this case might be to isolate the "system RAM" from the >> device's. > > I also don't like having these policies hard-coded, and your 100x > example above helps clarify what can go wrong about it. It would be > nicer if, instead, we could better express the "distance" between nodes > (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA > system figure out the Right Thing To Do. > > I realize that this is not quite possible with NUMA just yet, but I > wonder if that's a reasonable direction to go with this? That is complete overhaul of the NUMA representation in the kernel. What CDM attempts is to find a solution with existing NUMA framework and with as little code change as possible.
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/31/2017 12:55 PM, John Hubbard wrote: > On 01/30/2017 05:57 PM, Dave Hansen wrote: >> On 01/30/2017 05:36 PM, Anshuman Khandual wrote: Let's say we had a CDM node with 100x more RAM than the rest of the system and it was just as fast as the rest of the RAM. Would we still want it isolated like this? Or would we want a different policy? >>> >>> But then the other argument being, dont we want to keep this 100X more >>> memory isolated for some special purpose to be utilized by specific >>> applications ? >> >> I was thinking that in this case, we wouldn't even want to bother with >> having "system RAM" in the fallback lists. A device who got its memory >> usage off by 1% could start to starve the rest of the system. A sane >> policy in this case might be to isolate the "system RAM" from the >> device's. > > I also don't like having these policies hard-coded, and your 100x > example above helps clarify what can go wrong about it. It would be > nicer if, instead, we could better express the "distance" between nodes > (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA > system figure out the Right Thing To Do. > > I realize that this is not quite possible with NUMA just yet, but I > wonder if that's a reasonable direction to go with this? That is complete overhaul of the NUMA representation in the kernel. What CDM attempts is to find a solution with existing NUMA framework and with as little code change as possible.
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/31/2017 07:27 AM, Dave Hansen wrote: > On 01/30/2017 05:36 PM, Anshuman Khandual wrote: >>> Let's say we had a CDM node with 100x more RAM than the rest of the >>> system and it was just as fast as the rest of the RAM. Would we still >>> want it isolated like this? Or would we want a different policy? >> >> But then the other argument being, dont we want to keep this 100X more >> memory isolated for some special purpose to be utilized by specific >> applications ? > > I was thinking that in this case, we wouldn't even want to bother with > having "system RAM" in the fallback lists. A device who got its memory System RAM is in the fallback list of the CDM node for the following purpose. If the user asks explicitly through mbind() and there is insufficient memory on the CDM node to fulfill the request. Then it is better to fallback on a system RAM memory node than to fail the request. This is in line with expectations from the mbind() call. There are other ways for the user space like /proc/pid/numa_maps to query about from where exactly a given page has come from in the runtime. But keeping options open I have noted down this in the cover letter. " FALLBACK zonelist creation: CDM node's FALLBACK zonelist can also be changed to accommodate other CDM memory zones along with system RAM zones in which case they can be used as fallback options instead of first depending on the system RAM zones when it's own memory falls insufficient during allocation. " > usage off by 1% could start to starve the rest of the system. A sane Did not get this point. Could you please elaborate more on this ? > policy in this case might be to isolate the "system RAM" from the device's. Hmm. > >>> Why do we need this hard-coded along with the cpuset stuff later in the >>> series. Doesn't taking a node out of the cpuset also take it out of the >>> fallback lists? >> >> There are two mutually exclusive approaches which are described in >> this patch series. >> >> (1) zonelist modification based approach >> (2) cpuset restriction based approach >> >> As mentioned in the cover letter, > > Well, I'm glad you coded both of them up, but now that we have them how > to we pick which one to throw to the wolves? Or, do we just merge both > of them and let one bitrot? ;) I am just trying to see how each isolation method stack up from benefit and cost point of view, so that we can have informed debate about their individual merit. Meanwhile I have started looking at if the core buddy allocator __alloc_pages_nodemask() and its interaction with nodemask at various stages can also be modified to implement the intended solution.
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/31/2017 07:27 AM, Dave Hansen wrote: > On 01/30/2017 05:36 PM, Anshuman Khandual wrote: >>> Let's say we had a CDM node with 100x more RAM than the rest of the >>> system and it was just as fast as the rest of the RAM. Would we still >>> want it isolated like this? Or would we want a different policy? >> >> But then the other argument being, dont we want to keep this 100X more >> memory isolated for some special purpose to be utilized by specific >> applications ? > > I was thinking that in this case, we wouldn't even want to bother with > having "system RAM" in the fallback lists. A device who got its memory System RAM is in the fallback list of the CDM node for the following purpose. If the user asks explicitly through mbind() and there is insufficient memory on the CDM node to fulfill the request. Then it is better to fallback on a system RAM memory node than to fail the request. This is in line with expectations from the mbind() call. There are other ways for the user space like /proc/pid/numa_maps to query about from where exactly a given page has come from in the runtime. But keeping options open I have noted down this in the cover letter. " FALLBACK zonelist creation: CDM node's FALLBACK zonelist can also be changed to accommodate other CDM memory zones along with system RAM zones in which case they can be used as fallback options instead of first depending on the system RAM zones when it's own memory falls insufficient during allocation. " > usage off by 1% could start to starve the rest of the system. A sane Did not get this point. Could you please elaborate more on this ? > policy in this case might be to isolate the "system RAM" from the device's. Hmm. > >>> Why do we need this hard-coded along with the cpuset stuff later in the >>> series. Doesn't taking a node out of the cpuset also take it out of the >>> fallback lists? >> >> There are two mutually exclusive approaches which are described in >> this patch series. >> >> (1) zonelist modification based approach >> (2) cpuset restriction based approach >> >> As mentioned in the cover letter, > > Well, I'm glad you coded both of them up, but now that we have them how > to we pick which one to throw to the wolves? Or, do we just merge both > of them and let one bitrot? ;) I am just trying to see how each isolation method stack up from benefit and cost point of view, so that we can have informed debate about their individual merit. Meanwhile I have started looking at if the core buddy allocator __alloc_pages_nodemask() and its interaction with nodemask at various stages can also be modified to implement the intended solution.
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/31/2017 12:04 PM, Dave Hansen wrote: > On 01/30/2017 11:25 PM, John Hubbard wrote: >> I also don't like having these policies hard-coded, and your 100x >> example above helps clarify what can go wrong about it. It would be >> nicer if, instead, we could better express the "distance" between nodes >> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA >> system figure out the Right Thing To Do. >> >> I realize that this is not quite possible with NUMA just yet, but I >> wonder if that's a reasonable direction to go with this? > In the end, I don't think the kernel can make the "right" decision very > widely here. > > Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that > evidently has a higher latency than DRAM. Given a plain malloc(), how > is the kernel to know that the memory will be used for AVX-512 > instructions that need lots of bandwidth vs. some random data structure > that's latency-sensitive? > > In the end, I think all we can do is keep the kernel's existing default > of "low latency to the CPU that allocated it", and let apps override > when that policy doesn't fit them. > I think John's point is that latency might not be the predominant factor anymore for certain sections of the CPU and GPU world. What if a Phi has MCDRAM physically attached, but DDR4 connected via QPI that still has lower total latency (might be a stretch for Phi but not a stretch for GPUs with deep sorting memory controllers)? Lowest latency is probably the wrong choice. Latency has really been a numeric proxy for physical proximity, under assumption most closely coupled memory is the right placement, but HBM/MCDRAM is causing that relationship to break down in all sorts of interesting ways.
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/31/2017 12:04 PM, Dave Hansen wrote: > On 01/30/2017 11:25 PM, John Hubbard wrote: >> I also don't like having these policies hard-coded, and your 100x >> example above helps clarify what can go wrong about it. It would be >> nicer if, instead, we could better express the "distance" between nodes >> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA >> system figure out the Right Thing To Do. >> >> I realize that this is not quite possible with NUMA just yet, but I >> wonder if that's a reasonable direction to go with this? > In the end, I don't think the kernel can make the "right" decision very > widely here. > > Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that > evidently has a higher latency than DRAM. Given a plain malloc(), how > is the kernel to know that the memory will be used for AVX-512 > instructions that need lots of bandwidth vs. some random data structure > that's latency-sensitive? > > In the end, I think all we can do is keep the kernel's existing default > of "low latency to the CPU that allocated it", and let apps override > when that policy doesn't fit them. > I think John's point is that latency might not be the predominant factor anymore for certain sections of the CPU and GPU world. What if a Phi has MCDRAM physically attached, but DDR4 connected via QPI that still has lower total latency (might be a stretch for Phi but not a stretch for GPUs with deep sorting memory controllers)? Lowest latency is probably the wrong choice. Latency has really been a numeric proxy for physical proximity, under assumption most closely coupled memory is the right placement, but HBM/MCDRAM is causing that relationship to break down in all sorts of interesting ways.
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/30/2017 11:25 PM, John Hubbard wrote: > I also don't like having these policies hard-coded, and your 100x > example above helps clarify what can go wrong about it. It would be > nicer if, instead, we could better express the "distance" between nodes > (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA > system figure out the Right Thing To Do. > > I realize that this is not quite possible with NUMA just yet, but I > wonder if that's a reasonable direction to go with this? In the end, I don't think the kernel can make the "right" decision very widely here. Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that evidently has a higher latency than DRAM. Given a plain malloc(), how is the kernel to know that the memory will be used for AVX-512 instructions that need lots of bandwidth vs. some random data structure that's latency-sensitive? In the end, I think all we can do is keep the kernel's existing default of "low latency to the CPU that allocated it", and let apps override when that policy doesn't fit them.
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/30/2017 11:25 PM, John Hubbard wrote: > I also don't like having these policies hard-coded, and your 100x > example above helps clarify what can go wrong about it. It would be > nicer if, instead, we could better express the "distance" between nodes > (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA > system figure out the Right Thing To Do. > > I realize that this is not quite possible with NUMA just yet, but I > wonder if that's a reasonable direction to go with this? In the end, I don't think the kernel can make the "right" decision very widely here. Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that evidently has a higher latency than DRAM. Given a plain malloc(), how is the kernel to know that the memory will be used for AVX-512 instructions that need lots of bandwidth vs. some random data structure that's latency-sensitive? In the end, I think all we can do is keep the kernel's existing default of "low latency to the CPU that allocated it", and let apps override when that policy doesn't fit them.
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/30/2017 05:57 PM, Dave Hansen wrote: On 01/30/2017 05:36 PM, Anshuman Khandual wrote: Let's say we had a CDM node with 100x more RAM than the rest of the system and it was just as fast as the rest of the RAM. Would we still want it isolated like this? Or would we want a different policy? But then the other argument being, dont we want to keep this 100X more memory isolated for some special purpose to be utilized by specific applications ? I was thinking that in this case, we wouldn't even want to bother with having "system RAM" in the fallback lists. A device who got its memory usage off by 1% could start to starve the rest of the system. A sane policy in this case might be to isolate the "system RAM" from the device's. I also don't like having these policies hard-coded, and your 100x example above helps clarify what can go wrong about it. It would be nicer if, instead, we could better express the "distance" between nodes (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA system figure out the Right Thing To Do. I realize that this is not quite possible with NUMA just yet, but I wonder if that's a reasonable direction to go with this? thanks, john h Why do we need this hard-coded along with the cpuset stuff later in the series. Doesn't taking a node out of the cpuset also take it out of the fallback lists? There are two mutually exclusive approaches which are described in this patch series. (1) zonelist modification based approach (2) cpuset restriction based approach As mentioned in the cover letter, Well, I'm glad you coded both of them up, but now that we have them how to we pick which one to throw to the wolves? Or, do we just merge both of them and let one bitrot? ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/30/2017 05:57 PM, Dave Hansen wrote: On 01/30/2017 05:36 PM, Anshuman Khandual wrote: Let's say we had a CDM node with 100x more RAM than the rest of the system and it was just as fast as the rest of the RAM. Would we still want it isolated like this? Or would we want a different policy? But then the other argument being, dont we want to keep this 100X more memory isolated for some special purpose to be utilized by specific applications ? I was thinking that in this case, we wouldn't even want to bother with having "system RAM" in the fallback lists. A device who got its memory usage off by 1% could start to starve the rest of the system. A sane policy in this case might be to isolate the "system RAM" from the device's. I also don't like having these policies hard-coded, and your 100x example above helps clarify what can go wrong about it. It would be nicer if, instead, we could better express the "distance" between nodes (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA system figure out the Right Thing To Do. I realize that this is not quite possible with NUMA just yet, but I wonder if that's a reasonable direction to go with this? thanks, john h Why do we need this hard-coded along with the cpuset stuff later in the series. Doesn't taking a node out of the cpuset also take it out of the fallback lists? There are two mutually exclusive approaches which are described in this patch series. (1) zonelist modification based approach (2) cpuset restriction based approach As mentioned in the cover letter, Well, I'm glad you coded both of them up, but now that we have them how to we pick which one to throw to the wolves? Or, do we just merge both of them and let one bitrot? ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/30/2017 05:36 PM, Anshuman Khandual wrote: >> Let's say we had a CDM node with 100x more RAM than the rest of the >> system and it was just as fast as the rest of the RAM. Would we still >> want it isolated like this? Or would we want a different policy? > > But then the other argument being, dont we want to keep this 100X more > memory isolated for some special purpose to be utilized by specific > applications ? I was thinking that in this case, we wouldn't even want to bother with having "system RAM" in the fallback lists. A device who got its memory usage off by 1% could start to starve the rest of the system. A sane policy in this case might be to isolate the "system RAM" from the device's. >> Why do we need this hard-coded along with the cpuset stuff later in the >> series. Doesn't taking a node out of the cpuset also take it out of the >> fallback lists? > > There are two mutually exclusive approaches which are described in > this patch series. > > (1) zonelist modification based approach > (2) cpuset restriction based approach > > As mentioned in the cover letter, Well, I'm glad you coded both of them up, but now that we have them how to we pick which one to throw to the wolves? Or, do we just merge both of them and let one bitrot? ;)
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/30/2017 05:36 PM, Anshuman Khandual wrote: >> Let's say we had a CDM node with 100x more RAM than the rest of the >> system and it was just as fast as the rest of the RAM. Would we still >> want it isolated like this? Or would we want a different policy? > > But then the other argument being, dont we want to keep this 100X more > memory isolated for some special purpose to be utilized by specific > applications ? I was thinking that in this case, we wouldn't even want to bother with having "system RAM" in the fallback lists. A device who got its memory usage off by 1% could start to starve the rest of the system. A sane policy in this case might be to isolate the "system RAM" from the device's. >> Why do we need this hard-coded along with the cpuset stuff later in the >> series. Doesn't taking a node out of the cpuset also take it out of the >> fallback lists? > > There are two mutually exclusive approaches which are described in > this patch series. > > (1) zonelist modification based approach > (2) cpuset restriction based approach > > As mentioned in the cover letter, Well, I'm glad you coded both of them up, but now that we have them how to we pick which one to throw to the wolves? Or, do we just merge both of them and let one bitrot? ;)
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/30/2017 11:04 PM, Dave Hansen wrote: > On 01/29/2017 07:35 PM, Anshuman Khandual wrote: >> * CDM node's zones are not part of any other node's FALLBACK zonelist >> * CDM node's FALLBACK list contains it's own memory zones followed by >> all system RAM zones in regular order as before >> * CDM node's zones are part of it's own NOFALLBACK zonelist > > This seems like a sane policy for the system that you're describing. > But, it's still a policy, and it's rather hard-coded into the kernel. Right. In the original RFC which I had posted in October, I had thought about this issue and created 'pglist_data->coherent_device' as a u64 element where each bit in the mask can indicate a specific policy request for the hot plugged coherent device. But it looked too complicated in for the moment in absence of other potential coherent memory HW which really requires anything other than isolation and explicit allocation method. > Let's say we had a CDM node with 100x more RAM than the rest of the > system and it was just as fast as the rest of the RAM. Would we still > want it isolated like this? Or would we want a different policy? Though in this particular case this CDM can be hot plugged into the system as a normal NUMA node (I dont see any reason why it should not be treated as normal NUMA node) but I do understand the need for different policy requirements for different kind of coherent memory. But then the other argument being, dont we want to keep this 100X more memory isolated for some special purpose to be utilized by specific applications ? There is a sense that if the non system RAM memory is coherent and similar there cannot be much differences between what they would expect from the kernel. > > Why do we need this hard-coded along with the cpuset stuff later in the > series. Doesn't taking a node out of the cpuset also take it out of the > fallback lists? There are two mutually exclusive approaches which are described in this patch series. (1) zonelist modification based approach (2) cpuset restriction based approach As mentioned in the cover letter, " NOTE: These two set of patches mutually exclusive of each other and represent two different approaches. Only one of these sets should be applied at any point of time. Set1: mm: Change generic FALLBACK zonelist creation process mm: Change mbind(MPOL_BIND) implementation for CDM nodes Set2: cpuset: Add cpuset_inc() inside cpuset_init() mm: Exclude CDM nodes from task->mems_allowed and root cpuset mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE " > >> while ((node = find_next_best_node(local_node, _mask)) >= 0) { >> +#ifdef CONFIG_COHERENT_DEVICE >> +/* >> + * CDM node's own zones should not be part of any other >> + * node's fallback zonelist but only it's own fallback >> + * zonelist. >> + */ >> +if (is_cdm_node(node) && (pgdat->node_id != node)) >> +continue; >> +#endif > > On a superficial note: Isn't that #ifdef unnecessary? is_cdm_node() has > a 'return 0' stub when the config option is off anyway. Right, will fix it up.
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/30/2017 11:04 PM, Dave Hansen wrote: > On 01/29/2017 07:35 PM, Anshuman Khandual wrote: >> * CDM node's zones are not part of any other node's FALLBACK zonelist >> * CDM node's FALLBACK list contains it's own memory zones followed by >> all system RAM zones in regular order as before >> * CDM node's zones are part of it's own NOFALLBACK zonelist > > This seems like a sane policy for the system that you're describing. > But, it's still a policy, and it's rather hard-coded into the kernel. Right. In the original RFC which I had posted in October, I had thought about this issue and created 'pglist_data->coherent_device' as a u64 element where each bit in the mask can indicate a specific policy request for the hot plugged coherent device. But it looked too complicated in for the moment in absence of other potential coherent memory HW which really requires anything other than isolation and explicit allocation method. > Let's say we had a CDM node with 100x more RAM than the rest of the > system and it was just as fast as the rest of the RAM. Would we still > want it isolated like this? Or would we want a different policy? Though in this particular case this CDM can be hot plugged into the system as a normal NUMA node (I dont see any reason why it should not be treated as normal NUMA node) but I do understand the need for different policy requirements for different kind of coherent memory. But then the other argument being, dont we want to keep this 100X more memory isolated for some special purpose to be utilized by specific applications ? There is a sense that if the non system RAM memory is coherent and similar there cannot be much differences between what they would expect from the kernel. > > Why do we need this hard-coded along with the cpuset stuff later in the > series. Doesn't taking a node out of the cpuset also take it out of the > fallback lists? There are two mutually exclusive approaches which are described in this patch series. (1) zonelist modification based approach (2) cpuset restriction based approach As mentioned in the cover letter, " NOTE: These two set of patches mutually exclusive of each other and represent two different approaches. Only one of these sets should be applied at any point of time. Set1: mm: Change generic FALLBACK zonelist creation process mm: Change mbind(MPOL_BIND) implementation for CDM nodes Set2: cpuset: Add cpuset_inc() inside cpuset_init() mm: Exclude CDM nodes from task->mems_allowed and root cpuset mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE " > >> while ((node = find_next_best_node(local_node, _mask)) >= 0) { >> +#ifdef CONFIG_COHERENT_DEVICE >> +/* >> + * CDM node's own zones should not be part of any other >> + * node's fallback zonelist but only it's own fallback >> + * zonelist. >> + */ >> +if (is_cdm_node(node) && (pgdat->node_id != node)) >> +continue; >> +#endif > > On a superficial note: Isn't that #ifdef unnecessary? is_cdm_node() has > a 'return 0' stub when the config option is off anyway. Right, will fix it up.
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/29/2017 07:35 PM, Anshuman Khandual wrote: > * CDM node's zones are not part of any other node's FALLBACK zonelist > * CDM node's FALLBACK list contains it's own memory zones followed by > all system RAM zones in regular order as before > * CDM node's zones are part of it's own NOFALLBACK zonelist This seems like a sane policy for the system that you're describing. But, it's still a policy, and it's rather hard-coded into the kernel. Let's say we had a CDM node with 100x more RAM than the rest of the system and it was just as fast as the rest of the RAM. Would we still want it isolated like this? Or would we want a different policy? Why do we need this hard-coded along with the cpuset stuff later in the series. Doesn't taking a node out of the cpuset also take it out of the fallback lists? > while ((node = find_next_best_node(local_node, _mask)) >= 0) { > +#ifdef CONFIG_COHERENT_DEVICE > + /* > + * CDM node's own zones should not be part of any other > + * node's fallback zonelist but only it's own fallback > + * zonelist. > + */ > + if (is_cdm_node(node) && (pgdat->node_id != node)) > + continue; > +#endif On a superficial note: Isn't that #ifdef unnecessary? is_cdm_node() has a 'return 0' stub when the config option is off anyway.
Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process
On 01/29/2017 07:35 PM, Anshuman Khandual wrote: > * CDM node's zones are not part of any other node's FALLBACK zonelist > * CDM node's FALLBACK list contains it's own memory zones followed by > all system RAM zones in regular order as before > * CDM node's zones are part of it's own NOFALLBACK zonelist This seems like a sane policy for the system that you're describing. But, it's still a policy, and it's rather hard-coded into the kernel. Let's say we had a CDM node with 100x more RAM than the rest of the system and it was just as fast as the rest of the RAM. Would we still want it isolated like this? Or would we want a different policy? Why do we need this hard-coded along with the cpuset stuff later in the series. Doesn't taking a node out of the cpuset also take it out of the fallback lists? > while ((node = find_next_best_node(local_node, _mask)) >= 0) { > +#ifdef CONFIG_COHERENT_DEVICE > + /* > + * CDM node's own zones should not be part of any other > + * node's fallback zonelist but only it's own fallback > + * zonelist. > + */ > + if (is_cdm_node(node) && (pgdat->node_id != node)) > + continue; > +#endif On a superficial note: Isn't that #ifdef unnecessary? is_cdm_node() has a 'return 0' stub when the config option is off anyway.