Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-31 Thread Anshuman Khandual
On 01/31/2017 11:34 PM, Dave Hansen wrote:
> On 01/30/2017 11:25 PM, John Hubbard wrote:
>> I also don't like having these policies hard-coded, and your 100x
>> example above helps clarify what can go wrong about it. It would be
>> nicer if, instead, we could better express the "distance" between nodes
>> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA
>> system figure out the Right Thing To Do.
>>
>> I realize that this is not quite possible with NUMA just yet, but I
>> wonder if that's a reasonable direction to go with this?
> 
> In the end, I don't think the kernel can make the "right" decision very
> widely here.
> 
> Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that
> evidently has a higher latency than DRAM.  Given a plain malloc(), how
> is the kernel to know that the memory will be used for AVX-512
> instructions that need lots of bandwidth vs. some random data structure
> that's latency-sensitive?

CDM has been designed to work with a driver which can take these kind
of appropriate memory placement decisions along the way. But as per
the above example of an generic malloc() allocated buffer.

(1) System RAM gets allocated if there are first CPU faults
(2) CDM memory gets allocated if there are first device access faults
(3) After monitoring the access patterns there after, the driver can
then take required "right" decisions about its eventual placement
and migrates memory as required

> 
> In the end, I think all we can do is keep the kernel's existing default
> of "low latency to the CPU that allocated it", and let apps override
> when that policy doesn't fit them.

I think this is almost similar to what we are trying to achieve with
CDM representation and driver based migrations. Dont you agree ?



Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-31 Thread Anshuman Khandual
On 01/31/2017 11:34 PM, Dave Hansen wrote:
> On 01/30/2017 11:25 PM, John Hubbard wrote:
>> I also don't like having these policies hard-coded, and your 100x
>> example above helps clarify what can go wrong about it. It would be
>> nicer if, instead, we could better express the "distance" between nodes
>> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA
>> system figure out the Right Thing To Do.
>>
>> I realize that this is not quite possible with NUMA just yet, but I
>> wonder if that's a reasonable direction to go with this?
> 
> In the end, I don't think the kernel can make the "right" decision very
> widely here.
> 
> Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that
> evidently has a higher latency than DRAM.  Given a plain malloc(), how
> is the kernel to know that the memory will be used for AVX-512
> instructions that need lots of bandwidth vs. some random data structure
> that's latency-sensitive?

CDM has been designed to work with a driver which can take these kind
of appropriate memory placement decisions along the way. But as per
the above example of an generic malloc() allocated buffer.

(1) System RAM gets allocated if there are first CPU faults
(2) CDM memory gets allocated if there are first device access faults
(3) After monitoring the access patterns there after, the driver can
then take required "right" decisions about its eventual placement
and migrates memory as required

> 
> In the end, I think all we can do is keep the kernel's existing default
> of "low latency to the CPU that allocated it", and let apps override
> when that policy doesn't fit them.

I think this is almost similar to what we are trying to achieve with
CDM representation and driver based migrations. Dont you agree ?



Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-31 Thread Anshuman Khandual
On 01/31/2017 12:55 PM, John Hubbard wrote:
> On 01/30/2017 05:57 PM, Dave Hansen wrote:
>> On 01/30/2017 05:36 PM, Anshuman Khandual wrote:
 Let's say we had a CDM node with 100x more RAM than the rest of the
 system and it was just as fast as the rest of the RAM.  Would we still
 want it isolated like this?  Or would we want a different policy?
>>>
>>> But then the other argument being, dont we want to keep this 100X more
>>> memory isolated for some special purpose to be utilized by specific
>>> applications ?
>>
>> I was thinking that in this case, we wouldn't even want to bother with
>> having "system RAM" in the fallback lists.  A device who got its memory
>> usage off by 1% could start to starve the rest of the system.  A sane
>> policy in this case might be to isolate the "system RAM" from the
>> device's.
> 
> I also don't like having these policies hard-coded, and your 100x
> example above helps clarify what can go wrong about it. It would be
> nicer if, instead, we could better express the "distance" between nodes
> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA
> system figure out the Right Thing To Do.
> 
> I realize that this is not quite possible with NUMA just yet, but I
> wonder if that's a reasonable direction to go with this?

That is complete overhaul of the NUMA representation in the kernel. What
CDM attempts is to find a solution with existing NUMA framework and with
as little code change as possible.



Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-31 Thread Anshuman Khandual
On 01/31/2017 12:55 PM, John Hubbard wrote:
> On 01/30/2017 05:57 PM, Dave Hansen wrote:
>> On 01/30/2017 05:36 PM, Anshuman Khandual wrote:
 Let's say we had a CDM node with 100x more RAM than the rest of the
 system and it was just as fast as the rest of the RAM.  Would we still
 want it isolated like this?  Or would we want a different policy?
>>>
>>> But then the other argument being, dont we want to keep this 100X more
>>> memory isolated for some special purpose to be utilized by specific
>>> applications ?
>>
>> I was thinking that in this case, we wouldn't even want to bother with
>> having "system RAM" in the fallback lists.  A device who got its memory
>> usage off by 1% could start to starve the rest of the system.  A sane
>> policy in this case might be to isolate the "system RAM" from the
>> device's.
> 
> I also don't like having these policies hard-coded, and your 100x
> example above helps clarify what can go wrong about it. It would be
> nicer if, instead, we could better express the "distance" between nodes
> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA
> system figure out the Right Thing To Do.
> 
> I realize that this is not quite possible with NUMA just yet, but I
> wonder if that's a reasonable direction to go with this?

That is complete overhaul of the NUMA representation in the kernel. What
CDM attempts is to find a solution with existing NUMA framework and with
as little code change as possible.



Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-31 Thread Anshuman Khandual
On 01/31/2017 07:27 AM, Dave Hansen wrote:
> On 01/30/2017 05:36 PM, Anshuman Khandual wrote:
>>> Let's say we had a CDM node with 100x more RAM than the rest of the
>>> system and it was just as fast as the rest of the RAM.  Would we still
>>> want it isolated like this?  Or would we want a different policy?
>>
>> But then the other argument being, dont we want to keep this 100X more
>> memory isolated for some special purpose to be utilized by specific
>> applications ?
> 
> I was thinking that in this case, we wouldn't even want to bother with
> having "system RAM" in the fallback lists.  A device who got its memory

System RAM is in the fallback list of the CDM node for the following
purpose.

If the user asks explicitly through mbind() and there is insufficient
memory on the CDM node to fulfill the request. Then it is better to
fallback on a system RAM memory node than to fail the request. This is
in line with expectations from the mbind() call. There are other ways
for the user space like /proc/pid/numa_maps to query about from where
exactly a given page has come from in the runtime.

But keeping options open I have noted down this in the cover letter.

"
FALLBACK zonelist creation:

CDM node's FALLBACK zonelist can also be changed to accommodate other CDM
memory zones along with system RAM zones in which case they can be used as
fallback options instead of first depending on the system RAM zones when
it's own memory falls insufficient during allocation.
"

> usage off by 1% could start to starve the rest of the system.  A sane

Did not get this point. Could you please elaborate more on this ?

> policy in this case might be to isolate the "system RAM" from the device's.

Hmm.

> 
>>> Why do we need this hard-coded along with the cpuset stuff later in the
>>> series.  Doesn't taking a node out of the cpuset also take it out of the
>>> fallback lists?
>>
>> There are two mutually exclusive approaches which are described in
>> this patch series.
>>
>> (1) zonelist modification based approach
>> (2) cpuset restriction based approach
>>
>> As mentioned in the cover letter,
> 
> Well, I'm glad you coded both of them up, but now that we have them how
> to we pick which one to throw to the wolves?  Or, do we just merge both
> of them and let one bitrot? ;)

I am just trying to see how each isolation method stack up from benefit
and cost point of view, so that we can have informed debate about their
individual merit. Meanwhile I have started looking at if the core buddy
allocator __alloc_pages_nodemask() and its interaction with nodemask at
various stages can also be modified to implement the intended solution.



Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-31 Thread Anshuman Khandual
On 01/31/2017 07:27 AM, Dave Hansen wrote:
> On 01/30/2017 05:36 PM, Anshuman Khandual wrote:
>>> Let's say we had a CDM node with 100x more RAM than the rest of the
>>> system and it was just as fast as the rest of the RAM.  Would we still
>>> want it isolated like this?  Or would we want a different policy?
>>
>> But then the other argument being, dont we want to keep this 100X more
>> memory isolated for some special purpose to be utilized by specific
>> applications ?
> 
> I was thinking that in this case, we wouldn't even want to bother with
> having "system RAM" in the fallback lists.  A device who got its memory

System RAM is in the fallback list of the CDM node for the following
purpose.

If the user asks explicitly through mbind() and there is insufficient
memory on the CDM node to fulfill the request. Then it is better to
fallback on a system RAM memory node than to fail the request. This is
in line with expectations from the mbind() call. There are other ways
for the user space like /proc/pid/numa_maps to query about from where
exactly a given page has come from in the runtime.

But keeping options open I have noted down this in the cover letter.

"
FALLBACK zonelist creation:

CDM node's FALLBACK zonelist can also be changed to accommodate other CDM
memory zones along with system RAM zones in which case they can be used as
fallback options instead of first depending on the system RAM zones when
it's own memory falls insufficient during allocation.
"

> usage off by 1% could start to starve the rest of the system.  A sane

Did not get this point. Could you please elaborate more on this ?

> policy in this case might be to isolate the "system RAM" from the device's.

Hmm.

> 
>>> Why do we need this hard-coded along with the cpuset stuff later in the
>>> series.  Doesn't taking a node out of the cpuset also take it out of the
>>> fallback lists?
>>
>> There are two mutually exclusive approaches which are described in
>> this patch series.
>>
>> (1) zonelist modification based approach
>> (2) cpuset restriction based approach
>>
>> As mentioned in the cover letter,
> 
> Well, I'm glad you coded both of them up, but now that we have them how
> to we pick which one to throw to the wolves?  Or, do we just merge both
> of them and let one bitrot? ;)

I am just trying to see how each isolation method stack up from benefit
and cost point of view, so that we can have informed debate about their
individual merit. Meanwhile I have started looking at if the core buddy
allocator __alloc_pages_nodemask() and its interaction with nodemask at
various stages can also be modified to implement the intended solution.



Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-31 Thread David Nellans


On 01/31/2017 12:04 PM, Dave Hansen wrote:
> On 01/30/2017 11:25 PM, John Hubbard wrote:
>> I also don't like having these policies hard-coded, and your 100x
>> example above helps clarify what can go wrong about it. It would be
>> nicer if, instead, we could better express the "distance" between nodes
>> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA
>> system figure out the Right Thing To Do.
>>
>> I realize that this is not quite possible with NUMA just yet, but I
>> wonder if that's a reasonable direction to go with this?
> In the end, I don't think the kernel can make the "right" decision very
> widely here.
>
> Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that
> evidently has a higher latency than DRAM.  Given a plain malloc(), how
> is the kernel to know that the memory will be used for AVX-512
> instructions that need lots of bandwidth vs. some random data structure
> that's latency-sensitive?
>
> In the end, I think all we can do is keep the kernel's existing default
> of "low latency to the CPU that allocated it", and let apps override
> when that policy doesn't fit them.
>
I think John's point is that latency might not be the predominant factor
anymore
for certain sections of the CPU and GPU world.  What if a Phi has MCDRAM
physically attached, but DDR4 connected via QPI that still has lower
total latency (might
be a stretch for Phi but not a stretch for GPUs with deep sorting memory
controllers)?  Lowest latency is probably the wrong choice. Latency has
really been a
numeric proxy for physical proximity, under assumption most closely
coupled memory is
the right placement, but HBM/MCDRAM is causing that relationship to
break down in all
sorts of interesting ways.


Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-31 Thread David Nellans


On 01/31/2017 12:04 PM, Dave Hansen wrote:
> On 01/30/2017 11:25 PM, John Hubbard wrote:
>> I also don't like having these policies hard-coded, and your 100x
>> example above helps clarify what can go wrong about it. It would be
>> nicer if, instead, we could better express the "distance" between nodes
>> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA
>> system figure out the Right Thing To Do.
>>
>> I realize that this is not quite possible with NUMA just yet, but I
>> wonder if that's a reasonable direction to go with this?
> In the end, I don't think the kernel can make the "right" decision very
> widely here.
>
> Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that
> evidently has a higher latency than DRAM.  Given a plain malloc(), how
> is the kernel to know that the memory will be used for AVX-512
> instructions that need lots of bandwidth vs. some random data structure
> that's latency-sensitive?
>
> In the end, I think all we can do is keep the kernel's existing default
> of "low latency to the CPU that allocated it", and let apps override
> when that policy doesn't fit them.
>
I think John's point is that latency might not be the predominant factor
anymore
for certain sections of the CPU and GPU world.  What if a Phi has MCDRAM
physically attached, but DDR4 connected via QPI that still has lower
total latency (might
be a stretch for Phi but not a stretch for GPUs with deep sorting memory
controllers)?  Lowest latency is probably the wrong choice. Latency has
really been a
numeric proxy for physical proximity, under assumption most closely
coupled memory is
the right placement, but HBM/MCDRAM is causing that relationship to
break down in all
sorts of interesting ways.


Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-31 Thread Dave Hansen
On 01/30/2017 11:25 PM, John Hubbard wrote:
> I also don't like having these policies hard-coded, and your 100x
> example above helps clarify what can go wrong about it. It would be
> nicer if, instead, we could better express the "distance" between nodes
> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA
> system figure out the Right Thing To Do.
> 
> I realize that this is not quite possible with NUMA just yet, but I
> wonder if that's a reasonable direction to go with this?

In the end, I don't think the kernel can make the "right" decision very
widely here.

Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that
evidently has a higher latency than DRAM.  Given a plain malloc(), how
is the kernel to know that the memory will be used for AVX-512
instructions that need lots of bandwidth vs. some random data structure
that's latency-sensitive?

In the end, I think all we can do is keep the kernel's existing default
of "low latency to the CPU that allocated it", and let apps override
when that policy doesn't fit them.


Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-31 Thread Dave Hansen
On 01/30/2017 11:25 PM, John Hubbard wrote:
> I also don't like having these policies hard-coded, and your 100x
> example above helps clarify what can go wrong about it. It would be
> nicer if, instead, we could better express the "distance" between nodes
> (bandwidth, latency, relative to sysmem, perhaps), and let the NUMA
> system figure out the Right Thing To Do.
> 
> I realize that this is not quite possible with NUMA just yet, but I
> wonder if that's a reasonable direction to go with this?

In the end, I don't think the kernel can make the "right" decision very
widely here.

Intel's Xeon Phis have some high-bandwidth memory (MCDRAM) that
evidently has a higher latency than DRAM.  Given a plain malloc(), how
is the kernel to know that the memory will be used for AVX-512
instructions that need lots of bandwidth vs. some random data structure
that's latency-sensitive?

In the end, I think all we can do is keep the kernel's existing default
of "low latency to the CPU that allocated it", and let apps override
when that policy doesn't fit them.


Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-30 Thread John Hubbard

On 01/30/2017 05:57 PM, Dave Hansen wrote:

On 01/30/2017 05:36 PM, Anshuman Khandual wrote:

Let's say we had a CDM node with 100x more RAM than the rest of the
system and it was just as fast as the rest of the RAM.  Would we still
want it isolated like this?  Or would we want a different policy?


But then the other argument being, dont we want to keep this 100X more
memory isolated for some special purpose to be utilized by specific
applications ?


I was thinking that in this case, we wouldn't even want to bother with
having "system RAM" in the fallback lists.  A device who got its memory
usage off by 1% could start to starve the rest of the system.  A sane
policy in this case might be to isolate the "system RAM" from the device's.


I also don't like having these policies hard-coded, and your 100x example above 
helps clarify what can go wrong about it. It would be nicer if, instead, we could 
better express the "distance" between nodes (bandwidth, latency, relative to sysmem, 
perhaps), and let the NUMA system figure out the Right Thing To Do.


I realize that this is not quite possible with NUMA just yet, but I wonder if that's 
a reasonable direction to go with this?


thanks,
john h




Why do we need this hard-coded along with the cpuset stuff later in the
series.  Doesn't taking a node out of the cpuset also take it out of the
fallback lists?


There are two mutually exclusive approaches which are described in
this patch series.

(1) zonelist modification based approach
(2) cpuset restriction based approach

As mentioned in the cover letter,


Well, I'm glad you coded both of them up, but now that we have them how
to we pick which one to throw to the wolves?  Or, do we just merge both
of them and let one bitrot? ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-30 Thread John Hubbard

On 01/30/2017 05:57 PM, Dave Hansen wrote:

On 01/30/2017 05:36 PM, Anshuman Khandual wrote:

Let's say we had a CDM node with 100x more RAM than the rest of the
system and it was just as fast as the rest of the RAM.  Would we still
want it isolated like this?  Or would we want a different policy?


But then the other argument being, dont we want to keep this 100X more
memory isolated for some special purpose to be utilized by specific
applications ?


I was thinking that in this case, we wouldn't even want to bother with
having "system RAM" in the fallback lists.  A device who got its memory
usage off by 1% could start to starve the rest of the system.  A sane
policy in this case might be to isolate the "system RAM" from the device's.


I also don't like having these policies hard-coded, and your 100x example above 
helps clarify what can go wrong about it. It would be nicer if, instead, we could 
better express the "distance" between nodes (bandwidth, latency, relative to sysmem, 
perhaps), and let the NUMA system figure out the Right Thing To Do.


I realize that this is not quite possible with NUMA just yet, but I wonder if that's 
a reasonable direction to go with this?


thanks,
john h




Why do we need this hard-coded along with the cpuset stuff later in the
series.  Doesn't taking a node out of the cpuset also take it out of the
fallback lists?


There are two mutually exclusive approaches which are described in
this patch series.

(1) zonelist modification based approach
(2) cpuset restriction based approach

As mentioned in the cover letter,


Well, I'm glad you coded both of them up, but now that we have them how
to we pick which one to throw to the wolves?  Or, do we just merge both
of them and let one bitrot? ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-30 Thread Dave Hansen
On 01/30/2017 05:36 PM, Anshuman Khandual wrote:
>> Let's say we had a CDM node with 100x more RAM than the rest of the
>> system and it was just as fast as the rest of the RAM.  Would we still
>> want it isolated like this?  Or would we want a different policy?
> 
> But then the other argument being, dont we want to keep this 100X more
> memory isolated for some special purpose to be utilized by specific
> applications ?

I was thinking that in this case, we wouldn't even want to bother with
having "system RAM" in the fallback lists.  A device who got its memory
usage off by 1% could start to starve the rest of the system.  A sane
policy in this case might be to isolate the "system RAM" from the device's.

>> Why do we need this hard-coded along with the cpuset stuff later in the
>> series.  Doesn't taking a node out of the cpuset also take it out of the
>> fallback lists?
> 
> There are two mutually exclusive approaches which are described in
> this patch series.
> 
> (1) zonelist modification based approach
> (2) cpuset restriction based approach
> 
> As mentioned in the cover letter,

Well, I'm glad you coded both of them up, but now that we have them how
to we pick which one to throw to the wolves?  Or, do we just merge both
of them and let one bitrot? ;)



Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-30 Thread Dave Hansen
On 01/30/2017 05:36 PM, Anshuman Khandual wrote:
>> Let's say we had a CDM node with 100x more RAM than the rest of the
>> system and it was just as fast as the rest of the RAM.  Would we still
>> want it isolated like this?  Or would we want a different policy?
> 
> But then the other argument being, dont we want to keep this 100X more
> memory isolated for some special purpose to be utilized by specific
> applications ?

I was thinking that in this case, we wouldn't even want to bother with
having "system RAM" in the fallback lists.  A device who got its memory
usage off by 1% could start to starve the rest of the system.  A sane
policy in this case might be to isolate the "system RAM" from the device's.

>> Why do we need this hard-coded along with the cpuset stuff later in the
>> series.  Doesn't taking a node out of the cpuset also take it out of the
>> fallback lists?
> 
> There are two mutually exclusive approaches which are described in
> this patch series.
> 
> (1) zonelist modification based approach
> (2) cpuset restriction based approach
> 
> As mentioned in the cover letter,

Well, I'm glad you coded both of them up, but now that we have them how
to we pick which one to throw to the wolves?  Or, do we just merge both
of them and let one bitrot? ;)



Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-30 Thread Anshuman Khandual
On 01/30/2017 11:04 PM, Dave Hansen wrote:
> On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
>> * CDM node's zones are not part of any other node's FALLBACK zonelist
>> * CDM node's FALLBACK list contains it's own memory zones followed by
>>   all system RAM zones in regular order as before
>> * CDM node's zones are part of it's own NOFALLBACK zonelist
> 
> This seems like a sane policy for the system that you're describing.
> But, it's still a policy, and it's rather hard-coded into the kernel.

Right. In the original RFC which I had posted in October, I had thought
about this issue and created 'pglist_data->coherent_device' as a u64
element where each bit in the mask can indicate a specific policy request
for the hot plugged coherent device. But it looked too complicated in
for the moment in absence of other potential coherent memory HW which
really requires anything other than isolation and explicit allocation
method.

> Let's say we had a CDM node with 100x more RAM than the rest of the
> system and it was just as fast as the rest of the RAM.  Would we still
> want it isolated like this?  Or would we want a different policy?

Though in this particular case this CDM can be hot plugged into the
system as a normal NUMA node (I dont see any reason why it should
not be treated as normal NUMA node) but I do understand the need
for different policy requirements for different kind of coherent
memory.

But then the other argument being, dont we want to keep this 100X more
memory isolated for some special purpose to be utilized by specific
applications ?

There is a sense that if the non system RAM memory is coherent and
similar there cannot be much differences between what they would
expect from the kernel.

> 
> Why do we need this hard-coded along with the cpuset stuff later in the
> series.  Doesn't taking a node out of the cpuset also take it out of the
> fallback lists?

There are two mutually exclusive approaches which are described in
this patch series.

(1) zonelist modification based approach
(2) cpuset restriction based approach

As mentioned in the cover letter,

"
NOTE: These two set of patches mutually exclusive of each other and
represent two different approaches. Only one of these sets should be
applied at any point of time.

Set1:
  mm: Change generic FALLBACK zonelist creation process
  mm: Change mbind(MPOL_BIND) implementation for CDM nodes

Set2:
  cpuset: Add cpuset_inc() inside cpuset_init()
  mm: Exclude CDM nodes from task->mems_allowed and root cpuset
  mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE
"

> 
>>  while ((node = find_next_best_node(local_node, _mask)) >= 0) {
>> +#ifdef CONFIG_COHERENT_DEVICE
>> +/*
>> + * CDM node's own zones should not be part of any other
>> + * node's fallback zonelist but only it's own fallback
>> + * zonelist.
>> + */
>> +if (is_cdm_node(node) && (pgdat->node_id != node))
>> +continue;
>> +#endif
> 
> On a superficial note: Isn't that #ifdef unnecessary?  is_cdm_node() has
> a 'return 0' stub when the config option is off anyway.

Right, will fix it up.



Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-30 Thread Anshuman Khandual
On 01/30/2017 11:04 PM, Dave Hansen wrote:
> On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
>> * CDM node's zones are not part of any other node's FALLBACK zonelist
>> * CDM node's FALLBACK list contains it's own memory zones followed by
>>   all system RAM zones in regular order as before
>> * CDM node's zones are part of it's own NOFALLBACK zonelist
> 
> This seems like a sane policy for the system that you're describing.
> But, it's still a policy, and it's rather hard-coded into the kernel.

Right. In the original RFC which I had posted in October, I had thought
about this issue and created 'pglist_data->coherent_device' as a u64
element where each bit in the mask can indicate a specific policy request
for the hot plugged coherent device. But it looked too complicated in
for the moment in absence of other potential coherent memory HW which
really requires anything other than isolation and explicit allocation
method.

> Let's say we had a CDM node with 100x more RAM than the rest of the
> system and it was just as fast as the rest of the RAM.  Would we still
> want it isolated like this?  Or would we want a different policy?

Though in this particular case this CDM can be hot plugged into the
system as a normal NUMA node (I dont see any reason why it should
not be treated as normal NUMA node) but I do understand the need
for different policy requirements for different kind of coherent
memory.

But then the other argument being, dont we want to keep this 100X more
memory isolated for some special purpose to be utilized by specific
applications ?

There is a sense that if the non system RAM memory is coherent and
similar there cannot be much differences between what they would
expect from the kernel.

> 
> Why do we need this hard-coded along with the cpuset stuff later in the
> series.  Doesn't taking a node out of the cpuset also take it out of the
> fallback lists?

There are two mutually exclusive approaches which are described in
this patch series.

(1) zonelist modification based approach
(2) cpuset restriction based approach

As mentioned in the cover letter,

"
NOTE: These two set of patches mutually exclusive of each other and
represent two different approaches. Only one of these sets should be
applied at any point of time.

Set1:
  mm: Change generic FALLBACK zonelist creation process
  mm: Change mbind(MPOL_BIND) implementation for CDM nodes

Set2:
  cpuset: Add cpuset_inc() inside cpuset_init()
  mm: Exclude CDM nodes from task->mems_allowed and root cpuset
  mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE
"

> 
>>  while ((node = find_next_best_node(local_node, _mask)) >= 0) {
>> +#ifdef CONFIG_COHERENT_DEVICE
>> +/*
>> + * CDM node's own zones should not be part of any other
>> + * node's fallback zonelist but only it's own fallback
>> + * zonelist.
>> + */
>> +if (is_cdm_node(node) && (pgdat->node_id != node))
>> +continue;
>> +#endif
> 
> On a superficial note: Isn't that #ifdef unnecessary?  is_cdm_node() has
> a 'return 0' stub when the config option is off anyway.

Right, will fix it up.



Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-30 Thread Dave Hansen
On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
> * CDM node's zones are not part of any other node's FALLBACK zonelist
> * CDM node's FALLBACK list contains it's own memory zones followed by
>   all system RAM zones in regular order as before
> * CDM node's zones are part of it's own NOFALLBACK zonelist

This seems like a sane policy for the system that you're describing.
But, it's still a policy, and it's rather hard-coded into the kernel.
Let's say we had a CDM node with 100x more RAM than the rest of the
system and it was just as fast as the rest of the RAM.  Would we still
want it isolated like this?  Or would we want a different policy?

Why do we need this hard-coded along with the cpuset stuff later in the
series.  Doesn't taking a node out of the cpuset also take it out of the
fallback lists?

>   while ((node = find_next_best_node(local_node, _mask)) >= 0) {
> +#ifdef CONFIG_COHERENT_DEVICE
> + /*
> +  * CDM node's own zones should not be part of any other
> +  * node's fallback zonelist but only it's own fallback
> +  * zonelist.
> +  */
> + if (is_cdm_node(node) && (pgdat->node_id != node))
> + continue;
> +#endif

On a superficial note: Isn't that #ifdef unnecessary?  is_cdm_node() has
a 'return 0' stub when the config option is off anyway.


Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

2017-01-30 Thread Dave Hansen
On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
> * CDM node's zones are not part of any other node's FALLBACK zonelist
> * CDM node's FALLBACK list contains it's own memory zones followed by
>   all system RAM zones in regular order as before
> * CDM node's zones are part of it's own NOFALLBACK zonelist

This seems like a sane policy for the system that you're describing.
But, it's still a policy, and it's rather hard-coded into the kernel.
Let's say we had a CDM node with 100x more RAM than the rest of the
system and it was just as fast as the rest of the RAM.  Would we still
want it isolated like this?  Or would we want a different policy?

Why do we need this hard-coded along with the cpuset stuff later in the
series.  Doesn't taking a node out of the cpuset also take it out of the
fallback lists?

>   while ((node = find_next_best_node(local_node, _mask)) >= 0) {
> +#ifdef CONFIG_COHERENT_DEVICE
> + /*
> +  * CDM node's own zones should not be part of any other
> +  * node's fallback zonelist but only it's own fallback
> +  * zonelist.
> +  */
> + if (is_cdm_node(node) && (pgdat->node_id != node))
> + continue;
> +#endif

On a superficial note: Isn't that #ifdef unnecessary?  is_cdm_node() has
a 'return 0' stub when the config option is off anyway.