subject:"\[RFC PATCH 2\/3\] topology\: support node_numa_mem\(\) for determining the fallback node"

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-07-22 Thread David Rientjes

On Tue, 22 Jul 2014, Nishanth Aravamudan wrote:

> > I think there's two use cases of interest:
> > 
> >  - allocating from a memoryless node where numa_node_id() is memoryless, 
> >and
> > 
> >  - using node_to_mem_node() for a possibly-memoryless node for kmalloc().
> > 
> > I believe the first should have its own node_zonelist[0], whether it's 
> > memoryless or not, that points to a list of zones that start with those 
> > with the smallest distance.
> 
> Ok, and that would be used for falling back in the appropriate priority?
> 

There's no real fallback since there's never a case when you can allocate 
on a memoryless node.  The zonelist defines the appropriate order in which 
to try to allocate from zones, so it depends on things like the 
numa_node_id() in alloc_pages_current() and whether the zonelist for a 
memoryless node is properly initialized or whether this needs to be 
numa_mem_id().  It depends on the intended behavior of calling 
alloc_pages_{node,vma}() with a memoryless node, the complexity of 
(re-)building the zonelists at bootstrap and for memory hotplug isn't a 
hotpath.

This choice would also impact MPOL_PREFERRED mempolicies when MPOL_F_LOCAL 
is set.

> > I think its own node_zonelist[1], for __GFP_THISNODE allocations,
> > should point to the node with present memory that has the smallest
> > distance.
> 
> And so would this, but with the caveat that we can fail here and don't
> go further? Semantically, __GFP_THISNODE then means "as close as
> physically possible ignoring run-time memory constraints". I say that
> because obviously we might get off-node memory without memoryless nodes,
> but that shouldn't be used to satisfy __GPF_THISNODE allocations.
> 

alloc_pages_current() substitutes any existing mempolicy for the default 
local policy when __GFP_THISNODE is set, and that would require local 
allocation.  That, currently, is numa_node_id() and not numa_mem_id().

The slab allocator already only uses __GFP_THISNODE for numa_mem_id() so 
it will allocate remotely anyway.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-07-22 Thread Nishanth Aravamudan

On 22.07.2014 [14:43:11 -0700], Nishanth Aravamudan wrote:
> Hi David,



> on powerpc now, things look really good. On a KVM instance with the
> following topology:
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 
> 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 
> 97 98 99
> node 1 size: 16336 MB
> node 1 free: 14274 MB
> node distances:
> node   0   1 
>   0:  10  40 
>   1:  40  10 
> 
> 3.16.0-rc6 gives:
> 
> Slab:1039744 kB
>   SReclaimable:  38976 kB
>   SUnreclaim:  1000768 kB



> Adding my patch on top of Joonsoo's and the revert, I get:
> 
>   Slab: 411776 kB
>   SReclaimable:  40960 kB
>   SUnreclaim:   370816 kB
> 
> So CONFIG_SLUB still uses about 3x as much slab memory, but it's not so
> much that we are close to OOM with small VM/LPAR sizes.

Just to clarify/add one more datapoint, with a balanced topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 8154 MB
node 0 free: 8075 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 
98 99
node 1 size: 8181 MB
node 1 free: 7776 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10

I see the following for my patch + Joonsoo's + the revert:

Slab: 495872 kB
SReclaimable:  46528 kB
SUnreclaim:   449344 kB

(Although these numbers fluctuate quite a bit between 250M and 500M),
which indicates that the memoryless node slab consumption is now on-par
with a populated topology. And both are still more than CONFIG_SLAB
requires.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-07-22 Thread Tejun Heo

Hello,

On Tue, Jul 22, 2014 at 02:43:11PM -0700, Nishanth Aravamudan wrote:
...
> "There is an issue currently where NUMA information is used on powerpc
> (and possibly ia64) before it has been read from the device-tree, which
> leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
> 
> NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
> after start_secondary(), similar to ia64, which is invoked via
> smp_init().
> 
> Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
> early_initcall()") made init_workqueues() be invoked via
> do_pre_smp_initcalls(), which is obviously before the secondary
> processors are online.
> ...
> Therefore, when init_workqueues() runs, it sees all CPUs as being on
> Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
> a high number of slab deactivations
> (http://www.spinics.net/lists/linux-mm/msg67489.html)."
> 
> Christoph/Tejun, do you see the issue I'm referring to? Is my analysis
> correct? It seems like regardless of CONFIG_USE_PERCPU_NUMA_NODE_ID, we
> have to be especially careful that users of cpu_to_{node,mem} and
> related APIs run *after* correct values are stored for all used CPUs?

Without delving into the code, yes, NUMA info should be set up as soon
as possible before major allocations happen.  All allocations which
happen beforehand would naturally be done with bogus NUMA information.

Thanks.

-- 
tejun
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-07-22 Thread Nishanth Aravamudan

Hi David,

On 21.07.2014 [18:16:58 -0700], David Rientjes wrote:
> On Mon, 21 Jul 2014, Nishanth Aravamudan wrote:
> 
> > Sorry for bringing up this old thread again, but I had a question for
> > you, David. node_to_mem_node(), which does seem like a useful API,
> > doesn't seem like it can just node_distance() solely, right? Because
> > that just tells us the relative cost (or so I think about it) of using
> > resources from that node. But we also need to know if that node itself
> > has memory, etc. So using the zonelists is required no matter what? And
> > upon memory hotplug (or unplug), the topology can change in a way that
> > affects things, so node online time isn't right either?
> > 
> 
> I think there's two use cases of interest:
> 
>  - allocating from a memoryless node where numa_node_id() is memoryless, 
>and
> 
>  - using node_to_mem_node() for a possibly-memoryless node for kmalloc().
> 
> I believe the first should have its own node_zonelist[0], whether it's 
> memoryless or not, that points to a list of zones that start with those 
> with the smallest distance.

Ok, and that would be used for falling back in the appropriate priority?

> I think its own node_zonelist[1], for __GFP_THISNODE allocations,
> should point to the node with present memory that has the smallest
> distance.

And so would this, but with the caveat that we can fail here and don't
go further? Semantically, __GFP_THISNODE then means "as close as
physically possible ignoring run-time memory constraints". I say that
because obviously we might get off-node memory without memoryless nodes,
but that shouldn't be used to satisfy __GPF_THISNODE allocations.

> For sure node_zonelist[0] cannot be NULL since things like 
> first_online_pgdat() would break and it should be unnecessary to do 
> node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES 
> since the zonelists should already be defined properly.  All nodes, 
> regardless of whether they have memory or not, should probably end up 
> having a struct pglist_data unless there's a reason for another level of 
> indirection.

So I've re-tested Joonsoo's patch 2 and 3 from the series he sent, and
on powerpc now, things look really good. On a KVM instance with the
following topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 
98 99
node 1 size: 16336 MB
node 1 free: 14274 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 

3.16.0-rc6 gives:

Slab:1039744 kB
SReclaimable:  38976 kB
SUnreclaim:  1000768 kB

Joonsoo's patches give:

Slab: 366144 kB
SReclaimable:  36928 kB
SUnreclaim:   329216 kB

For reference, CONFIG_SLAB gives:

Slab: 122496 kB
SReclaimable:  14912 kB
SUnreclaim:   107584 kB

At Tejun's request [adding him to Cc], I also partially reverted
81c98869faa5 ("kthread: ensure locality of task_struct allocations"): 

Slab: 428864 kB
SReclaimable:  44288 kB
SUnreclaim:   384576 kB

This seems slightly worse, but I think it's because of the same
root-cause that I indicated in my RFC patch 2/2, quoting it here:

"There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().

Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.
...
Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html)."

Christoph/Tejun, do you see the issue I'm referring to? Is my analysis
correct? It seems like regardless of CONFIG_USE_PERCPU_NUMA_NODE_ID, we
have to be especially careful that users of cpu_to_{node,mem} and
related APIs run *after* correct values are stored for all used CPUs?

In any case, with Joonsoo's patches, we shouldn't see slab deactivations
*if* the NUMA topology information is stored correctly. The full
changelog and patch is at http://patchwork.ozlabs.org/patch/371266/.

Adding my patch on top of Joonsoo's and the revert, I get:

Slab: 411776 kB

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-07-21 Thread David Rientjes

On Mon, 21 Jul 2014, Nishanth Aravamudan wrote:

> Sorry for bringing up this old thread again, but I had a question for
> you, David. node_to_mem_node(), which does seem like a useful API,
> doesn't seem like it can just node_distance() solely, right? Because
> that just tells us the relative cost (or so I think about it) of using
> resources from that node. But we also need to know if that node itself
> has memory, etc. So using the zonelists is required no matter what? And
> upon memory hotplug (or unplug), the topology can change in a way that
> affects things, so node online time isn't right either?
> 

I think there's two use cases of interest:

 - allocating from a memoryless node where numa_node_id() is memoryless, 
   and

 - using node_to_mem_node() for a possibly-memoryless node for kmalloc().

I believe the first should have its own node_zonelist[0], whether it's 
memoryless or not, that points to a list of zones that start with those 
with the smallest distance.  I think its own node_zonelist[1], for 
__GFP_THISNODE allocations, should point to the node with present memory 
that has the smallest distance.

For sure node_zonelist[0] cannot be NULL since things like 
first_online_pgdat() would break and it should be unnecessary to do 
node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES 
since the zonelists should already be defined properly.  All nodes, 
regardless of whether they have memory or not, should probably end up 
having a struct pglist_data unless there's a reason for another level of 
indirection.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-07-21 Thread Nishanth Aravamudan

On 10.02.2014 [10:09:36 +0900], Joonsoo Kim wrote:
> On Sat, Feb 08, 2014 at 01:57:39AM -0800, David Rientjes wrote:
> > On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> > 
> > > > It seems like a better approach would be to do this when a node is 
> > > > brought 
> > > > online and determine the fallback node based not on the zonelists as 
> > > > you 
> > > > do here but rather on locality (such as through a SLIT if provided, see 
> > > > node_distance()).
> > > 
> > > Hmm...
> > > I guess that zonelist is base on locality. Zonelist is generated using
> > > node_distance(), so I think that it reflects locality. But, I'm not expert
> > > on NUMA, so please let me know what I am missing here :)
> > > 
> > 
> > The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
> > If your solution is going to become the generic kernel API that determines 
> > what node has local memory for a particular node, then it will have to 
> > support all definitions of node.  That includes nodes that consist solely 
> > of I/O, chipsets, networking, or storage devices.  These nodes may not 
> > have memory or cpus, so doing it as part of onlining cpus isn't going to 
> > be generic enough.  You want a node_to_mem_node() API for all possible 
> > node types (the possible node types listed above are straight from the 
> > ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
> > X and we can optimize for that, but any solution that relies on cpu online 
> > is probably shortsighted right now.
> > 
> > I think it would be much better to do this as a part of setting a node to 
> > be online.
> 
> Okay. I got your point.
> I will change it to rely on node online if this patch is really needed.

Sorry for bringing up this old thread again, but I had a question for
you, David. node_to_mem_node(), which does seem like a useful API,
doesn't seem like it can just node_distance() solely, right? Because
that just tells us the relative cost (or so I think about it) of using
resources from that node. But we also need to know if that node itself
has memory, etc. So using the zonelists is required no matter what? And
upon memory hotplug (or unplug), the topology can change in a way that
affects things, so node online time isn't right either?

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-03-13 Thread Nishanth Aravamudan

On 24.02.2014 [13:54:35 -0600], Christoph Lameter wrote:
> On Mon, 24 Feb 2014, Joonsoo Kim wrote:
> 
> > > It will not common get there because of the tracking. Instead a per cpu
> > > object will be used.
> > > > get_partial_node() always fails even if there are some partial slab on
> > > > memoryless node's neareast node.
> > >
> > > Correct and that leads to a page allocator action whereupon the node will
> > > be marked as empty.
> >
> > Why do we need to request to a page allocator if there is partial slab?
> > Checking whether node is memoryless or not is really easy, so we don't need
> > to skip this. To skip this is suboptimal solution.
> 
> The page allocator action is also used to determine to which other node we
> should fall back if the node is empty. So we need to call the page
> allocator when the per cpu slab is exhaused with the node of the
> memoryless node to get memory from the proper fallback node.

Where do we stand with these patches? I feel like no resolution was
really found...

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-24 Thread Christoph Lameter

On Mon, 24 Feb 2014, Joonsoo Kim wrote:

> > It will not common get there because of the tracking. Instead a per cpu
> > object will be used.
> > > get_partial_node() always fails even if there are some partial slab on
> > > memoryless node's neareast node.
> >
> > Correct and that leads to a page allocator action whereupon the node will
> > be marked as empty.
>
> Why do we need to request to a page allocator if there is partial slab?
> Checking whether node is memoryless or not is really easy, so we don't need
> to skip this. To skip this is suboptimal solution.

The page allocator action is also used to determine to which other node we
should fall back if the node is empty. So we need to call the page
allocator when the per cpu slab is exhaused with the node of the
memoryless node to get memory from the proper fallback node.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-23 Thread Joonsoo Kim

On Tue, Feb 18, 2014 at 10:38:01AM -0600, Christoph Lameter wrote:
> On Mon, 17 Feb 2014, Joonsoo Kim wrote:
> 
> > On Wed, Feb 12, 2014 at 04:16:11PM -0600, Christoph Lameter wrote:
> > > Here is another patch with some fixes. The additional logic is only
> > > compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> > >
> > > Subject: slub: Memoryless node support
> > >
> > > Support memoryless nodes by tracking which allocations are failing.
> >
> > I still don't understand why this tracking is needed.
> 
> Its an optimization to avoid calling the page allocator to figure out if
> there is memory available on a particular node.
> 
> > All we need for allcation targeted to memoryless node is to fallback proper
> > node, that it, numa_mem_id() node of targeted node. My previous patch
> > implements it and use proper fallback node on every allocation code path.
> > Why this tracking is needed? Please elaborate more on this.
> 
> Its too slow to do that on every alloc. One needs to be able to satisfy
> most allocations without switching percpu slabs for optimal performance.

I don't think that we need to switch percpu slabs on every alloc.
Allocation targeted to specific node is rare. And most of these allocations
may be targeted to either numa_node_id() or numa_mem_id(). My patch considers
these cases, so most of allocations are processed by percpu slabs. There is
no suboptimal performance.

> 
> > > Allocations targeted to the nodes without memory fall back to the
> > > current available per cpu objects and if that is not available will
> > > create a new slab using the page allocator to fallback from the
> > > memoryless node to some other node.
> 
> And what about the next alloc? Assuem there are N allocs from a memoryless
> node this means we push back the partial slab on each alloc and then fall
> back?
> 
> > >  {
> > >   void *object;
> > > - int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > > + int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> > >
> > >   object = get_partial_node(s, get_node(s, searchnode), c, flags);
> > >   if (object || node != NUMA_NO_NODE)
> >
> > This isn't enough.
> > Consider that allcation targeted to memoryless node.
> 
> It will not common get there because of the tracking. Instead a per cpu
> object will be used.
> > get_partial_node() always fails even if there are some partial slab on
> > memoryless node's neareast node.
> 
> Correct and that leads to a page allocator action whereupon the node will
> be marked as empty.

Why do we need to request to a page allocator if there is partial slab?
Checking whether node is memoryless or not is really easy, so we don't need
to skip this. To skip this is suboptimal solution.

> > We should fallback to some proper node in this case, since there is no slab
> > on memoryless node.
> 
> NUMA is about optimization of memory allocations. It is often *not* about
> correctness but heuristics are used in many cases. F.e. see the zone
> reclaim logic, zone reclaim mode, fallback scenarios in the page allocator
> etc etc.

Okay. But, 'do our best' is preferable to me.

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-20 Thread Christoph Lameter

On Wed, 19 Feb 2014, David Rientjes wrote:

> On Tue, 18 Feb 2014, Christoph Lameter wrote:
>
> > Its an optimization to avoid calling the page allocator to figure out if
> > there is memory available on a particular node.
> Thus this patch breaks with memory hot-add for a memoryless node.

As soon as the per cpu slab is exhausted the node number of the so far
"empty" node will be used for allocation. That will be sucessfull and the
node will no longer be marked as empty.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-19 Thread Christoph Lameter

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> the performance impact of the underlying NUMA configuration. I guess we
> could special-case memoryless/cpuless configurations somewhat, but I
> don't think there's any reason to do that if we can make memoryless-node
> support work in-kernel?

Well we can make it work in-kernel but it always has been a bit wacky (as
is the idea of numa "memory" nodes without memory).
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Nishanth Aravamudan

On 18.02.2014 [15:49:22 -0600], Christoph Lameter wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> 
> > We use the topology provided by the hypervisor, it does actually reflect
> > where CPUs and memory are, and their corresponding performance/NUMA
> > characteristics.
> 
> And so there are actually nodes without memory that have processors?

Virtually (topologically as indicated to Linux), yes. Physically, I
don't think they are, but they might be exhausted, which is we get sort
of odd-appearing NUMA configurations.

> Can the hypervisor or the linux arch code be convinced to ignore nodes
> without memory or assign a sane default node to processors?

I think this happens quite often, so I don't know that we want to ignore
the performance impact of the underlying NUMA configuration. I guess we
could special-case memoryless/cpuless configurations somewhat, but I
don't think there's any reason to do that if we can make memoryless-node
support work in-kernel?

> > > Ok then also move the memory of the local node somewhere?
> >
> > This happens below the OS, we don't control the hypervisor's decisions.
> > I'm not sure if that's what you are suggesting.
> 
> You could also do this from the powerpc arch code by sanitizing the
> processor / node information that is then used by Linux.

I see what you're saying now, thanks!

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Christoph Lameter

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> We use the topology provided by the hypervisor, it does actually reflect
> where CPUs and memory are, and their corresponding performance/NUMA
> characteristics.

And so there are actually nodes without memory that have processors?
Can the hypervisor or the linux arch code be convinced to ignore nodes
without memory or assign a sane default node to processors?

> > Ok then also move the memory of the local node somewhere?
>
> This happens below the OS, we don't control the hypervisor's decisions.
> I'm not sure if that's what you are suggesting.

You could also do this from the powerpc arch code by sanitizing the
processor / node information that is then used by Linux.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Nishanth Aravamudan

On 18.02.2014 [13:58:20 -0600], Christoph Lameter wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> 
> >
> > Well, on powerpc, with the hypervisor providing the resources and the
> > topology, you can have cpuless and memoryless nodes. I'm not sure how
> > "fake" the NUMA is -- as I think since the resources are virtualized to
> > be one system, it's logically possible that the actual topology of the
> > resources can be CPUs from physical node 0 and memory from physical node
> > 2. I would think with KVM on a sufficiently large (physically NUMA
> > x86_64) and loaded system, one could cause the same sort of
> > configuration to occur for a guest?
> 
> Ok but since you have a virtualized environment: Why not provide a fake
> home node with fake memory that could be anywhere? This would avoid the
> whole problem of supporting such a config at the kernel level.

We use the topology provided by the hypervisor, it does actually reflect
where CPUs and memory are, and their corresponding performance/NUMA
characteristics.

> Do not have a fake node that has no memory.
> 
> > In any case, these configurations happen fairly often on long-running
> > (not rebooted) systems as LPARs are created/destroyed, resources are
> > DLPAR'd in and out of LPARs, etc.
> 
> Ok then also move the memory of the local node somewhere?

This happens below the OS, we don't control the hypervisor's decisions.
I'm not sure if that's what you are suggesting.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Christoph Lameter

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

>
> Well, on powerpc, with the hypervisor providing the resources and the
> topology, you can have cpuless and memoryless nodes. I'm not sure how
> "fake" the NUMA is -- as I think since the resources are virtualized to
> be one system, it's logically possible that the actual topology of the
> resources can be CPUs from physical node 0 and memory from physical node
> 2. I would think with KVM on a sufficiently large (physically NUMA
> x86_64) and loaded system, one could cause the same sort of
> configuration to occur for a guest?

Ok but since you have a virtualized environment: Why not provide a fake
home node with fake memory that could be anywhere? This would avoid the
whole problem of supporting such a config at the kernel level.

Do not have a fake node that has no memory.

> In any case, these configurations happen fairly often on long-running
> (not rebooted) systems as LPARs are created/destroyed, resources are
> DLPAR'd in and out of LPARs, etc.

Ok then also move the memory of the local node somewhere?

> I might look into it, as it might have sped up testing these changes.

I guess that will be necessary in order to support the memoryless nodes
long term.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Nishanth Aravamudan

On 18.02.2014 [10:57:09 -0600], Christoph Lameter wrote:
> On Mon, 17 Feb 2014, Joonsoo Kim wrote:
> 
> > On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> > > Hi Joonsoo,
> > > Also, given that only ia64 and (hopefuly soon) ppc64 can set
> > > CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> > > memoryless nodes present? Even with fakenuma? Just curious.
> 
> x86_64 currently does not support memoryless nodes otherwise it would
> have set CONFIG_HAVE_MEMORYLESS_NODES in the kconfig. Memoryless nodes are
> a bit strange given that the NUMA paradigm is to have NUMA nodes (meaning
> memory) with processors. MEMORYLESS nodes means that we have a fake NUMA
> node without memory but just processors. Not very efficient. Not sure why
> people use these configurations.

Well, on powerpc, with the hypervisor providing the resources and the
topology, you can have cpuless and memoryless nodes. I'm not sure how
"fake" the NUMA is -- as I think since the resources are virtualized to
be one system, it's logically possible that the actual topology of the
resources can be CPUs from physical node 0 and memory from physical node
2. I would think with KVM on a sufficiently large (physically NUMA
x86_64) and loaded system, one could cause the same sort of
configuration to occur for a guest?

In any case, these configurations happen fairly often on long-running
(not rebooted) systems as LPARs are created/destroyed, resources are
DLPAR'd in and out of LPARs, etc.

> > I don't know, because I'm not expert on NUMA system :)
> > At first glance, fakenuma can't be used for testing
> > CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.
> 
> Well yeah. You'd have to do some mods to enable that testing.

I might look into it, as it might have sped up testing these changes.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Nishanth Aravamudan

On 12.02.2014 [16:16:11 -0600], Christoph Lameter wrote:
> Here is another patch with some fixes. The additional logic is only
> compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> 
> Subject: slub: Memoryless node support
> 
> Support memoryless nodes by tracking which allocations are failing.
> Allocations targeted to the nodes without memory fall back to the
> current available per cpu objects and if that is not available will
> create a new slab using the page allocator to fallback from the
> memoryless node to some other node.
> 
> Signed-off-by: Christoph Lameter 

Tested-by: Nishanth Aravamudan 
Acked-by: Nishanth Aravamudan 

> Index: linux/mm/slub.c
> ===
> --- linux.orig/mm/slub.c  2014-02-12 16:07:48.957869570 -0600
> +++ linux/mm/slub.c   2014-02-12 16:09:22.198928260 -0600
> @@ -134,6 +134,10 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +static nodemask_t empty_nodes;
> +#endif
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1409,28 @@ static struct page *new_slab(struct kmem
>   void *last;
>   void *p;
>   int order;
> + int alloc_node;
> 
>   BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>   page = allocate_slab(s,
>   flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> - if (!page)
> + if (!page) {
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> + if (node != NUMA_NO_NODE)
> + node_set(node, empty_nodes);
> +#endif
>   goto out;
> + }
> 
>   order = compound_order(page);
> - inc_slabs_node(s, page_to_nid(page), page->objects);
> + alloc_node = page_to_nid(page);
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> + node_clear(alloc_node, empty_nodes);
> + if (node != NUMA_NO_NODE && alloc_node != node)
> + node_set(node, empty_nodes);
> +#endif
> + inc_slabs_node(s, alloc_node, page->objects);
>   memcg_bind_pages(s, order);
>   page->slab_cache = s;
>   __SetPageSlab(page);
> @@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
>   struct kmem_cache_cpu *c)
>  {
>   void *object;
> - int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> + int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>   object = get_partial_node(s, get_node(s, searchnode), c, flags);
>   if (object || node != NUMA_NO_NODE)
> @@ -2117,8 +2133,19 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> - if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> + int page_node = page_to_nid(page);
> +
> + if (!page)
>   return 0;
> +
> + if (node != NUMA_NO_NODE) {
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> + if (node_isset(node, empty_nodes))
> + return 1;
> +#endif
> + if (page_node != node)
> + return 0;
> + }
>  #endif
>   return 1;
>  }
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Christoph Lameter

On Mon, 17 Feb 2014, Joonsoo Kim wrote:

> On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> > Hi Joonsoo,
> > Also, given that only ia64 and (hopefuly soon) ppc64 can set
> > CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> > memoryless nodes present? Even with fakenuma? Just curious.

x86_64 currently does not support memoryless nodes otherwise it would
have set CONFIG_HAVE_MEMORYLESS_NODES in the kconfig. Memoryless nodes are
a bit strange given that the NUMA paradigm is to have NUMA nodes (meaning
memory) with processors. MEMORYLESS nodes means that we have a fake NUMA
node without memory but just processors. Not very efficient. Not sure why
people use these configurations.

> I don't know, because I'm not expert on NUMA system :)
> At first glance, fakenuma can't be used for testing
> CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.

Well yeah. You'd have to do some mods to enable that testing.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Christoph Lameter

On Mon, 17 Feb 2014, Joonsoo Kim wrote:

> On Wed, Feb 12, 2014 at 04:16:11PM -0600, Christoph Lameter wrote:
> > Here is another patch with some fixes. The additional logic is only
> > compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> >
> > Subject: slub: Memoryless node support
> >
> > Support memoryless nodes by tracking which allocations are failing.
>
> I still don't understand why this tracking is needed.

Its an optimization to avoid calling the page allocator to figure out if
there is memory available on a particular node.

> All we need for allcation targeted to memoryless node is to fallback proper
> node, that it, numa_mem_id() node of targeted node. My previous patch
> implements it and use proper fallback node on every allocation code path.
> Why this tracking is needed? Please elaborate more on this.

Its too slow to do that on every alloc. One needs to be able to satisfy
most allocations without switching percpu slabs for optimal performance.

> > Allocations targeted to the nodes without memory fall back to the
> > current available per cpu objects and if that is not available will
> > create a new slab using the page allocator to fallback from the
> > memoryless node to some other node.

And what about the next alloc? Assuem there are N allocs from a memoryless
node this means we push back the partial slab on each alloc and then fall
back?

> >  {
> > void *object;
> > -   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > +   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> >
> > object = get_partial_node(s, get_node(s, searchnode), c, flags);
> > if (object || node != NUMA_NO_NODE)
>
> This isn't enough.
> Consider that allcation targeted to memoryless node.

It will not common get there because of the tracking. Instead a per cpu
object will be used.

> get_partial_node() always fails even if there are some partial slab on
> memoryless node's neareast node.

Correct and that leads to a page allocator action whereupon the node will
be marked as empty.

> We should fallback to some proper node in this case, since there is no slab
> on memoryless node.

NUMA is about optimization of memory allocations. It is often *not* about
correctness but heuristics are used in many cases. F.e. see the zone
reclaim logic, zone reclaim mode, fallback scenarios in the page allocator
etc etc.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-16 Thread Joonsoo Kim

On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> Hi Joonsoo,
> Also, given that only ia64 and (hopefuly soon) ppc64 can set
> CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> memoryless nodes present? Even with fakenuma? Just curious.

I don't know, because I'm not expert on NUMA system :)
At first glance, fakenuma can't be used for testing
CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-16 Thread Joonsoo Kim

On Wed, Feb 12, 2014 at 04:16:11PM -0600, Christoph Lameter wrote:
> Here is another patch with some fixes. The additional logic is only
> compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> 
> Subject: slub: Memoryless node support
> 
> Support memoryless nodes by tracking which allocations are failing.

I still don't understand why this tracking is needed.
All we need for allcation targeted to memoryless node is to fallback proper
node, that it, numa_mem_id() node of targeted node. My previous patch
implements it and use proper fallback node on every allocation code path.
Why this tracking is needed? Please elaborate more on this.

> Allocations targeted to the nodes without memory fall back to the
> current available per cpu objects and if that is not available will
> create a new slab using the page allocator to fallback from the
> memoryless node to some other node.
> 
> Signed-off-by: Christoph Lameter 
> 
> @@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
>   struct kmem_cache_cpu *c)
>  {
>   void *object;
> - int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> + int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>   object = get_partial_node(s, get_node(s, searchnode), c, flags);
>   if (object || node != NUMA_NO_NODE)

This isn't enough.
Consider that allcation targeted to memoryless node.
get_partial_node() always fails even if there are some partial slab on
memoryless node's neareast node.
We should fallback to some proper node in this case, since there is no slab
on memoryless node.

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-12 Thread Nishanth Aravamudan

Hi Joonsoo,

On 11.02.2014 [16:42:00 +0900], Joonsoo Kim wrote:
> On Mon, Feb 10, 2014 at 11:13:21AM -0800, Nishanth Aravamudan wrote:
> > Hi Christoph,
> > 
> > On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > > Here is a draft of a patch to make this work with memoryless nodes.
> > > 
> > > The first thing is that we modify node_match to also match if we hit an
> > > empty node. In that case we simply take the current slab if its there.
> > > 
> > > If there is no current slab then a regular allocation occurs with the
> > > memoryless node. The page allocator will fallback to a possible node and
> > > that will become the current slab. Next alloc from a memoryless node
> > > will then use that slab.
> > > 
> > > For that we also add some tracking of allocations on nodes that were not
> > > satisfied using the empty_node[] array. A successful alloc on a node
> > > clears that flag.
> > > 
> > > I would rather avoid the empty_node[] array since its global and there may
> > > be thread specific allocation restrictions but it would be expensive to do
> > > an allocation attempt via the page allocator to make sure that there is
> > > really no page available from the page allocator.
> > 
> > With this patch on our test system (I pulled out the numa_mem_id()
> > change, since you Acked Joonsoo's already), on top of 3.13.0 + my
> > kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
> > patch 1):
> > 
> > MemTotal:8264704 kB
> > MemFree: 5924608 kB
> > ...
> > Slab:1402496 kB
> > SReclaimable: 102848 kB
> > SUnreclaim:  1299648 kB
> > 
> > And Anton's slabusage reports:
> > 
> > slab   mem objsslabs
> >   used   active   active
> > 
> > kmalloc-16384   207 MB   98.60%  100.00%
> > task_struct 134 MB   97.82%  100.00%
> > kmalloc-8192117 MB  100.00%  100.00%
> > pgtable-2^12111 MB  100.00%  100.00%
> > pgtable-2^10104 MB  100.00%  100.00%
> > 
> > For comparison, Anton's patch applied at the same point in the series:
> > 
> > meminfo:
> > 
> > MemTotal:8264704 kB
> > MemFree: 4150464 kB
> > ...
> > Slab:1590336 kB
> > SReclaimable: 208768 kB
> > SUnreclaim:  1381568 kB
> > 
> > slabusage:
> > 
> > slab   mem objsslabs
> >   used   active   active
> > 
> > kmalloc-16384   227 MB   98.63%  100.00%
> > kmalloc-8192130 MB  100.00%  100.00%
> > task_struct 129 MB   97.73%  100.00%
> > pgtable-2^12112 MB  100.00%  100.00%
> > pgtable-2^10106 MB  100.00%  100.00%
> > 
> > 
> > Consider this patch:
> > 
> > Acked-by: Nishanth Aravamudan 
> > Tested-by: Nishanth Aravamudan 
> 
> Hello,
> 
> I still think that there is another problem.
> Your report about CONFIG_SLAB said that SLAB uses just 200MB.
> Below is your previous report.
> 
>   Ok, with your patches applied and CONFIG_SLAB enabled:
> 
>   MemTotal:8264640 kB
>   MemFree: 7119680 kB
>   Slab: 207232 kB
>   SReclaimable:  32896 kB
>   SUnreclaim:   174336 kB
> 
> The number on CONFIG_SLUB with these patches tell us that SLUB uses 1.4GB.
> There is large difference on slab usage.

Agreed. But, at least for now, this gets us to not OOM all the time :) I
think that's significant progress. I will continue to look at this
issue for where the other gaps are, but would like to see Christoph's
latest patch get merged (pending my re-testing).

> And, I should note that number of active objects on slabinfo can be
> wrong on some situation, since it doesn't consider cpu slab (and cpu
> partial slab).

Well, I grabbed everything from /sys/kernel/slab for you in the
tarballs, I believe.

> I recommend to confirm page_to_nid() and other things as I mentioned
> earlier.

I believe these all work once CONFIG_HAVE_MEMORYLESS_NODES was set for
ppc64, but will test it again when I have access to the test system.

Also, given that only ia64 and (hopefuly soon) ppc64 can set
CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
memoryless nodes present? Even with fakenuma? Just curious.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-12 Thread Nishanth Aravamudan

On 12.02.2014 [16:16:11 -0600], Christoph Lameter wrote:
> Here is another patch with some fixes. The additional logic is only
> compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> 
> Subject: slub: Memoryless node support
> 
> Support memoryless nodes by tracking which allocations are failing.
> Allocations targeted to the nodes without memory fall back to the
> current available per cpu objects and if that is not available will
> create a new slab using the page allocator to fallback from the
> memoryless node to some other node.

I'll try and retest this once the LPAR in question comes free. Hopefully
in the next day or two.

Thanks,
Nish

> Signed-off-by: Christoph Lameter 
> 
> Index: linux/mm/slub.c
> ===
> --- linux.orig/mm/slub.c  2014-02-12 16:07:48.957869570 -0600
> +++ linux/mm/slub.c   2014-02-12 16:09:22.198928260 -0600
> @@ -134,6 +134,10 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +static nodemask_t empty_nodes;
> +#endif
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1409,28 @@ static struct page *new_slab(struct kmem
>   void *last;
>   void *p;
>   int order;
> + int alloc_node;
> 
>   BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>   page = allocate_slab(s,
>   flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> - if (!page)
> + if (!page) {
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> + if (node != NUMA_NO_NODE)
> + node_set(node, empty_nodes);
> +#endif
>   goto out;
> + }
> 
>   order = compound_order(page);
> - inc_slabs_node(s, page_to_nid(page), page->objects);
> + alloc_node = page_to_nid(page);
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> + node_clear(alloc_node, empty_nodes);
> + if (node != NUMA_NO_NODE && alloc_node != node)
> + node_set(node, empty_nodes);
> +#endif
> + inc_slabs_node(s, alloc_node, page->objects);
>   memcg_bind_pages(s, order);
>   page->slab_cache = s;
>   __SetPageSlab(page);
> @@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
>   struct kmem_cache_cpu *c)
>  {
>   void *object;
> - int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> + int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>   object = get_partial_node(s, get_node(s, searchnode), c, flags);
>   if (object || node != NUMA_NO_NODE)
> @@ -2117,8 +2133,19 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> - if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> + int page_node = page_to_nid(page);
> +
> + if (!page)
>   return 0;
> +
> + if (node != NUMA_NO_NODE) {
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> + if (node_isset(node, empty_nodes))
> + return 1;
> +#endif
> + if (page_node != node)
> + return 0;
> + }
>  #endif
>   return 1;
>  }
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-12 Thread Christoph Lameter

Here is another patch with some fixes. The additional logic is only
compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.

Subject: slub: Memoryless node support

Support memoryless nodes by tracking which allocations are failing.
Allocations targeted to the nodes without memory fall back to the
current available per cpu objects and if that is not available will
create a new slab using the page allocator to fallback from the
memoryless node to some other node.

Signed-off-by: Christoph Lameter 

Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c2014-02-12 16:07:48.957869570 -0600
+++ linux/mm/slub.c 2014-02-12 16:09:22.198928260 -0600
@@ -134,6 +134,10 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }

+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+static nodemask_t empty_nodes;
+#endif
+
 /*
  * Issues still to be resolved:
  *
@@ -1405,16 +1409,28 @@ static struct page *new_slab(struct kmem
void *last;
void *p;
int order;
+   int alloc_node;

BUG_ON(flags & GFP_SLAB_BUG_MASK);

page = allocate_slab(s,
flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-   if (!page)
+   if (!page) {
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+   if (node != NUMA_NO_NODE)
+   node_set(node, empty_nodes);
+#endif
goto out;
+   }

order = compound_order(page);
-   inc_slabs_node(s, page_to_nid(page), page->objects);
+   alloc_node = page_to_nid(page);
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+   node_clear(alloc_node, empty_nodes);
+   if (node != NUMA_NO_NODE && alloc_node != node)
+   node_set(node, empty_nodes);
+#endif
+   inc_slabs_node(s, alloc_node, page->objects);
memcg_bind_pages(s, order);
page->slab_cache = s;
__SetPageSlab(page);
@@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
struct kmem_cache_cpu *c)
 {
void *object;
-   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
@@ -2117,8 +2133,19 @@ static void flush_all(struct kmem_cache
 static inline int node_match(struct page *page, int node)
 {
 #ifdef CONFIG_NUMA
-   if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
+   int page_node = page_to_nid(page);
+
+   if (!page)
return 0;
+
+   if (node != NUMA_NO_NODE) {
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+   if (node_isset(node, empty_nodes))
+   return 1;
+#endif
+   if (page_node != node)
+   return 0;
+   }
 #endif
return 1;
 }
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-11 Thread Christoph Lameter

On Mon, 10 Feb 2014, Joonsoo Kim wrote:

> On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> >
> > The first thing is that we modify node_match to also match if we hit an
> > empty node. In that case we simply take the current slab if its there.
>
> Why not inspecting whether we can get the page on the best node such as
> numa_mem_id() node?

Its expensive to do so.

> empty_node cannot be set on memoryless node, since page allocation would
> succeed on different node.

Ok then we need to add a check for being on the rignt node there too.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-10 Thread Joonsoo Kim

On Mon, Feb 10, 2014 at 11:13:21AM -0800, Nishanth Aravamudan wrote:
> Hi Christoph,
> 
> On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> > 
> > The first thing is that we modify node_match to also match if we hit an
> > empty node. In that case we simply take the current slab if its there.
> > 
> > If there is no current slab then a regular allocation occurs with the
> > memoryless node. The page allocator will fallback to a possible node and
> > that will become the current slab. Next alloc from a memoryless node
> > will then use that slab.
> > 
> > For that we also add some tracking of allocations on nodes that were not
> > satisfied using the empty_node[] array. A successful alloc on a node
> > clears that flag.
> > 
> > I would rather avoid the empty_node[] array since its global and there may
> > be thread specific allocation restrictions but it would be expensive to do
> > an allocation attempt via the page allocator to make sure that there is
> > really no page available from the page allocator.
> 
> With this patch on our test system (I pulled out the numa_mem_id()
> change, since you Acked Joonsoo's already), on top of 3.13.0 + my
> kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
> patch 1):
> 
> MemTotal:8264704 kB
> MemFree: 5924608 kB
> ...
> Slab:1402496 kB
> SReclaimable: 102848 kB
> SUnreclaim:  1299648 kB
> 
> And Anton's slabusage reports:
> 
> slab   mem objsslabs
>   used   active   active
> 
> kmalloc-16384   207 MB   98.60%  100.00%
> task_struct 134 MB   97.82%  100.00%
> kmalloc-8192117 MB  100.00%  100.00%
> pgtable-2^12111 MB  100.00%  100.00%
> pgtable-2^10104 MB  100.00%  100.00%
> 
> For comparison, Anton's patch applied at the same point in the series:
> 
> meminfo:
> 
> MemTotal:8264704 kB
> MemFree: 4150464 kB
> ...
> Slab:1590336 kB
> SReclaimable: 208768 kB
> SUnreclaim:  1381568 kB
> 
> slabusage:
> 
> slab   mem objsslabs
>   used   active   active
> 
> kmalloc-16384   227 MB   98.63%  100.00%
> kmalloc-8192130 MB  100.00%  100.00%
> task_struct 129 MB   97.73%  100.00%
> pgtable-2^12112 MB  100.00%  100.00%
> pgtable-2^10106 MB  100.00%  100.00%
> 
> 
> Consider this patch:
> 
> Acked-by: Nishanth Aravamudan 
> Tested-by: Nishanth Aravamudan 

Hello,

I still think that there is another problem.
Your report about CONFIG_SLAB said that SLAB uses just 200MB.
Below is your previous report.

  Ok, with your patches applied and CONFIG_SLAB enabled:

  MemTotal:8264640 kB
  MemFree: 7119680 kB
  Slab: 207232 kB
  SReclaimable:  32896 kB
  SUnreclaim:   174336 kB

The number on CONFIG_SLUB with these patches tell us that SLUB uses 1.4GB.
There is large difference on slab usage.

And, I should note that number of active objects on slabinfo can be wrong
on some situation, since it doesn't consider cpu slab (and cpu partial slab).

I recommend to confirm page_to_nid() and other things as I mentioned earlier.

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-10 Thread Nishanth Aravamudan

Hi Christoph,

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.
> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.

With this patch on our test system (I pulled out the numa_mem_id()
change, since you Acked Joonsoo's already), on top of 3.13.0 + my
kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
patch 1):

MemTotal:8264704 kB
MemFree: 5924608 kB
...
Slab:1402496 kB
SReclaimable: 102848 kB
SUnreclaim:  1299648 kB

And Anton's slabusage reports:

slab   mem objsslabs
  used   active   active

kmalloc-16384   207 MB   98.60%  100.00%
task_struct 134 MB   97.82%  100.00%
kmalloc-8192117 MB  100.00%  100.00%
pgtable-2^12111 MB  100.00%  100.00%
pgtable-2^10104 MB  100.00%  100.00%

For comparison, Anton's patch applied at the same point in the series:

meminfo:

MemTotal:8264704 kB
MemFree: 4150464 kB
...
Slab:1590336 kB
SReclaimable: 208768 kB
SUnreclaim:  1381568 kB

slabusage:

slab   mem objsslabs
  used   active   active

kmalloc-16384   227 MB   98.63%  100.00%
kmalloc-8192130 MB  100.00%  100.00%
task_struct 129 MB   97.73%  100.00%
pgtable-2^12112 MB  100.00%  100.00%
pgtable-2^10106 MB  100.00%  100.00%


Consider this patch:

Acked-by: Nishanth Aravamudan 
Tested-by: Nishanth Aravamudan 

I was thinking about your concerns about empty_node[]. Would it make
sense to use a helper function, rather than direct access to
direct_node, such as:

bool is_node_empty(int nid)

void set_node_empty(int nid, bool empty)

which we stub out if !HAVE_MEMORYLESS_NODES to return false and noop
respectively?

That way only architectures that have memoryless nodes pay the penalty
of the array allocation?

Thanks,
Nish

> Index: linux/mm/slub.c
> ===
> --- linux.orig/mm/slub.c  2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c   2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>   void *last;
>   void *p;
>   int order;
> + int alloc_node;
> 
>   BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>   page = allocate_slab(s,
>   flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> - if (!page)
> + if (!page) {
> + if (node != NUMA_NO_NODE)
> + empty_node[node] = 1;
>   goto out;
> + }
> 
>   order = compound_order(page);
> - inc_slabs_node(s, page_to_nid(page), page->objects);
> + alloc_node = page_to_nid(page);
> + empty_node[alloc_node] = 0;
> + inc_slabs_node(s, alloc_node, page->objects);
>   memcg_bind_pages(s, order);
>   page->slab_cache = s;
>   __SetPageSlab(page);
> @@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
>   struct kmem_cache_cpu *c)
>  {
>   void *object;
> - int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> + int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>   object = get_partial_node(s, get_node(s, searchnode), c, flags);
>   if (object || node != NUMA_NO_NODE)
> @@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> - if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> + int page_node;
> +
> + /* No data means no match *

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-09 Thread Joonsoo Kim

On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.

Why not inspecting whether we can get the page on the best node such as
numa_mem_id() node?

> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.
> 
> Index: linux/mm/slub.c
> ===
> --- linux.orig/mm/slub.c  2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c   2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>   void *last;
>   void *p;
>   int order;
> + int alloc_node;
> 
>   BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>   page = allocate_slab(s,
>   flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> - if (!page)
> + if (!page) {
> + if (node != NUMA_NO_NODE)
> + empty_node[node] = 1;
>   goto out;
> + }

empty_node cannot be set on memoryless node, since page allocation would
succeed on different node.

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-09 Thread Joonsoo Kim

On Fri, Feb 07, 2014 at 01:38:55PM -0800, Nishanth Aravamudan wrote:
> On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> 
> Hi Christoph, this should be tested instead of Joonsoo's patch 2 (and 3)?

Hello,

I guess that your system has another problem that makes my patches inactive.
Maybe it will also affect to the Christoph's one. Could you confirm 
page_to_nid(),
numa_mem_id() and node_present_pages although I doubt mostly about 
page_to_nid()?

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-09 Thread Joonsoo Kim

On Sat, Feb 08, 2014 at 01:57:39AM -0800, David Rientjes wrote:
> On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> 
> > > It seems like a better approach would be to do this when a node is 
> > > brought 
> > > online and determine the fallback node based not on the zonelists as you 
> > > do here but rather on locality (such as through a SLIT if provided, see 
> > > node_distance()).
> > 
> > Hmm...
> > I guess that zonelist is base on locality. Zonelist is generated using
> > node_distance(), so I think that it reflects locality. But, I'm not expert
> > on NUMA, so please let me know what I am missing here :)
> > 
> 
> The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
> If your solution is going to become the generic kernel API that determines 
> what node has local memory for a particular node, then it will have to 
> support all definitions of node.  That includes nodes that consist solely 
> of I/O, chipsets, networking, or storage devices.  These nodes may not 
> have memory or cpus, so doing it as part of onlining cpus isn't going to 
> be generic enough.  You want a node_to_mem_node() API for all possible 
> node types (the possible node types listed above are straight from the 
> ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
> X and we can optimize for that, but any solution that relies on cpu online 
> is probably shortsighted right now.
> 
> I think it would be much better to do this as a part of setting a node to 
> be online.

Okay. I got your point.
I will change it to rely on node online if this patch is really needed.

Thanks!

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-07 Thread Nishanth Aravamudan

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.

Hi Christoph, this should be tested instead of Joonsoo's patch 2 (and 3)?

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-07 Thread Christoph Lameter

Here is a draft of a patch to make this work with memoryless nodes.

The first thing is that we modify node_match to also match if we hit an
empty node. In that case we simply take the current slab if its there.

If there is no current slab then a regular allocation occurs with the
memoryless node. The page allocator will fallback to a possible node and
that will become the current slab. Next alloc from a memoryless node
will then use that slab.

For that we also add some tracking of allocations on nodes that were not
satisfied using the empty_node[] array. A successful alloc on a node
clears that flag.

I would rather avoid the empty_node[] array since its global and there may
be thread specific allocation restrictions but it would be expensive to do
an allocation attempt via the page allocator to make sure that there is
really no page available from the page allocator.

Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c2014-02-03 13:19:22.896853227 -0600
+++ linux/mm/slub.c 2014-02-07 12:44:49.311494806 -0600
@@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }

+static int empty_node[MAX_NUMNODES];
+
 /*
  * Issues still to be resolved:
  *
@@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
void *last;
void *p;
int order;
+   int alloc_node;

BUG_ON(flags & GFP_SLAB_BUG_MASK);

page = allocate_slab(s,
flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-   if (!page)
+   if (!page) {
+   if (node != NUMA_NO_NODE)
+   empty_node[node] = 1;
goto out;
+   }

order = compound_order(page);
-   inc_slabs_node(s, page_to_nid(page), page->objects);
+   alloc_node = page_to_nid(page);
+   empty_node[alloc_node] = 0;
+   inc_slabs_node(s, alloc_node, page->objects);
memcg_bind_pages(s, order);
page->slab_cache = s;
__SetPageSlab(page);
@@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
struct kmem_cache_cpu *c)
 {
void *object;
-   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
@@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
 static inline int node_match(struct page *page, int node)
 {
 #ifdef CONFIG_NUMA
-   if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
+   int page_node;
+
+   /* No data means no match */
+   if (!page)
return 0;
+
+   /* Node does not matter. Therefore anything is a match */
+   if (node == NUMA_NO_NODE)
+   return 1;
+
+   /* Did we hit the requested node ? */
+   page_node = page_to_nid(page);
+   if (page_node == node)
+   return 1;
+
+   /* If the node has available data then we can use it. Mismatch */
+   return !empty_node[page_node];
+
+   /* Target node empty so just take anything */
 #endif
return 1;
 }

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-07 Thread Christoph Lameter

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> >
> > It seems like a better approach would be to do this when a node is brought
> > online and determine the fallback node based not on the zonelists as you
> > do here but rather on locality (such as through a SLIT if provided, see
> > node_distance()).
>
> Hmm...
> I guess that zonelist is base on locality. Zonelist is generated using
> node_distance(), so I think that it reflects locality. But, I'm not expert
> on NUMA, so please let me know what I am missing here :)

The next node can be found by going through the zonelist of a node and
checking for available memory. See fallback_alloc().

There is a function node_distance() that determines the relative
performance of a memory access from one to the other node.
The building of the fallback list for every node in build_zonelists()
relies on that.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-06 Thread Joonsoo Kim

On Thu, Feb 06, 2014 at 12:52:11PM -0800, David Rientjes wrote:
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> 
> > From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> > From: Joonsoo Kim 
> > Date: Thu, 6 Feb 2014 17:07:05 +0900
> > Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> > determining the
> >  fallback node
> > 
> > We need to determine the fallback node in slub allocator if the allocation
> > target node is memoryless node. Without it, the SLUB wrongly select
> > the node which has no memory and can't use a partial slab, because of node
> > mismatch. Introduced function, node_numa_mem(X), will return
> > a node Y with memory that has the nearest distance. If X is memoryless
> > node, it will return nearest distance node, but, if
> > X is normal node, it will return itself.
> > 
> > We will use this function in following patch to determine the fallback
> > node.
> > 
> 
> I like the approach and it may fix the problem today, but it may not be 
> sufficient in the future: nodes may not only be memoryless but they may 
> also be cpuless.  It's possible that a node can only have I/O, networking, 
> or storage devices and we can define affinity for them that is remote from 
> every cpu and/or memory by the ACPI specification.
> 
> It seems like a better approach would be to do this when a node is brought 
> online and determine the fallback node based not on the zonelists as you 
> do here but rather on locality (such as through a SLIT if provided, see 
> node_distance()).

Hmm...
I guess that zonelist is base on locality. Zonelist is generated using
node_distance(), so I think that it reflects locality. But, I'm not expert
on NUMA, so please let me know what I am missing here :)

> Also, the names aren't very descriptive: {get,set}_numa_mem() doesn't make 
> a lot of sense in generic code.  I'd suggest something like 
> node_to_mem_node().

It's much better!
If this patch eventually will be needed, I will update it.

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-06 Thread Joonsoo Kim

On Thu, Feb 06, 2014 at 11:11:31AM -0800, Nishanth Aravamudan wrote:
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 12ae6ce..66b19b8 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
> >   * Use the accessor functions set_numa_mem(), numa_mem_id() and 
> > cpu_to_mem().
> >   */
> >  DECLARE_PER_CPU(int, _numa_mem_);
> > +int _node_numa_mem_[MAX_NUMNODES];
> 
> Should be static, I think?

Yes, will update it.

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-06 Thread Nishanth Aravamudan

On 06.02.2014 [19:29:16 +0900], Joonsoo Kim wrote:
> 2014-02-06 David Rientjes :
> > On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> >
> >> Signed-off-by: Joonsoo Kim 
> >>
> >
> > I may be misunderstanding this patch and there's no help because there's
> > no changelog.
> 
> Sorry about that.
> I made this patch just for testing. :)
> Thanks for looking this.
> 
> >> diff --git a/include/linux/topology.h b/include/linux/topology.h
> >> index 12ae6ce..a6d5438 100644
> >> --- a/include/linux/topology.h
> >> +++ b/include/linux/topology.h
> >> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
> >>   * Use the accessor functions set_numa_mem(), numa_mem_id() and 
> >> cpu_to_mem().
> >>   */
> >>  DECLARE_PER_CPU(int, _numa_mem_);
> >> +int _node_numa_mem_[MAX_NUMNODES];
> >>
> >>  #ifndef set_numa_mem
> >>  static inline void set_numa_mem(int node)
> >>  {
> >>   this_cpu_write(_numa_mem_, node);
> >> + _node_numa_mem_[numa_node_id()] = node;
> >> +}
> >> +#endif
> >> +
> >> +#ifndef get_numa_mem
> >> +static inline int get_numa_mem(int node)
> >> +{
> >> + return _node_numa_mem_[node];
> >>  }
> >>  #endif
> >>
> >> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
> >>  static inline void set_cpu_numa_mem(int cpu, int node)
> >>  {
> >>   per_cpu(_numa_mem_, cpu) = node;
> >> + _node_numa_mem_[numa_node_id()] = node;
> >
> > The intention seems to be that _node_numa_mem_[X] for a node X will return
> > a node Y with memory that has the nearest distance?  In other words,
> > caching the value returned by local_memory_node(X)?
> 
> Yes, you are right.
> 
> > That doesn't seem to be what it's doing since numa_node_id() is the node
> > of the cpu that current is running on so this ends up getting initialized
> > to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in
> > cpu_possible_mask.
> 
> Yes, I made a mistake.
> Thanks for pointer.
> I fix it and attach v2.
> Now I'm out of office, so I'm not sure this second version is correct :(
> 
> Thanks.
> 
> --8<--
> From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim 
> Date: Thu, 6 Feb 2014 17:07:05 +0900
> Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> determining the
>  fallback node
> 
> We need to determine the fallback node in slub allocator if the allocation
> target node is memoryless node. Without it, the SLUB wrongly select
> the node which has no memory and can't use a partial slab, because of node
> mismatch. Introduced function, node_numa_mem(X), will return
> a node Y with memory that has the nearest distance. If X is memoryless
> node, it will return nearest distance node, but, if
> X is normal node, it will return itself.
> 
> We will use this function in following patch to determine the fallback
> node.
> 
> Signed-off-by: Joonsoo Kim 
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 12ae6ce..66b19b8 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
>   */
>  DECLARE_PER_CPU(int, _numa_mem_);
> +int _node_numa_mem_[MAX_NUMNODES];

Should be static, I think?

> 
>  #ifndef set_numa_mem
>  static inline void set_numa_mem(int node)
>  {
>   this_cpu_write(_numa_mem_, node);
> + _node_numa_mem_[numa_node_id()] = node;
> +}
> +#endif
> +
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> + return _node_numa_mem_[node];
>  }
>  #endif
> 
> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
>  static inline void set_cpu_numa_mem(int cpu, int node)
>  {
>   per_cpu(_numa_mem_, cpu) = node;
> + _node_numa_mem_[cpu_to_node(cpu)] = node;
>  }
>  #endif
> 
> @@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
>  }
>  #endif
> 
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> + return node;
> +}
> +#endif
> +
>  #ifndef cpu_to_mem
>  static inline int cpu_to_mem(int cpu)
>  {
> -- 
> 1.7.9.5
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-06 Thread Joonsoo Kim

2014-02-06 David Rientjes :
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
>
>> Signed-off-by: Joonsoo Kim 
>>
>
> I may be misunderstanding this patch and there's no help because there's
> no changelog.

Sorry about that.
I made this patch just for testing. :)
Thanks for looking this.

>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index 12ae6ce..a6d5438 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
>>   * Use the accessor functions set_numa_mem(), numa_mem_id() and 
>> cpu_to_mem().
>>   */
>>  DECLARE_PER_CPU(int, _numa_mem_);
>> +int _node_numa_mem_[MAX_NUMNODES];
>>
>>  #ifndef set_numa_mem
>>  static inline void set_numa_mem(int node)
>>  {
>>   this_cpu_write(_numa_mem_, node);
>> + _node_numa_mem_[numa_node_id()] = node;
>> +}
>> +#endif
>> +
>> +#ifndef get_numa_mem
>> +static inline int get_numa_mem(int node)
>> +{
>> + return _node_numa_mem_[node];
>>  }
>>  #endif
>>
>> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
>>  static inline void set_cpu_numa_mem(int cpu, int node)
>>  {
>>   per_cpu(_numa_mem_, cpu) = node;
>> + _node_numa_mem_[numa_node_id()] = node;
>
> The intention seems to be that _node_numa_mem_[X] for a node X will return
> a node Y with memory that has the nearest distance?  In other words,
> caching the value returned by local_memory_node(X)?

Yes, you are right.

> That doesn't seem to be what it's doing since numa_node_id() is the node
> of the cpu that current is running on so this ends up getting initialized
> to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in
> cpu_possible_mask.

Yes, I made a mistake.
Thanks for pointer.
I fix it and attach v2.
Now I'm out of office, so I'm not sure this second version is correct :(

Thanks.

--8<--
>From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Thu, 6 Feb 2014 17:07:05 +0900
Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
determining the
 fallback node

We need to determine the fallback node in slub allocator if the allocation
target node is memoryless node. Without it, the SLUB wrongly select
the node which has no memory and can't use a partial slab, because of node
mismatch. Introduced function, node_numa_mem(X), will return
a node Y with memory that has the nearest distance. If X is memoryless
node, it will return nearest distance node, but, if
X is normal node, it will return itself.

We will use this function in following patch to determine the fallback
node.

Signed-off-by: Joonsoo Kim 

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce..66b19b8 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -233,11 +233,20 @@ static inline int numa_node_id(void)
  * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
  */
 DECLARE_PER_CPU(int, _numa_mem_);
+int _node_numa_mem_[MAX_NUMNODES];

 #ifndef set_numa_mem
 static inline void set_numa_mem(int node)
 {
  this_cpu_write(_numa_mem_, node);
+ _node_numa_mem_[numa_node_id()] = node;
+}
+#endif
+
+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+ return _node_numa_mem_[node];
 }
 #endif

@@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
 static inline void set_cpu_numa_mem(int cpu, int node)
 {
  per_cpu(_numa_mem_, cpu) = node;
+ _node_numa_mem_[cpu_to_node(cpu)] = node;
 }
 #endif

@@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
 }
 #endif

+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+ return node;
+}
+#endif
+
 #ifndef cpu_to_mem
 static inline int cpu_to_mem(int cpu)
 {
-- 
1.7.9.5
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-06 Thread Joonsoo Kim

Signed-off-by: Joonsoo Kim 

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce..a6d5438 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -233,11 +233,20 @@ static inline int numa_node_id(void)
  * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
  */
 DECLARE_PER_CPU(int, _numa_mem_);
+int _node_numa_mem_[MAX_NUMNODES];
 
 #ifndef set_numa_mem
 static inline void set_numa_mem(int node)
 {
this_cpu_write(_numa_mem_, node);
+   _node_numa_mem_[numa_node_id()] = node;
+}
+#endif
+
+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+   return _node_numa_mem_[node];
 }
 #endif
 
@@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
 static inline void set_cpu_numa_mem(int cpu, int node)
 {
per_cpu(_numa_mem_, cpu) = node;
+   _node_numa_mem_[numa_node_id()] = node;
 }
 #endif
 
@@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
 }
 #endif
 
+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+   return node;
+}
+#endif
+
 #ifndef cpu_to_mem
 static inline int cpu_to_mem(int cpu)
 {
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

38 matches

Mail list logo