from:"Christoph Lameter"

Re: [PATCH V2] mm/page_alloc: Ensure that HUGETLB_PAGE_ORDER is less than MAX_ORDER

2021-04-19 Thread Christoph Lameter

On Mon, 19 Apr 2021, Anshuman Khandual wrote:

> >> Unfortunately the build test fails on both the platforms (powerpc and ia64)
> >> which subscribe HUGETLB_PAGE_SIZE_VARIABLE and where this check would make
> >> sense. I some how overlooked the cross compile build failure that actually
> >> detected this problem.
> >>
> >> But wondering why this assert is not holding true ? and how these platforms
> >> do not see the warning during boot (or do they ?) at mm/vmscan.c:1092 like
> >> arm64 did.
> >>
> >> static int __fragmentation_index(unsigned int order, struct 
> >> contig_page_info *info)
> >> {
> >>  unsigned long requested = 1UL << order;
> >>
> >>  if (WARN_ON_ONCE(order >= MAX_ORDER))
> >>  return 0;
> >> 
> >>
> >> Can pageblock_order really exceed MAX_ORDER - 1 ?

You can have larger blocks but you would need to allocate multiple
contigous max order blocks or do it at boot time before the buddy
allocator is active.

What IA64 did was to do this at boot time thereby avoiding the buddy
lists. And it had a separate virtual address range and page table for the
huge pages.

Looks like the current code does these allocations via CMA which should
also bypass the buddy allocator.

> > }
> >
> >
> > But it's kind of weird, isn't it? Let's assume we have MAX_ORDER - 1 
> > correspond to 4 MiB and pageblock_order correspond to 8 MiB.
> >
> > Sure, we'd be grouping pages in 8 MiB chunks, however, we cannot even
> > allocate 8 MiB chunks via the buddy. So only alloc_contig_range()
> > could really grab them (IOW: gigantic pages).
>
> Right.

But then you can avoid the buddy allocator.

> > Further, we have code like deferred_free_range(), where we end up
> > calling __free_pages_core()->...->__free_one_page() with
> > pageblock_order. Wouldn't we end up setting the buddy order to
> > something > MAX_ORDER -1 on that path?
>
> Agreed.

We would need to return the supersized block to the huge page pool and not
to the buddy allocator. There is a special callback in the compound page
sos that you can call an alternate free function that is not the buddy
allocator.

>
> >
> > Having pageblock_order > MAX_ORDER feels wrong and looks shaky.
> >
> Agreed, definitely does not look right. Lets see what other folks
> might have to say on this.
>
> + Christoph Lameter 
>

It was done for a long time successfully and is running in numerous
configurations.

Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-02 Thread Christoph Lameter

On Thu, 1 Jun 2017, Hugh Dickins wrote:

> SLUB versus SLAB, cpu versus memory?  Since someone has taken the
> trouble to write it with ctors in the past, I didn't feel on firm
> enough ground to recommend such a change.  But it may be obvious
> to someone else that your suggestion would be better (or worse).

Umm how about using alloc_pages() for pageframes?

Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-02 Thread Christoph Lameter

On Thu, 1 Jun 2017, Hugh Dickins wrote:

> Thanks a lot for working that out.  Makes sense, fully understood now,
> nothing to worry about (though makes one wonder whether it's efficient
> to use ctors on high-alignment caches; or whether an internal "zero-me"
> ctor would be useful).

Use kzalloc to zero it. And here is another example of using slab
allocations for page frames. Use the page allocator for this? The page
allocator is there for allocating page frames. The slab allocator main
purpose is to allocate small objects

Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-01 Thread Christoph Lameter

On Thu, 1 Jun 2017, Hugh Dickins wrote:

> CONFIG_SLUB_DEBUG_ON=y.  My SLAB|SLUB config options are
>
> CONFIG_SLUB_DEBUG=y
> # CONFIG_SLUB_MEMCG_SYSFS_ON is not set
> # CONFIG_SLAB is not set
> CONFIG_SLUB=y
> # CONFIG_SLAB_FREELIST_RANDOM is not set
> CONFIG_SLUB_CPU_PARTIAL=y
> CONFIG_SLABINFO=y
> # CONFIG_SLUB_DEBUG_ON is not set
> CONFIG_SLUB_STATS=y

Thats fine.

> But I think you are now surprised, when I say no slub_debug options
> were on.  Here's the output from /sys/kernel/slab/pgtable-2^12/*
> (before I tried the new kernel with Aneesh's fix patch)
> in case they tell you anything...
>
> pgtable-2^12/poison:0
> pgtable-2^12/red_zone:0
> pgtable-2^12/reserved:0
> pgtable-2^12/sanity_checks:0
> pgtable-2^12/store_user:0

Ok so debugging was off but the slab cache has a ctor callback which
mandates that the free pointer cannot use the free object space when
the object is not in use. Thus the size of the object must be increased to
accomodate the freepointer.

Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-01 Thread Christoph Lameter



> > I am curious as to what is going on there. Do you have the output from
> > these failed allocations?
>
> I thought the relevant output was in my mail.  I did skip the Mem-Info
> dump, since that just seemed noise in this case: we know memory can get
> fragmented.  What more output are you looking for?

The output for the failing allocations when you disabling debugging. For
that I would think that you need remove(!) the slub_debug statement on the 
kernel
command line. You can verify that debug is off by inspecting the values in
/sys/kernel/slab//

> But it was still order 4 when booted with slub_debug=O, which surprised me.
> And that surprises you too?  If so, then we ought to dig into it further.

No it does no longer. I dont think slub_debug=O does disable debugging
(frankly I am not sure what it does). Please do not specify any debug options.

Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Christoph Lameter

On Wed, 31 May 2017, Michael Ellerman wrote:

> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default 
> > order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.

Ahh. Ok debugging increased the object size to an order 4. This should be
order 3 without debugging.

> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.

I am curious as to what is going on there. Do you have the output from
these failed allocations?

Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Christoph Lameter

On Tue, 30 May 2017, Hugh Dickins wrote:

> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.

CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
be able to enable it at runtime.

> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.

SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.

Why are the slab allocators used to create slab caches for large object
sizes?

> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.

I thought you had these huge 64k page sizes?

Re: [PATCH] percpu: improve generic percpu modify-return implementation

2016-09-21 Thread Christoph Lameter

On Wed, 21 Sep 2016, Tejun Heo wrote:

> Hello, Nick.
>
> How have you been? :)
>

He is baack. Are we getting SL!B? ;-)

Re: [kernel-hardening] Re: [PATCH 9/9] mm: SLUB hardened usercopy support

2016-07-08 Thread Christoph Lameter

On Fri, 8 Jul 2016, Kees Cook wrote:

> Is check_valid_pointer() making sure the pointer is within the usable
> size? It seemed like it was checking that it was within the slub
> object (checks against s->size, wants it above base after moving
> pointer to include redzone, etc).

check_valid_pointer verifies that a pointer is pointing to the start of an
object. It is used to verify the internal points that SLUB used and
should not be modified to do anything different.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [kernel-hardening] Re: [PATCH 9/9] mm: SLUB hardened usercopy support

2016-07-08 Thread Christoph Lameter

On Fri, 8 Jul 2016, Michael Ellerman wrote:

> > I wonder if this code should be using size_from_object() instead of s->size?
>
> Hmm, not sure. Who's SLUB maintainer? :)

Me.

s->size is the size of the whole object including debugging info etc.
ksize() gives you the actual usable size of an object.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node

2015-07-30 Thread Christoph Lameter

On Thu, 30 Jul 2015, Vlastimil Babka wrote:

> > NAK. This is changing slob behavior. With no node specified it must use
> > alloc_pages because that obeys NUMA memory policies etc etc. It should not
> > force allocation from the current node like what is happening here after
> > the patch. See the code in slub.c that is similar.
>
> Doh, somehow I convinced myself that there's #else and alloc_pages() is only
> used for !CONFIG_NUMA so it doesn't matter. Here's a fixed version.

Acked-by: Christoph Lameter 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node

2015-07-30 Thread Christoph Lameter

On Thu, 30 Jul 2015, Vlastimil Babka wrote:

> --- a/mm/slob.c
> +++ b/mm/slob.c
>   void *page;
>
> -#ifdef CONFIG_NUMA
> - if (node != NUMA_NO_NODE)
> - page = alloc_pages_exact_node(node, gfp, order);
> - else
> -#endif
> - page = alloc_pages(gfp, order);
> + page = alloc_pages_node(node, gfp, order);

NAK. This is changing slob behavior. With no node specified it must use
alloc_pages because that obeys NUMA memory policies etc etc. It should not
force allocation from the current node like what is happening here after
the patch. See the code in slub.c that is similar.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 2/3] mm: unify checks in alloc_pages_node() and __alloc_pages_node()

2015-07-30 Thread Christoph Lameter


Acked-by: Christoph Lameter 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 3/3] mm: use numa_mem_id() in alloc_pages_node()

2015-07-30 Thread Christoph Lameter

On Thu, 30 Jul 2015, Vlastimil Babka wrote:

> numa_mem_id() is able to handle allocation from CPUs on memory-less nodes,
> so it's a more robust fallback than the currently used numa_node_id().
>
> Suggested-by: Christoph Lameter 
> Signed-off-by: Vlastimil Babka 
> Acked-by: David Rientjes 
> Acked-by: Mel Gorman 

You can add my ack too if it helps.

Acked-by: Christoph Lameter 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] mm: rename and document alloc_pages_exact_node

2015-07-23 Thread Christoph Lameter

On Wed, 22 Jul 2015, David Rientjes wrote:

> Eek, yeah, that does look bad.  I'm not even sure the
>
>   if (nid < 0)
>   nid = numa_node_id();
>
> is correct; I think this should be comparing to NUMA_NO_NODE rather than
> all negative numbers, otherwise we silently ignore overflow and nobody
> ever knows.

Comparing to NUMA_NO_NODE would be better. Also use numa_mem_id() instead
to support memoryless nodes better?

> The only possible downside would be existing users of
> alloc_pages_node() that are calling it with an offline node.  Since it's a
> VM_BUG_ON() that would catch that, I think it should be changed to a
> VM_WARN_ON() and eventually fixed up because it's nonsensical.
> VM_BUG_ON() here should be avoided.

The offline node thing could be addresses by using numa_mem_id()?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] mm: rename and document alloc_pages_exact_node

2015-07-21 Thread Christoph Lameter

On Tue, 21 Jul 2015, Vlastimil Babka wrote:

> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
> allocator: do not check NUMA node ID when the caller knows the node is valid")
> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
> to be -1. Unfortunately the name of the function can easily suggest that the
> allocation is restricted to the given node. In truth, the node is only
> preferred, unless __GFP_THISNODE is among the gfp flags.

Yup. I complained about this when this was introduced. Glad to see this
fixed. Initially this was alloc_pages_node() which just means that a node
is specified. The exact behavior of the allocation is determined by flags
such as GFP_THISNODE. I'd rather have that restored because otherwise we
get into weird code like the one below. And such an arrangement also
leaves the way open to add more flags in the future that may change the
allocation behavior.

>   area->nid = nid;
>   area->order = order;
> - area->pages = alloc_pages_exact_node(area->nid,
> + area->pages = alloc_pages_prefer_node(area->nid,
>   GFP_KERNEL|__GFP_THISNODE,
>   area->order);

This is not preferring a node but requiring alloction on that node.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc: Replace __get_cpu_var uses

2014-10-29 Thread Christoph Lameter

On Wed, 29 Oct 2014, Michael Ellerman wrote:

> >  #define __ARCH_IRQ_STAT
> >
> > -#define local_softirq_pending()
> > __get_cpu_var(irq_stat).__softirq_pending
> > +#define local_softirq_pending()
> > __this_cpu_read(irq_stat.__softirq_pending)
> > +#define set_softirq_pending(x) __this_cpu_write(irq_stat._softirq_pending, 
> > (x))
> > +#define or_softirq_pending(x) __this_cpu_or(irq_stat._softirq_pending, (x))
>
> This breaks the build, because we also get the version of set_ and or_ from
> include/linux/interrupt.h, and then because it's __softirq_pending.
>
> Fixed by adding:
>
> #define __ARCH_SET_SOFTIRQ_PENDING
>
> And fixing the typo.

Ok.

> >
> >  void __set_breakpoint(struct arch_hw_breakpoint *brk)
> >  {
> > -   __get_cpu_var(current_brk) = *brk;
> > +   __this_cpu_write(current_brk, *brk);
>
> This breaks the build because we're trying to do a structure assignment but
> __this_cpu_write() only supports certain sizes.
>
> I replaced it with this which I think is right?
>
>   memcpy(this_cpu_ptr(¤t_brk), brk, sizeof(*brk));
>
>

Yes that is right. Thank you.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc: Replace __get_cpu_var uses

2014-10-27 Thread Christoph Lameter

On Tue, 28 Oct 2014, Michael Ellerman wrote:

> I'm happy to put it in a topic branch for 3.19, or move the definition or
> whatever, your choice Christoph.


Get the patch merged please.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc: Replace __get_cpu_var uses

2014-10-27 Thread Christoph Lameter

Ping? We are planning to remove support for __get_cpu_var in the
3.19 merge period. I can move the definition for __get_cpu_var into the
powerpc per cpu definition instead if we cannot get this merged?

On Tue, 21 Oct 2014, Christoph Lameter wrote:

>
> This still has not been merged and now powerpc is the only arch that does
> not have this change. Sorry about missing linuxppc-dev before.
>
>
> V2->V2
>   - Fix up to work against 3.18-rc1
>
> __get_cpu_var() is used for multiple purposes in the kernel source. One of
> them is address calculation via the form &__get_cpu_var(x).  This calculates
> the address for the instance of the percpu variable of the current processor
> based on an offset.
>
> Other use cases are for storing and retrieving data from the current
> processors percpu area.  __get_cpu_var() can be used as an lvalue when
> writing data or on the right side of an assignment.
>
> __get_cpu_var() is defined as :
>
>
> #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
>
>
>
> __get_cpu_var() always only does an address determination. However, store
> and retrieve operations could use a segment prefix (or global register on
> other platforms) to avoid the address calculation.
>
> this_cpu_write() and this_cpu_read() can directly take an offset into a
> percpu area and use optimized assembly code to read and write per cpu
> variables.
>
>
> This patch converts __get_cpu_var into either an explicit address
> calculation using this_cpu_ptr() or into a use of this_cpu operations that
> use the offset.  Thereby address calculations are avoided and less registers
> are used when code is generated.
>
> At the end of the patch set all uses of __get_cpu_var have been removed so
> the macro is removed too.
>
> The patch set includes passes over all arches as well. Once these operations
> are used throughout then specialized macros can be defined in non -x86
> arches as well in order to optimize per cpu access by f.e.  using a global
> register that may be set to the per cpu base.
>
>
>
>
> Transformations done to __get_cpu_var()
>
>
> 1. Determine the address of the percpu instance of the current processor.
>
>   DEFINE_PER_CPU(int, y);
>   int *x = &__get_cpu_var(y);
>
> Converts to
>
>   int *x = this_cpu_ptr(&y);
>
>
> 2. Same as #1 but this time an array structure is involved.
>
>   DEFINE_PER_CPU(int, y[20]);
>   int *x = __get_cpu_var(y);
>
> Converts to
>
>   int *x = this_cpu_ptr(y);
>
>
> 3. Retrieve the content of the current processors instance of a per cpu
> variable.
>
>   DEFINE_PER_CPU(int, y);
>   int x = __get_cpu_var(y)
>
>Converts to
>
>   int x = __this_cpu_read(y);
>
>
> 4. Retrieve the content of a percpu struct
>
>   DEFINE_PER_CPU(struct mystruct, y);
>   struct mystruct x = __get_cpu_var(y);
>
>Converts to
>
>   memcpy(&x, this_cpu_ptr(&y), sizeof(x));
>
>
> 5. Assignment to a per cpu variable
>
>   DEFINE_PER_CPU(int, y)
>   __get_cpu_var(y) = x;
>
>Converts to
>
>   __this_cpu_write(y, x);
>
>
> 6. Increment/Decrement etc of a per cpu variable
>
>   DEFINE_PER_CPU(int, y);
>   __get_cpu_var(y)++
>
>Converts to
>
>   __this_cpu_inc(y)
>
>
> Cc: Benjamin Herrenschmidt 
> CC: Paul Mackerras 
> Signed-off-by: Christoph Lameter 
> ---
>  arch/powerpc/include/asm/hardirq.h   |  4 +++-
>  arch/powerpc/include/asm/tlbflush.h  |  4 ++--
>  arch/powerpc/include/asm/xics.h  |  8 
>  arch/powerpc/kernel/dbell.c  |  2 +-
>  arch/powerpc/kernel/hw_breakpoint.c  |  6 +++---
>  arch/powerpc/kernel/iommu.c  |  2 +-
>  arch/powerpc/kernel/irq.c|  4 ++--
>  arch/powerpc/kernel/kgdb.c   |  2 +-
>  arch/powerpc/kernel/kprobes.c|  6 +++---
>  arch/powerpc/kernel/mce.c| 24 
>  arch/powerpc/kernel/process.c| 10 +-
>  arch/powerpc/kernel/smp.c|  6 +++---
>  arch/powerpc/kernel/sysfs.c  |  4 ++--
>  arch/powerpc/kernel/time.c   | 22 +++---
>  arch/powerpc/kernel/traps.c  |  6 +++---
>  arch/powerpc/kvm/e500.c  | 14 +++---
>  arch/powerpc/kvm/e500mc.c|  4 ++--
>  arch/powerpc/mm/hash_native_64.c |  2 +-
>  arch/powerpc/mm/hash_utils_64.c  |  2 +-
>  arch/powerpc/mm/hugetlbpage-book3e.c |  6 +++---
>  arch/powerpc/mm/hug

powerpc: Replace __get_cpu_var uses

2014-10-21 Thread Christoph Lameter


This still has not been merged and now powerpc is the only arch that does
not have this change. Sorry about missing linuxppc-dev before.


V2->V2
  - Fix up to work against 3.18-rc1

__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x).  This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :


#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))



__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.


This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset.  Thereby address calculations are avoided and less registers
are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e.  using a global
register that may be set to the per cpu base.




Transformations done to __get_cpu_var()


1. Determine the address of the percpu instance of the current processor.

DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(&y);


2. Same as #1 but this time an array structure is involved.

DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(y);


3. Retrieve the content of the current processors instance of a per cpu
variable.

DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)

   Converts to

int x = __this_cpu_read(y);


4. Retrieve the content of a percpu struct

DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);

   Converts to

memcpy(&x, this_cpu_ptr(&y), sizeof(x));


5. Assignment to a per cpu variable

DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;

   Converts to

__this_cpu_write(y, x);


6. Increment/Decrement etc of a per cpu variable

DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++

   Converts to

__this_cpu_inc(y)


Cc: Benjamin Herrenschmidt 
CC: Paul Mackerras 
Signed-off-by: Christoph Lameter 
---
 arch/powerpc/include/asm/hardirq.h   |  4 +++-
 arch/powerpc/include/asm/tlbflush.h  |  4 ++--
 arch/powerpc/include/asm/xics.h  |  8 
 arch/powerpc/kernel/dbell.c  |  2 +-
 arch/powerpc/kernel/hw_breakpoint.c  |  6 +++---
 arch/powerpc/kernel/iommu.c  |  2 +-
 arch/powerpc/kernel/irq.c|  4 ++--
 arch/powerpc/kernel/kgdb.c   |  2 +-
 arch/powerpc/kernel/kprobes.c|  6 +++---
 arch/powerpc/kernel/mce.c| 24 
 arch/powerpc/kernel/process.c| 10 +-
 arch/powerpc/kernel/smp.c|  6 +++---
 arch/powerpc/kernel/sysfs.c  |  4 ++--
 arch/powerpc/kernel/time.c   | 22 +++---
 arch/powerpc/kernel/traps.c  |  6 +++---
 arch/powerpc/kvm/e500.c  | 14 +++---
 arch/powerpc/kvm/e500mc.c|  4 ++--
 arch/powerpc/mm/hash_native_64.c |  2 +-
 arch/powerpc/mm/hash_utils_64.c  |  2 +-
 arch/powerpc/mm/hugetlbpage-book3e.c |  6 +++---
 arch/powerpc/mm/hugetlbpage.c|  2 +-
 arch/powerpc/mm/stab.c   | 12 ++--
 arch/powerpc/perf/core-book3s.c  | 22 +++---
 arch/powerpc/perf/core-fsl-emb.c |  6 +++---
 arch/powerpc/platforms/cell/interrupt.c  |  6 +++---
 arch/powerpc/platforms/ps3/interrupt.c   |  2 +-
 arch/powerpc/platforms/pseries/dtl.c |  2 +-
 arch/powerpc/platforms/pseries/hvCall_inst.c |  4 ++--
 arch/powerpc/platforms/pseries/iommu.c   |  8 
 arch/powerpc/platforms/pseries/lpar.c|  6 +++---
 arch/powerpc/platforms/pseries/ras.c |  4 ++--
 arch/powerpc/sysdev/xics/xics-common.c   |  2 +-
 32 files changed, 108 insertions(+), 106 deletions(-)

Index: linux/arch/powerpc/include/asm/hardirq.h
===

Re: [RFC PATCH v3 1/4] topology: add support for node_to_mem_node() to determine the fallback node

2014-08-14 Thread Christoph Lameter

On Wed, 13 Aug 2014, Nishanth Aravamudan wrote:

> +++ b/include/linux/topology.h
> @@ -119,11 +119,20 @@ static inline int numa_node_id(void)
>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
>   */
>  DECLARE_PER_CPU(int, _numa_mem_);
> +extern int _node_numa_mem_[MAX_NUMNODES];

Why are these variables starting with an _ ?
Maybe _numa_mem was defined that way because it is typically not defined.
We dont do this in other situations.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: Kernel build issues after yesterdays merge by Linus

2014-06-12 Thread Christoph Lameter

Goobledieguy due to missing Mime header.

On Thu, 12 Jun 2014, David Laight wrote:

> RnJvbTogQW50b24gQmxhbmNoYXJkDQouLi4NCj4gZGlmZiAtLWdpdCBhL2FyY2gvcG93ZXJwYy9i
> b290L2luc3RhbGwuc2ggYi9hcmNoL3Bvd2VycGMvYm9vdC9pbnN0YWxsLnNoDQo+IGluZGV4IGI2
> YTI1NmIuLmUwOTZlNWEgMTAwNjQ0DQo+IC0tLSBhL2FyY2gvcG93ZXJwYy9ib290L2luc3RhbGwu
> c2gNCj4gKysrIGIvYXJjaC9wb3dlcnBjL2Jvb3QvaW5zdGFsbC5zaA0KPiBAQCAtMjMsOCArMjMs
> OCBAQCBzZXQgLWUNCj4gDQo+ICAjIFVzZXIgbWF5IGhhdmUgYSBjdXN0b20gaW5zdGFsbCBzY3Jp
> cHQNCj4gDQo+IC1pZiBbIC14IH4vYmluLyR7SU5TVEFMTEtFUk5FTH0gXTsgdGhlbiBleGVjIH4v
> YmluLyR7SU5TVEFMTEtFUk5FTH0gIiRAIjsgZmkNCj4gLWlmIFsgLXggL3NiaW4vJHtJTlNUQUxM
> S0VSTkVMfSBdOyB0aGVuIGV4ZWMgL3NiaW4vJHtJTlNUQUxMS0VSTkVMfSAiJEAiOyBmaQ0KPiAr
> aWYgWyAteCB+L2Jpbi8ke0lOU1RBTExLRVJORUx9IF07IHRoZW4gZXhlYyB+L2Jpbi8ke0lOU1RB
> TExLRVJORUx9ICQxICQyICQzICQ0OyBmaQ0KPiAraWYgWyAteCAvc2Jpbi8ke0lOU1RBTExLRVJO
> RUx9IF07IHRoZW4gZXhlYyAvc2Jpbi8ke0lOU1RBTExLRVJORUx9ICQxICQyICQzICQ0OyBmaQ0K
> DQpZb3UgcHJvYmFibHkgd2FudCB0byBlbmNsb3NlIHRoZSAkMSBpbiAiIGFzOg0KDQo+ICtpZiBb
> IC14IC9zYmluLyR7SU5TVEFMTEtFUk5FTH0gXTsgdGhlbiBleGVjIC9zYmluLyR7SU5TVEFMTEtF
> Uk5FTH0gIiQxIiAiJDIiICIkMyIgIiQ0IjsgZmkNCg0KCURhdmlkDQoNCg==
>
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

power and percpu: Could we move the paca into the percpu area?

2014-06-11 Thread Christoph Lameter


Looking at arch/powerpc/include/asm/percpu.h I see that the per cpu offset
comes from a local_paca field and local_paca is in r13. That means that
for all percpu operations we first have to determine the address through a
memory access.

Would it be possible to put the paca at the beginning of the percpu data
area and then have r31 point to the percpu area?

power has these nice instructions that fetch from an offset relative to a
base register which could be used throughout for percpu operations in the
kernel (similar to x86 segment registers).

With that we may also be able to use the atomic ops for fast percpu access
so that we can avoid the irq enable/disable sequence that is now required
for percpu atomics. Would result in fast and reliable percpu
counters for powerpc.

I.e. powerpc atomic inc
static __inline__ void atomic_inc(atomic_t *v)
{
int t;

__asm__ __volatile__(
"1: lwarx   %0,0,%2 # atomic_inc\n\
addic   %0,%0,1\n"
PPC405_ERR77(0,%2)
"   stwcx.  %0,0,%2 \n\
bne-1b"
: "=&r" (t), "+m" (v->counter)
: "r" (&v->counter)
: "cc", "xer");
}

Could be used as a template to get:

static __inline__ void raw_cpu_inc_4(__percpu void *v)
{
int t;

__asm__ __volatile__(
"1: lwarx   %0,r31,%2 # percpu_inc\n\
addic   %0,%0,1\n"
PPC405_ERR77(0,%2)
"   stwcx.  %0,r31,%2 \n\
bne-1b"
: "=&r" (t), "+m" (v)
: "r" (&v->counter)
: "cc", "xer");
}

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Kernel build issues after yesterdays merge by Linus

2014-06-11 Thread Christoph Lameter

This is under Ubuntu Utopic Unicorn on a Power 8 system while simply
trying to build with the Ubuntu standard kernel config. It could be that
these issues come about because we do not have an rc1 yet but I wanted to
give some early notice. Also this is a new arch to me so I may not be
aware of how things work.


1. Bad relocation while building:

root@rd-power8:/rdhome/clameter/linux# make
  CHK include/config/kernel.release
  CHK include/generated/uapi/linux/version.h
  CHK include/generated/utsrelease.h
  CALLscripts/checksyscalls.sh
  CHK include/generated/compile.h
  SKIPPED include/generated/compile.h
  CALLarch/powerpc/kernel/systbl_chk.sh
  CALLarch/powerpc/kernel/prom_init_check.sh
  CHK kernel/config_data.h
  CALLarch/powerpc/relocs_check.pl
WARNING: 1 bad relocations
c0cc7df0 R_PPC64_ADDR64__crc_TOC.



2. "make install" fails

root@rd-power8:/rdhome/clameter/linux# make install
sh -x /rdhome/clameter/linux/arch/powerpc/boot/install.sh "3.15.0+"
vmlinux System.map "/boot" arch/powerpc/boot/zImage.pseries
arch/powerpc/boot/zImage.epapr
+ set -e
+ [ -x /home/clameter/bin/installkernel ]
+ [ -x /sbin/installkernel ]
+ exec /sbin/installkernel 3.15.0+ vmlinux System.map /boot
arch/powerpc/boot/zImage.pseries arch/powerpc/boot/zImage.epapr
Usage: installkernel
/rdhome/clameter/linux/arch/powerpc/boot/Makefile:393: recipe for target
'install' failed
make[1]: *** [install] Error 1
/rdhome/clameter/linux/arch/powerpc/Makefile:294: recipe for target
'install' failed
make: *** [install] Error 2



3. Ubuntu "make-kpkg" fails

clameter@rd-power8:~/linux$ fakeroot make-kpkg --initrd --revision 1
kernel_image
exec make kpkg_version=13.013 -f
/usr/share/kernel-package/ruleset/minimal.mk debian DEBIAN_REVISION=1
INITRD=YES
== making target debian/stamp/conf/minimal_debian [new prereqs:
]==
This is kernel package version 13.013.
test -d debian || mkdir debian
test ! -e stamp-building || rm -f stamp-building
install -p -m 755 /usr/share/kernel-package/rules debian/rules
for file in ChangeLog  Control  Control.bin86 config templates.in rules;
do  \
cp -f  /usr/share/kernel-package/$file ./debian/;
\
done
cp: cannot stat ‘/usr/share/kernel-package/ChangeLog’: No such file or
directory
for dir  in Config docs examples ruleset scripts pkg po;  do
\
  cp -af /usr/share/kernel-package/$dir  ./debian/;
\
done
test -f debian/control || sed -e 's/=V/../g'  \
-e 's/=D/1/g' -e 's/=A/ppc64el/g'  \
-e 's/=SA//g'  \
-e 's/=I//g'\
-e 's/=CV/./g'  \
-e 's/=M/Unknown Kernel Package Maintainer
/g'
\
-e 's/=ST/linux/g'  -e 's/=B/ppc64el/g'\
-e 's/=R//g'/usr/share/kernel-package/Control >
debian/control
test -f debian/changelog ||  sed -e 's/=V/../g'   \
-e 's/=D/1/g'-e 's/=A/ppc64el/g'   \
-e 's/=ST/linux/g' -e 's/=B/ppc64el/g' \
-e 's/=M/Unknown Kernel Package Maintainer
/g'
\
 /usr/share/kernel-package/changelog > debian/changelog
chmod 0644 debian/control debian/changelog
test -d ./debian/stamp || mkdir debian/stamp
make -f debian/rules debian/stamp/conf/kernel-conf
make[1]: Entering directory '/rdhome/clameter/linux'
debian/ruleset/misc/checks.mk:36: *** Error. I do not know where the
kernel image goes to [kimagedest undefined] The usual case for this is
that I could not determine which arch or subarch this machine belongs to.
Please specify a subarch, and try again..  Stop.
make[1]: Leaving directory '/rdhome/clameter/linux'
/usr/share/kernel-package/ruleset/minimal.mk:93: recipe for target
'debian/stamp/conf/minimal_debian' failed
make: *** [debian/stamp/conf/minimal_debian] Error 2
Failed to create a ./debian directory:  at /usr/bin/make-kpkg line 966.




4. Errors during build:

Lots of integer to differnt pointer size conversions?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Node 0 not necessary for powerpc?

2014-05-21 Thread Christoph Lameter

On Mon, 19 May 2014, Nishanth Aravamudan wrote:

> I'm seeing a panic at boot with this change on an LPAR which actually
> has no Node 0. Here's what I think is happening:
>
> start_kernel
> ...
> -> setup_per_cpu_areas
> -> pcpu_embed_first_chunk
> -> pcpu_fc_alloc
> -> ___alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu), ...
> -> smp_prepare_boot_cpu
> -> set_numa_node(boot_cpuid)
>
> So we panic on the NODE_DATA call. It seems that ia64, at least, uses
> pcpu_alloc_first_chunk rather than embed. x86 has some code to handle
> early calls of cpu_to_node (early_cpu_to_node) and sets the mapping for
> all CPUs in setup_per_cpu_areas().

Maybe we can switch ia64 too embed? Tejun: Why are there these
dependencies?

> Thoughts? Does that mean we need something similar to x86 for powerpc?

Tejun is the expert in this area. CCing him.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Bug in reclaim logic with exhausted nodes?

2014-04-03 Thread Christoph Lameter

On Mon, 31 Mar 2014, Nishanth Aravamudan wrote:

> Yep. The node exists, it's just fully exhausted at boot (due to the
> presence of 16GB pages reserved at boot-time).

Well if you want us to support that then I guess you need to propose
patches to address this issue.

> I'd appreciate a bit more guidance? I'm suggesting that in this case the
> node functionally has no memory. So the page allocator should not allow
> allocations from it -- except (I need to investigate this still)
> userspace accessing the 16GB pages on that node, but that, I believe,
> doesn't go through the page allocator at all, it's all from hugetlb
> interfaces. It seems to me there is a bug in SLUB that we are noting
> that we have a useless per-node structure for a given nid, but not
> actually preventing requests to that node or reclaim because of those
> allocations.

Well if you can address that without impacting the fastpath then we could
do this. Otherwise we would need a fake structure here to avoid adding
checks to the fastpath

> I think there is a logical bug (even if it only occurs in this
> particular corner case) where if reclaim progresses for a THISNODE
> allocation, we don't check *where* the reclaim is progressing, and thus
> may falsely be indicating that we have done some progress when in fact
> the allocation that is causing reclaim will not possibly make any more
> progress.

Ok maybe we could address this corner case. How would you do this?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Bug in reclaim logic with exhausted nodes?

2014-03-28 Thread Christoph Lameter

On Thu, 27 Mar 2014, Nishanth Aravamudan wrote:

> > That looks to be the correct way to handle things. Maybe mark the node as
> > offline or somehow not present so that the kernel ignores it.
>
> This is a SLUB condition:
>
> mm/slub.c::early_kmem_cache_node_alloc():
> ...
> page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
> ...

So the page allocation from the node failed. We have a strange boot
condition where the OS is aware of anode but allocations on that node
fail.

 > if (page_to_nid(page) != node) {
> printk(KERN_ERR "SLUB: Unable to allocate memory from "
> "node %d\n", node);
> printk(KERN_ERR "SLUB: Allocating a useless per node 
> structure "
> "in order to be able to continue\n");
> }
> ...
>
> Since this is quite early, and we have not set up the nodemasks yet,
> does it make sense to perhaps have a temporary init-time nodemask that
> we set bits in here, and "fix-up" those nodes when we setup the
> nodemasks?

Please take care of this earlier than this. The page allocator in general
should allow allocations from all nodes with memory during boot,




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Bug in reclaim logic with exhausted nodes?

2014-03-25 Thread Christoph Lameter

On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:

> On power, very early, we find the 16G pages (gpages in the powerpc arch
> code) in the device-tree:
>
> early_setup ->
>   early_init_mmu ->
>   htab_initialize ->
>   htab_init_page_sizes ->
>   htab_dt_scan_hugepage_blocks ->
>   memblock_reserve
>   which marks the memory
>   as reserved
>   add_gpage
>   which saves the address
>   off so future calls for
>   alloc_bootmem_huge_page()
>
> hugetlb_init ->
>   hugetlb_init_hstates ->
>   hugetlb_hstate_alloc_pages ->
>   alloc_bootmem_huge_page
>
> > Not sure if I understand that correctly.
>
> Basically this is present memory that is "reserved" for the 16GB usage
> per the LPAR configuration. We honor that configuration in Linux based
> upon the contents of the device-tree. It just so happens in the
> configuration from my original e-mail that a consequence of this is that
> a NUMA node has memory (topologically), but none of that memory is free,
> nor will it ever be free.

Well dont do that

> Perhaps, in this case, we could just remove that node from the N_MEMORY
> mask? Memory allocations will never succeed from the node, and we can
> never free these 16GB pages. It is really not any different than a
> memoryless node *except* when you are using the 16GB pages.

That looks to be the correct way to handle things. Maybe mark the node as
offline or somehow not present so that the kernel ignores it.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Bug in reclaim logic with exhausted nodes?

2014-03-25 Thread Christoph Lameter

On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:

> On 25.03.2014 [11:17:57 -0500], Christoph Lameter wrote:
> > On Mon, 24 Mar 2014, Nishanth Aravamudan wrote:
> >
> > > Anyone have any ideas here?
> >
> > Dont do that? Check on boot to not allow exhausting a node with huge
> > pages?
>
> Gigantic hugepages are allocated by the hypervisor (not the Linux VM),

Ok so the kernel starts booting up and then suddenly the hypervisor takes
the 2 16G pages before even the slab allocator is working?

Not sure if I understand that correctly.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Bug in reclaim logic with exhausted nodes?

2014-03-25 Thread Christoph Lameter

On Mon, 24 Mar 2014, Nishanth Aravamudan wrote:

> Anyone have any ideas here?

Dont do that? Check on boot to not allow exhausting a node with huge
pages?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Node 0 not necessary for powerpc?

2014-03-12 Thread Christoph Lameter

On Tue, 11 Mar 2014, Nishanth Aravamudan wrote:
> I have a P7 system that has no node0, but a node0 shows up in numactl
> --hardware, which has no cpus and no memory (and no PCI devices):

Well as you see from the code there has been so far the assumption that
node 0 has memory. I have never run a machine that has no node 0 memory.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-24 Thread Christoph Lameter

On Mon, 24 Feb 2014, Joonsoo Kim wrote:

> > It will not common get there because of the tracking. Instead a per cpu
> > object will be used.
> > > get_partial_node() always fails even if there are some partial slab on
> > > memoryless node's neareast node.
> >
> > Correct and that leads to a page allocator action whereupon the node will
> > be marked as empty.
>
> Why do we need to request to a page allocator if there is partial slab?
> Checking whether node is memoryless or not is really easy, so we don't need
> to skip this. To skip this is suboptimal solution.

The page allocator action is also used to determine to which other node we
should fall back if the node is empty. So we need to call the page
allocator when the per cpu slab is exhaused with the node of the
memoryless node to get memory from the proper fallback node.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup

2014-02-24 Thread Christoph Lameter

On Fri, 21 Feb 2014, Nishanth Aravamudan wrote:

> I added two calls to local_memory_node(), I *think* both are necessary,
> but am willing to be corrected.
>
> One is in map_cpu_to_node() and one is in start_secondary(). The
> start_secondary() path is fine, AFAICT, as we are up & running at that
> point. But in [the renamed function] update_numa_cpu_node() which is
> used by hotplug, we get called from do_init_bootmem(), which is before
> the zonelists are setup.
>
> I think both calls are necessary because I believe the
> arch_update_cpu_topology() is used for supporting firmware-driven
> home-noding, which does not invoke start_secondary() again (the
> processor is already running, we're just updating the topology in that
> situation).
>
> Then again, I could special-case the do_init_bootmem callpath, which is
> only called at kernel init time?

Well taht looks to be simpler.

> > I do agree that calling local_memory_node() too early then trying to
> > fudge around the consequences seems rather wrong.
>
> If the answer is to simply not call local_memory_node() early, I'll
> submit a patch to at least add a comment, as there's nothing in the code
> itself to prevent this from happening and is guaranteed to oops.

Ok.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup

2014-02-20 Thread Christoph Lameter

On Wed, 19 Feb 2014, Nishanth Aravamudan wrote:

> We can call local_memory_node() before the zonelists are setup. In that
> case, first_zones_zonelist() will not set zone and the reference to
> zone->node will Oops. Catch this case, and, since we presumably running
> very early, just return that any node will do.

Really? Isnt there some way to avoid this call if zonelists are not setup
yet?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-20 Thread Christoph Lameter

On Wed, 19 Feb 2014, David Rientjes wrote:

> On Tue, 18 Feb 2014, Christoph Lameter wrote:
>
> > Its an optimization to avoid calling the page allocator to figure out if
> > there is memory available on a particular node.
> Thus this patch breaks with memory hot-add for a memoryless node.

As soon as the per cpu slab is exhausted the node number of the so far
"empty" node will be used for allocation. That will be sucessfull and the
node will no longer be marked as empty.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-19 Thread Christoph Lameter

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> the performance impact of the underlying NUMA configuration. I guess we
> could special-case memoryless/cpuless configurations somewhat, but I
> don't think there's any reason to do that if we can make memoryless-node
> support work in-kernel?

Well we can make it work in-kernel but it always has been a bit wacky (as
is the idea of numa "memory" nodes without memory).
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Christoph Lameter

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> We use the topology provided by the hypervisor, it does actually reflect
> where CPUs and memory are, and their corresponding performance/NUMA
> characteristics.

And so there are actually nodes without memory that have processors?
Can the hypervisor or the linux arch code be convinced to ignore nodes
without memory or assign a sane default node to processors?

> > Ok then also move the memory of the local node somewhere?
>
> This happens below the OS, we don't control the hypervisor's decisions.
> I'm not sure if that's what you are suggesting.

You could also do this from the powerpc arch code by sanitizing the
processor / node information that is then used by Linux.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Christoph Lameter

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

>
> Well, on powerpc, with the hypervisor providing the resources and the
> topology, you can have cpuless and memoryless nodes. I'm not sure how
> "fake" the NUMA is -- as I think since the resources are virtualized to
> be one system, it's logically possible that the actual topology of the
> resources can be CPUs from physical node 0 and memory from physical node
> 2. I would think with KVM on a sufficiently large (physically NUMA
> x86_64) and loaded system, one could cause the same sort of
> configuration to occur for a guest?

Ok but since you have a virtualized environment: Why not provide a fake
home node with fake memory that could be anywhere? This would avoid the
whole problem of supporting such a config at the kernel level.

Do not have a fake node that has no memory.

> In any case, these configurations happen fairly often on long-running
> (not rebooted) systems as LPARs are created/destroyed, resources are
> DLPAR'd in and out of LPARs, etc.

Ok then also move the memory of the local node somewhere?

> I might look into it, as it might have sped up testing these changes.

I guess that will be necessary in order to support the memoryless nodes
long term.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Christoph Lameter

On Mon, 17 Feb 2014, Joonsoo Kim wrote:

> On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> > Hi Joonsoo,
> > Also, given that only ia64 and (hopefuly soon) ppc64 can set
> > CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> > memoryless nodes present? Even with fakenuma? Just curious.

x86_64 currently does not support memoryless nodes otherwise it would
have set CONFIG_HAVE_MEMORYLESS_NODES in the kconfig. Memoryless nodes are
a bit strange given that the NUMA paradigm is to have NUMA nodes (meaning
memory) with processors. MEMORYLESS nodes means that we have a fake NUMA
node without memory but just processors. Not very efficient. Not sure why
people use these configurations.

> I don't know, because I'm not expert on NUMA system :)
> At first glance, fakenuma can't be used for testing
> CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.

Well yeah. You'd have to do some mods to enable that testing.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Christoph Lameter

On Mon, 17 Feb 2014, Joonsoo Kim wrote:

> On Wed, Feb 12, 2014 at 04:16:11PM -0600, Christoph Lameter wrote:
> > Here is another patch with some fixes. The additional logic is only
> > compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> >
> > Subject: slub: Memoryless node support
> >
> > Support memoryless nodes by tracking which allocations are failing.
>
> I still don't understand why this tracking is needed.

Its an optimization to avoid calling the page allocator to figure out if
there is memory available on a particular node.

> All we need for allcation targeted to memoryless node is to fallback proper
> node, that it, numa_mem_id() node of targeted node. My previous patch
> implements it and use proper fallback node on every allocation code path.
> Why this tracking is needed? Please elaborate more on this.

Its too slow to do that on every alloc. One needs to be able to satisfy
most allocations without switching percpu slabs for optimal performance.

> > Allocations targeted to the nodes without memory fall back to the
> > current available per cpu objects and if that is not available will
> > create a new slab using the page allocator to fallback from the
> > memoryless node to some other node.

And what about the next alloc? Assuem there are N allocs from a memoryless
node this means we push back the partial slab on each alloc and then fall
back?

> >  {
> > void *object;
> > -   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > +   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> >
> > object = get_partial_node(s, get_node(s, searchnode), c, flags);
> > if (object || node != NUMA_NO_NODE)
>
> This isn't enough.
> Consider that allcation targeted to memoryless node.

It will not common get there because of the tracking. Instead a per cpu
object will be used.

> get_partial_node() always fails even if there are some partial slab on
> memoryless node's neareast node.

Correct and that leads to a page allocator action whereupon the node will
be marked as empty.

> We should fallback to some proper node in this case, since there is no slab
> on memoryless node.

NUMA is about optimization of memory allocations. It is often *not* about
correctness but heuristics are used in many cases. F.e. see the zone
reclaim logic, zone reclaim mode, fallback scenarios in the page allocator
etc etc.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-12 Thread Christoph Lameter

Here is another patch with some fixes. The additional logic is only
compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.

Subject: slub: Memoryless node support

Support memoryless nodes by tracking which allocations are failing.
Allocations targeted to the nodes without memory fall back to the
current available per cpu objects and if that is not available will
create a new slab using the page allocator to fallback from the
memoryless node to some other node.

Signed-off-by: Christoph Lameter 

Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c2014-02-12 16:07:48.957869570 -0600
+++ linux/mm/slub.c 2014-02-12 16:09:22.198928260 -0600
@@ -134,6 +134,10 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }

+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+static nodemask_t empty_nodes;
+#endif
+
 /*
  * Issues still to be resolved:
  *
@@ -1405,16 +1409,28 @@ static struct page *new_slab(struct kmem
void *last;
void *p;
int order;
+   int alloc_node;

BUG_ON(flags & GFP_SLAB_BUG_MASK);

page = allocate_slab(s,
flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-   if (!page)
+   if (!page) {
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+   if (node != NUMA_NO_NODE)
+   node_set(node, empty_nodes);
+#endif
goto out;
+   }

order = compound_order(page);
-   inc_slabs_node(s, page_to_nid(page), page->objects);
+   alloc_node = page_to_nid(page);
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+   node_clear(alloc_node, empty_nodes);
+   if (node != NUMA_NO_NODE && alloc_node != node)
+   node_set(node, empty_nodes);
+#endif
+   inc_slabs_node(s, alloc_node, page->objects);
memcg_bind_pages(s, order);
page->slab_cache = s;
__SetPageSlab(page);
@@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
struct kmem_cache_cpu *c)
 {
void *object;
-   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
@@ -2117,8 +2133,19 @@ static void flush_all(struct kmem_cache
 static inline int node_match(struct page *page, int node)
 {
 #ifdef CONFIG_NUMA
-   if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
+   int page_node = page_to_nid(page);
+
+   if (!page)
return 0;
+
+   if (node != NUMA_NO_NODE) {
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+   if (node_isset(node, empty_nodes))
+   return 1;
+#endif
+   if (page_node != node)
+   return 0;
+   }
 #endif
return 1;
 }
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-11 Thread Christoph Lameter

On Mon, 10 Feb 2014, Joonsoo Kim wrote:

> On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> >
> > The first thing is that we modify node_match to also match if we hit an
> > empty node. In that case we simply take the current slab if its there.
>
> Why not inspecting whether we can get the page on the best node such as
> numa_mem_id() node?

Its expensive to do so.

> empty_node cannot be set on memoryless node, since page allocation would
> succeed on different node.

Ok then we need to add a check for being on the rignt node there too.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-07 Thread Christoph Lameter

Here is a draft of a patch to make this work with memoryless nodes.

The first thing is that we modify node_match to also match if we hit an
empty node. In that case we simply take the current slab if its there.

If there is no current slab then a regular allocation occurs with the
memoryless node. The page allocator will fallback to a possible node and
that will become the current slab. Next alloc from a memoryless node
will then use that slab.

For that we also add some tracking of allocations on nodes that were not
satisfied using the empty_node[] array. A successful alloc on a node
clears that flag.

I would rather avoid the empty_node[] array since its global and there may
be thread specific allocation restrictions but it would be expensive to do
an allocation attempt via the page allocator to make sure that there is
really no page available from the page allocator.

Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c2014-02-03 13:19:22.896853227 -0600
+++ linux/mm/slub.c 2014-02-07 12:44:49.311494806 -0600
@@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }

+static int empty_node[MAX_NUMNODES];
+
 /*
  * Issues still to be resolved:
  *
@@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
void *last;
void *p;
int order;
+   int alloc_node;

BUG_ON(flags & GFP_SLAB_BUG_MASK);

page = allocate_slab(s,
flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-   if (!page)
+   if (!page) {
+   if (node != NUMA_NO_NODE)
+   empty_node[node] = 1;
goto out;
+   }

order = compound_order(page);
-   inc_slabs_node(s, page_to_nid(page), page->objects);
+   alloc_node = page_to_nid(page);
+   empty_node[alloc_node] = 0;
+   inc_slabs_node(s, alloc_node, page->objects);
memcg_bind_pages(s, order);
page->slab_cache = s;
__SetPageSlab(page);
@@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
struct kmem_cache_cpu *c)
 {
void *object;
-   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
@@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
 static inline int node_match(struct page *page, int node)
 {
 #ifdef CONFIG_NUMA
-   if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
+   int page_node;
+
+   /* No data means no match */
+   if (!page)
return 0;
+
+   /* Node does not matter. Therefore anything is a match */
+   if (node == NUMA_NO_NODE)
+   return 1;
+
+   /* Did we hit the requested node ? */
+   page_node = page_to_nid(page);
+   if (page_node == node)
+   return 1;
+
+   /* If the node has available data then we can use it. Mismatch */
+   return !empty_node[page_node];
+
+   /* Target node empty so just take anything */
 #endif
return 1;
 }

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-07 Thread Christoph Lameter

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> >
> > It seems like a better approach would be to do this when a node is brought
> > online and determine the fallback node based not on the zonelists as you
> > do here but rather on locality (such as through a SLIT if provided, see
> > node_distance()).
>
> Hmm...
> I guess that zonelist is base on locality. Zonelist is generated using
> node_distance(), so I think that it reflects locality. But, I'm not expert
> on NUMA, so please let me know what I am missing here :)

The next node can be found by going through the zonelist of a node and
checking for available memory. See fallback_alloc().

There is a function node_distance() that determines the relative
performance of a memory access from one to the other node.
The building of the fallback list for every node in build_zonelists()
relies on that.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node

2014-02-07 Thread Christoph Lameter

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> > This check wouild need to be something that checks for other contigencies
> > in the page allocator as well. A simple solution would be to actually run
> > a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
> > If that fails then fallback. See how fallback_alloc() does it in slab.
> >
>
> Hello, Christoph.
>
> This !node_present_pages() ensure that allocation on this node cannot succeed.
> So we can directly use numa_mem_id() here.

Yes of course we can use numa_mem_id().

But the check is only for not having any memory at all on a node. There
are other reason for allocations to fail on a certain node. The node could
have memory that cannot be reclaimed, all dirty, beyond certain
thresholds, not in the current set of allowed nodes etc etc.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node

2014-02-06 Thread Christoph Lameter

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> diff --git a/mm/slub.c b/mm/slub.c
> index cc1f995..c851f82 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1700,6 +1700,14 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> flags, int node,
>   void *object;
>   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
>
> + if (node == NUMA_NO_NODE)
> + searchnode = numa_mem_id();
> + else {
> + searchnode = node;
> + if (!node_present_pages(node))

This check wouild need to be something that checks for other contigencies
in the page allocator as well. A simple solution would be to actually run
a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
If that fails then fallback. See how fallback_alloc() does it in slab.

> + searchnode = get_numa_mem(node);
> + }

> @@ -2277,11 +2285,18 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t 
> gfpflags, int node,
>  redo:
>
>   if (unlikely(!node_match(page, node))) {
> - stat(s, ALLOC_NODE_MISMATCH);
> - deactivate_slab(s, page, c->freelist);
> - c->page = NULL;
> - c->freelist = NULL;
> - goto new_slab;
> + int searchnode = node;
> +
> + if (node != NUMA_NO_NODE && !node_present_pages(node))

Same issue here. I would suggest not deactivating the slab and first check
if the node has no pages. If so then just take an object from the current
cpu slab. If that is not available do an allcoation from the indicated
node and take whatever the page allocator gave you.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()

2014-02-06 Thread Christoph Lameter

On Thu, 6 Feb 2014, David Rientjes wrote:

> I think you'll need to send these to Andrew since he appears to be picking
> up slub patches these days.

I can start managing merges again if Pekka no longer has the time.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-06 Thread Christoph Lameter

On Wed, 5 Feb 2014, Nishanth Aravamudan wrote:

> > Right so if we are ignoring the node then the simplest thing to do is to
> > not deactivate the current cpu slab but to take an object from it.
>
> Ok, that's what Anton's patch does, I believe. Are you ok with that
> patch as it is?

No. Again his patch only works if the node is memoryless not if there are
other issues that prevent allocation from that node.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()

2014-02-06 Thread Christoph Lameter

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> a partial slab on numa_node_id() node. This doesn't work properly on the
> system having memoryless node, since it can have no memory on that node and
> there must be no partial slab on that node.
>
> On that node, page allocation always fallback to numa_mem_id() first. So
> searching a partial slab on numa_node_id() in that case is proper solution
> for memoryless node case.

Acked-by: Christoph Lameter 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-05 Thread Christoph Lameter

On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:

> > If the target node allocation fails (for whatever reason) then I would
> > recommend for simplicities sake to change the target node to
> > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> > complex solution would be to look through partial lists in increasing
> > distance to find a partially used slab that is reasonable close to the
> > current node. Slab has logic like that in fallback_alloc(). Slubs
> > get_any_partial() function does something close to what you want.
>
> I apologize for my own ignorance, but I'm having trouble following.
> Anton's original patch did fallback to the current cpu slab, but I'm not
> sure any NUMA_NO_NODE change is necessary there. At the point we're
> deactivating the slab (in the current code, in __slab_alloc()), we have
> successfully allocated from somewhere, it's just not on the node we
> expected to be on.

Right so if we are ignoring the node then the simplest thing to do is to
not deactivate the current cpu slab but to take an object from it.

> So perhaps you are saying to make a change lower in the code? I'm not
> sure where it makes sense to change the target node in that case. I'd
> appreciate any guidance you can give.

This not an easy thing to do. If the current slab is not the right node
but would be the node from which the page allocator would be returning
memory then the current slab can still be allocated from. If the fallback
is to another node then the current cpu slab needs to be deactivated and
the allocation from that node needs to proceeed. Have a look at
fallback_alloc() in the slab allocator.

A allocation attempt from the page allocator can be restricted to a
specific node through GFP_THIS_NODE.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-04 Thread Christoph Lameter

On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:

> Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
> the $SUBJECT issue.

Hmmm... I am not sure that this is a general solution. The fallback to
other nodes can not only occur because a node has no memory as his patch
assumes.

If the target node allocation fails (for whatever reason) then I would
recommend for simplicities sake to change the target node to NUMA_NO_NODE
and just take whatever is in the current cpu slab. A more complex solution
would be to look through partial lists in increasing distance to find a
partially used slab that is reasonable close to the current node. Slab has
logic like that in fallback_alloc(). Slubs get_any_partial() function does
something close to what you want.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-03 Thread Christoph Lameter

On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:

> So what's the status of this patch? Christoph, do you think this is fine
> as it is?

Certainly enabling CONFIG_MEMORYLESS_NODES is the right thing to do and I
already acked the patch.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-30 Thread Christoph Lameter

On Wed, 29 Jan 2014, Nishanth Aravamudan wrote:

> exactly what the caller intends.
>
> int searchnode = node;
> if (node == NUMA_NO_NODE)
>   searchnode = numa_mem_id();
> if (!node_present_pages(node))
>   searchnode = local_memory_node(node);
>
> The difference in semantics from the previous is that here, if we have a
> memoryless node, rather than using the CPU's nearest NUMA node, we use
> the NUMA node closest to the requested one?

The idea here is that the page allocator will do the fallback to other
nodes. This check for !node_present should not be necessary. SLUB needs to
accept the page from whatever node the page allocator returned and work
with that.

The problem is the check for having a slab from the "right" node may fall
again after another attempt to allocate from the same node. SLUB will then
push the slab from the *wrong* node back to the partial lists and may
attempt another allocation that will again be successful but return memory
from another node. That way the partial lists from a particular node are
growing uselessly.

One way to solve this may be to check if memory is actually allocated
from the requested node and fallback to NUMA_NO_NODE (which will use the
last allocated slab) for future allocs if the page allocator returned
memory from a different node (unless GFP_THIS_NODE is set of course).
Otherwise we end up replicating  the page allocator logic in slub like in
slab. That is what I wanted to
avoid.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: enable CONFIG_HAVE_MEMORYLESS_NODES

2014-01-29 Thread Christoph Lameter

On Tue, 28 Jan 2014, Nishanth Aravamudan wrote:

> Anton Blanchard found an issue with an LPAR that had no memory in Node
> 0. Christoph Lameter recommended, as one possible solution, to use
> numa_mem_id() for locality of the nearest memory node-wise. However,
> numa_mem_id() [and the other related APIs] are only useful if
> CONFIG_HAVE_MEMORYLESS_NODES is set. This is only the case for ia64
> currently, but clearly we can have memoryless nodes on ppc64. Add the
> Kconfig option and define it to be the same value as CONFIG_NUMA.

Well this is trivial but if you need encouragement:

Reviewed-by: Christoph Lameter 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-29 Thread Christoph Lameter

On Tue, 28 Jan 2014, Nishanth Aravamudan wrote:

> This helps about the same as David's patch -- but I found the reason
> why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
> shortly for that and one other case I found.

Oww...

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-27 Thread Christoph Lameter

On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

> What I find odd is that there are only 2 nodes on this system, node 0
> (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> should be coming from node 1 (thus node_match() should always be true?)

Well yes that occurs if you specify the node or just always use the
default memory allocation policy.

In order to spread the allocatios over both node you would have to set the
tasks memory allocation policy to MPOL_INTERLEAVE.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-27 Thread Christoph Lameter

On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

> As to cpu_to_node() being passed to kmalloc_node(), I think an
> appropriate fix is to change that to cpu_to_mem()?

Yup.

> > Yeah, the default policy should be to fallback to local memory if the node
> > passed is memoryless.
>
> Thanks!

I would suggest to use NUMA_NO_NODE instead. That will fit any slab that
we may be currently allocating from or can get a hold of and is mosty
efficient.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-27 Thread Christoph Lameter

On Fri, 24 Jan 2014, David Rientjes wrote:

> kmalloc_node(nid) and kmem_cache_alloc_node(nid) should fallback to nodes
> other than nid when memory can't be allocated, these functions only
> indicate a preference.

The nid passed indicated a preference unless __GFP_THIS_NODE is specified.
Then the allocation must occur on that node.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-24 Thread Christoph Lameter

On Fri, 24 Jan 2014, Wanpeng Li wrote:

> >
> >diff --git a/mm/slub.c b/mm/slub.c
> >index 545a170..a1c6040 100644
> >--- a/mm/slub.c
> >+++ b/mm/slub.c
> >@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> >flags, int node,
> > void *object;
> > int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

This needs to be numa_mem_id() and numa_mem_id would need to be
consistently used.

> >
> >+if (!node_present_pages(searchnode))
> >+searchnode = numa_mem_id();

Probably wont need that?

> >+
> > object = get_partial_node(s, get_node(s, searchnode), c, flags);
> > if (object || node != NUMA_NO_NODE)
> > return object;
> >
>
> The bug still can't be fixed w/ this patch.

Some more detail would be good. If memory is requested from a particular
node then it would be best to use one that has memory. Callers also may
have used numa_node_id() and that also would need to be fixed.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-20 Thread Christoph Lameter

On Mon, 20 Jan 2014, Wanpeng Li wrote:

> >+   enum zone_type high_zoneidx = gfp_zone(flags);
> >
> >+   if (!node_present_pages(searchnode)) {
> >+   zonelist = node_zonelist(searchnode, flags);
> >+   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> >+   searchnode = zone_to_nid(zone);
> >+   if (node_present_pages(searchnode))
> >+   break;
> >+   }
> >+   }
> >object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >if (object || node != NUMA_NO_NODE)
> >return object;
> >
>
> The patch fix the bug. However, the kernel crashed very quickly after running
> stress tests for a short while:

This is not a good way of fixing it. How about not asking for memory from
nodes that are memoryless? Use numa_mem_id() which gives you the next node
that has memory instead of numa_node_id() (gives you the current node
regardless if it has memory or not).
[  287.464285] Unable to handle kernel paging request for data at address 
0x0001
[  287.464289] Faulting instruction address: 0xc0445af8
[  287.464294] Oops: Kernel access of bad area, sig: 11 [#1]
[  287.464296] SMP NR_CPUS=2048 NUMA pSeries
[  287.464301] Modules linked in: btrfs raid6_pq xor dm_service_time sg nfsv3 
arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs fscache dns_resolver 
nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT 
ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc 
ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 
nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter 
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter 
ip_tables ext4 mbcache jbd2 ibmvfc scsi_transport_fc ibmveth nx_crypto 
pseries_rng nfsd auth_rpcgss nfs_acl lockd binfmt_misc sunrpc uinput 
dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ibmvscsi 
scsi_transport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[  287.464374] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
3.10.0-71.el7.91831.ppc64 #1
[  287.464378] task: c0fde590 ti: c001fffd task.ti: 
c10a4000
[  287.464382] NIP: c0445af8 LR: c0445bcc CTR: c0445b90
[  287.464385] REGS: c001fffd38e0 TRAP: 0300   Not tainted  
(3.10.0-71.el7.91831.ppc64)
[  287.464388] MSR: 80009032   CR: 88002084  XER: 
0001
[  287.464397] SOFTE: 0
[  287.464398] CFAR: c000908c
[  287.464401] DAR: 0001, DSISR: 4000
[  287.464403]
GPR00: d3649a04 c001fffd3b60 c10a94d0 0003
GPR04: c0018d841048 c001fffd3bd0 0012 d364eff0
GPR08: c001fffd3bd0 0001 d364d688 c0445b90
GPR12: d364b960 c7e0 042ac510 0060
GPR16: 0020 fb19 c1122100 
GPR20: c0a94680 c1122180 c0a94680 000a
GPR24: 0100  0001 c001ef90
GPR28: c001d6c066f0 c001aea03520 c001bc9a2640 c0018d841680
[  287.464447] NIP [c0445af8] .__dev_printk+0x28/0xc0
[  287.464450] LR [c0445bcc] .dev_printk+0x3c/0x50
[  287.464453] PACATMSCRATCH [80009032]
[  287.464455] Call Trace:
[  287.464458] [c001fffd3b60] [c001fffd3c00] 0xc001fffd3c00 
(unreliable)
[  287.464467] [c001fffd3bf0] [d3649a04] 
.ibmvfc_scsi_done+0x334/0x3e0 [ibmvfc]
[  287.464474] [c001fffd3cb0] [d36495b8] 
.ibmvfc_handle_crq+0x2e8/0x320 [ibmvfc]
[  287.464488] [c001fffd3d30] [d3649fe4] .ibmvfc_tasklet+0xd4/0x250 
[ibmvfc]
[  287.464494] [c001fffd3de0] [c009b46c] .tasklet_action+0xcc/0x1b0
[  287.464498] [c001fffd3e90] [c009a668] .__do_softirq+0x148/0x360
[  287.464503] [c001fffd3f90] [c00218a8] .call_do_softirq+0x14/0x24
[  287.464507] [c001fffcfdf0] [c00107e0] .do_softirq+0xd0/0x100
[  287.464511] [c001fffcfe80] [c009aba8] .irq_exit+0x1b8/0x1d0
[  287.464514] [c001fffcff10] [c0010410] .__do_irq+0xc0/0x1e0
[  287.464518] [c001fffcff90] [c00218cc] .call_do_irq+0x14/0x24
[  287.464522] [c10a76d0] [c00105bc] .do_IRQ+0x8c/0x100
[  287.464527] --- Exception: 501 at 0x
[  287.464527] LR = .arch_local_irq_restore+0x74/0x90
[  287.464533] [c10a7770] [c0002494] 
hardware_interrupt_common+0x114/0x180 (unreliable)
[  287.464540] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[  287.464540] LR = .check_and_cede_processor+0x24/0x40
[  287.464546] [c10a7a60] [0001] 0x1 (unreliable)
[  287.464550] [c10a7ad0] [c0074ecc] .shared_cede_loop+0x2c/0x70
[  287.464555] [c10a7b50] [c0553

Re: mm/slab: ppc: ubi: kmalloc_slab WARNING / PPC + UBI driver

2013-08-06 Thread Christoph Lameter

On Tue, 6 Aug 2013, Wladislav Wiebe wrote:

> ok, just saw in slab/for-linus branch that those stuff is reverted again..

No that was only for the 3.11 merge by Linus. The 3.12 patches have not
been put into pekkas tree.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: mm/slab: ppc: ubi: kmalloc_slab WARNING / PPC + UBI driver

2013-07-31 Thread Christoph Lameter

On Wed, 31 Jul 2013, Wladislav Wiebe wrote:

> Thanks for the point, do you plan to make kmalloc_large available for extern 
> access in a separate mainline patch?
> Since kmalloc_large is statically defined in slub_def.h and when including it 
> to seq_file.c
> we have a lot of conflicting types:

You cannot separatly include slub_def.h. slab.h includes slub_def.h for
you. What problem did you try to fix by doing so?

There is a patch pending that moves kmalloc_large to slab.h. So maybe we
have to wait a merge period in order to be able to use it with other
allocators than slub.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: mm/slab: ppc: ubi: kmalloc_slab WARNING / PPC + UBI driver

2013-07-31 Thread Christoph Lameter

Crap you cannot do PAGE_SIZE allocations with kmalloc_large. Fails when
freeing pages. Need to only do the multiple page allocs with
kmalloc_large.

Subject: seq_file: Use kmalloc_large for page sized allocation

There is no point in using the slab allocation functions for
large page order allocation. Use kmalloc_large().

This fixes the warning about large allocs but it will still cause
large contiguous allocs that could fail because of memory fragmentation.

Signed-off-by: Christoph Lameter 

Index: linux/fs/seq_file.c
===
--- linux.orig/fs/seq_file.c2013-07-31 10:39:03.050472030 -0500
+++ linux/fs/seq_file.c 2013-07-31 10:39:03.050472030 -0500
@@ -136,7 +136,7 @@ static int traverse(struct seq_file *m,
 Eoverflow:
m->op->stop(m, p);
kfree(m->buf);
-   m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
+   m->buf = kmalloc_large(m->size <<= 1, GFP_KERNEL);
return !m->buf ? -ENOMEM : -EAGAIN;
 }

@@ -232,7 +232,7 @@ ssize_t seq_read(struct file *file, char
goto Fill;
m->op->stop(m, p);
kfree(m->buf);
-   m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
+   m->buf = kmalloc_large(m->size <<= 1, GFP_KERNEL);
if (!m->buf)
goto Enomem;
m->count = 0;
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: mm/slab: ppc: ubi: kmalloc_slab WARNING / PPC + UBI driver

2013-07-31 Thread Christoph Lameter

This patch will suppress the warnings by using the page allocator wrappers
of the slab allocators. These are page sized allocs after all.


Subject: seq_file: Use kmalloc_large for page sized allocation

There is no point in using the slab allocation functions for large page
order allocation. Use the kmalloc_large() wrappers which will cause calls
to the page alocator instead.

This fixes the warning about large allocs but it will still cause
high order allocs to occur that could fail because of memory
fragmentation. Maybe switch to vmalloc if we really want to allocate multi
megabyte buffers for proc fs?

Signed-off-by: Christoph Lameter 

Index: linux/fs/seq_file.c
===
--- linux.orig/fs/seq_file.c2013-07-10 14:03:15.367134544 -0500
+++ linux/fs/seq_file.c 2013-07-31 10:11:42.671736131 -0500
@@ -96,7 +96,7 @@ static int traverse(struct seq_file *m,
return 0;
}
if (!m->buf) {
-   m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
+   m->buf = kmalloc_large(m->size = PAGE_SIZE, GFP_KERNEL);
if (!m->buf)
return -ENOMEM;
}
@@ -136,7 +136,7 @@ static int traverse(struct seq_file *m,
 Eoverflow:
m->op->stop(m, p);
kfree(m->buf);
-   m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
+   m->buf = kmalloc_large(m->size <<= 1, GFP_KERNEL);
return !m->buf ? -ENOMEM : -EAGAIN;
 }

@@ -191,7 +191,7 @@ ssize_t seq_read(struct file *file, char

/* grab buffer if we didn't have one */
if (!m->buf) {
-   m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
+   m->buf = kmalloc_large(m->size = PAGE_SIZE, GFP_KERNEL);
if (!m->buf)
goto Enomem;
}
@@ -232,7 +232,7 @@ ssize_t seq_read(struct file *file, char
goto Fill;
m->op->stop(m, p);
kfree(m->buf);
-   m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
+   m->buf = kmalloc_large(m->size <<= 1, GFP_KERNEL);
if (!m->buf)
goto Enomem;
m->count = 0;
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: mm/slab: ppc: ubi: kmalloc_slab WARNING / PPC + UBI driver

2013-07-31 Thread Christoph Lameter

On Wed, 31 Jul 2013, Wladislav Wiebe wrote:

> on a PPC 32-Bit board with a Linux Kernel v3.10.0 I see trouble with 
> kmalloc_slab.
> Basically at system startup, something request a size of 8388608 b,
> but KMALLOC_MAX_SIZE has 4194304 b in our case. It points a WARNING at:

> ..
> NIP [c0099fec] kmalloc_slab+0x60/0xe8
> LR [c0099fd4] kmalloc_slab+0x48/0xe8
> Call Trace:
> [ccd3be60] [c0099fd4] kmalloc_slab+0x48/0xe8 (unreliable)
> [ccd3be70] [c00ae650] __kmalloc+0x20/0x1b4
> [ccd3be90] [c00d46f4] seq_read+0x2a4/0x540
> [ccd3bee0] [c00fe09c] proc_reg_read+0x5c/0x90
> [ccd3bef0] [c00b4e1c] vfs_read+0xa4/0x150
> [ccd3bf10] [c00b500c] SyS_read+0x4c/0x84
> [ccd3bf40] [c000be80] ret_from_syscall+0x0/0x3c
> ..
>
> Do you have any idea how I can analyze where these 8388608 b coming from?

It comes from the kmalloc in seq_read(). And 8M read from the proc
filesystem? Wow. Maybe switch the kmalloc to vmalloc()?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v5 04/14] memory-hotplug: remove /sys/firmware/memmap/X sysfs

2013-01-02 Thread Christoph Lameter

On Thu, 27 Dec 2012, Tang Chen wrote:

> On 12/26/2012 11:30 AM, Kamezawa Hiroyuki wrote:
> >> @@ -41,6 +42,7 @@ struct firmware_map_entry {
> >>const char  *type;  /* type of the memory range */
> >>struct list_headlist;   /* entry for the linked list */
> >>struct kobject  kobj;   /* kobject for each entry */
> >> +  unsigned intbootmem:1; /* allocated from bootmem */
> >>};
> >
> > Can't we detect from which the object is allocated from, slab or bootmem ?
> >
> > Hm, for example,
> >
> >  PageReserved(virt_to_page(address_of_obj)) ?
> >  PageSlab(virt_to_page(address_of_obj)) ?
> >
>
> Hi Kamezawa-san,
>
> I think we can detect it without a new member. I think bootmem:1 member
> is just for convenience. I think I can remove it. :)

Larger size slab allocations may fall back to the page allocator but then
the slabs do not track this allocation. That memory can be freed using the
page allocator.

If you see pageslab then you can always remove using the slab allocator.
Otherwise the page allocator should work (unless it was some
special case bootmem allocation).

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v3 0/13] memory-hotplug : hot-remove physical memory

2012-07-09 Thread Christoph Lameter

On Mon, 9 Jul 2012, Yasuaki Ishimatsu wrote:

> Even if you apply these patches, you cannot remove the physical memory
> completely since these patches are still under development. I want you to
> cooperate to improve the physical memory hot-remove. So please review these
> patches and give your comment/idea.

Could you at least give a method on how you want to do physical memory
removal? You would have to remove all objects from the range you want to
physically remove. That is only possible under special circumstances and
with a limited set of objects. Even if you exclusively use ZONE_MOVEABLE
you still may get cases where pages are pinned for a long time.

I am not sure that these patches are useful unless we know where you are
going with this. If we end up with a situation where we still cannot
remove physical memory then this patchset is not helpful.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH SLAB 1/2 v3] duplicate the cache name in SLUB's saved_alias list, SLAB, and SLOB

2012-07-09 Thread Christoph Lameter


> I was pointed by Glauber to the slab common code patches. I need some
> more time to read the patches. Now I think the slab/slot changes in this
> v3 are not needed, and can be ignored.

That may take some kernel cycles. You have a current issue here that needs
to be fixed.

> > down_write(&slub_lock);
> > -   s = find_mergeable(size, align, flags, name, ctor);
> > +   s = find_mergeable(size, align, flags, n, ctor);
> > if (s) {
> > s->refcount++;
> > /*
>
>   ..
>   up_write(&slub_lock);
>   return s;
>   }
>
> Here, the function returns without name string n be kfreed.

That is intentional since the string n is still referenced by the entry
that sysfs_slab_alias has created.

> But we couldn't kfree n here, because in sysfs_slab_alias(), if
> (slab_state < SYS_FS), the name need to be kept valid until
> slab_sysfs_init() is finished adding the entry into sysfs.

Right that is why it is not freed and that is what fixes the issue you
see.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH SLAB 1/2 v3] duplicate the cache name in SLUB's saved_alias list, SLAB, and SLOB

2012-07-06 Thread Christoph Lameter

I thought I posted this a couple of days ago. Would this not fix things
without having to change all the allocators?


Subject: slub: Dup name earlier in kmem_cache_create

Dup the name earlier in kmem_cache_create so that alias
processing is done using the copy of the string and not
the string itself.

Signed-off-by: Christoph Lameter 

---
 mm/slub.c |   29 ++---
 1 file changed, 14 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2012-06-11 08:49:56.0 -0500
+++ linux-2.6/mm/slub.c 2012-07-03 15:17:37.0 -0500
@@ -3933,8 +3933,12 @@ struct kmem_cache *kmem_cache_create(con
if (WARN_ON(!name))
return NULL;

+   n = kstrdup(name, GFP_KERNEL);
+   if (!n)
+   goto out;
+
down_write(&slub_lock);
-   s = find_mergeable(size, align, flags, name, ctor);
+   s = find_mergeable(size, align, flags, n, ctor);
if (s) {
s->refcount++;
/*
@@ -3944,7 +3948,7 @@ struct kmem_cache *kmem_cache_create(con
s->objsize = max(s->objsize, (int)size);
s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));

-   if (sysfs_slab_alias(s, name)) {
+   if (sysfs_slab_alias(s, n)) {
s->refcount--;
goto err;
}
@@ -3952,31 +3956,26 @@ struct kmem_cache *kmem_cache_create(con
return s;
}

-   n = kstrdup(name, GFP_KERNEL);
-   if (!n)
-   goto err;
-
s = kmalloc(kmem_size, GFP_KERNEL);
if (s) {
if (kmem_cache_open(s, n,
size, align, flags, ctor)) {
list_add(&s->list, &slab_caches);
up_write(&slub_lock);
-   if (sysfs_slab_add(s)) {
-   down_write(&slub_lock);
-   list_del(&s->list);
-   kfree(n);
-   kfree(s);
-   goto err;
-   }
-   return s;
+   if (!sysfs_slab_add(s))
+   return s;
+
+   down_write(&slub_lock);
+   list_del(&s->list);
}
kfree(s);
}
-   kfree(n);
+
 err:
+   kfree(n);
up_write(&slub_lock);

+out:
if (flags & SLAB_PANIC)
panic("Cannot create slabcache %s\n", name);
else
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH powerpc 2/2] kfree the cache name of pgtable cache if SLUB is used

2012-07-03 Thread Christoph Lameter

Looking through the emails it seems that there is an issue with alias
strings. That can be solved by duping the name of the slab earlier in 
kmem_cache_create().
Does this patch fix the issue?

Subject: slub: Dup name earlier in kmem_cache_create

Dup the name earlier in kmem_cache_create so that alias
processing is done using the copy of the string and not
the string itself.

Signed-off-by: Christoph Lameter 

---
 mm/slub.c |   29 ++---
 1 file changed, 14 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2012-06-11 08:49:56.0 -0500
+++ linux-2.6/mm/slub.c 2012-07-03 15:17:37.0 -0500
@@ -3933,8 +3933,12 @@ struct kmem_cache *kmem_cache_create(con
if (WARN_ON(!name))
return NULL;

+   n = kstrdup(name, GFP_KERNEL);
+   if (!n)
+   goto out;
+
down_write(&slub_lock);
-   s = find_mergeable(size, align, flags, name, ctor);
+   s = find_mergeable(size, align, flags, n, ctor);
if (s) {
s->refcount++;
/*
@@ -3944,7 +3948,7 @@ struct kmem_cache *kmem_cache_create(con
s->objsize = max(s->objsize, (int)size);
s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));

-   if (sysfs_slab_alias(s, name)) {
+   if (sysfs_slab_alias(s, n)) {
s->refcount--;
goto err;
}
@@ -3952,31 +3956,26 @@ struct kmem_cache *kmem_cache_create(con
return s;
}

-   n = kstrdup(name, GFP_KERNEL);
-   if (!n)
-   goto err;
-
s = kmalloc(kmem_size, GFP_KERNEL);
if (s) {
if (kmem_cache_open(s, n,
size, align, flags, ctor)) {
list_add(&s->list, &slab_caches);
up_write(&slub_lock);
-   if (sysfs_slab_add(s)) {
-   down_write(&slub_lock);
-   list_del(&s->list);
-   kfree(n);
-   kfree(s);
-   goto err;
-   }
-   return s;
+   if (!sysfs_slab_add(s))
+   return s;
+
+   down_write(&slub_lock);
+   list_del(&s->list);
}
kfree(s);
}
-   kfree(n);
+
 err:
+   kfree(n);
up_write(&slub_lock);

+out:
if (flags & SLAB_PANIC)
panic("Cannot create slabcache %s\n", name);
else
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH powerpc 2/2] kfree the cache name of pgtable cache if SLUB is used

2012-07-03 Thread Christoph Lameter

On Mon, 25 Jun 2012, Li Zhong wrote:

> This patch tries to kfree the cache name of pgtables cache if SLUB is
> used, as SLUB duplicates the cache name, and the original one is leaked.

SLAB also does not free the name. Why would you have an #ifdef in there?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: fix kernel BUG at mm/slub.c:1950!

2011-06-13 Thread Christoph Lameter

On Mon, 13 Jun 2011, Pekka Enberg wrote:

> > Hmmm.. The allocpercpu in alloc_kmem_cache_cpus should take care of the
> > alignment. Uhh.. I see that a patch that removes the #ifdef CMPXCHG_LOCAL
> > was not applied? Pekka?
>
> This patch?
>
> http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=d4d84fef6d0366b585b7de13527a0faeca84d9ce
>
> It's queued and will be sent to Linus soon.

Ok it will also fix Hugh's problem then.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: fix kernel BUG at mm/slub.c:1950!

2011-06-13 Thread Christoph Lameter

On Sun, 12 Jun 2011, Hugh Dickins wrote:

> 3.0-rc won't boot with SLUB on my PowerPC G5: kernel BUG at mm/slub.c:1950!
> Bisected to 1759415e630e "slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery".
>
> After giving myself a medal for finding the BUG on line 1950 of mm/slub.c
> (it's actually the
>   VM_BUG_ON((unsigned long)(&pcp1) % (2 * sizeof(pcp1)));
> on line 268 of the morass that is include/linux/percpu.h)
> I tried the following alignment patch and found it to work.

Hmmm.. The allocpercpu in alloc_kmem_cache_cpus should take care of the
alignment. Uhh.. I see that a patch that removes the #ifdef CMPXCHG_LOCAL
was not applied? Pekka?


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-27 Thread Christoph Lameter


On Fri, 24 Sep 2010, Alan Cox wrote:

> Whether you add new syscalls or do the fd passing using flags and hide
> the ugly bits in glibc is another question.

Use device specific ioctls instead of syscalls?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-27 Thread Christoph Lameter

On Thu, 23 Sep 2010, john stultz wrote:

> > > 3) Further, the PTP hardware counter can be simply set to a new offset
> > > to put it in line with the network time. This could cause trouble with
> > > timekeeping much like unsynced TSCs do.
> >
> > You can do the same for system time.
>
> Settimeofday does allow CLOCK_REALTIME to jump, but the CLOCK_MONOTONIC
> time cannot jump around. Having a clocksource that is non-monotonic
> would break this.

Currently time runs at the same speed. CLOCK_MONOTONIC runs at a offset
to CLOCK_REALTIME. We are creating APIs here that allow time to run at
different speeds.

> The design actually avoids most userland induced latency.
>
> 1) On the PTP hardware syncing point, the reference packet gets
> timestamped with the PTP hardware time on arrival. This allows the
> offset calculation to be done in userland without introducing latency.

The timestamps allows the calculation of the network transmission time I
guess and therefore its more accurate to calculate that effect out. Ok but
then the overhead of getting to code in user space (that does the proper
clock adjustments) is resulting in the addition of a relatively long time
that is subject to OS scheduling latencies and noises.

> 2) On the system syncing side, the proposal for the PPS interrupt allows
> the PTP hardware to trigger an interrupt on the second boundary that
> would take a timestamp of the system time. Then the pps interface allows
> for the timestamp to be read from userland allowing the offset to be
> calculated without introducing additional latency.

Sorry dont really get the whole picture here it seems. Sounds like one is
going through additional unnecessary layers. Why would the PTP hardware
triggger an interrupt? I thought the PTP messages came in via
timestamping and are then processed by software. Then the software is
issuing a hardware interrupt that then triggers the PPS subsystem. And
that is supposed to be better than directly interfacing with the PTP?

> Additionally, even just in userland, it would be easy to bracket two
> reads of the system time around one read of the PTP clock to bound any
> userland latency fairly well. It may not be as good as the PPS interface
> (although that depends on the interrupt latency), but if the accesses
> are all local, it probably could get fairly close.

That sounds hacky.

> > Ok maybe we need some sort of control interface to manage the clock like
> > the others have.
>
> That's what the clock_adjtime call provides.

Ummm... You are managing a hardware device with hardware (driver) specific
settings. That is currently being done via ioctls. Why generalize it?

> > The posix clocks today assumes one notion of real "time" in the kernel.
> > All clocks increase in lockstep (aside from offset updates).
>
> Not true. The cputime clockids do not increment at the same rate (as the
> apps don't always run). Further CLOCK_MONOTONIC_RAW provides a non-freq
> corrected view of CLOCK_MONOTONIC, so it increments at a slightly
> different rate.

cputime clockids are not tracking time but cpu resource use.

> Re-using the fairly nice (Alan of course disagrees :) posix interface
> seems at least a little better for application developers who actually
> have to use the hardware.

Well it may also be confusing for others. The application developers also
will have a hard time using a generic clock interface to control PTP
device specific things like frequencies, rates etc etc. So you always need
to ioctl/device specific control interface regardless.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-27 Thread Christoph Lameter

On Thu, 23 Sep 2010, Christian Riesch wrote:

> > > It implies clock tuning in userspace for a potential sub microsecond
> > > accurate clock. The clock accuracy will be limited by user space
> > > latencies and noise. You wont be able to discipline the system clock
> > > accurately.
> >
> > Noise matters, latency doesn't.
>
> Well put! That's why we need hardware support for PTP timestamping to reduce
> the noise, but get along well with the clock servo that is steering the PHC in
> user space.

Even if I buy into the catch phrase above: User space is subject to noise
that the in kernel code is not. If you do the tuning over long intervals
then it hopefully averages out but it still causes jitter effects that
affects the degree of accuracy (or sync) that you can reach. And the noise
varies with the load on the system.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-23 Thread Christoph Lameter

On Thu, 23 Sep 2010, john stultz wrote:

> > The HPET or pit timesource are also quite slow these days. You only need
> > access periodically to essentially tune the TSC ratio.
>
> If we're using the TSC, then we're not using the PTP clock as you
> suggest. Further the HPET and PIT aren't used to steer the system time
> when we are using the TSC as a clocksource. Its only used to calibrate
> the initial constant freq used by the timekeeping code (and if its
> non-constant, we throw it out).

There is no other scalable time source available for fast timer access
than the time stamp counter in the cpu. Other time source require
memory accesses which is inherently slower.

An accurate other time source is used to adjust this clock. NTP does that
via the clock interfaces from user space which has its problems with
accuracy. PTP can provide the network synced time access
that would a more accurate calibration of the time.

> 2) The way PTP clocks are steered to sync with network time causes their
> hardware freq to actually change. Since these adjustments are done on
> the hardware clock level, and not on the system time level, the
> adjustments to sync the system time/freq would then be made incorrect by
> PTP hardware adjustments.

Right. So use these as a way to fine tune the TSC clock (and thereby the
system time).

> 3) Further, the PTP hardware counter can be simply set to a new offset
> to put it in line with the network time. This could cause trouble with
> timekeeping much like unsynced TSCs do.

You can do the same for system time.

> Now, what you seem to be suggesting is to use the TSC (or whatever
> clocksource the system time is using) but to steer the system time using
> the PTP clock. This is actually what is being proposed, however, the
> steering is done in userland. This is due to the fact that there are two
> components to the steering, 1) adjusting the PTP clock hardware to
> network time and 2) adjusting the system time to the PTP hardware. By
> exposing the PTP clock to userland via the posix clocks interface, we
> allow this to easily be done.

Userland code would introduce latencies that would make sub microsecond
time sync very difficult.

> > We can switch underlying clocks for system time already. We can adapt to a
> > different hw frequency.
>
> Actually no. The timekeeping code requires a fixed freq counter. Dealing
> with hardware freq changes is difficult, because error is introduced by
> the latency between when the freq changes and when the timekeeping code
> is notified of it. So the system treats the hardware counters as fixed
> freq. Now, hardware does vary freq ever so slightly as thermal
> conditions change, but this is addressed in userland and corrected via
> adjtimex.

Acadmic hair splitting? I have repeatedly switched between different
clocks on various systems. So its difficult but we do it?

> Unnecessary layers? Where? This approach has less in-kernel layers, as
> it exposes the PTP clock to userland, instead of trying to layer things
> on top of it and stretching the system time abstraction to cover it.

You dont need the user APIs if you directly use the PTP time source to
steer the system clock. In fact I think you have to do it in kernel space
since user space latencies will degrade accuracy otherwise.

> I've argued through the approach trying to keep it all internal to the
> kernel, but to do so would be anything but trivial. Further, there's the
> case of master-clocks, where the PTP hardware must be synced to system
> time, instead of the other way around. And then there's the case of
> boundary-clocks, which may have multiple PTP hardware clocks that have
> to be synced.

Ok maybe we need some sort of control interface to manage the clock like
the others have.

> I think exposing this through the posix clock interface is really the
> best approach. Its not a static clockid, so its not something most apps
> will ever have to deal with, but it allows the few apps that really need
> to have access to the PTP clock hardware can do so in a clean way.

It implies clock tuning in userspace for a potential sub microsecond
accurate clock. The clock accuracy will be limited by user space
latencies and noise. You wont be able to discipline the system clock
accurately.

The posix clocks today assumes one notion of real "time" in the kernel.
All clocks increase in lockstep (aside from offset updates). This approach
here result in multiple notions of "time" increasing at various speeds.
And it implies that someone is user space is trying to tinker around with
extremely low latencies using system call APIs that take much longer than
these intervals to process the data.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 6/8] ptp: Added a clock that uses the eTSEC found on the MPC85xx.

2010-09-23 Thread Christoph Lameter

On Thu, 23 Sep 2010, Alan Cox wrote:

> > Please do not introduce useless additional layers for clock sync. Load
> > these ptp clocks like the other regular clock modules and make them sync
> > system time like any other clock.
>
> I don't think you understand PTP. PTP has masters, a system can need to
> be honouring multiple conflicting masters at once.

The upshot of it all has to be some synchronized notion of time regardless
of how many other things are going on under the hood. And the spec here
suggests a hardware able to generate periodic accurate events that can be
used to sync system time.

> > Really guys: I want a PTP solution! Now! And not some idiotic additional
> > kernel layers that just pass bits around because its so much fun and
> > screws up clock accurary in due to the latency noise introduced while
> > having so much fun with the bits.
>
> There are some interesting complications in putting a PTP sync
> interface in kernel.

If the PTP logic internally has to juggle multiple clocks then that is a
complication for the driver ok. In any case the driver ultimately has to
provide *one* source of time for the system to sync to.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 6/8] ptp: Added a clock that uses the eTSEC found on the MPC85xx.

2010-09-23 Thread Christoph Lameter

On Thu, 23 Sep 2010, Richard Cochran wrote:

> +* Gianfar PTP clock nodes
> +
> +General Properties:
> +
> +  - compatible   Should be "fsl,etsec-ptp"
> +  - reg  Offset and length of the register set for the device
> +  - interrupts   There should be at least two interrupts. Some devices
> + have as many as four PTP related interrupts.
> +
> +Clock Properties:
> +
> +  - tclk-period  Timer reference clock period in nanoseconds.
> +  - tmr-prsc Prescaler, divides the output clock.
> +  - tmr-add  Frequency compensation value.
> +  - cksel0= external clock, 1= eTSEC system clock, 3= RTC clock 
> input.
> + Currently the driver only supports choice "1".
> +  - tmr-fiper1   Fixed interval period pulse generator.
> +  - tmr-fiper2   Fixed interval period pulse generator.
> +  - max-adj  Maximum frequency adjustment in parts per billion.
> +
> +  These properties set the operational parameters for the PTP
> +  clock. You must choose these carefully for the clock to work right.
> +  Here is how to figure good values:
> +
> +  TimerOsc = system clock   MHz
> +  tclk_period  = desired clock period   nanoseconds
> +  NominalFreq  = 1000 / tclk_period MHz
> +  FreqDivRatio = TimerOsc / NominalFreq (must be greater that 1.0)
> +  tmr_add  = ceil(2^32 / FreqDivRatio)
> +  OutputClock  = NominalFreq / tmr_prsc MHz
> +  PulseWidth   = 1 / OutputClockmicroseconds
> +  FiperFreq1   = desired frequency in Hz
> +  FiperDiv1= 100 * OutputClock / FiperFreq1
> +  tmr_fiper1   = tmr_prsc * tclk_period * FiperDiv1 - tclk_period
> +  max_adj  = 10 * (FreqDivRatio - 1.0) - 1

Great stuff for clock synchronization...

> +  The calculation for tmr_fiper2 is the same as for tmr_fiper1. The
> +  driver expects that tmr_fiper1 will be correctly set to produce a 1
> +  Pulse Per Second (PPS) signal, since this will be offered to the PPS
> +  subsystem to synchronize the Linux clock.

Argh. And conceptually completely screwed up. Why go through the PPS
subsystem if you can directly tune the system clock based on a number of
the cool periodic clock features that you have above? See how the other
clocks do that easily? Look into drivers/clocksource. Add it there.

Please do not introduce useless additional layers for clock sync. Load
these ptp clocks like the other regular clock modules and make them sync
system time like any other clock.

Really guys: I want a PTP solution! Now! And not some idiotic additional
kernel layers that just pass bits around because its so much fun and
screws up clock accurary in due to the latency noise introduced while
having so much fun with the bits.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-23 Thread Christoph Lameter

On Thu, 23 Sep 2010, john stultz wrote:

> This was my initial gut reaction as well, but in the end, I agree with
> Richard that in the case of one or multiple PTP hardware clocks, we
> really can't abstract over the different time domains.

My (arguably still superficial) review of the source does not show
anything that would make me reach that conclusion.

> I really don't think the PTP clock can be used as a clocksource sanely.
>
> First, the hardware access is much to slow for system timekeeping.

The HPET or pit timesource are also quite slow these days. You only need
access periodically to essentially tune the TSC ratio.

> Second, there is the problem that the system time is a software clock,
> and adjustments made (like freq) are made in the layer that interprets
> the underlying hardware cycle counter. Adjustments made in PTP (in order
> to sync the network timestamps) are made at the hardware level.

>From what I can see the PTP clocks are periodic hardware cycle counters
like any other clock that we currently support. If its configurable enough
then setup a hardware cycle counter that mimics nanoseconds since the
epoch as closely as possible and use that to sync the TSC rate to. Makes
it very easy.

> This would cause a disconnect between the hardware freq understood by
> the system time management code and the actual hardware freq.

We can switch underlying clocks for system time already. We can adapt to a
different hw frequency. But then I do not know why adjust the freq? I
thought the point was that the periodic clock was network synchronized and
can be used as "the" master clock for multiple machines?

> Richard, I'd actually strike this paragraph from the rational, as I feel
> it has the tendency to confuse as it suggests having the PHC as a
> clocksource is feasible when really it isn't. Or alternatively, maybe
> express more clearly why its not feasible, so it doesn't just seem like
> a minor design choice.

Sorry but I still feel that this is pretty much a misguided approach that
creates unnecessary layers in the kernel. The trivial easy approach was
not done (copy a driver from drivers/clocksource, modify so that it
programs access to a centralized periodic ptp signal and uses it for
system sync).
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-23 Thread Christoph Lameter

On Thu, 23 Sep 2010, Jacob Keller wrote:

> > There is a reason for not being able to shift posix clocks: The system has
> > one time base. The various clocks are contributing to maintaining that
> > sytem wide time.
> >
> > Adjusting clocks is absolutely essential for proper functioning of the PTP
> protocol. The slave obtains and calculates the offset from master and uses
> that in order to adjust the clock properly, The problem is that the
> timestamps are done via the hardware. We need a method to expose that
> hardware so that the ptp software can properly adjust those clocks.

There is no way to use that clock directly to avoid all the user space
tuning etc? There are already tuning mechanisms in the kernel that do this
with system time based on periodic clocks. If you calculate the
nanoseconds since the epoch then you should be able to use that to tune
system time.

> > I do not understand why you want to maintain different clocks running at
> > different speeds. Certainly interesting for some uses I guess that I
> > do not have the energy to imagine right now. But can we get the PTP killer
> > feature of synchronized accurate system time first?
> >
>
> The problem is maintaining a hardware clock at the correct speed/frequency
> and time. The timestamping is done via hardware, and that hardware clock
> needs to be accurate. We need to be able to modify that clock. Yes, having
> the system time be the same value would be nice, but the problem comes
> because we don't want to jump through hoops to keep that hardware clock
> accurate to the ptp protocol running on the network.

Then allow system time == hardware clock?

> All of the necessary features for microsecond or better accuracy are done
> via the hardware. You can get accuracy to within <10 mircoseconds while only
> sending sync packets and such once per second. The reason is because the
> hardware timestamps are very accurate. But if we can't properly adjust the
> clocks time and frequency, we cannot maintain the accuracy of the
> timestamps.

You can already adjust the system time with the existing APIs. Tuning
hardware clocks is currently done using device specific controls. But I
would think that you do not need to expose this to user space if you can
do it all in kernel.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-23 Thread Christoph Lameter

On Thu, 23 Sep 2010, Richard Cochran wrote:

>   Support for obtaining timestamps from a PHC already exists via the
>   SO_TIMESTAMPING socket option, integrated in kernel version 2.6.30.
>   This patch set completes the picture by allow user space programs to
>   adjust the PHC and to control its ancillary features.

Is there a way to use the PHC as a system clock? I think the main benefit
of PTP is to have syncronized time on multiple machines in a cluster. That
may mean getting rid of ntp and using an in kernel PHC based way to sync time.

>So as far as the POSIX standard is concerned, offering a clock id
>to represent the PHC would be acceptable.

Sure but what would you do with it? HPET timer support has no such need.

> 3.2.1 Using the POSIX Clock API
> 
>
> Looking at the mapping from PHC operation to the POSIX clock API,
> we see that two of the basic clock operations, marked with *, have
> no POSIX equivalent. The items marked NA are peculiar to PHCs and
> will be discussed separately, below.
>
>   Clock Operation   POSIX function
>  -+-
>   Set time  clock_gettime
>   Get time  clock_settime
>   Shift the clock   *
>   Adjust clock frequency*
>  -+-
>   Time stamp external eventsNA
>   Enable PPS events NA
>   Periodic output signals   NA
>   One shot or periodic alarms   timer_create, timer_settime
>
> In contrast to the standard Linux system clock, a PHC is
> adjustable in hardware, for example using frequency compensation
> registers or a VCO. The ability to directly tune the PHC is
> essential to reap the benefit of hardware timestamping.

There is a reason for not being able to shift posix clocks: The system has
one time base. The various clocks are contributing to maintaining that
sytem wide time.

I do not understand why you want to maintain different clocks running at
different speeds. Certainly interesting for some uses I guess that I
do not have the energy to imagine right now. But can we get the PTP killer
feature of synchronized accurate system time first?

> 3.3 Synchronizing the Linux System Time
> 
>
>One could offer a PHC as a combined clock source and clock event
>device. The advantage of this approach would be that it obviates
>the need for synchronization when the PHC is selected as the system
>timer. However, some PHCs, namely the PHY based clocks, cannot be
>used in this way.

Why not? Do PHY based clock not at least provide a counter that increments
in synchronized intervals throughout the network?

>Instead, the patch set provides a way to offer a Pulse Per Second
>(PPS) event from the PHC to the Linux PPS subsystem. A user space
>application can read the PPS events and tune the system clock, just
>like when using other external time sources like radio clocks or
>GPS.

User space is subject to various latencies created by the OS etc. I would
that in order to have fine grained (read microsecond) accurary we would
have to run the portions that are relevant to obtaining the desired
accuracy in the kernel.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim

2010-03-01 Thread Christoph Lameter

On Mon, 1 Mar 2010, Mel Gorman wrote:

> Christoph, how feasible would it be to allow parallel reclaimers in
> __zone_reclaim() that back off at a rate depending on the number of
> reclaimers?

Not too hard. Zone locking is there but there may be a lot of bouncing
cachelines if you run it concurrently.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim

2010-02-24 Thread Christoph Lameter

On Tue, 23 Feb 2010, Anton Blanchard wrote:

> zone_reclaim_mode.txt
> Now we set zone_reclaim_mode = 1. On each iteration we continue to improve,
> but even after 10 runs of stream we have > 10% remote node memory usage.

The intend of zone reclaim was never to allocate all memory from on node.
You should not expect all memory to come from the node even if zone
reclaim works.

> reclaim_4096_pages.txt
> Instead of reclaiming 32 pages at a time, we try for a much larger batch
> of 4096. The slope is much steeper but it still takes around 6 iterations
> to get almost all local node memory.

"almost all"? How much do you want?

> wait_on_busy_flag.txt
> Here we busy wait if the ZONE_RECLAIM_LOCKED flag is set. As you suggest
> we would need to check the GFP flags etc, but so far it looks the most
> promising. We only get a few percent of remote node memory on the first
> iteration and get all local node by the second.

This would significantly impact performance. Zone reclaim should reclaim
with minimal overhead. If zone reclaim is running on another processor
then the OS already takes measures against the shortage of node local
memory. The right thing to do is to take what is currently available which
may be off node memory.
<>--- mm/vmscan.c~	2010-02-21 23:47:14.0 -0600

+++ mm/vmscan.c	2010-02-22 03:22:01.0 -0600

@@ -2534,7 +2534,7 @@

 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),

 		.may_swap = 1,

 		.nr_to_reclaim = max_t(unsigned long, nr_pages,

-   SWAP_CLUSTER_MAX),

+   4096),

 		.gfp_mask = gfp_mask,

 		.swappiness = vm_swappiness,

 		.order = order,

--- mm/vmscan.c~	2010-02-21 23:47:14.0 -0600

+++ mm/vmscan.c	2010-02-21 23:47:31.0 -0600

@@ -2634,8 +2634,8 @@

 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())

 		return ZONE_RECLAIM_NOSCAN;

 

-	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))

-		return ZONE_RECLAIM_NOSCAN;

+	while (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))

+		cpu_relax();

 

 	ret = __zone_reclaim(zone, gfp_mask, order);

 	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim

2010-02-19 Thread Christoph Lameter

On Fri, 19 Feb 2010, Balbir Singh wrote:

> >> zone_reclaim. The others back off and try the next zone in the zonelist
> >> instead. I'm not sure what the original intention was but most likely it
> >> was to prevent too many parallel reclaimers in the same zone potentially
> >> dumping out way more data than necessary.
> >
> > Yes it was to prevent concurrency slowing down reclaim. At that time the
> > number of processors per NUMA node was 2 or so. The number of pages that
> > are reclaimed is limited to avoid tossing too many page cache pages.
> >
>
> That is interesting, I always thought it was to try and free page
> cache first. For example with zone->min_unmapped_pages, if
> zone_pagecache_reclaimable is greater than unmapped pages, we start
> reclaim the cached pages first. The min_unmapped_pages almost sounds
> like the higher level watermark - or am I misreading the code.

Indeed the purpose is to free *old* page cache pages.

The min_unmapped_pages is to protect a mininum of the page cache pages /
fs metadata from zone reclaim so that ongoing file I/O is not impacted.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim

2010-02-19 Thread Christoph Lameter

On Fri, 19 Feb 2010, Mel Gorman wrote:

> > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
> > > zone reclaim.
> >
>
> I've no problem with the patch anyway.

Nor do I.

> > - We seem to end up racing between zone_watermark_ok, zone_reclaim and
> >   buffered_rmqueue. Since everyone is in here the memory one thread reclaims
> >   may be stolen by another thread.
> >
>
> You're pretty much on the button here. Only one thread at a time enters
> zone_reclaim. The others back off and try the next zone in the zonelist
> instead. I'm not sure what the original intention was but most likely it
> was to prevent too many parallel reclaimers in the same zone potentially
> dumping out way more data than necessary.

Yes it was to prevent concurrency slowing down reclaim. At that time the
number of processors per NUMA node was 2 or so. The number of pages that
are reclaimed is limited to avoid tossing too many page cache pages.

> You could experiment with waiting on the bit if the GFP flags allowi it? The
> expectation would be that the reclaim operation does not take long. Wait
> on the bit, if you are making the forward progress, recheck the
> watermarks before continueing.

You could reclaim more pages during a zone reclaim pass? Increase the
nr_to_reclaim in __zone_reclaim() and see if that helps. One zone reclaim
pass should reclaim enough local pages to keep the processors on a node
happy for a reasonable interval. Maybe do a fraction of a zone? 1/16th?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2][v2] powerpc: Make the CMM memory hotplug aware

2009-10-16 Thread Christoph Lameter

On Thu, 15 Oct 2009, Gerald Schaefer wrote:

> > The pages allocated as __GFP_MOVABLE are used to store the list of pages
> > allocated by the balloon.  They reference virtual addresses and it would
> > be fine for the kernel to migrate the physical pages for those, the
> > balloon would not notice this.
>
> Does page migration really work for kernel pages that were allocated
> with __get_free_page()? I was wondering if we can do this on s390, where
> we have a 1:1 mapping of kernel virtual to physical addresses, but
> looking at migrate_pages() and friends, it seems that kernel pages
> w/o mapping and rmap should not be migrateable at all. Any thoughts from
> the memory migration experts?

page migration only works for pages where we have some way of accounting
for all the references to a page. This usually mean using reverse mappings
(anon list, radix trees and page tables).


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 6/6] Add support for __read_mostly to linux/cache.h

2009-05-01 Thread Christoph Lameter

On Fri, 1 May 2009, Sam Ravnborg wrote:

> Are there any specific reason why we do not support read_mostly on all
> architectures?

Not that I know of.

> read_mostly is about grouping rarely written data together
> so what is needed is to introduce this section in the remaining
> archtectures.
>
> Christoph - git log says you did the inital implmentation.
> Do you agree?

Yes.

There is some concern that __read_mostly is needlessly applied to
numerous variables that are not used in hot code paths. This may make
__read_mostly ineffective and actually increase the cache footprint of a
function since global variables are no longer in the same cacheline. If
such a function is called and the caches are cold then two cacheline
fetches have to be done instead of one.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks

2008-07-30 Thread Christoph Lameter

Mel Gorman wrote:

> With Erics patch and libhugetlbfs, we can automatically back text/data[1],
> malloc[2] and stacks without source modification. Fairly soon, libhugetlbfs
> will also be able to override shmget() to add SHM_HUGETLB. That should cover
> a lot of the memory-intensive apps without source modification.

So we are quite far down the road to having a VM that supports 2 page sizes 4k 
and 2M?

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [BUG] 2.6.25-rc3-mm1 kernel panic while bootup on powerpc ()

2008-03-04 Thread Christoph Lameter

On Tue, 4 Mar 2008, Pekka Enberg wrote:

> Looking at the code, it's triggerable in 2.6.24.3 at least. Why we don't have
> a report yet, probably because (1) the default allocator is SLUB which doesn't
> suffer from this and (2) you need a big honkin' NUMA box that causes fallback
> allocations to happen to trigger it.

Plus the issue only became a problem after the antifrag stuff went in. 
That came with SLUB as the default.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [BUG] 2.6.25-rc3-mm1 kernel panic while bootup on powerpc ()

2008-03-04 Thread Christoph Lameter

I think this is the correct fix.

The NUMA fallback logic should be passing local_flags to kmem_get_pages() 
and not simply the flags.

Maybe a stable candidate since we are now simply 
passing on flags to the page allocator on the fallback path.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/slab.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.25-rc3-mm1/mm/slab.c
===
--- linux-2.6.25-rc3-mm1.orig/mm/slab.c 2008-03-04 12:01:07.430911920 -0800
+++ linux-2.6.25-rc3-mm1/mm/slab.c  2008-03-04 12:04:54.449857145 -0800
@@ -3277,7 +3277,7 @@ retry:
if (local_flags & __GFP_WAIT)
local_irq_enable();
kmem_flagcheck(cache, flags);
-   obj = kmem_getpages(cache, flags, -1);
+   obj = kmem_getpages(cache, local_flags, -1);
if (local_flags & __GFP_WAIT)
local_irq_disable();
if (obj) {

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [BUG] 2.6.25-rc3-mm1 kernel panic while bootup on powerpc ()

2008-03-04 Thread Christoph Lameter

On Tue, 4 Mar 2008, Pekka J Enberg wrote:

> On Tue, 4 Mar 2008, Christoph Lameter wrote:
> > Slab allocations should never be passed these flags since the slabs do 
> > their own thing there.
> > 
> > The following patch would clear these in slub:
> 
> Here's the same fix for SLAB:

That is an immediate fix ok. But there must be some location where SLAB 
does the masking of the gfp bits where things go wrong. Looking for that.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [BUG] 2.6.25-rc3-mm1 kernel panic while bootup on powerpc ()

2008-03-04 Thread Christoph Lameter

On Tue, 4 Mar 2008, Pekka Enberg wrote:

> >  > >> [c9edf5f0] [c00b56e4] 
> > .__alloc_pages_internal+0xf8/0x470
> >  > >> [c9edf6e0] [c00e0458] .kmem_getpages+0x8c/0x194
> >  > >> [c9edf770] [c00e1050] .fallback_alloc+0x194/0x254
> >  > >> [c9edf820] [c00e14b0] .kmem_cache_alloc+0xd8/0x144

Ahh! This is SLAB. slub does not suffer this problem since new_slab() 
masks the bits correctly.

So we need to fix SLAB.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [BUG] 2.6.25-rc3-mm1 kernel panic while bootup on powerpc ()

2008-03-04 Thread Christoph Lameter

On Tue, 4 Mar 2008, Pekka Enberg wrote:

> >  > I suspect the WARN_ON() is bogus although I really don't know that part
> >  > of the code all too well. Mel?
> >  >
> >
> >  The warn-on is valid. A situation should not exist that allows both flags 
> > to
> >  be set. I suspect  if remove-set_migrateflags.patch was reverted from -mm
> >  the warning would not trigger. Christoph, would it be reasonable to always
> >  clear __GFP_MOVABLE when __GFP_RECLAIMABLE is set for SLAB_RECLAIM_ACCOUNT.

Slab allocations should never be passed these flags since the slabs do 
their own thing there.

The following patch would clear these in slub:

---
 mm/slub.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.25-rc3-mm1/mm/slub.c
===
--- linux-2.6.25-rc3-mm1.orig/mm/slub.c 2008-03-04 11:53:47.600342756 -0800
+++ linux-2.6.25-rc3-mm1/mm/slub.c  2008-03-04 11:55:40.153855150 -0800
@@ -1033,8 +1033,8 @@ static struct page *allocate_slab(struct
struct page *page;
int pages = 1 << s->order;
 
+   flags &= ~GFP_MOVABLE_MASK;
flags |= s->allocflags;
-
page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY,
node, s->order);
if (unlikely(!page)) {
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

2008-01-23 Thread Christoph Lameter

On Wed, 23 Jan 2008, Nishanth Aravamudan wrote:

> Right, so it might have functioned before, but the correctness was
> wobbly at best... Certainly the memoryless patch series has tightened
> that up, but we missed these SLAB issues.
> 
> I see that your patch fixed Olaf's machine, Pekka. Nice work on
> everyone's part tracking this stuff down.

Another important result is that I found that GFP_THISNODE is actually 
required for proper SLAB operation and not only an optimization. Fallback 
can lead to very bad results. I have two customer reported instances of 
SLAB corruption here that can be explained now due to fallback to another 
node. Foreign objects enter the per cpu queue. The wrong node lock is 
taken during cache_flusharray(). Fields in the struct slab can become 
corrupted. It typically hits the list field and the inuse field.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

2008-01-23 Thread Christoph Lameter

On Wed, 23 Jan 2008, Pekka Enberg wrote:

> I think Mel said that their configuration did work with 2.6.23
> although I also wonder how that's possible. AFAIK there has been some
> changes in the page allocator that might explain this. That is, if
> kmem_getpages() returned pages for memoryless node before, bootstrap
> would have worked.

Regular kmem_getpages is called with GFP_THISNODE set. There was some 
breakage in 2.6.22 and before with GFP_THISNODE returning pages from the 
wrong node if a node had no memory. So it may have worked accidentally and 
in an unsafe manner because the pages would have been associated with the 
wrong node which could trigger bug ons and locking troubles.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

2008-01-23 Thread Christoph Lameter

On Wed, 23 Jan 2008, Pekka J Enberg wrote:

> Fine. But, why are we hitting fallback_alloc() in the first place? It's 
> definitely not because of missing ->nodelists as we do:
> 
> cache_cache.nodelists[node] = &initkmem_list3[CACHE_CACHE];
> 
> before attempting to set up kmalloc caches. Now, if I understood 
> correctly, we're booting off a memoryless node so kmem_getpages() will 
> return NULL thus forcing us to fallback_alloc() which is unavailable at 
> this point.
> 
> As far as I can tell, there are two ways to fix this:
> 
>   (1) don't boot off a memoryless node (why are we doing this in the first 
>   place?)

Right. That is the solution that I would prefer.

>   (2) initialize cache_cache.nodelists with initmem_list3 equivalents
>   for *each node hat has normal memory*

Or simply do it for all. SLAB bootstrap is very complex thing though.

> 
> I am still wondering why this worked before, though.

I doubt it did ever work for SLAB.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

2008-01-23 Thread Christoph Lameter

On Wed, 23 Jan 2008, Mel Gorman wrote:

> This patch adds the necessary checks to make sure a kmem_list3 exists for
> the preferred node used when growing the cache. If the preferred node has
> no nodelist then the currently running node is used instead. This
> problem only affects the SLAB allocator, SLUB appears to work fine.

That is a dangerous thing to do. SLAB per cpu queues will contain foreign 
objects which may cause troubles when pushing the objects back. I think we 
may be lucky that these objects are consumed at boot. If all of the 
foreign objects are consumed at boot then we are fine. At least an 
explanation as to this issue should be added to the patch.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

2008-01-23 Thread Christoph Lameter

On Wed, 23 Jan 2008, Pekka J Enberg wrote:

> Furthermore, don't let kmem_getpages() call alloc_pages_node() if nodeid 
> passed
> to it is -1 as the latter will always translate that to numa_node_id() which
> might not have ->nodelist that caused the invocation of fallback_alloc() in 
> the
> first place (for example, during bootstrap).

kmem_getpages is called without GFP_THISNODE. This 
alloc_pages_node(numa_node_id(), ...) will fall back to the next node with 
memory.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

2008-01-23 Thread Christoph Lameter

On Wed, 23 Jan 2008, Pekka J Enberg wrote:

> I still think Christoph's kmem_getpages() patch is correct (to fix 
> cache_grow() oops) but I overlooked the fact that none the callers of 
> cache_alloc_node() deal with bootstrapping (with the exception of 
> __cache_alloc_node() that even has a comment about it).

My patch is useless. kmem_getpages called with nodeid == -1 falls back 
correctly to the available node. The problem is that the node structures 
for the page does not exist.

> But what I am really wondering about is, why wasn't the 
> N_NORMAL_MEMORY revert enough? I assume this used to work before so what 
> more do we need to revert for 2.6.24?

I think that is because SLUB relaxed the requirements on having regular 
memory on the boot node. Now the expectation is that SLAB can do the same.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

1 2 >

1 - 100 of 120 matches

Mail list logo