Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Vlastimil Babka wrote:

> > NAK. This is changing slob behavior. With no node specified it must use
> > alloc_pages because that obeys NUMA memory policies etc etc. It should not
> > force allocation from the current node like what is happening here after
> > the patch. See the code in slub.c that is similar.
>
> Doh, somehow I convinced myself that there's #else and alloc_pages() is only
> used for !CONFIG_NUMA so it doesn't matter. Here's a fixed version.

Acked-by: Christoph Lameter 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/3] mm: use numa_mem_id() in alloc_pages_node()

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Vlastimil Babka wrote:

> numa_mem_id() is able to handle allocation from CPUs on memory-less nodes,
> so it's a more robust fallback than the currently used numa_node_id().
>
> Suggested-by: Christoph Lameter 
> Signed-off-by: Vlastimil Babka 
> Acked-by: David Rientjes 
> Acked-by: Mel Gorman 

You can add my ack too if it helps.

Acked-by: Christoph Lameter 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] mm: unify checks in alloc_pages_node() and __alloc_pages_node()

2015-07-30 Thread Christoph Lameter

Acked-by: Christoph Lameter 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Vlastimil Babka wrote:

> --- a/mm/slob.c
> +++ b/mm/slob.c
>   void *page;
>
> -#ifdef CONFIG_NUMA
> - if (node != NUMA_NO_NODE)
> - page = alloc_pages_exact_node(node, gfp, order);
> - else
> -#endif
> - page = alloc_pages(gfp, order);
> + page = alloc_pages_node(node, gfp, order);

NAK. This is changing slob behavior. With no node specified it must use
alloc_pages because that obeys NUMA memory policies etc etc. It should not
force allocation from the current node like what is happening here after
the patch. See the code in slub.c that is similar.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: rename and document alloc_pages_exact_node

2015-07-23 Thread Christoph Lameter
On Wed, 22 Jul 2015, David Rientjes wrote:

> Eek, yeah, that does look bad.  I'm not even sure the
>
>   if (nid < 0)
>   nid = numa_node_id();
>
> is correct; I think this should be comparing to NUMA_NO_NODE rather than
> all negative numbers, otherwise we silently ignore overflow and nobody
> ever knows.

Comparing to NUMA_NO_NODE would be better. Also use numa_mem_id() instead
to support memoryless nodes better?

> The only possible downside would be existing users of
> alloc_pages_node() that are calling it with an offline node.  Since it's a
> VM_BUG_ON() that would catch that, I think it should be changed to a
> VM_WARN_ON() and eventually fixed up because it's nonsensical.
> VM_BUG_ON() here should be avoided.

The offline node thing could be addresses by using numa_mem_id()?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: rename and document alloc_pages_exact_node

2015-07-21 Thread Christoph Lameter
On Tue, 21 Jul 2015, Vlastimil Babka wrote:

> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
> allocator: do not check NUMA node ID when the caller knows the node is valid")
> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
> to be -1. Unfortunately the name of the function can easily suggest that the
> allocation is restricted to the given node. In truth, the node is only
> preferred, unless __GFP_THISNODE is among the gfp flags.

Yup. I complained about this when this was introduced. Glad to see this
fixed. Initially this was alloc_pages_node() which just means that a node
is specified. The exact behavior of the allocation is determined by flags
such as GFP_THISNODE. I'd rather have that restored because otherwise we
get into weird code like the one below. And such an arrangement also
leaves the way open to add more flags in the future that may change the
allocation behavior.


>   area->nid = nid;
>   area->order = order;
> - area->pages = alloc_pages_exact_node(area->nid,
> + area->pages = alloc_pages_prefer_node(area->nid,
>   GFP_KERNEL|__GFP_THISNODE,
>   area->order);

This is not preferring a node but requiring alloction on that node.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-21 Thread Christoph Lameter
On Fri, 16 Aug 2013, Nicholas A. Bellinger wrote:

> + spinlock_t  lock;

Remove the spinlock.

> + unsignednr_free;
> + unsignedfreelist[];
> +};
> +
> +static inline void move_tags(unsigned *dst, unsigned *dst_nr,
> +  unsigned *src, unsigned *src_nr,
> +  unsigned nr)
> +{
> + *src_nr -= nr;
> + memcpy(dst + *dst_nr, src + *src_nr, sizeof(unsigned) * nr);
> + *dst_nr += nr;
> +}
> +

> +static inline unsigned alloc_local_tag(struct percpu_ida *pool,
> +struct percpu_ida_cpu *tags)

Pass the __percpu offset and not the tags pointer.

> +{
> + int tag = -ENOSPC;
> +
> + spin_lock(&tags->lock);

Interupts are already disabled. Drop the spinlock.

> + if (tags->nr_free)
> + tag = tags->freelist[--tags->nr_free];

You can keep this or avoid address calculation through segment prefixes.
F.e.

if (__this_cpu_read(tags->nrfree) {
int n = __this_cpu_dec_return(tags->nr_free);
tag =  __this_cpu_read(tags->freelist[n]);
}

> + spin_unlock(&tags->lock);

Drop.

> + * Returns a tag - an integer in the range [0..nr_tags) (passed to
> + * tag_pool_init()), or otherwise -ENOSPC on allocation failure.
> + *
> + * Safe to be called from interrupt context (assuming it isn't passed
> + * __GFP_WAIT, of course).
> + *
> + * Will not fail if passed __GFP_WAIT.
> + */
> +int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
> +{
> + DEFINE_WAIT(wait);
> + struct percpu_ida_cpu *tags;
> + unsigned long flags;
> + int tag;
> +
> + local_irq_save(flags);
> + tags = this_cpu_ptr(pool->tag_cpu);

You could drop this_cpu_ptr if you pass pool->tag_cpu to alloc_local_tag.

> +/**
> + * percpu_ida_free - free a tag
> + * @pool: pool @tag was allocated from
> + * @tag: a tag previously allocated with percpu_ida_alloc()
> + *
> + * Safe to be called from interrupt context.
> + */
> +void percpu_ida_free(struct percpu_ida *pool, unsigned tag)
> +{
> + struct percpu_ida_cpu *tags;
> + unsigned long flags;
> + unsigned nr_free;
> +
> + BUG_ON(tag >= pool->nr_tags);
> +
> + local_irq_save(flags);
> + tags = this_cpu_ptr(pool->tag_cpu);
> +
> + spin_lock(&tags->lock);

No need for spinlocking
> + tags->freelist[tags->nr_free++] = tag;

nr_free = __this_cpu_inc_return(pool->tag_cpu.nr_free) ?

__this_cpu_write(pool->tag_cpu.freelist[nr_free], tag)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ksummit-2012-discuss] SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]

2012-07-06 Thread Christoph Lameter
On Fri, 6 Jul 2012, James Bottomley wrote:

> What people might pay attention to is evidence that there's a problem in
> 3.5-rc6 (without any OFED crap).  If you're not going to bother
> investigating, it has to be in an environment they can reproduce (so
> ordinary hardware, not infiniband) otherwise it gets ignored as an
> esoteric hardware issue.

The OFED stuff in the meantime is part of 3.5-rc6. Infiniband has been
supported for a long time and its a very important technology given the
problematic nature of ethernet at high network speeds.

OFED crap exists for those running RHEL5/6. The new enterprise distros are
based on the 3.2 kernel which has pretty good Infiniband support
out of the box.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] bitops: add _local bitops

2012-05-10 Thread Christoph Lameter
On Wed, 9 May 2012, Michael S. Tsirkin wrote:

> kvm needs to update some hypervisor variables atomically
> in a sense that the operation can't be interrupted
> in the middle. However the hypervisor always runs
> on the same CPU so it does not need any memory
> barrier or lock prefix.
>
> Add _local bitops for this purpose: define them
> as non-atomics for x86 and (for now) atomics for
> everyone else.

Have you tried to use the this_cpu_ops for that purpose? They create the
per cpu atomic instructions that you want without a lock prefix and can
also relocate the per cpu pointer to the correct processor via a
segment register prefix.

There are no bit operations provided right now but those can either be
improvised using this_cpu_cmpxchg or added.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Code clean up for percpu_xxx() functions

2011-10-19 Thread Christoph Lameter
On Wed, 19 Oct 2011, Alex,Shi wrote:

> Thanks for comments! I initialized the patch as following accordingly,
> And cc to more maintainers for review. I checked all code except xen/kvm
> part, totally a idiot for them.

Acked-by: Christoph Lameter 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-04-03 Thread Christoph Lameter
On Sun, 3 Apr 2011, KOSAKI Motohiro wrote:

> 1) Some bios don't have such knob. btw, OK, yes, *I* can switch NUMA off 
> completely
> because I don't have such bios. 2) bios level turning off makes some side 
> effects,
> example, scheduler load balancing don't care numa anymore.

Well then lets add a kernel parameter that switches all NUMA off.
Otherwise: If you just run a kernel build without NUMA support then you have a 
similar
effect.

Re #2) If you have the system toss processes around the system then the
load balancing heuristics does not bring you any benefit.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-04-03 Thread Christoph Lameter
On Sat, 2 Apr 2011, Dave Chinner wrote:

> Fundamentally, if you just switch off memory reclaim to avoid the
> latencies involved with direct memory reclaim, then all you'll get
> instead is ENOMEM because there's no memory available and none will be
> reclaimed. That's even more fatal for the system than doing reclaim.

Not for my use cases here. No one will die if reclaim happens but its bad
for the bottom line. Reducing the chance of memory reclaim occurring in a
critical section is sufficient.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-04-01 Thread Christoph Lameter
On Fri, 1 Apr 2011, KOSAKI Motohiro wrote:

> > On Thu, 31 Mar 2011, KOSAKI Motohiro wrote:
> >
> > > 1) zone reclaim doesn't work if the system has multiple node and the
> > >workload is file cache oriented (eg file server, web server, mail 
> > > server, et al).
> > >because zone recliam make some much free pages than zone->pages_min and
> > >then new page cache request consume nearest node memory and then it
> > >bring next zone reclaim. Then, memory utilization is reduced and
> > >unnecessary LRU discard is increased dramatically.
> >
> > That is only true if the webserver only allocates from a single node. If
> > the allocation load is balanced then it will be fine. It is useful to
> > reclaim pages from the node where we allocate memory since that keeps the
> > dataset node local.
>
> Why?
> Scheduler load balancing only consider cpu load. Then, usually memory
> pressure is no complete symmetric. That's the reason why we got the
> bug report periodically.

The scheduler load balancing also considers caching effects. It does not
consider NUMA effects aside from heuritics though. If processes are
randomly moving around then zone reclaim is not effective. Processes need
to stay mainly on a certain node and memory needs to be allocatable from
that node in order to improve performance. zone_reclaim is useless if you
toss processes around the box.

> btw, when we are talking about memory distance aware reclaim, we have to
> recognize traditional numa (ie external node interconnect) and on-chip
> numa have different performance characteristics. on-chip remote node access
> is not so slow, then elaborated nearest node allocation effort doesn't have
> so much worth. especially, a workload use a lot of short lived object.
> Current zone-reclaim don't have so much issue when using traditiona numa
> because it's fit your original design and assumption and administrators of
> such systems have good skill and don't hesitate to learn esoteric knobs.
> But recent on-chip and cheap numa are used for much different people against
> past. therefore new issues and claims were raised.

You can switch NUMA off completely at the bios level. Then the distances
are not considered by the OS. If they are not relevant then lets just
switch NUMA off. Managing NUMA distances can cause significant overhead.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-03-31 Thread Christoph Lameter
On Thu, 31 Mar 2011, KOSAKI Motohiro wrote:

> 1) zone reclaim doesn't work if the system has multiple node and the
>workload is file cache oriented (eg file server, web server, mail server, 
> et al).
>because zone recliam make some much free pages than zone->pages_min and
>then new page cache request consume nearest node memory and then it
>bring next zone reclaim. Then, memory utilization is reduced and
>unnecessary LRU discard is increased dramatically.

That is only true if the webserver only allocates from a single node. If
the allocation load is balanced then it will be fine. It is useful to
reclaim pages from the node where we allocate memory since that keeps the
dataset node local.

>SGI folks added CPUSET specific solution in past. 
> (cpuset.memory_spread_page)
>But global recliam still have its issue. zone recliam is HPC workload 
> specific
>feature and HPC folks has no motivation to don't use CPUSET.

The spreading can also be done via memory policies. But that is only
required if the application has an unbalanced allocation behavior.

> 2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to
>decide to filter out mapped pages. It made a lot of problems for DB servers
>and large application servers. Because, if the system has a lot of mapped
>pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2)
>reclaim latency become terribly slow and hangup detectors misdetect its
>state and start to force reboot. That was big problem of RHEL5 based 
> banking
>system.
>So, sc->may_unmap should be killed in future. Don't increase uses.

Because a bank could not configure its system properly we need to get rid
of may_unmap? Maybe raise min_unmapped_ratio instead and take care that
either the allocation load is balanced or a round robin scheme is
used by the app?

> And, this patch introduce new allocator fast path overhead. I haven't seen
> any justification for it.

We could do the triggering differently.

> In other words, you have to kill following three for getting ack 1) zone
> reclaim oriented reclaim 2) filter based LRU scanning (eg sc->may_unmap)
> 3) fastpath overhead. In other words, If you want a feature for vm guest,
> Any hardcoded machine configration assumption and/or workload assumption
> are wrong.

It would be good if you could come up with a new reclaim scheme that
avoids the need for zone reclaim and still allows one to take advantage of
memory distances. I agree that the current scheme sometimes requires
tuning too many esoteric knobs to get useful behavior.

> But, I agree that now we have to concern slightly large VM change parhaps
> (or parhaps not). Ok, it's good opportunity to fill out some thing.
> Historically, Linux MM has "free memory are waste memory" policy, and It
> worked completely fine. But now we have a few exceptions.
>
> 1) RT, embedded and finance systems. They really hope to avoid reclaim
>latency (ie avoid foreground reclaim completely) and they can accept
>to make slightly much free pages before memory shortage.

In general we need a mechanism to ensure we can avoid reclaim during
critical sections of application. So some way to give some hints to the
machine to free up lots of memory (/proc/sys/vm/dropcaches is far too
drastic) may be useful.

> And, now we have four proposal of utilization related issues.
>
> 1) cleancache (from Oracle)
> 2) VirtFS (from IBM)
> 3) kstaled (from Google)
> 4) unmapped page reclaim (from you)
>
> Probably, we can't merge all of them and we need to consolidate some
> requirement and implementations.

Well all these approaches show that we have major issues with reclaim and
large memory. Things get overly complicated. Time for a new approach that
integrates all the goals that these try to accomplish?

> Personally I think cleancache or other multi level page cache framework
> looks promising. but another solution is also acceptable. Anyway, I hope
> to everyone back 1000feet bird eye at once and sorting out all requiremnt
> with all related person.

Would be good if you could takle that problem.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-28 Thread Christoph Lameter
On Fri, 28 Jan 2011, KAMEZAWA Hiroyuki wrote:

> > > I see it as a tradeoff of when to check? add_to_page_cache or when we
> > > are want more free memory (due to allocation). It is OK to wakeup
> > > kswapd while allocating memory, somehow for this purpose (global page
> > > cache), add_to_page_cache or add_to_page_cache_locked does not seem
> > > the right place to hook into. I'd be open to comments/suggestions
> > > though from others as well.
>
> I don't like add hook here.
> AND I don't want to run kswapd because 'kswapd' has been a sign as
> there are memory shortage. (reusing code is ok.)
>
> How about adding new daemon ? Recently, khugepaged, ksmd works for
> managing memory. Adding one more daemon for special purpose is not
> very bad, I think. Then, you can do
>  - wake up without hook
>  - throttle its work.
>  - balance the whole system rather than zone.
>I think per-node balance is enough...


I think we already have enough kernel daemons floating around. They are
multiplying in an amazing way. What would be useful is to map all
the memory management background stuff into a process. May call this memd
instead? Perhaps we can fold khugepaged into kswapd as well etc.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-26 Thread Christoph Lameter

Reviewed-by: Christoph Lameter 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v4)

2011-01-26 Thread Christoph Lameter

Reviewed-by: Christoph Lameter 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)

2011-01-21 Thread Christoph Lameter
On Fri, 21 Jan 2011, Balbir Singh wrote:

> * Christoph Lameter  [2011-01-20 09:00:09]:
>
> > On Thu, 20 Jan 2011, Balbir Singh wrote:
> >
> > > + unmapped_page_control
> > > + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
> > > + is enabled. It controls the amount of unmapped memory
> > > + that is present in the system. This boot option plus
> > > + vm.min_unmapped_ratio (sysctl) provide granular control
> >
> > min_unmapped_ratio is there to guarantee that zone reclaim does not
> > reclaim all unmapped pages.
> >
> > What you want here is a max_unmapped_ratio.
> >
>
> I thought about that, the logic for reusing min_unmapped_ratio was to
> keep a limit beyond which unmapped page cache shrinking should stop.

Right. That is the role of it. Its a minimum to leave. You want a maximum
size of the pagte cache.

> I think you are suggesting max_unmapped_ratio as the point at which
> shrinking should begin, right?

The role of min_unmapped_ratio is to never reclaim more pagecache if we
reach that ratio even if we have to go off node for an allocation.

AFAICT What you propose is a maximum size of the page cache. If the number
of page cache pages goes beyond that then you trim the page cache in
background reclaim.

> > > + reclaim_unmapped_pages(priority, zone, &sc);
> > > +
> > >   if (!zone_watermark_ok_safe(zone, order,
> >
> > H. Okay that means background reclaim does it. If so then we also want
> > zone reclaim to be able to work in the background I think.
>
> Anything specific you had in mind, works for me in testing, but is
> there anything specific that stands out in your mind that needs to be
> done?

Hmmm. So this would also work in a NUMA configuration, right. Limiting the
sizes of the page cache would avoid zone reclaim through these limit. Page
cache size would be limited by the max_unmapped_ratio.

zone_reclaim only would come into play if other allocations make the
memory on the node so tight that we would have to evict more page
cache pages in direct reclaim.
Then zone_reclaim could go down to shrink the page cache size to
min_unmapped_ratio.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)

2011-01-20 Thread Christoph Lameter
On Thu, 20 Jan 2011, Balbir Singh wrote:

> + unmapped_page_control
> + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
> + is enabled. It controls the amount of unmapped memory
> + that is present in the system. This boot option plus
> + vm.min_unmapped_ratio (sysctl) provide granular control

min_unmapped_ratio is there to guarantee that zone reclaim does not
reclaim all unmapped pages.

What you want here is a max_unmapped_ratio.


>  {
> @@ -2297,6 +2320,12 @@ loop_again:
>   shrink_active_list(SWAP_CLUSTER_MAX, zone,
>   &sc, priority, 0);
>
> + /*
> +  * We do unmapped page reclaim once here and once
> +  * below, so that we don't lose out
> +  */
> + reclaim_unmapped_pages(priority, zone, &sc);
> +
>   if (!zone_watermark_ok_safe(zone, order,

H. Okay that means background reclaim does it. If so then we also want
zone reclaim to be able to work in the background I think.
max_unmapped_ratio could also be useful to the zone reclaim logic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST] [PATCH 2/3] Refactor zone_reclaim code (v3)

2011-01-20 Thread Christoph Lameter

Reviewed-by: Christoph Lameter 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST] [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)

2011-01-20 Thread Christoph Lameter
On Thu, 20 Jan 2011, Balbir Singh wrote:

> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -253,11 +253,11 @@ extern int vm_swappiness;
>  extern int remove_mapping(struct address_space *mapping, struct page *page);
>  extern long vm_total_pages;
>
> +extern int sysctl_min_unmapped_ratio;
> +extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
>  #ifdef CONFIG_NUMA
>  extern int zone_reclaim_mode;
> -extern int sysctl_min_unmapped_ratio;
>  extern int sysctl_min_slab_ratio;
> -extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
>  #else
>  #define zone_reclaim_mode 0

So the end result of this patch is that zone reclaim is compiled
into vmscan.o even on !NUMA configurations but since zone_reclaim_mode ==
0 noone can ever call that code?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: sleeping function called from invalid context at mm/slub.c:793

2011-01-11 Thread Christoph Lameter


Reviewed-by: Christoph Lameter 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: sleeping function called from invalid context at mm/slub.c:793

2011-01-10 Thread Christoph Lameter

On Mon, 10 Jan 2011, Kirill A. Shutemov wrote:

> Every time I run qemu with KVM enabled I get this in dmesg:
>
> [  182.878328] BUG: sleeping function called from invalid context at 
> mm/slub.c:793
> [  182.878339] in_atomic(): 1, irqs_disabled(): 0, pid: 4992, name: qemu
> [  182.878355] Pid: 4992, comm: qemu Not tainted 2.6.37+ #31
> [  182.878361] Call Trace:
> [  182.878381]  [] ? __might_sleep+0xd0/0xd7
> [  182.878394]  [] ? slab_pre_alloc_hook.clone.39+0x23/0x27
> [  182.878404]  [] ? kmem_cache_alloc+0x22/0xc8
> [  182.878414]  [] ? init_fpu+0x44/0x7b

fpu_alloc() does call kmem_cache_alloc with GFP_KERNEL although we are in
an atomic context.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages

2010-12-01 Thread Christoph Lameter
On Tue, 30 Nov 2010, Andrew Morton wrote:

> > +#define UNMAPPED_PAGE_RATIO 16
>
> Well.  Giving 16 a name didn't really clarify anything.  Attentive
> readers will want to know what this does, why 16 was chosen and what
> the effects of changing it will be.

The meaning is analoguous to the other zone reclaim ratio. But yes it
should be justified and defined.

> > Reviewed-by: Christoph Lameter 
>
> So you're OK with shoving all this flotsam into 100,000,000 cellphones?
> This was a pretty outrageous patchset!

This is a feature that has been requested over and over for years. Using
/proc/vm/drop_caches for fixing situations where one simply has too many
page cache pages is not so much fun in the long run.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages

2010-11-30 Thread Christoph Lameter

Looks good.

Reviewed-by: Christoph Lameter 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Refactor zone_reclaim

2010-11-30 Thread Christoph Lameter

Reviewed-by: Christoph Lameter 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA

2010-11-30 Thread Christoph Lameter

Reviewed-by: Christoph Lameter 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/3] Linux/Guest unmapped page cache control

2010-11-03 Thread Christoph Lameter
On Wed, 3 Nov 2010, Balbir Singh wrote:

> > > +#define UNMAPPED_PAGE_RATIO 16
> >
> > Maybe come up with a scheme that allows better configuration of the
> > mininum? I think in some setting we may want an absolute limit and in
> > other a fraction of something (total zone size or working set?)
> >
>
> Are you suggesting a sysctl or computation based on zone size and
> limit, etc? I understand it to be the latter.

Do a computation based on zone size on startup and then allow the
user to modify the absolute size of the page cache?


Hmmm.. That would have to be per zone/node or somehow distributed over all
zones/nodes.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/3] Linux/Guest unmapped page cache control

2010-11-03 Thread Christoph Lameter
On Fri, 29 Oct 2010, Balbir Singh wrote:

> A lot of the code is borrowed from zone_reclaim_mode logic for
> __zone_reclaim(). One might argue that the with ballooning and
> KSM this feature is not very useful, but even with ballooning,

Interesting use of zone reclaim. I am having a difficult time reviewing
the patch since you move and modify functions at the same time. Could you
separate that out a bit?

> +#define UNMAPPED_PAGE_RATIO 16

Maybe come up with a scheme that allows better configuration of the
mininum? I think in some setting we may want an absolute limit and in
other a fraction of something (total zone size or working set?)


> +bool should_balance_unmapped_pages(struct zone *zone)
> +{
> + if (unmapped_page_control &&
> + (zone_unmapped_file_pages(zone) >
> + UNMAPPED_PAGE_RATIO * zone->min_unmapped_pages))
> + return true;
> + return false;
> +}


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Add vzalloc shortcut

2010-10-18 Thread Christoph Lameter
On Sat, 16 Oct 2010, Dave Young wrote:

> Add vzalloc for convinience of vmalloc-then-memset-zero case

Reviewed-by: Christoph Lameter 

Wish we would also have vzalloc_node() but I guess that can wait.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/12] Add async PF initialization to PV guest.

2010-08-23 Thread Christoph Lameter
On Mon, 23 Aug 2010, Gleb Natapov wrote:

> > The guest will have to align this on a 64 byte boundary, should this
> > be marked __aligned(64) here?
> >
> I do __aligned(64) when I declare variable of that type:
>
> static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);

64 byte boundary: You mean cacheline aligned? We have a special define for
that.

DEFINE_PER_CPU_SHARED_ALIGNED

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/12] Maintain preemptability count even for !CONFIG_PREEMPT kernels

2009-11-30 Thread Christoph Lameter
Ok so there is some variance in tests as usual due to cacheline placement.
But it seems that overall we are looking at a 1-2% regression.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/12] Maintain preemptability count even for !CONFIG_PREEMPT kernels

2009-11-24 Thread Christoph Lameter
On Tue, 24 Nov 2009, Gleb Natapov wrote:

> On Mon, Nov 23, 2009 at 11:30:02AM -0600, Christoph Lameter wrote:
> > This adds significant overhead for the !PREEMPT case adding lots of code
> > in critical paths all over the place.
> I want to measure it. Can you suggest benchmarks to try?

AIM9 (reaim9)?

Any test suite will do that tests OS performance.

Latency will also be negatively impacted. There are already significant
regressions in recent kernel releases so many of us who are sensitive
to these issues just stick with old kernels (2.6.22 f.e.) and hope
that the upstream issues are worked out at some point.

There is also lldiag package in my directory. See

http://www.kernel.org/pub/linux/kernel/people/christoph/lldiag

Try the latency test and the mcast test. Localhost multicast is typically
a good test for kernel performance.

There is also the page fault test that Kamezawa-san posted recently in the
thread where we tried to deal with the long term mmap_sem issues.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/12] Maintain preemptability count even for !CONFIG_PREEMPT kernels

2009-11-23 Thread Christoph Lameter
This adds significant overhead for the !PREEMPT case adding lots of code
in critical paths all over the place.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Christoph Lameter
On Wed, 12 Nov 2008, Lee Schermerhorn wrote:

> Might want/need to check for migration entry in do_swap_page() and loop
> back to migration_entry_wait() call when the changed pte is detected
> rather than returning an error to the caller.
>
> Does that sound reasonable?

The reference count freezing and the rechecking of the pte in
do_swap_page() does not work? Nick broke it during lock removal for the
lockless page cache?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Christoph Lameter
On Wed, 12 Nov 2008, Andrea Arcangeli wrote:

> On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote:
> > get_user_pages() cannot get to it since the pagetables have already been
> > modified. If get_user_pages runs then the fault handling will occur
> > which will block the thread until migration is complete.
>
> migrate.c does nothing for ptes pointing to swap entries and
> do_swap_page won't wait for them either. Assume follow_page in

If a anonymous page is a swap page then it has a mapping.
migrate_page_move_mapping() will lock the radix tree and ensure that no
additional reference (like done by do_swap_page) is established during
migration.

> However it's not exactly the same bug as the one in fork, I was
> talking about before, it's also not o_direct specific. Still

So far I have seen wild ideas not bugs.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Wed, 12 Nov 2008, Andrea Arcangeli wrote:

> So are you checking if there's an unresolved reference only in the
> very place I just quoted in the previous email? If answer is yes: what
> should prevent get_user_pages from running in parallel from another
> thread? get_user_pages will trigger a minor fault and get the elevated
> reference just after you read page_count. To you it looks like there
> is no o_direct in progress when you proceed to the core of migration
> code, but in effect o_direct just started a moment after you read the
> page count.

get_user_pages() cannot get to it since the pagetables have already been
modified. If get_user_pages runs then the fault handling will occur
which will block the thread until migration is complete.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Wed, 12 Nov 2008, Andrea Arcangeli wrote:

> > O_DIRECT does not take a refcount on the page in order to prevent this?
>
> It definitely does, it's also the only thing it does.

Then page migration will not occur because there is an unresolved
reference.

> The whole point is that O_DIRECT can start the instruction after
> page_count returns as far as I can tell.

But there must still be reference for the bio and whatever may be going on
at the time in order to perform the I/O operation.

> If you check the three emails I linked in answer to Andrew on the
> topic, we agree the o_direct can't start under PT lock (or under
> mmap_sem in write mode but migrate.c rightefully takes the read
> mode). So the fix used in ksm page_wrprotect and in fork() is to check
> page_count vs page_mapcount inside PT lock before doing anything on
> the pte. If you just mark the page wprotect while O_DIRECT is in
> flight, that's enough for fork() to generate data corruption in the
> parent (not the child where the result would be undefined). But in the
> parent the result of the o-direct is defined and it'd never corrupt if
> this was a cached-I/O. The moment the parent pte is marked readonly, a thread
> in the parent could write to the last 512bytes of the page, leading to
> the first 512bytes coming with O_DIRECT from disk being lost (as the
> write will trigger a cow before I/O is complete and the dma will
> complete on the oldpage).

Have you actually seen corruption or this conjecture? AFACT the page
count is elevated while I/O is in progress and thus this is safe.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Andrea Arcangeli wrote:

> this page_count check done with only the tree_lock won't prevent a
> task to start O_DIRECT after page_count has been read in the above line.
>
> If a thread starts O_DIRECT on the page, and the o_direct is still in
> flight by the time you copy the page to the new page, the read will
> not be represented fully in the newpage leading to userland data
> corruption.

O_DIRECT does not take a refcount on the page in order to prevent this?

> > Define a regular VM page? A page on the LRU?
>
> Yes, pages owned, allocated and worked on by the VM. So they can be
> swapped, collected, migrated etc... You can't possibly migrate a
> device driver page for example and infact those device driver pages
> can't be migrated either.

Oh they could be migrated if you had a callback to the devices method for
giving up references. Same as slab defrag.

> The KSM page initially is a driver page, later we'd like to teach the
> VM how to swap it by introducing rmap methods and adding it to the
> LRU. As long as it's only anonymous memory that we're sharing/cloning,
> we won't have to patch pagecache radix tree and other stuff. BTW, if
> we ever decice to clone pagecache we could generate immense metadata
> ram overhead in the radix tree with just a single page of data. All
> issues that don't exist for anon ram.

Seems that we are tinkering around with the concept of what an anonymous
page is? Doesnt shmem have some means of converting pages to file backed?
Swizzling?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Avi Kivity wrote:

> Christoph Lameter wrote:
> > page migration requires the page to be on the LRU. That could be changed
> > if you have a different means of isolating a page from its page tables.
> >
>
> Isn't rmap the means of isolating a page from its page tables?  I guess I'm
> misunderstanding something.

In order to migrate a page one first has to make sure that userspace and
the kernel cannot access the page in any way. User space must be made to
generate page faults and all kernel references must be accounted for and
not be in use.

The user space portion involves changing the page tables so that faults
are generated.

The kernel portion isolates the page from the LRU (to exempt it from
kernel reclaim handling etc).

Only then can the page and its metadata be copied to a new location.

Guess you already have the LRU portion done. So you just need the user
space isolation portion?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Izik Eidus wrote:

> > What do you mean by kernel page? The kernel can allocate a page and then
> > point a user space pte to it. That is how page migration works.
> >
> i mean filebacked page (!AnonPage())

ok.

> ksm need the pte inside the vma to point from anonymous page into filebacked
> page
> can migrate.c do it without changes?

So change anonymous to filebacked page?

Currently page migration assumes that the page will continue to be part
of the existing file or anon vma.

What you want sounds like assigning a swap pte to an anonymous page? That
way a anon page gains membership in a file backed mapping.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Andrea Arcangeli wrote:

> btw, page_migration likely is buggy w.r.t. o_direct too (and now
> unfixable with gup_fast until the 2.4 brlock is added around it or
> similar) if it does the same thing but without any page_mapcount vs
> page_count check.

Details please?

> page_migration does too much for us, so us calling into migrate.c may
> not be ideal. It has to convert a fresh page to a VM page. In KSM we
> don't convert the newpage to be a VM page, we just replace the anon
> page with another page. The new page in the KSM case is not a page
> known by the VM, not in the lru etc...

A VM page as opposed to pages not in the VM? ???

page migration requires the page to be on the LRU. That could be changed
if you have a different means of isolating a page from its page tables.

> The way to go could be to change the page_migration to use
> replace_page (or __replace_page if called in some shared inner-lock
> context) after preparing the newpage to be a regular VM page. If we
> can do that, migrate.c will get the o_direct race fixed too for free.

Define a regular VM page? A page on the LRU?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Izik Eidus wrote:

> yes but it replace it with kernel allocated page.
> > page migration already kinda does that.  Is there common ground?
> >
> >
> page migration as far as i saw cant migrate anonymous page into kernel page.
> if you want we can change page_migration to do that, but i thought you will
> rather have ksm changes separate.

What do you mean by kernel page? The kernel can allocate a page and then
point a user space pte to it. That is how page migration works.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html