Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node
On Thu, 30 Jul 2015, Vlastimil Babka wrote: > > NAK. This is changing slob behavior. With no node specified it must use > > alloc_pages because that obeys NUMA memory policies etc etc. It should not > > force allocation from the current node like what is happening here after > > the patch. See the code in slub.c that is similar. > > Doh, somehow I convinced myself that there's #else and alloc_pages() is only > used for !CONFIG_NUMA so it doesn't matter. Here's a fixed version. Acked-by: Christoph Lameter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/3] mm: use numa_mem_id() in alloc_pages_node()
On Thu, 30 Jul 2015, Vlastimil Babka wrote: > numa_mem_id() is able to handle allocation from CPUs on memory-less nodes, > so it's a more robust fallback than the currently used numa_node_id(). > > Suggested-by: Christoph Lameter > Signed-off-by: Vlastimil Babka > Acked-by: David Rientjes > Acked-by: Mel Gorman You can add my ack too if it helps. Acked-by: Christoph Lameter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/3] mm: unify checks in alloc_pages_node() and __alloc_pages_node()
Acked-by: Christoph Lameter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node
On Thu, 30 Jul 2015, Vlastimil Babka wrote: > --- a/mm/slob.c > +++ b/mm/slob.c > void *page; > > -#ifdef CONFIG_NUMA > - if (node != NUMA_NO_NODE) > - page = alloc_pages_exact_node(node, gfp, order); > - else > -#endif > - page = alloc_pages(gfp, order); > + page = alloc_pages_node(node, gfp, order); NAK. This is changing slob behavior. With no node specified it must use alloc_pages because that obeys NUMA memory policies etc etc. It should not force allocation from the current node like what is happening here after the patch. See the code in slub.c that is similar. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mm: rename and document alloc_pages_exact_node
On Wed, 22 Jul 2015, David Rientjes wrote: > Eek, yeah, that does look bad. I'm not even sure the > > if (nid < 0) > nid = numa_node_id(); > > is correct; I think this should be comparing to NUMA_NO_NODE rather than > all negative numbers, otherwise we silently ignore overflow and nobody > ever knows. Comparing to NUMA_NO_NODE would be better. Also use numa_mem_id() instead to support memoryless nodes better? > The only possible downside would be existing users of > alloc_pages_node() that are calling it with an offline node. Since it's a > VM_BUG_ON() that would catch that, I think it should be changed to a > VM_WARN_ON() and eventually fixed up because it's nonsensical. > VM_BUG_ON() here should be avoided. The offline node thing could be addresses by using numa_mem_id()? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mm: rename and document alloc_pages_exact_node
On Tue, 21 Jul 2015, Vlastimil Babka wrote: > The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page > allocator: do not check NUMA node ID when the caller knows the node is valid") > as an optimized variant of alloc_pages_node(), that doesn't allow the node id > to be -1. Unfortunately the name of the function can easily suggest that the > allocation is restricted to the given node. In truth, the node is only > preferred, unless __GFP_THISNODE is among the gfp flags. Yup. I complained about this when this was introduced. Glad to see this fixed. Initially this was alloc_pages_node() which just means that a node is specified. The exact behavior of the allocation is determined by flags such as GFP_THISNODE. I'd rather have that restored because otherwise we get into weird code like the one below. And such an arrangement also leaves the way open to add more flags in the future that may change the allocation behavior. > area->nid = nid; > area->order = order; > - area->pages = alloc_pages_exact_node(area->nid, > + area->pages = alloc_pages_prefer_node(area->nid, > GFP_KERNEL|__GFP_THISNODE, > area->order); This is not preferring a node but requiring alloction on that node. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v3 1/4] idr: Percpu ida
On Fri, 16 Aug 2013, Nicholas A. Bellinger wrote: > + spinlock_t lock; Remove the spinlock. > + unsignednr_free; > + unsignedfreelist[]; > +}; > + > +static inline void move_tags(unsigned *dst, unsigned *dst_nr, > + unsigned *src, unsigned *src_nr, > + unsigned nr) > +{ > + *src_nr -= nr; > + memcpy(dst + *dst_nr, src + *src_nr, sizeof(unsigned) * nr); > + *dst_nr += nr; > +} > + > +static inline unsigned alloc_local_tag(struct percpu_ida *pool, > +struct percpu_ida_cpu *tags) Pass the __percpu offset and not the tags pointer. > +{ > + int tag = -ENOSPC; > + > + spin_lock(&tags->lock); Interupts are already disabled. Drop the spinlock. > + if (tags->nr_free) > + tag = tags->freelist[--tags->nr_free]; You can keep this or avoid address calculation through segment prefixes. F.e. if (__this_cpu_read(tags->nrfree) { int n = __this_cpu_dec_return(tags->nr_free); tag = __this_cpu_read(tags->freelist[n]); } > + spin_unlock(&tags->lock); Drop. > + * Returns a tag - an integer in the range [0..nr_tags) (passed to > + * tag_pool_init()), or otherwise -ENOSPC on allocation failure. > + * > + * Safe to be called from interrupt context (assuming it isn't passed > + * __GFP_WAIT, of course). > + * > + * Will not fail if passed __GFP_WAIT. > + */ > +int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp) > +{ > + DEFINE_WAIT(wait); > + struct percpu_ida_cpu *tags; > + unsigned long flags; > + int tag; > + > + local_irq_save(flags); > + tags = this_cpu_ptr(pool->tag_cpu); You could drop this_cpu_ptr if you pass pool->tag_cpu to alloc_local_tag. > +/** > + * percpu_ida_free - free a tag > + * @pool: pool @tag was allocated from > + * @tag: a tag previously allocated with percpu_ida_alloc() > + * > + * Safe to be called from interrupt context. > + */ > +void percpu_ida_free(struct percpu_ida *pool, unsigned tag) > +{ > + struct percpu_ida_cpu *tags; > + unsigned long flags; > + unsigned nr_free; > + > + BUG_ON(tag >= pool->nr_tags); > + > + local_irq_save(flags); > + tags = this_cpu_ptr(pool->tag_cpu); > + > + spin_lock(&tags->lock); No need for spinlocking > + tags->freelist[tags->nr_free++] = tag; nr_free = __this_cpu_inc_return(pool->tag_cpu.nr_free) ? __this_cpu_write(pool->tag_cpu.freelist[nr_free], tag) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ksummit-2012-discuss] SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]
On Fri, 6 Jul 2012, James Bottomley wrote: > What people might pay attention to is evidence that there's a problem in > 3.5-rc6 (without any OFED crap). If you're not going to bother > investigating, it has to be in an environment they can reproduce (so > ordinary hardware, not infiniband) otherwise it gets ignored as an > esoteric hardware issue. The OFED stuff in the meantime is part of 3.5-rc6. Infiniband has been supported for a long time and its a very important technology given the problematic nature of ethernet at high network speeds. OFED crap exists for those running RHEL5/6. The new enterprise distros are based on the 3.2 kernel which has pretty good Infiniband support out of the box. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] bitops: add _local bitops
On Wed, 9 May 2012, Michael S. Tsirkin wrote: > kvm needs to update some hypervisor variables atomically > in a sense that the operation can't be interrupted > in the middle. However the hypervisor always runs > on the same CPU so it does not need any memory > barrier or lock prefix. > > Add _local bitops for this purpose: define them > as non-atomics for x86 and (for now) atomics for > everyone else. Have you tried to use the this_cpu_ops for that purpose? They create the per cpu atomic instructions that you want without a lock prefix and can also relocate the per cpu pointer to the correct processor via a segment register prefix. There are no bit operations provided right now but those can either be improvised using this_cpu_cmpxchg or added. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Code clean up for percpu_xxx() functions
On Wed, 19 Oct 2011, Alex,Shi wrote: > Thanks for comments! I initialized the patch as following accordingly, > And cc to more maintainers for review. I checked all code except xen/kvm > part, totally a idiot for them. Acked-by: Christoph Lameter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
On Sun, 3 Apr 2011, KOSAKI Motohiro wrote: > 1) Some bios don't have such knob. btw, OK, yes, *I* can switch NUMA off > completely > because I don't have such bios. 2) bios level turning off makes some side > effects, > example, scheduler load balancing don't care numa anymore. Well then lets add a kernel parameter that switches all NUMA off. Otherwise: If you just run a kernel build without NUMA support then you have a similar effect. Re #2) If you have the system toss processes around the system then the load balancing heuristics does not bring you any benefit. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
On Sat, 2 Apr 2011, Dave Chinner wrote: > Fundamentally, if you just switch off memory reclaim to avoid the > latencies involved with direct memory reclaim, then all you'll get > instead is ENOMEM because there's no memory available and none will be > reclaimed. That's even more fatal for the system than doing reclaim. Not for my use cases here. No one will die if reclaim happens but its bad for the bottom line. Reducing the chance of memory reclaim occurring in a critical section is sufficient. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
On Fri, 1 Apr 2011, KOSAKI Motohiro wrote: > > On Thu, 31 Mar 2011, KOSAKI Motohiro wrote: > > > > > 1) zone reclaim doesn't work if the system has multiple node and the > > >workload is file cache oriented (eg file server, web server, mail > > > server, et al). > > >because zone recliam make some much free pages than zone->pages_min and > > >then new page cache request consume nearest node memory and then it > > >bring next zone reclaim. Then, memory utilization is reduced and > > >unnecessary LRU discard is increased dramatically. > > > > That is only true if the webserver only allocates from a single node. If > > the allocation load is balanced then it will be fine. It is useful to > > reclaim pages from the node where we allocate memory since that keeps the > > dataset node local. > > Why? > Scheduler load balancing only consider cpu load. Then, usually memory > pressure is no complete symmetric. That's the reason why we got the > bug report periodically. The scheduler load balancing also considers caching effects. It does not consider NUMA effects aside from heuritics though. If processes are randomly moving around then zone reclaim is not effective. Processes need to stay mainly on a certain node and memory needs to be allocatable from that node in order to improve performance. zone_reclaim is useless if you toss processes around the box. > btw, when we are talking about memory distance aware reclaim, we have to > recognize traditional numa (ie external node interconnect) and on-chip > numa have different performance characteristics. on-chip remote node access > is not so slow, then elaborated nearest node allocation effort doesn't have > so much worth. especially, a workload use a lot of short lived object. > Current zone-reclaim don't have so much issue when using traditiona numa > because it's fit your original design and assumption and administrators of > such systems have good skill and don't hesitate to learn esoteric knobs. > But recent on-chip and cheap numa are used for much different people against > past. therefore new issues and claims were raised. You can switch NUMA off completely at the bios level. Then the distances are not considered by the OS. If they are not relevant then lets just switch NUMA off. Managing NUMA distances can cause significant overhead. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
On Thu, 31 Mar 2011, KOSAKI Motohiro wrote: > 1) zone reclaim doesn't work if the system has multiple node and the >workload is file cache oriented (eg file server, web server, mail server, > et al). >because zone recliam make some much free pages than zone->pages_min and >then new page cache request consume nearest node memory and then it >bring next zone reclaim. Then, memory utilization is reduced and >unnecessary LRU discard is increased dramatically. That is only true if the webserver only allocates from a single node. If the allocation load is balanced then it will be fine. It is useful to reclaim pages from the node where we allocate memory since that keeps the dataset node local. >SGI folks added CPUSET specific solution in past. > (cpuset.memory_spread_page) >But global recliam still have its issue. zone recliam is HPC workload > specific >feature and HPC folks has no motivation to don't use CPUSET. The spreading can also be done via memory policies. But that is only required if the application has an unbalanced allocation behavior. > 2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to >decide to filter out mapped pages. It made a lot of problems for DB servers >and large application servers. Because, if the system has a lot of mapped >pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2) >reclaim latency become terribly slow and hangup detectors misdetect its >state and start to force reboot. That was big problem of RHEL5 based > banking >system. >So, sc->may_unmap should be killed in future. Don't increase uses. Because a bank could not configure its system properly we need to get rid of may_unmap? Maybe raise min_unmapped_ratio instead and take care that either the allocation load is balanced or a round robin scheme is used by the app? > And, this patch introduce new allocator fast path overhead. I haven't seen > any justification for it. We could do the triggering differently. > In other words, you have to kill following three for getting ack 1) zone > reclaim oriented reclaim 2) filter based LRU scanning (eg sc->may_unmap) > 3) fastpath overhead. In other words, If you want a feature for vm guest, > Any hardcoded machine configration assumption and/or workload assumption > are wrong. It would be good if you could come up with a new reclaim scheme that avoids the need for zone reclaim and still allows one to take advantage of memory distances. I agree that the current scheme sometimes requires tuning too many esoteric knobs to get useful behavior. > But, I agree that now we have to concern slightly large VM change parhaps > (or parhaps not). Ok, it's good opportunity to fill out some thing. > Historically, Linux MM has "free memory are waste memory" policy, and It > worked completely fine. But now we have a few exceptions. > > 1) RT, embedded and finance systems. They really hope to avoid reclaim >latency (ie avoid foreground reclaim completely) and they can accept >to make slightly much free pages before memory shortage. In general we need a mechanism to ensure we can avoid reclaim during critical sections of application. So some way to give some hints to the machine to free up lots of memory (/proc/sys/vm/dropcaches is far too drastic) may be useful. > And, now we have four proposal of utilization related issues. > > 1) cleancache (from Oracle) > 2) VirtFS (from IBM) > 3) kstaled (from Google) > 4) unmapped page reclaim (from you) > > Probably, we can't merge all of them and we need to consolidate some > requirement and implementations. Well all these approaches show that we have major issues with reclaim and large memory. Things get overly complicated. Time for a new approach that integrates all the goals that these try to accomplish? > Personally I think cleancache or other multi level page cache framework > looks promising. but another solution is also acceptable. Anyway, I hope > to everyone back 1000feet bird eye at once and sorting out all requiremnt > with all related person. Would be good if you could takle that problem. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
On Fri, 28 Jan 2011, KAMEZAWA Hiroyuki wrote: > > > I see it as a tradeoff of when to check? add_to_page_cache or when we > > > are want more free memory (due to allocation). It is OK to wakeup > > > kswapd while allocating memory, somehow for this purpose (global page > > > cache), add_to_page_cache or add_to_page_cache_locked does not seem > > > the right place to hook into. I'd be open to comments/suggestions > > > though from others as well. > > I don't like add hook here. > AND I don't want to run kswapd because 'kswapd' has been a sign as > there are memory shortage. (reusing code is ok.) > > How about adding new daemon ? Recently, khugepaged, ksmd works for > managing memory. Adding one more daemon for special purpose is not > very bad, I think. Then, you can do > - wake up without hook > - throttle its work. > - balance the whole system rather than zone. >I think per-node balance is enough... I think we already have enough kernel daemons floating around. They are multiplying in an amazing way. What would be useful is to map all the memory management background stuff into a process. May call this memd instead? Perhaps we can fold khugepaged into kswapd as well etc. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
Reviewed-by: Christoph Lameter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v4)
Reviewed-by: Christoph Lameter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)
On Fri, 21 Jan 2011, Balbir Singh wrote: > * Christoph Lameter [2011-01-20 09:00:09]: > > > On Thu, 20 Jan 2011, Balbir Singh wrote: > > > > > + unmapped_page_control > > > + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL > > > + is enabled. It controls the amount of unmapped memory > > > + that is present in the system. This boot option plus > > > + vm.min_unmapped_ratio (sysctl) provide granular control > > > > min_unmapped_ratio is there to guarantee that zone reclaim does not > > reclaim all unmapped pages. > > > > What you want here is a max_unmapped_ratio. > > > > I thought about that, the logic for reusing min_unmapped_ratio was to > keep a limit beyond which unmapped page cache shrinking should stop. Right. That is the role of it. Its a minimum to leave. You want a maximum size of the pagte cache. > I think you are suggesting max_unmapped_ratio as the point at which > shrinking should begin, right? The role of min_unmapped_ratio is to never reclaim more pagecache if we reach that ratio even if we have to go off node for an allocation. AFAICT What you propose is a maximum size of the page cache. If the number of page cache pages goes beyond that then you trim the page cache in background reclaim. > > > + reclaim_unmapped_pages(priority, zone, &sc); > > > + > > > if (!zone_watermark_ok_safe(zone, order, > > > > H. Okay that means background reclaim does it. If so then we also want > > zone reclaim to be able to work in the background I think. > > Anything specific you had in mind, works for me in testing, but is > there anything specific that stands out in your mind that needs to be > done? Hmmm. So this would also work in a NUMA configuration, right. Limiting the sizes of the page cache would avoid zone reclaim through these limit. Page cache size would be limited by the max_unmapped_ratio. zone_reclaim only would come into play if other allocations make the memory on the node so tight that we would have to evict more page cache pages in direct reclaim. Then zone_reclaim could go down to shrink the page cache size to min_unmapped_ratio. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)
On Thu, 20 Jan 2011, Balbir Singh wrote: > + unmapped_page_control > + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL > + is enabled. It controls the amount of unmapped memory > + that is present in the system. This boot option plus > + vm.min_unmapped_ratio (sysctl) provide granular control min_unmapped_ratio is there to guarantee that zone reclaim does not reclaim all unmapped pages. What you want here is a max_unmapped_ratio. > { > @@ -2297,6 +2320,12 @@ loop_again: > shrink_active_list(SWAP_CLUSTER_MAX, zone, > &sc, priority, 0); > > + /* > + * We do unmapped page reclaim once here and once > + * below, so that we don't lose out > + */ > + reclaim_unmapped_pages(priority, zone, &sc); > + > if (!zone_watermark_ok_safe(zone, order, H. Okay that means background reclaim does it. If so then we also want zone reclaim to be able to work in the background I think. max_unmapped_ratio could also be useful to the zone reclaim logic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST] [PATCH 2/3] Refactor zone_reclaim code (v3)
Reviewed-by: Christoph Lameter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST] [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)
On Thu, 20 Jan 2011, Balbir Singh wrote: > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -253,11 +253,11 @@ extern int vm_swappiness; > extern int remove_mapping(struct address_space *mapping, struct page *page); > extern long vm_total_pages; > > +extern int sysctl_min_unmapped_ratio; > +extern int zone_reclaim(struct zone *, gfp_t, unsigned int); > #ifdef CONFIG_NUMA > extern int zone_reclaim_mode; > -extern int sysctl_min_unmapped_ratio; > extern int sysctl_min_slab_ratio; > -extern int zone_reclaim(struct zone *, gfp_t, unsigned int); > #else > #define zone_reclaim_mode 0 So the end result of this patch is that zone reclaim is compiled into vmscan.o even on !NUMA configurations but since zone_reclaim_mode == 0 noone can ever call that code? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: sleeping function called from invalid context at mm/slub.c:793
Reviewed-by: Christoph Lameter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: sleeping function called from invalid context at mm/slub.c:793
On Mon, 10 Jan 2011, Kirill A. Shutemov wrote: > Every time I run qemu with KVM enabled I get this in dmesg: > > [ 182.878328] BUG: sleeping function called from invalid context at > mm/slub.c:793 > [ 182.878339] in_atomic(): 1, irqs_disabled(): 0, pid: 4992, name: qemu > [ 182.878355] Pid: 4992, comm: qemu Not tainted 2.6.37+ #31 > [ 182.878361] Call Trace: > [ 182.878381] [] ? __might_sleep+0xd0/0xd7 > [ 182.878394] [] ? slab_pre_alloc_hook.clone.39+0x23/0x27 > [ 182.878404] [] ? kmem_cache_alloc+0x22/0xc8 > [ 182.878414] [] ? init_fpu+0x44/0x7b fpu_alloc() does call kmem_cache_alloc with GFP_KERNEL although we are in an atomic context. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages
On Tue, 30 Nov 2010, Andrew Morton wrote: > > +#define UNMAPPED_PAGE_RATIO 16 > > Well. Giving 16 a name didn't really clarify anything. Attentive > readers will want to know what this does, why 16 was chosen and what > the effects of changing it will be. The meaning is analoguous to the other zone reclaim ratio. But yes it should be justified and defined. > > Reviewed-by: Christoph Lameter > > So you're OK with shoving all this flotsam into 100,000,000 cellphones? > This was a pretty outrageous patchset! This is a feature that has been requested over and over for years. Using /proc/vm/drop_caches for fixing situations where one simply has too many page cache pages is not so much fun in the long run. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages
Looks good. Reviewed-by: Christoph Lameter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Refactor zone_reclaim
Reviewed-by: Christoph Lameter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA
Reviewed-by: Christoph Lameter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/3] Linux/Guest unmapped page cache control
On Wed, 3 Nov 2010, Balbir Singh wrote: > > > +#define UNMAPPED_PAGE_RATIO 16 > > > > Maybe come up with a scheme that allows better configuration of the > > mininum? I think in some setting we may want an absolute limit and in > > other a fraction of something (total zone size or working set?) > > > > Are you suggesting a sysctl or computation based on zone size and > limit, etc? I understand it to be the latter. Do a computation based on zone size on startup and then allow the user to modify the absolute size of the page cache? Hmmm.. That would have to be per zone/node or somehow distributed over all zones/nodes. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/3] Linux/Guest unmapped page cache control
On Fri, 29 Oct 2010, Balbir Singh wrote: > A lot of the code is borrowed from zone_reclaim_mode logic for > __zone_reclaim(). One might argue that the with ballooning and > KSM this feature is not very useful, but even with ballooning, Interesting use of zone reclaim. I am having a difficult time reviewing the patch since you move and modify functions at the same time. Could you separate that out a bit? > +#define UNMAPPED_PAGE_RATIO 16 Maybe come up with a scheme that allows better configuration of the mininum? I think in some setting we may want an absolute limit and in other a fraction of something (total zone size or working set?) > +bool should_balance_unmapped_pages(struct zone *zone) > +{ > + if (unmapped_page_control && > + (zone_unmapped_file_pages(zone) > > + UNMAPPED_PAGE_RATIO * zone->min_unmapped_pages)) > + return true; > + return false; > +} -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Add vzalloc shortcut
On Sat, 16 Oct 2010, Dave Young wrote: > Add vzalloc for convinience of vmalloc-then-memset-zero case Reviewed-by: Christoph Lameter Wish we would also have vzalloc_node() but I guess that can wait. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 03/12] Add async PF initialization to PV guest.
On Mon, 23 Aug 2010, Gleb Natapov wrote: > > The guest will have to align this on a 64 byte boundary, should this > > be marked __aligned(64) here? > > > I do __aligned(64) when I declare variable of that type: > > static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64); 64 byte boundary: You mean cacheline aligned? We have a special define for that. DEFINE_PER_CPU_SHARED_ALIGNED -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/12] Maintain preemptability count even for !CONFIG_PREEMPT kernels
Ok so there is some variance in tests as usual due to cacheline placement. But it seems that overall we are looking at a 1-2% regression. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/12] Maintain preemptability count even for !CONFIG_PREEMPT kernels
On Tue, 24 Nov 2009, Gleb Natapov wrote: > On Mon, Nov 23, 2009 at 11:30:02AM -0600, Christoph Lameter wrote: > > This adds significant overhead for the !PREEMPT case adding lots of code > > in critical paths all over the place. > I want to measure it. Can you suggest benchmarks to try? AIM9 (reaim9)? Any test suite will do that tests OS performance. Latency will also be negatively impacted. There are already significant regressions in recent kernel releases so many of us who are sensitive to these issues just stick with old kernels (2.6.22 f.e.) and hope that the upstream issues are worked out at some point. There is also lldiag package in my directory. See http://www.kernel.org/pub/linux/kernel/people/christoph/lldiag Try the latency test and the mcast test. Localhost multicast is typically a good test for kernel performance. There is also the page fault test that Kamezawa-san posted recently in the thread where we tried to deal with the long term mmap_sem issues. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/12] Maintain preemptability count even for !CONFIG_PREEMPT kernels
This adds significant overhead for the !PREEMPT case adding lots of code in critical paths all over the place. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, 12 Nov 2008, Lee Schermerhorn wrote: > Might want/need to check for migration entry in do_swap_page() and loop > back to migration_entry_wait() call when the changed pte is detected > rather than returning an error to the caller. > > Does that sound reasonable? The reference count freezing and the rechecking of the pte in do_swap_page() does not work? Nick broke it during lock removal for the lockless page cache? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, 12 Nov 2008, Andrea Arcangeli wrote: > On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote: > > get_user_pages() cannot get to it since the pagetables have already been > > modified. If get_user_pages runs then the fault handling will occur > > which will block the thread until migration is complete. > > migrate.c does nothing for ptes pointing to swap entries and > do_swap_page won't wait for them either. Assume follow_page in If a anonymous page is a swap page then it has a mapping. migrate_page_move_mapping() will lock the radix tree and ensure that no additional reference (like done by do_swap_page) is established during migration. > However it's not exactly the same bug as the one in fork, I was > talking about before, it's also not o_direct specific. Still So far I have seen wild ideas not bugs. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, 12 Nov 2008, Andrea Arcangeli wrote: > So are you checking if there's an unresolved reference only in the > very place I just quoted in the previous email? If answer is yes: what > should prevent get_user_pages from running in parallel from another > thread? get_user_pages will trigger a minor fault and get the elevated > reference just after you read page_count. To you it looks like there > is no o_direct in progress when you proceed to the core of migration > code, but in effect o_direct just started a moment after you read the > page count. get_user_pages() cannot get to it since the pagetables have already been modified. If get_user_pages runs then the fault handling will occur which will block the thread until migration is complete. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, 12 Nov 2008, Andrea Arcangeli wrote: > > O_DIRECT does not take a refcount on the page in order to prevent this? > > It definitely does, it's also the only thing it does. Then page migration will not occur because there is an unresolved reference. > The whole point is that O_DIRECT can start the instruction after > page_count returns as far as I can tell. But there must still be reference for the bio and whatever may be going on at the time in order to perform the I/O operation. > If you check the three emails I linked in answer to Andrew on the > topic, we agree the o_direct can't start under PT lock (or under > mmap_sem in write mode but migrate.c rightefully takes the read > mode). So the fix used in ksm page_wrprotect and in fork() is to check > page_count vs page_mapcount inside PT lock before doing anything on > the pte. If you just mark the page wprotect while O_DIRECT is in > flight, that's enough for fork() to generate data corruption in the > parent (not the child where the result would be undefined). But in the > parent the result of the o-direct is defined and it'd never corrupt if > this was a cached-I/O. The moment the parent pte is marked readonly, a thread > in the parent could write to the last 512bytes of the page, leading to > the first 512bytes coming with O_DIRECT from disk being lost (as the > write will trigger a cow before I/O is complete and the dma will > complete on the oldpage). Have you actually seen corruption or this conjecture? AFACT the page count is elevated while I/O is in progress and thus this is safe. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008, Andrea Arcangeli wrote: > this page_count check done with only the tree_lock won't prevent a > task to start O_DIRECT after page_count has been read in the above line. > > If a thread starts O_DIRECT on the page, and the o_direct is still in > flight by the time you copy the page to the new page, the read will > not be represented fully in the newpage leading to userland data > corruption. O_DIRECT does not take a refcount on the page in order to prevent this? > > Define a regular VM page? A page on the LRU? > > Yes, pages owned, allocated and worked on by the VM. So they can be > swapped, collected, migrated etc... You can't possibly migrate a > device driver page for example and infact those device driver pages > can't be migrated either. Oh they could be migrated if you had a callback to the devices method for giving up references. Same as slab defrag. > The KSM page initially is a driver page, later we'd like to teach the > VM how to swap it by introducing rmap methods and adding it to the > LRU. As long as it's only anonymous memory that we're sharing/cloning, > we won't have to patch pagecache radix tree and other stuff. BTW, if > we ever decice to clone pagecache we could generate immense metadata > ram overhead in the radix tree with just a single page of data. All > issues that don't exist for anon ram. Seems that we are tinkering around with the concept of what an anonymous page is? Doesnt shmem have some means of converting pages to file backed? Swizzling? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008, Avi Kivity wrote: > Christoph Lameter wrote: > > page migration requires the page to be on the LRU. That could be changed > > if you have a different means of isolating a page from its page tables. > > > > Isn't rmap the means of isolating a page from its page tables? I guess I'm > misunderstanding something. In order to migrate a page one first has to make sure that userspace and the kernel cannot access the page in any way. User space must be made to generate page faults and all kernel references must be accounted for and not be in use. The user space portion involves changing the page tables so that faults are generated. The kernel portion isolates the page from the LRU (to exempt it from kernel reclaim handling etc). Only then can the page and its metadata be copied to a new location. Guess you already have the LRU portion done. So you just need the user space isolation portion? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008, Izik Eidus wrote: > > What do you mean by kernel page? The kernel can allocate a page and then > > point a user space pte to it. That is how page migration works. > > > i mean filebacked page (!AnonPage()) ok. > ksm need the pte inside the vma to point from anonymous page into filebacked > page > can migrate.c do it without changes? So change anonymous to filebacked page? Currently page migration assumes that the page will continue to be part of the existing file or anon vma. What you want sounds like assigning a swap pte to an anonymous page? That way a anon page gains membership in a file backed mapping. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008, Andrea Arcangeli wrote: > btw, page_migration likely is buggy w.r.t. o_direct too (and now > unfixable with gup_fast until the 2.4 brlock is added around it or > similar) if it does the same thing but without any page_mapcount vs > page_count check. Details please? > page_migration does too much for us, so us calling into migrate.c may > not be ideal. It has to convert a fresh page to a VM page. In KSM we > don't convert the newpage to be a VM page, we just replace the anon > page with another page. The new page in the KSM case is not a page > known by the VM, not in the lru etc... A VM page as opposed to pages not in the VM? ??? page migration requires the page to be on the LRU. That could be changed if you have a different means of isolating a page from its page tables. > The way to go could be to change the page_migration to use > replace_page (or __replace_page if called in some shared inner-lock > context) after preparing the newpage to be a regular VM page. If we > can do that, migrate.c will get the o_direct race fixed too for free. Define a regular VM page? A page on the LRU? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008, Izik Eidus wrote: > yes but it replace it with kernel allocated page. > > page migration already kinda does that. Is there common ground? > > > > > page migration as far as i saw cant migrate anonymous page into kernel page. > if you want we can change page_migration to do that, but i thought you will > rather have ksm changes separate. What do you mean by kernel page? The kernel can allocate a page and then point a user space pte to it. That is how page migration works. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html