pci device assignment and mm, KSM.
I'm sorry if my understanding is incorrect. Here are some topics on pci passthrough to guests. When pci passthrough is used with kvm, guest's all memory are pinned by extra reference count of get_page(). That pinned pages are never be reclaimable and movable by migration and cannot be merged by KSM. Now, the information that 'the page is pinned by kvm' is just represented by page_count(). So, there are following problems. a) pages are on ANON_LRU. So, try_to_free_page() and kswapd will scan XX GB of pages hopelessly. b) KSM cannot recognize the pages in its early stage. So, it breaks transparent huge page mapped by kvm into small pages. But it fails to merge them finally, because of raised page_count(). So, all hugepages are split without any benefits. 2 ideas for fixing this for a) I guess the pages should go to UNEVICTABLE list. But it's not mlocked. I think we use PagePinned() instread of it and move pages to UNEVICTABLE list. Then, kswapd etc will ignore pinned pages. for b) At first, I thought qemu should call madvise(MADV_UNMERGEABLE). But I think kernel may be able to handle situation with an extra check, PagePinned() or checking a flag in mm_struct. Should we avoid this in userland or kernel ? BTW, I think pinned pages cannot be freed until the kvm process exits. Is it right ? Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
On Fri, 28 Jan 2011 09:20:02 -0600 (CST) Christoph Lameter c...@linux.com wrote: On Fri, 28 Jan 2011, KAMEZAWA Hiroyuki wrote: I see it as a tradeoff of when to check? add_to_page_cache or when we are want more free memory (due to allocation). It is OK to wakeup kswapd while allocating memory, somehow for this purpose (global page cache), add_to_page_cache or add_to_page_cache_locked does not seem the right place to hook into. I'd be open to comments/suggestions though from others as well. I don't like add hook here. AND I don't want to run kswapd because 'kswapd' has been a sign as there are memory shortage. (reusing code is ok.) How about adding new daemon ? Recently, khugepaged, ksmd works for managing memory. Adding one more daemon for special purpose is not very bad, I think. Then, you can do - wake up without hook - throttle its work. - balance the whole system rather than zone. I think per-node balance is enough... I think we already have enough kernel daemons floating around. They are multiplying in an amazing way. What would be useful is to map all the memory management background stuff into a process. May call this memd instead? Perhaps we can fold khugepaged into kswapd as well etc. Making kswapd slow for whis additional, requested by user, not by system work is good thing ? I think workqueue works enough well, it's scale based on workloads, if using thread is bad. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
On Fri, 28 Jan 2011 16:24:19 +0900 Minchan Kim minchan@gmail.com wrote: On Fri, Jan 28, 2011 at 3:48 PM, Balbir Singh bal...@linux.vnet.ibm.com wrote: * MinChan Kim minchan@gmail.com [2011-01-28 14:44:50]: On Fri, Jan 28, 2011 at 11:56 AM, Balbir Singh bal...@linux.vnet.ibm.com wrote: On Thu, Jan 27, 2011 at 4:42 AM, Minchan Kim minchan@gmail.com wrote: [snip] index 7b56473..2ac8549 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1660,6 +1660,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_reclaim_unmapped_pages(zone)) + wakeup_kswapd(zone, order, classzone_idx); + Do we really need the check in fastpath? There are lost of caller of alloc_pages. Many of them are not related to mapped pages. Could we move the check into add_to_page_cache_locked? The check is a simple check to see if the unmapped pages need balancing, the reason I placed this check here is to allow other allocations to benefit as well, if there are some unmapped pages to be freed. add_to_page_cache_locked (check under a critical section) is even worse, IMHO. It just moves the overhead from general into specific case(ie, allocates page for just page cache). Another cases(ie, allocates pages for other purpose except page cache, ex device drivers or fs allocation for internal using) aren't affected. So, It would be better. The goal in this patch is to remove only page cache page, isn't it? So I think we could the balance check in add_to_page_cache and trigger reclaim. If we do so, what's the problem? I see it as a tradeoff of when to check? add_to_page_cache or when we are want more free memory (due to allocation). It is OK to wakeup kswapd while allocating memory, somehow for this purpose (global page cache), add_to_page_cache or add_to_page_cache_locked does not seem the right place to hook into. I'd be open to comments/suggestions though from others as well. I don't like add hook here. AND I don't want to run kswapd because 'kswapd' has been a sign as there are memory shortage. (reusing code is ok.) How about adding new daemon ? Recently, khugepaged, ksmd works for managing memory. Adding one more daemon for special purpose is not very bad, I think. Then, you can do - wake up without hook - throttle its work. - balance the whole system rather than zone. I think per-node balance is enough... mark = zone-watermark[alloc_flags ALLOC_WMARK_MASK]; if (zone_watermark_ok(zone, order, mark, classzone_idx, alloc_flags)) @@ -4167,8 +4170,12 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, zone-spanned_pages = size; zone-present_pages = realsize; +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA) zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) / 100; + zone-max_unmapped_pages = (realsize*sysctl_max_unmapped_ratio) + / 100; +#endif #ifdef CONFIG_NUMA zone-node = nid; zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; @@ -5084,6 +5091,7 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write, return 0; } +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA) int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { @@ -5100,6 +5108,23 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, return 0; } +int sysctl_max_unmapped_ratio_sysctl_handler(ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + struct zone *zone; + int rc; + + rc = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (rc) + return rc; + + for_each_zone(zone) + zone-max_unmapped_pages = (zone-present_pages * + sysctl_max_unmapped_ratio) / 100; + return 0; +} +#endif + #ifdef CONFIG_NUMA int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) diff --git a/mm/vmscan.c b/mm/vmscan.c index 02cc82e..6377411 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -159,6 +159,29 @@ static DECLARE_RWSEM(shrinker_rwsem); #define scanning_global_lru(sc) (1) #endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +static
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
On Fri, 28 Jan 2011 13:49:28 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2011-01-28 16:56:05]: BTW, it seems this doesn't work when some apps use huge shmem. How to handle the issue ? Could you elaborate further? == static inline unsigned long zone_unmapped_file_pages(struct zone *zone) { unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED); unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) + zone_page_state(zone, NR_ACTIVE_FILE); /* * It's possible for there to be more file mapped pages than * accounted for by the pages on the file LRU lists because * tmpfs pages accounted for as ANON can also be FILE_MAPPED */ return (file_lru file_mapped) ? (file_lru - file_mapped) : 0; } == Did you read ? Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cgroup limits only affect kvm guest under certain conditions
On Thu, 06 Jan 2011 14:15:37 +0100 Dominik Klein d...@in-telegence.net wrote: Hi I am playing with cgroups and try to limit block io for guests. The proof of concept is: # mkdir /dev/cgroup/blkio # mount -t cgroup -o blkio blkio /dev/cgroup/blkio/ # cd blkio/ # mkdir test # cd test/ # ls -l /dev/vdisks/kirk lrwxrwxrwx 1 root root 7 2011-01-06 13:46 /dev/vdisks/kirk - ../dm-5 # ls -l /dev/dm-5 brw-rw 1 root disk 253, 5 2011-01-06 13:36 /dev/dm-5 # echo 253:5 1048576 blkio.throttle.write_bps_device # echo $$ tasks # dd if=/dev/zero of=/dev/dm-5 bs=1M count=20 20+0 records in 20+0 records out 20971520 bytes (21 MB) copied, 20.0223 s, 1.0 MB/s So limit applies to the dd child of my shell. Now I assign /dev/dm-5 (/dev/vdisks/kirk) to a vm and echo the qemu-kvm pid into tasks. Limits are not applied, the guest can happily use max io bandwidth. qemu consists of several threads. Cgroup works per thread now. Could you double check all threads for qemu are in a cgroup ? I think you have to write all thread-ID to tasks file when you move qemu after starting it. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages
On Thu, 2 Dec 2010 10:22:16 +0900 (JST) KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com wrote: On Tue, 30 Nov 2010, Andrew Morton wrote: +#define UNMAPPED_PAGE_RATIO 16 Well. Giving 16 a name didn't really clarify anything. Attentive readers will want to know what this does, why 16 was chosen and what the effects of changing it will be. The meaning is analoguous to the other zone reclaim ratio. But yes it should be justified and defined. Reviewed-by: Christoph Lameter c...@linux.com So you're OK with shoving all this flotsam into 100,000,000 cellphones? This was a pretty outrageous patchset! This is a feature that has been requested over and over for years. Using /proc/vm/drop_caches for fixing situations where one simply has too many page cache pages is not so much fun in the long run. I'm not against page cache limitation feature at all. But, this is too ugly and too destructive fast path. I hope this patch reduce negative impact more. And I think min_mapped_unmapped_pages is ugly. It should be unmapped_pagecache_limit or some because it's for limitation feature. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Refactor zone_reclaim
On Tue, 30 Nov 2010 15:45:55 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Refactor zone_reclaim, move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Why is this min_mapped_pages based on zone (IOW, per-zone) ? Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages
On Tue, 30 Nov 2010 15:46:31 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/swap.h |5 ++- mm/page_alloc.c |7 +++-- mm/vmscan.c | 72 +- 3 files changed, 79 insertions(+), 5 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index eba53e7..78b0830 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -252,11 +252,12 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); +extern bool should_balance_unmapped_pages(struct zone *zone); +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 62b7280..4228da3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1662,6 +1662,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); + Hm, I'm not sure the final vision of this feature. Does this reclaiming feature can't be called directly via balloon driver just before alloc_page() ? Do you need to keep page caches small even when there are free memory on host ? Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages
On Wed, 1 Dec 2010 10:52:59 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * Balbir Singh bal...@linux.vnet.ibm.com [2010-12-01 10:48:16]: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-12-01 10:32:54]: On Tue, 30 Nov 2010 15:46:31 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/swap.h |5 ++- mm/page_alloc.c |7 +++-- mm/vmscan.c | 72 +- 3 files changed, 79 insertions(+), 5 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index eba53e7..78b0830 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -252,11 +252,12 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); +extern bool should_balance_unmapped_pages(struct zone *zone); +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 62b7280..4228da3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1662,6 +1662,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); + Hm, I'm not sure the final vision of this feature. Does this reclaiming feature can't be called directly via balloon driver just before alloc_page() ? That is a separate patch, this is a boot paramter based control approach. Do you need to keep page caches small even when there are free memory on host ? The goal is to avoid duplication, as you know page cache fills itself to consume as much memory as possible. The host generally does not have a lot of free memory in a consolidated environment. That's a point. Then, why the guest has to do _extra_ work for host even when the host says nothing ? I think trigger this by guests themselves is not very good. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages
On Wed, 1 Dec 2010 12:10:43 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: That's a point. Then, why the guest has to do _extra_ work for host even when the host says nothing ? I think trigger this by guests themselves is not very good. I've mentioned it before, the guest keeping free memory without a large performance hit, helps, the balloon driver is able to quickly retrieve this memory if required or the guest can use this memory for some other application/task. The cached data is mostly already present in the host page cache. Why ? Are there parameters/stats which shows this is _true_ ? How we can guarantee/show it to users ? Please add an interface to show shared rate between guest/host If not, any admin will not turn this on because file cache status on host is a black box for guest admins. I think this patch skips something important steps. 2nd point is maybe for reducing total host memory usage and for increasing the number of guests on a host. For that, this feature is useful only when all guests on a host are friendly and devoted to the health of host memory management because all setting must be done in the guest. This can be passed as even by qemu's command line argument. And _no_ benefit for the guests who reduce it's resource to help host management because there is no guarantee dropped caches are on host memory. So, for both claim, I want to see an interface to show the number of shared pages between hosts and guests rather than imagine it. BTW, I don't like this kind of please give us your victim, please please please logic. The host should be able to steal what it wants in force. Then, I think there should be no On/Off visible interfaces. The vm firmware should tell to turn on this if administrator of the host wants. BTW2, please test with some other benchmarks (which read file caches.) I don't think kernel make is good test for this. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control
On Mon, 14 Jun 2010 12:19:55 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: - Why don't you believe LRU ? And if LRU doesn't work well, should it be fixed by a knob rather than generic approach ? - No side effects ? I believe in LRU, just that the problem I am trying to solve is of using double the memory for caching the same data (consider kvm running in cache=writethrough or writeback mode, both the hypervisor and the guest OS maintain a page cache of the same data). As the VM's grow the overhead is substantial. In my runs I found upto 60% duplication in some cases. - Linux vm guys tend to say, free memory is bad memory. ok, for what free memory created by your patch is used ? IOW, I can't see the benefit. If free memory that your patch created will be used for another page-cache, it will be dropped soon by your patch itself. Free memory is good for cases when you want to do more in the same system. I agree that in a bare metail environment that might be partially true. I don't have a problem with frequently used data being cached, but I am targetting a consolidated environment at the moment. Moreover, the administrator has control via a boot option, so it is non-instrusive in many ways. It sounds that what you want is to improve performance etc. but to make it easy sizing the system and to help admins. Right ? From performance perspective, I don't see any advantage to drop caches which can be dropped easily. I just use cpus for the purpose it may no be necessary. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control
On Mon, 14 Jun 2010 13:06:46 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: It sounds that what you want is to improve performance etc. but to make it easy sizing the system and to help admins. Right ? Right, to allow freeing up of using double the memory to cache data. Oh, sorry. ask again.. It sounds that what you want is _not_ to improve performance etc. but to make it ... ? -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control
On Mon, 14 Jun 2010 00:01:45 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * Balbir Singh bal...@linux.vnet.ibm.com [2010-06-08 21:21:46]: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singh bal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. Are there any major objections to this patch? This kind of patch needs how it works well measurement. - How did you measure the effect of the patch ? kernbench is not enough, of course. - Why don't you believe LRU ? And if LRU doesn't work well, should it be fixed by a knob rather than generic approach ? - No side effects ? - Linux vm guys tend to say, free memory is bad memory. ok, for what free memory created by your patch is used ? IOW, I can't see the benefit. If free memory that your patch created will be used for another page-cache, it will be dropped soon by your patch itself. If your patch just drops duplicated, but no more necessary for other kvm, I agree your patch may increase available size of page-caches. But you just drops unmapped pages. Hmm. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Thu, 10 Jun 2010 17:07:32 -0700 Dave Hansen d...@linux.vnet.ibm.com wrote: On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Hmm. I'm not fan of estimating working set size by calculation based on some numbers without considering history or feedback. Can't we use some kind of feedback algorithm as hi-low-watermark, random walk or GA (or somehing more smart) to detect the size ? Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Fri, 11 Jun 2010 10:16:32 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-11 10:54:41]: On Thu, 10 Jun 2010 17:07:32 -0700 Dave Hansen d...@linux.vnet.ibm.com wrote: On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Hmm. I'm not fan of estimating working set size by calculation based on some numbers without considering history or feedback. Can't we use some kind of feedback algorithm as hi-low-watermark, random walk or GA (or somehing more smart) to detect the size ? Could you please clarify at what level you are suggesting size detection? I assume it is outside the OS, right? OS includes kernel and system programs ;) I can think of both way in kernel and in user approarh and they should be complement to each other. An example of kernel-based approach is. 1. add a shrinker callback(A) for balloon-driver-for-guest as guest kswapd. 2. add a shrinker callback(B) for balloon-driver-for-host as host kswapd. (I guess current balloon driver is only for host. Please imagine.) (A) increases free memory in Guest. (B) increases free memory in Host. This is an example of feedback based memory resizing between host and guest. I think (B) is necessary at least before considering complecated things. To implement something clever, (A) and (B) should take into account that how frequently memory reclaim in guest (which requires some I/O) happens. If doing outside kernel, I think using memcg is better than depends on balloon driver. But co-operative balloon and memcg may show us something good. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Fri, 11 Jun 2010 14:05:53 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: I can think of both way in kernel and in user approarh and they should be complement to each other. An example of kernel-based approach is. 1. add a shrinker callback(A) for balloon-driver-for-guest as guest kswapd. 2. add a shrinker callback(B) for balloon-driver-for-host as host kswapd. (I guess current balloon driver is only for host. Please imagine.) guest. Sorry. -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virsh dump blocking problem
On Tue, 06 Apr 2010 09:35:09 +0800 Gui Jianfeng guijianf...@cn.fujitsu.com wrote: Hi all, I'm not sure whether it's appropriate to post the problem here. I played with virsh under Fedora 12, and started a KVM fedora12 guest by virsh start command. The fedora12 guest is successfully started. Than I run the following command to dump the guest core: #virsh dump 1 mycoredump (domain id is 1) This command seemed blocking and not return. According to he strace output, virsh dump seems that it's blocking at poll() call. I think the following should be the call trace of virsh. cmdDump() - virDomainCoreDump() - remoteDomainCoreDump() - call() - remoteIO() - remoteIOEventLoop() - poll(fds, ARRAY_CARDINALITY(fds), -1) Any one encounters this problem also, any thoughts? I met and it seems qemu-kvm continues to counting the number of dirty pages and does no answer to libvirt. Guest never work and I have to kill it. I met this with 2.6.32+ qemu-0.12.3+ libvirt 0.7.7.1. When I updated the host kernel to 2.6.33, qemu-kvm never work. So, I moved back to fedora12's latest qemu-kvm. Now, 2.6.34-rc3+ qemu-0.11.0-13.fc12.x86_64 + libvirt 0.7.7.1 # virsh dump hangs. In most case, I see following 2 back trace.(with gdb) (gdb) bt #0 ram_save_remaining () at /usr/src/debug/qemu-kvm-0.11.0/vl.c:3104 #1 ram_bytes_remaining () at /usr/src/debug/qemu-kvm-0.11.0/vl.c:3112 #2 0x004ab2cf in do_info_migrate (mon=0x16b7970) at migration.c:150 #3 0x00414b1a in monitor_handle_command (mon=value optimized out, cmdline=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:2870 #4 0x00414c6a in monitor_command_cb (mon=0x16b7970, cmdline=value optimized out, opaque=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:3160 #5 0x0048b71b in readline_handle_byte (rs=0x208d6a0, ch=value optimized out) at readline.c:369 #6 0x00414cdc in monitor_read (opaque=value optimized out, buf=0x7fff1b1104b0 info migrate\r, size=13) at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:3146 #7 0x004b2a53 in tcp_chr_read (opaque=0x1614c30) at qemu-char.c:2006 #8 0x0040a6c7 in main_loop_wait (timeout=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:4188 #9 0x0040eed5 in main_loop (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:4414 #10 main (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:6263 (gdb) bt #0 0x003c2680e0bd in write () at ../sysdeps/unix/syscall-template.S:82 #1 0x004b304a in unix_write (fd=11, buf=value optimized out, len1=40) at qemu-char.c:512 #2 send_all (fd=11, buf=value optimized out, len1=40) at qemu-char.c:528 #3 0x00411201 in monitor_flush (mon=0x16b7970) at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:131 #4 0x00414cdc in monitor_read (opaque=value optimized out, buf=0x7fff1b1104b0 info migrate\r, size=13) at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:3146 #5 0x004b2a53 in tcp_chr_read (opaque=0x1614c30) at qemu-char.c:2006 #6 0x0040a6c7 in main_loop_wait (timeout=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:4188 #7 0x0040eed5 in main_loop (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:4414 #8 main (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:6263 And see no dump progress. I'm sorry if this is not a hang but just very slow. I don't see any progress at lease for 15 minutes and qemu-kvm continues to use 75% of cpus. I'm not sure why dump command trigger migration code... How long it takes to do virsh dump xxx , an idle VM with 2G memory ? I'm sorry if I ask wrong mailing list. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] add ksm kernel shared memory driver.
On Tue, 31 Mar 2009 15:21:53 +0300 Izik Eidus iei...@redhat.com wrote: kpage is actually what going to be KsmPage - the shared page... Right now this pages are not swappable..., after ksm will be merged we will make this pages swappable as well... sure. If so, please - show the amount of kpage - allow users to set limit for usage of kpages. or preserve kpages at boot or by user's command. kpage actually save memory..., and limiting the number of them, would make you limit the number of shared pages... Ah, I'm working for memory control cgroup. And *KSM* will be out of control. It's ok to make the default limit value as INFINITY. but please add knobs. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] add ksm kernel shared memory driver.
On Tue, 31 Mar 2009 02:59:20 +0300 Izik Eidus iei...@redhat.com wrote: Ksm is driver that allow merging identical pages between one or more applications in way unvisible to the application that use it. Pages that are merged are marked as readonly and are COWed when any application try to change them. Ksm is used for cases where using fork() is not suitable, one of this cases is where the pages of the application keep changing dynamicly and the application cannot know in advance what pages are going to be identical. Ksm works by walking over the memory pages of the applications it scan in order to find identical pages. It uses a two sorted data strctures called stable and unstable trees to find in effective way the identical pages. When ksm finds two identical pages, it marks them as readonly and merges them into single one page, after the pages are marked as readonly and merged into one page, linux will treat this pages as normal copy_on_write pages and will fork them when write access will happen to them. Ksm scan just memory areas that were registred to be scanned by it. Ksm api: KSM_GET_API_VERSION: Give the userspace the api version of the module. KSM_CREATE_SHARED_MEMORY_AREA: Create shared memory reagion fd, that latter allow the user to register the memory region to scan by using: KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION KSM_START_STOP_KTHREAD: Return information about the kernel thread, the inforamtion is returned using the ksm_kthread_info structure: ksm_kthread_info: __u32 sleep: number of microsecoends to sleep between each iteration of scanning. __u32 pages_to_scan: number of pages to scan for each iteration of scanning. __u32 max_pages_to_merge: maximum number of pages to merge in each iteration of scanning (so even if there are still more pages to scan, we stop this iteration) __u32 flags: flags to control ksmd (right now just ksm_control_flags_run available) KSM_REGISTER_MEMORY_REGION: Register userspace virtual address range to be scanned by ksm. This ioctl is using the ksm_memory_region structure: ksm_memory_region: __u32 npages; number of pages to share inside this memory region. __u32 pad; __u64 addr: the begining of the virtual address of this region. KSM_REMOVE_MEMORY_REGION: Remove memory region from ksm. Signed-off-by: Izik Eidus iei...@redhat.com --- include/linux/ksm.h| 69 +++ include/linux/miscdevice.h |1 + mm/Kconfig |6 + mm/Makefile|1 + mm/ksm.c | 1431 5 files changed, 1508 insertions(+), 0 deletions(-) create mode 100644 include/linux/ksm.h create mode 100644 mm/ksm.c diff --git a/include/linux/ksm.h b/include/linux/ksm.h new file mode 100644 index 000..5776dce --- /dev/null +++ b/include/linux/ksm.h @@ -0,0 +1,69 @@ +#ifndef __LINUX_KSM_H +#define __LINUX_KSM_H + +/* + * Userspace interface for /dev/ksm - kvm shared memory + */ + +#include linux/types.h +#include linux/ioctl.h + +#include asm/types.h + +#define KSM_API_VERSION 1 + +#define ksm_control_flags_run 1 + +/* for KSM_REGISTER_MEMORY_REGION */ +struct ksm_memory_region { + __u32 npages; /* number of pages to share */ + __u32 pad; + __u64 addr; /* the begining of the virtual address */ +__u64 reserved_bits; +}; + +struct ksm_kthread_info { + __u32 sleep; /* number of microsecoends to sleep */ + __u32 pages_to_scan; /* number of pages to scan */ + __u32 flags; /* control flags */ +__u32 pad; +__u64 reserved_bits; +}; + +#define KSMIO 0xAB + +/* ioctls for /dev/ksm */ + +#define KSM_GET_API_VERSION _IO(KSMIO, 0x00) +/* + * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd + */ +#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO, 0x01) /* return SMA fd */ +/* + * KSM_START_STOP_KTHREAD - control the kernel thread scanning speed + * (can stop the kernel thread from working by setting running = 0) + */ +#define KSM_START_STOP_KTHREAD_IOW(KSMIO, 0x02,\ + struct ksm_kthread_info) +/* + * KSM_GET_INFO_KTHREAD - return information about the kernel thread + * scanning speed. + */ +#define KSM_GET_INFO_KTHREAD _IOW(KSMIO, 0x03,\ + struct ksm_kthread_info) + + +/* ioctls for SMA fds */ + +/* + * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be + * scanned by kvm. + */ +#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\ + struct ksm_memory_region) +/* + * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm. + */ +#define KSM_REMOVE_MEMORY_REGION
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Thu, 13 Nov 2008 12:38:07 +0200 Izik Eidus [EMAIL PROTECTED] wrote: If KSM pages are on radix-tree, it will be accounted automatically. Now, we have Unevictable LRU and mlocked() pages are smartly isolated into its own LRU. So, just doing - inode's radix-tree - make all pages mlocked. - provide special page fault handler for your purpose Well in this version that i am going to merge the pages arent going to be swappable, Latter after Ksm will get merged we will make the KsmPages swappable... good to hear so i think working with cgroups would be effective / useful only when KsmPages will start be swappable... Do you agree? (What i am saying is that right now lets dont count the KsmPages inside the cgroup, lets do it when KsmPages will be swappable) ok. If you feel this pages should be counted in the cgroup i have no problem to do it via hooks like page migration is doing. thanks. is simple one. But ok, whatever implementation you'll do, I have to check it and consider whether it should be tracked or not. Then, add codes to memcg to track it or ignore it or comments on your patches ;) It's helpful to add me to CC: when you post this set again. Sure will. If necessary, I'll have to add ignore in this case hook in memcg. (ex. checking PageKSM flag in memcg.) If you are sufferred from memcg in your test, cgroup_disable=memory boot option will allow you to disable memcg. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
Thank you for answers. On Wed, 12 Nov 2008 13:11:12 +0200 Izik Eidus [EMAIL PROTECTED] wrote: Avi Kivity wrote: KAMEZAWA Hiroyuki wrote: Can I make a question ? (I'm working for memory cgroup.) Now, we do charge to anonymous page when - charge(+1) when it's mapped firstly (mapcount 0-1) - uncharge(-1) it's fully unmapped (mapcount 1-0) vir page_remove_rmap(). My quesion is - PageKSM pages are not necessary to be tracked by memory cgroup ? When we reaplacing page using page_replace() we have: oldpage - anonymous page that is going to be replaced by newpage newpage - kernel allocated page (KsmPage) so about oldpage we are calling page_remove_rmap() that will notify cgroup and about newpage it wont be count inside cgroup beacuse it is file rmap page (we are calling to page_add_file_rmap), so right now PageKSM wont ever be tracked by cgroup. If not in radix-tree, it's not tracked. (But we don't want to track non-LRU pages which are not freeable.) - Can we know that the page is just replaced and we don't necessary to do charge/uncharge. The caller of page_replace does know it, the only problem is that page_remove_rmap() automaticly change the cgroup for anonymous pages, if we want it not to change the cgroup, we can: increase the cgroup count before page_remove (but in that case what happen if we reach to the limit???) give parameter to page_remove_rmap() that we dont want the cgroup to be changed. Hmm, current mem cgroup works via page_cgroup struct to track pages. page - page_cgroup has one-to-one relation ship. So, exchanging page itself causes trouble. But I may be able to provide necessary hooks to you as I did in page migraiton. - annonymous page from KSM is worth to be tracked by memory cgroup ? (IOW, it's on LRU and can be swapped-out ?) KSM have no anonymous pages (it share anonymous pages into KsmPAGE - kernel allocated page without mapping) so it isnt in LRU and it cannt be swapped, only when KsmPAGEs will be break by do_wp_page() the duplication will be able to swap. Ok, thank you for confirmation. My feeling is that shared pages should be accounted as if they were not shared; that is, a share page should be accounted for each process that shares it. Perhaps sharing within a cgroup should be counted as 1 page for all the ptes pointing to it. If KSM pages are on radix-tree, it will be accounted automatically. Now, we have Unevictable LRU and mlocked() pages are smartly isolated into its own LRU. So, just doing - inode's radix-tree - make all pages mlocked. - provide special page fault handler for your purpose is simple one. But ok, whatever implementation you'll do, I have to check it and consider whether it should be tracked or not. Then, add codes to memcg to track it or ignore it or comments on your patches ;) It's helpful to add me to CC: when you post this set again. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008 23:24:21 +0100 Andrea Arcangeli [EMAIL PROTECTED] wrote: On Tue, Nov 11, 2008 at 03:31:18PM -0600, Christoph Lameter wrote: ksm need the pte inside the vma to point from anonymous page into filebacked page can migrate.c do it without changes? So change anonymous to filebacked page? Currently page migration assumes that the page will continue to be part of the existing file or anon vma. What you want sounds like assigning a swap pte to an anonymous page? That way a anon page gains membership in a file backed mapping. KSM needs to convert anonymous pages to PageKSM, which means a page owned by ksm.c and only known by ksm.c. The Linux VM will free this page in munmap but that's about it, all we do is to match the number of anon-ptes pointing to the page with the page_count. So besides freeing the page when the last user exit()s or cows it, the VM will do nothing about it. Initially. Later it can swap it in a nonlinear way. Can I make a question ? (I'm working for memory cgroup.) Now, we do charge to anonymous page when - charge(+1) when it's mapped firstly (mapcount 0-1) - uncharge(-1) it's fully unmapped (mapcount 1-0) vir page_remove_rmap(). My quesion is - PageKSM pages are not necessary to be tracked by memory cgroup ? - Can we know that the page is just replaced and we don't necessary to do charge/uncharge. - annonymous page from KSM is worth to be tracked by memory cgroup ? (IOW, it's on LRU and can be swapped-out ?) Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html