Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
To a two-CPU architecture, its page distribution just like theoretically ABABAB So every readahead of A process will create 4 unused readahead pages unless you are sure B will resume soon. Have you ever compared the results among UP, 2 or 4-CPU? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
To a two-CPU architecture, its page distribution just like theoretically ABABAB So every readahead of A process will create 4 unused readahead pages unless you are sure B will resume soon. Have you ever compared the results among UP, 2 or 4-CPU? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Performance improvement should occur when private pages of multiple processes are messed up, such as SMP. To UP, my previous mail is done by timer, which only shows a fact, if pages are messed up fully, current readahead will degrade remarkably, and unused readaheading pages make a burden to memory subsystem. You should re-test your testcases following the advises on Linux without my patch, do normal testcases and select a testcase randomly and record '/proc/vmstat/pswpin', redo the testcase solely, if the results are close, that is, your testcases doesn't messed up private pages at all as you expected due to Linux schedule. Thank you! 2007/2/22, Rik van Riel <[EMAIL PROTECTED]>: yunfeng zhang wrote: > Any comments or suggestions are always welcomed. Same question as always: what problem are you trying to solve? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Performance improvement should occur when private pages of multiple processes are messed up, such as SMP. To UP, my previous mail is done by timer, which only shows a fact, if pages are messed up fully, current readahead will degrade remarkably, and unused readaheading pages make a burden to memory subsystem. You should re-test your testcases following the advises on Linux without my patch, do normal testcases and select a testcase randomly and record '/proc/vmstat/pswpin', redo the testcase solely, if the results are close, that is, your testcases doesn't messed up private pages at all as you expected due to Linux schedule. Thank you! 2007/2/22, Rik van Riel [EMAIL PROTECTED]: yunfeng zhang wrote: Any comments or suggestions are always welcomed. Same question as always: what problem are you trying to solve? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Any comments or suggestions are always welcomed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Any comments or suggestions are always welcomed. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Following arithmetic is based on SwapSpace bitmap management which is discussed in the postscript section of my patch. Two purposes are implemented, one is allocating a group of fake continual swap entries, another is re-allocating swap entries in stage 3 for such as series length is too short. #include #include #include // 2 hardware cache line. You can also concentrate it to a hareware cache line. char bits_per_short[256] = { 8, 7, 7, 6, 7, 6, 6, 5, 7, 6, 6, 5, 6, 5, 5, 4, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0 }; unsigned char swap_bitmap[32]; // Allocate a group of fake continual swap entries. int alloc(int size) { int i, found = 0, result_offset; unsigned char a = 0, b = 0; for (i = 0; i < 32; i++) { b = bits_per_short[swap_bitmap[i]]; if (a + b >= size) { found = 1; break; } a = b; } result_offset = i == 0 ? 0 : i - 1; result_offset = found ? result_offset : -1; return result_offset; } // Re-allocate in stage 3 if necessary. int re_alloc(int position) { int offset = position / 8; int a = offset == 0 ? 0 : offset - 1; int b = offset == 31 ? 31 : offset + 1; int i, empty_bits = 0; for (i = a; i <= b; i++) { empty_bits += bits_per_short[swap_bitmap[i]]; } return empty_bits; } int main(int argc, char **argv) { int i; for (i = 0; i < 32; i++) { swap_bitmap[i] = (unsigned char) (rand() % 0xff); } i = 9; int temp = alloc(i); temp = re_alloc(i); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Following arithmetic is based on SwapSpace bitmap management which is discussed in the postscript section of my patch. Two purposes are implemented, one is allocating a group of fake continual swap entries, another is re-allocating swap entries in stage 3 for such as series length is too short. #include stdlib.h #include stdio.h #include string.h // 2 hardware cache line. You can also concentrate it to a hareware cache line. char bits_per_short[256] = { 8, 7, 7, 6, 7, 6, 6, 5, 7, 6, 6, 5, 6, 5, 5, 4, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0 }; unsigned char swap_bitmap[32]; // Allocate a group of fake continual swap entries. int alloc(int size) { int i, found = 0, result_offset; unsigned char a = 0, b = 0; for (i = 0; i 32; i++) { b = bits_per_short[swap_bitmap[i]]; if (a + b = size) { found = 1; break; } a = b; } result_offset = i == 0 ? 0 : i - 1; result_offset = found ? result_offset : -1; return result_offset; } // Re-allocate in stage 3 if necessary. int re_alloc(int position) { int offset = position / 8; int a = offset == 0 ? 0 : offset - 1; int b = offset == 31 ? 31 : offset + 1; int i, empty_bits = 0; for (i = a; i = b; i++) { empty_bits += bits_per_short[swap_bitmap[i]]; } return empty_bits; } int main(int argc, char **argv) { int i; for (i = 0; i 32; i++) { swap_bitmap[i] = (unsigned char) (rand() % 0xff); } i = 9; int temp = alloc(i); temp = re_alloc(i); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
You can apply my previous patch on 2.6.20 by changing -#define VM_PURE_PRIVATE0x0400 /* Is the vma is only belonging to a mm, to +#define VM_PURE_PRIVATE0x0800 /* Is the vma is only belonging to a mm, New revision is based on 2.6.20 with my previous patch, major changelogs are 1) pte_unmap pairs on shrink_pvma_scan_ptes and pps_swapoff_scan_ptes. 2) Now, kppsd can be woke up by kswapd. 3) New global variable accelerate_kppsd is appended to accelerate the reclamation process when a memory inode is low. Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]> Index: linux-2.6.19/Documentation/vm_pps.txt === --- linux-2.6.19.orig/Documentation/vm_pps.txt 2007-02-12 12:45:07.0 +0800 +++ linux-2.6.19/Documentation/vm_pps.txt 2007-02-12 15:30:16.490797672 +0800 @@ -143,23 +143,32 @@ 2) mm/memory.c do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page, do_swap_page (page-fault) 3) mm/memory.c get_user_pages (sometimes core need share PrivatePage with us) +4) mm/vmscan.c balance_pgdat (kswapd/x can do stage 5 of its node pages, + while kppsd can do stage 1-4) +5) mm/vmscan.c kppsd (new core daemon -- kppsd, see below) There isn't new lock order defined in pps, that is, it's compliable to Linux -lock order. +lock order. Locks in shrink_private_vma copied from shrink_list of 2.6.16.29 +(my initial version). // }])> // Others about pps <([{ A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to -execute the stages of pps periodically, note an appropriate timeout ticks is -necessary so we can give application a chance to re-map back its PrivatePage -from UnmappedPTE to PTE, that is, show their conglomeration affinity. - -kppsd can be controlled by new fields -- scan_control::may_reclaim/reclaim_node -may_reclaim = 1 means starting reclamation (stage 5). reclaim_node = (node -number) is used when a memory node is low. Caller should set them to wakeup_sc, -then wake up kppsd (vmscan.c:balance_pgdat). Note, if kppsd is started due to -timeout, it doesn't do stage 5 at all (vmscan.c:kppsd). Other alive legacy -fields are gfp_mask, may_writepage and may_swap. +execute the stage 1 - 4 of pps periodically, note an appropriate timeout ticks +(current 2 seconds) is necessary so we can give application a chance to re-map +back its PrivatePage from UnmappedPTE to PTE, that is, show their +conglomeration affinity. + +shrink_private_vma can be controlled by new fields -- may_reclaim, reclaim_node +and is_kppsd of scan_control. may_reclaim = 1 means starting reclamation +(stage 5). reclaim_node = (node number, -1 means all memory inode) is used when +a memory node is low. Caller (kswapd/x), typically, set reclaim_node to start +shrink_private_vma (vmscan.c:balance_pgdat). Note, only to kppsd is_kppsd = 1. +Other alive legacy fields to pps are gfp_mask, may_writepage and may_swap. + +When a memory inode is low, kswapd/x can wake up kppsd by increasing global +variable accelerate_kppsd (balance_pgdat), which accelerate stage 1 - 4, and +call shrink_private_vma to do stage 5. PPS statistic data is appended to /proc/meminfo entry, its prototype is in include/linux/mm.h. Index: linux-2.6.19/mm/swapfile.c === --- linux-2.6.19.orig/mm/swapfile.c 2007-02-12 12:45:07.0 +0800 +++ linux-2.6.19/mm/swapfile.c 2007-02-12 12:45:21.0 +0800 @@ -569,6 +569,7 @@ } } } while (pte++, addr += PAGE_SIZE, addr != end); + pte_unmap(pte); return 0; } Index: linux-2.6.19/mm/vmscan.c === --- linux-2.6.19.orig/mm/vmscan.c 2007-02-12 12:45:07.0 +0800 +++ linux-2.6.19/mm/vmscan.c2007-02-12 15:48:59.217292888 +0800 @@ -70,6 +70,7 @@ /* pps control command. See Documentation/vm_pps.txt. */ int may_reclaim; int reclaim_node; + int is_kppsd; }; /* @@ -1101,9 +1102,9 @@ return ret; } -// pps fields. +// pps fields, see Documentation/vm_pps.txt. static wait_queue_head_t kppsd_wait; -static struct scan_control wakeup_sc; +static int accelerate_kppsd = 0; struct pps_info pps_info = { .total = ATOMIC_INIT(0), .pte_count = ATOMIC_INIT(0), // stage 1 and 2. @@ -1118,24 +1119,22 @@ struct page* pages[MAX_SERIES_LENGTH]; int series_length; int series_stage; -} series; +}; -static int get_series_stage(pte_t* pte, int index) +static int get_series_stage(struct series_t* series, pte_t* pte, int index) { - series.orig_ptes[index] = *pte; - series.ptes[index] = pte; - if (pte_present(series.orig_ptes[index])) { - struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index])); - series.pages[index] = page; + series->orig_pte
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
You can apply my previous patch on 2.6.20 by changing -#define VM_PURE_PRIVATE0x0400 /* Is the vma is only belonging to a mm, to +#define VM_PURE_PRIVATE0x0800 /* Is the vma is only belonging to a mm, New revision is based on 2.6.20 with my previous patch, major changelogs are 1) pte_unmap pairs on shrink_pvma_scan_ptes and pps_swapoff_scan_ptes. 2) Now, kppsd can be woke up by kswapd. 3) New global variable accelerate_kppsd is appended to accelerate the reclamation process when a memory inode is low. Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED] Index: linux-2.6.19/Documentation/vm_pps.txt === --- linux-2.6.19.orig/Documentation/vm_pps.txt 2007-02-12 12:45:07.0 +0800 +++ linux-2.6.19/Documentation/vm_pps.txt 2007-02-12 15:30:16.490797672 +0800 @@ -143,23 +143,32 @@ 2) mm/memory.c do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page, do_swap_page (page-fault) 3) mm/memory.c get_user_pages (sometimes core need share PrivatePage with us) +4) mm/vmscan.c balance_pgdat (kswapd/x can do stage 5 of its node pages, + while kppsd can do stage 1-4) +5) mm/vmscan.c kppsd (new core daemon -- kppsd, see below) There isn't new lock order defined in pps, that is, it's compliable to Linux -lock order. +lock order. Locks in shrink_private_vma copied from shrink_list of 2.6.16.29 +(my initial version). // }]) // Others about pps ([{ A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to -execute the stages of pps periodically, note an appropriate timeout ticks is -necessary so we can give application a chance to re-map back its PrivatePage -from UnmappedPTE to PTE, that is, show their conglomeration affinity. - -kppsd can be controlled by new fields -- scan_control::may_reclaim/reclaim_node -may_reclaim = 1 means starting reclamation (stage 5). reclaim_node = (node -number) is used when a memory node is low. Caller should set them to wakeup_sc, -then wake up kppsd (vmscan.c:balance_pgdat). Note, if kppsd is started due to -timeout, it doesn't do stage 5 at all (vmscan.c:kppsd). Other alive legacy -fields are gfp_mask, may_writepage and may_swap. +execute the stage 1 - 4 of pps periodically, note an appropriate timeout ticks +(current 2 seconds) is necessary so we can give application a chance to re-map +back its PrivatePage from UnmappedPTE to PTE, that is, show their +conglomeration affinity. + +shrink_private_vma can be controlled by new fields -- may_reclaim, reclaim_node +and is_kppsd of scan_control. may_reclaim = 1 means starting reclamation +(stage 5). reclaim_node = (node number, -1 means all memory inode) is used when +a memory node is low. Caller (kswapd/x), typically, set reclaim_node to start +shrink_private_vma (vmscan.c:balance_pgdat). Note, only to kppsd is_kppsd = 1. +Other alive legacy fields to pps are gfp_mask, may_writepage and may_swap. + +When a memory inode is low, kswapd/x can wake up kppsd by increasing global +variable accelerate_kppsd (balance_pgdat), which accelerate stage 1 - 4, and +call shrink_private_vma to do stage 5. PPS statistic data is appended to /proc/meminfo entry, its prototype is in include/linux/mm.h. Index: linux-2.6.19/mm/swapfile.c === --- linux-2.6.19.orig/mm/swapfile.c 2007-02-12 12:45:07.0 +0800 +++ linux-2.6.19/mm/swapfile.c 2007-02-12 12:45:21.0 +0800 @@ -569,6 +569,7 @@ } } } while (pte++, addr += PAGE_SIZE, addr != end); + pte_unmap(pte); return 0; } Index: linux-2.6.19/mm/vmscan.c === --- linux-2.6.19.orig/mm/vmscan.c 2007-02-12 12:45:07.0 +0800 +++ linux-2.6.19/mm/vmscan.c2007-02-12 15:48:59.217292888 +0800 @@ -70,6 +70,7 @@ /* pps control command. See Documentation/vm_pps.txt. */ int may_reclaim; int reclaim_node; + int is_kppsd; }; /* @@ -1101,9 +1102,9 @@ return ret; } -// pps fields. +// pps fields, see Documentation/vm_pps.txt. static wait_queue_head_t kppsd_wait; -static struct scan_control wakeup_sc; +static int accelerate_kppsd = 0; struct pps_info pps_info = { .total = ATOMIC_INIT(0), .pte_count = ATOMIC_INIT(0), // stage 1 and 2. @@ -1118,24 +1119,22 @@ struct page* pages[MAX_SERIES_LENGTH]; int series_length; int series_stage; -} series; +}; -static int get_series_stage(pte_t* pte, int index) +static int get_series_stage(struct series_t* series, pte_t* pte, int index) { - series.orig_ptes[index] = *pte; - series.ptes[index] = pte; - if (pte_present(series.orig_ptes[index])) { - struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index])); - series.pages[index] = page; + series-orig_ptes[index] = *pte; + series
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
You have an interesting idea of "simplifies", given 16 files changed, 997 insertions(+), 25 deletions(-) (omitting your Documentation), and over 7k more code. You'll have to be much more persuasive (with good performance results) to get us to welcome your added layer of complexity. If the whole idea is deployed on Linux, following core objects should be erased 1) anon_vma. 2) pgdata::active/inactive list and relatted methods -- mark_page_accessed etc. 3) PrivatePage::count and mapcount. If core need to share the page, add PG_kmap flag. In fact, page::lru_list can safetly be erased too. 4) All cases should be from up to down, especially simplifies debug. Please make an effort to support at least i386 3level pagetables: you don't actually need >4GB of memory to test CONFIG_HIGHMEM64G. HIGHMEM testing shows you're missing a couple of pte_unmap()s, in pps_swapoff_scan_ptes() and in shrink_pvma_scan_ptes(). Yes, it's my fault. It would be nice if you could support at least x86_64 too (you have pte_low code peculiar to i386 in vmscan.c, which is preventing that), but that's harder if you don't have the hardware. Um! Data cmpxchged should include access bit. And I have only x86 PC, memory < 1G. 3level pagetable code copied from Linux other functions. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
You have an interesting idea of simplifies, given 16 files changed, 997 insertions(+), 25 deletions(-) (omitting your Documentation), and over 7k more code. You'll have to be much more persuasive (with good performance results) to get us to welcome your added layer of complexity. If the whole idea is deployed on Linux, following core objects should be erased 1) anon_vma. 2) pgdata::active/inactive list and relatted methods -- mark_page_accessed etc. 3) PrivatePage::count and mapcount. If core need to share the page, add PG_kmap flag. In fact, page::lru_list can safetly be erased too. 4) All cases should be from up to down, especially simplifies debug. Please make an effort to support at least i386 3level pagetables: you don't actually need 4GB of memory to test CONFIG_HIGHMEM64G. HIGHMEM testing shows you're missing a couple of pte_unmap()s, in pps_swapoff_scan_ptes() and in shrink_pvma_scan_ptes(). Yes, it's my fault. It would be nice if you could support at least x86_64 too (you have pte_low code peculiar to i386 in vmscan.c, which is preventing that), but that's harder if you don't have the hardware. Um! Data cmpxchged should include access bit. And I have only x86 PC, memory 1G. 3level pagetable code copied from Linux other functions. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Current test based on the fact below in my previous mail Current Linux page allocation fairly provides pages for every process, since swap daemon only is started when memory is low, so when it starts to scan active_list, the private pages of processes are messed up with each other, vmscan.c:shrink_list() is the only approach to attach disk swap page to page on active_list, as the result, all private pages lost their affinity on swap partition. I will give a testlater... Three testcases are imitated here 1) matrix: Some softwares do a lots of matrix arithmetic in their PrivateVMA, in fact, the type is much better to pps than Linux. 2) c: C malloc uses an arithmetic just like slab, so when an application resume from swap partition, let's supposed it touches three variables whose sizes are different, in the result, it should touch three pages and the three pages are closed with each other but aren't continual, I also imitate a case that if the application starts full-speed running later (touch more pages). 3) destruction: Typically, if an application resumes due to user clicks the close button, it totally visits all its private data to execute object destruction. Test stepping 1) run ./entry and say y to it, maybe need root right. 2) wait a moment until echo 'primarywait'. 3) swapoff -a && swapon -a. 4) ./hog until count = 10. 5) 'cat primary entry secondary > /dev/null' 6) 'cat /proc/vmstat' several times and record 'pswpin' field when it's stable. 7) type `1', `2' or `3' to 3 testcases, answer `2' to start fullspeed testcase. 8) record new 'pswpin' field. 9) which is better? see the 'pswpin' increment. pswpin is increased in mm/page_io.c:swap_readpage. Test stepping purposes 1) Step 1, 'entry' wakes up 'primary' and 'secondary' simultaneously, every time 'primary' allocates a page, 'secondary' inserts some pages into active_list closed to it. 1) Step 3, we should re-allocate swap pages. 2) Step 4, flush 'entry primary secondary' to swap partition. 3) Step 5, make file content 'entry primary secondary' present in memory. Testcases are done in vmware virtual machine 5.5, 32M memory. If you argue my circumstance, do your testcases following the steps advised 1) Run multiple memory-consumer together, make them pause at a point. (So mess up all private pages in pg_active list). 2) Flush them to swap partition. 3) Wake up one of them, let it run full-speed for a while, record pswpin of /proc/vmstat. 4) Invalidate all readaheaded pages. 5) Wake up another, repeat the test. 6) It's also good if you can record hard LED twinking:) Maybe your test resumes all memory-consumers together, so Linux readaheads some pages close to page-fault page but are belong to other processes, I think. By the way, what's linux-mm mail, it ins't in Documentation/SubmitPatches. In fact, you will find Linux column makes hard LED twinking per 5 seconds. - Linux pps matrix 52411597 53221620 (81)(23) c 80281937 80951954 fullspeed 83131964 (67)(17) (218) (10) destruction 94614445 98254484 (364) (39) Comment secondary.c:memset clause, so 'secondary' won't interrupt page-allocation in 'primary'. - Linux pps matrix 207 38 256 59 (49)(21) c 1273347 1341383 fullspeed 1362386 (68)(36) (21)(3) destruction 24351178 25131246 (78)(68) entry.c - #include #include #include #include #include #include #include #include #include pid_t pids[4]; int sem_set; siginfo_t si; int main(int argc, char **argv) { int i, data; unsigned short init_data[4] = { 1, 1, 1, 1 }; if ((sem_set = semget(123321, 4, IPC_CREAT)) == -1) goto failed; if (semctl(sem_set, 0, SETALL, _data) == -1) goto failed; pid_t pid = vfork(); if (pid == -1) { goto failed; } else if (pid == 0) { if (execlp("./primary", NULL) == -1) goto failed; } else { pids[0] = pid; } pid = vfork(); if (pid == -1) { goto failed; } else if (pid == 0) { if (execlp("./secondary", NULL) == -1)
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Current test based on the fact below in my previous mail Current Linux page allocation fairly provides pages for every process, since swap daemon only is started when memory is low, so when it starts to scan active_list, the private pages of processes are messed up with each other, vmscan.c:shrink_list() is the only approach to attach disk swap page to page on active_list, as the result, all private pages lost their affinity on swap partition. I will give a testlater... Three testcases are imitated here 1) matrix: Some softwares do a lots of matrix arithmetic in their PrivateVMA, in fact, the type is much better to pps than Linux. 2) c: C malloc uses an arithmetic just like slab, so when an application resume from swap partition, let's supposed it touches three variables whose sizes are different, in the result, it should touch three pages and the three pages are closed with each other but aren't continual, I also imitate a case that if the application starts full-speed running later (touch more pages). 3) destruction: Typically, if an application resumes due to user clicks the close button, it totally visits all its private data to execute object destruction. Test stepping 1) run ./entry and say y to it, maybe need root right. 2) wait a moment until echo 'primarywait'. 3) swapoff -a swapon -a. 4) ./hog until count = 10. 5) 'cat primary entry secondary /dev/null' 6) 'cat /proc/vmstat' several times and record 'pswpin' field when it's stable. 7) type `1', `2' or `3' to 3 testcases, answer `2' to start fullspeed testcase. 8) record new 'pswpin' field. 9) which is better? see the 'pswpin' increment. pswpin is increased in mm/page_io.c:swap_readpage. Test stepping purposes 1) Step 1, 'entry' wakes up 'primary' and 'secondary' simultaneously, every time 'primary' allocates a page, 'secondary' inserts some pages into active_list closed to it. 1) Step 3, we should re-allocate swap pages. 2) Step 4, flush 'entry primary secondary' to swap partition. 3) Step 5, make file content 'entry primary secondary' present in memory. Testcases are done in vmware virtual machine 5.5, 32M memory. If you argue my circumstance, do your testcases following the steps advised 1) Run multiple memory-consumer together, make them pause at a point. (So mess up all private pages in pg_active list). 2) Flush them to swap partition. 3) Wake up one of them, let it run full-speed for a while, record pswpin of /proc/vmstat. 4) Invalidate all readaheaded pages. 5) Wake up another, repeat the test. 6) It's also good if you can record hard LED twinking:) Maybe your test resumes all memory-consumers together, so Linux readaheads some pages close to page-fault page but are belong to other processes, I think. By the way, what's linux-mm mail, it ins't in Documentation/SubmitPatches. In fact, you will find Linux column makes hard LED twinking per 5 seconds. - Linux pps matrix 52411597 53221620 (81)(23) c 80281937 80951954 fullspeed 83131964 (67)(17) (218) (10) destruction 94614445 98254484 (364) (39) Comment secondary.c:memset clause, so 'secondary' won't interrupt page-allocation in 'primary'. - Linux pps matrix 207 38 256 59 (49)(21) c 1273347 1341383 fullspeed 1362386 (68)(36) (21)(3) destruction 24351178 25131246 (78)(68) entry.c - #include sys/wait.h #include sys/types.h #include sys/ipc.h #include sys/sem.h #include sys/stat.h #include unistd.h #include stdlib.h #include stdio.h #include string.h pid_t pids[4]; int sem_set; siginfo_t si; int main(int argc, char **argv) { int i, data; unsigned short init_data[4] = { 1, 1, 1, 1 }; if ((sem_set = semget(123321, 4, IPC_CREAT)) == -1) goto failed; if (semctl(sem_set, 0, SETALL, init_data) == -1) goto failed; pid_t pid = vfork(); if (pid == -1) { goto failed; } else if (pid == 0) { if (execlp(./primary, NULL) == -1) goto failed; } else { pids[0] = pid; } pid = vfork(); if (pid == -1) { goto failed; } else if (pid
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
re-code my patch, tab = 8. Sorry! Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]> Index: linux-2.6.19/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.19/Documentation/vm_pps.txt 2007-01-23 11:32:02.0 +0800 @@ -0,0 +1,236 @@ + Pure Private Page System (pps) + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose <([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section and how I patch it into Linux 2.6.19 in section +. +// }])> + +// How to Reclaim Pages more Efficiently <([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) Page table, zone and memory inode layer (architecture-dependent). +Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but +here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. +4) It simplifies Linux memory model dramatically. Keep it in mind that new swap + strategy is from up to down. In fact, Linux legacy swap subsystem is maybe + the only one from down to up. + +Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a +system on memory node::active_list/inactive_list. + +I've finished a patch, see section . Note, it +ISN'T perfect. +// }])> + +// Pure Private Page System -- pps <([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the +whole job should be consist of two parts, here is the first -- +PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS) +scheduled in future. Of course, if both are done, it will empty Linux legacy +page system. + +In fact, pps is centered on how to better collect and unmap process private +pages, the whole process is divided into six stages -- . PPS +uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma) +of mm/vmscan.c. Other sections show the remain aspects of pps +1) is basic data definition. +2) is focused on synchronization. +3) how private pages enter in/go off pps. +4) which VMA is belonging to pps. +5) new daemon thread kppsd, pps statistic data etc. + +I'm also glad to highlight my a new idea -- dftlb which is described in +section . +// }])> + +// Delay to Flush TLB (dftlb) <([{ +Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in +brief, when we want to unmap a page from the page table of a process, why we +send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we +can insert flushing tasks into timer interrupt route to implement a +free-charged TLB flushing. + +The trick is implemented in +1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c. +2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute + f
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Patched against 2.6.19 leads to: mm/vmscan.c: In function `shrink_pvma_scan_ptes': mm/vmscan.c:1340: too many arguments to function `page_remove_rmap' So changed page_remove_rmap(series.pages[i], vma); to page_remove_rmap(series.pages[i]); I've worked on 2.6.19, but when update to 2.6.20-rc5, the function is changed. But your patch doesn't offer any swap-performance improvement for both swsusp or tmpfs. Swap-in is still half speed of Swap-out. Current Linux page allocation fairly provides pages for every process, since swap daemon only is started when memory is low, so when it starts to scan active_list, the private pages of processes are messed up with each other, vmscan.c:shrink_list() is the only approach to attach disk swap page to page on active_list, as the result, all private pages lost their affinity on swap partition. I will give a testlater... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Patched against 2.6.19 leads to: mm/vmscan.c: In function `shrink_pvma_scan_ptes': mm/vmscan.c:1340: too many arguments to function `page_remove_rmap' So changed page_remove_rmap(series.pages[i], vma); to page_remove_rmap(series.pages[i]); I've worked on 2.6.19, but when update to 2.6.20-rc5, the function is changed. But your patch doesn't offer any swap-performance improvement for both swsusp or tmpfs. Swap-in is still half speed of Swap-out. Current Linux page allocation fairly provides pages for every process, since swap daemon only is started when memory is low, so when it starts to scan active_list, the private pages of processes are messed up with each other, vmscan.c:shrink_list() is the only approach to attach disk swap page to page on active_list, as the result, all private pages lost their affinity on swap partition. I will give a testlater... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
re-code my patch, tab = 8. Sorry! Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED] Index: linux-2.6.19/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.19/Documentation/vm_pps.txt 2007-01-23 11:32:02.0 +0800 @@ -0,0 +1,236 @@ + Pure Private Page System (pps) + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose ([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section How to Reclaim +Pages more Efficiently and how I patch it into Linux 2.6.19 in section +Pure Private Page System -- pps. +// }]) + +// How to Reclaim Pages more Efficiently ([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) Page table, zone and memory inode layer (architecture-dependent). +Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but +here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. +4) It simplifies Linux memory model dramatically. Keep it in mind that new swap + strategy is from up to down. In fact, Linux legacy swap subsystem is maybe + the only one from down to up. + +Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a +system on memory node::active_list/inactive_list. + +I've finished a patch, see section Pure Private Page System -- pps. Note, it +ISN'T perfect. +// }]) + +// Pure Private Page System -- pps ([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the +whole job should be consist of two parts, here is the first -- +PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS) +scheduled in future. Of course, if both are done, it will empty Linux legacy +page system. + +In fact, pps is centered on how to better collect and unmap process private +pages, the whole process is divided into six stages -- Stage Definition. PPS +uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma) +of mm/vmscan.c. Other sections show the remain aspects of pps +1) Data Definition is basic data definition. +2) Concurrent racers of Shrinking pps is focused on synchronization. +3) Private Page Lifecycle of pps how private pages enter in/go off pps. +4) VMA Lifecycle of pps which VMA is belonging to pps. +5) Others about pps new daemon thread kppsd, pps statistic data etc. + +I'm also glad to highlight my a new idea -- dftlb which is described in +section Delay to Flush TLB. +// }]) + +// Delay to Flush TLB (dftlb) ([{ +Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in +brief, when we want to unmap a page from the page table of a process, why we +send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we +can insert flushing tasks into timer interrupt route to implement
[PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
My patch is based on my new idea to Linux swap subsystem, you can find more in Documentation/vm_pps.txt which isn't only patch illustration but also file changelog. In brief, SwapDaemon should scan and reclaim pages on UserSpace::vmalist other than current zone::active/inactive. The change will conspicuously enhance swap subsystem performance by 1) SwapDaemon can collect the statistic of process acessing pages and by it unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in current Linux legacy swap subsystem. 2) Page-fault can issue better readahead requests since history data shows all related pages have conglomerating affinity. In contrast, Linux page-fault readaheads the pages relative to the SwapSpace position of current page-fault page. 3) It's conformable to POSIX madvise API family. 4) It simplifies Linux memory model dramatically. Keep it in mind that new swap strategy is from up to down. In fact, Linux legacy swap subsystem is maybe the only one from down to up. Other problems asked about my pps are 1) There isn't new lock order in my pps, it's compliant to Linux lock order defined in mm/rmap.c. 2) When a memory inode is low, you can set scan_control::reclaim_node to let my kppsd to reclaim the memory inode page. Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]> Index: linux-2.6.19/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.19/Documentation/vm_pps.txt 2007-01-22 13:52:04.973820224 +0800 @@ -0,0 +1,237 @@ + Pure Private Page System (pps) + Copyright by Yunfeng Zhang on GFDL 1.2 + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose <([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section and how I patch it into Linux 2.6.19 in section +. +// }])> + +// How to Reclaim Pages more Efficiently <([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) Page table, zone and memory inode layer (architecture-dependent). +Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but +here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. +4) It simplifies Linux memory model dramatically. Keep it in mind that new swap + strategy is from up to down. In fact, Linux legacy swap subsystem is maybe + the only one from down to up. + +Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a +system on memory node::active_list/inactive_list. + +I've finished a patch, see section . Note, it +ISN'T perfect. +// }])> + +// Pure Private Page System -- pps <([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. B
[PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
My patch is based on my new idea to Linux swap subsystem, you can find more in Documentation/vm_pps.txt which isn't only patch illustration but also file changelog. In brief, SwapDaemon should scan and reclaim pages on UserSpace::vmalist other than current zone::active/inactive. The change will conspicuously enhance swap subsystem performance by 1) SwapDaemon can collect the statistic of process acessing pages and by it unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in current Linux legacy swap subsystem. 2) Page-fault can issue better readahead requests since history data shows all related pages have conglomerating affinity. In contrast, Linux page-fault readaheads the pages relative to the SwapSpace position of current page-fault page. 3) It's conformable to POSIX madvise API family. 4) It simplifies Linux memory model dramatically. Keep it in mind that new swap strategy is from up to down. In fact, Linux legacy swap subsystem is maybe the only one from down to up. Other problems asked about my pps are 1) There isn't new lock order in my pps, it's compliant to Linux lock order defined in mm/rmap.c. 2) When a memory inode is low, you can set scan_control::reclaim_node to let my kppsd to reclaim the memory inode page. Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED] Index: linux-2.6.19/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.19/Documentation/vm_pps.txt 2007-01-22 13:52:04.973820224 +0800 @@ -0,0 +1,237 @@ + Pure Private Page System (pps) + Copyright by Yunfeng Zhang on GFDL 1.2 + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose ([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section How to Reclaim +Pages more Efficiently and how I patch it into Linux 2.6.19 in section +Pure Private Page System -- pps. +// }]) + +// How to Reclaim Pages more Efficiently ([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) Page table, zone and memory inode layer (architecture-dependent). +Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but +here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. +4) It simplifies Linux memory model dramatically. Keep it in mind that new swap + strategy is from up to down. In fact, Linux legacy swap subsystem is maybe + the only one from down to up. + +Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a +system on memory node::active_list/inactive_list. + +I've finished a patch, see section Pure Private Page System -- pps. Note, it +ISN'T perfect. +// }]) + +// Pure Private Page System -- pps ([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private
Re: [PATCH 2.6.16.29 1/1] MM: enhance Linux swap subsystem
2007/1/11, yunfeng zhang <[EMAIL PROTECTED]>: 2007/1/11, Rik van Riel <[EMAIL PROTECTED]>: Have you actually measured this? If your measurements saw any performance gains, with what kind of workload did they happen, how big were they and how do you explain those performance gains? How do you balance scanning the private memory with taking pages off the per-zone page lists? How do you deal with systems where some zones are really large and other zones are really small, eg. a 32 bit system with one 880MB zone and one 15.1GB zone? Another solution is add a new field preferred_zone to the scan_control of mm/vmscan.c, keep it in mind that new swap strategy is from up to down, from UserSpace to PrivatePage/SharedPage. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.16.29 1/1] MM: enhance Linux swap subsystem
2007/1/11, Rik van Riel <[EMAIL PROTECTED]>: Have you actually measured this? If your measurements saw any performance gains, with what kind of workload did they happen, how big were they and how do you explain those performance gains? How do you balance scanning the private memory with taking pages off the per-zone page lists? How do you deal with systems where some zones are really large and other zones are really small, eg. a 32 bit system with one 880MB zone and one 15.1GB zone? You didn't mention another problem, a VMA maybe residents on multiple zones. To multiple address space, multiple memory inode architecture, we can introduce a new core object -- section which has several features 1) Section is used as the atomic unit to contain the pages of a VMA residing in the memory inode of the section. 2) When page migration occurs among different memory inodes, new secion should be set up to trace the pages. 3) Section can be scanned by the SwapDaemon of its memory inode directely. 4) All sections of a VMA are excluded with each other not overlayed. 5) VMA is made up of sections totally, but its section objects scatter on memory inodes. So to the architecture, we can deploy swap subsystem on an architecture-independent layer by section and scan pages batchly. If the benefits come mostly from better IO clustering, would it not be safer/less invasive to add swapout clustering of the same style that the BSD kernels have? You mean add a new swap file/partition? I scan every UserSpaces by init_mm::mmlist which has already been used by swap subsystem, and I've patched in mm/swapfile.c. For your reference, the BSD kernels do swapout clustering like this: 1) select a page off the end of the pageout list 2) then scan the page table the page is in, to find nearby pages that are also eligable for pageout 3) page them all out with one disk I/O operation The same could also be done for files. With peterz's dirty tracking (and possible dirty limiting) code in the kernel, this can be done without the kind of deadlocks that would have plagued earlier kernels, when trying to do IO trickery from the pageout path... I've noticed FreeBSD maybe has a similar design as I mentioned, I'm reading their code to whether they have another two features -- UnmappedPTE and dftlb in my Documentation/vm_pps.txt. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.16.29 1/1] MM: enhance Linux swap subsystem
2007/1/11, Rik van Riel [EMAIL PROTECTED]: Have you actually measured this? If your measurements saw any performance gains, with what kind of workload did they happen, how big were they and how do you explain those performance gains? How do you balance scanning the private memory with taking pages off the per-zone page lists? How do you deal with systems where some zones are really large and other zones are really small, eg. a 32 bit system with one 880MB zone and one 15.1GB zone? You didn't mention another problem, a VMA maybe residents on multiple zones. To multiple address space, multiple memory inode architecture, we can introduce a new core object -- section which has several features 1) Section is used as the atomic unit to contain the pages of a VMA residing in the memory inode of the section. 2) When page migration occurs among different memory inodes, new secion should be set up to trace the pages. 3) Section can be scanned by the SwapDaemon of its memory inode directely. 4) All sections of a VMA are excluded with each other not overlayed. 5) VMA is made up of sections totally, but its section objects scatter on memory inodes. So to the architecture, we can deploy swap subsystem on an architecture-independent layer by section and scan pages batchly. If the benefits come mostly from better IO clustering, would it not be safer/less invasive to add swapout clustering of the same style that the BSD kernels have? You mean add a new swap file/partition? I scan every UserSpaces by init_mm::mmlist which has already been used by swap subsystem, and I've patched in mm/swapfile.c. For your reference, the BSD kernels do swapout clustering like this: 1) select a page off the end of the pageout list 2) then scan the page table the page is in, to find nearby pages that are also eligable for pageout 3) page them all out with one disk I/O operation The same could also be done for files. With peterz's dirty tracking (and possible dirty limiting) code in the kernel, this can be done without the kind of deadlocks that would have plagued earlier kernels, when trying to do IO trickery from the pageout path... I've noticed FreeBSD maybe has a similar design as I mentioned, I'm reading their code to whether they have another two features -- UnmappedPTE and dftlb in my Documentation/vm_pps.txt. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.16.29 1/1] MM: enhance Linux swap subsystem
2007/1/11, yunfeng zhang [EMAIL PROTECTED]: 2007/1/11, Rik van Riel [EMAIL PROTECTED]: Have you actually measured this? If your measurements saw any performance gains, with what kind of workload did they happen, how big were they and how do you explain those performance gains? How do you balance scanning the private memory with taking pages off the per-zone page lists? How do you deal with systems where some zones are really large and other zones are really small, eg. a 32 bit system with one 880MB zone and one 15.1GB zone? Another solution is add a new field preferred_zone to the scan_control of mm/vmscan.c, keep it in mind that new swap strategy is from up to down, from UserSpace to PrivatePage/SharedPage. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2.6.16.29 1/1] MM: enhance Linux swap subsystem
My patch is based on my new idea to Linux swap subsystem, you can find more in Documentation/vm_pps.txt which isn't only patch illustration but also file changelog. In brief, SwapDaemon should scan and reclaim pages on UserSpace::vmalist other than current zone::active/inactive. The change will conspicuously enhance swap subsystem performance by 1) SwapDaemon can collect the statistic of process acessing pages and by it unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in current Linux legacy swap subsystem. In fact, in some cases, we can even flush TLB without sending IPI. 2) Page-fault can issue better readahead requests since history data shows all related pages have conglomerating affinity. In contrast, Linux page-fault readaheads the pages relative to the SwapSpace position of current page-fault page. 3) It's conformable to POSIX madvise API family. Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]> Index: linux-2.6.16.29/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.16.29/Documentation/vm_pps.txt2007-01-11 10:51:15.649869656 +0800 @@ -0,0 +1,214 @@ + Pure Private Page System (pps) + Copyright by Yunfeng Zhang on GFDL 1.2 + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose <([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section and how I patch it into Linux 2.6.16.29 in section +. +// }])> + +// How to Reclaim Pages more Efficiently <([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) PTE, zone/memory inode layer (architecture-dependent). +4) Maybe it makes you sense that Page should be placed on the 3rd layer, but + here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. + +Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a +system on zone::active_list/inactive_list. + +I've finished a patch, see section . Note, it +ISN'T perfect. +// }])> + +// Pure Private Page System -- pps <([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the +whole job should be consist of two parts, here is the first -- +PrivateVMA-oriented (PPS), other is SharedVMA-oriented (should be called SPS) +scheduled in future. Of course, if all are done, it will empty Linux legacy +page system. + +In fact, pps is centered on how to better collect and unmap process private +pages in SwapDaemon mm/vmscan.c:shrink_private_vma, the whole process is +divided into six stages -- . Other sections show the remain +aspects of pps +1) is basic data definition. +2) is focused on synchronization. +3) -- how private pages enter in/go of
[PATCH 2.6.16.29 1/1] MM: enhance Linux swap subsystem
My patch is based on my new idea to Linux swap subsystem, you can find more in Documentation/vm_pps.txt which isn't only patch illustration but also file changelog. In brief, SwapDaemon should scan and reclaim pages on UserSpace::vmalist other than current zone::active/inactive. The change will conspicuously enhance swap subsystem performance by 1) SwapDaemon can collect the statistic of process acessing pages and by it unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in current Linux legacy swap subsystem. In fact, in some cases, we can even flush TLB without sending IPI. 2) Page-fault can issue better readahead requests since history data shows all related pages have conglomerating affinity. In contrast, Linux page-fault readaheads the pages relative to the SwapSpace position of current page-fault page. 3) It's conformable to POSIX madvise API family. Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED] Index: linux-2.6.16.29/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.16.29/Documentation/vm_pps.txt2007-01-11 10:51:15.649869656 +0800 @@ -0,0 +1,214 @@ + Pure Private Page System (pps) + Copyright by Yunfeng Zhang on GFDL 1.2 + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose ([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section How to Reclaim +Pages more Efficiently and how I patch it into Linux 2.6.16.29 in section +Pure Private Page System -- pps. +// }]) + +// How to Reclaim Pages more Efficiently ([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) PTE, zone/memory inode layer (architecture-dependent). +4) Maybe it makes you sense that Page should be placed on the 3rd layer, but + here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. + +Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a +system on zone::active_list/inactive_list. + +I've finished a patch, see section Pure Private Page System -- pps. Note, it +ISN'T perfect. +// }]) + +// Pure Private Page System -- pps ([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the +whole job should be consist of two parts, here is the first -- +PrivateVMA-oriented (PPS), other is SharedVMA-oriented (should be called SPS) +scheduled in future. Of course, if all are done, it will empty Linux legacy +page system. + +In fact, pps is centered on how to better collect and unmap process private +pages in SwapDaemon mm/vmscan.c:shrink_private_vma, the whole process is +divided into six stages -- Stage Definition. Other sections show the remain +aspects of pps +1) Data Definition
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
Sorry, I can't be online regularly, that is, can't synchronize Linux CVS, so only work on a fixed kernel version. Documentation/vm_pps.txt isn't only a patch overview but also a changelog. Great! Do you have patch against 2.6.19? Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
Maybe, there should be a memory maintainer in linux kernel group. Here, I show some content from my patch (Documentation/vm_pps.txt). In brief, I make a revolution about Linux swap subsystem, the idea is described that SwapDaemon should scan and reclaim pages on UserSpace::vmalist other than current zone::active/inactive. The change will conspicuously enhance swap subsystem performance by 1) SwapDaemon can collect the statistic of process acessing pages and by it unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in current Linux legacy swap subsystem. In fact, in some cases, we can even flush TLB without sending IPI. 2) Page-fault can issue better readahead requests since history data shows all related pages have conglomerating affinity. In contrast, Linux page-fault readaheads the pages relative to the SwapSpace position of current page-fault page. 3) It's conformable to POSIX madvise API family. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
Maybe, there should be a memory maintainer in linux kernel group. Here, I show some content from my patch (Documentation/vm_pps.txt). In brief, I make a revolution about Linux swap subsystem, the idea is described that SwapDaemon should scan and reclaim pages on UserSpace::vmalist other than current zone::active/inactive. The change will conspicuously enhance swap subsystem performance by 1) SwapDaemon can collect the statistic of process acessing pages and by it unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in current Linux legacy swap subsystem. In fact, in some cases, we can even flush TLB without sending IPI. 2) Page-fault can issue better readahead requests since history data shows all related pages have conglomerating affinity. In contrast, Linux page-fault readaheads the pages relative to the SwapSpace position of current page-fault page. 3) It's conformable to POSIX madvise API family. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
Sorry, I can't be online regularly, that is, can't synchronize Linux CVS, so only work on a fixed kernel version. Documentation/vm_pps.txt isn't only a patch overview but also a changelog. Great! Do you have patch against 2.6.19? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
A new patch has been done by me, based on the previous quilt patch (2.6.16.29). Here is changelog -- NEW New kernel thread kppsd is added to execute background scanning task periodically (mm/vmscan.c). PPS statistic is added into /proc/meminfo, its prototype is in include/linux/mm.h. Documentation/vm_pps.txt is also updated to show the aboved two new features, some sections is re-written for comprehension. BUG New loop code is introduced in shrink_private_vma (mm/vmscan.c) and pps_swapoff (mm/swapfile.c), contrast with old code, even lhtemp is freed during loop, it's also safe. A bug is catched in mm/memory.c:zap_pte_range -- if a PrivatePage is being written back, it will be migrated back to Linux legacy page system. A fault done by me in previous patch is remedied in stage 5, now stage 5 can work. MISCELLANEOUS UP code has been separated from SMP code in dftlb. -- Index: linux-2.6.16.29/Documentation/vm_pps.txt === --- linux-2.6.16.29.orig/Documentation/vm_pps.txt 2007-01-04 14:47:35.0 +0800 +++ linux-2.6.16.29/Documentation/vm_pps.txt2007-01-04 14:49:36.0 +0800 @@ -6,11 +6,11 @@ // Purpose <([{ The file is used to document the idea which is published firstly at http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my -OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, a patch -of the document to enchance the performance of Linux swap subsystem. You can -find the overview of the idea in section and how I patch it into Linux 2.6.16.29 in section . +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section and how I patch it into Linux 2.6.16.29 in section +. // }])> // How to Reclaim Pages more Efficiently <([{ @@ -21,7 +21,9 @@ OK! to modern OS, its memory subsystem can be divided into three layers 1) Space layer (InodeSpace, UserSpace and CoreSpace). 2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). -3) PTE and page layer (architecture-dependent). +3) PTE, zone/memory inode layer (architecture-dependent). +4) Maybe it makes you sense that Page should be placed on the 3rd layer, but + here, it's placed on the 2nd layer since it's the basic unit of VMA. Since the 2nd layer assembles the much statistic of page-acess information, so it's nature that swap subsystem should be deployed and implemented on the 2nd @@ -41,7 +43,8 @@ Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a system on zone::active_list/inactive_list. -I've finished a patch, see section . Note, it ISN'T perfect. +I've finished a patch, see section . Note, it +ISN'T perfect. // }])> // Pure Private Page System -- pps <([{ @@ -70,7 +73,18 @@ 3) -- how private pages enter in/go off pps. 4) which VMA is belonging to pps. -PPS uses init_mm.mm_list list to enumerate all swappable UserSpace. +PPS uses init_mm.mm_list list to enumerate all swappable UserSpace +(shrink_private_vma). + +A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to +execute the stages of pps periodically, note an appropriate timeout ticks is +necessary so we can give application a chance to re-map back its PrivatePage +from UnmappedPTE to PTE, that is, show their conglomeration affinity. +scan_control::pps_cmd field is used to control the behavior of kppsd, = 1 for +accelerating scanning process and reclaiming pages, it's used in balance_pgdat. + +PPS statistic data is appended to /proc/meminfo entry, its prototype is in +include/linux/mm.h. I'm also glad to highlight my a new idea -- dftlb which is described in section . @@ -97,15 +111,19 @@ gone when a CPU starts to execute the task in timer interrupt, so don't use dftlb. combine stage 1 with stage 2, and send IPI immediately in fill_in_tlb_tasks. + +dftlb increases mm_struct::mm_users to prevent the mm from being freed when +other CPU works on it. // }])> // Stage Definition <([{ The whole process of private page page-out is divided into six stages, as -showed in shrink_pvma_scan_ptes of mm/vmscan.c +showed in shrink_pvma_scan_ptes of mm/vmscan.c, the code groups the similar +pages to a series. 1) PTE to untouched PTE (access bit is cleared), append flushing tasks to dftlb. 2) Convert untouched PTE to UnmappedPTE. 3) Link SwapEntry to every UnmappedPTE. -4) Synchronize the page of a UnmappedPTE with its physical swap page. +4) Flush PrivatePage of UnmappedPTE to its disk SwapPage. 5) Reclaimed the page and shift UnmappedPTE to SwappedPTE. 6) SwappedPTE stage. // }])> @@ -114,7 +132,15 @@ New VMA flag (VM_PURE_PRIVATE) is appended into VMA in include/linux/mm.h. New PTE type (UnmappedPTE) is appended into PTE system in -include/asm-i386/pgtable.h. +include/asm-i386/pgtable.h. Its
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
No, a new idea to re-write swap subsystem at all. In fact, it's an impossible task to me, so I provide a compromising solution -- pps (pure private page system). 2006/12/30, Zhou Yingchao <[EMAIL PROTECTED]>: 2006/12/27, yunfeng zhang <[EMAIL PROTECTED]>: > To multiple address space, multiple memory inode architecture, we can introduce > a new core object -- section which has several features Do you mean "in-memory inode" or "memory node(pglist_data)" by "memory inode" ? > The idea issued by me is whether swap subsystem should be deployed on layer 2 or > layer 3 which is described in Documentation/vm_pps.txt of my patch. To multiple > memory inode architecture, the special memory model should be encapsulated on > layer 3 (architecture-dependent), I think. I guess that you are wanting to do something to remove arch-dependent code in swap subsystem. Just like the pud introduced in the page-table related codes. Is it right? However, you should verify that your changes will not deteriorate system performance. Also, you need to maintain it for a long time with the evolution of mainline kernel before it is accepted. Best regards -- Yingchao Zhou *** Institute Of Computing Technology Chinese Academy of Sciences *** - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
No, a new idea to re-write swap subsystem at all. In fact, it's an impossible task to me, so I provide a compromising solution -- pps (pure private page system). 2006/12/30, Zhou Yingchao [EMAIL PROTECTED]: 2006/12/27, yunfeng zhang [EMAIL PROTECTED]: To multiple address space, multiple memory inode architecture, we can introduce a new core object -- section which has several features Do you mean in-memory inode or memory node(pglist_data) by memory inode ? The idea issued by me is whether swap subsystem should be deployed on layer 2 or layer 3 which is described in Documentation/vm_pps.txt of my patch. To multiple memory inode architecture, the special memory model should be encapsulated on layer 3 (architecture-dependent), I think. I guess that you are wanting to do something to remove arch-dependent code in swap subsystem. Just like the pud introduced in the page-table related codes. Is it right? However, you should verify that your changes will not deteriorate system performance. Also, you need to maintain it for a long time with the evolution of mainline kernel before it is accepted. Best regards -- Yingchao Zhou *** Institute Of Computing Technology Chinese Academy of Sciences *** - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
A new patch has been done by me, based on the previous quilt patch (2.6.16.29). Here is changelog -- NEW New kernel thread kppsd is added to execute background scanning task periodically (mm/vmscan.c). PPS statistic is added into /proc/meminfo, its prototype is in include/linux/mm.h. Documentation/vm_pps.txt is also updated to show the aboved two new features, some sections is re-written for comprehension. BUG New loop code is introduced in shrink_private_vma (mm/vmscan.c) and pps_swapoff (mm/swapfile.c), contrast with old code, even lhtemp is freed during loop, it's also safe. A bug is catched in mm/memory.c:zap_pte_range -- if a PrivatePage is being written back, it will be migrated back to Linux legacy page system. A fault done by me in previous patch is remedied in stage 5, now stage 5 can work. MISCELLANEOUS UP code has been separated from SMP code in dftlb. -- Index: linux-2.6.16.29/Documentation/vm_pps.txt === --- linux-2.6.16.29.orig/Documentation/vm_pps.txt 2007-01-04 14:47:35.0 +0800 +++ linux-2.6.16.29/Documentation/vm_pps.txt2007-01-04 14:49:36.0 +0800 @@ -6,11 +6,11 @@ // Purpose ([{ The file is used to document the idea which is published firstly at http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my -OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, a patch -of the document to enchance the performance of Linux swap subsystem. You can -find the overview of the idea in section How to Reclaim Pages more -Efficiently and how I patch it into Linux 2.6.16.29 in section Pure Private -Page System -- pps. +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section How to Reclaim +Pages more Efficiently and how I patch it into Linux 2.6.16.29 in section +Pure Private Page System -- pps. // }]) // How to Reclaim Pages more Efficiently ([{ @@ -21,7 +21,9 @@ OK! to modern OS, its memory subsystem can be divided into three layers 1) Space layer (InodeSpace, UserSpace and CoreSpace). 2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). -3) PTE and page layer (architecture-dependent). +3) PTE, zone/memory inode layer (architecture-dependent). +4) Maybe it makes you sense that Page should be placed on the 3rd layer, but + here, it's placed on the 2nd layer since it's the basic unit of VMA. Since the 2nd layer assembles the much statistic of page-acess information, so it's nature that swap subsystem should be deployed and implemented on the 2nd @@ -41,7 +43,8 @@ Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a system on zone::active_list/inactive_list. -I've finished a patch, see section Pure Private Page System -- pps. Note, it ISN'T perfect. +I've finished a patch, see section Pure Private Page System -- pps. Note, it +ISN'T perfect. // }]) // Pure Private Page System -- pps ([{ @@ -70,7 +73,18 @@ 3) Private Page Lifecycle of pps -- how private pages enter in/go off pps. 4) VMA Lifecycle of pps which VMA is belonging to pps. -PPS uses init_mm.mm_list list to enumerate all swappable UserSpace. +PPS uses init_mm.mm_list list to enumerate all swappable UserSpace +(shrink_private_vma). + +A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to +execute the stages of pps periodically, note an appropriate timeout ticks is +necessary so we can give application a chance to re-map back its PrivatePage +from UnmappedPTE to PTE, that is, show their conglomeration affinity. +scan_control::pps_cmd field is used to control the behavior of kppsd, = 1 for +accelerating scanning process and reclaiming pages, it's used in balance_pgdat. + +PPS statistic data is appended to /proc/meminfo entry, its prototype is in +include/linux/mm.h. I'm also glad to highlight my a new idea -- dftlb which is described in section Delay to Flush TLB. @@ -97,15 +111,19 @@ gone when a CPU starts to execute the task in timer interrupt, so don't use dftlb. combine stage 1 with stage 2, and send IPI immediately in fill_in_tlb_tasks. + +dftlb increases mm_struct::mm_users to prevent the mm from being freed when +other CPU works on it. // }]) // Stage Definition ([{ The whole process of private page page-out is divided into six stages, as -showed in shrink_pvma_scan_ptes of mm/vmscan.c +showed in shrink_pvma_scan_ptes of mm/vmscan.c, the code groups the similar +pages to a series. 1) PTE to untouched PTE (access bit is cleared), append flushing tasks to dftlb. 2) Convert untouched PTE to UnmappedPTE. 3) Link SwapEntry to every UnmappedPTE. -4) Synchronize the page of a UnmappedPTE with its physical swap page. +4) Flush PrivatePage of UnmappedPTE to its disk SwapPage. 5) Reclaimed the page and shift UnmappedPTE to
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
I've re-published my work on quilt, sorry. Index: linux-2.6.16.29/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.16.29/Documentation/vm_pps.txt2006-12-29 14:36:36.507332384 +0800 @@ -0,0 +1,192 @@ + Pure Private Page System (pps) + Copyright by Yunfeng Zhang on GFDL 1.2 + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose <([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, a patch +of the document to enchance the performance of Linux swap subsystem. You can +find the overview of the idea in section and how I patch it into Linux 2.6.16.29 in section . +// }])> + +// How to Reclaim Pages more Efficiently <([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) PTE and page layer (architecture-dependent). + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. + +Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a +system on zone::active_list/inactive_list. + +I've finished a patch, see section . Note, it ISN'T perfect. +// }])> + +// Pure Private Page System -- pps <([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the +whole job should be consist of two parts, here is the first -- +PrivateVMA-oriented (PPS), other is SharedVMA-oriented (should be called SPS) +scheduled in future. Of course, if all are done, it will empty Linux legacy +page system. + +In fact, pps is centered on how to better collect and unmap process private +pages in SwapDaemon mm/vmscan.c:shrink_private_vma, the whole process is +divided into six stages -- . Other sections show the remain +aspects of pps +1) is basic data definition. +2) is focused on synchronization. +3) -- how private pages enter in/go off pps. +4) which VMA is belonging to pps. + +PPS uses init_mm.mm_list list to enumerate all swappable UserSpace. + +I'm also glad to highlight my a new idea -- dftlb which is described in +section . +// }])> + +// Delay to Flush TLB (dftlb) <([{ +Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in +brief, when we want to unmap a page from the page table of a process, why we +send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we +can insert flushing tasks into timer interrupt route to implement a +free-charged TLB flushing. + +The trick is implemented in +1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c. +2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute + flushing tasks. +3) all data are defined in include/linux/mm.h. + +The restriction of dftlb. Following conditions must be met +1) atomic cmpxchg instruction. +2) atomically set the access bit after they touch a pte firstly. +3) To some architectures, vma parameter of flush_tlb_range is maybe important, + if it's true, since it's possible that the vma of a TLB flushing task has + gone when a
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
I've re-published my work on quilt, sorry. Index: linux-2.6.16.29/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.16.29/Documentation/vm_pps.txt2006-12-29 14:36:36.507332384 +0800 @@ -0,0 +1,192 @@ + Pure Private Page System (pps) + Copyright by Yunfeng Zhang on GFDL 1.2 + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose ([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, a patch +of the document to enchance the performance of Linux swap subsystem. You can +find the overview of the idea in section How to Reclaim Pages more +Efficiently and how I patch it into Linux 2.6.16.29 in section Pure Private +Page System -- pps. +// }]) + +// How to Reclaim Pages more Efficiently ([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) PTE and page layer (architecture-dependent). + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. + +Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a +system on zone::active_list/inactive_list. + +I've finished a patch, see section Pure Private Page System -- pps. Note, it ISN'T perfect. +// }]) + +// Pure Private Page System -- pps ([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the +whole job should be consist of two parts, here is the first -- +PrivateVMA-oriented (PPS), other is SharedVMA-oriented (should be called SPS) +scheduled in future. Of course, if all are done, it will empty Linux legacy +page system. + +In fact, pps is centered on how to better collect and unmap process private +pages in SwapDaemon mm/vmscan.c:shrink_private_vma, the whole process is +divided into six stages -- Stage Definition. Other sections show the remain +aspects of pps +1) Data Definition is basic data definition. +2) Concurrent racers of Shrinking pps is focused on synchronization. +3) Private Page Lifecycle of pps -- how private pages enter in/go off pps. +4) VMA Lifecycle of pps which VMA is belonging to pps. + +PPS uses init_mm.mm_list list to enumerate all swappable UserSpace. + +I'm also glad to highlight my a new idea -- dftlb which is described in +section Delay to Flush TLB. +// }]) + +// Delay to Flush TLB (dftlb) ([{ +Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in +brief, when we want to unmap a page from the page table of a process, why we +send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we +can insert flushing tasks into timer interrupt route to implement a +free-charged TLB flushing. + +The trick is implemented in +1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c. +2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute + flushing tasks. +3) all data are defined in include/linux/mm.h. + +The restriction of dftlb. Following conditions must be met +1) atomic cmpxchg instruction. +2) atomically set the access bit
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
The job listed in Documentation/vm_pps.txt of my patch is too heavy to me, so I'm appreciate that Linux kernel group can arrange a schedule to help me. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
To multiple address space, multiple memory inode architecture, we can introduce a new core object -- section which has several features 1) Section is used as the atomic unit to contain the pages of a VMA residing in the memory inode of the section. 2) When page migration occurs among different memory inodes, new secion should be set up to trace the pages. 3) Section can be scanned by the SwapDaemon of its memory inode directely. 4) All sections of a VMA are excluded with each other not overlayed. 5) VMA is made up of sections totally, but its section objects scatter on memory inodes. So to the architecture, we can deploy swap subsystem on an architecture-independent layer by section and scan pages batchly. The idea issued by me is whether swap subsystem should be deployed on layer 2 or layer 3 which is described in Documentation/vm_pps.txt of my patch. To multiple memory inode architecture, the special memory model should be encapsulated on layer 3 (architecture-dependent), I think. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
dr, next); - } while (pud++, addr = next, addr != end); -} - -// migrate all pages of pure private vma back to Linux legacy memory management. -static void migrate_back_legacy_linux(struct mm_struct* mm, struct vm_area_struct* vma) -{ - pgd_t* pgd; - unsigned long next; - unsigned long addr = vma->vm_start; - unsigned long end = vma->vm_end; - - pgd = pgd_offset(mm, addr); - do { - next = pgd_addr_end(addr, end); - if (pgd_none_or_clear_bad(pgd)) - continue; - migrate_back_pud_range(mm, pgd, vma, addr, next); - } while (pgd++, addr = next, addr != end); -} - -LIST_HEAD(pps_head); -LIST_HEAD(pps_head_buddy); - -DEFINE_SPINLOCK(pps_lock); - -void enter_pps(struct mm_struct* mm, struct vm_area_struct* vma) -{ - int condition = VM_READ | VM_WRITE | VM_EXEC | \ -VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC | \ -VM_GROWSDOWN | VM_GROWSUP | \ -VM_LOCKED | VM_SEQ_READ | VM_RAND_READ | VM_DONTCOPY | VM_ACCOUNT; - if (!(vma->vm_flags & ~condition) && vma->vm_file == NULL) { - vma->vm_flags |= VM_PURE_PRIVATE; - if (list_empty(>mmlist)) { - spin_lock(_lock); - if (list_empty(>mmlist)) - list_add(>mmlist, _mm.mmlist); - spin_unlock(_lock); - } - } -} - -void leave_pps(struct vm_area_struct* vma, int migrate_flag) -{ - struct mm_struct* mm = vma->vm_mm; - - if (vma->vm_flags & VM_PURE_PRIVATE) { - vma->vm_flags &= ~VM_PURE_PRIVATE; - if (migrate_flag) - migrate_back_legacy_linux(mm, vma); - } -} --- patch-linux/kernel/timer.c 2006-12-26 15:20:02.688545256 +0800 +++ linux-2.6.16.29/kernel/timer.c 2006-09-13 02:02:10.0 +0800 @@ -845,2 +844,0 @@ - - timer_flush_tlb_tasks(NULL); --- patch-linux/kernel/fork.c 2006-12-26 15:20:02.688545256 +0800 +++ linux-2.6.16.29/kernel/fork.c 2006-09-13 02:02:10.0 +0800 @@ -232 +231,0 @@ - leave_pps(mpnt, 1); --- patch-linux/Documentation/vm_pps.txt2006-12-26 15:45:33.203883456 +0800 +++ linux-2.6.16.29/Documentation/vm_pps.txt1970-01-01 08:00:00.0 +0800 @@ -1,190 +0,0 @@ - Pure Private Page System (pps) - Copyright by Yunfeng Zhang on GFDL 1.2 - [EMAIL PROTECTED] - December 24-26, 2006 - -// Purpose <([{ -The file is used to document the idea which is published firstly at -http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my -OS -- main page http://blog.chinaunix.net/u/21764/index.php. You can find the -overview of the idea in section and how -I patch it into Linux 2.6.16.29 in section . -// }])> - -// How to Reclaim Pages more Efficiently <([{ -Good idea originates from overall design and management ability, when you look -down from a manager view, you will relief yourself from disordered code and -find some problem immediately. - -OK! to modern OS, its memory subsystem can be divided into three layers -1) Space layer (InodeSpace, UserSpace and CoreSpace). -2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). -3) PTE and page layer (architecture-dependent). - -Since the 2nd layer assembles the much statistic of page-acess information, so -it's nature that swap subsystem should be deployed and implemented on the 2nd -layer. - -Undoubtedly, there are some virtues about it -1) SwapDaemon can collect the statistic of process acessing pages and by it - unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range - to relief frequently TLB IPI interrupt. -2) Page-fault can issue better readahead requests since history data shows all - related pages have conglomerating affinity. -3) It's conformable to POSIX madvise API family. - -Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a -system surrounding page. - -I've commited a patch to it, see section . Note, it ISN'T perfect. -// }])> - -// Pure Private Page System -- pps <([{ -Current Linux is just like a monster and still growing, even its swap subsystem -... - -As I've referred in previous section, perfectly applying my idea need to unroot -page-surrounging swap subsystem to migrate it on VMA, but a huge gap has -defeated me -- active_list and inactive_list. In fact, you can find -lru_add_active anywhere ... It's IMPOSSIBLE to me to complete it only by -myself. It's also the difference between my design and Linux, in my OS, page is -the charge of its new owner totally, however, to Linux, page management system -is still tracing it by PG_active flag. - -So I conceive another solution:) That is, set up an independent page-recycle -system r
[PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
-vm_start; - unsigned long end = vma-vm_end; - - pgd = pgd_offset(mm, addr); - do { - next = pgd_addr_end(addr, end); - if (pgd_none_or_clear_bad(pgd)) - continue; - migrate_back_pud_range(mm, pgd, vma, addr, next); - } while (pgd++, addr = next, addr != end); -} - -LIST_HEAD(pps_head); -LIST_HEAD(pps_head_buddy); - -DEFINE_SPINLOCK(pps_lock); - -void enter_pps(struct mm_struct* mm, struct vm_area_struct* vma) -{ - int condition = VM_READ | VM_WRITE | VM_EXEC | \ -VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC | \ -VM_GROWSDOWN | VM_GROWSUP | \ -VM_LOCKED | VM_SEQ_READ | VM_RAND_READ | VM_DONTCOPY | VM_ACCOUNT; - if (!(vma-vm_flags ~condition) vma-vm_file == NULL) { - vma-vm_flags |= VM_PURE_PRIVATE; - if (list_empty(mm-mmlist)) { - spin_lock(mmlist_lock); - if (list_empty(mm-mmlist)) - list_add(mm-mmlist, init_mm.mmlist); - spin_unlock(mmlist_lock); - } - } -} - -void leave_pps(struct vm_area_struct* vma, int migrate_flag) -{ - struct mm_struct* mm = vma-vm_mm; - - if (vma-vm_flags VM_PURE_PRIVATE) { - vma-vm_flags = ~VM_PURE_PRIVATE; - if (migrate_flag) - migrate_back_legacy_linux(mm, vma); - } -} --- patch-linux/kernel/timer.c 2006-12-26 15:20:02.688545256 +0800 +++ linux-2.6.16.29/kernel/timer.c 2006-09-13 02:02:10.0 +0800 @@ -845,2 +844,0 @@ - - timer_flush_tlb_tasks(NULL); --- patch-linux/kernel/fork.c 2006-12-26 15:20:02.688545256 +0800 +++ linux-2.6.16.29/kernel/fork.c 2006-09-13 02:02:10.0 +0800 @@ -232 +231,0 @@ - leave_pps(mpnt, 1); --- patch-linux/Documentation/vm_pps.txt2006-12-26 15:45:33.203883456 +0800 +++ linux-2.6.16.29/Documentation/vm_pps.txt1970-01-01 08:00:00.0 +0800 @@ -1,190 +0,0 @@ - Pure Private Page System (pps) - Copyright by Yunfeng Zhang on GFDL 1.2 - [EMAIL PROTECTED] - December 24-26, 2006 - -// Purpose ([{ -The file is used to document the idea which is published firstly at -http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my -OS -- main page http://blog.chinaunix.net/u/21764/index.php. You can find the -overview of the idea in section How to Reclaim Pages more Efficiently and how -I patch it into Linux 2.6.16.29 in section Pure Private Page System -- pps. -// }]) - -// How to Reclaim Pages more Efficiently ([{ -Good idea originates from overall design and management ability, when you look -down from a manager view, you will relief yourself from disordered code and -find some problem immediately. - -OK! to modern OS, its memory subsystem can be divided into three layers -1) Space layer (InodeSpace, UserSpace and CoreSpace). -2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). -3) PTE and page layer (architecture-dependent). - -Since the 2nd layer assembles the much statistic of page-acess information, so -it's nature that swap subsystem should be deployed and implemented on the 2nd -layer. - -Undoubtedly, there are some virtues about it -1) SwapDaemon can collect the statistic of process acessing pages and by it - unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range - to relief frequently TLB IPI interrupt. -2) Page-fault can issue better readahead requests since history data shows all - related pages have conglomerating affinity. -3) It's conformable to POSIX madvise API family. - -Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a -system surrounding page. - -I've commited a patch to it, see section Pure Private Page System -- pps. Note, it ISN'T perfect. -// }]) - -// Pure Private Page System -- pps ([{ -Current Linux is just like a monster and still growing, even its swap subsystem -... - -As I've referred in previous section, perfectly applying my idea need to unroot -page-surrounging swap subsystem to migrate it on VMA, but a huge gap has -defeated me -- active_list and inactive_list. In fact, you can find -lru_add_active anywhere ... It's IMPOSSIBLE to me to complete it only by -myself. It's also the difference between my design and Linux, in my OS, page is -the charge of its new owner totally, however, to Linux, page management system -is still tracing it by PG_active flag. - -So I conceive another solution:) That is, set up an independent page-recycle -system rooted on Linux legacy page system -- pps, intercept all private pages -belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the -whole job should be consist of two parts, here is the first -- -PrivateVMA-oriented (PPS), other is SharedVMA-oriented (should be called
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
To multiple address space, multiple memory inode architecture, we can introduce a new core object -- section which has several features 1) Section is used as the atomic unit to contain the pages of a VMA residing in the memory inode of the section. 2) When page migration occurs among different memory inodes, new secion should be set up to trace the pages. 3) Section can be scanned by the SwapDaemon of its memory inode directely. 4) All sections of a VMA are excluded with each other not overlayed. 5) VMA is made up of sections totally, but its section objects scatter on memory inodes. So to the architecture, we can deploy swap subsystem on an architecture-independent layer by section and scan pages batchly. The idea issued by me is whether swap subsystem should be deployed on layer 2 or layer 3 which is described in Documentation/vm_pps.txt of my patch. To multiple memory inode architecture, the special memory model should be encapsulated on layer 3 (architecture-dependent), I think. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
The job listed in Documentation/vm_pps.txt of my patch is too heavy to me, so I'm appreciate that Linux kernel group can arrange a schedule to help me. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/