Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
On Thu, 23 Aug 2007 17:47:44 +0800 yunfeng zhang wrote: > Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]> > > The mayor change is > 1) Using nail arithmetic to maximum SwapDevice performance. > 2) Add PG_pps bit to sign every pps page. > 3) Some discussion about NUMA. > See vm_pps.txt > > Index: linux-2.6.22/Documentation/vm_pps.txt > === > --- /dev/null 1970-01-01 00:00:00.0 + > +++ linux-2.6.22/Documentation/vm_pps.txt 2007-08-23 17:04:12.051837322 > +0800 > @@ -0,0 +1,365 @@ > + > + Pure Private Page System (pps) > + [EMAIL PROTECTED] > + December 24-26, 2006 > +Revised on Aug 23, 2007 > + > +// Purpose <([{ > +The file is used to document the idea which is published firstly at > +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of > my > +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the > +patch of the document is for enchancing the performance of Linux swap > +subsystem. You can find the overview of the idea in section +Pages more Efficiently> and how I patch it into Linux 2.6.21 in section > +. > +// }])> Hi, What (text) format/markup language is the vm_pps.txt file in? --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
On Thu, 23 Aug 2007 17:47:44 +0800 yunfeng zhang wrote: Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED] The mayor change is 1) Using nail arithmetic to maximum SwapDevice performance. 2) Add PG_pps bit to sign every pps page. 3) Some discussion about NUMA. See vm_pps.txt Index: linux-2.6.22/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.22/Documentation/vm_pps.txt 2007-08-23 17:04:12.051837322 +0800 @@ -0,0 +1,365 @@ + + Pure Private Page System (pps) + [EMAIL PROTECTED] + December 24-26, 2006 +Revised on Aug 23, 2007 + +// Purpose ([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section How to Reclaim +Pages more Efficiently and how I patch it into Linux 2.6.21 in section +Pure Private Page System -- pps. +// }]) Hi, What (text) format/markup language is the vm_pps.txt file in? --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
To a two-CPU architecture, its page distribution just like theoretically ABABAB So every readahead of A process will create 4 unused readahead pages unless you are sure B will resume soon. Have you ever compared the results among UP, 2 or 4-CPU? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
To a two-CPU architecture, its page distribution just like theoretically ABABAB So every readahead of A process will create 4 unused readahead pages unless you are sure B will resume soon. Have you ever compared the results among UP, 2 or 4-CPU? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
yunfeng zhang wrote: Performance improvement should occur when private pages of multiple processes are messed up, Ummm, yes. Linux used to do this, but doing virtual scans just does not scale when a system has a really large amount of memory, a large number of processes and multiple zones. We've seen it fall apart with as little as 8GB of RAM. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Performance improvement should occur when private pages of multiple processes are messed up, such as SMP. To UP, my previous mail is done by timer, which only shows a fact, if pages are messed up fully, current readahead will degrade remarkably, and unused readaheading pages make a burden to memory subsystem. You should re-test your testcases following the advises on Linux without my patch, do normal testcases and select a testcase randomly and record '/proc/vmstat/pswpin', redo the testcase solely, if the results are close, that is, your testcases doesn't messed up private pages at all as you expected due to Linux schedule. Thank you! 2007/2/22, Rik van Riel <[EMAIL PROTECTED]>: yunfeng zhang wrote: > Any comments or suggestions are always welcomed. Same question as always: what problem are you trying to solve? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Performance improvement should occur when private pages of multiple processes are messed up, such as SMP. To UP, my previous mail is done by timer, which only shows a fact, if pages are messed up fully, current readahead will degrade remarkably, and unused readaheading pages make a burden to memory subsystem. You should re-test your testcases following the advises on Linux without my patch, do normal testcases and select a testcase randomly and record '/proc/vmstat/pswpin', redo the testcase solely, if the results are close, that is, your testcases doesn't messed up private pages at all as you expected due to Linux schedule. Thank you! 2007/2/22, Rik van Riel [EMAIL PROTECTED]: yunfeng zhang wrote: Any comments or suggestions are always welcomed. Same question as always: what problem are you trying to solve? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
yunfeng zhang wrote: Performance improvement should occur when private pages of multiple processes are messed up, Ummm, yes. Linux used to do this, but doing virtual scans just does not scale when a system has a really large amount of memory, a large number of processes and multiple zones. We've seen it fall apart with as little as 8GB of RAM. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
yunfeng zhang wrote: Any comments or suggestions are always welcomed. Same question as always: what problem are you trying to solve? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Any comments or suggestions are always welcomed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Any comments or suggestions are always welcomed. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
yunfeng zhang wrote: Any comments or suggestions are always welcomed. Same question as always: what problem are you trying to solve? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Following arithmetic is based on SwapSpace bitmap management which is discussed in the postscript section of my patch. Two purposes are implemented, one is allocating a group of fake continual swap entries, another is re-allocating swap entries in stage 3 for such as series length is too short. #include #include #include // 2 hardware cache line. You can also concentrate it to a hareware cache line. char bits_per_short[256] = { 8, 7, 7, 6, 7, 6, 6, 5, 7, 6, 6, 5, 6, 5, 5, 4, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0 }; unsigned char swap_bitmap[32]; // Allocate a group of fake continual swap entries. int alloc(int size) { int i, found = 0, result_offset; unsigned char a = 0, b = 0; for (i = 0; i < 32; i++) { b = bits_per_short[swap_bitmap[i]]; if (a + b >= size) { found = 1; break; } a = b; } result_offset = i == 0 ? 0 : i - 1; result_offset = found ? result_offset : -1; return result_offset; } // Re-allocate in stage 3 if necessary. int re_alloc(int position) { int offset = position / 8; int a = offset == 0 ? 0 : offset - 1; int b = offset == 31 ? 31 : offset + 1; int i, empty_bits = 0; for (i = a; i <= b; i++) { empty_bits += bits_per_short[swap_bitmap[i]]; } return empty_bits; } int main(int argc, char **argv) { int i; for (i = 0; i < 32; i++) { swap_bitmap[i] = (unsigned char) (rand() % 0xff); } i = 9; int temp = alloc(i); temp = re_alloc(i); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Following arithmetic is based on SwapSpace bitmap management which is discussed in the postscript section of my patch. Two purposes are implemented, one is allocating a group of fake continual swap entries, another is re-allocating swap entries in stage 3 for such as series length is too short. #include stdlib.h #include stdio.h #include string.h // 2 hardware cache line. You can also concentrate it to a hareware cache line. char bits_per_short[256] = { 8, 7, 7, 6, 7, 6, 6, 5, 7, 6, 6, 5, 6, 5, 5, 4, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 7, 6, 6, 5, 6, 5, 5, 4, 6, 5, 5, 4, 5, 4, 4, 3, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 6, 5, 5, 4, 5, 4, 4, 3, 5, 4, 4, 3, 4, 3, 3, 2, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 5, 4, 4, 3, 4, 3, 3, 2, 4, 3, 3, 2, 3, 2, 2, 1, 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0 }; unsigned char swap_bitmap[32]; // Allocate a group of fake continual swap entries. int alloc(int size) { int i, found = 0, result_offset; unsigned char a = 0, b = 0; for (i = 0; i 32; i++) { b = bits_per_short[swap_bitmap[i]]; if (a + b = size) { found = 1; break; } a = b; } result_offset = i == 0 ? 0 : i - 1; result_offset = found ? result_offset : -1; return result_offset; } // Re-allocate in stage 3 if necessary. int re_alloc(int position) { int offset = position / 8; int a = offset == 0 ? 0 : offset - 1; int b = offset == 31 ? 31 : offset + 1; int i, empty_bits = 0; for (i = a; i = b; i++) { empty_bits += bits_per_short[swap_bitmap[i]]; } return empty_bits; } int main(int argc, char **argv) { int i; for (i = 0; i 32; i++) { swap_bitmap[i] = (unsigned char) (rand() % 0xff); } i = 9; int temp = alloc(i); temp = re_alloc(i); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
You can apply my previous patch on 2.6.20 by changing -#define VM_PURE_PRIVATE0x0400 /* Is the vma is only belonging to a mm, to +#define VM_PURE_PRIVATE0x0800 /* Is the vma is only belonging to a mm, New revision is based on 2.6.20 with my previous patch, major changelogs are 1) pte_unmap pairs on shrink_pvma_scan_ptes and pps_swapoff_scan_ptes. 2) Now, kppsd can be woke up by kswapd. 3) New global variable accelerate_kppsd is appended to accelerate the reclamation process when a memory inode is low. Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]> Index: linux-2.6.19/Documentation/vm_pps.txt === --- linux-2.6.19.orig/Documentation/vm_pps.txt 2007-02-12 12:45:07.0 +0800 +++ linux-2.6.19/Documentation/vm_pps.txt 2007-02-12 15:30:16.490797672 +0800 @@ -143,23 +143,32 @@ 2) mm/memory.c do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page, do_swap_page (page-fault) 3) mm/memory.c get_user_pages (sometimes core need share PrivatePage with us) +4) mm/vmscan.c balance_pgdat (kswapd/x can do stage 5 of its node pages, + while kppsd can do stage 1-4) +5) mm/vmscan.c kppsd (new core daemon -- kppsd, see below) There isn't new lock order defined in pps, that is, it's compliable to Linux -lock order. +lock order. Locks in shrink_private_vma copied from shrink_list of 2.6.16.29 +(my initial version). // }])> // Others about pps <([{ A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to -execute the stages of pps periodically, note an appropriate timeout ticks is -necessary so we can give application a chance to re-map back its PrivatePage -from UnmappedPTE to PTE, that is, show their conglomeration affinity. - -kppsd can be controlled by new fields -- scan_control::may_reclaim/reclaim_node -may_reclaim = 1 means starting reclamation (stage 5). reclaim_node = (node -number) is used when a memory node is low. Caller should set them to wakeup_sc, -then wake up kppsd (vmscan.c:balance_pgdat). Note, if kppsd is started due to -timeout, it doesn't do stage 5 at all (vmscan.c:kppsd). Other alive legacy -fields are gfp_mask, may_writepage and may_swap. +execute the stage 1 - 4 of pps periodically, note an appropriate timeout ticks +(current 2 seconds) is necessary so we can give application a chance to re-map +back its PrivatePage from UnmappedPTE to PTE, that is, show their +conglomeration affinity. + +shrink_private_vma can be controlled by new fields -- may_reclaim, reclaim_node +and is_kppsd of scan_control. may_reclaim = 1 means starting reclamation +(stage 5). reclaim_node = (node number, -1 means all memory inode) is used when +a memory node is low. Caller (kswapd/x), typically, set reclaim_node to start +shrink_private_vma (vmscan.c:balance_pgdat). Note, only to kppsd is_kppsd = 1. +Other alive legacy fields to pps are gfp_mask, may_writepage and may_swap. + +When a memory inode is low, kswapd/x can wake up kppsd by increasing global +variable accelerate_kppsd (balance_pgdat), which accelerate stage 1 - 4, and +call shrink_private_vma to do stage 5. PPS statistic data is appended to /proc/meminfo entry, its prototype is in include/linux/mm.h. Index: linux-2.6.19/mm/swapfile.c === --- linux-2.6.19.orig/mm/swapfile.c 2007-02-12 12:45:07.0 +0800 +++ linux-2.6.19/mm/swapfile.c 2007-02-12 12:45:21.0 +0800 @@ -569,6 +569,7 @@ } } } while (pte++, addr += PAGE_SIZE, addr != end); + pte_unmap(pte); return 0; } Index: linux-2.6.19/mm/vmscan.c === --- linux-2.6.19.orig/mm/vmscan.c 2007-02-12 12:45:07.0 +0800 +++ linux-2.6.19/mm/vmscan.c2007-02-12 15:48:59.217292888 +0800 @@ -70,6 +70,7 @@ /* pps control command. See Documentation/vm_pps.txt. */ int may_reclaim; int reclaim_node; + int is_kppsd; }; /* @@ -1101,9 +1102,9 @@ return ret; } -// pps fields. +// pps fields, see Documentation/vm_pps.txt. static wait_queue_head_t kppsd_wait; -static struct scan_control wakeup_sc; +static int accelerate_kppsd = 0; struct pps_info pps_info = { .total = ATOMIC_INIT(0), .pte_count = ATOMIC_INIT(0), // stage 1 and 2. @@ -1118,24 +1119,22 @@ struct page* pages[MAX_SERIES_LENGTH]; int series_length; int series_stage; -} series; +}; -static int get_series_stage(pte_t* pte, int index) +static int get_series_stage(struct series_t* series, pte_t* pte, int index) { - series.orig_ptes[index] = *pte; - series.ptes[index] = pte; - if (pte_present(series.orig_ptes[index])) { - struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index])); - series.pages[index] = page; + series->orig_ptes[index] = *pte; +
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
You can apply my previous patch on 2.6.20 by changing -#define VM_PURE_PRIVATE0x0400 /* Is the vma is only belonging to a mm, to +#define VM_PURE_PRIVATE0x0800 /* Is the vma is only belonging to a mm, New revision is based on 2.6.20 with my previous patch, major changelogs are 1) pte_unmap pairs on shrink_pvma_scan_ptes and pps_swapoff_scan_ptes. 2) Now, kppsd can be woke up by kswapd. 3) New global variable accelerate_kppsd is appended to accelerate the reclamation process when a memory inode is low. Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED] Index: linux-2.6.19/Documentation/vm_pps.txt === --- linux-2.6.19.orig/Documentation/vm_pps.txt 2007-02-12 12:45:07.0 +0800 +++ linux-2.6.19/Documentation/vm_pps.txt 2007-02-12 15:30:16.490797672 +0800 @@ -143,23 +143,32 @@ 2) mm/memory.c do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page, do_swap_page (page-fault) 3) mm/memory.c get_user_pages (sometimes core need share PrivatePage with us) +4) mm/vmscan.c balance_pgdat (kswapd/x can do stage 5 of its node pages, + while kppsd can do stage 1-4) +5) mm/vmscan.c kppsd (new core daemon -- kppsd, see below) There isn't new lock order defined in pps, that is, it's compliable to Linux -lock order. +lock order. Locks in shrink_private_vma copied from shrink_list of 2.6.16.29 +(my initial version). // }]) // Others about pps ([{ A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to -execute the stages of pps periodically, note an appropriate timeout ticks is -necessary so we can give application a chance to re-map back its PrivatePage -from UnmappedPTE to PTE, that is, show their conglomeration affinity. - -kppsd can be controlled by new fields -- scan_control::may_reclaim/reclaim_node -may_reclaim = 1 means starting reclamation (stage 5). reclaim_node = (node -number) is used when a memory node is low. Caller should set them to wakeup_sc, -then wake up kppsd (vmscan.c:balance_pgdat). Note, if kppsd is started due to -timeout, it doesn't do stage 5 at all (vmscan.c:kppsd). Other alive legacy -fields are gfp_mask, may_writepage and may_swap. +execute the stage 1 - 4 of pps periodically, note an appropriate timeout ticks +(current 2 seconds) is necessary so we can give application a chance to re-map +back its PrivatePage from UnmappedPTE to PTE, that is, show their +conglomeration affinity. + +shrink_private_vma can be controlled by new fields -- may_reclaim, reclaim_node +and is_kppsd of scan_control. may_reclaim = 1 means starting reclamation +(stage 5). reclaim_node = (node number, -1 means all memory inode) is used when +a memory node is low. Caller (kswapd/x), typically, set reclaim_node to start +shrink_private_vma (vmscan.c:balance_pgdat). Note, only to kppsd is_kppsd = 1. +Other alive legacy fields to pps are gfp_mask, may_writepage and may_swap. + +When a memory inode is low, kswapd/x can wake up kppsd by increasing global +variable accelerate_kppsd (balance_pgdat), which accelerate stage 1 - 4, and +call shrink_private_vma to do stage 5. PPS statistic data is appended to /proc/meminfo entry, its prototype is in include/linux/mm.h. Index: linux-2.6.19/mm/swapfile.c === --- linux-2.6.19.orig/mm/swapfile.c 2007-02-12 12:45:07.0 +0800 +++ linux-2.6.19/mm/swapfile.c 2007-02-12 12:45:21.0 +0800 @@ -569,6 +569,7 @@ } } } while (pte++, addr += PAGE_SIZE, addr != end); + pte_unmap(pte); return 0; } Index: linux-2.6.19/mm/vmscan.c === --- linux-2.6.19.orig/mm/vmscan.c 2007-02-12 12:45:07.0 +0800 +++ linux-2.6.19/mm/vmscan.c2007-02-12 15:48:59.217292888 +0800 @@ -70,6 +70,7 @@ /* pps control command. See Documentation/vm_pps.txt. */ int may_reclaim; int reclaim_node; + int is_kppsd; }; /* @@ -1101,9 +1102,9 @@ return ret; } -// pps fields. +// pps fields, see Documentation/vm_pps.txt. static wait_queue_head_t kppsd_wait; -static struct scan_control wakeup_sc; +static int accelerate_kppsd = 0; struct pps_info pps_info = { .total = ATOMIC_INIT(0), .pte_count = ATOMIC_INIT(0), // stage 1 and 2. @@ -1118,24 +1119,22 @@ struct page* pages[MAX_SERIES_LENGTH]; int series_length; int series_stage; -} series; +}; -static int get_series_stage(pte_t* pte, int index) +static int get_series_stage(struct series_t* series, pte_t* pte, int index) { - series.orig_ptes[index] = *pte; - series.ptes[index] = pte; - if (pte_present(series.orig_ptes[index])) { - struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index])); - series.pages[index] = page; + series-orig_ptes[index] = *pte; +
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
You have an interesting idea of "simplifies", given 16 files changed, 997 insertions(+), 25 deletions(-) (omitting your Documentation), and over 7k more code. You'll have to be much more persuasive (with good performance results) to get us to welcome your added layer of complexity. If the whole idea is deployed on Linux, following core objects should be erased 1) anon_vma. 2) pgdata::active/inactive list and relatted methods -- mark_page_accessed etc. 3) PrivatePage::count and mapcount. If core need to share the page, add PG_kmap flag. In fact, page::lru_list can safetly be erased too. 4) All cases should be from up to down, especially simplifies debug. Please make an effort to support at least i386 3level pagetables: you don't actually need >4GB of memory to test CONFIG_HIGHMEM64G. HIGHMEM testing shows you're missing a couple of pte_unmap()s, in pps_swapoff_scan_ptes() and in shrink_pvma_scan_ptes(). Yes, it's my fault. It would be nice if you could support at least x86_64 too (you have pte_low code peculiar to i386 in vmscan.c, which is preventing that), but that's harder if you don't have the hardware. Um! Data cmpxchged should include access bit. And I have only x86 PC, memory < 1G. 3level pagetable code copied from Linux other functions. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
You have an interesting idea of simplifies, given 16 files changed, 997 insertions(+), 25 deletions(-) (omitting your Documentation), and over 7k more code. You'll have to be much more persuasive (with good performance results) to get us to welcome your added layer of complexity. If the whole idea is deployed on Linux, following core objects should be erased 1) anon_vma. 2) pgdata::active/inactive list and relatted methods -- mark_page_accessed etc. 3) PrivatePage::count and mapcount. If core need to share the page, add PG_kmap flag. In fact, page::lru_list can safetly be erased too. 4) All cases should be from up to down, especially simplifies debug. Please make an effort to support at least i386 3level pagetables: you don't actually need 4GB of memory to test CONFIG_HIGHMEM64G. HIGHMEM testing shows you're missing a couple of pte_unmap()s, in pps_swapoff_scan_ptes() and in shrink_pvma_scan_ptes(). Yes, it's my fault. It would be nice if you could support at least x86_64 too (you have pte_low code peculiar to i386 in vmscan.c, which is preventing that), but that's harder if you don't have the hardware. Um! Data cmpxchged should include access bit. And I have only x86 PC, memory 1G. 3level pagetable code copied from Linux other functions. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Current test based on the fact below in my previous mail Current Linux page allocation fairly provides pages for every process, since swap daemon only is started when memory is low, so when it starts to scan active_list, the private pages of processes are messed up with each other, vmscan.c:shrink_list() is the only approach to attach disk swap page to page on active_list, as the result, all private pages lost their affinity on swap partition. I will give a testlater... Three testcases are imitated here 1) matrix: Some softwares do a lots of matrix arithmetic in their PrivateVMA, in fact, the type is much better to pps than Linux. 2) c: C malloc uses an arithmetic just like slab, so when an application resume from swap partition, let's supposed it touches three variables whose sizes are different, in the result, it should touch three pages and the three pages are closed with each other but aren't continual, I also imitate a case that if the application starts full-speed running later (touch more pages). 3) destruction: Typically, if an application resumes due to user clicks the close button, it totally visits all its private data to execute object destruction. Test stepping 1) run ./entry and say y to it, maybe need root right. 2) wait a moment until echo 'primarywait'. 3) swapoff -a && swapon -a. 4) ./hog until count = 10. 5) 'cat primary entry secondary > /dev/null' 6) 'cat /proc/vmstat' several times and record 'pswpin' field when it's stable. 7) type `1', `2' or `3' to 3 testcases, answer `2' to start fullspeed testcase. 8) record new 'pswpin' field. 9) which is better? see the 'pswpin' increment. pswpin is increased in mm/page_io.c:swap_readpage. Test stepping purposes 1) Step 1, 'entry' wakes up 'primary' and 'secondary' simultaneously, every time 'primary' allocates a page, 'secondary' inserts some pages into active_list closed to it. 1) Step 3, we should re-allocate swap pages. 2) Step 4, flush 'entry primary secondary' to swap partition. 3) Step 5, make file content 'entry primary secondary' present in memory. Testcases are done in vmware virtual machine 5.5, 32M memory. If you argue my circumstance, do your testcases following the steps advised 1) Run multiple memory-consumer together, make them pause at a point. (So mess up all private pages in pg_active list). 2) Flush them to swap partition. 3) Wake up one of them, let it run full-speed for a while, record pswpin of /proc/vmstat. 4) Invalidate all readaheaded pages. 5) Wake up another, repeat the test. 6) It's also good if you can record hard LED twinking:) Maybe your test resumes all memory-consumers together, so Linux readaheads some pages close to page-fault page but are belong to other processes, I think. By the way, what's linux-mm mail, it ins't in Documentation/SubmitPatches. In fact, you will find Linux column makes hard LED twinking per 5 seconds. - Linux pps matrix 52411597 53221620 (81)(23) c 80281937 80951954 fullspeed 83131964 (67)(17) (218) (10) destruction 94614445 98254484 (364) (39) Comment secondary.c:memset clause, so 'secondary' won't interrupt page-allocation in 'primary'. - Linux pps matrix 207 38 256 59 (49)(21) c 1273347 1341383 fullspeed 1362386 (68)(36) (21)(3) destruction 24351178 25131246 (78)(68) entry.c - #include #include #include #include #include #include #include #include #include pid_t pids[4]; int sem_set; siginfo_t si; int main(int argc, char **argv) { int i, data; unsigned short init_data[4] = { 1, 1, 1, 1 }; if ((sem_set = semget(123321, 4, IPC_CREAT)) == -1) goto failed; if (semctl(sem_set, 0, SETALL, _data) == -1) goto failed; pid_t pid = vfork(); if (pid == -1) { goto failed; } else if (pid == 0) { if (execlp("./primary", NULL) == -1) goto failed; } else { pids[0] = pid; } pid = vfork(); if (pid == -1) { goto failed; } else if (pid == 0) { if (execlp("./secondary", NULL) == -1)
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Current test based on the fact below in my previous mail Current Linux page allocation fairly provides pages for every process, since swap daemon only is started when memory is low, so when it starts to scan active_list, the private pages of processes are messed up with each other, vmscan.c:shrink_list() is the only approach to attach disk swap page to page on active_list, as the result, all private pages lost their affinity on swap partition. I will give a testlater... Three testcases are imitated here 1) matrix: Some softwares do a lots of matrix arithmetic in their PrivateVMA, in fact, the type is much better to pps than Linux. 2) c: C malloc uses an arithmetic just like slab, so when an application resume from swap partition, let's supposed it touches three variables whose sizes are different, in the result, it should touch three pages and the three pages are closed with each other but aren't continual, I also imitate a case that if the application starts full-speed running later (touch more pages). 3) destruction: Typically, if an application resumes due to user clicks the close button, it totally visits all its private data to execute object destruction. Test stepping 1) run ./entry and say y to it, maybe need root right. 2) wait a moment until echo 'primarywait'. 3) swapoff -a swapon -a. 4) ./hog until count = 10. 5) 'cat primary entry secondary /dev/null' 6) 'cat /proc/vmstat' several times and record 'pswpin' field when it's stable. 7) type `1', `2' or `3' to 3 testcases, answer `2' to start fullspeed testcase. 8) record new 'pswpin' field. 9) which is better? see the 'pswpin' increment. pswpin is increased in mm/page_io.c:swap_readpage. Test stepping purposes 1) Step 1, 'entry' wakes up 'primary' and 'secondary' simultaneously, every time 'primary' allocates a page, 'secondary' inserts some pages into active_list closed to it. 1) Step 3, we should re-allocate swap pages. 2) Step 4, flush 'entry primary secondary' to swap partition. 3) Step 5, make file content 'entry primary secondary' present in memory. Testcases are done in vmware virtual machine 5.5, 32M memory. If you argue my circumstance, do your testcases following the steps advised 1) Run multiple memory-consumer together, make them pause at a point. (So mess up all private pages in pg_active list). 2) Flush them to swap partition. 3) Wake up one of them, let it run full-speed for a while, record pswpin of /proc/vmstat. 4) Invalidate all readaheaded pages. 5) Wake up another, repeat the test. 6) It's also good if you can record hard LED twinking:) Maybe your test resumes all memory-consumers together, so Linux readaheads some pages close to page-fault page but are belong to other processes, I think. By the way, what's linux-mm mail, it ins't in Documentation/SubmitPatches. In fact, you will find Linux column makes hard LED twinking per 5 seconds. - Linux pps matrix 52411597 53221620 (81)(23) c 80281937 80951954 fullspeed 83131964 (67)(17) (218) (10) destruction 94614445 98254484 (364) (39) Comment secondary.c:memset clause, so 'secondary' won't interrupt page-allocation in 'primary'. - Linux pps matrix 207 38 256 59 (49)(21) c 1273347 1341383 fullspeed 1362386 (68)(36) (21)(3) destruction 24351178 25131246 (78)(68) entry.c - #include sys/wait.h #include sys/types.h #include sys/ipc.h #include sys/sem.h #include sys/stat.h #include unistd.h #include stdlib.h #include stdio.h #include string.h pid_t pids[4]; int sem_set; siginfo_t si; int main(int argc, char **argv) { int i, data; unsigned short init_data[4] = { 1, 1, 1, 1 }; if ((sem_set = semget(123321, 4, IPC_CREAT)) == -1) goto failed; if (semctl(sem_set, 0, SETALL, init_data) == -1) goto failed; pid_t pid = vfork(); if (pid == -1) { goto failed; } else if (pid == 0) { if (execlp(./primary, NULL) == -1) goto failed; } else { pids[0] = pid; } pid = vfork(); if (pid == -1) { goto failed; } else if (pid
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
On Tue, 23 Jan 2007, yunfeng zhang wrote: > re-code my patch, tab = 8. Sorry! Please stop resending this patch until you can attend to the advice you've been given: Pavel made several very useful remarks on Monday: > No, this is not the way to submit major rewrite of swap subsystem. > > You need to (at minimum, making fundamental changes _is_ hard): > > 1) Fix your mailer not to wordwrap. > > 2) Get some testing. Identify workloads it improves. > > 3) Get some _external_ testing. You are retransmitting wordwrapped > patch. That means noone other then you is actually using it. > > I thought I told you to read the CodingStyle in some previous mail? Another piece of advice would be to stop mailing it to linux-kernel, and direct it to linux-mm instead, where specialists might be more attentive. But don't bother if you cannot follow Pavel's advice. The only improvement I notice is that you are now sending a patch against a recently developed kernel, 2.6.20-rc5, rather than the 2.6.16.29 you were offering earlier in the month: good, thank you. I haven't done anything at all with Tuesday's recode tab=8 version, I get too sick of "malformed patch" if ever I try to apply your mail (it's probably related to the "format=flowed" in your mailer), and don't usually have time to spend fixing them up. But I did make the effort to reform it once before, and again with Monday's version, the one that says: > 4) It simplifies Linux memory model dramatically. You have an interesting idea of "simplifies", given 16 files changed, 997 insertions(+), 25 deletions(-) (omitting your Documentation), and over 7k more code. You'll have to be much more persuasive (with good performance results) to get us to welcome your added layer of complexity. Please make an effort to support at least i386 3level pagetables: you don't actually need >4GB of memory to test CONFIG_HIGHMEM64G. HIGHMEM testing shows you're missing a couple of pte_unmap()s, in pps_swapoff_scan_ptes() and in shrink_pvma_scan_ptes(). It would be nice if you could support at least x86_64 too (you have pte_low code peculiar to i386 in vmscan.c, which is preventing that), but that's harder if you don't have the hardware. But I have to admit, I've not been trying your patch because I support it and want to see it in: the reverse, I've been trying it because I want quickly to check whether it's something we need to pay attention to and spend time on, hoping to rule it out and turn to other matters instead. And so far I've been (from that point of view) very pleased: the first tests I ran went about 50% slower; but since they involve tmpfs (and I suspect you've not considered the tmpfs use of swap at all) that seemed a bit unfair, so I switched to running the simplest memhog kind of tests (you know, in a 512MB machine with plenty of swap, try to malloc and touch 600MB in rotation: I imagine should suit your design optimally): quickly killed Out Of Memory. Tried running multiple hogs for smaller amounts (maybe one holds a lock you're needing to free memory), but the same OOMs. Ended up just doing builds on disk with limited memory and 100% swappiness: consistently around 50% slower (elapsed time, also user time, also system time). I've not reviewed the code at all, that would need a lot more time. But I get the strong impression that you're imposing on Linux 2.6 ideas that seem obvious to you, without finding out whether they really fit in and get good results. I expect you'd be able to contribute much more if you spent a while studying the behaviour of Linux swapping, and made incremental tweaks to improve that (e.g. changing its swap allocation strategy), rather than coming in with some preconceived plan. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
On Tue, 23 Jan 2007, yunfeng zhang wrote: re-code my patch, tab = 8. Sorry! Please stop resending this patch until you can attend to the advice you've been given: Pavel made several very useful remarks on Monday: No, this is not the way to submit major rewrite of swap subsystem. You need to (at minimum, making fundamental changes _is_ hard): 1) Fix your mailer not to wordwrap. 2) Get some testing. Identify workloads it improves. 3) Get some _external_ testing. You are retransmitting wordwrapped patch. That means noone other then you is actually using it. I thought I told you to read the CodingStyle in some previous mail? Another piece of advice would be to stop mailing it to linux-kernel, and direct it to linux-mm instead, where specialists might be more attentive. But don't bother if you cannot follow Pavel's advice. The only improvement I notice is that you are now sending a patch against a recently developed kernel, 2.6.20-rc5, rather than the 2.6.16.29 you were offering earlier in the month: good, thank you. I haven't done anything at all with Tuesday's recode tab=8 version, I get too sick of malformed patch if ever I try to apply your mail (it's probably related to the format=flowed in your mailer), and don't usually have time to spend fixing them up. But I did make the effort to reform it once before, and again with Monday's version, the one that says: 4) It simplifies Linux memory model dramatically. You have an interesting idea of simplifies, given 16 files changed, 997 insertions(+), 25 deletions(-) (omitting your Documentation), and over 7k more code. You'll have to be much more persuasive (with good performance results) to get us to welcome your added layer of complexity. Please make an effort to support at least i386 3level pagetables: you don't actually need 4GB of memory to test CONFIG_HIGHMEM64G. HIGHMEM testing shows you're missing a couple of pte_unmap()s, in pps_swapoff_scan_ptes() and in shrink_pvma_scan_ptes(). It would be nice if you could support at least x86_64 too (you have pte_low code peculiar to i386 in vmscan.c, which is preventing that), but that's harder if you don't have the hardware. But I have to admit, I've not been trying your patch because I support it and want to see it in: the reverse, I've been trying it because I want quickly to check whether it's something we need to pay attention to and spend time on, hoping to rule it out and turn to other matters instead. And so far I've been (from that point of view) very pleased: the first tests I ran went about 50% slower; but since they involve tmpfs (and I suspect you've not considered the tmpfs use of swap at all) that seemed a bit unfair, so I switched to running the simplest memhog kind of tests (you know, in a 512MB machine with plenty of swap, try to malloc and touch 600MB in rotation: I imagine should suit your design optimally): quickly killed Out Of Memory. Tried running multiple hogs for smaller amounts (maybe one holds a lock you're needing to free memory), but the same OOMs. Ended up just doing builds on disk with limited memory and 100% swappiness: consistently around 50% slower (elapsed time, also user time, also system time). I've not reviewed the code at all, that would need a lot more time. But I get the strong impression that you're imposing on Linux 2.6 ideas that seem obvious to you, without finding out whether they really fit in and get good results. I expect you'd be able to contribute much more if you spent a while studying the behaviour of Linux swapping, and made incremental tweaks to improve that (e.g. changing its swap allocation strategy), rather than coming in with some preconceived plan. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
re-code my patch, tab = 8. Sorry! Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]> Index: linux-2.6.19/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.19/Documentation/vm_pps.txt 2007-01-23 11:32:02.0 +0800 @@ -0,0 +1,236 @@ + Pure Private Page System (pps) + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose <([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section and how I patch it into Linux 2.6.19 in section +. +// }])> + +// How to Reclaim Pages more Efficiently <([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) Page table, zone and memory inode layer (architecture-dependent). +Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but +here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. +4) It simplifies Linux memory model dramatically. Keep it in mind that new swap + strategy is from up to down. In fact, Linux legacy swap subsystem is maybe + the only one from down to up. + +Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a +system on memory node::active_list/inactive_list. + +I've finished a patch, see section . Note, it +ISN'T perfect. +// }])> + +// Pure Private Page System -- pps <([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the +whole job should be consist of two parts, here is the first -- +PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS) +scheduled in future. Of course, if both are done, it will empty Linux legacy +page system. + +In fact, pps is centered on how to better collect and unmap process private +pages, the whole process is divided into six stages -- . PPS +uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma) +of mm/vmscan.c. Other sections show the remain aspects of pps +1) is basic data definition. +2) is focused on synchronization. +3) how private pages enter in/go off pps. +4) which VMA is belonging to pps. +5) new daemon thread kppsd, pps statistic data etc. + +I'm also glad to highlight my a new idea -- dftlb which is described in +section . +// }])> + +// Delay to Flush TLB (dftlb) <([{ +Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in +brief, when we want to unmap a page from the page table of a process, why we +send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we +can insert flushing tasks into timer interrupt route to implement a +free-charged TLB flushing. + +The trick is implemented in +1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c. +2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute + flushing tasks. +3) all data are
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Patched against 2.6.19 leads to: mm/vmscan.c: In function `shrink_pvma_scan_ptes': mm/vmscan.c:1340: too many arguments to function `page_remove_rmap' So changed page_remove_rmap(series.pages[i], vma); to page_remove_rmap(series.pages[i]); I've worked on 2.6.19, but when update to 2.6.20-rc5, the function is changed. But your patch doesn't offer any swap-performance improvement for both swsusp or tmpfs. Swap-in is still half speed of Swap-out. Current Linux page allocation fairly provides pages for every process, since swap daemon only is started when memory is low, so when it starts to scan active_list, the private pages of processes are messed up with each other, vmscan.c:shrink_list() is the only approach to attach disk swap page to page on active_list, as the result, all private pages lost their affinity on swap partition. I will give a testlater... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
yunfeng zhang wrote: > My patch is based on my new idea to Linux swap subsystem, you can find > more in Documentation/vm_pps.txt which isn't only patch illustration but > also file changelog. In brief, SwapDaemon should scan and reclaim pages on > UserSpace::vmalist other than current zone::active/inactive. The change > will conspicuously enhance swap subsystem performance by > > 1) SwapDaemon can collect the statistic of process acessing pages and by > it unmaps ptes, SMP specially benefits from it for we can use > flush_tlb_range to unmap ptes batchly rather than frequently TLB IPI > interrupt per a page in current Linux legacy swap subsystem. > 2) Page-fault can issue better readahead requests since history data shows > all related pages have conglomerating affinity. In contrast, Linux > page-fault readaheads the pages relative to the SwapSpace position of > current page-fault page. > 3) It's conformable to POSIX madvise API family. > 4) It simplifies Linux memory model dramatically. Keep it in mind that new > swap strategy is from up to down. In fact, Linux legacy swap subsystem is > maybe the only one from down to up. > > Other problems asked about my pps are > 1) There isn't new lock order in my pps, it's compliant to Linux lock > order defined in mm/rmap.c. > 2) When a memory inode is low, you can set scan_control::reclaim_node to > let my kppsd to reclaim the memory inode page. Patched against 2.6.19 leads to: mm/vmscan.c: In function `shrink_pvma_scan_ptes': mm/vmscan.c:1340: too many arguments to function `page_remove_rmap' So changed page_remove_rmap(series.pages[i], vma); to page_remove_rmap(series.pages[i]); But your patch doesn't offer any swap-performance improvement for both swsusp or tmpfs. Swap-in is still half speed of Swap-out. Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Hi1 > My patch is based on my new idea to Linux swap subsystem, you can find more > in > Documentation/vm_pps.txt which isn't only patch illustration but also file > changelog. In brief, SwapDaemon should scan and reclaim pages on > UserSpace::vmalist other than current zone::active/inactive. The change will > conspicuously enhance swap subsystem performance by No, this is not the way to submit major rewrite of swap subsystem. You need to (at minimum, making fundamental changes _is_ hard): 1) Fix your mailer not to wordwrap. 2) Get some testing. Identify workloads it improves. 3) Get some _external_ testing. You are retransmitting wordwrapped patch. That means noone other then you is actually using it. 4) Don't cc me; I'm not mm expert, and I tend to read l-k, anyway. Pavel > + Pure Private Page System (pps) > + Copyright by Yunfeng Zhang on GFDL 1.2 I am not sure GFDL is GPL compatible. > +// Purpose <([{ You have certainly "interesting" heading style. What is this markup? > + > +// The prototype of the function is fit with the "func" of "int > +// smp_call_function (void (*func) (void *info), void *info, int retry, int > +// wait);" of include/linux/smp.h of 2.6.16.29. Call it with NULL. > +void timer_flush_tlb_tasks(void* data /* = NULL */); I thought I told you to read the CodingStyle in some previous mail? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Hi1 My patch is based on my new idea to Linux swap subsystem, you can find more in Documentation/vm_pps.txt which isn't only patch illustration but also file changelog. In brief, SwapDaemon should scan and reclaim pages on UserSpace::vmalist other than current zone::active/inactive. The change will conspicuously enhance swap subsystem performance by No, this is not the way to submit major rewrite of swap subsystem. You need to (at minimum, making fundamental changes _is_ hard): 1) Fix your mailer not to wordwrap. 2) Get some testing. Identify workloads it improves. 3) Get some _external_ testing. You are retransmitting wordwrapped patch. That means noone other then you is actually using it. 4) Don't cc me; I'm not mm expert, and I tend to read l-k, anyway. Pavel + Pure Private Page System (pps) + Copyright by Yunfeng Zhang on GFDL 1.2 I am not sure GFDL is GPL compatible. +// Purpose ([{ You have certainly interesting heading style. What is this markup? + +// The prototype of the function is fit with the func of int +// smp_call_function (void (*func) (void *info), void *info, int retry, int +// wait); of include/linux/smp.h of 2.6.16.29. Call it with NULL. +void timer_flush_tlb_tasks(void* data /* = NULL */); I thought I told you to read the CodingStyle in some previous mail? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
yunfeng zhang wrote: My patch is based on my new idea to Linux swap subsystem, you can find more in Documentation/vm_pps.txt which isn't only patch illustration but also file changelog. In brief, SwapDaemon should scan and reclaim pages on UserSpace::vmalist other than current zone::active/inactive. The change will conspicuously enhance swap subsystem performance by 1) SwapDaemon can collect the statistic of process acessing pages and by it unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in current Linux legacy swap subsystem. 2) Page-fault can issue better readahead requests since history data shows all related pages have conglomerating affinity. In contrast, Linux page-fault readaheads the pages relative to the SwapSpace position of current page-fault page. 3) It's conformable to POSIX madvise API family. 4) It simplifies Linux memory model dramatically. Keep it in mind that new swap strategy is from up to down. In fact, Linux legacy swap subsystem is maybe the only one from down to up. Other problems asked about my pps are 1) There isn't new lock order in my pps, it's compliant to Linux lock order defined in mm/rmap.c. 2) When a memory inode is low, you can set scan_control::reclaim_node to let my kppsd to reclaim the memory inode page. Patched against 2.6.19 leads to: mm/vmscan.c: In function `shrink_pvma_scan_ptes': mm/vmscan.c:1340: too many arguments to function `page_remove_rmap' So changed page_remove_rmap(series.pages[i], vma); to page_remove_rmap(series.pages[i]); But your patch doesn't offer any swap-performance improvement for both swsusp or tmpfs. Swap-in is still half speed of Swap-out. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Patched against 2.6.19 leads to: mm/vmscan.c: In function `shrink_pvma_scan_ptes': mm/vmscan.c:1340: too many arguments to function `page_remove_rmap' So changed page_remove_rmap(series.pages[i], vma); to page_remove_rmap(series.pages[i]); I've worked on 2.6.19, but when update to 2.6.20-rc5, the function is changed. But your patch doesn't offer any swap-performance improvement for both swsusp or tmpfs. Swap-in is still half speed of Swap-out. Current Linux page allocation fairly provides pages for every process, since swap daemon only is started when memory is low, so when it starts to scan active_list, the private pages of processes are messed up with each other, vmscan.c:shrink_list() is the only approach to attach disk swap page to page on active_list, as the result, all private pages lost their affinity on swap partition. I will give a testlater... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
re-code my patch, tab = 8. Sorry! Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED] Index: linux-2.6.19/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.19/Documentation/vm_pps.txt 2007-01-23 11:32:02.0 +0800 @@ -0,0 +1,236 @@ + Pure Private Page System (pps) + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose ([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section How to Reclaim +Pages more Efficiently and how I patch it into Linux 2.6.19 in section +Pure Private Page System -- pps. +// }]) + +// How to Reclaim Pages more Efficiently ([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) Page table, zone and memory inode layer (architecture-dependent). +Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but +here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. +4) It simplifies Linux memory model dramatically. Keep it in mind that new swap + strategy is from up to down. In fact, Linux legacy swap subsystem is maybe + the only one from down to up. + +Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a +system on memory node::active_list/inactive_list. + +I've finished a patch, see section Pure Private Page System -- pps. Note, it +ISN'T perfect. +// }]) + +// Pure Private Page System -- pps ([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the +whole job should be consist of two parts, here is the first -- +PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS) +scheduled in future. Of course, if both are done, it will empty Linux legacy +page system. + +In fact, pps is centered on how to better collect and unmap process private +pages, the whole process is divided into six stages -- Stage Definition. PPS +uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma) +of mm/vmscan.c. Other sections show the remain aspects of pps +1) Data Definition is basic data definition. +2) Concurrent racers of Shrinking pps is focused on synchronization. +3) Private Page Lifecycle of pps how private pages enter in/go off pps. +4) VMA Lifecycle of pps which VMA is belonging to pps. +5) Others about pps new daemon thread kppsd, pps statistic data etc. + +I'm also glad to highlight my a new idea -- dftlb which is described in +section Delay to Flush TLB. +// }]) + +// Delay to Flush TLB (dftlb) ([{ +Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in +brief, when we want to unmap a page from the page table of a process, why we +send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we +can insert flushing tasks into timer interrupt route to implement a
[PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
My patch is based on my new idea to Linux swap subsystem, you can find more in Documentation/vm_pps.txt which isn't only patch illustration but also file changelog. In brief, SwapDaemon should scan and reclaim pages on UserSpace::vmalist other than current zone::active/inactive. The change will conspicuously enhance swap subsystem performance by 1) SwapDaemon can collect the statistic of process acessing pages and by it unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in current Linux legacy swap subsystem. 2) Page-fault can issue better readahead requests since history data shows all related pages have conglomerating affinity. In contrast, Linux page-fault readaheads the pages relative to the SwapSpace position of current page-fault page. 3) It's conformable to POSIX madvise API family. 4) It simplifies Linux memory model dramatically. Keep it in mind that new swap strategy is from up to down. In fact, Linux legacy swap subsystem is maybe the only one from down to up. Other problems asked about my pps are 1) There isn't new lock order in my pps, it's compliant to Linux lock order defined in mm/rmap.c. 2) When a memory inode is low, you can set scan_control::reclaim_node to let my kppsd to reclaim the memory inode page. Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]> Index: linux-2.6.19/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.19/Documentation/vm_pps.txt 2007-01-22 13:52:04.973820224 +0800 @@ -0,0 +1,237 @@ + Pure Private Page System (pps) + Copyright by Yunfeng Zhang on GFDL 1.2 + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose <([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section and how I patch it into Linux 2.6.19 in section +. +// }])> + +// How to Reclaim Pages more Efficiently <([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) Page table, zone and memory inode layer (architecture-dependent). +Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but +here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. +4) It simplifies Linux memory model dramatically. Keep it in mind that new swap + strategy is from up to down. In fact, Linux legacy swap subsystem is maybe + the only one from down to up. + +Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a +system on memory node::active_list/inactive_list. + +I've finished a patch, see section . Note, it +ISN'T perfect. +// }])> + +// Pure Private Page System -- pps <([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the +whole
[PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
My patch is based on my new idea to Linux swap subsystem, you can find more in Documentation/vm_pps.txt which isn't only patch illustration but also file changelog. In brief, SwapDaemon should scan and reclaim pages on UserSpace::vmalist other than current zone::active/inactive. The change will conspicuously enhance swap subsystem performance by 1) SwapDaemon can collect the statistic of process acessing pages and by it unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in current Linux legacy swap subsystem. 2) Page-fault can issue better readahead requests since history data shows all related pages have conglomerating affinity. In contrast, Linux page-fault readaheads the pages relative to the SwapSpace position of current page-fault page. 3) It's conformable to POSIX madvise API family. 4) It simplifies Linux memory model dramatically. Keep it in mind that new swap strategy is from up to down. In fact, Linux legacy swap subsystem is maybe the only one from down to up. Other problems asked about my pps are 1) There isn't new lock order in my pps, it's compliant to Linux lock order defined in mm/rmap.c. 2) When a memory inode is low, you can set scan_control::reclaim_node to let my kppsd to reclaim the memory inode page. Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED] Index: linux-2.6.19/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.19/Documentation/vm_pps.txt 2007-01-22 13:52:04.973820224 +0800 @@ -0,0 +1,237 @@ + Pure Private Page System (pps) + Copyright by Yunfeng Zhang on GFDL 1.2 + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose ([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section How to Reclaim +Pages more Efficiently and how I patch it into Linux 2.6.19 in section +Pure Private Page System -- pps. +// }]) + +// How to Reclaim Pages more Efficiently ([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) Page table, zone and memory inode layer (architecture-dependent). +Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but +here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. +4) It simplifies Linux memory model dramatically. Keep it in mind that new swap + strategy is from up to down. In fact, Linux legacy swap subsystem is maybe + the only one from down to up. + +Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a +system on memory node::active_list/inactive_list. + +I've finished a patch, see section Pure Private Page System -- pps. Note, it +ISN'T perfect. +// }]) + +// Pure Private Page System -- pps ([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private