Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-08-23 Thread Randy Dunlap
On Thu, 23 Aug 2007 17:47:44 +0800 yunfeng zhang wrote:

> Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]>
> 
> The mayor change is
> 1) Using nail arithmetic to maximum SwapDevice performance.
> 2) Add PG_pps bit to sign every pps page.
> 3) Some discussion about NUMA.
> See vm_pps.txt
> 
> Index: linux-2.6.22/Documentation/vm_pps.txt
> ===
> --- /dev/null 1970-01-01 00:00:00.0 +
> +++ linux-2.6.22/Documentation/vm_pps.txt 2007-08-23 17:04:12.051837322 
> +0800
> @@ -0,0 +1,365 @@
> +
> + Pure Private Page System (pps)
> +  [EMAIL PROTECTED]
> +  December 24-26, 2006
> +Revised on Aug 23, 2007
> +
> +// Purpose <([{
> +The file is used to document the idea which is published firstly at
> +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of 
> my
> +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
> +patch of the document is for enchancing the performance of Linux swap
> +subsystem. You can find the overview of the idea in section  +Pages more Efficiently> and how I patch it into Linux 2.6.21 in section
> +.
> +// }])>

Hi,
What (text) format/markup language is the vm_pps.txt file in?

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-08-23 Thread Randy Dunlap
On Thu, 23 Aug 2007 17:47:44 +0800 yunfeng zhang wrote:

 Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED]
 
 The mayor change is
 1) Using nail arithmetic to maximum SwapDevice performance.
 2) Add PG_pps bit to sign every pps page.
 3) Some discussion about NUMA.
 See vm_pps.txt
 
 Index: linux-2.6.22/Documentation/vm_pps.txt
 ===
 --- /dev/null 1970-01-01 00:00:00.0 +
 +++ linux-2.6.22/Documentation/vm_pps.txt 2007-08-23 17:04:12.051837322 
 +0800
 @@ -0,0 +1,365 @@
 +
 + Pure Private Page System (pps)
 +  [EMAIL PROTECTED]
 +  December 24-26, 2006
 +Revised on Aug 23, 2007
 +
 +// Purpose ([{
 +The file is used to document the idea which is published firstly at
 +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of 
 my
 +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
 +patch of the document is for enchancing the performance of Linux swap
 +subsystem. You can find the overview of the idea in section How to Reclaim
 +Pages more Efficiently and how I patch it into Linux 2.6.21 in section
 +Pure Private Page System -- pps.
 +// }])

Hi,
What (text) format/markup language is the vm_pps.txt file in?

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-24 Thread yunfeng zhang

To a two-CPU architecture, its page distribution just like theoretically

ABABAB

So every readahead of A process will create 4 unused readahead pages unless you
are sure B will resume soon.

Have you ever compared the results among UP, 2 or 4-CPU?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-24 Thread yunfeng zhang

To a two-CPU architecture, its page distribution just like theoretically

ABABAB

So every readahead of A process will create 4 unused readahead pages unless you
are sure B will resume soon.

Have you ever compared the results among UP, 2 or 4-CPU?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-22 Thread Rik van Riel

yunfeng zhang wrote:
Performance improvement should occur when private pages of multiple 
processes are messed up,


Ummm, yes.  Linux used to do this, but doing virtual scans
just does not scale when a system has a really large amount
of memory, a large number of processes and multiple zones.

We've seen it fall apart with as little as 8GB of RAM.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-22 Thread yunfeng zhang

Performance improvement should occur when private pages of multiple processes
are messed up, such as SMP. To UP, my previous mail is done by timer, which only
shows a fact, if pages are messed up fully, current readahead will degrade
remarkably, and unused readaheading pages make a burden to memory subsystem.

You should re-test your testcases following the advises on Linux without my
patch, do normal testcases and select a testcase randomly and record
'/proc/vmstat/pswpin', redo the testcase solely, if the results are close, that
is, your testcases doesn't messed up private pages at all as you expected due to
Linux schedule. Thank you!


2007/2/22, Rik van Riel <[EMAIL PROTECTED]>:

yunfeng zhang wrote:
> Any comments or suggestions are always welcomed.

Same question as always: what problem are you trying to solve?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-22 Thread yunfeng zhang

Performance improvement should occur when private pages of multiple processes
are messed up, such as SMP. To UP, my previous mail is done by timer, which only
shows a fact, if pages are messed up fully, current readahead will degrade
remarkably, and unused readaheading pages make a burden to memory subsystem.

You should re-test your testcases following the advises on Linux without my
patch, do normal testcases and select a testcase randomly and record
'/proc/vmstat/pswpin', redo the testcase solely, if the results are close, that
is, your testcases doesn't messed up private pages at all as you expected due to
Linux schedule. Thank you!


2007/2/22, Rik van Riel [EMAIL PROTECTED]:

yunfeng zhang wrote:
 Any comments or suggestions are always welcomed.

Same question as always: what problem are you trying to solve?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-22 Thread Rik van Riel

yunfeng zhang wrote:
Performance improvement should occur when private pages of multiple 
processes are messed up,


Ummm, yes.  Linux used to do this, but doing virtual scans
just does not scale when a system has a really large amount
of memory, a large number of processes and multiple zones.

We've seen it fall apart with as little as 8GB of RAM.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-21 Thread Rik van Riel

yunfeng zhang wrote:

Any comments or suggestions are always welcomed.


Same question as always: what problem are you trying to solve?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-21 Thread yunfeng zhang

Any comments or suggestions are always welcomed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-21 Thread yunfeng zhang

Any comments or suggestions are always welcomed.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-21 Thread Rik van Riel

yunfeng zhang wrote:

Any comments or suggestions are always welcomed.


Same question as always: what problem are you trying to solve?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-20 Thread yunfeng zhang

Following arithmetic is based on SwapSpace bitmap management which is discussed
in the postscript section of my patch. Two purposes are implemented, one is
allocating a group of fake continual swap entries, another is re-allocating
swap entries in stage 3 for such as series length is too short.


#include 
#include 
#include 
// 2 hardware cache line. You can also concentrate it to a hareware cache line.
char bits_per_short[256] = {
8, 7, 7, 6, 7, 6, 6, 5,
7, 6, 6, 5, 6, 5, 5, 4,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
4, 3, 3, 2, 3, 2, 2, 1,
3, 2, 2, 1, 2, 1, 1, 0
};
unsigned char swap_bitmap[32];
// Allocate a group of fake continual swap entries.
int alloc(int size)
{
int i, found = 0, result_offset;
unsigned char a = 0, b = 0;
for (i = 0; i < 32; i++) {
b = bits_per_short[swap_bitmap[i]];
if (a + b >= size) {
found = 1;
break;
}
a = b;
}
result_offset = i == 0 ? 0 : i - 1;
result_offset = found ? result_offset : -1;
return result_offset;
}
// Re-allocate in stage 3 if necessary.
int re_alloc(int position)
{
int offset = position / 8;
int a = offset == 0 ? 0 : offset - 1;
int b = offset == 31 ? 31 : offset + 1;
int i, empty_bits = 0;
for (i = a; i <= b; i++) {
empty_bits += bits_per_short[swap_bitmap[i]];
}
return empty_bits;
}
int main(int argc, char **argv)
{
int i;
for (i = 0; i < 32; i++) {
swap_bitmap[i] = (unsigned char) (rand() % 0xff);
}
i = 9;
int temp = alloc(i);
temp = re_alloc(i);
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-20 Thread yunfeng zhang

Following arithmetic is based on SwapSpace bitmap management which is discussed
in the postscript section of my patch. Two purposes are implemented, one is
allocating a group of fake continual swap entries, another is re-allocating
swap entries in stage 3 for such as series length is too short.


#include stdlib.h
#include stdio.h
#include string.h
// 2 hardware cache line. You can also concentrate it to a hareware cache line.
char bits_per_short[256] = {
8, 7, 7, 6, 7, 6, 6, 5,
7, 6, 6, 5, 6, 5, 5, 4,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
4, 3, 3, 2, 3, 2, 2, 1,
3, 2, 2, 1, 2, 1, 1, 0
};
unsigned char swap_bitmap[32];
// Allocate a group of fake continual swap entries.
int alloc(int size)
{
int i, found = 0, result_offset;
unsigned char a = 0, b = 0;
for (i = 0; i  32; i++) {
b = bits_per_short[swap_bitmap[i]];
if (a + b = size) {
found = 1;
break;
}
a = b;
}
result_offset = i == 0 ? 0 : i - 1;
result_offset = found ? result_offset : -1;
return result_offset;
}
// Re-allocate in stage 3 if necessary.
int re_alloc(int position)
{
int offset = position / 8;
int a = offset == 0 ? 0 : offset - 1;
int b = offset == 31 ? 31 : offset + 1;
int i, empty_bits = 0;
for (i = a; i = b; i++) {
empty_bits += bits_per_short[swap_bitmap[i]];
}
return empty_bits;
}
int main(int argc, char **argv)
{
int i;
for (i = 0; i  32; i++) {
swap_bitmap[i] = (unsigned char) (rand() % 0xff);
}
i = 9;
int temp = alloc(i);
temp = re_alloc(i);
}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-12 Thread yunfeng zhang

You can apply my previous patch on 2.6.20 by changing

-#define VM_PURE_PRIVATE0x0400  /* Is the vma is only belonging 
to a mm,
to
+#define VM_PURE_PRIVATE0x0800  /* Is the vma is only belonging 
to a mm,

New revision is based on 2.6.20 with my previous patch, major changelogs are
1) pte_unmap pairs on shrink_pvma_scan_ptes and pps_swapoff_scan_ptes.
2) Now, kppsd can be woke up by kswapd.
3) New global variable accelerate_kppsd is appended to accelerate the
  reclamation process when a memory inode is low.


Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]>

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- linux-2.6.19.orig/Documentation/vm_pps.txt  2007-02-12
12:45:07.0 +0800
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-02-12 15:30:16.490797672 
+0800
@@ -143,23 +143,32 @@
2) mm/memory.c   do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page,
   do_swap_page (page-fault)
3) mm/memory.c   get_user_pages (sometimes core need share PrivatePage with us)
+4) mm/vmscan.c   balance_pgdat  (kswapd/x can do stage 5 of its node pages,
+   while kppsd can do stage 1-4)
+5) mm/vmscan.c   kppsd  (new core daemon -- kppsd, see below)

There isn't new lock order defined in pps, that is, it's compliable to Linux
-lock order.
+lock order. Locks in shrink_private_vma copied from shrink_list of 2.6.16.29
+(my initial version).
// }])>

// Others about pps <([{
A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to
-execute the stages of pps periodically, note an appropriate timeout ticks is
-necessary so we can give application a chance to re-map back its PrivatePage
-from UnmappedPTE to PTE, that is, show their conglomeration affinity.
-
-kppsd can be controlled by new fields -- scan_control::may_reclaim/reclaim_node
-may_reclaim = 1 means starting reclamation (stage 5).  reclaim_node = (node
-number) is used when a memory node is low. Caller should set them to wakeup_sc,
-then wake up kppsd (vmscan.c:balance_pgdat). Note, if kppsd is started due to
-timeout, it doesn't do stage 5 at all (vmscan.c:kppsd). Other alive legacy
-fields are gfp_mask, may_writepage and may_swap.
+execute the stage 1 - 4 of pps periodically, note an appropriate timeout ticks
+(current 2 seconds) is necessary so we can give application a chance to re-map
+back its PrivatePage from UnmappedPTE to PTE, that is, show their
+conglomeration affinity.
+
+shrink_private_vma can be controlled by new fields -- may_reclaim, reclaim_node
+and is_kppsd of scan_control.  may_reclaim = 1 means starting reclamation
+(stage 5). reclaim_node = (node number, -1 means all memory inode) is used when
+a memory node is low. Caller (kswapd/x), typically, set reclaim_node to start
+shrink_private_vma (vmscan.c:balance_pgdat). Note, only to kppsd is_kppsd = 1.
+Other alive legacy fields to pps are gfp_mask, may_writepage and may_swap.
+
+When a memory inode is low, kswapd/x can wake up kppsd by increasing global
+variable accelerate_kppsd (balance_pgdat), which accelerate stage 1 - 4, and
+call shrink_private_vma to do stage 5.

PPS statistic data is appended to /proc/meminfo entry, its prototype is in
include/linux/mm.h.
Index: linux-2.6.19/mm/swapfile.c
===
--- linux-2.6.19.orig/mm/swapfile.c 2007-02-12 12:45:07.0 +0800
+++ linux-2.6.19/mm/swapfile.c  2007-02-12 12:45:21.0 +0800
@@ -569,6 +569,7 @@
}
}
} while (pte++, addr += PAGE_SIZE, addr != end);
+   pte_unmap(pte);
return 0;
}

Index: linux-2.6.19/mm/vmscan.c
===
--- linux-2.6.19.orig/mm/vmscan.c   2007-02-12 12:45:07.0 +0800
+++ linux-2.6.19/mm/vmscan.c2007-02-12 15:48:59.217292888 +0800
@@ -70,6 +70,7 @@
/* pps control command. See Documentation/vm_pps.txt. */
int may_reclaim;
int reclaim_node;
+   int is_kppsd;
};

/*
@@ -1101,9 +1102,9 @@
return ret;
}

-// pps fields.
+// pps fields, see Documentation/vm_pps.txt.
static wait_queue_head_t kppsd_wait;
-static struct scan_control wakeup_sc;
+static int accelerate_kppsd = 0;
struct pps_info pps_info = {
.total = ATOMIC_INIT(0),
.pte_count = ATOMIC_INIT(0), // stage 1 and 2.
@@ -1118,24 +1119,22 @@
struct page* pages[MAX_SERIES_LENGTH];
int series_length;
int series_stage;
-} series;
+};

-static int get_series_stage(pte_t* pte, int index)
+static int get_series_stage(struct series_t* series, pte_t* pte, int index)
{
-   series.orig_ptes[index] = *pte;
-   series.ptes[index] = pte;
-   if (pte_present(series.orig_ptes[index])) {
-   struct page* page = 
pfn_to_page(pte_pfn(series.orig_ptes[index]));
-   series.pages[index] = page;
+   series->orig_ptes[index] = *pte;
+   

Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-12 Thread yunfeng zhang

You can apply my previous patch on 2.6.20 by changing

-#define VM_PURE_PRIVATE0x0400  /* Is the vma is only belonging 
to a mm,
to
+#define VM_PURE_PRIVATE0x0800  /* Is the vma is only belonging 
to a mm,

New revision is based on 2.6.20 with my previous patch, major changelogs are
1) pte_unmap pairs on shrink_pvma_scan_ptes and pps_swapoff_scan_ptes.
2) Now, kppsd can be woke up by kswapd.
3) New global variable accelerate_kppsd is appended to accelerate the
  reclamation process when a memory inode is low.


Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED]

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- linux-2.6.19.orig/Documentation/vm_pps.txt  2007-02-12
12:45:07.0 +0800
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-02-12 15:30:16.490797672 
+0800
@@ -143,23 +143,32 @@
2) mm/memory.c   do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page,
   do_swap_page (page-fault)
3) mm/memory.c   get_user_pages (sometimes core need share PrivatePage with us)
+4) mm/vmscan.c   balance_pgdat  (kswapd/x can do stage 5 of its node pages,
+   while kppsd can do stage 1-4)
+5) mm/vmscan.c   kppsd  (new core daemon -- kppsd, see below)

There isn't new lock order defined in pps, that is, it's compliable to Linux
-lock order.
+lock order. Locks in shrink_private_vma copied from shrink_list of 2.6.16.29
+(my initial version).
// }])

// Others about pps ([{
A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to
-execute the stages of pps periodically, note an appropriate timeout ticks is
-necessary so we can give application a chance to re-map back its PrivatePage
-from UnmappedPTE to PTE, that is, show their conglomeration affinity.
-
-kppsd can be controlled by new fields -- scan_control::may_reclaim/reclaim_node
-may_reclaim = 1 means starting reclamation (stage 5).  reclaim_node = (node
-number) is used when a memory node is low. Caller should set them to wakeup_sc,
-then wake up kppsd (vmscan.c:balance_pgdat). Note, if kppsd is started due to
-timeout, it doesn't do stage 5 at all (vmscan.c:kppsd). Other alive legacy
-fields are gfp_mask, may_writepage and may_swap.
+execute the stage 1 - 4 of pps periodically, note an appropriate timeout ticks
+(current 2 seconds) is necessary so we can give application a chance to re-map
+back its PrivatePage from UnmappedPTE to PTE, that is, show their
+conglomeration affinity.
+
+shrink_private_vma can be controlled by new fields -- may_reclaim, reclaim_node
+and is_kppsd of scan_control.  may_reclaim = 1 means starting reclamation
+(stage 5). reclaim_node = (node number, -1 means all memory inode) is used when
+a memory node is low. Caller (kswapd/x), typically, set reclaim_node to start
+shrink_private_vma (vmscan.c:balance_pgdat). Note, only to kppsd is_kppsd = 1.
+Other alive legacy fields to pps are gfp_mask, may_writepage and may_swap.
+
+When a memory inode is low, kswapd/x can wake up kppsd by increasing global
+variable accelerate_kppsd (balance_pgdat), which accelerate stage 1 - 4, and
+call shrink_private_vma to do stage 5.

PPS statistic data is appended to /proc/meminfo entry, its prototype is in
include/linux/mm.h.
Index: linux-2.6.19/mm/swapfile.c
===
--- linux-2.6.19.orig/mm/swapfile.c 2007-02-12 12:45:07.0 +0800
+++ linux-2.6.19/mm/swapfile.c  2007-02-12 12:45:21.0 +0800
@@ -569,6 +569,7 @@
}
}
} while (pte++, addr += PAGE_SIZE, addr != end);
+   pte_unmap(pte);
return 0;
}

Index: linux-2.6.19/mm/vmscan.c
===
--- linux-2.6.19.orig/mm/vmscan.c   2007-02-12 12:45:07.0 +0800
+++ linux-2.6.19/mm/vmscan.c2007-02-12 15:48:59.217292888 +0800
@@ -70,6 +70,7 @@
/* pps control command. See Documentation/vm_pps.txt. */
int may_reclaim;
int reclaim_node;
+   int is_kppsd;
};

/*
@@ -1101,9 +1102,9 @@
return ret;
}

-// pps fields.
+// pps fields, see Documentation/vm_pps.txt.
static wait_queue_head_t kppsd_wait;
-static struct scan_control wakeup_sc;
+static int accelerate_kppsd = 0;
struct pps_info pps_info = {
.total = ATOMIC_INIT(0),
.pte_count = ATOMIC_INIT(0), // stage 1 and 2.
@@ -1118,24 +1119,22 @@
struct page* pages[MAX_SERIES_LENGTH];
int series_length;
int series_stage;
-} series;
+};

-static int get_series_stage(pte_t* pte, int index)
+static int get_series_stage(struct series_t* series, pte_t* pte, int index)
{
-   series.orig_ptes[index] = *pte;
-   series.ptes[index] = pte;
-   if (pte_present(series.orig_ptes[index])) {
-   struct page* page = 
pfn_to_page(pte_pfn(series.orig_ptes[index]));
-   series.pages[index] = page;
+   series-orig_ptes[index] = *pte;
+   

Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-28 Thread yunfeng zhang

You have an interesting idea of "simplifies", given
 16 files changed, 997 insertions(+), 25 deletions(-)
(omitting your Documentation), and over 7k more code.
You'll have to be much more persuasive (with good performance
results) to get us to welcome your added layer of complexity.


If the whole idea is deployed on Linux, following core objects should be erased
1) anon_vma.
2) pgdata::active/inactive list and relatted methods -- mark_page_accessed etc.
3) PrivatePage::count and mapcount. If core need to share the page, add PG_kmap
  flag. In fact, page::lru_list can safetly be erased too.
4) All cases should be from up to down, especially simplifies debug.


Please make an effort to support at least i386 3level pagetables:
you don't actually need >4GB of memory to test CONFIG_HIGHMEM64G.
HIGHMEM testing shows you're missing a couple of pte_unmap()s,
in pps_swapoff_scan_ptes() and in shrink_pvma_scan_ptes().


Yes, it's my fault.


It would be nice if you could support at least x86_64 too
(you have pte_low code peculiar to i386 in vmscan.c, which is
preventing that), but that's harder if you don't have the hardware.


Um! Data cmpxchged should include access bit. And I have only x86 PC, memory <
1G. 3level pagetable code copied from Linux other functions.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-28 Thread yunfeng zhang

You have an interesting idea of simplifies, given
 16 files changed, 997 insertions(+), 25 deletions(-)
(omitting your Documentation), and over 7k more code.
You'll have to be much more persuasive (with good performance
results) to get us to welcome your added layer of complexity.


If the whole idea is deployed on Linux, following core objects should be erased
1) anon_vma.
2) pgdata::active/inactive list and relatted methods -- mark_page_accessed etc.
3) PrivatePage::count and mapcount. If core need to share the page, add PG_kmap
  flag. In fact, page::lru_list can safetly be erased too.
4) All cases should be from up to down, especially simplifies debug.


Please make an effort to support at least i386 3level pagetables:
you don't actually need 4GB of memory to test CONFIG_HIGHMEM64G.
HIGHMEM testing shows you're missing a couple of pte_unmap()s,
in pps_swapoff_scan_ptes() and in shrink_pvma_scan_ptes().


Yes, it's my fault.


It would be nice if you could support at least x86_64 too
(you have pte_low code peculiar to i386 in vmscan.c, which is
preventing that), but that's harder if you don't have the hardware.


Um! Data cmpxchged should include access bit. And I have only x86 PC, memory 
1G. 3level pagetable code copied from Linux other functions.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-25 Thread yunfeng zhang

Current test based on the fact below in my previous mail


Current Linux page allocation fairly provides pages for every process, since
swap daemon only is started when memory is low, so when it starts to scan
active_list, the private pages of processes are messed up with each other,
vmscan.c:shrink_list() is the only approach to attach disk swap page to page on
active_list, as the result, all private pages lost their affinity on swap
partition. I will give a testlater...



Three testcases are imitated here
1) matrix: Some softwares do a lots of matrix arithmetic in their PrivateVMA,
  in fact, the type is much better to pps than Linux.
2) c: C malloc uses an arithmetic just like slab, so when an application resume
  from swap partition, let's supposed it touches three variables whose sizes
  are different, in the result, it should touch three pages and the three
  pages are closed with each other but aren't continual, I also imitate a case
  that if the application starts full-speed running later (touch more pages).
3) destruction: Typically, if an application resumes due to user clicks the
  close button, it totally visits all its private data to execute object
  destruction.

Test stepping
1) run ./entry and say y to it, maybe need root right.
2) wait a moment until echo 'primarywait'.
3) swapoff -a && swapon -a.
4) ./hog until count = 10.
5) 'cat primary entry secondary > /dev/null'
6) 'cat /proc/vmstat' several times and record 'pswpin' field when it's stable.
7) type `1', `2' or `3' to 3 testcases, answer `2' to start fullspeed testcase.
8) record new 'pswpin' field.
9) which is better? see the 'pswpin' increment.
pswpin is increased in mm/page_io.c:swap_readpage.

Test stepping purposes
1) Step 1, 'entry' wakes up 'primary' and 'secondary' simultaneously, every time
  'primary' allocates a page, 'secondary' inserts some pages into active_list
  closed to it.
1) Step 3, we should re-allocate swap pages.
2) Step 4, flush 'entry primary secondary' to swap partition.
3) Step 5, make file content 'entry primary secondary' present in memory.

Testcases are done in vmware virtual machine 5.5, 32M memory. If you argue my
circumstance, do your testcases following the steps advised
1) Run multiple memory-consumer together, make them pause at a point.
  (So mess up all private pages in pg_active list).
2) Flush them to swap partition.
3) Wake up one of them, let it run full-speed for a while, record pswpin of
  /proc/vmstat.
4) Invalidate all readaheaded pages.
5) Wake up another, repeat the test.
6) It's also good if you can record hard LED twinking:)
Maybe your test resumes all memory-consumers together, so Linux readaheads some
pages close to page-fault page but are belong to other processes, I think.

By the way, what's linux-mm mail, it ins't in Documentation/SubmitPatches.

In fact, you will find Linux column makes hard LED twinking per 5 seconds.
-
Linux   pps
matrix  52411597
53221620
(81)(23)

c   80281937
80951954
fullspeed   83131964
(67)(17)
(218)   (10)

destruction 94614445
98254484
(364)   (39)

Comment secondary.c:memset clause, so 'secondary' won't interrupt
page-allocation in 'primary'.
-
Linux   pps
matrix  207 38
256 59
(49)(21)

c   1273347
1341383
fullspeed   1362386
(68)(36)
(21)(3)

destruction 24351178
25131246
(78)(68)

entry.c
-
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

pid_t pids[4];
int sem_set;
siginfo_t si;

int main(int argc, char **argv)
{
int i, data;
unsigned short init_data[4] = { 1, 1, 1, 1 };
if ((sem_set = semget(123321, 4, IPC_CREAT)) == -1)
goto failed;
if (semctl(sem_set, 0, SETALL, _data) == -1)
goto failed;
pid_t pid = vfork();
if (pid == -1) {
goto failed;
} else if (pid == 0) {
if (execlp("./primary", NULL) == -1)
goto failed;
} else {
pids[0] = pid;
}
pid = vfork();
if (pid == -1) {
goto failed;
} else if (pid == 0) {
if (execlp("./secondary", NULL) == -1)
  

Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-25 Thread yunfeng zhang

Current test based on the fact below in my previous mail


Current Linux page allocation fairly provides pages for every process, since
swap daemon only is started when memory is low, so when it starts to scan
active_list, the private pages of processes are messed up with each other,
vmscan.c:shrink_list() is the only approach to attach disk swap page to page on
active_list, as the result, all private pages lost their affinity on swap
partition. I will give a testlater...



Three testcases are imitated here
1) matrix: Some softwares do a lots of matrix arithmetic in their PrivateVMA,
  in fact, the type is much better to pps than Linux.
2) c: C malloc uses an arithmetic just like slab, so when an application resume
  from swap partition, let's supposed it touches three variables whose sizes
  are different, in the result, it should touch three pages and the three
  pages are closed with each other but aren't continual, I also imitate a case
  that if the application starts full-speed running later (touch more pages).
3) destruction: Typically, if an application resumes due to user clicks the
  close button, it totally visits all its private data to execute object
  destruction.

Test stepping
1) run ./entry and say y to it, maybe need root right.
2) wait a moment until echo 'primarywait'.
3) swapoff -a  swapon -a.
4) ./hog until count = 10.
5) 'cat primary entry secondary  /dev/null'
6) 'cat /proc/vmstat' several times and record 'pswpin' field when it's stable.
7) type `1', `2' or `3' to 3 testcases, answer `2' to start fullspeed testcase.
8) record new 'pswpin' field.
9) which is better? see the 'pswpin' increment.
pswpin is increased in mm/page_io.c:swap_readpage.

Test stepping purposes
1) Step 1, 'entry' wakes up 'primary' and 'secondary' simultaneously, every time
  'primary' allocates a page, 'secondary' inserts some pages into active_list
  closed to it.
1) Step 3, we should re-allocate swap pages.
2) Step 4, flush 'entry primary secondary' to swap partition.
3) Step 5, make file content 'entry primary secondary' present in memory.

Testcases are done in vmware virtual machine 5.5, 32M memory. If you argue my
circumstance, do your testcases following the steps advised
1) Run multiple memory-consumer together, make them pause at a point.
  (So mess up all private pages in pg_active list).
2) Flush them to swap partition.
3) Wake up one of them, let it run full-speed for a while, record pswpin of
  /proc/vmstat.
4) Invalidate all readaheaded pages.
5) Wake up another, repeat the test.
6) It's also good if you can record hard LED twinking:)
Maybe your test resumes all memory-consumers together, so Linux readaheads some
pages close to page-fault page but are belong to other processes, I think.

By the way, what's linux-mm mail, it ins't in Documentation/SubmitPatches.

In fact, you will find Linux column makes hard LED twinking per 5 seconds.
-
Linux   pps
matrix  52411597
53221620
(81)(23)

c   80281937
80951954
fullspeed   83131964
(67)(17)
(218)   (10)

destruction 94614445
98254484
(364)   (39)

Comment secondary.c:memset clause, so 'secondary' won't interrupt
page-allocation in 'primary'.
-
Linux   pps
matrix  207 38
256 59
(49)(21)

c   1273347
1341383
fullspeed   1362386
(68)(36)
(21)(3)

destruction 24351178
25131246
(78)(68)

entry.c
-
#include sys/wait.h
#include sys/types.h
#include sys/ipc.h
#include sys/sem.h
#include sys/stat.h
#include unistd.h
#include stdlib.h
#include stdio.h
#include string.h

pid_t pids[4];
int sem_set;
siginfo_t si;

int main(int argc, char **argv)
{
int i, data;
unsigned short init_data[4] = { 1, 1, 1, 1 };
if ((sem_set = semget(123321, 4, IPC_CREAT)) == -1)
goto failed;
if (semctl(sem_set, 0, SETALL, init_data) == -1)
goto failed;
pid_t pid = vfork();
if (pid == -1) {
goto failed;
} else if (pid == 0) {
if (execlp(./primary, NULL) == -1)
goto failed;
} else {
pids[0] = pid;
}
pid = vfork();
if (pid == -1) {
goto failed;
} else if (pid 

Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-24 Thread Hugh Dickins
On Tue, 23 Jan 2007, yunfeng zhang wrote:
> re-code my patch, tab = 8. Sorry!

Please stop resending this patch until you can attend to the advice
you've been given: Pavel made several very useful remarks on Monday:

> No, this is not the way to submit major rewrite of swap subsystem.
> 
> You need to (at minimum, making fundamental changes _is_ hard):
> 
> 1) Fix your mailer not to wordwrap.
> 
> 2) Get some testing. Identify workloads it improves.
> 
> 3) Get some _external_ testing. You are retransmitting wordwrapped
> patch. That means noone other then you is actually using it.
> 
> I thought I told you to read the CodingStyle in some previous mail?

Another piece of advice would be to stop mailing it to linux-kernel,
and direct it to linux-mm instead, where specialists might be more
attentive.  But don't bother if you cannot follow Pavel's advice.

The only improvement I notice is that you are now sending a patch
against a recently developed kernel, 2.6.20-rc5, rather than the
2.6.16.29 you were offering earlier in the month: good, thank you.

I haven't done anything at all with Tuesday's recode tab=8 version,
I get too sick of "malformed patch" if ever I try to apply your mail
(it's probably related to the "format=flowed" in your mailer), and
don't usually have time to spend fixing them up.

But I did make the effort to reform it once before,
and again with Monday's version, the one that says:

> 4) It simplifies Linux memory model dramatically.

You have an interesting idea of "simplifies", given
 16 files changed, 997 insertions(+), 25 deletions(-)
(omitting your Documentation), and over 7k more code.
You'll have to be much more persuasive (with good performance
results) to get us to welcome your added layer of complexity.

Please make an effort to support at least i386 3level pagetables:
you don't actually need >4GB of memory to test CONFIG_HIGHMEM64G.
HIGHMEM testing shows you're missing a couple of pte_unmap()s,
in pps_swapoff_scan_ptes() and in shrink_pvma_scan_ptes().

It would be nice if you could support at least x86_64 too
(you have pte_low code peculiar to i386 in vmscan.c, which is
preventing that), but that's harder if you don't have the hardware.

But I have to admit, I've not been trying your patch because I
support it and want to see it in: the reverse, I've been trying
it because I want quickly to check whether it's something we
need to pay attention to and spend time on, hoping to rule
it out and turn to other matters instead.

And so far I've been (from that point of view) very pleased:
the first tests I ran went about 50% slower; but since they
involve tmpfs (and I suspect you've not considered the tmpfs
use of swap at all) that seemed a bit unfair, so I switched
to running the simplest memhog kind of tests (you know, in
a 512MB machine with plenty of swap, try to malloc and touch
600MB in rotation: I imagine should suit your design optimally):
quickly killed Out Of Memory.  Tried running multiple hogs for
smaller amounts (maybe one holds a lock you're needing to free
memory), but the same OOMs.  Ended up just doing builds on disk
with limited memory and 100% swappiness: consistently around
50% slower (elapsed time, also user time, also system time).

I've not reviewed the code at all, that would need a lot more
time.  But I get the strong impression that you're imposing on
Linux 2.6 ideas that seem obvious to you, without finding out
whether they really fit in and get good results.

I expect you'd be able to contribute much more if you spent a
while studying the behaviour of Linux swapping, and made incremental
tweaks to improve that (e.g. changing its swap allocation strategy),
rather than coming in with some preconceived plan.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-24 Thread Hugh Dickins
On Tue, 23 Jan 2007, yunfeng zhang wrote:
 re-code my patch, tab = 8. Sorry!

Please stop resending this patch until you can attend to the advice
you've been given: Pavel made several very useful remarks on Monday:

 No, this is not the way to submit major rewrite of swap subsystem.
 
 You need to (at minimum, making fundamental changes _is_ hard):
 
 1) Fix your mailer not to wordwrap.
 
 2) Get some testing. Identify workloads it improves.
 
 3) Get some _external_ testing. You are retransmitting wordwrapped
 patch. That means noone other then you is actually using it.
 
 I thought I told you to read the CodingStyle in some previous mail?

Another piece of advice would be to stop mailing it to linux-kernel,
and direct it to linux-mm instead, where specialists might be more
attentive.  But don't bother if you cannot follow Pavel's advice.

The only improvement I notice is that you are now sending a patch
against a recently developed kernel, 2.6.20-rc5, rather than the
2.6.16.29 you were offering earlier in the month: good, thank you.

I haven't done anything at all with Tuesday's recode tab=8 version,
I get too sick of malformed patch if ever I try to apply your mail
(it's probably related to the format=flowed in your mailer), and
don't usually have time to spend fixing them up.

But I did make the effort to reform it once before,
and again with Monday's version, the one that says:

 4) It simplifies Linux memory model dramatically.

You have an interesting idea of simplifies, given
 16 files changed, 997 insertions(+), 25 deletions(-)
(omitting your Documentation), and over 7k more code.
You'll have to be much more persuasive (with good performance
results) to get us to welcome your added layer of complexity.

Please make an effort to support at least i386 3level pagetables:
you don't actually need 4GB of memory to test CONFIG_HIGHMEM64G.
HIGHMEM testing shows you're missing a couple of pte_unmap()s,
in pps_swapoff_scan_ptes() and in shrink_pvma_scan_ptes().

It would be nice if you could support at least x86_64 too
(you have pte_low code peculiar to i386 in vmscan.c, which is
preventing that), but that's harder if you don't have the hardware.

But I have to admit, I've not been trying your patch because I
support it and want to see it in: the reverse, I've been trying
it because I want quickly to check whether it's something we
need to pay attention to and spend time on, hoping to rule
it out and turn to other matters instead.

And so far I've been (from that point of view) very pleased:
the first tests I ran went about 50% slower; but since they
involve tmpfs (and I suspect you've not considered the tmpfs
use of swap at all) that seemed a bit unfair, so I switched
to running the simplest memhog kind of tests (you know, in
a 512MB machine with plenty of swap, try to malloc and touch
600MB in rotation: I imagine should suit your design optimally):
quickly killed Out Of Memory.  Tried running multiple hogs for
smaller amounts (maybe one holds a lock you're needing to free
memory), but the same OOMs.  Ended up just doing builds on disk
with limited memory and 100% swappiness: consistently around
50% slower (elapsed time, also user time, also system time).

I've not reviewed the code at all, that would need a lot more
time.  But I get the strong impression that you're imposing on
Linux 2.6 ideas that seem obvious to you, without finding out
whether they really fit in and get good results.

I expect you'd be able to contribute much more if you spent a
while studying the behaviour of Linux swapping, and made incremental
tweaks to improve that (e.g. changing its swap allocation strategy),
rather than coming in with some preconceived plan.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread yunfeng zhang

re-code my patch, tab = 8. Sorry!

  Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]>

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-01-23 11:32:02.0 
+0800
@@ -0,0 +1,236 @@
+ Pure Private Page System (pps)
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose <([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section  and how I patch it into Linux 2.6.19 in section
+.
+// }])>
+
+// How to Reclaim Pages more Efficiently <([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) Page table, zone and memory inode layer (architecture-dependent).
+Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but
+here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
+   strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
+   the only one from down to up.
+
+Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a
+system on memory node::active_list/inactive_list.
+
+I've finished a patch, see section . Note, it
+ISN'T perfect.
+// }])>
+
+// Pure Private Page System -- pps  <([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole job should be consist of two parts, here is the first --
+PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS)
+scheduled in future. Of course, if both are done, it will empty Linux legacy
+page system.
+
+In fact, pps is centered on how to better collect and unmap process private
+pages, the whole process is divided into six stages -- . PPS
+uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma)
+of mm/vmscan.c. Other sections show the remain aspects of pps
+1)  is basic data definition.
+2)  is focused on synchronization.
+3)  how private pages enter in/go off pps.
+4)  which VMA is belonging to pps.
+5)  new daemon thread kppsd, pps statistic data etc.
+
+I'm also glad to highlight my a new idea -- dftlb which is described in
+section .
+// }])>
+
+// Delay to Flush TLB (dftlb) <([{
+Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in
+brief, when we want to unmap a page from the page table of a process, why we
+send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we
+can insert flushing tasks into timer interrupt route to implement a
+free-charged TLB flushing.
+
+The trick is implemented in
+1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c.
+2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute
+   flushing tasks.
+3) all data are 

Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread yunfeng zhang


Patched against 2.6.19 leads to:

mm/vmscan.c: In function `shrink_pvma_scan_ptes':
mm/vmscan.c:1340: too many arguments to function `page_remove_rmap'

So changed
 page_remove_rmap(series.pages[i], vma);
to
 page_remove_rmap(series.pages[i]);



I've worked on 2.6.19, but when update to 2.6.20-rc5, the function is changed.


But your patch doesn't offer any swap-performance improvement for both swsusp
or tmpfs.  Swap-in is still half speed of Swap-out.



Current Linux page allocation fairly provides pages for every process, since
swap daemon only is started when memory is low, so when it starts to scan
active_list, the private pages of processes are messed up with each other,
vmscan.c:shrink_list() is the only approach to attach disk swap page to page on
active_list, as the result, all private pages lost their affinity on swap
partition. I will give a testlater...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread Al Boldi
yunfeng zhang wrote:
> My patch is based on my new idea to Linux swap subsystem, you can find
> more in Documentation/vm_pps.txt which isn't only patch illustration but
> also file changelog. In brief, SwapDaemon should scan and reclaim pages on
> UserSpace::vmalist other than current zone::active/inactive. The change
> will conspicuously enhance swap subsystem performance by
>
> 1) SwapDaemon can collect the statistic of process acessing pages and by
> it unmaps ptes, SMP specially benefits from it for we can use
> flush_tlb_range to unmap ptes batchly rather than frequently TLB IPI
> interrupt per a page in current Linux legacy swap subsystem.
> 2) Page-fault can issue better readahead requests since history data shows
> all related pages have conglomerating affinity. In contrast, Linux
> page-fault readaheads the pages relative to the SwapSpace position of
> current page-fault page.
> 3) It's conformable to POSIX madvise API family.
> 4) It simplifies Linux memory model dramatically. Keep it in mind that new
> swap strategy is from up to down. In fact, Linux legacy swap subsystem is
> maybe the only one from down to up.
>
> Other problems asked about my pps are
> 1) There isn't new lock order in my pps, it's compliant to Linux lock
> order defined in mm/rmap.c.
> 2) When a memory inode is low, you can set scan_control::reclaim_node to
> let my kppsd to reclaim the memory inode page.

Patched against 2.6.19 leads to:

mm/vmscan.c: In function `shrink_pvma_scan_ptes':
mm/vmscan.c:1340: too many arguments to function `page_remove_rmap'

So changed
 page_remove_rmap(series.pages[i], vma);
to
 page_remove_rmap(series.pages[i]);

But your patch doesn't offer any swap-performance improvement for both swsusp 
or tmpfs.  Swap-in is still half speed of Swap-out.


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread Pavel Machek
Hi1

> My patch is based on my new idea to Linux swap subsystem, you can find more 
> in
> Documentation/vm_pps.txt which isn't only patch illustration but also file
> changelog. In brief, SwapDaemon should scan and reclaim pages on
> UserSpace::vmalist other than current zone::active/inactive. The change will
> conspicuously enhance swap subsystem performance by

No, this is not the way to submit major rewrite of swap subsystem.

You need to (at minimum, making fundamental changes _is_ hard):

1) Fix your mailer not to wordwrap.

2) Get some testing. Identify workloads it improves.

3) Get some _external_ testing. You are retransmitting wordwrapped
patch. That means noone other then you is actually using it.

4) Don't cc me; I'm not mm expert, and I tend to read l-k, anyway.

Pavel

> + Pure Private Page System (pps)
> + Copyright by Yunfeng Zhang on GFDL 1.2

I am not sure GFDL is GPL compatible.

> +// Purpose <([{

You have certainly "interesting" heading style. What is this markup?
> +
> +// The prototype of the function is fit with the "func" of "int
> +// smp_call_function (void (*func) (void *info), void *info, int retry, int
> +// wait);" of include/linux/smp.h of 2.6.16.29. Call it with NULL.
> +void timer_flush_tlb_tasks(void* data /* = NULL */);

I thought I told you to read the CodingStyle in some previous mail?

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread Pavel Machek
Hi1

 My patch is based on my new idea to Linux swap subsystem, you can find more 
 in
 Documentation/vm_pps.txt which isn't only patch illustration but also file
 changelog. In brief, SwapDaemon should scan and reclaim pages on
 UserSpace::vmalist other than current zone::active/inactive. The change will
 conspicuously enhance swap subsystem performance by

No, this is not the way to submit major rewrite of swap subsystem.

You need to (at minimum, making fundamental changes _is_ hard):

1) Fix your mailer not to wordwrap.

2) Get some testing. Identify workloads it improves.

3) Get some _external_ testing. You are retransmitting wordwrapped
patch. That means noone other then you is actually using it.

4) Don't cc me; I'm not mm expert, and I tend to read l-k, anyway.

Pavel

 + Pure Private Page System (pps)
 + Copyright by Yunfeng Zhang on GFDL 1.2

I am not sure GFDL is GPL compatible.

 +// Purpose ([{

You have certainly interesting heading style. What is this markup?
 +
 +// The prototype of the function is fit with the func of int
 +// smp_call_function (void (*func) (void *info), void *info, int retry, int
 +// wait); of include/linux/smp.h of 2.6.16.29. Call it with NULL.
 +void timer_flush_tlb_tasks(void* data /* = NULL */);

I thought I told you to read the CodingStyle in some previous mail?

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread Al Boldi
yunfeng zhang wrote:
 My patch is based on my new idea to Linux swap subsystem, you can find
 more in Documentation/vm_pps.txt which isn't only patch illustration but
 also file changelog. In brief, SwapDaemon should scan and reclaim pages on
 UserSpace::vmalist other than current zone::active/inactive. The change
 will conspicuously enhance swap subsystem performance by

 1) SwapDaemon can collect the statistic of process acessing pages and by
 it unmaps ptes, SMP specially benefits from it for we can use
 flush_tlb_range to unmap ptes batchly rather than frequently TLB IPI
 interrupt per a page in current Linux legacy swap subsystem.
 2) Page-fault can issue better readahead requests since history data shows
 all related pages have conglomerating affinity. In contrast, Linux
 page-fault readaheads the pages relative to the SwapSpace position of
 current page-fault page.
 3) It's conformable to POSIX madvise API family.
 4) It simplifies Linux memory model dramatically. Keep it in mind that new
 swap strategy is from up to down. In fact, Linux legacy swap subsystem is
 maybe the only one from down to up.

 Other problems asked about my pps are
 1) There isn't new lock order in my pps, it's compliant to Linux lock
 order defined in mm/rmap.c.
 2) When a memory inode is low, you can set scan_control::reclaim_node to
 let my kppsd to reclaim the memory inode page.

Patched against 2.6.19 leads to:

mm/vmscan.c: In function `shrink_pvma_scan_ptes':
mm/vmscan.c:1340: too many arguments to function `page_remove_rmap'

So changed
 page_remove_rmap(series.pages[i], vma);
to
 page_remove_rmap(series.pages[i]);

But your patch doesn't offer any swap-performance improvement for both swsusp 
or tmpfs.  Swap-in is still half speed of Swap-out.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread yunfeng zhang


Patched against 2.6.19 leads to:

mm/vmscan.c: In function `shrink_pvma_scan_ptes':
mm/vmscan.c:1340: too many arguments to function `page_remove_rmap'

So changed
 page_remove_rmap(series.pages[i], vma);
to
 page_remove_rmap(series.pages[i]);



I've worked on 2.6.19, but when update to 2.6.20-rc5, the function is changed.


But your patch doesn't offer any swap-performance improvement for both swsusp
or tmpfs.  Swap-in is still half speed of Swap-out.



Current Linux page allocation fairly provides pages for every process, since
swap daemon only is started when memory is low, so when it starts to scan
active_list, the private pages of processes are messed up with each other,
vmscan.c:shrink_list() is the only approach to attach disk swap page to page on
active_list, as the result, all private pages lost their affinity on swap
partition. I will give a testlater...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread yunfeng zhang

re-code my patch, tab = 8. Sorry!

  Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED]

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-01-23 11:32:02.0 
+0800
@@ -0,0 +1,236 @@
+ Pure Private Page System (pps)
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose ([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section How to Reclaim
+Pages more Efficiently and how I patch it into Linux 2.6.19 in section
+Pure Private Page System -- pps.
+// }])
+
+// How to Reclaim Pages more Efficiently ([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) Page table, zone and memory inode layer (architecture-dependent).
+Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but
+here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
+   strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
+   the only one from down to up.
+
+Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a
+system on memory node::active_list/inactive_list.
+
+I've finished a patch, see section Pure Private Page System -- pps. Note, it
+ISN'T perfect.
+// }])
+
+// Pure Private Page System -- pps  ([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole job should be consist of two parts, here is the first --
+PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS)
+scheduled in future. Of course, if both are done, it will empty Linux legacy
+page system.
+
+In fact, pps is centered on how to better collect and unmap process private
+pages, the whole process is divided into six stages -- Stage Definition. PPS
+uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma)
+of mm/vmscan.c. Other sections show the remain aspects of pps
+1) Data Definition is basic data definition.
+2) Concurrent racers of Shrinking pps is focused on synchronization.
+3) Private Page Lifecycle of pps how private pages enter in/go off pps.
+4) VMA Lifecycle of pps which VMA is belonging to pps.
+5) Others about pps new daemon thread kppsd, pps statistic data etc.
+
+I'm also glad to highlight my a new idea -- dftlb which is described in
+section Delay to Flush TLB.
+// }])
+
+// Delay to Flush TLB (dftlb) ([{
+Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in
+brief, when we want to unmap a page from the page table of a process, why we
+send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we
+can insert flushing tasks into timer interrupt route to implement a

[PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-21 Thread yunfeng zhang

My patch is based on my new idea to Linux swap subsystem, you can find more in
Documentation/vm_pps.txt which isn't only patch illustration but also file
changelog. In brief, SwapDaemon should scan and reclaim pages on
UserSpace::vmalist other than current zone::active/inactive. The change will
conspicuously enhance swap subsystem performance by

1) SwapDaemon can collect the statistic of process acessing pages and by it
  unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
  to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
  current Linux legacy swap subsystem.
2) Page-fault can issue better readahead requests since history data shows all
  related pages have conglomerating affinity. In contrast, Linux page-fault
  readaheads the pages relative to the SwapSpace position of current
  page-fault page.
3) It's conformable to POSIX madvise API family.
4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
  strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
  the only one from down to up.

Other problems asked about my pps are
1) There isn't new lock order in my pps, it's compliant to Linux lock order
  defined in mm/rmap.c.
2) When a memory inode is low, you can set scan_control::reclaim_node to let my
  kppsd to reclaim the memory inode page.

  Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]>

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-01-22 13:52:04.973820224 
+0800
@@ -0,0 +1,237 @@
+ Pure Private Page System (pps)
+ Copyright by Yunfeng Zhang on GFDL 1.2
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose <([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section  and how I patch it into Linux 2.6.19 in section
+.
+// }])>
+
+// How to Reclaim Pages more Efficiently <([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) Page table, zone and memory inode layer (architecture-dependent).
+Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but
+here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
+   strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
+   the only one from down to up.
+
+Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a
+system on memory node::active_list/inactive_list.
+
+I've finished a patch, see section . Note, it
+ISN'T perfect.
+// }])>
+
+// Pure Private Page System -- pps  <([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole 

[PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-21 Thread yunfeng zhang

My patch is based on my new idea to Linux swap subsystem, you can find more in
Documentation/vm_pps.txt which isn't only patch illustration but also file
changelog. In brief, SwapDaemon should scan and reclaim pages on
UserSpace::vmalist other than current zone::active/inactive. The change will
conspicuously enhance swap subsystem performance by

1) SwapDaemon can collect the statistic of process acessing pages and by it
  unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
  to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
  current Linux legacy swap subsystem.
2) Page-fault can issue better readahead requests since history data shows all
  related pages have conglomerating affinity. In contrast, Linux page-fault
  readaheads the pages relative to the SwapSpace position of current
  page-fault page.
3) It's conformable to POSIX madvise API family.
4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
  strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
  the only one from down to up.

Other problems asked about my pps are
1) There isn't new lock order in my pps, it's compliant to Linux lock order
  defined in mm/rmap.c.
2) When a memory inode is low, you can set scan_control::reclaim_node to let my
  kppsd to reclaim the memory inode page.

  Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED]

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-01-22 13:52:04.973820224 
+0800
@@ -0,0 +1,237 @@
+ Pure Private Page System (pps)
+ Copyright by Yunfeng Zhang on GFDL 1.2
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose ([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section How to Reclaim
+Pages more Efficiently and how I patch it into Linux 2.6.19 in section
+Pure Private Page System -- pps.
+// }])
+
+// How to Reclaim Pages more Efficiently ([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) Page table, zone and memory inode layer (architecture-dependent).
+Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but
+here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
+   strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
+   the only one from down to up.
+
+Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a
+system on memory node::active_list/inactive_list.
+
+I've finished a patch, see section Pure Private Page System -- pps. Note, it
+ISN'T perfect.
+// }])
+
+// Pure Private Page System -- pps  ([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private