Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-24 Thread yunfeng zhang

To a two-CPU architecture, its page distribution just like theoretically

ABABAB

So every readahead of A process will create 4 unused readahead pages unless you
are sure B will resume soon.

Have you ever compared the results among UP, 2 or 4-CPU?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-24 Thread yunfeng zhang

To a two-CPU architecture, its page distribution just like theoretically

ABABAB

So every readahead of A process will create 4 unused readahead pages unless you
are sure B will resume soon.

Have you ever compared the results among UP, 2 or 4-CPU?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-22 Thread yunfeng zhang

Performance improvement should occur when private pages of multiple processes
are messed up, such as SMP. To UP, my previous mail is done by timer, which only
shows a fact, if pages are messed up fully, current readahead will degrade
remarkably, and unused readaheading pages make a burden to memory subsystem.

You should re-test your testcases following the advises on Linux without my
patch, do normal testcases and select a testcase randomly and record
'/proc/vmstat/pswpin', redo the testcase solely, if the results are close, that
is, your testcases doesn't messed up private pages at all as you expected due to
Linux schedule. Thank you!


2007/2/22, Rik van Riel <[EMAIL PROTECTED]>:

yunfeng zhang wrote:
> Any comments or suggestions are always welcomed.

Same question as always: what problem are you trying to solve?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-22 Thread yunfeng zhang

Performance improvement should occur when private pages of multiple processes
are messed up, such as SMP. To UP, my previous mail is done by timer, which only
shows a fact, if pages are messed up fully, current readahead will degrade
remarkably, and unused readaheading pages make a burden to memory subsystem.

You should re-test your testcases following the advises on Linux without my
patch, do normal testcases and select a testcase randomly and record
'/proc/vmstat/pswpin', redo the testcase solely, if the results are close, that
is, your testcases doesn't messed up private pages at all as you expected due to
Linux schedule. Thank you!


2007/2/22, Rik van Riel [EMAIL PROTECTED]:

yunfeng zhang wrote:
 Any comments or suggestions are always welcomed.

Same question as always: what problem are you trying to solve?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-21 Thread yunfeng zhang

Any comments or suggestions are always welcomed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-21 Thread yunfeng zhang

Any comments or suggestions are always welcomed.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-20 Thread yunfeng zhang

Following arithmetic is based on SwapSpace bitmap management which is discussed
in the postscript section of my patch. Two purposes are implemented, one is
allocating a group of fake continual swap entries, another is re-allocating
swap entries in stage 3 for such as series length is too short.


#include 
#include 
#include 
// 2 hardware cache line. You can also concentrate it to a hareware cache line.
char bits_per_short[256] = {
8, 7, 7, 6, 7, 6, 6, 5,
7, 6, 6, 5, 6, 5, 5, 4,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
4, 3, 3, 2, 3, 2, 2, 1,
3, 2, 2, 1, 2, 1, 1, 0
};
unsigned char swap_bitmap[32];
// Allocate a group of fake continual swap entries.
int alloc(int size)
{
int i, found = 0, result_offset;
unsigned char a = 0, b = 0;
for (i = 0; i < 32; i++) {
b = bits_per_short[swap_bitmap[i]];
if (a + b >= size) {
found = 1;
break;
}
a = b;
}
result_offset = i == 0 ? 0 : i - 1;
result_offset = found ? result_offset : -1;
return result_offset;
}
// Re-allocate in stage 3 if necessary.
int re_alloc(int position)
{
int offset = position / 8;
int a = offset == 0 ? 0 : offset - 1;
int b = offset == 31 ? 31 : offset + 1;
int i, empty_bits = 0;
for (i = a; i <= b; i++) {
empty_bits += bits_per_short[swap_bitmap[i]];
}
return empty_bits;
}
int main(int argc, char **argv)
{
int i;
for (i = 0; i < 32; i++) {
swap_bitmap[i] = (unsigned char) (rand() % 0xff);
}
i = 9;
int temp = alloc(i);
temp = re_alloc(i);
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-20 Thread yunfeng zhang

Following arithmetic is based on SwapSpace bitmap management which is discussed
in the postscript section of my patch. Two purposes are implemented, one is
allocating a group of fake continual swap entries, another is re-allocating
swap entries in stage 3 for such as series length is too short.


#include stdlib.h
#include stdio.h
#include string.h
// 2 hardware cache line. You can also concentrate it to a hareware cache line.
char bits_per_short[256] = {
8, 7, 7, 6, 7, 6, 6, 5,
7, 6, 6, 5, 6, 5, 5, 4,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
7, 6, 6, 5, 6, 5, 5, 4,
6, 5, 5, 4, 5, 4, 4, 3,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
6, 5, 5, 4, 5, 4, 4, 3,
5, 4, 4, 3, 4, 3, 3, 2,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
5, 4, 4, 3, 4, 3, 3, 2,
4, 3, 3, 2, 3, 2, 2, 1,
4, 3, 3, 2, 3, 2, 2, 1,
3, 2, 2, 1, 2, 1, 1, 0
};
unsigned char swap_bitmap[32];
// Allocate a group of fake continual swap entries.
int alloc(int size)
{
int i, found = 0, result_offset;
unsigned char a = 0, b = 0;
for (i = 0; i  32; i++) {
b = bits_per_short[swap_bitmap[i]];
if (a + b = size) {
found = 1;
break;
}
a = b;
}
result_offset = i == 0 ? 0 : i - 1;
result_offset = found ? result_offset : -1;
return result_offset;
}
// Re-allocate in stage 3 if necessary.
int re_alloc(int position)
{
int offset = position / 8;
int a = offset == 0 ? 0 : offset - 1;
int b = offset == 31 ? 31 : offset + 1;
int i, empty_bits = 0;
for (i = a; i = b; i++) {
empty_bits += bits_per_short[swap_bitmap[i]];
}
return empty_bits;
}
int main(int argc, char **argv)
{
int i;
for (i = 0; i  32; i++) {
swap_bitmap[i] = (unsigned char) (rand() % 0xff);
}
i = 9;
int temp = alloc(i);
temp = re_alloc(i);
}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-12 Thread yunfeng zhang

You can apply my previous patch on 2.6.20 by changing

-#define VM_PURE_PRIVATE0x0400  /* Is the vma is only belonging 
to a mm,
to
+#define VM_PURE_PRIVATE0x0800  /* Is the vma is only belonging 
to a mm,

New revision is based on 2.6.20 with my previous patch, major changelogs are
1) pte_unmap pairs on shrink_pvma_scan_ptes and pps_swapoff_scan_ptes.
2) Now, kppsd can be woke up by kswapd.
3) New global variable accelerate_kppsd is appended to accelerate the
  reclamation process when a memory inode is low.


Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]>

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- linux-2.6.19.orig/Documentation/vm_pps.txt  2007-02-12
12:45:07.0 +0800
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-02-12 15:30:16.490797672 
+0800
@@ -143,23 +143,32 @@
2) mm/memory.c   do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page,
   do_swap_page (page-fault)
3) mm/memory.c   get_user_pages (sometimes core need share PrivatePage with us)
+4) mm/vmscan.c   balance_pgdat  (kswapd/x can do stage 5 of its node pages,
+   while kppsd can do stage 1-4)
+5) mm/vmscan.c   kppsd  (new core daemon -- kppsd, see below)

There isn't new lock order defined in pps, that is, it's compliable to Linux
-lock order.
+lock order. Locks in shrink_private_vma copied from shrink_list of 2.6.16.29
+(my initial version).
// }])>

// Others about pps <([{
A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to
-execute the stages of pps periodically, note an appropriate timeout ticks is
-necessary so we can give application a chance to re-map back its PrivatePage
-from UnmappedPTE to PTE, that is, show their conglomeration affinity.
-
-kppsd can be controlled by new fields -- scan_control::may_reclaim/reclaim_node
-may_reclaim = 1 means starting reclamation (stage 5).  reclaim_node = (node
-number) is used when a memory node is low. Caller should set them to wakeup_sc,
-then wake up kppsd (vmscan.c:balance_pgdat). Note, if kppsd is started due to
-timeout, it doesn't do stage 5 at all (vmscan.c:kppsd). Other alive legacy
-fields are gfp_mask, may_writepage and may_swap.
+execute the stage 1 - 4 of pps periodically, note an appropriate timeout ticks
+(current 2 seconds) is necessary so we can give application a chance to re-map
+back its PrivatePage from UnmappedPTE to PTE, that is, show their
+conglomeration affinity.
+
+shrink_private_vma can be controlled by new fields -- may_reclaim, reclaim_node
+and is_kppsd of scan_control.  may_reclaim = 1 means starting reclamation
+(stage 5). reclaim_node = (node number, -1 means all memory inode) is used when
+a memory node is low. Caller (kswapd/x), typically, set reclaim_node to start
+shrink_private_vma (vmscan.c:balance_pgdat). Note, only to kppsd is_kppsd = 1.
+Other alive legacy fields to pps are gfp_mask, may_writepage and may_swap.
+
+When a memory inode is low, kswapd/x can wake up kppsd by increasing global
+variable accelerate_kppsd (balance_pgdat), which accelerate stage 1 - 4, and
+call shrink_private_vma to do stage 5.

PPS statistic data is appended to /proc/meminfo entry, its prototype is in
include/linux/mm.h.
Index: linux-2.6.19/mm/swapfile.c
===
--- linux-2.6.19.orig/mm/swapfile.c 2007-02-12 12:45:07.0 +0800
+++ linux-2.6.19/mm/swapfile.c  2007-02-12 12:45:21.0 +0800
@@ -569,6 +569,7 @@
}
}
} while (pte++, addr += PAGE_SIZE, addr != end);
+   pte_unmap(pte);
return 0;
}

Index: linux-2.6.19/mm/vmscan.c
===
--- linux-2.6.19.orig/mm/vmscan.c   2007-02-12 12:45:07.0 +0800
+++ linux-2.6.19/mm/vmscan.c2007-02-12 15:48:59.217292888 +0800
@@ -70,6 +70,7 @@
/* pps control command. See Documentation/vm_pps.txt. */
int may_reclaim;
int reclaim_node;
+   int is_kppsd;
};

/*
@@ -1101,9 +1102,9 @@
return ret;
}

-// pps fields.
+// pps fields, see Documentation/vm_pps.txt.
static wait_queue_head_t kppsd_wait;
-static struct scan_control wakeup_sc;
+static int accelerate_kppsd = 0;
struct pps_info pps_info = {
.total = ATOMIC_INIT(0),
.pte_count = ATOMIC_INIT(0), // stage 1 and 2.
@@ -1118,24 +1119,22 @@
struct page* pages[MAX_SERIES_LENGTH];
int series_length;
int series_stage;
-} series;
+};

-static int get_series_stage(pte_t* pte, int index)
+static int get_series_stage(struct series_t* series, pte_t* pte, int index)
{
-   series.orig_ptes[index] = *pte;
-   series.ptes[index] = pte;
-   if (pte_present(series.orig_ptes[index])) {
-   struct page* page = 
pfn_to_page(pte_pfn(series.orig_ptes[index]));
-   series.pages[index] = page;
+   series->orig_pte

Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-02-12 Thread yunfeng zhang

You can apply my previous patch on 2.6.20 by changing

-#define VM_PURE_PRIVATE0x0400  /* Is the vma is only belonging 
to a mm,
to
+#define VM_PURE_PRIVATE0x0800  /* Is the vma is only belonging 
to a mm,

New revision is based on 2.6.20 with my previous patch, major changelogs are
1) pte_unmap pairs on shrink_pvma_scan_ptes and pps_swapoff_scan_ptes.
2) Now, kppsd can be woke up by kswapd.
3) New global variable accelerate_kppsd is appended to accelerate the
  reclamation process when a memory inode is low.


Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED]

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- linux-2.6.19.orig/Documentation/vm_pps.txt  2007-02-12
12:45:07.0 +0800
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-02-12 15:30:16.490797672 
+0800
@@ -143,23 +143,32 @@
2) mm/memory.c   do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page,
   do_swap_page (page-fault)
3) mm/memory.c   get_user_pages (sometimes core need share PrivatePage with us)
+4) mm/vmscan.c   balance_pgdat  (kswapd/x can do stage 5 of its node pages,
+   while kppsd can do stage 1-4)
+5) mm/vmscan.c   kppsd  (new core daemon -- kppsd, see below)

There isn't new lock order defined in pps, that is, it's compliable to Linux
-lock order.
+lock order. Locks in shrink_private_vma copied from shrink_list of 2.6.16.29
+(my initial version).
// }])

// Others about pps ([{
A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to
-execute the stages of pps periodically, note an appropriate timeout ticks is
-necessary so we can give application a chance to re-map back its PrivatePage
-from UnmappedPTE to PTE, that is, show their conglomeration affinity.
-
-kppsd can be controlled by new fields -- scan_control::may_reclaim/reclaim_node
-may_reclaim = 1 means starting reclamation (stage 5).  reclaim_node = (node
-number) is used when a memory node is low. Caller should set them to wakeup_sc,
-then wake up kppsd (vmscan.c:balance_pgdat). Note, if kppsd is started due to
-timeout, it doesn't do stage 5 at all (vmscan.c:kppsd). Other alive legacy
-fields are gfp_mask, may_writepage and may_swap.
+execute the stage 1 - 4 of pps periodically, note an appropriate timeout ticks
+(current 2 seconds) is necessary so we can give application a chance to re-map
+back its PrivatePage from UnmappedPTE to PTE, that is, show their
+conglomeration affinity.
+
+shrink_private_vma can be controlled by new fields -- may_reclaim, reclaim_node
+and is_kppsd of scan_control.  may_reclaim = 1 means starting reclamation
+(stage 5). reclaim_node = (node number, -1 means all memory inode) is used when
+a memory node is low. Caller (kswapd/x), typically, set reclaim_node to start
+shrink_private_vma (vmscan.c:balance_pgdat). Note, only to kppsd is_kppsd = 1.
+Other alive legacy fields to pps are gfp_mask, may_writepage and may_swap.
+
+When a memory inode is low, kswapd/x can wake up kppsd by increasing global
+variable accelerate_kppsd (balance_pgdat), which accelerate stage 1 - 4, and
+call shrink_private_vma to do stage 5.

PPS statistic data is appended to /proc/meminfo entry, its prototype is in
include/linux/mm.h.
Index: linux-2.6.19/mm/swapfile.c
===
--- linux-2.6.19.orig/mm/swapfile.c 2007-02-12 12:45:07.0 +0800
+++ linux-2.6.19/mm/swapfile.c  2007-02-12 12:45:21.0 +0800
@@ -569,6 +569,7 @@
}
}
} while (pte++, addr += PAGE_SIZE, addr != end);
+   pte_unmap(pte);
return 0;
}

Index: linux-2.6.19/mm/vmscan.c
===
--- linux-2.6.19.orig/mm/vmscan.c   2007-02-12 12:45:07.0 +0800
+++ linux-2.6.19/mm/vmscan.c2007-02-12 15:48:59.217292888 +0800
@@ -70,6 +70,7 @@
/* pps control command. See Documentation/vm_pps.txt. */
int may_reclaim;
int reclaim_node;
+   int is_kppsd;
};

/*
@@ -1101,9 +1102,9 @@
return ret;
}

-// pps fields.
+// pps fields, see Documentation/vm_pps.txt.
static wait_queue_head_t kppsd_wait;
-static struct scan_control wakeup_sc;
+static int accelerate_kppsd = 0;
struct pps_info pps_info = {
.total = ATOMIC_INIT(0),
.pte_count = ATOMIC_INIT(0), // stage 1 and 2.
@@ -1118,24 +1119,22 @@
struct page* pages[MAX_SERIES_LENGTH];
int series_length;
int series_stage;
-} series;
+};

-static int get_series_stage(pte_t* pte, int index)
+static int get_series_stage(struct series_t* series, pte_t* pte, int index)
{
-   series.orig_ptes[index] = *pte;
-   series.ptes[index] = pte;
-   if (pte_present(series.orig_ptes[index])) {
-   struct page* page = 
pfn_to_page(pte_pfn(series.orig_ptes[index]));
-   series.pages[index] = page;
+   series-orig_ptes[index] = *pte;
+   series

Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-28 Thread yunfeng zhang

You have an interesting idea of "simplifies", given
 16 files changed, 997 insertions(+), 25 deletions(-)
(omitting your Documentation), and over 7k more code.
You'll have to be much more persuasive (with good performance
results) to get us to welcome your added layer of complexity.


If the whole idea is deployed on Linux, following core objects should be erased
1) anon_vma.
2) pgdata::active/inactive list and relatted methods -- mark_page_accessed etc.
3) PrivatePage::count and mapcount. If core need to share the page, add PG_kmap
  flag. In fact, page::lru_list can safetly be erased too.
4) All cases should be from up to down, especially simplifies debug.


Please make an effort to support at least i386 3level pagetables:
you don't actually need >4GB of memory to test CONFIG_HIGHMEM64G.
HIGHMEM testing shows you're missing a couple of pte_unmap()s,
in pps_swapoff_scan_ptes() and in shrink_pvma_scan_ptes().


Yes, it's my fault.


It would be nice if you could support at least x86_64 too
(you have pte_low code peculiar to i386 in vmscan.c, which is
preventing that), but that's harder if you don't have the hardware.


Um! Data cmpxchged should include access bit. And I have only x86 PC, memory <
1G. 3level pagetable code copied from Linux other functions.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-28 Thread yunfeng zhang

You have an interesting idea of simplifies, given
 16 files changed, 997 insertions(+), 25 deletions(-)
(omitting your Documentation), and over 7k more code.
You'll have to be much more persuasive (with good performance
results) to get us to welcome your added layer of complexity.


If the whole idea is deployed on Linux, following core objects should be erased
1) anon_vma.
2) pgdata::active/inactive list and relatted methods -- mark_page_accessed etc.
3) PrivatePage::count and mapcount. If core need to share the page, add PG_kmap
  flag. In fact, page::lru_list can safetly be erased too.
4) All cases should be from up to down, especially simplifies debug.


Please make an effort to support at least i386 3level pagetables:
you don't actually need 4GB of memory to test CONFIG_HIGHMEM64G.
HIGHMEM testing shows you're missing a couple of pte_unmap()s,
in pps_swapoff_scan_ptes() and in shrink_pvma_scan_ptes().


Yes, it's my fault.


It would be nice if you could support at least x86_64 too
(you have pte_low code peculiar to i386 in vmscan.c, which is
preventing that), but that's harder if you don't have the hardware.


Um! Data cmpxchged should include access bit. And I have only x86 PC, memory 
1G. 3level pagetable code copied from Linux other functions.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-25 Thread yunfeng zhang

Current test based on the fact below in my previous mail


Current Linux page allocation fairly provides pages for every process, since
swap daemon only is started when memory is low, so when it starts to scan
active_list, the private pages of processes are messed up with each other,
vmscan.c:shrink_list() is the only approach to attach disk swap page to page on
active_list, as the result, all private pages lost their affinity on swap
partition. I will give a testlater...



Three testcases are imitated here
1) matrix: Some softwares do a lots of matrix arithmetic in their PrivateVMA,
  in fact, the type is much better to pps than Linux.
2) c: C malloc uses an arithmetic just like slab, so when an application resume
  from swap partition, let's supposed it touches three variables whose sizes
  are different, in the result, it should touch three pages and the three
  pages are closed with each other but aren't continual, I also imitate a case
  that if the application starts full-speed running later (touch more pages).
3) destruction: Typically, if an application resumes due to user clicks the
  close button, it totally visits all its private data to execute object
  destruction.

Test stepping
1) run ./entry and say y to it, maybe need root right.
2) wait a moment until echo 'primarywait'.
3) swapoff -a && swapon -a.
4) ./hog until count = 10.
5) 'cat primary entry secondary > /dev/null'
6) 'cat /proc/vmstat' several times and record 'pswpin' field when it's stable.
7) type `1', `2' or `3' to 3 testcases, answer `2' to start fullspeed testcase.
8) record new 'pswpin' field.
9) which is better? see the 'pswpin' increment.
pswpin is increased in mm/page_io.c:swap_readpage.

Test stepping purposes
1) Step 1, 'entry' wakes up 'primary' and 'secondary' simultaneously, every time
  'primary' allocates a page, 'secondary' inserts some pages into active_list
  closed to it.
1) Step 3, we should re-allocate swap pages.
2) Step 4, flush 'entry primary secondary' to swap partition.
3) Step 5, make file content 'entry primary secondary' present in memory.

Testcases are done in vmware virtual machine 5.5, 32M memory. If you argue my
circumstance, do your testcases following the steps advised
1) Run multiple memory-consumer together, make them pause at a point.
  (So mess up all private pages in pg_active list).
2) Flush them to swap partition.
3) Wake up one of them, let it run full-speed for a while, record pswpin of
  /proc/vmstat.
4) Invalidate all readaheaded pages.
5) Wake up another, repeat the test.
6) It's also good if you can record hard LED twinking:)
Maybe your test resumes all memory-consumers together, so Linux readaheads some
pages close to page-fault page but are belong to other processes, I think.

By the way, what's linux-mm mail, it ins't in Documentation/SubmitPatches.

In fact, you will find Linux column makes hard LED twinking per 5 seconds.
-
Linux   pps
matrix  52411597
53221620
(81)(23)

c   80281937
80951954
fullspeed   83131964
(67)(17)
(218)   (10)

destruction 94614445
98254484
(364)   (39)

Comment secondary.c:memset clause, so 'secondary' won't interrupt
page-allocation in 'primary'.
-
Linux   pps
matrix  207 38
256 59
(49)(21)

c   1273347
1341383
fullspeed   1362386
(68)(36)
(21)(3)

destruction 24351178
25131246
(78)(68)

entry.c
-
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

pid_t pids[4];
int sem_set;
siginfo_t si;

int main(int argc, char **argv)
{
int i, data;
unsigned short init_data[4] = { 1, 1, 1, 1 };
if ((sem_set = semget(123321, 4, IPC_CREAT)) == -1)
goto failed;
if (semctl(sem_set, 0, SETALL, _data) == -1)
goto failed;
pid_t pid = vfork();
if (pid == -1) {
goto failed;
} else if (pid == 0) {
if (execlp("./primary", NULL) == -1)
goto failed;
} else {
pids[0] = pid;
}
pid = vfork();
if (pid == -1) {
goto failed;
} else if (pid == 0) {
if (execlp("./secondary", NULL) == -1)
  

Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-25 Thread yunfeng zhang

Current test based on the fact below in my previous mail


Current Linux page allocation fairly provides pages for every process, since
swap daemon only is started when memory is low, so when it starts to scan
active_list, the private pages of processes are messed up with each other,
vmscan.c:shrink_list() is the only approach to attach disk swap page to page on
active_list, as the result, all private pages lost their affinity on swap
partition. I will give a testlater...



Three testcases are imitated here
1) matrix: Some softwares do a lots of matrix arithmetic in their PrivateVMA,
  in fact, the type is much better to pps than Linux.
2) c: C malloc uses an arithmetic just like slab, so when an application resume
  from swap partition, let's supposed it touches three variables whose sizes
  are different, in the result, it should touch three pages and the three
  pages are closed with each other but aren't continual, I also imitate a case
  that if the application starts full-speed running later (touch more pages).
3) destruction: Typically, if an application resumes due to user clicks the
  close button, it totally visits all its private data to execute object
  destruction.

Test stepping
1) run ./entry and say y to it, maybe need root right.
2) wait a moment until echo 'primarywait'.
3) swapoff -a  swapon -a.
4) ./hog until count = 10.
5) 'cat primary entry secondary  /dev/null'
6) 'cat /proc/vmstat' several times and record 'pswpin' field when it's stable.
7) type `1', `2' or `3' to 3 testcases, answer `2' to start fullspeed testcase.
8) record new 'pswpin' field.
9) which is better? see the 'pswpin' increment.
pswpin is increased in mm/page_io.c:swap_readpage.

Test stepping purposes
1) Step 1, 'entry' wakes up 'primary' and 'secondary' simultaneously, every time
  'primary' allocates a page, 'secondary' inserts some pages into active_list
  closed to it.
1) Step 3, we should re-allocate swap pages.
2) Step 4, flush 'entry primary secondary' to swap partition.
3) Step 5, make file content 'entry primary secondary' present in memory.

Testcases are done in vmware virtual machine 5.5, 32M memory. If you argue my
circumstance, do your testcases following the steps advised
1) Run multiple memory-consumer together, make them pause at a point.
  (So mess up all private pages in pg_active list).
2) Flush them to swap partition.
3) Wake up one of them, let it run full-speed for a while, record pswpin of
  /proc/vmstat.
4) Invalidate all readaheaded pages.
5) Wake up another, repeat the test.
6) It's also good if you can record hard LED twinking:)
Maybe your test resumes all memory-consumers together, so Linux readaheads some
pages close to page-fault page but are belong to other processes, I think.

By the way, what's linux-mm mail, it ins't in Documentation/SubmitPatches.

In fact, you will find Linux column makes hard LED twinking per 5 seconds.
-
Linux   pps
matrix  52411597
53221620
(81)(23)

c   80281937
80951954
fullspeed   83131964
(67)(17)
(218)   (10)

destruction 94614445
98254484
(364)   (39)

Comment secondary.c:memset clause, so 'secondary' won't interrupt
page-allocation in 'primary'.
-
Linux   pps
matrix  207 38
256 59
(49)(21)

c   1273347
1341383
fullspeed   1362386
(68)(36)
(21)(3)

destruction 24351178
25131246
(78)(68)

entry.c
-
#include sys/wait.h
#include sys/types.h
#include sys/ipc.h
#include sys/sem.h
#include sys/stat.h
#include unistd.h
#include stdlib.h
#include stdio.h
#include string.h

pid_t pids[4];
int sem_set;
siginfo_t si;

int main(int argc, char **argv)
{
int i, data;
unsigned short init_data[4] = { 1, 1, 1, 1 };
if ((sem_set = semget(123321, 4, IPC_CREAT)) == -1)
goto failed;
if (semctl(sem_set, 0, SETALL, init_data) == -1)
goto failed;
pid_t pid = vfork();
if (pid == -1) {
goto failed;
} else if (pid == 0) {
if (execlp(./primary, NULL) == -1)
goto failed;
} else {
pids[0] = pid;
}
pid = vfork();
if (pid == -1) {
goto failed;
} else if (pid 

Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread yunfeng zhang

re-code my patch, tab = 8. Sorry!

  Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]>

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-01-23 11:32:02.0 
+0800
@@ -0,0 +1,236 @@
+ Pure Private Page System (pps)
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose <([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section  and how I patch it into Linux 2.6.19 in section
+.
+// }])>
+
+// How to Reclaim Pages more Efficiently <([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) Page table, zone and memory inode layer (architecture-dependent).
+Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but
+here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
+   strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
+   the only one from down to up.
+
+Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a
+system on memory node::active_list/inactive_list.
+
+I've finished a patch, see section . Note, it
+ISN'T perfect.
+// }])>
+
+// Pure Private Page System -- pps  <([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole job should be consist of two parts, here is the first --
+PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS)
+scheduled in future. Of course, if both are done, it will empty Linux legacy
+page system.
+
+In fact, pps is centered on how to better collect and unmap process private
+pages, the whole process is divided into six stages -- . PPS
+uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma)
+of mm/vmscan.c. Other sections show the remain aspects of pps
+1)  is basic data definition.
+2)  is focused on synchronization.
+3)  how private pages enter in/go off pps.
+4)  which VMA is belonging to pps.
+5)  new daemon thread kppsd, pps statistic data etc.
+
+I'm also glad to highlight my a new idea -- dftlb which is described in
+section .
+// }])>
+
+// Delay to Flush TLB (dftlb) <([{
+Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in
+brief, when we want to unmap a page from the page table of a process, why we
+send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we
+can insert flushing tasks into timer interrupt route to implement a
+free-charged TLB flushing.
+
+The trick is implemented in
+1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c.
+2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute
+   f

Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread yunfeng zhang


Patched against 2.6.19 leads to:

mm/vmscan.c: In function `shrink_pvma_scan_ptes':
mm/vmscan.c:1340: too many arguments to function `page_remove_rmap'

So changed
 page_remove_rmap(series.pages[i], vma);
to
 page_remove_rmap(series.pages[i]);



I've worked on 2.6.19, but when update to 2.6.20-rc5, the function is changed.


But your patch doesn't offer any swap-performance improvement for both swsusp
or tmpfs.  Swap-in is still half speed of Swap-out.



Current Linux page allocation fairly provides pages for every process, since
swap daemon only is started when memory is low, so when it starts to scan
active_list, the private pages of processes are messed up with each other,
vmscan.c:shrink_list() is the only approach to attach disk swap page to page on
active_list, as the result, all private pages lost their affinity on swap
partition. I will give a testlater...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread yunfeng zhang


Patched against 2.6.19 leads to:

mm/vmscan.c: In function `shrink_pvma_scan_ptes':
mm/vmscan.c:1340: too many arguments to function `page_remove_rmap'

So changed
 page_remove_rmap(series.pages[i], vma);
to
 page_remove_rmap(series.pages[i]);



I've worked on 2.6.19, but when update to 2.6.20-rc5, the function is changed.


But your patch doesn't offer any swap-performance improvement for both swsusp
or tmpfs.  Swap-in is still half speed of Swap-out.



Current Linux page allocation fairly provides pages for every process, since
swap daemon only is started when memory is low, so when it starts to scan
active_list, the private pages of processes are messed up with each other,
vmscan.c:shrink_list() is the only approach to attach disk swap page to page on
active_list, as the result, all private pages lost their affinity on swap
partition. I will give a testlater...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread yunfeng zhang

re-code my patch, tab = 8. Sorry!

  Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED]

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-01-23 11:32:02.0 
+0800
@@ -0,0 +1,236 @@
+ Pure Private Page System (pps)
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose ([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section How to Reclaim
+Pages more Efficiently and how I patch it into Linux 2.6.19 in section
+Pure Private Page System -- pps.
+// }])
+
+// How to Reclaim Pages more Efficiently ([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) Page table, zone and memory inode layer (architecture-dependent).
+Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but
+here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
+   strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
+   the only one from down to up.
+
+Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a
+system on memory node::active_list/inactive_list.
+
+I've finished a patch, see section Pure Private Page System -- pps. Note, it
+ISN'T perfect.
+// }])
+
+// Pure Private Page System -- pps  ([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole job should be consist of two parts, here is the first --
+PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS)
+scheduled in future. Of course, if both are done, it will empty Linux legacy
+page system.
+
+In fact, pps is centered on how to better collect and unmap process private
+pages, the whole process is divided into six stages -- Stage Definition. PPS
+uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma)
+of mm/vmscan.c. Other sections show the remain aspects of pps
+1) Data Definition is basic data definition.
+2) Concurrent racers of Shrinking pps is focused on synchronization.
+3) Private Page Lifecycle of pps how private pages enter in/go off pps.
+4) VMA Lifecycle of pps which VMA is belonging to pps.
+5) Others about pps new daemon thread kppsd, pps statistic data etc.
+
+I'm also glad to highlight my a new idea -- dftlb which is described in
+section Delay to Flush TLB.
+// }])
+
+// Delay to Flush TLB (dftlb) ([{
+Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in
+brief, when we want to unmap a page from the page table of a process, why we
+send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we
+can insert flushing tasks into timer interrupt route to implement

[PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-21 Thread yunfeng zhang

My patch is based on my new idea to Linux swap subsystem, you can find more in
Documentation/vm_pps.txt which isn't only patch illustration but also file
changelog. In brief, SwapDaemon should scan and reclaim pages on
UserSpace::vmalist other than current zone::active/inactive. The change will
conspicuously enhance swap subsystem performance by

1) SwapDaemon can collect the statistic of process acessing pages and by it
  unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
  to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
  current Linux legacy swap subsystem.
2) Page-fault can issue better readahead requests since history data shows all
  related pages have conglomerating affinity. In contrast, Linux page-fault
  readaheads the pages relative to the SwapSpace position of current
  page-fault page.
3) It's conformable to POSIX madvise API family.
4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
  strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
  the only one from down to up.

Other problems asked about my pps are
1) There isn't new lock order in my pps, it's compliant to Linux lock order
  defined in mm/rmap.c.
2) When a memory inode is low, you can set scan_control::reclaim_node to let my
  kppsd to reclaim the memory inode page.

  Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]>

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-01-22 13:52:04.973820224 
+0800
@@ -0,0 +1,237 @@
+ Pure Private Page System (pps)
+ Copyright by Yunfeng Zhang on GFDL 1.2
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose <([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section  and how I patch it into Linux 2.6.19 in section
+.
+// }])>
+
+// How to Reclaim Pages more Efficiently <([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) Page table, zone and memory inode layer (architecture-dependent).
+Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but
+here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
+   strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
+   the only one from down to up.
+
+Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a
+system on memory node::active_list/inactive_list.
+
+I've finished a patch, see section . Note, it
+ISN'T perfect.
+// }])>
+
+// Pure Private Page System -- pps  <([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  B

[PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-21 Thread yunfeng zhang

My patch is based on my new idea to Linux swap subsystem, you can find more in
Documentation/vm_pps.txt which isn't only patch illustration but also file
changelog. In brief, SwapDaemon should scan and reclaim pages on
UserSpace::vmalist other than current zone::active/inactive. The change will
conspicuously enhance swap subsystem performance by

1) SwapDaemon can collect the statistic of process acessing pages and by it
  unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
  to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
  current Linux legacy swap subsystem.
2) Page-fault can issue better readahead requests since history data shows all
  related pages have conglomerating affinity. In contrast, Linux page-fault
  readaheads the pages relative to the SwapSpace position of current
  page-fault page.
3) It's conformable to POSIX madvise API family.
4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
  strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
  the only one from down to up.

Other problems asked about my pps are
1) There isn't new lock order in my pps, it's compliant to Linux lock order
  defined in mm/rmap.c.
2) When a memory inode is low, you can set scan_control::reclaim_node to let my
  kppsd to reclaim the memory inode page.

  Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED]

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-01-22 13:52:04.973820224 
+0800
@@ -0,0 +1,237 @@
+ Pure Private Page System (pps)
+ Copyright by Yunfeng Zhang on GFDL 1.2
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose ([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section How to Reclaim
+Pages more Efficiently and how I patch it into Linux 2.6.19 in section
+Pure Private Page System -- pps.
+// }])
+
+// How to Reclaim Pages more Efficiently ([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) Page table, zone and memory inode layer (architecture-dependent).
+Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but
+here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
+   strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
+   the only one from down to up.
+
+Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a
+system on memory node::active_list/inactive_list.
+
+I've finished a patch, see section Pure Private Page System -- pps. Note, it
+ISN'T perfect.
+// }])
+
+// Pure Private Page System -- pps  ([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private

Re: [PATCH 2.6.16.29 1/1] MM: enhance Linux swap subsystem

2007-01-11 Thread yunfeng zhang

2007/1/11, yunfeng zhang <[EMAIL PROTECTED]>:
2007/1/11, Rik van Riel <[EMAIL PROTECTED]>:


Have you actually measured this?

If your measurements saw any performance gains, with what kind
of workload did they happen, how big were they and how do you
explain those performance gains?

How do you balance scanning the private memory with taking
pages off the per-zone page lists?

How do you deal with systems where some zones are really
large and other zones are really small, eg. a 32 bit system
with one 880MB zone and one 15.1GB zone?



Another solution is add a new field preferred_zone to the scan_control of
mm/vmscan.c, keep it in mind that new swap strategy is from up to down, from
UserSpace to PrivatePage/SharedPage.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.16.29 1/1] MM: enhance Linux swap subsystem

2007-01-11 Thread yunfeng zhang

2007/1/11, Rik van Riel <[EMAIL PROTECTED]>:


Have you actually measured this?

If your measurements saw any performance gains, with what kind
of workload did they happen, how big were they and how do you
explain those performance gains?

How do you balance scanning the private memory with taking
pages off the per-zone page lists?

How do you deal with systems where some zones are really
large and other zones are really small, eg. a 32 bit system
with one 880MB zone and one 15.1GB zone?



You didn't mention another problem, a VMA maybe residents on multiple zones. To
multiple address space, multiple memory inode architecture, we can introduce
a new core object -- section which has several features
1) Section is used as the atomic unit to contain the pages of a VMA residing in
 the memory inode of the section.
2) When page migration occurs among different memory inodes, new secion should
 be set up to trace the pages.
3) Section can be scanned by the SwapDaemon of its memory inode directely.
4) All sections of a VMA are excluded with each other not overlayed.
5) VMA is made up of sections totally, but its section objects scatter on memory
 inodes.
So to the architecture, we can deploy swap subsystem on an
architecture-independent layer by section and scan pages batchly.


If the benefits come mostly from better IO clustering, would
it not be safer/less invasive to add swapout clustering of the
same style that the BSD kernels have?



You mean add a new swap file/partition? I scan every UserSpaces by
init_mm::mmlist which has already been used by swap subsystem, and I've patched
in mm/swapfile.c.


For your reference, the BSD kernels do swapout clustering like this:
1) select a page off the end of the pageout list
2) then scan the page table the page is in, to find
nearby pages that are also eligable for pageout
3) page them all out with one disk I/O operation

The same could also be done for files.

With peterz's dirty tracking (and possible dirty limiting)
code in the kernel, this can be done without the kind of
deadlocks that would have plagued earlier kernels, when
trying to do IO trickery from the pageout path...



I've noticed FreeBSD maybe has a similar design as I mentioned, I'm reading
their code to whether they have another two features -- UnmappedPTE and dftlb in
my Documentation/vm_pps.txt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.16.29 1/1] MM: enhance Linux swap subsystem

2007-01-11 Thread yunfeng zhang

2007/1/11, Rik van Riel [EMAIL PROTECTED]:


Have you actually measured this?

If your measurements saw any performance gains, with what kind
of workload did they happen, how big were they and how do you
explain those performance gains?

How do you balance scanning the private memory with taking
pages off the per-zone page lists?

How do you deal with systems where some zones are really
large and other zones are really small, eg. a 32 bit system
with one 880MB zone and one 15.1GB zone?



You didn't mention another problem, a VMA maybe residents on multiple zones. To
multiple address space, multiple memory inode architecture, we can introduce
a new core object -- section which has several features
1) Section is used as the atomic unit to contain the pages of a VMA residing in
 the memory inode of the section.
2) When page migration occurs among different memory inodes, new secion should
 be set up to trace the pages.
3) Section can be scanned by the SwapDaemon of its memory inode directely.
4) All sections of a VMA are excluded with each other not overlayed.
5) VMA is made up of sections totally, but its section objects scatter on memory
 inodes.
So to the architecture, we can deploy swap subsystem on an
architecture-independent layer by section and scan pages batchly.


If the benefits come mostly from better IO clustering, would
it not be safer/less invasive to add swapout clustering of the
same style that the BSD kernels have?



You mean add a new swap file/partition? I scan every UserSpaces by
init_mm::mmlist which has already been used by swap subsystem, and I've patched
in mm/swapfile.c.


For your reference, the BSD kernels do swapout clustering like this:
1) select a page off the end of the pageout list
2) then scan the page table the page is in, to find
nearby pages that are also eligable for pageout
3) page them all out with one disk I/O operation

The same could also be done for files.

With peterz's dirty tracking (and possible dirty limiting)
code in the kernel, this can be done without the kind of
deadlocks that would have plagued earlier kernels, when
trying to do IO trickery from the pageout path...



I've noticed FreeBSD maybe has a similar design as I mentioned, I'm reading
their code to whether they have another two features -- UnmappedPTE and dftlb in
my Documentation/vm_pps.txt.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.16.29 1/1] MM: enhance Linux swap subsystem

2007-01-11 Thread yunfeng zhang

2007/1/11, yunfeng zhang [EMAIL PROTECTED]:
2007/1/11, Rik van Riel [EMAIL PROTECTED]:


Have you actually measured this?

If your measurements saw any performance gains, with what kind
of workload did they happen, how big were they and how do you
explain those performance gains?

How do you balance scanning the private memory with taking
pages off the per-zone page lists?

How do you deal with systems where some zones are really
large and other zones are really small, eg. a 32 bit system
with one 880MB zone and one 15.1GB zone?



Another solution is add a new field preferred_zone to the scan_control of
mm/vmscan.c, keep it in mind that new swap strategy is from up to down, from
UserSpace to PrivatePage/SharedPage.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.16.29 1/1] MM: enhance Linux swap subsystem

2007-01-10 Thread yunfeng zhang

My patch is based on my new idea to Linux swap subsystem, you can find more in
Documentation/vm_pps.txt which isn't only patch illustration but also file
changelog. In brief, SwapDaemon should scan and reclaim pages on
UserSpace::vmalist other than current zone::active/inactive. The change will
conspicuously enhance swap subsystem performance by

1) SwapDaemon can collect the statistic of process acessing pages and by it
 unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
 to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
 current Linux legacy swap subsystem. In fact, in some cases, we can even
 flush TLB without sending IPI.
2) Page-fault can issue better readahead requests since history data shows all
 related pages have conglomerating affinity. In contrast, Linux page-fault
 readaheads the pages relative to the SwapSpace position of current
 page-fault page.
3) It's conformable to POSIX madvise API family.

Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]>

Index: linux-2.6.16.29/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.16.29/Documentation/vm_pps.txt2007-01-11 10:51:15.649869656 
+0800
@@ -0,0 +1,214 @@
+ Pure Private Page System (pps)
+ Copyright by Yunfeng Zhang on GFDL 1.2
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose <([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section  and how I patch it into Linux 2.6.16.29 in section
+.
+// }])>
+
+// How to Reclaim Pages more Efficiently <([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) PTE, zone/memory inode layer (architecture-dependent).
+4) Maybe it makes you sense that Page should be placed on the 3rd layer, but
+   here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+
+Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a
+system on zone::active_list/inactive_list.
+
+I've finished a patch, see section . Note, it
+ISN'T perfect.
+// }])>
+
+// Pure Private Page System -- pps  <([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole job should be consist of two parts, here is the first --
+PrivateVMA-oriented (PPS), other is SharedVMA-oriented (should be called SPS)
+scheduled in future. Of course, if all are done, it will empty Linux legacy
+page system.
+
+In fact, pps is centered on how to better collect and unmap process private
+pages in SwapDaemon mm/vmscan.c:shrink_private_vma, the whole process is
+divided into six stages -- . Other sections show the remain
+aspects of pps
+1)  is basic data definition.
+2)  is focused on synchronization.
+3)  -- how private pages enter in/go of

[PATCH 2.6.16.29 1/1] MM: enhance Linux swap subsystem

2007-01-10 Thread yunfeng zhang

My patch is based on my new idea to Linux swap subsystem, you can find more in
Documentation/vm_pps.txt which isn't only patch illustration but also file
changelog. In brief, SwapDaemon should scan and reclaim pages on
UserSpace::vmalist other than current zone::active/inactive. The change will
conspicuously enhance swap subsystem performance by

1) SwapDaemon can collect the statistic of process acessing pages and by it
 unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
 to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
 current Linux legacy swap subsystem. In fact, in some cases, we can even
 flush TLB without sending IPI.
2) Page-fault can issue better readahead requests since history data shows all
 related pages have conglomerating affinity. In contrast, Linux page-fault
 readaheads the pages relative to the SwapSpace position of current
 page-fault page.
3) It's conformable to POSIX madvise API family.

Signed-off-by: Yunfeng Zhang [EMAIL PROTECTED]

Index: linux-2.6.16.29/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.16.29/Documentation/vm_pps.txt2007-01-11 10:51:15.649869656 
+0800
@@ -0,0 +1,214 @@
+ Pure Private Page System (pps)
+ Copyright by Yunfeng Zhang on GFDL 1.2
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose ([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section How to Reclaim
+Pages more Efficiently and how I patch it into Linux 2.6.16.29 in section
+Pure Private Page System -- pps.
+// }])
+
+// How to Reclaim Pages more Efficiently ([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) PTE, zone/memory inode layer (architecture-dependent).
+4) Maybe it makes you sense that Page should be placed on the 3rd layer, but
+   here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+
+Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a
+system on zone::active_list/inactive_list.
+
+I've finished a patch, see section Pure Private Page System -- pps. Note, it
+ISN'T perfect.
+// }])
+
+// Pure Private Page System -- pps  ([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole job should be consist of two parts, here is the first --
+PrivateVMA-oriented (PPS), other is SharedVMA-oriented (should be called SPS)
+scheduled in future. Of course, if all are done, it will empty Linux legacy
+page system.
+
+In fact, pps is centered on how to better collect and unmap process private
+pages in SwapDaemon mm/vmscan.c:shrink_private_vma, the whole process is
+divided into six stages -- Stage Definition. Other sections show the remain
+aspects of pps
+1) Data Definition

Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2007-01-09 Thread yunfeng zhang

Sorry, I can't be online regularly, that is, can't synchronize Linux CVS, so
only work on a fixed kernel version. Documentation/vm_pps.txt isn't only a patch
overview but also a changelog.



Great!

Do you have patch against 2.6.19?


Thanks!

--
Al



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2007-01-09 Thread yunfeng zhang

Maybe, there should be a memory maintainer in linux kernel group.

Here, I show some content from my patch (Documentation/vm_pps.txt). In brief, I
make a revolution about Linux swap subsystem, the idea is described that
SwapDaemon should scan and reclaim pages on UserSpace::vmalist other than
current zone::active/inactive. The change will conspicuously enhance swap
subsystem performance by

1) SwapDaemon can collect the statistic of process acessing pages and by it
  unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
  to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
  current Linux legacy swap subsystem. In fact, in some cases, we can even
  flush TLB without sending IPI.
2) Page-fault can issue better readahead requests since history data shows all
  related pages have conglomerating affinity. In contrast, Linux page-fault
  readaheads the pages relative to the SwapSpace position of current
  page-fault page.
3) It's conformable to POSIX madvise API family.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2007-01-09 Thread yunfeng zhang

Maybe, there should be a memory maintainer in linux kernel group.

Here, I show some content from my patch (Documentation/vm_pps.txt). In brief, I
make a revolution about Linux swap subsystem, the idea is described that
SwapDaemon should scan and reclaim pages on UserSpace::vmalist other than
current zone::active/inactive. The change will conspicuously enhance swap
subsystem performance by

1) SwapDaemon can collect the statistic of process acessing pages and by it
  unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
  to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
  current Linux legacy swap subsystem. In fact, in some cases, we can even
  flush TLB without sending IPI.
2) Page-fault can issue better readahead requests since history data shows all
  related pages have conglomerating affinity. In contrast, Linux page-fault
  readaheads the pages relative to the SwapSpace position of current
  page-fault page.
3) It's conformable to POSIX madvise API family.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2007-01-09 Thread yunfeng zhang

Sorry, I can't be online regularly, that is, can't synchronize Linux CVS, so
only work on a fixed kernel version. Documentation/vm_pps.txt isn't only a patch
overview but also a changelog.



Great!

Do you have patch against 2.6.19?


Thanks!

--
Al



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2007-01-04 Thread yunfeng zhang

A new patch has been done by me, based on the previous quilt patch
(2.6.16.29). Here is
changelog

--

NEW

New kernel thread kppsd is added to execute background scanning task
periodically (mm/vmscan.c).

PPS statistic is added into /proc/meminfo, its prototype is in
include/linux/mm.h.

Documentation/vm_pps.txt is also updated to show the aboved two new features,
some sections is re-written for comprehension.


BUG

New loop code is introduced in shrink_private_vma (mm/vmscan.c) and
pps_swapoff (mm/swapfile.c), contrast with old code, even lhtemp is freed
during loop, it's also safe.

A bug is catched in mm/memory.c:zap_pte_range -- if a PrivatePage is
being written back, it will be migrated back to Linux legacy page system.

A fault done by me in previous patch is remedied in stage 5, now stage 5 can
work.


MISCELLANEOUS

UP code has been separated from SMP code in dftlb.
--

Index: linux-2.6.16.29/Documentation/vm_pps.txt
===
--- linux-2.6.16.29.orig/Documentation/vm_pps.txt   2007-01-04
14:47:35.0 +0800
+++ linux-2.6.16.29/Documentation/vm_pps.txt2007-01-04 14:49:36.0 
+0800
@@ -6,11 +6,11 @@
// Purpose <([{
The file is used to document the idea which is published firstly at
http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
-OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, a patch
-of the document to enchance the performance of Linux swap subsystem. You can
-find the overview of the idea in section  and how I patch it into Linux 2.6.16.29 in section .
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section  and how I patch it into Linux 2.6.16.29 in section
+.
// }])>

// How to Reclaim Pages more Efficiently <([{
@@ -21,7 +21,9 @@
OK! to modern OS, its memory subsystem can be divided into three layers
1) Space layer (InodeSpace, UserSpace and CoreSpace).
2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
-3) PTE and page layer (architecture-dependent).
+3) PTE, zone/memory inode layer (architecture-dependent).
+4) Maybe it makes you sense that Page should be placed on the 3rd layer, but
+   here, it's placed on the 2nd layer since it's the basic unit of VMA.

Since the 2nd layer assembles the much statistic of page-acess information, so
it's nature that swap subsystem should be deployed and implemented on the 2nd
@@ -41,7 +43,8 @@
Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a
system on zone::active_list/inactive_list.

-I've finished a patch, see section .
Note, it ISN'T perfect.
+I've finished a patch, see section . Note, it
+ISN'T perfect.
// }])>

// Pure Private Page System -- pps  <([{
@@ -70,7 +73,18 @@
3)  -- how private pages enter in/go off pps.
4)  which VMA is belonging to pps.

-PPS uses init_mm.mm_list list to enumerate all swappable UserSpace.
+PPS uses init_mm.mm_list list to enumerate all swappable UserSpace
+(shrink_private_vma).
+
+A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to
+execute the stages of pps periodically, note an appropriate timeout ticks is
+necessary so we can give application a chance to re-map back its PrivatePage
+from UnmappedPTE to PTE, that is, show their conglomeration affinity.
+scan_control::pps_cmd field is used to control the behavior of kppsd, = 1 for
+accelerating scanning process and reclaiming pages, it's used in balance_pgdat.
+
+PPS statistic data is appended to /proc/meminfo entry, its prototype is in
+include/linux/mm.h.

I'm also glad to highlight my a new idea -- dftlb which is described in
section .
@@ -97,15 +111,19 @@
   gone when a CPU starts to execute the task in timer interrupt, so don't use
   dftlb.
combine stage 1 with stage 2, and send IPI immediately in fill_in_tlb_tasks.
+
+dftlb increases mm_struct::mm_users to prevent the mm from being freed when
+other CPU works on it.
// }])>

// Stage Definition <([{
The whole process of private page page-out is divided into six stages, as
-showed in shrink_pvma_scan_ptes of mm/vmscan.c
+showed in shrink_pvma_scan_ptes of mm/vmscan.c, the code groups the similar
+pages to a series.
1) PTE to untouched PTE (access bit is cleared), append flushing
tasks to dftlb.
2) Convert untouched PTE to UnmappedPTE.
3) Link SwapEntry to every UnmappedPTE.
-4) Synchronize the page of a UnmappedPTE with its physical swap page.
+4) Flush PrivatePage of UnmappedPTE to its disk SwapPage.
5) Reclaimed the page and shift UnmappedPTE to SwappedPTE.
6) SwappedPTE stage.
// }])>
@@ -114,7 +132,15 @@
New VMA flag (VM_PURE_PRIVATE) is appended into VMA in include/linux/mm.h.

New PTE type (UnmappedPTE) is appended into PTE system in
-include/asm-i386/pgtable.h.
+include/asm-i386/pgtable.h. Its 

Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2007-01-04 Thread yunfeng zhang

No, a new idea to re-write swap subsystem at all. In fact, it's an
impossible task to me, so I provide a compromising solution -- pps
(pure private page system).

2006/12/30, Zhou Yingchao <[EMAIL PROTECTED]>:

2006/12/27, yunfeng zhang <[EMAIL PROTECTED]>:
> To multiple address space, multiple memory inode architecture, we can 
introduce
> a new core object -- section which has several features
Do you mean "in-memory inode"  or "memory node(pglist_data)" by "memory inode" ?
> The idea issued by me is whether swap subsystem should be deployed on layer 2 
or
> layer 3 which is described in Documentation/vm_pps.txt of my patch. To 
multiple
> memory inode architecture, the special memory model should be encapsulated on
> layer 3 (architecture-dependent), I think.
I guess that you are  wanting to do something to remove arch-dependent
code in swap subsystem.  Just like the pud introduced in the
page-table related codes. Is it right?
However, you should verify that your changes will not deteriorate
system performance. Also, you need to maintain it for a long time with
the evolution of mainline kernel before it is accepted.

Best regards
--
Yingchao Zhou
***
 Institute Of Computing Technology
 Chinese Academy of Sciences
***


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2007-01-04 Thread yunfeng zhang

No, a new idea to re-write swap subsystem at all. In fact, it's an
impossible task to me, so I provide a compromising solution -- pps
(pure private page system).

2006/12/30, Zhou Yingchao [EMAIL PROTECTED]:

2006/12/27, yunfeng zhang [EMAIL PROTECTED]:
 To multiple address space, multiple memory inode architecture, we can 
introduce
 a new core object -- section which has several features
Do you mean in-memory inode  or memory node(pglist_data) by memory inode ?
 The idea issued by me is whether swap subsystem should be deployed on layer 2 
or
 layer 3 which is described in Documentation/vm_pps.txt of my patch. To 
multiple
 memory inode architecture, the special memory model should be encapsulated on
 layer 3 (architecture-dependent), I think.
I guess that you are  wanting to do something to remove arch-dependent
code in swap subsystem.  Just like the pud introduced in the
page-table related codes. Is it right?
However, you should verify that your changes will not deteriorate
system performance. Also, you need to maintain it for a long time with
the evolution of mainline kernel before it is accepted.

Best regards
--
Yingchao Zhou
***
 Institute Of Computing Technology
 Chinese Academy of Sciences
***


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2007-01-04 Thread yunfeng zhang

A new patch has been done by me, based on the previous quilt patch
(2.6.16.29). Here is
changelog

--

NEW

New kernel thread kppsd is added to execute background scanning task
periodically (mm/vmscan.c).

PPS statistic is added into /proc/meminfo, its prototype is in
include/linux/mm.h.

Documentation/vm_pps.txt is also updated to show the aboved two new features,
some sections is re-written for comprehension.


BUG

New loop code is introduced in shrink_private_vma (mm/vmscan.c) and
pps_swapoff (mm/swapfile.c), contrast with old code, even lhtemp is freed
during loop, it's also safe.

A bug is catched in mm/memory.c:zap_pte_range -- if a PrivatePage is
being written back, it will be migrated back to Linux legacy page system.

A fault done by me in previous patch is remedied in stage 5, now stage 5 can
work.


MISCELLANEOUS

UP code has been separated from SMP code in dftlb.
--

Index: linux-2.6.16.29/Documentation/vm_pps.txt
===
--- linux-2.6.16.29.orig/Documentation/vm_pps.txt   2007-01-04
14:47:35.0 +0800
+++ linux-2.6.16.29/Documentation/vm_pps.txt2007-01-04 14:49:36.0 
+0800
@@ -6,11 +6,11 @@
// Purpose ([{
The file is used to document the idea which is published firstly at
http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
-OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, a patch
-of the document to enchance the performance of Linux swap subsystem. You can
-find the overview of the idea in section How to Reclaim Pages more
-Efficiently and how I patch it into Linux 2.6.16.29 in section Pure Private
-Page System -- pps.
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section How to Reclaim
+Pages more Efficiently and how I patch it into Linux 2.6.16.29 in section
+Pure Private Page System -- pps.
// }])

// How to Reclaim Pages more Efficiently ([{
@@ -21,7 +21,9 @@
OK! to modern OS, its memory subsystem can be divided into three layers
1) Space layer (InodeSpace, UserSpace and CoreSpace).
2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
-3) PTE and page layer (architecture-dependent).
+3) PTE, zone/memory inode layer (architecture-dependent).
+4) Maybe it makes you sense that Page should be placed on the 3rd layer, but
+   here, it's placed on the 2nd layer since it's the basic unit of VMA.

Since the 2nd layer assembles the much statistic of page-acess information, so
it's nature that swap subsystem should be deployed and implemented on the 2nd
@@ -41,7 +43,8 @@
Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a
system on zone::active_list/inactive_list.

-I've finished a patch, see section Pure Private Page System -- pps.
Note, it ISN'T perfect.
+I've finished a patch, see section Pure Private Page System -- pps. Note, it
+ISN'T perfect.
// }])

// Pure Private Page System -- pps  ([{
@@ -70,7 +73,18 @@
3) Private Page Lifecycle of pps -- how private pages enter in/go off pps.
4) VMA Lifecycle of pps which VMA is belonging to pps.

-PPS uses init_mm.mm_list list to enumerate all swappable UserSpace.
+PPS uses init_mm.mm_list list to enumerate all swappable UserSpace
+(shrink_private_vma).
+
+A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to
+execute the stages of pps periodically, note an appropriate timeout ticks is
+necessary so we can give application a chance to re-map back its PrivatePage
+from UnmappedPTE to PTE, that is, show their conglomeration affinity.
+scan_control::pps_cmd field is used to control the behavior of kppsd, = 1 for
+accelerating scanning process and reclaiming pages, it's used in balance_pgdat.
+
+PPS statistic data is appended to /proc/meminfo entry, its prototype is in
+include/linux/mm.h.

I'm also glad to highlight my a new idea -- dftlb which is described in
section Delay to Flush TLB.
@@ -97,15 +111,19 @@
   gone when a CPU starts to execute the task in timer interrupt, so don't use
   dftlb.
combine stage 1 with stage 2, and send IPI immediately in fill_in_tlb_tasks.
+
+dftlb increases mm_struct::mm_users to prevent the mm from being freed when
+other CPU works on it.
// }])

// Stage Definition ([{
The whole process of private page page-out is divided into six stages, as
-showed in shrink_pvma_scan_ptes of mm/vmscan.c
+showed in shrink_pvma_scan_ptes of mm/vmscan.c, the code groups the similar
+pages to a series.
1) PTE to untouched PTE (access bit is cleared), append flushing
tasks to dftlb.
2) Convert untouched PTE to UnmappedPTE.
3) Link SwapEntry to every UnmappedPTE.
-4) Synchronize the page of a UnmappedPTE with its physical swap page.
+4) Flush PrivatePage of UnmappedPTE to its disk SwapPage.
5) Reclaimed the page and shift UnmappedPTE to 

Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2006-12-28 Thread yunfeng zhang

I've re-published my work on quilt, sorry.


Index: linux-2.6.16.29/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.16.29/Documentation/vm_pps.txt2006-12-29 14:36:36.507332384 
+0800
@@ -0,0 +1,192 @@
+ Pure Private Page System (pps)
+ Copyright by Yunfeng Zhang on GFDL 1.2
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose <([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, a patch
+of the document to enchance the performance of Linux swap subsystem. You can
+find the overview of the idea in section  and how I patch it into Linux 2.6.16.29 in section .
+// }])>
+
+// How to Reclaim Pages more Efficiently <([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) PTE and page layer (architecture-dependent).
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+
+Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a
+system on zone::active_list/inactive_list.
+
+I've finished a patch, see section .
Note, it ISN'T perfect.
+// }])>
+
+// Pure Private Page System -- pps  <([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole job should be consist of two parts, here is the first --
+PrivateVMA-oriented (PPS), other is SharedVMA-oriented (should be called SPS)
+scheduled in future. Of course, if all are done, it will empty Linux legacy
+page system.
+
+In fact, pps is centered on how to better collect and unmap process private
+pages in SwapDaemon mm/vmscan.c:shrink_private_vma, the whole process is
+divided into six stages -- . Other sections show the remain
+aspects of pps
+1)  is basic data definition.
+2)  is focused on synchronization.
+3)  -- how private pages enter in/go off pps.
+4)  which VMA is belonging to pps.
+
+PPS uses init_mm.mm_list list to enumerate all swappable UserSpace.
+
+I'm also glad to highlight my a new idea -- dftlb which is described in
+section .
+// }])>
+
+// Delay to Flush TLB (dftlb) <([{
+Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in
+brief, when we want to unmap a page from the page table of a process, why we
+send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we
+can insert flushing tasks into timer interrupt route to implement a
+free-charged TLB flushing.
+
+The trick is implemented in
+1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c.
+2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute
+   flushing tasks.
+3) all data are defined in include/linux/mm.h.
+
+The restriction of dftlb. Following conditions must be met
+1) atomic cmpxchg instruction.
+2) atomically set the access bit after they touch a pte firstly.
+3) To some architectures, vma parameter of flush_tlb_range is maybe important,
+   if it's true, since it's possible that the vma of a TLB flushing task has
+   gone when a 

Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2006-12-28 Thread yunfeng zhang

I've re-published my work on quilt, sorry.


Index: linux-2.6.16.29/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.16.29/Documentation/vm_pps.txt2006-12-29 14:36:36.507332384 
+0800
@@ -0,0 +1,192 @@
+ Pure Private Page System (pps)
+ Copyright by Yunfeng Zhang on GFDL 1.2
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose ([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, a patch
+of the document to enchance the performance of Linux swap subsystem. You can
+find the overview of the idea in section How to Reclaim Pages more
+Efficiently and how I patch it into Linux 2.6.16.29 in section Pure Private
+Page System -- pps.
+// }])
+
+// How to Reclaim Pages more Efficiently ([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) PTE and page layer (architecture-dependent).
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+
+Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a
+system on zone::active_list/inactive_list.
+
+I've finished a patch, see section Pure Private Page System -- pps.
Note, it ISN'T perfect.
+// }])
+
+// Pure Private Page System -- pps  ([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole job should be consist of two parts, here is the first --
+PrivateVMA-oriented (PPS), other is SharedVMA-oriented (should be called SPS)
+scheduled in future. Of course, if all are done, it will empty Linux legacy
+page system.
+
+In fact, pps is centered on how to better collect and unmap process private
+pages in SwapDaemon mm/vmscan.c:shrink_private_vma, the whole process is
+divided into six stages -- Stage Definition. Other sections show the remain
+aspects of pps
+1) Data Definition is basic data definition.
+2) Concurrent racers of Shrinking pps is focused on synchronization.
+3) Private Page Lifecycle of pps -- how private pages enter in/go off pps.
+4) VMA Lifecycle of pps which VMA is belonging to pps.
+
+PPS uses init_mm.mm_list list to enumerate all swappable UserSpace.
+
+I'm also glad to highlight my a new idea -- dftlb which is described in
+section Delay to Flush TLB.
+// }])
+
+// Delay to Flush TLB (dftlb) ([{
+Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in
+brief, when we want to unmap a page from the page table of a process, why we
+send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we
+can insert flushing tasks into timer interrupt route to implement a
+free-charged TLB flushing.
+
+The trick is implemented in
+1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c.
+2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute
+   flushing tasks.
+3) all data are defined in include/linux/mm.h.
+
+The restriction of dftlb. Following conditions must be met
+1) atomic cmpxchg instruction.
+2) atomically set the access bit

Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2006-12-26 Thread yunfeng zhang

The job listed in Documentation/vm_pps.txt of my patch is too heavy to me, so
I'm appreciate that Linux kernel group can arrange a schedule to help me.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2006-12-26 Thread yunfeng zhang

To multiple address space, multiple memory inode architecture, we can introduce
a new core object -- section which has several features
1) Section is used as the atomic unit to contain the pages of a VMA residing in
  the memory inode of the section.
2) When page migration occurs among different memory inodes, new secion should
  be set up to trace the pages.
3) Section can be scanned by the SwapDaemon of its memory inode directely.
4) All sections of a VMA are excluded with each other not overlayed.
5) VMA is made up of sections totally, but its section objects scatter on memory
  inodes.
So to the architecture, we can deploy swap subsystem on an
architecture-independent layer by section and scan pages batchly.

The idea issued by me is whether swap subsystem should be deployed on layer 2 or
layer 3 which is described in Documentation/vm_pps.txt of my patch. To multiple
memory inode architecture, the special memory model should be encapsulated on
layer 3 (architecture-dependent), I think.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2006-12-26 Thread yunfeng zhang
dr, next);
-   } while (pud++, addr = next, addr != end);
-}
-
-// migrate all pages of pure private vma back to Linux legacy memory
management.
-static void migrate_back_legacy_linux(struct mm_struct* mm, struct
vm_area_struct* vma)
-{
-   pgd_t* pgd;
-   unsigned long next;
-   unsigned long addr = vma->vm_start;
-   unsigned long end = vma->vm_end;
-
-   pgd = pgd_offset(mm, addr);
-   do {
-   next = pgd_addr_end(addr, end);
-   if (pgd_none_or_clear_bad(pgd))
-   continue;
-   migrate_back_pud_range(mm, pgd, vma, addr, next);
-   } while (pgd++, addr = next, addr != end);
-}
-
-LIST_HEAD(pps_head);
-LIST_HEAD(pps_head_buddy);
-
-DEFINE_SPINLOCK(pps_lock);
-
-void enter_pps(struct mm_struct* mm, struct vm_area_struct* vma)
-{
-   int condition = VM_READ | VM_WRITE | VM_EXEC | \
-VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC | \
-VM_GROWSDOWN | VM_GROWSUP | \
-VM_LOCKED | VM_SEQ_READ | VM_RAND_READ | VM_DONTCOPY | 
VM_ACCOUNT;
-   if (!(vma->vm_flags & ~condition) && vma->vm_file == NULL) {
-   vma->vm_flags |= VM_PURE_PRIVATE;
-   if (list_empty(>mmlist)) {
-   spin_lock(_lock);
-   if (list_empty(>mmlist))
-   list_add(>mmlist, _mm.mmlist);
-   spin_unlock(_lock);
-   }
-   }
-}
-
-void leave_pps(struct vm_area_struct* vma, int migrate_flag)
-{
-   struct mm_struct* mm = vma->vm_mm;
-
-   if (vma->vm_flags & VM_PURE_PRIVATE) {
-   vma->vm_flags &= ~VM_PURE_PRIVATE;
-   if (migrate_flag)
-   migrate_back_legacy_linux(mm, vma);
-   }
-}
--- patch-linux/kernel/timer.c  2006-12-26 15:20:02.688545256 +0800
+++ linux-2.6.16.29/kernel/timer.c  2006-09-13 02:02:10.0 +0800
@@ -845,2 +844,0 @@
-
-   timer_flush_tlb_tasks(NULL);
--- patch-linux/kernel/fork.c   2006-12-26 15:20:02.688545256 +0800
+++ linux-2.6.16.29/kernel/fork.c   2006-09-13 02:02:10.0 +0800
@@ -232 +231,0 @@
-   leave_pps(mpnt, 1);
--- patch-linux/Documentation/vm_pps.txt2006-12-26 15:45:33.203883456 
+0800
+++ linux-2.6.16.29/Documentation/vm_pps.txt1970-01-01 08:00:00.0 
+0800
@@ -1,190 +0,0 @@
- Pure Private Page System (pps)
- Copyright by Yunfeng Zhang on GFDL 1.2
-  [EMAIL PROTECTED]
-  December 24-26, 2006
-
-// Purpose <([{
-The file is used to document the idea which is published firstly at
-http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
-OS -- main page http://blog.chinaunix.net/u/21764/index.php.  You can find the
-overview of the idea in section  and how
-I patch it into Linux 2.6.16.29 in section .
-// }])>
-
-// How to Reclaim Pages more Efficiently <([{
-Good idea originates from overall design and management ability, when you look
-down from a manager view, you will relief yourself from disordered code and
-find some problem immediately.
-
-OK! to modern OS, its memory subsystem can be divided into three layers
-1) Space layer (InodeSpace, UserSpace and CoreSpace).
-2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
-3) PTE and page layer (architecture-dependent).
-
-Since the 2nd layer assembles the much statistic of page-acess information, so
-it's nature that swap subsystem should be deployed and implemented on the 2nd
-layer.
-
-Undoubtedly, there are some virtues about it
-1) SwapDaemon can collect the statistic of process acessing pages and by it
-   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
-   to relief frequently TLB IPI interrupt.
-2) Page-fault can issue better readahead requests since history data shows all
-   related pages have conglomerating affinity.
-3) It's conformable to POSIX madvise API family.
-
-Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a
-system surrounding page.
-
-I've commited a patch to it, see section . Note, it ISN'T perfect.
-// }])>
-
-// Pure Private Page System -- pps  <([{
-Current Linux is just like a monster and still growing, even its swap subsystem
-...
-
-As I've referred in previous section, perfectly applying my idea need to unroot
-page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
-defeated me -- active_list and inactive_list. In fact, you can find
-lru_add_active anywhere ... It's IMPOSSIBLE to me to complete it only by
-myself. It's also the difference between my design and Linux, in my OS, page is
-the charge of its new owner totally, however, to Linux, page management system
-is still tracing it by PG_active flag.
-
-So I conceive another solution:) That is, set up an independent page-recycle
-system r

[PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2006-12-26 Thread yunfeng zhang
-vm_start;
-   unsigned long end = vma-vm_end;
-
-   pgd = pgd_offset(mm, addr);
-   do {
-   next = pgd_addr_end(addr, end);
-   if (pgd_none_or_clear_bad(pgd))
-   continue;
-   migrate_back_pud_range(mm, pgd, vma, addr, next);
-   } while (pgd++, addr = next, addr != end);
-}
-
-LIST_HEAD(pps_head);
-LIST_HEAD(pps_head_buddy);
-
-DEFINE_SPINLOCK(pps_lock);
-
-void enter_pps(struct mm_struct* mm, struct vm_area_struct* vma)
-{
-   int condition = VM_READ | VM_WRITE | VM_EXEC | \
-VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC | \
-VM_GROWSDOWN | VM_GROWSUP | \
-VM_LOCKED | VM_SEQ_READ | VM_RAND_READ | VM_DONTCOPY | 
VM_ACCOUNT;
-   if (!(vma-vm_flags  ~condition)  vma-vm_file == NULL) {
-   vma-vm_flags |= VM_PURE_PRIVATE;
-   if (list_empty(mm-mmlist)) {
-   spin_lock(mmlist_lock);
-   if (list_empty(mm-mmlist))
-   list_add(mm-mmlist, init_mm.mmlist);
-   spin_unlock(mmlist_lock);
-   }
-   }
-}
-
-void leave_pps(struct vm_area_struct* vma, int migrate_flag)
-{
-   struct mm_struct* mm = vma-vm_mm;
-
-   if (vma-vm_flags  VM_PURE_PRIVATE) {
-   vma-vm_flags = ~VM_PURE_PRIVATE;
-   if (migrate_flag)
-   migrate_back_legacy_linux(mm, vma);
-   }
-}
--- patch-linux/kernel/timer.c  2006-12-26 15:20:02.688545256 +0800
+++ linux-2.6.16.29/kernel/timer.c  2006-09-13 02:02:10.0 +0800
@@ -845,2 +844,0 @@
-
-   timer_flush_tlb_tasks(NULL);
--- patch-linux/kernel/fork.c   2006-12-26 15:20:02.688545256 +0800
+++ linux-2.6.16.29/kernel/fork.c   2006-09-13 02:02:10.0 +0800
@@ -232 +231,0 @@
-   leave_pps(mpnt, 1);
--- patch-linux/Documentation/vm_pps.txt2006-12-26 15:45:33.203883456 
+0800
+++ linux-2.6.16.29/Documentation/vm_pps.txt1970-01-01 08:00:00.0 
+0800
@@ -1,190 +0,0 @@
- Pure Private Page System (pps)
- Copyright by Yunfeng Zhang on GFDL 1.2
-  [EMAIL PROTECTED]
-  December 24-26, 2006
-
-// Purpose ([{
-The file is used to document the idea which is published firstly at
-http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
-OS -- main page http://blog.chinaunix.net/u/21764/index.php.  You can find the
-overview of the idea in section How to Reclaim Pages more Efficiently and how
-I patch it into Linux 2.6.16.29 in section Pure Private Page System -- pps.
-// }])
-
-// How to Reclaim Pages more Efficiently ([{
-Good idea originates from overall design and management ability, when you look
-down from a manager view, you will relief yourself from disordered code and
-find some problem immediately.
-
-OK! to modern OS, its memory subsystem can be divided into three layers
-1) Space layer (InodeSpace, UserSpace and CoreSpace).
-2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
-3) PTE and page layer (architecture-dependent).
-
-Since the 2nd layer assembles the much statistic of page-acess information, so
-it's nature that swap subsystem should be deployed and implemented on the 2nd
-layer.
-
-Undoubtedly, there are some virtues about it
-1) SwapDaemon can collect the statistic of process acessing pages and by it
-   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
-   to relief frequently TLB IPI interrupt.
-2) Page-fault can issue better readahead requests since history data shows all
-   related pages have conglomerating affinity.
-3) It's conformable to POSIX madvise API family.
-
-Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a
-system surrounding page.
-
-I've commited a patch to it, see section Pure Private Page System --
pps. Note, it ISN'T perfect.
-// }])
-
-// Pure Private Page System -- pps  ([{
-Current Linux is just like a monster and still growing, even its swap subsystem
-...
-
-As I've referred in previous section, perfectly applying my idea need to unroot
-page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
-defeated me -- active_list and inactive_list. In fact, you can find
-lru_add_active anywhere ... It's IMPOSSIBLE to me to complete it only by
-myself. It's also the difference between my design and Linux, in my OS, page is
-the charge of its new owner totally, however, to Linux, page management system
-is still tracing it by PG_active flag.
-
-So I conceive another solution:) That is, set up an independent page-recycle
-system rooted on Linux legacy page system -- pps, intercept all private pages
-belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
-whole job should be consist of two parts, here is the first --
-PrivateVMA-oriented (PPS), other is SharedVMA-oriented (should be called

Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2006-12-26 Thread yunfeng zhang

To multiple address space, multiple memory inode architecture, we can introduce
a new core object -- section which has several features
1) Section is used as the atomic unit to contain the pages of a VMA residing in
  the memory inode of the section.
2) When page migration occurs among different memory inodes, new secion should
  be set up to trace the pages.
3) Section can be scanned by the SwapDaemon of its memory inode directely.
4) All sections of a VMA are excluded with each other not overlayed.
5) VMA is made up of sections totally, but its section objects scatter on memory
  inodes.
So to the architecture, we can deploy swap subsystem on an
architecture-independent layer by section and scan pages batchly.

The idea issued by me is whether swap subsystem should be deployed on layer 2 or
layer 3 which is described in Documentation/vm_pps.txt of my patch. To multiple
memory inode architecture, the special memory model should be encapsulated on
layer 3 (architecture-dependent), I think.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

2006-12-26 Thread yunfeng zhang

The job listed in Documentation/vm_pps.txt of my patch is too heavy to me, so
I'm appreciate that Linux kernel group can arrange a schedule to help me.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/