from:"Ethan Solomita"

Re: [PATCH 1/6] cpuset write dirty map

2007-09-18 Thread Ethan Solomita

Andrew Morton wrote:
> On Tue, 11 Sep 2007 18:36:34 -0700
> Ethan Solomita <[EMAIL PROTECTED]> wrote:
> 
>> Add a dirty map to struct address_space
> 
> I get a tremendous number of rejects trying to wedge this stuff on top of
> Peter's mm-dirty-balancing-for-tasks changes.  More rejects than I am
> prepared to partially-fix so that I can usefully look at these changes in
> tkdiff, so this is all based on a quick peek at the diff itself..

This isn't surprising. We're both changing the calculation of dirty
limits. If his code is already into your workspace, then I'll have to do
the merging after you release it.

>> +#if MAX_NUMNODES <= BITS_PER_LONG
> 
> The patch is sprinkled full of this conditional.
> 
>   I don't understand why this is being done.  afaict it isn't described
>   in a code comment (it should be) nor even in the changelogs?

I can add comments.

>   Given its overall complexity and its likelihood to change in the
>   future, I'd suggest that this conditional be centralised in a single
>   place.  Something like
> 
>   /*
>* nice comment goes here
>*/
>   #if MAX_NUMNODES <= BITS_PER_LONG
>   #define CPUSET_DIRTY_LIMITS 1
>   #else
>   #define CPUSET_DIRTY_LIMITS 0
>   #endif
> 
>   Then use #if CPUSET_DIRTY_LIMITS everywhere else.
> 
>   (This is better than #ifdef CPUSET_DIRTY_LIMITS because we'll et a
>   warning if someone typos '#if CPUSET_DITRY_LIMITS')

I can add something like this. Probably something like:

CPUSET_DIRTY_LIMITS_USEPTR

>> --- 0/include/linux/fs.h 2007-09-11 14:35:58.0 -0700
>> +++ 1/include/linux/fs.h 2007-09-11 14:36:24.0 -0700
>> @@ -516,6 +516,13 @@ struct address_space {
>>  spinlock_t  private_lock;   /* for use by the address_space 
>> */
>>  struct list_headprivate_list;   /* ditto */
>>  struct address_space*assoc_mapping; /* ditto */
>> +#ifdef CONFIG_CPUSETS
>> +#if MAX_NUMNODES <= BITS_PER_LONG
>> +nodemask_t  dirty_nodes;/* nodes with dirty pages */
>> +#else
>> +nodemask_t  *dirty_nodes;   /* pointer to map if dirty */
>> +#endif
>> +#endif
> 
> afacit there is no code comment and no changelog text which explains the
> above design decision?  There should be, please.

OK.
> 
> There is talk of making cpusets available with CONFIG_SMP=n.  Will this new
> feature be available in that case?  (it should be).

I'm not sure how useful it would be in that scenario, but for
consistency we should still be able to specify varying dirty ratios
(from patch 6/6). The above code wouldn't mean anything SMP=n since
there's only the one node. We'd just be indicating whether the inode has
any dirty pages, which we already know.

> 
>>  } __attribute__((aligned(sizeof(long;
>>  /*
>>   * On most architectures that alignment is already the case; but
>> diff -uprN -X 0/Documentation/dontdiff 0/include/linux/writeback.h 
>> 1/include/linux/writeback.h
>> --- 0/include/linux/writeback.h  2007-09-11 14:35:58.0 -0700
>> +++ 1/include/linux/writeback.h  2007-09-11 14:37:46.0 -0700
>> @@ -62,6 +62,7 @@ struct writeback_control {
>>  unsigned for_writepages:1;  /* This is a writepages() call */
>>  unsigned range_cyclic:1;/* range_start is cyclic */
>>  void *fs_private;   /* For use by ->writepages() */
>> +nodemask_t *nodes;  /* Set of nodes of interest */
>>  };
> 
> That comment is a bit terse.  It's always good to be lavish when commenting
> data structures, for understanding those is key to understanding a design.
> 
OK

>>  /*
>> diff -uprN -X 0/Documentation/dontdiff 0/kernel/cpuset.c 1/kernel/cpuset.c
>> --- 0/kernel/cpuset.c2007-09-11 14:35:58.0 -0700
>> +++ 1/kernel/cpuset.c2007-09-11 14:36:24.0 -0700
>> @@ -4,7 +4,7 @@
>>   *  Processor and Memory placement constraints for sets of tasks.
>>   *
>>   *  Copyright (C) 2003 BULL SA.
>> - *  Copyright (C) 2004-2006 Silicon Graphics, Inc.
>> + *  Copyright (C) 2004-2007 Silicon Graphics, Inc.
>>   *  Copyright (C) 2006 Google, Inc
>>   *
>>   *  Portions derived from Patrick Mochel's sysfs code.
>> @@ -14,6 +14,7 @@
>>   *  2003-10-22 Updates by Stephen Hemminger.
>>   *  2004 May-July Rework by Paul Jackson.
>>   *  2006 Rework by Paul Menage to use generic containers
>> + *  2007 Cpuset writeback by Christoph Lameter.
>>   *
>>   *  This file is su

Re: [PATCH 6/6] cpuset dirty limits

2007-09-18 Thread Ethan Solomita

Christoph Lameter wrote:
> On Fri, 14 Sep 2007, Andrew Morton wrote:
> 
>>> +   mutex_lock(&callback_mutex);
>>> +   *cs_int = val;
>>> +   mutex_unlock(&callback_mutex);
>> I don't think this locking does anything?
> 
> Locking is wrong here. The lock needs to be taken before the cs pointer 
> is dereferenced from the caller.

I think we can just remove the callback_mutex lock. Since the change is
coming from an update to a cpuset filesystem file, the cpuset is not
going anywhere since the inode is open. And I don't see that any code
really cares whether the dirty ratios change out from under them.

> 
>>> +   return 0;
>>> +}
>>> +
>>>  /*
>>>   * Frequency meter - How fast is some event occurring?
>>>   *
>>> ...
>>> +void cpuset_get_current_ratios(int *background_ratio, int *throttle_ratio)
>>> +{
>>> +   int background = -1;
>>> +   int throttle = -1;
>>> +   struct task_struct *tsk = current;
>>> +
>>> +   task_lock(tsk);
>>> +   background = task_cs(tsk)->background_dirty_ratio;
>>> +   throttle = task_cs(tsk)->throttle_dirty_ratio;
>>> +   task_unlock(tsk);
>> ditto?
> 
> It is required to take the task lock while dereferencing the tasks cpuset 
> pointer.

Agreed.
-- Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/6] cpuset dirty limits

2007-09-11 Thread Ethan Solomita

Per cpuset dirty ratios

This implements dirty ratios per cpuset. Two new files are added
to the cpuset directories:

background_dirty_ratio  Percentage at which background writeback starts

throttle_dirty_ratioPercentage at which the application is throttled
and we start synchrononous writeout.

Both variables are set to -1 by default which means that the global
limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio)
are used for a cpuset.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 5/include/linux/cpuset.h 
7/include/linux/cpuset.h
--- 5/include/linux/cpuset.h2007-09-11 14:50:48.0 -0700
+++ 7/include/linux/cpuset.h2007-09-11 14:51:12.0 -0700
@@ -77,6 +77,7 @@ extern void cpuset_track_online_nodes(vo
 
 extern int current_cpuset_is_being_rebound(void);
 
+extern void cpuset_get_current_ratios(int *background, int *ratio);
 /*
  * We need macros since struct address_space is not defined yet
  */
diff -uprN -X 0/Documentation/dontdiff 5/kernel/cpuset.c 7/kernel/cpuset.c
--- 5/kernel/cpuset.c   2007-09-11 14:50:49.0 -0700
+++ 7/kernel/cpuset.c   2007-09-11 14:56:18.0 -0700
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -92,6 +93,9 @@ struct cpuset {
int mems_generation;
 
struct fmeter fmeter;   /* memory_pressure filter */
+
+   int background_dirty_ratio;
+   int throttle_dirty_ratio;
 };
 
 /* Retrieve the cpuset for a container */
@@ -169,6 +173,8 @@ static struct cpuset top_cpuset = {
.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
.cpus_allowed = CPU_MASK_ALL,
.mems_allowed = NODE_MASK_ALL,
+   .background_dirty_ratio = -1,
+   .throttle_dirty_ratio = -1,
 };
 
 /*
@@ -785,6 +791,21 @@ static int update_flag(cpuset_flagbits_t
return 0;
 }
 
+static int update_int(int *cs_int, char *buf, int min, int max)
+{
+   char *endp;
+   int val;
+
+   val = simple_strtol(buf, &endp, 10);
+   if (val < min || val > max)
+   return -EINVAL;
+
+   mutex_lock(&callback_mutex);
+   *cs_int = val;
+   mutex_unlock(&callback_mutex);
+   return 0;
+}
+
 /*
  * Frequency meter - How fast is some event occurring?
  *
@@ -933,6 +954,8 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+   FILE_THROTTLE_DIRTY_RATIO,
+   FILE_BACKGROUND_DIRTY_RATIO,
 } cpuset_filetype_t;
 
 static ssize_t cpuset_common_file_write(struct container *cont,
@@ -997,6 +1020,12 @@ static ssize_t cpuset_common_file_write(
retval = update_flag(CS_SPREAD_SLAB, cs, buffer);
cs->mems_generation = cpuset_mems_generation++;
break;
+   case FILE_BACKGROUND_DIRTY_RATIO:
+   retval = update_int(&cs->background_dirty_ratio, buffer, -1, 
100);
+   break;
+   case FILE_THROTTLE_DIRTY_RATIO:
+   retval = update_int(&cs->throttle_dirty_ratio, buffer, -1, 100);
+   break;
default:
retval = -EINVAL;
goto out2;
@@ -1090,6 +1119,12 @@ static ssize_t cpuset_common_file_read(s
case FILE_SPREAD_SLAB:
*s++ = is_spread_slab(cs) ? '1' : '0';
break;
+   case FILE_BACKGROUND_DIRTY_RATIO:
+   s += sprintf(s, "%d", cs->background_dirty_ratio);
+   break;
+   case FILE_THROTTLE_DIRTY_RATIO:
+   s += sprintf(s, "%d", cs->throttle_dirty_ratio);
+   break;
default:
retval = -EINVAL;
goto out;
@@ -1173,6 +1208,20 @@ static struct cftype cft_spread_slab = {
.private = FILE_SPREAD_SLAB,
 };
 
+static struct cftype cft_background_dirty_ratio = {
+   .name = "background_dirty_ratio",
+   .read = cpuset_common_file_read,
+   .write = cpuset_common_file_write,
+   .private = FILE_BACKGROUND_DIRTY_RATIO,
+};
+
+static struct cftype cft_throttle_dirty_ratio = {
+   .name = "throttle_dirty_ratio",
+   .read = cpuset_common_file_read,
+   .write = cpuset_common_file_write,
+   .private = FILE_THROTTLE_DIRTY_RATIO,
+};
+
 static int cpuset_populate(struct container_subsys *ss, struct container *cont)
 {
int err;
@@ -1193,6 +1242,10 @@ static int cpuset_populate(struct contai
return err;
if ((err = container_add_file(cont, ss, &cft_spread_slab)) < 0)
return err;
+   if ((err = container_add_file(cont, ss, &cft_background_dirty_ratio)) < 
0)
+   return err;
+   if ((err = container_add_file(cont, ss, &cft_throttle_di

[PATCH 5/6] cpuset write vm writeout

2007-09-11 Thread Ethan Solomita

Throttle VM writeout in a cpuset aware way

This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.

kswapd has a cpuset context that includes the whole machine. VM throttling
will only work during synchrononous reclaim and not  from kswapd.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 
5/include/linux/writeback.h
--- 4/include/linux/writeback.h 2007-09-11 14:49:47.0 -0700
+++ 5/include/linux/writeback.h 2007-09-11 14:50:52.0 -0700
@@ -94,7 +94,7 @@ static inline void inode_sync_wait(struc
 int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 
5/mm/page-writeback.c
--- 4/mm/page-writeback.c   2007-09-11 14:49:47.0 -0700
+++ 5/mm/page-writeback.c   2007-09-11 14:50:52.0 -0700
@@ -386,7 +386,7 @@ void balance_dirty_pages_ratelimited_nr(
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
 {
struct dirty_limits dl;
 
@@ -401,7 +401,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
}
 
for ( ; ; ) {
-   get_dirty_limits(&dl, NULL, &node_online_map);
+   get_dirty_limits(&dl, NULL, nodes);
 
/*
 * Boost the allowable dirty threshold a bit for page
diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c
--- 4/mm/vmscan.c   2007-09-11 14:50:41.0 -0700
+++ 5/mm/vmscan.c   2007-09-11 14:50:52.0 -0700
@@ -1185,7 +1185,7 @@ static unsigned long shrink_zone(int pri
}
}
 
-   throttle_vm_writeout(sc->gfp_mask);
+   throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask);
 
atomic_dec(&zone->reclaim_in_progress);
return nr_reclaimed;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/6] cpuset write throttle

2007-09-11 Thread Ethan Solomita

Make page writeback obey cpuset constraints

Currently dirty throttling does not work properly in a cpuset.

If f.e a cpuset contains only 1/10th of available memory then all of the
memory of a cpuset can be dirtied without any writes being triggered.
If all of the cpusets memory is dirty then only 10% of total memory is dirty.
The background writeback threshold is usually set at 10% and the synchrononous
threshold at 40%. So we are still below the global limits while the dirty
ratio in the cpuset is 100%! Writeback throttling and background writeout
do not work at all in such scenarios.

This patch makes dirty writeout cpuset aware. When determining the
dirty limits in get_dirty_limits() we calculate values based on the
nodes that are reachable from the current process (that has been
dirtying the page). Then we can trigger writeout based on the
dirty ratio of the memory in the cpuset.

We trigger writeout in a a cpuset specific way. We go through the dirty
inodes and search for inodes that have dirty pages on the nodes of the
active cpuset. If an inode fulfills that requirement then we begin writeout
of the dirty pages of that inode.

Adding up all the counters for each node in a cpuset may seem to be quite
an expensive operation (in particular for large cpusets with hundreds of
nodes) compared to just accessing the global counters if we do not have
a cpuset. However, please remember that the global counters were only
introduced recently. Before 2.6.18 we did add up per processor
counters for each processor on each invocation of get_dirty_limits().
We now add per node information which I think is equal or less effort
since there are less nodes than processors.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 
3/mm/page-writeback.c
--- 2/mm/page-writeback.c   2007-09-11 14:39:22.0 -0700
+++ 3/mm/page-writeback.c   2007-09-11 14:49:35.0 -0700
@@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode);
 
 static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
+struct dirty_limits {
+   long thresh_background;
+   long thresh_dirty;
+   unsigned long nr_dirty;
+   unsigned long nr_unstable;
+   unsigned long nr_writeback;
+};
+
 /*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
@@ -121,16 +129,20 @@ static void background_writeout(unsigned
  * clamping level.
  */
 
-static unsigned long highmem_dirtyable_memory(unsigned long total)
+static unsigned long highmem_dirtyable_memory(nodemask_t *nodes, unsigned long 
total)
 {
 #ifdef CONFIG_HIGHMEM
int node;
unsigned long x = 0;
 
+   if (nodes == NULL)
+   nodes = &node_online_mask;
for_each_node_state(node, N_HIGH_MEMORY) {
struct zone *z =
&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
+   if (!node_isset(node, nodes))
+   continue;
x += zone_page_state(z, NR_FREE_PAGES)
+ zone_page_state(z, NR_INACTIVE)
+ zone_page_state(z, NR_ACTIVE);
@@ -154,26 +166,74 @@ static unsigned long determine_dirtyable
x = global_page_state(NR_FREE_PAGES)
+ global_page_state(NR_INACTIVE)
+ global_page_state(NR_ACTIVE);
-   x -= highmem_dirtyable_memory(x);
+   x -= highmem_dirtyable_memory(NULL, x);
return x + 1;   /* Ensure that we never return 0 */
 }
 
-static void
-get_dirty_limits(long *pbackground, long *pdirty,
-   struct address_space *mapping)
+static int
+get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping,
+   nodemask_t *nodes)
 {
int background_ratio;   /* Percentages */
int dirty_ratio;
int unmapped_ratio;
long background;
long dirty;
-   unsigned long available_memory = determine_dirtyable_memory();
+   unsigned long available_memory;
+   unsigned long nr_mapped;
struct task_struct *tsk;
+   int is_subset = 0;
 
-   unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) +
-   global_page_state(NR_ANON_PAGES)) * 100) /
-   available_memory;
+#ifdef CONFIG_CPUSETS
+   if (unlikely(nodes &&
+   !nodes_subset(node_online_map, *nodes))) {
+   int node;
 
+   /*
+* Calculate the limits relative to the current cpuset.
+*
+* We do not disregard highmem because all nodes (except
+* maybe node 0) have either all memory in HIGHMEM (32 bit) or
+* all memory in non HIGHMEM (64 bit). If we would disregard
+

[PATCH 4/6] cpuset write vmscan

2007-09-11 Thread Ethan Solomita

Direct reclaim: cpuset aware writeout

During direct reclaim we traverse down a zonelist and are carefully
checking each zone if its a member of the active cpuset. But then we call
pdflush without enforcing the same restrictions. In a larger system this
may have the effect of a massive amount of pages being dirtied and then either

A. No writeout occurs because global dirty limits have not been reached

or

B. Writeout starts randomly for some dirty inode in the system. Pdflush
   may just write out data for nodes in another cpuset and miss doing
   proper dirty handling for the current cpuset.

In both cases dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.

Fix that by restricting pdflush to the active cpuset. Writeout will occur
from direct reclaim the same way as without a cpuset.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c
--- 3/mm/vmscan.c   2007-09-11 14:41:56.0 -0700
+++ 4/mm/vmscan.c   2007-09-11 14:50:41.0 -0700
@@ -1301,7 +1301,8 @@ unsigned long do_try_to_free_pages(struc
 */
if (total_scanned > sc->swap_cluster_max +
sc->swap_cluster_max / 2) {
-   wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+   wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+   &cpuset_current_mems_allowed);
sc->may_writepage = 1;
}
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/6] cpuset write pdflush nodemask

2007-09-11 Thread Ethan Solomita

pdflush: Allow the passing of a nodemask parameter

If we want to support nodeset specific writeout then we need a way
to communicate the set of nodes that an operation should affect.

So add a nodemask_t parameter to the pdflush functions and also
store the nodemask in the pdflush control structure.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c
--- 1/fs/buffer.c   2007-09-11 14:36:24.0 -0700
+++ 2/fs/buffer.c   2007-09-11 14:39:22.0 -0700
@@ -372,7 +372,7 @@ static void free_more_memory(void)
struct zone **zones;
pg_data_t *pgdat;
 
-   wakeup_pdflush(1024);
+   wakeup_pdflush(1024, NULL);
yield();
 
for_each_online_pgdat(pgdat) {
diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c
--- 1/fs/super.c2007-09-11 14:36:05.0 -0700
+++ 2/fs/super.c2007-09-11 14:39:22.0 -0700
@@ -616,7 +616,7 @@ int do_remount_sb(struct super_block *sb
return 0;
 }
 
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
 {
struct super_block *sb;
 
@@ -644,7 +644,7 @@ static void do_emergency_remount(unsigne
 
 void emergency_remount(void)
 {
-   pdflush_operation(do_emergency_remount, 0);
+   pdflush_operation(do_emergency_remount, 0, NULL);
 }
 
 /*
diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c
--- 1/fs/sync.c 2007-09-11 14:36:05.0 -0700
+++ 2/fs/sync.c 2007-09-11 14:39:22.0 -0700
@@ -21,9 +21,9 @@
  * sync everything.  Start out by waking pdflush, because that writes back
  * all queues in parallel.
  */
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
 {
-   wakeup_pdflush(0);
+   wakeup_pdflush(0, NULL);
sync_inodes(0); /* All mappings, inodes and their blockdevs */
DQUOT_SYNC(NULL);
sync_supers();  /* Write the superblocks */
@@ -38,13 +38,13 @@ static void do_sync(unsigned long wait)
 
 asmlinkage long sys_sync(void)
 {
-   do_sync(1);
+   do_sync(1, NULL);
return 0;
 }
 
 void emergency_sync(void)
 {
-   pdflush_operation(do_sync, 0);
+   pdflush_operation(do_sync, 0, NULL);
 }
 
 /*
diff -uprN -X 0/Documentation/dontdiff 1/include/linux/writeback.h 
2/include/linux/writeback.h
--- 1/include/linux/writeback.h 2007-09-11 14:37:46.0 -0700
+++ 2/include/linux/writeback.h 2007-09-11 14:39:22.0 -0700
@@ -91,7 +91,7 @@ static inline void inode_sync_wait(struc
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(gfp_t gfp_mask);
@@ -122,7 +122,8 @@ balance_dirty_pages_ratelimited(struct a
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
void *data);
 
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes),
+   unsigned long arg0, nodemask_t *nodes);
 int generic_writepages(struct address_space *mapping,
   struct writeback_control *wbc);
 int write_cache_pages(struct address_space *mapping,
diff -uprN -X 0/Documentation/dontdiff 1/mm/page-writeback.c 
2/mm/page-writeback.c
--- 1/mm/page-writeback.c   2007-09-11 14:36:24.0 -0700
+++ 2/mm/page-writeback.c   2007-09-11 14:39:22.0 -0700
@@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
 /*
  * Work out the current dirty-memory clamping and background writeout
@@ -272,7 +272,7 @@ static void balance_dirty_pages(struct a
 */
if ((laptop_mode && pages_written) ||
 (!laptop_mode && (nr_reclaimable > background_thresh)))
-   pdflush_operation(background_writeout, 0);
+   pdflush_operation(background_writeout, 0, NULL);
 }
 
 void set_page_dirty_balance(struct page *page)
@@ -362,7 +362,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
  */
-static void background_writeout(unsigned long _min_pages)
+static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
 {
long min_pages = _min_pages;
struct writeback_control wbc = {
@@ -402,12 +402,12 @@ static void background_writeout(unsigned
  * the whole world.  Returns 0 if a pdflush thread was dispat

[PATCH 1/6] cpuset write dirty map

2007-09-11 Thread Ethan Solomita


Add a dirty map to struct address_space

In a NUMA system it is helpful to know where the dirty pages of a mapping
are located. That way we will be able to implement writeout for applications
that are constrained to a portion of the memory of the system as required by
cpusets.

This patch implements the management of dirty node maps for an address
space through the following functions:

cpuset_clear_dirty_nodes(mapping)   Clear the map of dirty nodes

cpuset_update_nodes(mapping, page)  Record a node in the dirty nodes map

cpuset_init_dirty_nodes(mapping)First time init of the map


The dirty map may be stored either directly in the mapping (for NUMA
systems with less then BITS_PER_LONG nodes) or separately allocated
for systems with a large number of nodes (f.e. IA64 with 1024 nodes).

Updating the dirty map may involve allocating it first for large
configurations. Therefore we protect the allocation and setting
of a node in the map through the tree_lock. The tree_lock is
already taken when a page is dirtied so there is no additional
locking overhead if we insert the updating of the nodemask there.

The dirty map is only cleared (or freed) when the inode is cleared.
At that point no pages are attached to the inode anymore and therefore it can
be done without any locking. The dirty map therefore records all nodes that
have been used for dirty pages by that inode until the inode is no longer
used.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 0/fs/buffer.c 1/fs/buffer.c
--- 0/fs/buffer.c   2007-09-11 14:35:58.0 -0700
+++ 1/fs/buffer.c   2007-09-11 14:36:24.0 -0700
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -723,6 +724,7 @@ static int __set_page_dirty(struct page 
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
+   cpuset_update_dirty_nodes(mapping, page);
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
diff -uprN -X 0/Documentation/dontdiff 0/fs/fs-writeback.c 1/fs/fs-writeback.c
--- 0/fs/fs-writeback.c 2007-09-11 14:35:58.0 -0700
+++ 1/fs/fs-writeback.c 2007-09-11 14:36:24.0 -0700
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 int sysctl_inode_debug __read_mostly;
@@ -476,6 +477,12 @@ int generic_sync_sb_inodes(struct super_
continue;   /* blockdev has wrong queue */
}
 
+   if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
+   /* No pages on the nodes under writeback */
+   list_move(&inode->i_list, &sb->s_dirty);
+   continue;
+   }
+
/* Was this inode dirtied after sync_sb_inodes was called? */
if (time_after(inode->dirtied_when, start))
break;
diff -uprN -X 0/Documentation/dontdiff 0/fs/inode.c 1/fs/inode.c
--- 0/fs/inode.c2007-09-11 14:35:58.0 -0700
+++ 1/fs/inode.c2007-09-11 14:36:24.0 -0700
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * This is needed for the following functions:
@@ -157,6 +158,7 @@ static struct inode *alloc_inode(struct 
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
+   cpuset_init_dirty_nodes(mapping);
 
/*
 * If the block_device provides a backing_dev_info for client
@@ -264,6 +266,7 @@ void clear_inode(struct inode *inode)
bd_forget(inode);
if (S_ISCHR(inode->i_mode) && inode->i_cdev)
cd_forget(inode);
+   cpuset_clear_dirty_nodes(inode->i_mapping);
inode->i_state = I_CLEAR;
 }
 
diff -uprN -X 0/Documentation/dontdiff 0/include/linux/cpuset.h 
1/include/linux/cpuset.h
--- 0/include/linux/cpuset.h2007-09-11 14:35:58.0 -0700
+++ 1/include/linux/cpuset.h2007-09-11 14:36:24.0 -0700
@@ -77,6 +77,45 @@ extern void cpuset_track_online_nodes(vo
 
 extern int current_cpuset_is_being_rebound(void);
 
+/*
+ * We need macros since struct address_space is not defined yet
+ */
+#if MAX_NUMNODES <= BITS_PER_LONG
+#define cpuset_update_dirty_nodes(__mapping, __page)   \
+   do {\
+   int node = page_to_nid(__page); \
+   if (!node_isset(node, (__mapping)->dirty_nodes))\
+

Re: [PATCH 0/6] cpuset aware writeback

2007-09-11 Thread Ethan Solomita

Perform writeback and dirty throttling with awareness of cpuset mem_allowed.

The theory of operation has two primary elements:

1. Add a nodemask per mapping which indicates the nodes
   which have set PageDirty on any page of the mappings.

2. Add a nodemask argument to wakeup_pdflush() which is
   propagated down to sync_sb_inodes.

This leaves sync_sb_inodes() with two nodemasks. One is passed to it and
specifies the nodes the caller is interested in syncing, and will either
be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the
caller's context.

The second nodemask is attached to the inode's mapping and shows who has
modified data in the inode. sync_sb_inodes() will then skip syncing of
inodes if the nodemask argument does not intersect with the mapping
nodemask.

cpuset_current_mems_allowed will be passed in to pdflush
background_writeout by try_to_free_pages and balance_dirty_pages.
balance_dirty_pages also passes the nodemask in to writeback_inodes
directly when doing active reclaim.

Other callers do not limit inode writeback, passing in a NULL nodemask
pointer.

A final change is to get_dirty_limits. It takes a nodemask argument, and
when it is null there is no change in behavior. If the nodemask is set,
page statistics are accumulated only for specified nodes, and the
background and throttle dirty ratios will be read from a new per-cpuset
ratio feature.

For testing I did a variety of basic tests, verifying individual
features of the test. To verify that it fixes the core problem, I
created a stress test which involved using cpusets and mems_allowed
to split memory so that all daemons had memory set aside for them, and
my memory stress test had a separate set of memory. The stress test was
mmaping 7GB of a very large file on disk. It then scans the entire 7GB
of memory reading and modifying each byte. 7GB is more than the amount
of physical memory made available to the stress test.

Using iostat I can see the initial period of reading from disk, followed
by a period of simultaneous reads and writes as dirty bytes are pushed
to make room for new reads.

In a separate log-in, in the other cpuset, I am running:

while `true`; do date | tee -a date.txt; sleep 5; done

date.txt resides on the same disk as the large file mentioned above. The
above while-loop serves the dual purpose of providing me visual clues of
progress along with the opportunity for the "tee" command to become
throttled writing to the disk.

The effect of this patchset is straightforward. Without it there are
long hangs between appearances of the date. With it the dates are all 5
(or sometimes 6) seconds apart.

I also added printks to the kernel to verify that, without these
patches, the tee was being throttled (along with lots of other things),
and with the patch only pdflush is being throttled.

These patches are mostly unchanged from Chris Lameter's original
changelist posted previously to linux-mm.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/6] cpuset aware writeback

2007-07-23 Thread Ethan Solomita

Christoph Lameter wrote:
> On Tue, 17 Jul 2007 14:23:14 -0700
> Ethan Solomita <[EMAIL PROTECTED]> wrote:
> 
>> These patches are mostly unchanged from Chris Lameter's original
>> changelist posted previously to linux-mm.
> 
> Thanks for keeping these patches up to date. Add you signoff if you
> did modifications to a patch. Also include the description of the tests
> in the introduction to the patchset.

So switch from an Ack to a signed-off? OK, and I'll add descriptions of
testing. Everyone other than you has been silent on these patches. Does
silence equal consent?
-- Ethan


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/6] cpuset dirty limits

2007-07-17 Thread Ethan Solomita

Per cpuset dirty ratios

This implements dirty ratios per cpuset. Two new files are added
to the cpuset directories:

background_dirty_ratio  Percentage at which background writeback starts

throttle_dirty_ratioPercentage at which the application is throttled
and we start synchrononous writeout.

Both variables are set to -1 by default which means that the global
limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio)
are used for a cpuset.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.22-rc6-mm1

diff -uprN -X 0/Documentation/dontdiff 6/include/linux/cpuset.h 
7/include/linux/cpuset.h
--- 6/include/linux/cpuset.h2007-07-11 21:17:08.0 -0700
+++ 7/include/linux/cpuset.h2007-07-11 21:17:41.0 -0700
@@ -76,6 +76,7 @@ extern void cpuset_track_online_nodes(vo
 
 extern int current_cpuset_is_being_rebound(void);
 
+extern void cpuset_get_current_ratios(int *background, int *ratio);
 /*
  * We need macros since struct address_space is not defined yet
  */
diff -uprN -X 0/Documentation/dontdiff 6/kernel/cpuset.c 7/kernel/cpuset.c
--- 6/kernel/cpuset.c   2007-07-12 12:15:20.0 -0700
+++ 7/kernel/cpuset.c   2007-07-12 12:15:34.0 -0700
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -92,6 +93,9 @@ struct cpuset {
int mems_generation;
 
struct fmeter fmeter;   /* memory_pressure filter */
+
+   int background_dirty_ratio;
+   int throttle_dirty_ratio;
 };
 
 /* Update the cpuset for a container */
@@ -175,6 +179,8 @@ static struct cpuset top_cpuset = {
.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
.cpus_allowed = CPU_MASK_ALL,
.mems_allowed = NODE_MASK_ALL,
+   .background_dirty_ratio = -1,
+   .throttle_dirty_ratio = -1,
 };
 
 /*
@@ -776,6 +782,21 @@ static int update_flag(cpuset_flagbits_t
return 0;
 }
 
+static int update_int(int *cs_int, char *buf, int min, int max)
+{
+   char *endp;
+   int val;
+
+   val = simple_strtol(buf, &endp, 10);
+   if (val < min || val > max)
+   return -EINVAL;
+
+   mutex_lock(&callback_mutex);
+   *cs_int = val;
+   mutex_unlock(&callback_mutex);
+   return 0;
+}
+
 /*
  * Frequency meter - How fast is some event occurring?
  *
@@ -924,6 +945,8 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+   FILE_THROTTLE_DIRTY_RATIO,
+   FILE_BACKGROUND_DIRTY_RATIO,
 } cpuset_filetype_t;
 
 static ssize_t cpuset_common_file_write(struct container *cont,
@@ -988,6 +1011,12 @@ static ssize_t cpuset_common_file_write(
retval = update_flag(CS_SPREAD_SLAB, cs, buffer);
cs->mems_generation = cpuset_mems_generation++;
break;
+   case FILE_BACKGROUND_DIRTY_RATIO:
+   retval = update_int(&cs->background_dirty_ratio, buffer, -1, 
100);
+   break;
+   case FILE_THROTTLE_DIRTY_RATIO:
+   retval = update_int(&cs->throttle_dirty_ratio, buffer, -1, 100);
+   break;
default:
retval = -EINVAL;
goto out2;
@@ -1081,6 +1110,12 @@ static ssize_t cpuset_common_file_read(s
case FILE_SPREAD_SLAB:
*s++ = is_spread_slab(cs) ? '1' : '0';
break;
+   case FILE_BACKGROUND_DIRTY_RATIO:
+   s += sprintf(s, "%d", cs->background_dirty_ratio);
+   break;
+   case FILE_THROTTLE_DIRTY_RATIO:
+   s += sprintf(s, "%d", cs->throttle_dirty_ratio);
+   break;
default:
retval = -EINVAL;
goto out;
@@ -1164,6 +1199,20 @@ static struct cftype cft_spread_slab = {
.private = FILE_SPREAD_SLAB,
 };
 
+static struct cftype cft_background_dirty_ratio = {
+   .name = "background_dirty_ratio",
+   .read = cpuset_common_file_read,
+   .write = cpuset_common_file_write,
+   .private = FILE_BACKGROUND_DIRTY_RATIO,
+};
+
+static struct cftype cft_throttle_dirty_ratio = {
+   .name = "throttle_dirty_ratio",
+   .read = cpuset_common_file_read,
+   .write = cpuset_common_file_write,
+   .private = FILE_THROTTLE_DIRTY_RATIO,
+};
+
 int cpuset_populate(struct container_subsys *ss, struct container *cont)
 {
int err;
@@ -1184,6 +1233,10 @@ int cpuset_populate(struct container_sub
return err;
if ((err = container_add_file(cont, &cft_spread_slab)) < 0)
return err;
+   if ((err = container_add_file(cont, &cft_background_dirty_ratio)) < 0)
+   return err;
+   if ((err = container_add_file(cont, &cft_throttle_dirty_ratio)) < 0)
+

[PATCH 5/6] cpuset write vm writeout

2007-07-17 Thread Ethan Solomita

Throttle VM writeout in a cpuset aware way

This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.

kswapd has a cpuset context that includes the whole machine. VM throttling
will only work during synchrononous reclaim and not  from kswapd.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.22-rc6-mm1

diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 
5/include/linux/writeback.h
--- 4/include/linux/writeback.h 2007-07-11 21:16:25.0 -0700
+++ 5/include/linux/writeback.h 2007-07-11 21:16:50.0 -0700
@@ -95,7 +95,7 @@ static inline void inode_sync_wait(struc
 int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 
5/mm/page-writeback.c
--- 4/mm/page-writeback.c   2007-07-16 18:31:13.0 -0700
+++ 5/mm/page-writeback.c   2007-07-16 18:32:08.0 -0700
@@ -384,7 +384,7 @@ void balance_dirty_pages_ratelimited_nr(
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
 {
struct dirty_limits dl;
 
@@ -399,7 +399,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
}
 
for ( ; ; ) {
-   get_dirty_limits(&dl, NULL, &node_online_map);
+   get_dirty_limits(&dl, NULL, nodes);
 
/*
 * Boost the allowable dirty threshold a bit for page
diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c
--- 4/mm/vmscan.c   2007-07-11 21:16:26.0 -0700
+++ 5/mm/vmscan.c   2007-07-11 21:16:50.0 -0700
@@ -1064,7 +1064,7 @@ static unsigned long shrink_zone(int pri
}
}
 
-   throttle_vm_writeout(sc->gfp_mask);
+   throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask);
 
atomic_dec(&zone->reclaim_in_progress);
return nr_reclaimed;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/6] cpuset write vmscan

2007-07-17 Thread Ethan Solomita

Direct reclaim: cpuset aware writeout

During direct reclaim we traverse down a zonelist and are carefully
checking each zone if its a member of the active cpuset. But then we call
pdflush without enforcing the same restrictions. In a larger system this
may have the effect of a massive amount of pages being dirtied and then either

A. No writeout occurs because global dirty limits have not been reached

or

B. Writeout starts randomly for some dirty inode in the system. Pdflush
   may just write out data for nodes in another cpuset and miss doing
   proper dirty handling for the current cpuset.

In both cases dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.

Fix that by restricting pdflush to the active cpuset. Writeout will occur
from direct reclaim the same way as without a cpuset.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.22-rc6-mm1

diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c
--- 3/mm/vmscan.c   2007-07-11 21:16:14.0 -0700
+++ 4/mm/vmscan.c   2007-07-11 21:16:26.0 -0700
@@ -1183,7 +1183,8 @@ unsigned long try_to_free_pages(struct z
 */
if (total_scanned > sc.swap_cluster_max +
sc.swap_cluster_max / 2) {
-   wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+   wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+   &cpuset_current_mems_allowed);
sc.may_writepage = 1;
}
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/6] cpuset write pdflush nodemask

2007-07-17 Thread Ethan Solomita

pdflush: Allow the passing of a nodemask parameter

If we want to support nodeset specific writeout then we need a way
to communicate the set of nodes that an operation should affect.

So add a nodemask_t parameter to the pdflush functions and also
store the nodemask in the pdflush control structure.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.22-rc6-mm1

diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c
--- 1/fs/buffer.c   2007-07-11 21:08:04.0 -0700
+++ 2/fs/buffer.c   2007-07-11 21:15:47.0 -0700
@@ -359,7 +359,7 @@ static void free_more_memory(void)
struct zone **zones;
pg_data_t *pgdat;
 
-   wakeup_pdflush(1024);
+   wakeup_pdflush(1024, NULL);
yield();
 
for_each_online_pgdat(pgdat) {
diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c
--- 1/fs/super.c2007-07-11 21:07:41.0 -0700
+++ 2/fs/super.c2007-07-11 21:15:47.0 -0700
@@ -615,7 +615,7 @@ int do_remount_sb(struct super_block *sb
return 0;
 }
 
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
 {
struct super_block *sb;
 
@@ -643,7 +643,7 @@ static void do_emergency_remount(unsigne
 
 void emergency_remount(void)
 {
-   pdflush_operation(do_emergency_remount, 0);
+   pdflush_operation(do_emergency_remount, 0, NULL);
 }
 
 /*
diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c
--- 1/fs/sync.c 2007-07-11 21:07:41.0 -0700
+++ 2/fs/sync.c 2007-07-11 21:15:47.0 -0700
@@ -21,9 +21,9 @@
  * sync everything.  Start out by waking pdflush, because that writes back
  * all queues in parallel.
  */
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
 {
-   wakeup_pdflush(0);
+   wakeup_pdflush(0, NULL);
sync_inodes(0); /* All mappings, inodes and their blockdevs */
DQUOT_SYNC(NULL);
sync_supers();  /* Write the superblocks */
@@ -38,13 +38,13 @@ static void do_sync(unsigned long wait)
 
 asmlinkage long sys_sync(void)
 {
-   do_sync(1);
+   do_sync(1, NULL);
return 0;
 }
 
 void emergency_sync(void)
 {
-   pdflush_operation(do_sync, 0);
+   pdflush_operation(do_sync, 0, NULL);
 }
 
 /*
diff -uprN -X 0/Documentation/dontdiff 1/include/linux/writeback.h 
2/include/linux/writeback.h
--- 1/include/linux/writeback.h 2007-07-11 21:12:25.0 -0700
+++ 2/include/linux/writeback.h 2007-07-11 21:15:47.0 -0700
@@ -92,7 +92,7 @@ static inline void inode_sync_wait(struc
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(gfp_t gfp_mask);
@@ -123,7 +123,8 @@ balance_dirty_pages_ratelimited(struct a
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
void *data);
 
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes),
+   unsigned long arg0, nodemask_t *nodes);
 int generic_writepages(struct address_space *mapping,
   struct writeback_control *wbc);
 int write_cache_pages(struct address_space *mapping,
diff -uprN -X 0/Documentation/dontdiff 1/mm/page-writeback.c 
2/mm/page-writeback.c
--- 1/mm/page-writeback.c   2007-07-11 21:08:04.0 -0700
+++ 2/mm/page-writeback.c   2007-07-11 21:15:47.0 -0700
@@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
 /*
  * Work out the current dirty-memory clamping and background writeout
@@ -272,7 +272,7 @@ static void balance_dirty_pages(struct a
 */
if ((laptop_mode && pages_written) ||
 (!laptop_mode && (nr_reclaimable > background_thresh)))
-   pdflush_operation(background_writeout, 0);
+   pdflush_operation(background_writeout, 0, NULL);
 }
 
 void set_page_dirty_balance(struct page *page)
@@ -362,7 +362,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
  */
-static void background_writeout(unsigned long _min_pages)
+static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
 {
long min_pages = _min_pages;
struct writeback_control wbc = {
@@ -402,12 +402,12 @@ static void background_writeout(unsigned
  * the whole world.  Returns 0 if a pdflush thread was dispat

[PATCH 3/6] cpuset write throttle

2007-07-17 Thread Ethan Solomita

Make page writeback obey cpuset constraints

Currently dirty throttling does not work properly in a cpuset.

If f.e a cpuset contains only 1/10th of available memory then all of the
memory of a cpuset can be dirtied without any writes being triggered.
If all of the cpusets memory is dirty then only 10% of total memory is dirty.
The background writeback threshold is usually set at 10% and the synchrononous
threshold at 40%. So we are still below the global limits while the dirty
ratio in the cpuset is 100%! Writeback throttling and background writeout
do not work at all in such scenarios.

This patch makes dirty writeout cpuset aware. When determining the
dirty limits in get_dirty_limits() we calculate values based on the
nodes that are reachable from the current process (that has been
dirtying the page). Then we can trigger writeout based on the
dirty ratio of the memory in the cpuset.

We trigger writeout in a a cpuset specific way. We go through the dirty
inodes and search for inodes that have dirty pages on the nodes of the
active cpuset. If an inode fulfills that requirement then we begin writeout
of the dirty pages of that inode.

Adding up all the counters for each node in a cpuset may seem to be quite
an expensive operation (in particular for large cpusets with hundreds of
nodes) compared to just accessing the global counters if we do not have
a cpuset. However, please remember that the global counters were only
introduced recently. Before 2.6.18 we did add up per processor
counters for each processor on each invocation of get_dirty_limits().
We now add per node information which I think is equal or less effort
since there are less nodes than processors.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.22-rc6-mm1

diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 
3/mm/page-writeback.c
--- 2/mm/page-writeback.c   2007-07-11 21:15:47.0 -0700
+++ 3/mm/page-writeback.c   2007-07-16 18:30:01.0 -0700
@@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode);
 
 static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
+struct dirty_limits {
+   long thresh_background;
+   long thresh_dirty;
+   unsigned long nr_dirty;
+   unsigned long nr_unstable;
+   unsigned long nr_writeback;
+};
+
 /*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
@@ -121,13 +129,15 @@ static void background_writeout(unsigned
  * clamping level.
  */
 
-static unsigned long highmem_dirtyable_memory(unsigned long total)
+static unsigned long highmem_dirtyable_memory(nodemask_t *nodes, unsigned long 
total)
 {
 #ifdef CONFIG_HIGHMEM
int node;
unsigned long x = 0;
 
-   for_each_online_node(node) {
+   if (nodes == NULL)
+   nodes = &node_online_mask;
+   for_each_node_mask(node, *nodes) {
struct zone *z =
&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
@@ -154,26 +164,74 @@ static unsigned long determine_dirtyable
x = global_page_state(NR_FREE_PAGES)
+ global_page_state(NR_INACTIVE)
+ global_page_state(NR_ACTIVE);
-   x -= highmem_dirtyable_memory(x);
+   x -= highmem_dirtyable_memory(NULL, x);
return x + 1;   /* Ensure that we never return 0 */
 }
 
-static void
-get_dirty_limits(long *pbackground, long *pdirty,
-   struct address_space *mapping)
+static int
+get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping,
+   nodemask_t *nodes)
 {
int background_ratio;   /* Percentages */
int dirty_ratio;
int unmapped_ratio;
long background;
long dirty;
-   unsigned long available_memory = determine_dirtyable_memory();
+   unsigned long available_memory;
+   unsigned long nr_mapped;
struct task_struct *tsk;
+   int is_subset = 0;
 
-   unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) +
-   global_page_state(NR_ANON_PAGES)) * 100) /
-   available_memory;
+#ifdef CONFIG_CPUSETS
+   if (unlikely(nodes &&
+   !nodes_subset(node_online_map, *nodes))) {
+   int node;
 
+   /*
+* Calculate the limits relative to the current cpuset.
+*
+* We do not disregard highmem because all nodes (except
+* maybe node 0) have either all memory in HIGHMEM (32 bit) or
+* all memory in non HIGHMEM (64 bit). If we would disregard
+* highmem then cpuset throttling would not work on 32 bit.
+*/
+   is_subset = 1;
+   memset(dl, 0, sizeof(struct dirty_limits));
+   available_memory = 0

[PATCH 1/6] cpuset write dirty map

2007-07-17 Thread Ethan Solomita

Add a dirty map to struct address_space

In a NUMA system it is helpful to know where the dirty pages of a mapping
are located. That way we will be able to implement writeout for applications
that are constrained to a portion of the memory of the system as required by
cpusets.

This patch implements the management of dirty node maps for an address
space through the following functions:

cpuset_clear_dirty_nodes(mapping)   Clear the map of dirty nodes

cpuset_update_nodes(mapping, page)  Record a node in the dirty nodes map

cpuset_init_dirty_nodes(mapping)First time init of the map


The dirty map may be stored either directly in the mapping (for NUMA
systems with less then BITS_PER_LONG nodes) or separately allocated
for systems with a large number of nodes (f.e. IA64 with 1024 nodes).

Updating the dirty map may involve allocating it first for large
configurations. Therefore we protect the allocation and setting
of a node in the map through the tree_lock. The tree_lock is
already taken when a page is dirtied so there is no additional
locking overhead if we insert the updating of the nodemask there.

The dirty map is only cleared (or freed) when the inode is cleared.
At that point no pages are attached to the inode anymore and therefore it can
be done without any locking. The dirty map therefore records all nodes that
have been used for dirty pages by that inode until the inode is no longer
used.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.22-rc6-mm1

diff -uprN -X 0/Documentation/dontdiff 0/fs/buffer.c 1/fs/buffer.c
--- 0/fs/buffer.c   2007-07-11 20:30:55.0 -0700
+++ 1/fs/buffer.c   2007-07-11 21:08:04.0 -0700
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -710,6 +711,7 @@ static int __set_page_dirty(struct page 
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
+   cpuset_update_dirty_nodes(mapping, page);
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
diff -uprN -X 0/Documentation/dontdiff 0/fs/fs-writeback.c 1/fs/fs-writeback.c
--- 0/fs/fs-writeback.c 2007-07-11 20:30:55.0 -0700
+++ 1/fs/fs-writeback.c 2007-07-11 21:08:04.0 -0700
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 int sysctl_inode_debug __read_mostly;
@@ -492,6 +493,12 @@ int generic_sync_sb_inodes(struct super_
continue;   /* blockdev has wrong queue */
}
 
+   if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
+   /* No pages on the nodes under writeback */
+   list_move(&inode->i_list, &sb->s_dirty);
+   continue;
+   }
+
/* Was this inode dirtied after sync_sb_inodes was called? */
if (time_after(inode->dirtied_when, start))
break;
diff -uprN -X 0/Documentation/dontdiff 0/fs/inode.c 1/fs/inode.c
--- 0/fs/inode.c2007-07-11 20:30:55.0 -0700
+++ 1/fs/inode.c2007-07-11 21:08:04.0 -0700
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * This is needed for the following functions:
@@ -157,6 +158,7 @@ static struct inode *alloc_inode(struct 
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
+   cpuset_init_dirty_nodes(mapping);
 
/*
 * If the block_device provides a backing_dev_info for client
@@ -264,6 +266,7 @@ void clear_inode(struct inode *inode)
bd_forget(inode);
if (S_ISCHR(inode->i_mode) && inode->i_cdev)
cd_forget(inode);
+   cpuset_clear_dirty_nodes(inode->i_mapping);
inode->i_state = I_CLEAR;
 }
 
diff -uprN -X 0/Documentation/dontdiff 0/include/linux/cpuset.h 
1/include/linux/cpuset.h
--- 0/include/linux/cpuset.h2007-07-11 20:30:56.0 -0700
+++ 1/include/linux/cpuset.h2007-07-11 21:08:04.0 -0700
@@ -76,6 +76,45 @@ extern void cpuset_track_online_nodes(vo
 
 extern int current_cpuset_is_being_rebound(void);
 
+/*
+ * We need macros since struct address_space is not defined yet
+ */
+#if MAX_NUMNODES <= BITS_PER_LONG
+#define cpuset_update_dirty_nodes(__mapping, __page)   \
+   do {\
+   int node = page_to_nid(__page); \
+   if (!node_isset(node, (__mapping)->dirty_nodes))\
+

[PATCH 0/6] cpuset aware writeback

2007-07-17 Thread Ethan Solomita

Perform writeback and dirty throttling with awareness of cpuset mem_allowed.

The theory of operation has two primary elements:

1. Add a nodemask per mapping which indicates the nodes
   which have set PageDirty on any page of the mappings.

2. Add a nodemask argument to wakeup_pdflush() which is
   propagated down to sync_sb_inodes.

This leaves sync_sb_inodes() with two nodemasks. One is passed to it and
specifies the nodes the caller is interested in syncing, and will either
be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the
caller's context.

The second nodemask is attached to the inode's mapping and shows who has
modified data in the inode. sync_sb_inodes() will then skip syncing of
inodes if the nodemask argument does not intersect with the mapping
nodemask.

cpuset_current_mems_allowed will be passed in to pdflush
background_writeout by try_to_free_pages and balance_dirty_pages.
balance_dirty_pages also passes the nodemask in to writeback_inodes
directly when doing active reclaim.

Other callers do not limit inode writeback, passing in a NULL nodemask
pointer.

A final change is to get_dirty_limits. It takes a nodemask argument, and
when it is null there is no change in behavior. If the nodemask is set,
page statistics are accumulated only for specified nodes, and the
background and throttle dirty ratios will be read from a new per-cpuset
ratio feature.

These patches are mostly unchanged from Chris Lameter's original
changelist posted previously to linux-mm.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 1/7] cpuset write dirty map

2007-07-11 Thread Ethan Solomita

Christoph Lameter wrote:
> 
> This may be a leftover from earlier times when the logic was different in 
> throttle vm writeout? 

Sorry -- my merge error when looking at an earlier kernel, no issue
with mainline  or -mm.
-- Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 1/7] cpuset write dirty map

2007-07-11 Thread Ethan Solomita

Christoph -- I have a question about one part of the patches. In
throttle_vm_writeout() you added a clause that checks for __GFP_FS |
__GFP_IO and if they're not both set it calls blk_congestion_wait()
immediately and then returns, no change for looping. Two questions:

1. This seems like an unrelated bug fix. Should you submit it as a
standalone patch?

2. You put this gfp check before the check for get_dirty_limits. It's
possible that this will block even though without your change it would
have returned straight away. Would it better, instead of adding the
if-clause at the top of the function, to embed the gfp check at the end
of the for-loop after calling blk_congestion_wait?

-- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 1/7] cpuset write dirty map

2007-06-30 Thread Ethan Solomita

Christoph Lameter wrote:
> On Wed, 27 Jun 2007, Ethan Solomita wrote:
> 
>>  I looked over it at one point. Most of the code doesn't conflict, but I
>> believe that the code path which calculates the dirty limits will need
>> some merging. Doable but non-trivial.
>>  -- Ethan
> 
> I hope you will keep on updating the patchset and posting it against 
> current mm?
> 

I have no new changes, but I can update it against the current mm. Or
did the per-bdi throttling change get taken by Andrew?
-- Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 1/7] cpuset write dirty map

2007-06-27 Thread Ethan Solomita

Andrew Morton wrote:
> 
> One open question is the interaction between these changes and with Peter's
> per-device-dirty-throttling changes.  They also are in my queue somewhere. 

I looked over it at one point. Most of the code doesn't conflict, but I
believe that the code path which calculates the dirty limits will need
some merging. Doable but non-trivial.
-- Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 1/7] cpuset write dirty map

2007-06-25 Thread Ethan Solomita

Christoph Lameter wrote:
> 
> What testing was done? Would you include the results of tests in your next 
> post?

Sorry for the delay in responding -- I was chasing phantom failures.

I created a stress test which involved using cpusets and mems_allowed
to split memory so that all daemons had memory set aside for them, and
my memory stress test had a separate set of memory. The stress test was
mmaping 7GB of a very large file on disk. It then scans the entire 7GB
of memory reading and modifying each byte. 7GB is more than the amount
of physical memory made available to the stress test.

Using iostat I can see the initial period of reading from disk,
followed by a period of simultaneous reads and writes as dirty bytes are
pushed to make room for new reads.

In a separate log-in, in the other cpuset, I am running:

while `true`; do date | tee -a date.txt; sleep 5; done

date.txt resides on the same disk as the large file mentioned above.
The above while-loop serves the dual purpose of providing me visual
clues of progress along with the opportunity for the "tee" command to
become throttled writing to the disk.

The effect of this patchset is straightforward. Without it there are
long hangs between appearances of the date. With it the dates are all 5
(or sometimes 6) seconds apart.

I also added printks to the kernel to verify that, without these
patches, the tee was being throttled (along with lots of other things),
and with the patch only pdflush is being throttled.
-- Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 1/7] cpuset write dirty map

2007-06-04 Thread Ethan Solomita

Christoph Lameter wrote:
> On Thu, 31 May 2007, Ethan Solomita wrote:
> 
>> The dirty map is only cleared (or freed) when the inode is cleared.
>> At that point no pages are attached to the inode anymore and therefore it can
>> be done without any locking. The dirty map therefore records all nodes that
>> have been used for dirty pages by that inode until the inode is no longer
>> used.
>>
>> Originally by Christoph Lameter <[EMAIL PROTECTED]>
> 
> You should preserve my Signed-off-by: since I wrote most of this. Is there 
> a changelog?
> 

I wasn't sure of the etiquette -- I'd thought that by saying you had
signed it off that meant you were accepting my modifications, and didn't
want to presume. But I will change it if you like. No slight intended.

Unfortunately I don't have a changelog, and since I've since forward
ported the changes it would be hard to produce. If you want to review it
you should probably review it all, because the forward porting may have
introduced issues.
-- Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 7/7] cpuset dirty limits

2007-05-31 Thread Ethan Solomita

Per cpuset dirty ratios

This implements dirty ratios per cpuset. Two new files are added
to the cpuset directories:

background_dirty_ratio  Percentage at which background writeback starts

throttle_dirty_ratioPercentage at which the application is throttled
and we start synchrononous writeout.

Both variables are set to -1 by default which means that the global
limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio)
are used for a cpuset.

Originally by Christoph Lameter <[EMAIL PROTECTED]>

Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]>

---

diff -uprN -X 0/Documentation/dontdiff 6/include/linux/cpuset.h 
7/include/linux/cpuset.h
--- 6/include/linux/cpuset.h2007-05-30 11:39:17.0 -0700
+++ 7/include/linux/cpuset.h2007-05-30 11:39:48.0 -0700
@@ -75,6 +75,7 @@ static inline int cpuset_do_slab_mem_spr
 
 extern void cpuset_track_online_nodes(void);
 
+extern void cpuset_get_current_ratios(int *background, int *ratio);
 /*
  * We need macros since struct address_space is not defined yet
  */
diff -uprN -X 0/Documentation/dontdiff 6/kernel/cpuset.c 7/kernel/cpuset.c
--- 6/kernel/cpuset.c   2007-05-30 11:39:17.0 -0700
+++ 7/kernel/cpuset.c   2007-05-30 11:39:48.0 -0700
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -99,6 +100,9 @@ struct cpuset {
int mems_generation;
 
struct fmeter fmeter;   /* memory_pressure filter */
+
+   int background_dirty_ratio;
+   int throttle_dirty_ratio;
 };
 
 /* bits in struct cpuset flags field */
@@ -176,6 +180,8 @@ static struct cpuset top_cpuset = {
.count = ATOMIC_INIT(0),
.sibling = LIST_HEAD_INIT(top_cpuset.sibling),
.children = LIST_HEAD_INIT(top_cpuset.children),
+   .background_dirty_ratio = -1,
+   .throttle_dirty_ratio = -1,
 };
 
 static struct vfsmount *cpuset_mount;
@@ -1030,6 +1036,21 @@ static int update_flag(cpuset_flagbits_t
return 0;
 }
 
+static int update_int(int *cs_int, char *buf, int min, int max)
+{
+   char *endp;
+   int val;
+
+   val = simple_strtol(buf, &endp, 10);
+   if (val < min || val > max)
+   return -EINVAL;
+
+   mutex_lock(&callback_mutex);
+   *cs_int = val;
+   mutex_unlock(&callback_mutex);
+   return 0;
+}
+
 /*
  * Frequency meter - How fast is some event occurring?
  *
@@ -1238,6 +1259,8 @@ typedef enum {
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
FILE_TASKLIST,
+   FILE_THROTTLE_DIRTY_RATIO,
+   FILE_BACKGROUND_DIRTY_RATIO,
 } cpuset_filetype_t;
 
 static ssize_t cpuset_common_file_write(struct file *file,
@@ -1308,6 +1331,12 @@ static ssize_t cpuset_common_file_write(
case FILE_TASKLIST:
retval = attach_task(cs, buffer, &pathbuf);
break;
+   case FILE_BACKGROUND_DIRTY_RATIO:
+   retval = update_int(&cs->background_dirty_ratio, buffer, -1, 
100);
+   break;
+   case FILE_THROTTLE_DIRTY_RATIO:
+   retval = update_int(&cs->throttle_dirty_ratio, buffer, -1, 100);
+   break;
default:
retval = -EINVAL;
goto out2;
@@ -1420,6 +1449,12 @@ static ssize_t cpuset_common_file_read(s
case FILE_SPREAD_SLAB:
*s++ = is_spread_slab(cs) ? '1' : '0';
break;
+   case FILE_BACKGROUND_DIRTY_RATIO:
+   s += sprintf(s, "%d", cs->background_dirty_ratio);
+   break;
+   case FILE_THROTTLE_DIRTY_RATIO:
+   s += sprintf(s, "%d", cs->throttle_dirty_ratio);
+   break;
default:
retval = -EINVAL;
goto out;
@@ -1788,6 +1823,16 @@ static struct cftype cft_spread_slab = {
.private = FILE_SPREAD_SLAB,
 };
 
+static struct cftype cft_background_dirty_ratio = {
+   .name = "background_dirty_ratio",
+   .private = FILE_BACKGROUND_DIRTY_RATIO,
+};
+
+static struct cftype cft_throttle_dirty_ratio = {
+   .name = "throttle_dirty_ratio",
+   .private = FILE_THROTTLE_DIRTY_RATIO,
+};
+
 static int cpuset_populate_dir(struct dentry *cs_dentry)
 {
int err;
@@ -1810,6 +1855,10 @@ static int cpuset_populate_dir(struct de
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_spread_slab)) < 0)
return err;
+   if ((err = cpuset_add_file(cs_dentry, &cft_background_dirty_ratio)) < 0)
+   return err;
+   if ((err = cpuset_add_file(cs_dentry, &cft_throttle_dirty_ratio)) < 0)
+   return err;
if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
return err;
return 0;
@@ -1849,6 +1898,8 @@ static long cpuset_create(struct cpuset 
INIT_LIST_HEAD(&cs

[RFC 6/7] cpuset write fixes

2007-05-31 Thread Ethan Solomita

Remove unneeded local variable.

Originally by Christoph Lameter <[EMAIL PROTECTED]>

Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]>

---

diff -uprN -X 0/Documentation/dontdiff 5/mm/page-writeback.c 
6/mm/page-writeback.c
--- 5/mm/page-writeback.c   2007-05-30 11:37:01.0 -0700
+++ 6/mm/page-writeback.c   2007-05-30 11:39:25.0 -0700
@@ -177,7 +177,6 @@ get_dirty_limits(struct dirty_limits *dl
int unmapped_ratio;
long background;
long dirty;
-   unsigned long available_memory = determine_dirtyable_memory();
unsigned long dirtyable_memory;
unsigned long nr_mapped;
struct task_struct *tsk;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 5/7] cpuset write vm writeout

2007-05-31 Thread Ethan Solomita

Throttle VM writeout in a cpuset aware way

This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.

kswapd has a cpuset context that includes the whole machine. VM throttling
will only work during synchrononous reclaim and not  from kswapd.

Originally by Christoph Lameter <[EMAIL PROTECTED]>

Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]>

---

diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h
5/include/linux/writeback.h
--- 4/include/linux/writeback.h 2007-05-30 11:36:14.0 -0700
+++ 5/include/linux/writeback.h 2007-05-30 11:37:01.0 -0700
@@ -89,7 +89,7 @@ static inline void wait_on_inode(struct
 int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask);

 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c
5/mm/page-writeback.c
--- 4/mm/page-writeback.c   2007-05-30 11:36:15.0 -0700
+++ 5/mm/page-writeback.c   2007-05-30 11:37:01.0 -0700
@@ -384,7 +384,7 @@ void balance_dirty_pages_ratelimited_nr(
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);

-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
 {
struct dirty_limits dl;

@@ -399,7 +399,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
}

for ( ; ; ) {
-   get_dirty_limits(&dl, NULL, &node_online_map);
+   get_dirty_limits(&dl, NULL, nodes);

/*
 * Boost the allowable dirty threshold a bit for page
diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c
--- 4/mm/vmscan.c   2007-05-30 11:36:17.0 -0700
+++ 5/mm/vmscan.c   2007-05-30 11:37:01.0 -0700
@@ -1079,7 +1079,7 @@ static unsigned long shrink_zone(int pri
}
}

-   throttle_vm_writeout(sc->gfp_mask);
+   throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask);

atomic_dec(&zone->reclaim_in_progress);
return nr_reclaimed;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[corrected][RFC 5/7] cpuset write vm writeout

2007-05-31 Thread Ethan Solomita

Throttle VM writeout in a cpuset aware way

This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.

kswapd has a cpuset context that includes the whole machine. VM throttling
will only work during synchrononous reclaim and not  from kswapd.

Originally by Christoph Lameter <[EMAIL PROTECTED]>

Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]>

---

diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 
5/include/linux/writeback.h
--- 4/include/linux/writeback.h 2007-05-30 11:36:14.0 -0700
+++ 5/include/linux/writeback.h 2007-05-30 11:37:01.0 -0700
@@ -89,7 +89,7 @@ static inline void wait_on_inode(struct 
 int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 
5/mm/page-writeback.c
--- 4/mm/page-writeback.c   2007-05-30 11:36:15.0 -0700
+++ 5/mm/page-writeback.c   2007-05-30 11:37:01.0 -0700
@@ -384,7 +384,7 @@ void balance_dirty_pages_ratelimited_nr(
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
 {
struct dirty_limits dl;
 
@@ -399,7 +399,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
}
 
for ( ; ; ) {
-   get_dirty_limits(&dl, NULL, &node_online_map);
+   get_dirty_limits(&dl, NULL, nodes);
 
/*
 * Boost the allowable dirty threshold a bit for page
diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c
--- 4/mm/vmscan.c   2007-05-30 11:36:17.0 -0700
+++ 5/mm/vmscan.c   2007-05-30 11:37:01.0 -0700
@@ -1079,7 +1079,7 @@ static unsigned long shrink_zone(int pri
}
}
 
-   throttle_vm_writeout(sc->gfp_mask);
+   throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask);
 
atomic_dec(&zone->reclaim_in_progress);
return nr_reclaimed;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 4/7] cpuset write vmscan

2007-05-31 Thread Ethan Solomita

Direct reclaim: cpuset aware writeout

During direct reclaim we traverse down a zonelist and are carefully
checking each zone if its a member of the active cpuset. But then we call
pdflush without enforcing the same restrictions. In a larger system this
may have the effect of a massive amount of pages being dirtied and then either

A. No writeout occurs because global dirty limits have not been reached

or

B. Writeout starts randomly for some dirty inode in the system. Pdflush
   may just write out data for nodes in another cpuset and miss doing
   proper dirty handling for the current cpuset.

In both cases dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.

Fix that by restricting pdflush to the active cpuset. Writeout will occur
from direct reclaim the same way as without a cpuset.

Originally by Christoph Lameter <[EMAIL PROTECTED]>

Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]>

---

diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c
--- 3/mm/vmscan.c   2007-05-30 11:34:21.0 -0700
+++ 4/mm/vmscan.c   2007-05-30 11:36:17.0 -0700
@@ -1198,7 +1198,8 @@ unsigned long try_to_free_pages(struct z
 */
if (total_scanned > sc.swap_cluster_max +
sc.swap_cluster_max / 2) {
-   wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+   wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+   &cpuset_current_mems_allowed);
sc.may_writepage = 1;
}
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 3/7] cpuset write throttle

2007-05-31 Thread Ethan Solomita

Make page writeback obey cpuset constraints

Currently dirty throttling does not work properly in a cpuset.

If f.e a cpuset contains only 1/10th of available memory then all of the
memory of a cpuset can be dirtied without any writes being triggered.
If all of the cpusets memory is dirty then only 10% of total memory is dirty.
The background writeback threshold is usually set at 10% and the synchrononous
threshold at 40%. So we are still below the global limits while the dirty
ratio in the cpuset is 100%! Writeback throttling and background writeout
do not work at all in such scenarios.

This patch makes dirty writeout cpuset aware. When determining the
dirty limits in get_dirty_limits() we calculate values based on the
nodes that are reachable from the current process (that has been
dirtying the page). Then we can trigger writeout based on the
dirty ratio of the memory in the cpuset.

We trigger writeout in a a cpuset specific way. We go through the dirty
inodes and search for inodes that have dirty pages on the nodes of the
active cpuset. If an inode fulfills that requirement then we begin writeout
of the dirty pages of that inode.

Adding up all the counters for each node in a cpuset may seem to be quite
an expensive operation (in particular for large cpusets with hundreds of
nodes) compared to just accessing the global counters if we do not have
a cpuset. However, please remember that the global counters were only
introduced recently. Before 2.6.18 we did add up per processor
counters for each processor on each invocation of get_dirty_limits().
We now add per node information which I think is equal or less effort
since there are less nodes than processors.

Originally by Christoph Lameter <[EMAIL PROTECTED]>

Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]>

---

diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 
3/mm/page-writeback.c
--- 2/mm/page-writeback.c   2007-05-30 11:31:22.0 -0700
+++ 3/mm/page-writeback.c   2007-05-30 11:34:26.0 -0700
@@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode);
 
 static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
+struct dirty_limits {
+   long thresh_background;
+   long thresh_dirty;
+   unsigned long nr_dirty;
+   unsigned long nr_unstable;
+   unsigned long nr_writeback;
+};
+
 /*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
@@ -121,13 +129,15 @@ static void background_writeout(unsigned
  * clamping level.
  */
 
-static unsigned long highmem_dirtyable_memory(unsigned long total)
+static unsigned long highmem_dirtyable_memory(nodemask_t *nodes, unsigned long 
total)
 {
 #ifdef CONFIG_HIGHMEM
int node;
unsigned long x = 0;
 
-   for_each_online_node(node) {
+   if (nodes == NULL)
+   nodes = &node_online_mask;
+   for_each_node_mask(node, *nodes) {
struct zone *z =
&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
@@ -154,13 +164,13 @@ static unsigned long determine_dirtyable
x = global_page_state(NR_FREE_PAGES)
+ global_page_state(NR_INACTIVE)
+ global_page_state(NR_ACTIVE);
-   x -= highmem_dirtyable_memory(x);
+   x -= highmem_dirtyable_memory(NULL, x);
return x + 1;   /* Ensure that we never return 0 */
 }
 
-static void
-get_dirty_limits(long *pbackground, long *pdirty,
-   struct address_space *mapping)
+static int
+get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping,
+   nodemask_t *nodes)
 {
int background_ratio;   /* Percentages */
int dirty_ratio;
@@ -168,12 +178,60 @@ get_dirty_limits(long *pbackground, long
long background;
long dirty;
unsigned long available_memory = determine_dirtyable_memory();
+   unsigned long dirtyable_memory;
+   unsigned long nr_mapped;
struct task_struct *tsk;
+   int is_subset = 0;
+
+#ifdef CONFIG_CPUSETS
+   if (unlikely(nodes &&
+   !nodes_subset(node_online_map, *nodes))) {
+   int node;
+
+   /*
+* Calculate the limits relative to the current cpuset.
+*
+* We do not disregard highmem because all nodes (except
+* maybe node 0) have either all memory in HIGHMEM (32 bit) or
+* all memory in non HIGHMEM (64 bit). If we would disregard
+* highmem then cpuset throttling would not work on 32 bit.
+*/
+   is_subset = 1;
+   memset(dl, 0, sizeof(struct dirty_limits));
+   dirtyable_memory = 0;
+   nr_mapped = 0;
+   for_each_node_mask(node, *nodes) {
+   if (!node_online(node))
+   continue;
+   dl-&g

[RFC 2/7] cpuset write pdflush nodemask

2007-05-31 Thread Ethan Solomita

pdflush: Allow the passing of a nodemask parameter

If we want to support nodeset specific writeout then we need a way
to communicate the set of nodes that an operation should affect.

So add a nodemask_t parameter to the pdflush functions and also
store the nodemask in the pdflush control structure.

Originally by Christoph Lameter <[EMAIL PROTECTED]>

Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]>

---

diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c
--- 1/fs/buffer.c   2007-05-29 17:44:33.0 -0700
+++ 2/fs/buffer.c   2007-05-30 11:31:22.0 -0700
@@ -359,7 +359,7 @@ static void free_more_memory(void)
struct zone **zones;
pg_data_t *pgdat;
 
-   wakeup_pdflush(1024);
+   wakeup_pdflush(1024, NULL);
yield();
 
for_each_online_pgdat(pgdat) {
diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c
--- 1/fs/super.c2007-05-29 17:43:00.0 -0700
+++ 2/fs/super.c2007-05-30 11:31:22.0 -0700
@@ -615,7 +615,7 @@ int do_remount_sb(struct super_block *sb
return 0;
 }
 
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
 {
struct super_block *sb;
 
@@ -643,7 +643,7 @@ static void do_emergency_remount(unsigne
 
 void emergency_remount(void)
 {
-   pdflush_operation(do_emergency_remount, 0);
+   pdflush_operation(do_emergency_remount, 0, NULL);
 }
 
 /*
diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c
--- 1/fs/sync.c 2007-05-29 17:43:00.0 -0700
+++ 2/fs/sync.c 2007-05-30 11:31:22.0 -0700
@@ -21,9 +21,9 @@
  * sync everything.  Start out by waking pdflush, because that writes back
  * all queues in parallel.
  */
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
 {
-   wakeup_pdflush(0);
+   wakeup_pdflush(0, NULL);
sync_inodes(0); /* All mappings, inodes and their blockdevs */
DQUOT_SYNC(NULL);
sync_supers();  /* Write the superblocks */
@@ -38,13 +38,13 @@ static void do_sync(unsigned long wait)
 
 asmlinkage long sys_sync(void)
 {
-   do_sync(1);
+   do_sync(1, NULL);
return 0;
 }
 
 void emergency_sync(void)
 {
-   pdflush_operation(do_sync, 0);
+   pdflush_operation(do_sync, 0, NULL);
 }
 
 /*
diff -uprN -X 0/Documentation/dontdiff 1/include/linux/writeback.h 
2/include/linux/writeback.h
--- 1/include/linux/writeback.h 2007-05-30 11:20:16.0 -0700
+++ 2/include/linux/writeback.h 2007-05-30 11:31:22.0 -0700
@@ -86,7 +86,7 @@ static inline void wait_on_inode(struct 
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(gfp_t gfp_mask);
@@ -117,7 +117,8 @@ balance_dirty_pages_ratelimited(struct a
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
void *data);
 
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes),
+   unsigned long arg0, nodemask_t *nodes);
 int generic_writepages(struct address_space *mapping,
   struct writeback_control *wbc);
 int write_cache_pages(struct address_space *mapping,
diff -uprN -X 0/Documentation/dontdiff 1/mm/page-writeback.c 
2/mm/page-writeback.c
--- 1/mm/page-writeback.c   2007-05-29 17:44:33.0 -0700
+++ 2/mm/page-writeback.c   2007-05-30 11:31:22.0 -0700
@@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
 /*
  * Work out the current dirty-memory clamping and background writeout
@@ -272,7 +272,7 @@ static void balance_dirty_pages(struct a
 */
if ((laptop_mode && pages_written) ||
 (!laptop_mode && (nr_reclaimable > background_thresh)))
-   pdflush_operation(background_writeout, 0);
+   pdflush_operation(background_writeout, 0, NULL);
 }
 
 void set_page_dirty_balance(struct page *page)
@@ -362,7 +362,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
  */
-static void background_writeout(unsigned long _min_pages)
+static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
 {
long min_pages = _min_pages;
struct writeback_control wbc = {
@@ -402,12 +402,12 @@ static void background_writeout(unsigned
  * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
  * -1 if all

[RFC 1/7] cpuset write dirty map

2007-05-31 Thread Ethan Solomita

Add a dirty map to struct address_space

In a NUMA system it is helpful to know where the dirty pages of a mapping
are located. That way we will be able to implement writeout for applications
that are constrained to a portion of the memory of the system as required by
cpusets.

This patch implements the management of dirty node maps for an address
space through the following functions:

cpuset_clear_dirty_nodes(mapping)   Clear the map of dirty nodes

cpuset_update_nodes(mapping, page)  Record a node in the dirty nodes map

cpuset_init_dirty_nodes(mapping)First time init of the map


The dirty map may be stored either directly in the mapping (for NUMA
systems with less then BITS_PER_LONG nodes) or separately allocated
for systems with a large number of nodes (f.e. IA64 with 1024 nodes).

Updating the dirty map may involve allocating it first for large
configurations. Therefore we protect the allocation and setting
of a node in the map through the tree_lock. The tree_lock is
already taken when a page is dirtied so there is no additional
locking overhead if we insert the updating of the nodemask there.

The dirty map is only cleared (or freed) when the inode is cleared.
At that point no pages are attached to the inode anymore and therefore it can
be done without any locking. The dirty map therefore records all nodes that
have been used for dirty pages by that inode until the inode is no longer
used.

Originally by Christoph Lameter <[EMAIL PROTECTED]>

Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]>

---

diff -uprN -X 0/Documentation/dontdiff 0/fs/buffer.c 1/fs/buffer.c
--- 0/fs/buffer.c   2007-05-29 17:42:07.0 -0700
+++ 1/fs/buffer.c   2007-05-29 17:44:33.0 -0700
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -710,6 +711,7 @@ static int __set_page_dirty(struct page 
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
+   cpuset_update_dirty_nodes(mapping, page);
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
diff -uprN -X 0/Documentation/dontdiff 0/fs/fs-writeback.c 1/fs/fs-writeback.c
--- 0/fs/fs-writeback.c 2007-05-29 17:42:07.0 -0700
+++ 1/fs/fs-writeback.c 2007-05-29 18:13:48.0 -0700
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 int sysctl_inode_debug __read_mostly;
@@ -483,6 +484,12 @@ int generic_sync_sb_inodes(struct super_
continue;   /* blockdev has wrong queue */
}
 
+   if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
+   /* No pages on the nodes under writeback */
+   redirty_head(inode);
+   continue;
+   }
+
/* Was this inode dirtied after sync_sb_inodes was called? */
if (time_after(inode->dirtied_when, start))
break;
diff -uprN -X 0/Documentation/dontdiff 0/fs/inode.c 1/fs/inode.c
--- 0/fs/inode.c2007-05-29 17:42:07.0 -0700
+++ 1/fs/inode.c2007-05-29 17:44:33.0 -0700
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * This is needed for the following functions:
@@ -148,6 +149,7 @@ static struct inode *alloc_inode(struct 
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
+   cpuset_init_dirty_nodes(mapping);
 
/*
 * If the block_device provides a backing_dev_info for client
@@ -255,6 +257,7 @@ void clear_inode(struct inode *inode)
bd_forget(inode);
if (S_ISCHR(inode->i_mode) && inode->i_cdev)
cd_forget(inode);
+   cpuset_clear_dirty_nodes(inode->i_mapping);
inode->i_state = I_CLEAR;
 }
 
diff -uprN -X 0/Documentation/dontdiff 0/include/linux/cpuset.h 
1/include/linux/cpuset.h
--- 0/include/linux/cpuset.h2007-05-29 17:40:07.0 -0700
+++ 1/include/linux/cpuset.h2007-05-29 17:44:33.0 -0700
@@ -75,6 +75,45 @@ static inline int cpuset_do_slab_mem_spr
 
 extern void cpuset_track_online_nodes(void);
 
+/*
+ * We need macros since struct address_space is not defined yet
+ */
+#if MAX_NUMNODES <= BITS_PER_LONG
+#define cpuset_update_dirty_nodes(__mapping, __page)   \
+   do {\
+   int node = page_to_nid(__page); \
+   if (!node_isset(node, (__mapping)->dirty_nodes))\
+   node_set(node, (__mapping)->dir

Re: NR_UNSTABLE_FS vs. NR_FILE_DIRTY: double counting pages?

2007-04-29 Thread Ethan Solomita


Ethan Solomita wrote:

Trond Myklebust wrote:


It should not happen. If the page is on the unstable list, then it will
be committed before nfs_updatepage is allowed to redirty it. See the
recent fixes in 2.6.21-rc7.


Above I present a codepath called straight from sys_write() which 
seems to do what I say. I could be wrong, but can you address the code 
paths I show above which seem to set both?


	Sorry about my quick reply, I'd misunderstood what you were saying. 
I'll take a look at what you say.


Thanks,
-- Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: NR_UNSTABLE_FS vs. NR_FILE_DIRTY: double counting pages?

2007-04-29 Thread Ethan Solomita


Trond Myklebust wrote:

On Fri, 2007-04-27 at 18:21 -0700, Ethan Solomita wrote:

There are several places where we add together NR_UNSTABLE_FS and
NF_FILE_DIRTY:

sync_inodes_sb()
balance_dirty_pages()
wakeup_pdflush()
wb_kupdate()
prefetch_suitable()

I can trace a standard codepath where it seems both of these are set
on the same page:

nfs_file_aops.commit_write ->
nfs_commit_write
nfs_updatepages
nfs_writepage_setup
nfs_wb_page
nfs_wb_page_priority
nfs_writepage_locked
nfs_flush_mapping
nfs_flush_list
nfs_flush_multi
nfs_write_partial_ops.rpc_call_done
nfs_writeback_done_partial
nfs_writepage_release
nfs_reschedule_unstable_write
nfs_mark_request_commit
incr NR_UNSTABLE_NFS

nfs_file_aops.commit_write ->
nfs_commit_write
nfs_updatepage
__set_page_dirty_nobuffers
incr NF_FILE_DIRTY


This is the standard code path that derives from sys_write(). Can
someone either show how this code sequence can't happen, or confirm for
me that there's a bug?
-- Ethan


It should not happen. If the page is on the unstable list, then it will
be committed before nfs_updatepage is allowed to redirty it. See the
recent fixes in 2.6.21-rc7.


	Above I present a codepath called straight from sys_write() which seems 
to do what I say. I could be wrong, but can you address the code paths I 
show above which seem to set both?

-- Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

NR_UNSTABLE_FS vs. NR_FILE_DIRTY: double counting pages?

2007-04-27 Thread Ethan Solomita

There are several places where we add together NR_UNSTABLE_FS and
NF_FILE_DIRTY:

sync_inodes_sb()
balance_dirty_pages()
wakeup_pdflush()
wb_kupdate()
prefetch_suitable()

I can trace a standard codepath where it seems both of these are set
on the same page:

nfs_file_aops.commit_write ->
nfs_commit_write
nfs_updatepages
nfs_writepage_setup
nfs_wb_page
nfs_wb_page_priority
nfs_writepage_locked
nfs_flush_mapping
nfs_flush_list
nfs_flush_multi
nfs_write_partial_ops.rpc_call_done
nfs_writeback_done_partial
nfs_writepage_release
nfs_reschedule_unstable_write
nfs_mark_request_commit
incr NR_UNSTABLE_NFS

nfs_file_aops.commit_write ->
nfs_commit_write
nfs_updatepage
__set_page_dirty_nobuffers
incr NF_FILE_DIRTY


This is the standard code path that derives from sys_write(). Can
someone either show how this code sequence can't happen, or confirm for
me that there's a bug?
-- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/8] Cpuset aware writeback

2007-04-21 Thread Ethan Solomita


Christoph Lameter wrote:

On Fri, 20 Apr 2007, Ethan Solomita wrote:

  

cpuset_write_dirty_map.htm

   In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes()
but in __set_page_dirty_buffers() you call it only if page->mapping is still
set after locking. Is there a reason for the difference? Also a question not
about your patch: why do those functions call __mark_inode_dirty() even if the
dirty page has been truncated and mapping == NULL?



If page->mapping has been cleared then the page was removed from the 
mapping. __mark_inode_dirty just dirties the inode. If a truncation occurs 
then the inode was modified.
  


   You didn't address the first half. Why do the buffers() and 
nobuffers() act differently when calling cpuset_update_dirty_nodes()?



cpuset_write_throttle.htm

   I noticed that several lines have leading spaces. I didn't check if other
patches have the problem too.



Maybe download the patches? How did those strange .htm endings get 
appended to the patches?
  


   Something weird with Firefox, but instead of jumping on me did you 
consider double checking your patches? I just went back, found the text 
versions, and the spaces are still there.e.g.:


+   unsigned long dirtyable_memory;



   In get_dirty_limits(), when cpusets are configd you don't subtract highmen
the same way that is done without cpusets. Is this intentional?



That is something in flux upstream. Linus changed it recently. Do it one 
way or the other.
  


   Exactly -- your patch should be consistent and do it the same way as 
whatever your patch is built against. Your patch is built against a 
kernel that subtracts off highmem. "Do it..." are you handing off the 
patch and are done with it?



   It seems that dirty_exceeded is still a global punishment across cpusets.
Should it be addressed?



Sure. It would be best if you could place that somehow in a cpuset.
  


   Again it sounds like you're handing them off. I'm not objecting I 
just hadn't understood that.

   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/8] Cpuset aware writeback

2007-04-20 Thread Ethan Solomita


Christoph Lameter wrote:
H Sorry. I got distracted and I have sent them to Kame-san who was 
interested in working on them. 


I have placed the most recent version at
http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty
  


   Hi Christoph -- a few comments on the patches:

cpuset_write_dirty_map.htm

   In __set_page_dirty_nobuffers() you always call 
cpuset_update_dirty_nodes() but in __set_page_dirty_buffers() you call 
it only if page->mapping is still set after locking. Is there a reason 
for the difference? Also a question not about your patch: why do those 
functions call __mark_inode_dirty() even if the dirty page has been 
truncated and mapping == NULL?


cpuset_write_throttle.htm

   I noticed that several lines have leading spaces. I didn't check if 
other patches have the problem too.


   In get_dirty_limits(), when cpusets are configd you don't subtract 
highmen the same way that is done without cpusets. Is this intentional?


   It seems that dirty_exceeded is still a global punishment across 
cpusets. Should it be addressed?



   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/8] Cpuset aware writeback

2007-04-19 Thread Ethan Solomita


Christoph Lameter wrote:

On Wed, 18 Apr 2007, Ethan Solomita wrote:

  

   Any new ETA? I'm trying to decide whether to go back to your original
patches or wait for the new set. Adding new knobs isn't as important to me as
having something that fixes the core problem, so hopefully this isn't waiting
on them. They could always be patches on top of your core patches.
   -- Ethan



H Sorry. I got distracted and I have sent them to Kame-san who was 
interested in working on them. 


I have placed the most recent version at
http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty
  


   Do you expect any conflicts with the per-bdi dirty throttling patches?
   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/8] Cpuset aware writeback

2007-04-18 Thread Ethan Solomita


Christoph Lameter wrote:

On Wed, 21 Mar 2007, Ethan Solomita wrote:

  

Christoph Lameter wrote:


On Thu, 1 Feb 2007, Ethan Solomita wrote:

  

   Hi Christoph -- has anything come of resolving the NFS / OOM concerns
that
Andrew Morton expressed concerning the patch? I'd be happy to see some
progress on getting this patch (i.e. the one you posted on 1/23) through.


Peter Zilkstra addressed the NFS issue. I will submit the patch again as
soon as the writeback code stabilizes a bit.
  

I'm pinging to see if this has gotten anywhere. Are you ready to
resubmit? Do we have the evidence to convince Andrew that the NFS issues are
resolved and so this patch won't obscure anything?



The NFS patch went into Linus tree a couple of days ago and I have a new 
version ready with additional support to set per dirty ratios per cpu. 
There is some interest in adding more VM controls to this patch. I hope I 
can post the next rev by tomorrow.
  


   Any new ETA? I'm trying to decide whether to go back to your 
original patches or wait for the new set. Adding new knobs isn't as 
important to me as having something that fixes the core problem, so 
hopefully this isn't waiting on them. They could always be patches on 
top of your core patches.

   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash)

2007-04-03 Thread Ethan Solomita

Maneesh Soni wrote:
> I have modified the previous patch (which was dropped from -mm) and now 
> keeping
> the statement making s_dentry as NULL in sysfs_d_iput(), so this should
> _safely_ fix sysfs_readdir() oops. 
>   

If you could find some additional places in sysfs code to add new
BUG() checks I'd appreciate it. Especially if it turns out that you
can't reproduce it, I'd like to have as many asserts as is reasonable.
-- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [FIXED] Re: tty OOPS (Re: 2.6.21-rc5-mm2)

2007-03-28 Thread Ethan Solomita

   Apologies -- I didn't notice lkml on the cc list. I'll catch up from 
lkml directly.

   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [FIXED] Re: tty OOPS (Re: 2.6.21-rc5-mm2)

2007-03-28 Thread Ethan Solomita


Andreas Mohr wrote:

Hi,

On Wed, Mar 28, 2007 at 10:56:32PM +0400, Alexey Dobriyan wrote:
  

The only suspicious new patch in -rc5-mm1 to me is
fix-sysfs-reclaim-crash.patch which removes "sd->s_dentry = NULL;". Note
that whole sysfs_drop_dentry() is NOP if ->s_dentry is NULL.

Could you try to revert it?

Alexey, who knows very little about sysfs internals



Apparently that's still too much knowledge ;)

Or, in other words: 6 reboots already and not a single problem!

So yes, the removal of the NULLing line in this patch most likely
has caused this issue on my box.
Now the question is whether something as simple as that is a fully
correct fix or whether something should be done entirely differently.
I'll let people more familiar with those parts decide about it...
  


   Sorry -- I've only just been cc'd on this mail thread. Are we 
claiming that this patch/fix has caused a new problem, or successfully 
fixed an old problem?


   Thanks!
   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-28 Thread Ethan Solomita


Nick Piggin wrote:

Eric W. Biederman wrote:

First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.


Let's be practical here, what you're asking is basically impossible.

Unless by deterministic you mean that it never enters the a non
trivial syscall, in which case, you just want to know about maximum
RSS of the process, which we already account).


   If we used Beancounters as Pavel and Kirill mentioned, that would 
keep track of each container that has referenced a page, not just the 
first container. It sounds like beancounters can return a usage count 
where each page is divided by the number of referencing containers (e.g. 
1/3rd if 3 containers share a page). Presumably it could also return a 
full count of 1 to each container.


   If we look at data in the latter form, i.e. each container must pay 
fully for each page used, then Eric could use that to determine real 
usage needs of the container. However we could also use the fractional 
count in order to do things such as charging the container for its 
actual usage. i.e. full count for setting guarantees, fractional for 
actual usage.

   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux-VServer example results for sharing vs. separate mappings ...

2007-03-27 Thread Ethan Solomita


Herbert Poetzl wrote:

On Sat, Mar 24, 2007 at 12:19:06PM -0800, Andrew Morton wrote:
  

Or change the reclaim code so that a page which hasn't
been referenced from a process within its hardware
container is considered unreferenced (so it gets reclaimed). 



that might easily lead to some ping-pong behaviour,
when two similar guest are executing similar binaries
but not at the same time ...
  


   It might lead to that, but I don't think it would become 
pathological "easily". If a system has been up for a long time, it's 
easy to image pagecache pages lying everywhere just because someone 
somewhere is still using them.


   I suggest a variant on what Andrew says: don't change reclaim. 
Instead, when referencing a page, don't mark the page as referenced if 
the current task is not permitted to allocate from the page's node. I'm 
thinking in terms of cpusets, with each task having a nodemask of 
mems_allowed. This may result in a page being thrown out unnecessarily 
and brought back in from disk, but when memory is tight that is what 
happens. An optimization might be to keep track of who is referencing 
the page and migrate it to their memory instead of reclaiming it, but 
that would require reclaim to know the task/cpuset/container of the 
referencing task.

   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sysfs reclaim crash

2007-03-27 Thread Ethan Solomita

	Hi Maneesh -- I will start testing with the patch you provided. If you 
come up with any further issues please let me know. Also, if you could 
suggest some additional BUG() lines that I could insert I would 
appreciate it. Since the bug is hard to reproduce, it may be easier to 
catch a race condition in the making via BUG() than an actual failure 
due to a race condition.


Thanks!
-- Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/8] Cpuset aware writeback

2007-03-21 Thread Ethan Solomita


Christoph Lameter wrote:

On Thu, 1 Feb 2007, Ethan Solomita wrote:


   Hi Christoph -- has anything come of resolving the NFS / OOM concerns that
Andrew Morton expressed concerning the patch? I'd be happy to see some
progress on getting this patch (i.e. the one you posted on 1/23) through.


Peter Zilkstra addressed the NFS issue. I will submit the patch again as 
soon as the writeback code stabilizes a bit.


	I'm pinging to see if this has gotten anywhere. Are you ready to 
resubmit? Do we have the evidence to convince Andrew that the NFS issues 
are resolved and so this patch won't obscure anything?


Thanks,
-- Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] mm: Inconsistent use of node IDs

2007-03-16 Thread Ethan Solomita

Ping!
-- Ethan

Ethan Solomita wrote:
> Andi Kleen wrote:
>   
>> On Monday 12 March 2007 23:51, Ethan Solomita wrote:
>> 
>>> This patch corrects inconsistent use of node numbers (variously "nid" or
>>> "node") in the presence of fake NUMA.
>>>   
>> I think it's very consistent -- your patch would make it inconsistent though.
>> 
>
>   It's consistent to call node_online() with a physical node ID when the 
> online node mask is composed of fake nodes?
>
>   
>> Sorry, but when you ask for NUMA emulation you will get it. I don't see
>> any point in a "half way only for some subsystems I like" NUMA emulation. 
>> It's unlikely that your ideas of where it is useful and where is not
>> matches other NUMA emulation user's ideas too.
>> 
>
>   I don't understand your comments. My code is intended to work for all 
> systems. If the system is non-NUMA by nature, then all CPUs map to fake 
> node 0.
>
>   As an example, on a two chip dual-core AMD opteron system, there are 4 
> "cpus" where CPUs 0 and 1 are close to the first half of memory, and 
> CPUs 2 and 3 are close to the second half. Without this change CPUs 2 
> and 3 are mapped to fake node 1. This results in awful performance. With 
> this change, CPUs 2 and 3 are mapped to (roughly) 1/2 the fake node 
> count. Their zonelists[] are ordered to do allocations preferentially 
> from zones that are local to CPUs 2 and 3.
>
>   Can you tell me the scenario where my code makes things worse?
>
>   
>> Besides adding such a secondary node space would be likely a huge long term 
>> mainteance issue. I just can it see breaking with every non trivial change.
>> 
>
>   I'm adding no data structures to do this. The current code already has 
> get_phys_node. My changes use the existing information about node 
> layout, both the physical and fake, and defines a mapping. The current 
> mapping just takes a physical node and says "it's the fake node too".
>
>   
>> NACK.
>> 
>
>   I wish you would include some specifics as to why you think what you 
> do. You're suggesting we leave in place a system that destroys NUMA 
> locality when using fake numa, and passes around physical node ids as an 
> index into nodes[] whihc is indexed by fake nodes. My change has no 
> effect without fake numa, and harms no one with fake numa.
>   -- Ethan
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>   

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] mm: Inconsistent use of node IDs

2007-03-12 Thread Ethan Solomita


Andi Kleen wrote:

On Monday 12 March 2007 23:51, Ethan Solomita wrote:

This patch corrects inconsistent use of node numbers (variously "nid" or
"node") in the presence of fake NUMA.


I think it's very consistent -- your patch would make it inconsistent though.


	It's consistent to call node_online() with a physical node ID when the 
online node mask is composed of fake nodes?



Sorry, but when you ask for NUMA emulation you will get it. I don't see
any point in a "half way only for some subsystems I like" NUMA emulation. 
It's unlikely that your ideas of where it is useful and where is not

matches other NUMA emulation user's ideas too.


	I don't understand your comments. My code is intended to work for all 
systems. If the system is non-NUMA by nature, then all CPUs map to fake 
node 0.


	As an example, on a two chip dual-core AMD opteron system, there are 4 
"cpus" where CPUs 0 and 1 are close to the first half of memory, and 
CPUs 2 and 3 are close to the second half. Without this change CPUs 2 
and 3 are mapped to fake node 1. This results in awful performance. With 
this change, CPUs 2 and 3 are mapped to (roughly) 1/2 the fake node 
count. Their zonelists[] are ordered to do allocations preferentially 
from zones that are local to CPUs 2 and 3.


Can you tell me the scenario where my code makes things worse?

Besides adding such a secondary node space would be likely a huge long term 
mainteance issue. I just can it see breaking with every non trivial change.


	I'm adding no data structures to do this. The current code already has 
get_phys_node. My changes use the existing information about node 
layout, both the physical and fake, and defines a mapping. The current 
mapping just takes a physical node and says "it's the fake node too".



NACK.


	I wish you would include some specifics as to why you think what you 
do. You're suggesting we leave in place a system that destroys NUMA 
locality when using fake numa, and passes around physical node ids as an 
index into nodes[] whihc is indexed by fake nodes. My change has no 
effect without fake numa, and harms no one with fake numa.

-- Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/1] mm: Inconsistent use of node IDs

2007-03-12 Thread Ethan Solomita


This patch corrects inconsistent use of node numbers (variously "nid" or
"node") in the presence of fake NUMA.

Both AMD and Intel x86_64 discovery code will determine a CPU's physical
node and use that node when calling numa_add_cpu() to associate that CPU
with the node, but numa_add_cpu() treats the node argument as a fake
node. This physical node may not exist within the fake nodespace, and
even if it does, it will likely incorrectly associate a CPU with a fake
memory node that may not share the same underlying physical NUMA node.

Similarly, the PCI code which determines the node of the PCI bus saves
it in the pci_sysdata structure. This node then propagates down to other
buses and devices which hang off the PCI bus, and is used to specify a
node when allocating memory. The purpose is to provide NUMA locality,
but the node is a physical node, and the memory allocation code expects
a fake node argument.

Provide a routine (get_fake_node()) to map a physical node ID to a fake
node ID, where the fake node ID contains memory on the specified
physical node ID. This fake node's zonelist is tied to other close fake
nodes, maintaining NUMA locality. Also provide numa_online_phys() which
is the same as numa_online() but takes a physical node ID.

Change init_cpu_to_node(), x86_64 and PCI code use get_fake_node() and
numa_online_phys() in order to convert to an appropriate fake ID.

Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]>
---
arch/i386/pci/acpi.c  |6 +++
arch/x86_64/kernel/setup.c|   14 
arch/x86_64/mm/numa.c |   70 +-
arch/x86_64/pci/k8-bus.c  |3 +
include/asm-x86_64/topology.h |8 
5 files changed, 85 insertions(+), 16 deletions(-)



diff -uprN -x install -X linux-2.6.21-rc3-mm2/Documentation/dontdiff 
linux-2.6.21-rc3-mm2/arch/i386/pci/acpi.c 
linux-2.6.21-rc3-mm2-phystofake/arch/i386/pci/acpi.c
--- linux-2.6.21-rc3-mm2/arch/i386/pci/acpi.c   2007-03-09 16:42:42.0 
-0800
+++ linux-2.6.21-rc3-mm2-phystofake/arch/i386/pci/acpi.c2007-03-12 
12:36:50.0 -0700
@@ -35,8 +35,13 @@ struct pci_bus * __devinit pci_acpi_scan

pxm = acpi_get_pxm(device->handle);
#ifdef CONFIG_ACPI_NUMA
-   if (pxm >= 0)
+   if (pxm >= 0) {
sd->node = pxm_to_node(pxm);
+#ifdef CONFIG_NUMA_EMU
+   if (sd->node != -1)
+   sd->node = get_fake_node(sd->node);
+#endif
+   }
#endif

bus = pci_scan_bus_parented(NULL, busnum, &pci_root_ops, sd);
diff -uprN -x install -X linux-2.6.21-rc3-mm2/Documentation/dontdiff 
linux-2.6.21-rc3-mm2/arch/x86_64/kernel/setup.c 
linux-2.6.21-rc3-mm2-phystofake/arch/x86_64/kernel/setup.c
--- linux-2.6.21-rc3-mm2/arch/x86_64/kernel/setup.c 2007-03-09 
16:42:42.0 -0800
+++ linux-2.6.21-rc3-mm2-phystofake/arch/x86_64/kernel/setup.c  2007-03-12 
12:44:31.0 -0700
@@ -476,20 +476,20 @@ static void __cpuinit display_cacheinfo(
}

#ifdef CONFIG_NUMA
-static int nearby_node(int apicid)
+static int __init nearby_node(int apicid)
{
int i;
for (i = apicid - 1; i >= 0; i--) {
int node = apicid_to_node[i];
-   if (node != NUMA_NO_NODE && node_online(node))
+   if (node != NUMA_NO_NODE && node_online_phys(node))
return node;
}
for (i = apicid + 1; i < MAX_LOCAL_APIC; i++) {
int node = apicid_to_node[i];
-   if (node != NUMA_NO_NODE && node_online(node))
+   if (node != NUMA_NO_NODE && node_online_phys(node))
return node;
}
-   return first_node(node_online_map); /* Shouldn't happen */
+   return NUMA_NO_NODE; /* Shouldn't happen */
}
#endif

@@ -528,7 +528,7 @@ static void __init amd_detect_cmp(struct
node = c->phys_proc_id;
if (apicid_to_node[apicid] != NUMA_NO_NODE)
node = apicid_to_node[apicid];
-   if (!node_online(node)) {
+   if (!node_online_phys(node)) {
/* Two possibilities here:
   - The CPU is missing memory and no node was created.
   In that case try picking one from a nearby CPU
@@ -543,9 +543,10 @@ static void __init amd_detect_cmp(struct
apicid_to_node[ht_nodeid] != NUMA_NO_NODE)
node = apicid_to_node[ht_nodeid];
/* Pick a nearby node */
-   if (!node_online(node))
+   if (!node_online_phys(node))
node = nearby_node(apicid);
}
+   node = get_fake_node(node);
numa_set_node(cpu, node);

printk(KERN_INFO "CPU %d/%x -> Node %d\n", cpu, apicid, node);
@@ -679,7 +680,7 @@ static int __cpuinit intel_num_cpu_cores
return 1;
}

-static void srat_detect_node(void)
+static void

Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Ethan Solomita

   Hi Christoph -- has anything come of resolving the NFS / OOM 
concerns that Andrew Morton expressed concerning the patch? I'd be happy 
to see some progress on getting this patch (i.e. the one you posted on 
1/23) through.


   Thanks,
   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

49 matches

Mail list logo