Re: [PATCH 1/6] cpuset write dirty map
Andrew Morton wrote: > On Tue, 11 Sep 2007 18:36:34 -0700 > Ethan Solomita <[EMAIL PROTECTED]> wrote: > >> Add a dirty map to struct address_space > > I get a tremendous number of rejects trying to wedge this stuff on top of > Peter's mm-dirty-balancing-for-tasks changes. More rejects than I am > prepared to partially-fix so that I can usefully look at these changes in > tkdiff, so this is all based on a quick peek at the diff itself.. This isn't surprising. We're both changing the calculation of dirty limits. If his code is already into your workspace, then I'll have to do the merging after you release it. >> +#if MAX_NUMNODES <= BITS_PER_LONG > > The patch is sprinkled full of this conditional. > > I don't understand why this is being done. afaict it isn't described > in a code comment (it should be) nor even in the changelogs? I can add comments. > Given its overall complexity and its likelihood to change in the > future, I'd suggest that this conditional be centralised in a single > place. Something like > > /* >* nice comment goes here >*/ > #if MAX_NUMNODES <= BITS_PER_LONG > #define CPUSET_DIRTY_LIMITS 1 > #else > #define CPUSET_DIRTY_LIMITS 0 > #endif > > Then use #if CPUSET_DIRTY_LIMITS everywhere else. > > (This is better than #ifdef CPUSET_DIRTY_LIMITS because we'll et a > warning if someone typos '#if CPUSET_DITRY_LIMITS') I can add something like this. Probably something like: CPUSET_DIRTY_LIMITS_USEPTR >> --- 0/include/linux/fs.h 2007-09-11 14:35:58.0 -0700 >> +++ 1/include/linux/fs.h 2007-09-11 14:36:24.0 -0700 >> @@ -516,6 +516,13 @@ struct address_space { >> spinlock_t private_lock; /* for use by the address_space >> */ >> struct list_headprivate_list; /* ditto */ >> struct address_space*assoc_mapping; /* ditto */ >> +#ifdef CONFIG_CPUSETS >> +#if MAX_NUMNODES <= BITS_PER_LONG >> +nodemask_t dirty_nodes;/* nodes with dirty pages */ >> +#else >> +nodemask_t *dirty_nodes; /* pointer to map if dirty */ >> +#endif >> +#endif > > afacit there is no code comment and no changelog text which explains the > above design decision? There should be, please. OK. > > There is talk of making cpusets available with CONFIG_SMP=n. Will this new > feature be available in that case? (it should be). I'm not sure how useful it would be in that scenario, but for consistency we should still be able to specify varying dirty ratios (from patch 6/6). The above code wouldn't mean anything SMP=n since there's only the one node. We'd just be indicating whether the inode has any dirty pages, which we already know. > >> } __attribute__((aligned(sizeof(long; >> /* >> * On most architectures that alignment is already the case; but >> diff -uprN -X 0/Documentation/dontdiff 0/include/linux/writeback.h >> 1/include/linux/writeback.h >> --- 0/include/linux/writeback.h 2007-09-11 14:35:58.0 -0700 >> +++ 1/include/linux/writeback.h 2007-09-11 14:37:46.0 -0700 >> @@ -62,6 +62,7 @@ struct writeback_control { >> unsigned for_writepages:1; /* This is a writepages() call */ >> unsigned range_cyclic:1;/* range_start is cyclic */ >> void *fs_private; /* For use by ->writepages() */ >> +nodemask_t *nodes; /* Set of nodes of interest */ >> }; > > That comment is a bit terse. It's always good to be lavish when commenting > data structures, for understanding those is key to understanding a design. > OK >> /* >> diff -uprN -X 0/Documentation/dontdiff 0/kernel/cpuset.c 1/kernel/cpuset.c >> --- 0/kernel/cpuset.c2007-09-11 14:35:58.0 -0700 >> +++ 1/kernel/cpuset.c2007-09-11 14:36:24.0 -0700 >> @@ -4,7 +4,7 @@ >> * Processor and Memory placement constraints for sets of tasks. >> * >> * Copyright (C) 2003 BULL SA. >> - * Copyright (C) 2004-2006 Silicon Graphics, Inc. >> + * Copyright (C) 2004-2007 Silicon Graphics, Inc. >> * Copyright (C) 2006 Google, Inc >> * >> * Portions derived from Patrick Mochel's sysfs code. >> @@ -14,6 +14,7 @@ >> * 2003-10-22 Updates by Stephen Hemminger. >> * 2004 May-July Rework by Paul Jackson. >> * 2006 Rework by Paul Menage to use generic containers >> + * 2007 Cpuset writeback by Christoph Lameter. >> * >> * This file is su
Re: [PATCH 6/6] cpuset dirty limits
Christoph Lameter wrote: > On Fri, 14 Sep 2007, Andrew Morton wrote: > >>> + mutex_lock(&callback_mutex); >>> + *cs_int = val; >>> + mutex_unlock(&callback_mutex); >> I don't think this locking does anything? > > Locking is wrong here. The lock needs to be taken before the cs pointer > is dereferenced from the caller. I think we can just remove the callback_mutex lock. Since the change is coming from an update to a cpuset filesystem file, the cpuset is not going anywhere since the inode is open. And I don't see that any code really cares whether the dirty ratios change out from under them. > >>> + return 0; >>> +} >>> + >>> /* >>> * Frequency meter - How fast is some event occurring? >>> * >>> ... >>> +void cpuset_get_current_ratios(int *background_ratio, int *throttle_ratio) >>> +{ >>> + int background = -1; >>> + int throttle = -1; >>> + struct task_struct *tsk = current; >>> + >>> + task_lock(tsk); >>> + background = task_cs(tsk)->background_dirty_ratio; >>> + throttle = task_cs(tsk)->throttle_dirty_ratio; >>> + task_unlock(tsk); >> ditto? > > It is required to take the task lock while dereferencing the tasks cpuset > pointer. Agreed. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/6] cpuset dirty limits
Per cpuset dirty ratios This implements dirty ratios per cpuset. Two new files are added to the cpuset directories: background_dirty_ratio Percentage at which background writeback starts throttle_dirty_ratioPercentage at which the application is throttled and we start synchrononous writeout. Both variables are set to -1 by default which means that the global limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio) are used for a cpuset. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 5/include/linux/cpuset.h 7/include/linux/cpuset.h --- 5/include/linux/cpuset.h2007-09-11 14:50:48.0 -0700 +++ 7/include/linux/cpuset.h2007-09-11 14:51:12.0 -0700 @@ -77,6 +77,7 @@ extern void cpuset_track_online_nodes(vo extern int current_cpuset_is_being_rebound(void); +extern void cpuset_get_current_ratios(int *background, int *ratio); /* * We need macros since struct address_space is not defined yet */ diff -uprN -X 0/Documentation/dontdiff 5/kernel/cpuset.c 7/kernel/cpuset.c --- 5/kernel/cpuset.c 2007-09-11 14:50:49.0 -0700 +++ 7/kernel/cpuset.c 2007-09-11 14:56:18.0 -0700 @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -92,6 +93,9 @@ struct cpuset { int mems_generation; struct fmeter fmeter; /* memory_pressure filter */ + + int background_dirty_ratio; + int throttle_dirty_ratio; }; /* Retrieve the cpuset for a container */ @@ -169,6 +173,8 @@ static struct cpuset top_cpuset = { .flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)), .cpus_allowed = CPU_MASK_ALL, .mems_allowed = NODE_MASK_ALL, + .background_dirty_ratio = -1, + .throttle_dirty_ratio = -1, }; /* @@ -785,6 +791,21 @@ static int update_flag(cpuset_flagbits_t return 0; } +static int update_int(int *cs_int, char *buf, int min, int max) +{ + char *endp; + int val; + + val = simple_strtol(buf, &endp, 10); + if (val < min || val > max) + return -EINVAL; + + mutex_lock(&callback_mutex); + *cs_int = val; + mutex_unlock(&callback_mutex); + return 0; +} + /* * Frequency meter - How fast is some event occurring? * @@ -933,6 +954,8 @@ typedef enum { FILE_MEMORY_PRESSURE, FILE_SPREAD_PAGE, FILE_SPREAD_SLAB, + FILE_THROTTLE_DIRTY_RATIO, + FILE_BACKGROUND_DIRTY_RATIO, } cpuset_filetype_t; static ssize_t cpuset_common_file_write(struct container *cont, @@ -997,6 +1020,12 @@ static ssize_t cpuset_common_file_write( retval = update_flag(CS_SPREAD_SLAB, cs, buffer); cs->mems_generation = cpuset_mems_generation++; break; + case FILE_BACKGROUND_DIRTY_RATIO: + retval = update_int(&cs->background_dirty_ratio, buffer, -1, 100); + break; + case FILE_THROTTLE_DIRTY_RATIO: + retval = update_int(&cs->throttle_dirty_ratio, buffer, -1, 100); + break; default: retval = -EINVAL; goto out2; @@ -1090,6 +1119,12 @@ static ssize_t cpuset_common_file_read(s case FILE_SPREAD_SLAB: *s++ = is_spread_slab(cs) ? '1' : '0'; break; + case FILE_BACKGROUND_DIRTY_RATIO: + s += sprintf(s, "%d", cs->background_dirty_ratio); + break; + case FILE_THROTTLE_DIRTY_RATIO: + s += sprintf(s, "%d", cs->throttle_dirty_ratio); + break; default: retval = -EINVAL; goto out; @@ -1173,6 +1208,20 @@ static struct cftype cft_spread_slab = { .private = FILE_SPREAD_SLAB, }; +static struct cftype cft_background_dirty_ratio = { + .name = "background_dirty_ratio", + .read = cpuset_common_file_read, + .write = cpuset_common_file_write, + .private = FILE_BACKGROUND_DIRTY_RATIO, +}; + +static struct cftype cft_throttle_dirty_ratio = { + .name = "throttle_dirty_ratio", + .read = cpuset_common_file_read, + .write = cpuset_common_file_write, + .private = FILE_THROTTLE_DIRTY_RATIO, +}; + static int cpuset_populate(struct container_subsys *ss, struct container *cont) { int err; @@ -1193,6 +1242,10 @@ static int cpuset_populate(struct contai return err; if ((err = container_add_file(cont, ss, &cft_spread_slab)) < 0) return err; + if ((err = container_add_file(cont, ss, &cft_background_dirty_ratio)) < 0) + return err; + if ((err = container_add_file(cont, ss, &cft_throttle_di
[PATCH 5/6] cpuset write vm writeout
Throttle VM writeout in a cpuset aware way This bases the vm throttling from the reclaim path on the dirty ratio of the cpuset. Note that a cpuset is only effective if shrink_zone is called from direct reclaim. kswapd has a cpuset context that includes the whole machine. VM throttling will only work during synchrononous reclaim and not from kswapd. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 5/include/linux/writeback.h --- 4/include/linux/writeback.h 2007-09-11 14:49:47.0 -0700 +++ 5/include/linux/writeback.h 2007-09-11 14:50:52.0 -0700 @@ -94,7 +94,7 @@ static inline void inode_sync_wait(struc int wakeup_pdflush(long nr_pages, nodemask_t *nodes); void laptop_io_completion(void); void laptop_sync_completion(void); -void throttle_vm_writeout(gfp_t gfp_mask); +void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask); /* These are exported to sysctl. */ extern int dirty_background_ratio; diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 5/mm/page-writeback.c --- 4/mm/page-writeback.c 2007-09-11 14:49:47.0 -0700 +++ 5/mm/page-writeback.c 2007-09-11 14:50:52.0 -0700 @@ -386,7 +386,7 @@ void balance_dirty_pages_ratelimited_nr( } EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); -void throttle_vm_writeout(gfp_t gfp_mask) +void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask) { struct dirty_limits dl; @@ -401,7 +401,7 @@ void throttle_vm_writeout(gfp_t gfp_mask } for ( ; ; ) { - get_dirty_limits(&dl, NULL, &node_online_map); + get_dirty_limits(&dl, NULL, nodes); /* * Boost the allowable dirty threshold a bit for page diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c --- 4/mm/vmscan.c 2007-09-11 14:50:41.0 -0700 +++ 5/mm/vmscan.c 2007-09-11 14:50:52.0 -0700 @@ -1185,7 +1185,7 @@ static unsigned long shrink_zone(int pri } } - throttle_vm_writeout(sc->gfp_mask); + throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask); atomic_dec(&zone->reclaim_in_progress); return nr_reclaimed; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/6] cpuset write throttle
Make page writeback obey cpuset constraints Currently dirty throttling does not work properly in a cpuset. If f.e a cpuset contains only 1/10th of available memory then all of the memory of a cpuset can be dirtied without any writes being triggered. If all of the cpusets memory is dirty then only 10% of total memory is dirty. The background writeback threshold is usually set at 10% and the synchrononous threshold at 40%. So we are still below the global limits while the dirty ratio in the cpuset is 100%! Writeback throttling and background writeout do not work at all in such scenarios. This patch makes dirty writeout cpuset aware. When determining the dirty limits in get_dirty_limits() we calculate values based on the nodes that are reachable from the current process (that has been dirtying the page). Then we can trigger writeout based on the dirty ratio of the memory in the cpuset. We trigger writeout in a a cpuset specific way. We go through the dirty inodes and search for inodes that have dirty pages on the nodes of the active cpuset. If an inode fulfills that requirement then we begin writeout of the dirty pages of that inode. Adding up all the counters for each node in a cpuset may seem to be quite an expensive operation (in particular for large cpusets with hundreds of nodes) compared to just accessing the global counters if we do not have a cpuset. However, please remember that the global counters were only introduced recently. Before 2.6.18 we did add up per processor counters for each processor on each invocation of get_dirty_limits(). We now add per node information which I think is equal or less effort since there are less nodes than processors. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 3/mm/page-writeback.c --- 2/mm/page-writeback.c 2007-09-11 14:39:22.0 -0700 +++ 3/mm/page-writeback.c 2007-09-11 14:49:35.0 -0700 @@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode); static void background_writeout(unsigned long _min_pages, nodemask_t *nodes); +struct dirty_limits { + long thresh_background; + long thresh_dirty; + unsigned long nr_dirty; + unsigned long nr_unstable; + unsigned long nr_writeback; +}; + /* * Work out the current dirty-memory clamping and background writeout * thresholds. @@ -121,16 +129,20 @@ static void background_writeout(unsigned * clamping level. */ -static unsigned long highmem_dirtyable_memory(unsigned long total) +static unsigned long highmem_dirtyable_memory(nodemask_t *nodes, unsigned long total) { #ifdef CONFIG_HIGHMEM int node; unsigned long x = 0; + if (nodes == NULL) + nodes = &node_online_mask; for_each_node_state(node, N_HIGH_MEMORY) { struct zone *z = &NODE_DATA(node)->node_zones[ZONE_HIGHMEM]; + if (!node_isset(node, nodes)) + continue; x += zone_page_state(z, NR_FREE_PAGES) + zone_page_state(z, NR_INACTIVE) + zone_page_state(z, NR_ACTIVE); @@ -154,26 +166,74 @@ static unsigned long determine_dirtyable x = global_page_state(NR_FREE_PAGES) + global_page_state(NR_INACTIVE) + global_page_state(NR_ACTIVE); - x -= highmem_dirtyable_memory(x); + x -= highmem_dirtyable_memory(NULL, x); return x + 1; /* Ensure that we never return 0 */ } -static void -get_dirty_limits(long *pbackground, long *pdirty, - struct address_space *mapping) +static int +get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping, + nodemask_t *nodes) { int background_ratio; /* Percentages */ int dirty_ratio; int unmapped_ratio; long background; long dirty; - unsigned long available_memory = determine_dirtyable_memory(); + unsigned long available_memory; + unsigned long nr_mapped; struct task_struct *tsk; + int is_subset = 0; - unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) + - global_page_state(NR_ANON_PAGES)) * 100) / - available_memory; +#ifdef CONFIG_CPUSETS + if (unlikely(nodes && + !nodes_subset(node_online_map, *nodes))) { + int node; + /* +* Calculate the limits relative to the current cpuset. +* +* We do not disregard highmem because all nodes (except +* maybe node 0) have either all memory in HIGHMEM (32 bit) or +* all memory in non HIGHMEM (64 bit). If we would disregard +
[PATCH 4/6] cpuset write vmscan
Direct reclaim: cpuset aware writeout During direct reclaim we traverse down a zonelist and are carefully checking each zone if its a member of the active cpuset. But then we call pdflush without enforcing the same restrictions. In a larger system this may have the effect of a massive amount of pages being dirtied and then either A. No writeout occurs because global dirty limits have not been reached or B. Writeout starts randomly for some dirty inode in the system. Pdflush may just write out data for nodes in another cpuset and miss doing proper dirty handling for the current cpuset. In both cases dirty pages in the zones of interest may not be affected and writeout may not occur as necessary. Fix that by restricting pdflush to the active cpuset. Writeout will occur from direct reclaim the same way as without a cpuset. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c --- 3/mm/vmscan.c 2007-09-11 14:41:56.0 -0700 +++ 4/mm/vmscan.c 2007-09-11 14:50:41.0 -0700 @@ -1301,7 +1301,8 @@ unsigned long do_try_to_free_pages(struc */ if (total_scanned > sc->swap_cluster_max + sc->swap_cluster_max / 2) { - wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL); + wakeup_pdflush(laptop_mode ? 0 : total_scanned, + &cpuset_current_mems_allowed); sc->may_writepage = 1; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/6] cpuset write pdflush nodemask
pdflush: Allow the passing of a nodemask parameter If we want to support nodeset specific writeout then we need a way to communicate the set of nodes that an operation should affect. So add a nodemask_t parameter to the pdflush functions and also store the nodemask in the pdflush control structure. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c --- 1/fs/buffer.c 2007-09-11 14:36:24.0 -0700 +++ 2/fs/buffer.c 2007-09-11 14:39:22.0 -0700 @@ -372,7 +372,7 @@ static void free_more_memory(void) struct zone **zones; pg_data_t *pgdat; - wakeup_pdflush(1024); + wakeup_pdflush(1024, NULL); yield(); for_each_online_pgdat(pgdat) { diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c --- 1/fs/super.c2007-09-11 14:36:05.0 -0700 +++ 2/fs/super.c2007-09-11 14:39:22.0 -0700 @@ -616,7 +616,7 @@ int do_remount_sb(struct super_block *sb return 0; } -static void do_emergency_remount(unsigned long foo) +static void do_emergency_remount(unsigned long foo, nodemask_t *bar) { struct super_block *sb; @@ -644,7 +644,7 @@ static void do_emergency_remount(unsigne void emergency_remount(void) { - pdflush_operation(do_emergency_remount, 0); + pdflush_operation(do_emergency_remount, 0, NULL); } /* diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c --- 1/fs/sync.c 2007-09-11 14:36:05.0 -0700 +++ 2/fs/sync.c 2007-09-11 14:39:22.0 -0700 @@ -21,9 +21,9 @@ * sync everything. Start out by waking pdflush, because that writes back * all queues in parallel. */ -static void do_sync(unsigned long wait) +static void do_sync(unsigned long wait, nodemask_t *unused) { - wakeup_pdflush(0); + wakeup_pdflush(0, NULL); sync_inodes(0); /* All mappings, inodes and their blockdevs */ DQUOT_SYNC(NULL); sync_supers(); /* Write the superblocks */ @@ -38,13 +38,13 @@ static void do_sync(unsigned long wait) asmlinkage long sys_sync(void) { - do_sync(1); + do_sync(1, NULL); return 0; } void emergency_sync(void) { - pdflush_operation(do_sync, 0); + pdflush_operation(do_sync, 0, NULL); } /* diff -uprN -X 0/Documentation/dontdiff 1/include/linux/writeback.h 2/include/linux/writeback.h --- 1/include/linux/writeback.h 2007-09-11 14:37:46.0 -0700 +++ 2/include/linux/writeback.h 2007-09-11 14:39:22.0 -0700 @@ -91,7 +91,7 @@ static inline void inode_sync_wait(struc /* * mm/page-writeback.c */ -int wakeup_pdflush(long nr_pages); +int wakeup_pdflush(long nr_pages, nodemask_t *nodes); void laptop_io_completion(void); void laptop_sync_completion(void); void throttle_vm_writeout(gfp_t gfp_mask); @@ -122,7 +122,8 @@ balance_dirty_pages_ratelimited(struct a typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc, void *data); -int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0); +int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes), + unsigned long arg0, nodemask_t *nodes); int generic_writepages(struct address_space *mapping, struct writeback_control *wbc); int write_cache_pages(struct address_space *mapping, diff -uprN -X 0/Documentation/dontdiff 1/mm/page-writeback.c 2/mm/page-writeback.c --- 1/mm/page-writeback.c 2007-09-11 14:36:24.0 -0700 +++ 2/mm/page-writeback.c 2007-09-11 14:39:22.0 -0700 @@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode); /* End of sysctl-exported parameters */ -static void background_writeout(unsigned long _min_pages); +static void background_writeout(unsigned long _min_pages, nodemask_t *nodes); /* * Work out the current dirty-memory clamping and background writeout @@ -272,7 +272,7 @@ static void balance_dirty_pages(struct a */ if ((laptop_mode && pages_written) || (!laptop_mode && (nr_reclaimable > background_thresh))) - pdflush_operation(background_writeout, 0); + pdflush_operation(background_writeout, 0, NULL); } void set_page_dirty_balance(struct page *page) @@ -362,7 +362,7 @@ void throttle_vm_writeout(gfp_t gfp_mask * writeback at least _min_pages, and keep writing until the amount of dirty * memory is less than the background threshold, or until we're all clean. */ -static void background_writeout(unsigned long _min_pages) +static void background_writeout(unsigned long _min_pages, nodemask_t *unused) { long min_pages = _min_pages; struct writeback_control wbc = { @@ -402,12 +402,12 @@ static void background_writeout(unsigned * the whole world. Returns 0 if a pdflush thread was dispat
[PATCH 1/6] cpuset write dirty map
Add a dirty map to struct address_space In a NUMA system it is helpful to know where the dirty pages of a mapping are located. That way we will be able to implement writeout for applications that are constrained to a portion of the memory of the system as required by cpusets. This patch implements the management of dirty node maps for an address space through the following functions: cpuset_clear_dirty_nodes(mapping) Clear the map of dirty nodes cpuset_update_nodes(mapping, page) Record a node in the dirty nodes map cpuset_init_dirty_nodes(mapping)First time init of the map The dirty map may be stored either directly in the mapping (for NUMA systems with less then BITS_PER_LONG nodes) or separately allocated for systems with a large number of nodes (f.e. IA64 with 1024 nodes). Updating the dirty map may involve allocating it first for large configurations. Therefore we protect the allocation and setting of a node in the map through the tree_lock. The tree_lock is already taken when a page is dirtied so there is no additional locking overhead if we insert the updating of the nodemask there. The dirty map is only cleared (or freed) when the inode is cleared. At that point no pages are attached to the inode anymore and therefore it can be done without any locking. The dirty map therefore records all nodes that have been used for dirty pages by that inode until the inode is no longer used. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 0/fs/buffer.c 1/fs/buffer.c --- 0/fs/buffer.c 2007-09-11 14:35:58.0 -0700 +++ 1/fs/buffer.c 2007-09-11 14:36:24.0 -0700 @@ -41,6 +41,7 @@ #include #include #include +#include static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); @@ -723,6 +724,7 @@ static int __set_page_dirty(struct page radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } + cpuset_update_dirty_nodes(mapping, page); write_unlock_irq(&mapping->tree_lock); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); diff -uprN -X 0/Documentation/dontdiff 0/fs/fs-writeback.c 1/fs/fs-writeback.c --- 0/fs/fs-writeback.c 2007-09-11 14:35:58.0 -0700 +++ 1/fs/fs-writeback.c 2007-09-11 14:36:24.0 -0700 @@ -22,6 +22,7 @@ #include #include #include +#include #include "internal.h" int sysctl_inode_debug __read_mostly; @@ -476,6 +477,12 @@ int generic_sync_sb_inodes(struct super_ continue; /* blockdev has wrong queue */ } + if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) { + /* No pages on the nodes under writeback */ + list_move(&inode->i_list, &sb->s_dirty); + continue; + } + /* Was this inode dirtied after sync_sb_inodes was called? */ if (time_after(inode->dirtied_when, start)) break; diff -uprN -X 0/Documentation/dontdiff 0/fs/inode.c 1/fs/inode.c --- 0/fs/inode.c2007-09-11 14:35:58.0 -0700 +++ 1/fs/inode.c2007-09-11 14:36:24.0 -0700 @@ -22,6 +22,7 @@ #include #include #include +#include /* * This is needed for the following functions: @@ -157,6 +158,7 @@ static struct inode *alloc_inode(struct mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE); mapping->assoc_mapping = NULL; mapping->backing_dev_info = &default_backing_dev_info; + cpuset_init_dirty_nodes(mapping); /* * If the block_device provides a backing_dev_info for client @@ -264,6 +266,7 @@ void clear_inode(struct inode *inode) bd_forget(inode); if (S_ISCHR(inode->i_mode) && inode->i_cdev) cd_forget(inode); + cpuset_clear_dirty_nodes(inode->i_mapping); inode->i_state = I_CLEAR; } diff -uprN -X 0/Documentation/dontdiff 0/include/linux/cpuset.h 1/include/linux/cpuset.h --- 0/include/linux/cpuset.h2007-09-11 14:35:58.0 -0700 +++ 1/include/linux/cpuset.h2007-09-11 14:36:24.0 -0700 @@ -77,6 +77,45 @@ extern void cpuset_track_online_nodes(vo extern int current_cpuset_is_being_rebound(void); +/* + * We need macros since struct address_space is not defined yet + */ +#if MAX_NUMNODES <= BITS_PER_LONG +#define cpuset_update_dirty_nodes(__mapping, __page) \ + do {\ + int node = page_to_nid(__page); \ + if (!node_isset(node, (__mapping)->dirty_nodes))\ +
Re: [PATCH 0/6] cpuset aware writeback
Perform writeback and dirty throttling with awareness of cpuset mem_allowed. The theory of operation has two primary elements: 1. Add a nodemask per mapping which indicates the nodes which have set PageDirty on any page of the mappings. 2. Add a nodemask argument to wakeup_pdflush() which is propagated down to sync_sb_inodes. This leaves sync_sb_inodes() with two nodemasks. One is passed to it and specifies the nodes the caller is interested in syncing, and will either be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the caller's context. The second nodemask is attached to the inode's mapping and shows who has modified data in the inode. sync_sb_inodes() will then skip syncing of inodes if the nodemask argument does not intersect with the mapping nodemask. cpuset_current_mems_allowed will be passed in to pdflush background_writeout by try_to_free_pages and balance_dirty_pages. balance_dirty_pages also passes the nodemask in to writeback_inodes directly when doing active reclaim. Other callers do not limit inode writeback, passing in a NULL nodemask pointer. A final change is to get_dirty_limits. It takes a nodemask argument, and when it is null there is no change in behavior. If the nodemask is set, page statistics are accumulated only for specified nodes, and the background and throttle dirty ratios will be read from a new per-cpuset ratio feature. For testing I did a variety of basic tests, verifying individual features of the test. To verify that it fixes the core problem, I created a stress test which involved using cpusets and mems_allowed to split memory so that all daemons had memory set aside for them, and my memory stress test had a separate set of memory. The stress test was mmaping 7GB of a very large file on disk. It then scans the entire 7GB of memory reading and modifying each byte. 7GB is more than the amount of physical memory made available to the stress test. Using iostat I can see the initial period of reading from disk, followed by a period of simultaneous reads and writes as dirty bytes are pushed to make room for new reads. In a separate log-in, in the other cpuset, I am running: while `true`; do date | tee -a date.txt; sleep 5; done date.txt resides on the same disk as the large file mentioned above. The above while-loop serves the dual purpose of providing me visual clues of progress along with the opportunity for the "tee" command to become throttled writing to the disk. The effect of this patchset is straightforward. Without it there are long hangs between appearances of the date. With it the dates are all 5 (or sometimes 6) seconds apart. I also added printks to the kernel to verify that, without these patches, the tee was being throttled (along with lots of other things), and with the patch only pdflush is being throttled. These patches are mostly unchanged from Chris Lameter's original changelist posted previously to linux-mm. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] cpuset aware writeback
Christoph Lameter wrote: > On Tue, 17 Jul 2007 14:23:14 -0700 > Ethan Solomita <[EMAIL PROTECTED]> wrote: > >> These patches are mostly unchanged from Chris Lameter's original >> changelist posted previously to linux-mm. > > Thanks for keeping these patches up to date. Add you signoff if you > did modifications to a patch. Also include the description of the tests > in the introduction to the patchset. So switch from an Ack to a signed-off? OK, and I'll add descriptions of testing. Everyone other than you has been silent on these patches. Does silence equal consent? -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/6] cpuset dirty limits
Per cpuset dirty ratios This implements dirty ratios per cpuset. Two new files are added to the cpuset directories: background_dirty_ratio Percentage at which background writeback starts throttle_dirty_ratioPercentage at which the application is throttled and we start synchrononous writeout. Both variables are set to -1 by default which means that the global limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio) are used for a cpuset. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.22-rc6-mm1 diff -uprN -X 0/Documentation/dontdiff 6/include/linux/cpuset.h 7/include/linux/cpuset.h --- 6/include/linux/cpuset.h2007-07-11 21:17:08.0 -0700 +++ 7/include/linux/cpuset.h2007-07-11 21:17:41.0 -0700 @@ -76,6 +76,7 @@ extern void cpuset_track_online_nodes(vo extern int current_cpuset_is_being_rebound(void); +extern void cpuset_get_current_ratios(int *background, int *ratio); /* * We need macros since struct address_space is not defined yet */ diff -uprN -X 0/Documentation/dontdiff 6/kernel/cpuset.c 7/kernel/cpuset.c --- 6/kernel/cpuset.c 2007-07-12 12:15:20.0 -0700 +++ 7/kernel/cpuset.c 2007-07-12 12:15:34.0 -0700 @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -92,6 +93,9 @@ struct cpuset { int mems_generation; struct fmeter fmeter; /* memory_pressure filter */ + + int background_dirty_ratio; + int throttle_dirty_ratio; }; /* Update the cpuset for a container */ @@ -175,6 +179,8 @@ static struct cpuset top_cpuset = { .flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)), .cpus_allowed = CPU_MASK_ALL, .mems_allowed = NODE_MASK_ALL, + .background_dirty_ratio = -1, + .throttle_dirty_ratio = -1, }; /* @@ -776,6 +782,21 @@ static int update_flag(cpuset_flagbits_t return 0; } +static int update_int(int *cs_int, char *buf, int min, int max) +{ + char *endp; + int val; + + val = simple_strtol(buf, &endp, 10); + if (val < min || val > max) + return -EINVAL; + + mutex_lock(&callback_mutex); + *cs_int = val; + mutex_unlock(&callback_mutex); + return 0; +} + /* * Frequency meter - How fast is some event occurring? * @@ -924,6 +945,8 @@ typedef enum { FILE_MEMORY_PRESSURE, FILE_SPREAD_PAGE, FILE_SPREAD_SLAB, + FILE_THROTTLE_DIRTY_RATIO, + FILE_BACKGROUND_DIRTY_RATIO, } cpuset_filetype_t; static ssize_t cpuset_common_file_write(struct container *cont, @@ -988,6 +1011,12 @@ static ssize_t cpuset_common_file_write( retval = update_flag(CS_SPREAD_SLAB, cs, buffer); cs->mems_generation = cpuset_mems_generation++; break; + case FILE_BACKGROUND_DIRTY_RATIO: + retval = update_int(&cs->background_dirty_ratio, buffer, -1, 100); + break; + case FILE_THROTTLE_DIRTY_RATIO: + retval = update_int(&cs->throttle_dirty_ratio, buffer, -1, 100); + break; default: retval = -EINVAL; goto out2; @@ -1081,6 +1110,12 @@ static ssize_t cpuset_common_file_read(s case FILE_SPREAD_SLAB: *s++ = is_spread_slab(cs) ? '1' : '0'; break; + case FILE_BACKGROUND_DIRTY_RATIO: + s += sprintf(s, "%d", cs->background_dirty_ratio); + break; + case FILE_THROTTLE_DIRTY_RATIO: + s += sprintf(s, "%d", cs->throttle_dirty_ratio); + break; default: retval = -EINVAL; goto out; @@ -1164,6 +1199,20 @@ static struct cftype cft_spread_slab = { .private = FILE_SPREAD_SLAB, }; +static struct cftype cft_background_dirty_ratio = { + .name = "background_dirty_ratio", + .read = cpuset_common_file_read, + .write = cpuset_common_file_write, + .private = FILE_BACKGROUND_DIRTY_RATIO, +}; + +static struct cftype cft_throttle_dirty_ratio = { + .name = "throttle_dirty_ratio", + .read = cpuset_common_file_read, + .write = cpuset_common_file_write, + .private = FILE_THROTTLE_DIRTY_RATIO, +}; + int cpuset_populate(struct container_subsys *ss, struct container *cont) { int err; @@ -1184,6 +1233,10 @@ int cpuset_populate(struct container_sub return err; if ((err = container_add_file(cont, &cft_spread_slab)) < 0) return err; + if ((err = container_add_file(cont, &cft_background_dirty_ratio)) < 0) + return err; + if ((err = container_add_file(cont, &cft_throttle_dirty_ratio)) < 0) +
[PATCH 5/6] cpuset write vm writeout
Throttle VM writeout in a cpuset aware way This bases the vm throttling from the reclaim path on the dirty ratio of the cpuset. Note that a cpuset is only effective if shrink_zone is called from direct reclaim. kswapd has a cpuset context that includes the whole machine. VM throttling will only work during synchrononous reclaim and not from kswapd. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.22-rc6-mm1 diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 5/include/linux/writeback.h --- 4/include/linux/writeback.h 2007-07-11 21:16:25.0 -0700 +++ 5/include/linux/writeback.h 2007-07-11 21:16:50.0 -0700 @@ -95,7 +95,7 @@ static inline void inode_sync_wait(struc int wakeup_pdflush(long nr_pages, nodemask_t *nodes); void laptop_io_completion(void); void laptop_sync_completion(void); -void throttle_vm_writeout(gfp_t gfp_mask); +void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask); /* These are exported to sysctl. */ extern int dirty_background_ratio; diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 5/mm/page-writeback.c --- 4/mm/page-writeback.c 2007-07-16 18:31:13.0 -0700 +++ 5/mm/page-writeback.c 2007-07-16 18:32:08.0 -0700 @@ -384,7 +384,7 @@ void balance_dirty_pages_ratelimited_nr( } EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); -void throttle_vm_writeout(gfp_t gfp_mask) +void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask) { struct dirty_limits dl; @@ -399,7 +399,7 @@ void throttle_vm_writeout(gfp_t gfp_mask } for ( ; ; ) { - get_dirty_limits(&dl, NULL, &node_online_map); + get_dirty_limits(&dl, NULL, nodes); /* * Boost the allowable dirty threshold a bit for page diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c --- 4/mm/vmscan.c 2007-07-11 21:16:26.0 -0700 +++ 5/mm/vmscan.c 2007-07-11 21:16:50.0 -0700 @@ -1064,7 +1064,7 @@ static unsigned long shrink_zone(int pri } } - throttle_vm_writeout(sc->gfp_mask); + throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask); atomic_dec(&zone->reclaim_in_progress); return nr_reclaimed; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/6] cpuset write vmscan
Direct reclaim: cpuset aware writeout During direct reclaim we traverse down a zonelist and are carefully checking each zone if its a member of the active cpuset. But then we call pdflush without enforcing the same restrictions. In a larger system this may have the effect of a massive amount of pages being dirtied and then either A. No writeout occurs because global dirty limits have not been reached or B. Writeout starts randomly for some dirty inode in the system. Pdflush may just write out data for nodes in another cpuset and miss doing proper dirty handling for the current cpuset. In both cases dirty pages in the zones of interest may not be affected and writeout may not occur as necessary. Fix that by restricting pdflush to the active cpuset. Writeout will occur from direct reclaim the same way as without a cpuset. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.22-rc6-mm1 diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c --- 3/mm/vmscan.c 2007-07-11 21:16:14.0 -0700 +++ 4/mm/vmscan.c 2007-07-11 21:16:26.0 -0700 @@ -1183,7 +1183,8 @@ unsigned long try_to_free_pages(struct z */ if (total_scanned > sc.swap_cluster_max + sc.swap_cluster_max / 2) { - wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL); + wakeup_pdflush(laptop_mode ? 0 : total_scanned, + &cpuset_current_mems_allowed); sc.may_writepage = 1; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/6] cpuset write pdflush nodemask
pdflush: Allow the passing of a nodemask parameter If we want to support nodeset specific writeout then we need a way to communicate the set of nodes that an operation should affect. So add a nodemask_t parameter to the pdflush functions and also store the nodemask in the pdflush control structure. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.22-rc6-mm1 diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c --- 1/fs/buffer.c 2007-07-11 21:08:04.0 -0700 +++ 2/fs/buffer.c 2007-07-11 21:15:47.0 -0700 @@ -359,7 +359,7 @@ static void free_more_memory(void) struct zone **zones; pg_data_t *pgdat; - wakeup_pdflush(1024); + wakeup_pdflush(1024, NULL); yield(); for_each_online_pgdat(pgdat) { diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c --- 1/fs/super.c2007-07-11 21:07:41.0 -0700 +++ 2/fs/super.c2007-07-11 21:15:47.0 -0700 @@ -615,7 +615,7 @@ int do_remount_sb(struct super_block *sb return 0; } -static void do_emergency_remount(unsigned long foo) +static void do_emergency_remount(unsigned long foo, nodemask_t *bar) { struct super_block *sb; @@ -643,7 +643,7 @@ static void do_emergency_remount(unsigne void emergency_remount(void) { - pdflush_operation(do_emergency_remount, 0); + pdflush_operation(do_emergency_remount, 0, NULL); } /* diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c --- 1/fs/sync.c 2007-07-11 21:07:41.0 -0700 +++ 2/fs/sync.c 2007-07-11 21:15:47.0 -0700 @@ -21,9 +21,9 @@ * sync everything. Start out by waking pdflush, because that writes back * all queues in parallel. */ -static void do_sync(unsigned long wait) +static void do_sync(unsigned long wait, nodemask_t *unused) { - wakeup_pdflush(0); + wakeup_pdflush(0, NULL); sync_inodes(0); /* All mappings, inodes and their blockdevs */ DQUOT_SYNC(NULL); sync_supers(); /* Write the superblocks */ @@ -38,13 +38,13 @@ static void do_sync(unsigned long wait) asmlinkage long sys_sync(void) { - do_sync(1); + do_sync(1, NULL); return 0; } void emergency_sync(void) { - pdflush_operation(do_sync, 0); + pdflush_operation(do_sync, 0, NULL); } /* diff -uprN -X 0/Documentation/dontdiff 1/include/linux/writeback.h 2/include/linux/writeback.h --- 1/include/linux/writeback.h 2007-07-11 21:12:25.0 -0700 +++ 2/include/linux/writeback.h 2007-07-11 21:15:47.0 -0700 @@ -92,7 +92,7 @@ static inline void inode_sync_wait(struc /* * mm/page-writeback.c */ -int wakeup_pdflush(long nr_pages); +int wakeup_pdflush(long nr_pages, nodemask_t *nodes); void laptop_io_completion(void); void laptop_sync_completion(void); void throttle_vm_writeout(gfp_t gfp_mask); @@ -123,7 +123,8 @@ balance_dirty_pages_ratelimited(struct a typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc, void *data); -int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0); +int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes), + unsigned long arg0, nodemask_t *nodes); int generic_writepages(struct address_space *mapping, struct writeback_control *wbc); int write_cache_pages(struct address_space *mapping, diff -uprN -X 0/Documentation/dontdiff 1/mm/page-writeback.c 2/mm/page-writeback.c --- 1/mm/page-writeback.c 2007-07-11 21:08:04.0 -0700 +++ 2/mm/page-writeback.c 2007-07-11 21:15:47.0 -0700 @@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode); /* End of sysctl-exported parameters */ -static void background_writeout(unsigned long _min_pages); +static void background_writeout(unsigned long _min_pages, nodemask_t *nodes); /* * Work out the current dirty-memory clamping and background writeout @@ -272,7 +272,7 @@ static void balance_dirty_pages(struct a */ if ((laptop_mode && pages_written) || (!laptop_mode && (nr_reclaimable > background_thresh))) - pdflush_operation(background_writeout, 0); + pdflush_operation(background_writeout, 0, NULL); } void set_page_dirty_balance(struct page *page) @@ -362,7 +362,7 @@ void throttle_vm_writeout(gfp_t gfp_mask * writeback at least _min_pages, and keep writing until the amount of dirty * memory is less than the background threshold, or until we're all clean. */ -static void background_writeout(unsigned long _min_pages) +static void background_writeout(unsigned long _min_pages, nodemask_t *unused) { long min_pages = _min_pages; struct writeback_control wbc = { @@ -402,12 +402,12 @@ static void background_writeout(unsigned * the whole world. Returns 0 if a pdflush thread was dispat
[PATCH 3/6] cpuset write throttle
Make page writeback obey cpuset constraints Currently dirty throttling does not work properly in a cpuset. If f.e a cpuset contains only 1/10th of available memory then all of the memory of a cpuset can be dirtied without any writes being triggered. If all of the cpusets memory is dirty then only 10% of total memory is dirty. The background writeback threshold is usually set at 10% and the synchrononous threshold at 40%. So we are still below the global limits while the dirty ratio in the cpuset is 100%! Writeback throttling and background writeout do not work at all in such scenarios. This patch makes dirty writeout cpuset aware. When determining the dirty limits in get_dirty_limits() we calculate values based on the nodes that are reachable from the current process (that has been dirtying the page). Then we can trigger writeout based on the dirty ratio of the memory in the cpuset. We trigger writeout in a a cpuset specific way. We go through the dirty inodes and search for inodes that have dirty pages on the nodes of the active cpuset. If an inode fulfills that requirement then we begin writeout of the dirty pages of that inode. Adding up all the counters for each node in a cpuset may seem to be quite an expensive operation (in particular for large cpusets with hundreds of nodes) compared to just accessing the global counters if we do not have a cpuset. However, please remember that the global counters were only introduced recently. Before 2.6.18 we did add up per processor counters for each processor on each invocation of get_dirty_limits(). We now add per node information which I think is equal or less effort since there are less nodes than processors. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.22-rc6-mm1 diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 3/mm/page-writeback.c --- 2/mm/page-writeback.c 2007-07-11 21:15:47.0 -0700 +++ 3/mm/page-writeback.c 2007-07-16 18:30:01.0 -0700 @@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode); static void background_writeout(unsigned long _min_pages, nodemask_t *nodes); +struct dirty_limits { + long thresh_background; + long thresh_dirty; + unsigned long nr_dirty; + unsigned long nr_unstable; + unsigned long nr_writeback; +}; + /* * Work out the current dirty-memory clamping and background writeout * thresholds. @@ -121,13 +129,15 @@ static void background_writeout(unsigned * clamping level. */ -static unsigned long highmem_dirtyable_memory(unsigned long total) +static unsigned long highmem_dirtyable_memory(nodemask_t *nodes, unsigned long total) { #ifdef CONFIG_HIGHMEM int node; unsigned long x = 0; - for_each_online_node(node) { + if (nodes == NULL) + nodes = &node_online_mask; + for_each_node_mask(node, *nodes) { struct zone *z = &NODE_DATA(node)->node_zones[ZONE_HIGHMEM]; @@ -154,26 +164,74 @@ static unsigned long determine_dirtyable x = global_page_state(NR_FREE_PAGES) + global_page_state(NR_INACTIVE) + global_page_state(NR_ACTIVE); - x -= highmem_dirtyable_memory(x); + x -= highmem_dirtyable_memory(NULL, x); return x + 1; /* Ensure that we never return 0 */ } -static void -get_dirty_limits(long *pbackground, long *pdirty, - struct address_space *mapping) +static int +get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping, + nodemask_t *nodes) { int background_ratio; /* Percentages */ int dirty_ratio; int unmapped_ratio; long background; long dirty; - unsigned long available_memory = determine_dirtyable_memory(); + unsigned long available_memory; + unsigned long nr_mapped; struct task_struct *tsk; + int is_subset = 0; - unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) + - global_page_state(NR_ANON_PAGES)) * 100) / - available_memory; +#ifdef CONFIG_CPUSETS + if (unlikely(nodes && + !nodes_subset(node_online_map, *nodes))) { + int node; + /* +* Calculate the limits relative to the current cpuset. +* +* We do not disregard highmem because all nodes (except +* maybe node 0) have either all memory in HIGHMEM (32 bit) or +* all memory in non HIGHMEM (64 bit). If we would disregard +* highmem then cpuset throttling would not work on 32 bit. +*/ + is_subset = 1; + memset(dl, 0, sizeof(struct dirty_limits)); + available_memory = 0
[PATCH 1/6] cpuset write dirty map
Add a dirty map to struct address_space In a NUMA system it is helpful to know where the dirty pages of a mapping are located. That way we will be able to implement writeout for applications that are constrained to a portion of the memory of the system as required by cpusets. This patch implements the management of dirty node maps for an address space through the following functions: cpuset_clear_dirty_nodes(mapping) Clear the map of dirty nodes cpuset_update_nodes(mapping, page) Record a node in the dirty nodes map cpuset_init_dirty_nodes(mapping)First time init of the map The dirty map may be stored either directly in the mapping (for NUMA systems with less then BITS_PER_LONG nodes) or separately allocated for systems with a large number of nodes (f.e. IA64 with 1024 nodes). Updating the dirty map may involve allocating it first for large configurations. Therefore we protect the allocation and setting of a node in the map through the tree_lock. The tree_lock is already taken when a page is dirtied so there is no additional locking overhead if we insert the updating of the nodemask there. The dirty map is only cleared (or freed) when the inode is cleared. At that point no pages are attached to the inode anymore and therefore it can be done without any locking. The dirty map therefore records all nodes that have been used for dirty pages by that inode until the inode is no longer used. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Acked-by: Ethan Solomita <[EMAIL PROTECTED]> --- Patch against 2.6.22-rc6-mm1 diff -uprN -X 0/Documentation/dontdiff 0/fs/buffer.c 1/fs/buffer.c --- 0/fs/buffer.c 2007-07-11 20:30:55.0 -0700 +++ 1/fs/buffer.c 2007-07-11 21:08:04.0 -0700 @@ -41,6 +41,7 @@ #include #include #include +#include static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); @@ -710,6 +711,7 @@ static int __set_page_dirty(struct page radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } + cpuset_update_dirty_nodes(mapping, page); write_unlock_irq(&mapping->tree_lock); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); diff -uprN -X 0/Documentation/dontdiff 0/fs/fs-writeback.c 1/fs/fs-writeback.c --- 0/fs/fs-writeback.c 2007-07-11 20:30:55.0 -0700 +++ 1/fs/fs-writeback.c 2007-07-11 21:08:04.0 -0700 @@ -22,6 +22,7 @@ #include #include #include +#include #include "internal.h" int sysctl_inode_debug __read_mostly; @@ -492,6 +493,12 @@ int generic_sync_sb_inodes(struct super_ continue; /* blockdev has wrong queue */ } + if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) { + /* No pages on the nodes under writeback */ + list_move(&inode->i_list, &sb->s_dirty); + continue; + } + /* Was this inode dirtied after sync_sb_inodes was called? */ if (time_after(inode->dirtied_when, start)) break; diff -uprN -X 0/Documentation/dontdiff 0/fs/inode.c 1/fs/inode.c --- 0/fs/inode.c2007-07-11 20:30:55.0 -0700 +++ 1/fs/inode.c2007-07-11 21:08:04.0 -0700 @@ -22,6 +22,7 @@ #include #include #include +#include /* * This is needed for the following functions: @@ -157,6 +158,7 @@ static struct inode *alloc_inode(struct mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE); mapping->assoc_mapping = NULL; mapping->backing_dev_info = &default_backing_dev_info; + cpuset_init_dirty_nodes(mapping); /* * If the block_device provides a backing_dev_info for client @@ -264,6 +266,7 @@ void clear_inode(struct inode *inode) bd_forget(inode); if (S_ISCHR(inode->i_mode) && inode->i_cdev) cd_forget(inode); + cpuset_clear_dirty_nodes(inode->i_mapping); inode->i_state = I_CLEAR; } diff -uprN -X 0/Documentation/dontdiff 0/include/linux/cpuset.h 1/include/linux/cpuset.h --- 0/include/linux/cpuset.h2007-07-11 20:30:56.0 -0700 +++ 1/include/linux/cpuset.h2007-07-11 21:08:04.0 -0700 @@ -76,6 +76,45 @@ extern void cpuset_track_online_nodes(vo extern int current_cpuset_is_being_rebound(void); +/* + * We need macros since struct address_space is not defined yet + */ +#if MAX_NUMNODES <= BITS_PER_LONG +#define cpuset_update_dirty_nodes(__mapping, __page) \ + do {\ + int node = page_to_nid(__page); \ + if (!node_isset(node, (__mapping)->dirty_nodes))\ +
[PATCH 0/6] cpuset aware writeback
Perform writeback and dirty throttling with awareness of cpuset mem_allowed. The theory of operation has two primary elements: 1. Add a nodemask per mapping which indicates the nodes which have set PageDirty on any page of the mappings. 2. Add a nodemask argument to wakeup_pdflush() which is propagated down to sync_sb_inodes. This leaves sync_sb_inodes() with two nodemasks. One is passed to it and specifies the nodes the caller is interested in syncing, and will either be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the caller's context. The second nodemask is attached to the inode's mapping and shows who has modified data in the inode. sync_sb_inodes() will then skip syncing of inodes if the nodemask argument does not intersect with the mapping nodemask. cpuset_current_mems_allowed will be passed in to pdflush background_writeout by try_to_free_pages and balance_dirty_pages. balance_dirty_pages also passes the nodemask in to writeback_inodes directly when doing active reclaim. Other callers do not limit inode writeback, passing in a NULL nodemask pointer. A final change is to get_dirty_limits. It takes a nodemask argument, and when it is null there is no change in behavior. If the nodemask is set, page statistics are accumulated only for specified nodes, and the background and throttle dirty ratios will be read from a new per-cpuset ratio feature. These patches are mostly unchanged from Chris Lameter's original changelist posted previously to linux-mm. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/7] cpuset write dirty map
Christoph Lameter wrote: > > This may be a leftover from earlier times when the logic was different in > throttle vm writeout? Sorry -- my merge error when looking at an earlier kernel, no issue with mainline or -mm. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/7] cpuset write dirty map
Christoph -- I have a question about one part of the patches. In throttle_vm_writeout() you added a clause that checks for __GFP_FS | __GFP_IO and if they're not both set it calls blk_congestion_wait() immediately and then returns, no change for looping. Two questions: 1. This seems like an unrelated bug fix. Should you submit it as a standalone patch? 2. You put this gfp check before the check for get_dirty_limits. It's possible that this will block even though without your change it would have returned straight away. Would it better, instead of adding the if-clause at the top of the function, to embed the gfp check at the end of the for-loop after calling blk_congestion_wait? -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/7] cpuset write dirty map
Christoph Lameter wrote: > On Wed, 27 Jun 2007, Ethan Solomita wrote: > >> I looked over it at one point. Most of the code doesn't conflict, but I >> believe that the code path which calculates the dirty limits will need >> some merging. Doable but non-trivial. >> -- Ethan > > I hope you will keep on updating the patchset and posting it against > current mm? > I have no new changes, but I can update it against the current mm. Or did the per-bdi throttling change get taken by Andrew? -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/7] cpuset write dirty map
Andrew Morton wrote: > > One open question is the interaction between these changes and with Peter's > per-device-dirty-throttling changes. They also are in my queue somewhere. I looked over it at one point. Most of the code doesn't conflict, but I believe that the code path which calculates the dirty limits will need some merging. Doable but non-trivial. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/7] cpuset write dirty map
Christoph Lameter wrote: > > What testing was done? Would you include the results of tests in your next > post? Sorry for the delay in responding -- I was chasing phantom failures. I created a stress test which involved using cpusets and mems_allowed to split memory so that all daemons had memory set aside for them, and my memory stress test had a separate set of memory. The stress test was mmaping 7GB of a very large file on disk. It then scans the entire 7GB of memory reading and modifying each byte. 7GB is more than the amount of physical memory made available to the stress test. Using iostat I can see the initial period of reading from disk, followed by a period of simultaneous reads and writes as dirty bytes are pushed to make room for new reads. In a separate log-in, in the other cpuset, I am running: while `true`; do date | tee -a date.txt; sleep 5; done date.txt resides on the same disk as the large file mentioned above. The above while-loop serves the dual purpose of providing me visual clues of progress along with the opportunity for the "tee" command to become throttled writing to the disk. The effect of this patchset is straightforward. Without it there are long hangs between appearances of the date. With it the dates are all 5 (or sometimes 6) seconds apart. I also added printks to the kernel to verify that, without these patches, the tee was being throttled (along with lots of other things), and with the patch only pdflush is being throttled. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/7] cpuset write dirty map
Christoph Lameter wrote: > On Thu, 31 May 2007, Ethan Solomita wrote: > >> The dirty map is only cleared (or freed) when the inode is cleared. >> At that point no pages are attached to the inode anymore and therefore it can >> be done without any locking. The dirty map therefore records all nodes that >> have been used for dirty pages by that inode until the inode is no longer >> used. >> >> Originally by Christoph Lameter <[EMAIL PROTECTED]> > > You should preserve my Signed-off-by: since I wrote most of this. Is there > a changelog? > I wasn't sure of the etiquette -- I'd thought that by saying you had signed it off that meant you were accepting my modifications, and didn't want to presume. But I will change it if you like. No slight intended. Unfortunately I don't have a changelog, and since I've since forward ported the changes it would be hard to produce. If you want to review it you should probably review it all, because the forward porting may have introduced issues. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 7/7] cpuset dirty limits
Per cpuset dirty ratios This implements dirty ratios per cpuset. Two new files are added to the cpuset directories: background_dirty_ratio Percentage at which background writeback starts throttle_dirty_ratioPercentage at which the application is throttled and we start synchrononous writeout. Both variables are set to -1 by default which means that the global limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio) are used for a cpuset. Originally by Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]> --- diff -uprN -X 0/Documentation/dontdiff 6/include/linux/cpuset.h 7/include/linux/cpuset.h --- 6/include/linux/cpuset.h2007-05-30 11:39:17.0 -0700 +++ 7/include/linux/cpuset.h2007-05-30 11:39:48.0 -0700 @@ -75,6 +75,7 @@ static inline int cpuset_do_slab_mem_spr extern void cpuset_track_online_nodes(void); +extern void cpuset_get_current_ratios(int *background, int *ratio); /* * We need macros since struct address_space is not defined yet */ diff -uprN -X 0/Documentation/dontdiff 6/kernel/cpuset.c 7/kernel/cpuset.c --- 6/kernel/cpuset.c 2007-05-30 11:39:17.0 -0700 +++ 7/kernel/cpuset.c 2007-05-30 11:39:48.0 -0700 @@ -49,6 +49,7 @@ #include #include #include +#include #include #include @@ -99,6 +100,9 @@ struct cpuset { int mems_generation; struct fmeter fmeter; /* memory_pressure filter */ + + int background_dirty_ratio; + int throttle_dirty_ratio; }; /* bits in struct cpuset flags field */ @@ -176,6 +180,8 @@ static struct cpuset top_cpuset = { .count = ATOMIC_INIT(0), .sibling = LIST_HEAD_INIT(top_cpuset.sibling), .children = LIST_HEAD_INIT(top_cpuset.children), + .background_dirty_ratio = -1, + .throttle_dirty_ratio = -1, }; static struct vfsmount *cpuset_mount; @@ -1030,6 +1036,21 @@ static int update_flag(cpuset_flagbits_t return 0; } +static int update_int(int *cs_int, char *buf, int min, int max) +{ + char *endp; + int val; + + val = simple_strtol(buf, &endp, 10); + if (val < min || val > max) + return -EINVAL; + + mutex_lock(&callback_mutex); + *cs_int = val; + mutex_unlock(&callback_mutex); + return 0; +} + /* * Frequency meter - How fast is some event occurring? * @@ -1238,6 +1259,8 @@ typedef enum { FILE_SPREAD_PAGE, FILE_SPREAD_SLAB, FILE_TASKLIST, + FILE_THROTTLE_DIRTY_RATIO, + FILE_BACKGROUND_DIRTY_RATIO, } cpuset_filetype_t; static ssize_t cpuset_common_file_write(struct file *file, @@ -1308,6 +1331,12 @@ static ssize_t cpuset_common_file_write( case FILE_TASKLIST: retval = attach_task(cs, buffer, &pathbuf); break; + case FILE_BACKGROUND_DIRTY_RATIO: + retval = update_int(&cs->background_dirty_ratio, buffer, -1, 100); + break; + case FILE_THROTTLE_DIRTY_RATIO: + retval = update_int(&cs->throttle_dirty_ratio, buffer, -1, 100); + break; default: retval = -EINVAL; goto out2; @@ -1420,6 +1449,12 @@ static ssize_t cpuset_common_file_read(s case FILE_SPREAD_SLAB: *s++ = is_spread_slab(cs) ? '1' : '0'; break; + case FILE_BACKGROUND_DIRTY_RATIO: + s += sprintf(s, "%d", cs->background_dirty_ratio); + break; + case FILE_THROTTLE_DIRTY_RATIO: + s += sprintf(s, "%d", cs->throttle_dirty_ratio); + break; default: retval = -EINVAL; goto out; @@ -1788,6 +1823,16 @@ static struct cftype cft_spread_slab = { .private = FILE_SPREAD_SLAB, }; +static struct cftype cft_background_dirty_ratio = { + .name = "background_dirty_ratio", + .private = FILE_BACKGROUND_DIRTY_RATIO, +}; + +static struct cftype cft_throttle_dirty_ratio = { + .name = "throttle_dirty_ratio", + .private = FILE_THROTTLE_DIRTY_RATIO, +}; + static int cpuset_populate_dir(struct dentry *cs_dentry) { int err; @@ -1810,6 +1855,10 @@ static int cpuset_populate_dir(struct de return err; if ((err = cpuset_add_file(cs_dentry, &cft_spread_slab)) < 0) return err; + if ((err = cpuset_add_file(cs_dentry, &cft_background_dirty_ratio)) < 0) + return err; + if ((err = cpuset_add_file(cs_dentry, &cft_throttle_dirty_ratio)) < 0) + return err; if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0) return err; return 0; @@ -1849,6 +1898,8 @@ static long cpuset_create(struct cpuset INIT_LIST_HEAD(&cs
[RFC 6/7] cpuset write fixes
Remove unneeded local variable. Originally by Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]> --- diff -uprN -X 0/Documentation/dontdiff 5/mm/page-writeback.c 6/mm/page-writeback.c --- 5/mm/page-writeback.c 2007-05-30 11:37:01.0 -0700 +++ 6/mm/page-writeback.c 2007-05-30 11:39:25.0 -0700 @@ -177,7 +177,6 @@ get_dirty_limits(struct dirty_limits *dl int unmapped_ratio; long background; long dirty; - unsigned long available_memory = determine_dirtyable_memory(); unsigned long dirtyable_memory; unsigned long nr_mapped; struct task_struct *tsk; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 5/7] cpuset write vm writeout
Throttle VM writeout in a cpuset aware way This bases the vm throttling from the reclaim path on the dirty ratio of the cpuset. Note that a cpuset is only effective if shrink_zone is called from direct reclaim. kswapd has a cpuset context that includes the whole machine. VM throttling will only work during synchrononous reclaim and not from kswapd. Originally by Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]> --- diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 5/include/linux/writeback.h --- 4/include/linux/writeback.h 2007-05-30 11:36:14.0 -0700 +++ 5/include/linux/writeback.h 2007-05-30 11:37:01.0 -0700 @@ -89,7 +89,7 @@ static inline void wait_on_inode(struct int wakeup_pdflush(long nr_pages, nodemask_t *nodes); void laptop_io_completion(void); void laptop_sync_completion(void); -void throttle_vm_writeout(gfp_t gfp_mask); +void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask); /* These are exported to sysctl. */ extern int dirty_background_ratio; diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 5/mm/page-writeback.c --- 4/mm/page-writeback.c 2007-05-30 11:36:15.0 -0700 +++ 5/mm/page-writeback.c 2007-05-30 11:37:01.0 -0700 @@ -384,7 +384,7 @@ void balance_dirty_pages_ratelimited_nr( } EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); -void throttle_vm_writeout(gfp_t gfp_mask) +void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask) { struct dirty_limits dl; @@ -399,7 +399,7 @@ void throttle_vm_writeout(gfp_t gfp_mask } for ( ; ; ) { - get_dirty_limits(&dl, NULL, &node_online_map); + get_dirty_limits(&dl, NULL, nodes); /* * Boost the allowable dirty threshold a bit for page diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c --- 4/mm/vmscan.c 2007-05-30 11:36:17.0 -0700 +++ 5/mm/vmscan.c 2007-05-30 11:37:01.0 -0700 @@ -1079,7 +1079,7 @@ static unsigned long shrink_zone(int pri } } - throttle_vm_writeout(sc->gfp_mask); + throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask); atomic_dec(&zone->reclaim_in_progress); return nr_reclaimed; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[corrected][RFC 5/7] cpuset write vm writeout
Throttle VM writeout in a cpuset aware way This bases the vm throttling from the reclaim path on the dirty ratio of the cpuset. Note that a cpuset is only effective if shrink_zone is called from direct reclaim. kswapd has a cpuset context that includes the whole machine. VM throttling will only work during synchrononous reclaim and not from kswapd. Originally by Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]> --- diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 5/include/linux/writeback.h --- 4/include/linux/writeback.h 2007-05-30 11:36:14.0 -0700 +++ 5/include/linux/writeback.h 2007-05-30 11:37:01.0 -0700 @@ -89,7 +89,7 @@ static inline void wait_on_inode(struct int wakeup_pdflush(long nr_pages, nodemask_t *nodes); void laptop_io_completion(void); void laptop_sync_completion(void); -void throttle_vm_writeout(gfp_t gfp_mask); +void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask); /* These are exported to sysctl. */ extern int dirty_background_ratio; diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 5/mm/page-writeback.c --- 4/mm/page-writeback.c 2007-05-30 11:36:15.0 -0700 +++ 5/mm/page-writeback.c 2007-05-30 11:37:01.0 -0700 @@ -384,7 +384,7 @@ void balance_dirty_pages_ratelimited_nr( } EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); -void throttle_vm_writeout(gfp_t gfp_mask) +void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask) { struct dirty_limits dl; @@ -399,7 +399,7 @@ void throttle_vm_writeout(gfp_t gfp_mask } for ( ; ; ) { - get_dirty_limits(&dl, NULL, &node_online_map); + get_dirty_limits(&dl, NULL, nodes); /* * Boost the allowable dirty threshold a bit for page diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c --- 4/mm/vmscan.c 2007-05-30 11:36:17.0 -0700 +++ 5/mm/vmscan.c 2007-05-30 11:37:01.0 -0700 @@ -1079,7 +1079,7 @@ static unsigned long shrink_zone(int pri } } - throttle_vm_writeout(sc->gfp_mask); + throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask); atomic_dec(&zone->reclaim_in_progress); return nr_reclaimed; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 4/7] cpuset write vmscan
Direct reclaim: cpuset aware writeout During direct reclaim we traverse down a zonelist and are carefully checking each zone if its a member of the active cpuset. But then we call pdflush without enforcing the same restrictions. In a larger system this may have the effect of a massive amount of pages being dirtied and then either A. No writeout occurs because global dirty limits have not been reached or B. Writeout starts randomly for some dirty inode in the system. Pdflush may just write out data for nodes in another cpuset and miss doing proper dirty handling for the current cpuset. In both cases dirty pages in the zones of interest may not be affected and writeout may not occur as necessary. Fix that by restricting pdflush to the active cpuset. Writeout will occur from direct reclaim the same way as without a cpuset. Originally by Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]> --- diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c --- 3/mm/vmscan.c 2007-05-30 11:34:21.0 -0700 +++ 4/mm/vmscan.c 2007-05-30 11:36:17.0 -0700 @@ -1198,7 +1198,8 @@ unsigned long try_to_free_pages(struct z */ if (total_scanned > sc.swap_cluster_max + sc.swap_cluster_max / 2) { - wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL); + wakeup_pdflush(laptop_mode ? 0 : total_scanned, + &cpuset_current_mems_allowed); sc.may_writepage = 1; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 3/7] cpuset write throttle
Make page writeback obey cpuset constraints Currently dirty throttling does not work properly in a cpuset. If f.e a cpuset contains only 1/10th of available memory then all of the memory of a cpuset can be dirtied without any writes being triggered. If all of the cpusets memory is dirty then only 10% of total memory is dirty. The background writeback threshold is usually set at 10% and the synchrononous threshold at 40%. So we are still below the global limits while the dirty ratio in the cpuset is 100%! Writeback throttling and background writeout do not work at all in such scenarios. This patch makes dirty writeout cpuset aware. When determining the dirty limits in get_dirty_limits() we calculate values based on the nodes that are reachable from the current process (that has been dirtying the page). Then we can trigger writeout based on the dirty ratio of the memory in the cpuset. We trigger writeout in a a cpuset specific way. We go through the dirty inodes and search for inodes that have dirty pages on the nodes of the active cpuset. If an inode fulfills that requirement then we begin writeout of the dirty pages of that inode. Adding up all the counters for each node in a cpuset may seem to be quite an expensive operation (in particular for large cpusets with hundreds of nodes) compared to just accessing the global counters if we do not have a cpuset. However, please remember that the global counters were only introduced recently. Before 2.6.18 we did add up per processor counters for each processor on each invocation of get_dirty_limits(). We now add per node information which I think is equal or less effort since there are less nodes than processors. Originally by Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]> --- diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 3/mm/page-writeback.c --- 2/mm/page-writeback.c 2007-05-30 11:31:22.0 -0700 +++ 3/mm/page-writeback.c 2007-05-30 11:34:26.0 -0700 @@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode); static void background_writeout(unsigned long _min_pages, nodemask_t *nodes); +struct dirty_limits { + long thresh_background; + long thresh_dirty; + unsigned long nr_dirty; + unsigned long nr_unstable; + unsigned long nr_writeback; +}; + /* * Work out the current dirty-memory clamping and background writeout * thresholds. @@ -121,13 +129,15 @@ static void background_writeout(unsigned * clamping level. */ -static unsigned long highmem_dirtyable_memory(unsigned long total) +static unsigned long highmem_dirtyable_memory(nodemask_t *nodes, unsigned long total) { #ifdef CONFIG_HIGHMEM int node; unsigned long x = 0; - for_each_online_node(node) { + if (nodes == NULL) + nodes = &node_online_mask; + for_each_node_mask(node, *nodes) { struct zone *z = &NODE_DATA(node)->node_zones[ZONE_HIGHMEM]; @@ -154,13 +164,13 @@ static unsigned long determine_dirtyable x = global_page_state(NR_FREE_PAGES) + global_page_state(NR_INACTIVE) + global_page_state(NR_ACTIVE); - x -= highmem_dirtyable_memory(x); + x -= highmem_dirtyable_memory(NULL, x); return x + 1; /* Ensure that we never return 0 */ } -static void -get_dirty_limits(long *pbackground, long *pdirty, - struct address_space *mapping) +static int +get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping, + nodemask_t *nodes) { int background_ratio; /* Percentages */ int dirty_ratio; @@ -168,12 +178,60 @@ get_dirty_limits(long *pbackground, long long background; long dirty; unsigned long available_memory = determine_dirtyable_memory(); + unsigned long dirtyable_memory; + unsigned long nr_mapped; struct task_struct *tsk; + int is_subset = 0; + +#ifdef CONFIG_CPUSETS + if (unlikely(nodes && + !nodes_subset(node_online_map, *nodes))) { + int node; + + /* +* Calculate the limits relative to the current cpuset. +* +* We do not disregard highmem because all nodes (except +* maybe node 0) have either all memory in HIGHMEM (32 bit) or +* all memory in non HIGHMEM (64 bit). If we would disregard +* highmem then cpuset throttling would not work on 32 bit. +*/ + is_subset = 1; + memset(dl, 0, sizeof(struct dirty_limits)); + dirtyable_memory = 0; + nr_mapped = 0; + for_each_node_mask(node, *nodes) { + if (!node_online(node)) + continue; + dl-&g
[RFC 2/7] cpuset write pdflush nodemask
pdflush: Allow the passing of a nodemask parameter If we want to support nodeset specific writeout then we need a way to communicate the set of nodes that an operation should affect. So add a nodemask_t parameter to the pdflush functions and also store the nodemask in the pdflush control structure. Originally by Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]> --- diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c --- 1/fs/buffer.c 2007-05-29 17:44:33.0 -0700 +++ 2/fs/buffer.c 2007-05-30 11:31:22.0 -0700 @@ -359,7 +359,7 @@ static void free_more_memory(void) struct zone **zones; pg_data_t *pgdat; - wakeup_pdflush(1024); + wakeup_pdflush(1024, NULL); yield(); for_each_online_pgdat(pgdat) { diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c --- 1/fs/super.c2007-05-29 17:43:00.0 -0700 +++ 2/fs/super.c2007-05-30 11:31:22.0 -0700 @@ -615,7 +615,7 @@ int do_remount_sb(struct super_block *sb return 0; } -static void do_emergency_remount(unsigned long foo) +static void do_emergency_remount(unsigned long foo, nodemask_t *bar) { struct super_block *sb; @@ -643,7 +643,7 @@ static void do_emergency_remount(unsigne void emergency_remount(void) { - pdflush_operation(do_emergency_remount, 0); + pdflush_operation(do_emergency_remount, 0, NULL); } /* diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c --- 1/fs/sync.c 2007-05-29 17:43:00.0 -0700 +++ 2/fs/sync.c 2007-05-30 11:31:22.0 -0700 @@ -21,9 +21,9 @@ * sync everything. Start out by waking pdflush, because that writes back * all queues in parallel. */ -static void do_sync(unsigned long wait) +static void do_sync(unsigned long wait, nodemask_t *unused) { - wakeup_pdflush(0); + wakeup_pdflush(0, NULL); sync_inodes(0); /* All mappings, inodes and their blockdevs */ DQUOT_SYNC(NULL); sync_supers(); /* Write the superblocks */ @@ -38,13 +38,13 @@ static void do_sync(unsigned long wait) asmlinkage long sys_sync(void) { - do_sync(1); + do_sync(1, NULL); return 0; } void emergency_sync(void) { - pdflush_operation(do_sync, 0); + pdflush_operation(do_sync, 0, NULL); } /* diff -uprN -X 0/Documentation/dontdiff 1/include/linux/writeback.h 2/include/linux/writeback.h --- 1/include/linux/writeback.h 2007-05-30 11:20:16.0 -0700 +++ 2/include/linux/writeback.h 2007-05-30 11:31:22.0 -0700 @@ -86,7 +86,7 @@ static inline void wait_on_inode(struct /* * mm/page-writeback.c */ -int wakeup_pdflush(long nr_pages); +int wakeup_pdflush(long nr_pages, nodemask_t *nodes); void laptop_io_completion(void); void laptop_sync_completion(void); void throttle_vm_writeout(gfp_t gfp_mask); @@ -117,7 +117,8 @@ balance_dirty_pages_ratelimited(struct a typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc, void *data); -int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0); +int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes), + unsigned long arg0, nodemask_t *nodes); int generic_writepages(struct address_space *mapping, struct writeback_control *wbc); int write_cache_pages(struct address_space *mapping, diff -uprN -X 0/Documentation/dontdiff 1/mm/page-writeback.c 2/mm/page-writeback.c --- 1/mm/page-writeback.c 2007-05-29 17:44:33.0 -0700 +++ 2/mm/page-writeback.c 2007-05-30 11:31:22.0 -0700 @@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode); /* End of sysctl-exported parameters */ -static void background_writeout(unsigned long _min_pages); +static void background_writeout(unsigned long _min_pages, nodemask_t *nodes); /* * Work out the current dirty-memory clamping and background writeout @@ -272,7 +272,7 @@ static void balance_dirty_pages(struct a */ if ((laptop_mode && pages_written) || (!laptop_mode && (nr_reclaimable > background_thresh))) - pdflush_operation(background_writeout, 0); + pdflush_operation(background_writeout, 0, NULL); } void set_page_dirty_balance(struct page *page) @@ -362,7 +362,7 @@ void throttle_vm_writeout(gfp_t gfp_mask * writeback at least _min_pages, and keep writing until the amount of dirty * memory is less than the background threshold, or until we're all clean. */ -static void background_writeout(unsigned long _min_pages) +static void background_writeout(unsigned long _min_pages, nodemask_t *unused) { long min_pages = _min_pages; struct writeback_control wbc = { @@ -402,12 +402,12 @@ static void background_writeout(unsigned * the whole world. Returns 0 if a pdflush thread was dispatched. Returns * -1 if all
[RFC 1/7] cpuset write dirty map
Add a dirty map to struct address_space In a NUMA system it is helpful to know where the dirty pages of a mapping are located. That way we will be able to implement writeout for applications that are constrained to a portion of the memory of the system as required by cpusets. This patch implements the management of dirty node maps for an address space through the following functions: cpuset_clear_dirty_nodes(mapping) Clear the map of dirty nodes cpuset_update_nodes(mapping, page) Record a node in the dirty nodes map cpuset_init_dirty_nodes(mapping)First time init of the map The dirty map may be stored either directly in the mapping (for NUMA systems with less then BITS_PER_LONG nodes) or separately allocated for systems with a large number of nodes (f.e. IA64 with 1024 nodes). Updating the dirty map may involve allocating it first for large configurations. Therefore we protect the allocation and setting of a node in the map through the tree_lock. The tree_lock is already taken when a page is dirtied so there is no additional locking overhead if we insert the updating of the nodemask there. The dirty map is only cleared (or freed) when the inode is cleared. At that point no pages are attached to the inode anymore and therefore it can be done without any locking. The dirty map therefore records all nodes that have been used for dirty pages by that inode until the inode is no longer used. Originally by Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]> --- diff -uprN -X 0/Documentation/dontdiff 0/fs/buffer.c 1/fs/buffer.c --- 0/fs/buffer.c 2007-05-29 17:42:07.0 -0700 +++ 1/fs/buffer.c 2007-05-29 17:44:33.0 -0700 @@ -41,6 +41,7 @@ #include #include #include +#include static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); @@ -710,6 +711,7 @@ static int __set_page_dirty(struct page radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } + cpuset_update_dirty_nodes(mapping, page); write_unlock_irq(&mapping->tree_lock); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); diff -uprN -X 0/Documentation/dontdiff 0/fs/fs-writeback.c 1/fs/fs-writeback.c --- 0/fs/fs-writeback.c 2007-05-29 17:42:07.0 -0700 +++ 1/fs/fs-writeback.c 2007-05-29 18:13:48.0 -0700 @@ -22,6 +22,7 @@ #include #include #include +#include #include "internal.h" int sysctl_inode_debug __read_mostly; @@ -483,6 +484,12 @@ int generic_sync_sb_inodes(struct super_ continue; /* blockdev has wrong queue */ } + if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) { + /* No pages on the nodes under writeback */ + redirty_head(inode); + continue; + } + /* Was this inode dirtied after sync_sb_inodes was called? */ if (time_after(inode->dirtied_when, start)) break; diff -uprN -X 0/Documentation/dontdiff 0/fs/inode.c 1/fs/inode.c --- 0/fs/inode.c2007-05-29 17:42:07.0 -0700 +++ 1/fs/inode.c2007-05-29 17:44:33.0 -0700 @@ -22,6 +22,7 @@ #include #include #include +#include /* * This is needed for the following functions: @@ -148,6 +149,7 @@ static struct inode *alloc_inode(struct mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE); mapping->assoc_mapping = NULL; mapping->backing_dev_info = &default_backing_dev_info; + cpuset_init_dirty_nodes(mapping); /* * If the block_device provides a backing_dev_info for client @@ -255,6 +257,7 @@ void clear_inode(struct inode *inode) bd_forget(inode); if (S_ISCHR(inode->i_mode) && inode->i_cdev) cd_forget(inode); + cpuset_clear_dirty_nodes(inode->i_mapping); inode->i_state = I_CLEAR; } diff -uprN -X 0/Documentation/dontdiff 0/include/linux/cpuset.h 1/include/linux/cpuset.h --- 0/include/linux/cpuset.h2007-05-29 17:40:07.0 -0700 +++ 1/include/linux/cpuset.h2007-05-29 17:44:33.0 -0700 @@ -75,6 +75,45 @@ static inline int cpuset_do_slab_mem_spr extern void cpuset_track_online_nodes(void); +/* + * We need macros since struct address_space is not defined yet + */ +#if MAX_NUMNODES <= BITS_PER_LONG +#define cpuset_update_dirty_nodes(__mapping, __page) \ + do {\ + int node = page_to_nid(__page); \ + if (!node_isset(node, (__mapping)->dirty_nodes))\ + node_set(node, (__mapping)->dir
Re: NR_UNSTABLE_FS vs. NR_FILE_DIRTY: double counting pages?
Ethan Solomita wrote: Trond Myklebust wrote: It should not happen. If the page is on the unstable list, then it will be committed before nfs_updatepage is allowed to redirty it. See the recent fixes in 2.6.21-rc7. Above I present a codepath called straight from sys_write() which seems to do what I say. I could be wrong, but can you address the code paths I show above which seem to set both? Sorry about my quick reply, I'd misunderstood what you were saying. I'll take a look at what you say. Thanks, -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NR_UNSTABLE_FS vs. NR_FILE_DIRTY: double counting pages?
Trond Myklebust wrote: On Fri, 2007-04-27 at 18:21 -0700, Ethan Solomita wrote: There are several places where we add together NR_UNSTABLE_FS and NF_FILE_DIRTY: sync_inodes_sb() balance_dirty_pages() wakeup_pdflush() wb_kupdate() prefetch_suitable() I can trace a standard codepath where it seems both of these are set on the same page: nfs_file_aops.commit_write -> nfs_commit_write nfs_updatepages nfs_writepage_setup nfs_wb_page nfs_wb_page_priority nfs_writepage_locked nfs_flush_mapping nfs_flush_list nfs_flush_multi nfs_write_partial_ops.rpc_call_done nfs_writeback_done_partial nfs_writepage_release nfs_reschedule_unstable_write nfs_mark_request_commit incr NR_UNSTABLE_NFS nfs_file_aops.commit_write -> nfs_commit_write nfs_updatepage __set_page_dirty_nobuffers incr NF_FILE_DIRTY This is the standard code path that derives from sys_write(). Can someone either show how this code sequence can't happen, or confirm for me that there's a bug? -- Ethan It should not happen. If the page is on the unstable list, then it will be committed before nfs_updatepage is allowed to redirty it. See the recent fixes in 2.6.21-rc7. Above I present a codepath called straight from sys_write() which seems to do what I say. I could be wrong, but can you address the code paths I show above which seem to set both? -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
NR_UNSTABLE_FS vs. NR_FILE_DIRTY: double counting pages?
There are several places where we add together NR_UNSTABLE_FS and NF_FILE_DIRTY: sync_inodes_sb() balance_dirty_pages() wakeup_pdflush() wb_kupdate() prefetch_suitable() I can trace a standard codepath where it seems both of these are set on the same page: nfs_file_aops.commit_write -> nfs_commit_write nfs_updatepages nfs_writepage_setup nfs_wb_page nfs_wb_page_priority nfs_writepage_locked nfs_flush_mapping nfs_flush_list nfs_flush_multi nfs_write_partial_ops.rpc_call_done nfs_writeback_done_partial nfs_writepage_release nfs_reschedule_unstable_write nfs_mark_request_commit incr NR_UNSTABLE_NFS nfs_file_aops.commit_write -> nfs_commit_write nfs_updatepage __set_page_dirty_nobuffers incr NF_FILE_DIRTY This is the standard code path that derives from sys_write(). Can someone either show how this code sequence can't happen, or confirm for me that there's a bug? -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: On Fri, 20 Apr 2007, Ethan Solomita wrote: cpuset_write_dirty_map.htm In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes() but in __set_page_dirty_buffers() you call it only if page->mapping is still set after locking. Is there a reason for the difference? Also a question not about your patch: why do those functions call __mark_inode_dirty() even if the dirty page has been truncated and mapping == NULL? If page->mapping has been cleared then the page was removed from the mapping. __mark_inode_dirty just dirties the inode. If a truncation occurs then the inode was modified. You didn't address the first half. Why do the buffers() and nobuffers() act differently when calling cpuset_update_dirty_nodes()? cpuset_write_throttle.htm I noticed that several lines have leading spaces. I didn't check if other patches have the problem too. Maybe download the patches? How did those strange .htm endings get appended to the patches? Something weird with Firefox, but instead of jumping on me did you consider double checking your patches? I just went back, found the text versions, and the spaces are still there.e.g.: + unsigned long dirtyable_memory; In get_dirty_limits(), when cpusets are configd you don't subtract highmen the same way that is done without cpusets. Is this intentional? That is something in flux upstream. Linus changed it recently. Do it one way or the other. Exactly -- your patch should be consistent and do it the same way as whatever your patch is built against. Your patch is built against a kernel that subtracts off highmem. "Do it..." are you handing off the patch and are done with it? It seems that dirty_exceeded is still a global punishment across cpusets. Should it be addressed? Sure. It would be best if you could place that somehow in a cpuset. Again it sounds like you're handing them off. I'm not objecting I just hadn't understood that. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: H Sorry. I got distracted and I have sent them to Kame-san who was interested in working on them. I have placed the most recent version at http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty Hi Christoph -- a few comments on the patches: cpuset_write_dirty_map.htm In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes() but in __set_page_dirty_buffers() you call it only if page->mapping is still set after locking. Is there a reason for the difference? Also a question not about your patch: why do those functions call __mark_inode_dirty() even if the dirty page has been truncated and mapping == NULL? cpuset_write_throttle.htm I noticed that several lines have leading spaces. I didn't check if other patches have the problem too. In get_dirty_limits(), when cpusets are configd you don't subtract highmen the same way that is done without cpusets. Is this intentional? It seems that dirty_exceeded is still a global punishment across cpusets. Should it be addressed? -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: On Wed, 18 Apr 2007, Ethan Solomita wrote: Any new ETA? I'm trying to decide whether to go back to your original patches or wait for the new set. Adding new knobs isn't as important to me as having something that fixes the core problem, so hopefully this isn't waiting on them. They could always be patches on top of your core patches. -- Ethan H Sorry. I got distracted and I have sent them to Kame-san who was interested in working on them. I have placed the most recent version at http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty Do you expect any conflicts with the per-bdi dirty throttling patches? -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: On Wed, 21 Mar 2007, Ethan Solomita wrote: Christoph Lameter wrote: On Thu, 1 Feb 2007, Ethan Solomita wrote: Hi Christoph -- has anything come of resolving the NFS / OOM concerns that Andrew Morton expressed concerning the patch? I'd be happy to see some progress on getting this patch (i.e. the one you posted on 1/23) through. Peter Zilkstra addressed the NFS issue. I will submit the patch again as soon as the writeback code stabilizes a bit. I'm pinging to see if this has gotten anywhere. Are you ready to resubmit? Do we have the evidence to convince Andrew that the NFS issues are resolved and so this patch won't obscure anything? The NFS patch went into Linus tree a couple of days ago and I have a new version ready with additional support to set per dirty ratios per cpu. There is some interest in adding more VM controls to this patch. I hope I can post the next rev by tomorrow. Any new ETA? I'm trying to decide whether to go back to your original patches or wait for the new set. Adding new knobs isn't as important to me as having something that fixes the core problem, so hopefully this isn't waiting on them. They could always be patches on top of your core patches. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash)
Maneesh Soni wrote: > I have modified the previous patch (which was dropped from -mm) and now > keeping > the statement making s_dentry as NULL in sysfs_d_iput(), so this should > _safely_ fix sysfs_readdir() oops. > If you could find some additional places in sysfs code to add new BUG() checks I'd appreciate it. Especially if it turns out that you can't reproduce it, I'd like to have as many asserts as is reasonable. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [FIXED] Re: tty OOPS (Re: 2.6.21-rc5-mm2)
Apologies -- I didn't notice lkml on the cc list. I'll catch up from lkml directly. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [FIXED] Re: tty OOPS (Re: 2.6.21-rc5-mm2)
Andreas Mohr wrote: Hi, On Wed, Mar 28, 2007 at 10:56:32PM +0400, Alexey Dobriyan wrote: The only suspicious new patch in -rc5-mm1 to me is fix-sysfs-reclaim-crash.patch which removes "sd->s_dentry = NULL;". Note that whole sysfs_drop_dentry() is NOP if ->s_dentry is NULL. Could you try to revert it? Alexey, who knows very little about sysfs internals Apparently that's still too much knowledge ;) Or, in other words: 6 reboots already and not a single problem! So yes, the removal of the NULLing line in this patch most likely has caused this issue on my box. Now the question is whether something as simple as that is a fully correct fix or whether something should be done entirely differently. I'll let people more familiar with those parts decide about it... Sorry -- I've only just been cc'd on this mail thread. Are we claiming that this patch/fix has caused a new problem, or successfully fixed an old problem? Thanks! -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 4/7] RSS accounting hooks over the code
Nick Piggin wrote: Eric W. Biederman wrote: First touch page ownership does not guarantee give me anything useful for knowing if I can run my application or not. Because of page sharing my application might run inside the rss limit only because I got lucky and happened to share a lot of pages with another running application. If the next I run and it isn't running my application will fail. That is ridiculous. Let's be practical here, what you're asking is basically impossible. Unless by deterministic you mean that it never enters the a non trivial syscall, in which case, you just want to know about maximum RSS of the process, which we already account). If we used Beancounters as Pavel and Kirill mentioned, that would keep track of each container that has referenced a page, not just the first container. It sounds like beancounters can return a usage count where each page is divided by the number of referencing containers (e.g. 1/3rd if 3 containers share a page). Presumably it could also return a full count of 1 to each container. If we look at data in the latter form, i.e. each container must pay fully for each page used, then Eric could use that to determine real usage needs of the container. However we could also use the fractional count in order to do things such as charging the container for its actual usage. i.e. full count for setting guarantees, fractional for actual usage. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-VServer example results for sharing vs. separate mappings ...
Herbert Poetzl wrote: On Sat, Mar 24, 2007 at 12:19:06PM -0800, Andrew Morton wrote: Or change the reclaim code so that a page which hasn't been referenced from a process within its hardware container is considered unreferenced (so it gets reclaimed). that might easily lead to some ping-pong behaviour, when two similar guest are executing similar binaries but not at the same time ... It might lead to that, but I don't think it would become pathological "easily". If a system has been up for a long time, it's easy to image pagecache pages lying everywhere just because someone somewhere is still using them. I suggest a variant on what Andrew says: don't change reclaim. Instead, when referencing a page, don't mark the page as referenced if the current task is not permitted to allocate from the page's node. I'm thinking in terms of cpusets, with each task having a nodemask of mems_allowed. This may result in a page being thrown out unnecessarily and brought back in from disk, but when memory is tight that is what happens. An optimization might be to keep track of who is referencing the page and migrate it to their memory instead of reclaiming it, but that would require reclaim to know the task/cpuset/container of the referencing task. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sysfs reclaim crash
Hi Maneesh -- I will start testing with the patch you provided. If you come up with any further issues please let me know. Also, if you could suggest some additional BUG() lines that I could insert I would appreciate it. Since the bug is hard to reproduce, it may be easier to catch a race condition in the making via BUG() than an actual failure due to a race condition. Thanks! -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: On Thu, 1 Feb 2007, Ethan Solomita wrote: Hi Christoph -- has anything come of resolving the NFS / OOM concerns that Andrew Morton expressed concerning the patch? I'd be happy to see some progress on getting this patch (i.e. the one you posted on 1/23) through. Peter Zilkstra addressed the NFS issue. I will submit the patch again as soon as the writeback code stabilizes a bit. I'm pinging to see if this has gotten anywhere. Are you ready to resubmit? Do we have the evidence to convince Andrew that the NFS issues are resolved and so this patch won't obscure anything? Thanks, -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mm: Inconsistent use of node IDs
Ping! -- Ethan Ethan Solomita wrote: > Andi Kleen wrote: > >> On Monday 12 March 2007 23:51, Ethan Solomita wrote: >> >>> This patch corrects inconsistent use of node numbers (variously "nid" or >>> "node") in the presence of fake NUMA. >>> >> I think it's very consistent -- your patch would make it inconsistent though. >> > > It's consistent to call node_online() with a physical node ID when the > online node mask is composed of fake nodes? > > >> Sorry, but when you ask for NUMA emulation you will get it. I don't see >> any point in a "half way only for some subsystems I like" NUMA emulation. >> It's unlikely that your ideas of where it is useful and where is not >> matches other NUMA emulation user's ideas too. >> > > I don't understand your comments. My code is intended to work for all > systems. If the system is non-NUMA by nature, then all CPUs map to fake > node 0. > > As an example, on a two chip dual-core AMD opteron system, there are 4 > "cpus" where CPUs 0 and 1 are close to the first half of memory, and > CPUs 2 and 3 are close to the second half. Without this change CPUs 2 > and 3 are mapped to fake node 1. This results in awful performance. With > this change, CPUs 2 and 3 are mapped to (roughly) 1/2 the fake node > count. Their zonelists[] are ordered to do allocations preferentially > from zones that are local to CPUs 2 and 3. > > Can you tell me the scenario where my code makes things worse? > > >> Besides adding such a secondary node space would be likely a huge long term >> mainteance issue. I just can it see breaking with every non trivial change. >> > > I'm adding no data structures to do this. The current code already has > get_phys_node. My changes use the existing information about node > layout, both the physical and fake, and defines a mapping. The current > mapping just takes a physical node and says "it's the fake node too". > > >> NACK. >> > > I wish you would include some specifics as to why you think what you > do. You're suggesting we leave in place a system that destroys NUMA > locality when using fake numa, and passes around physical node ids as an > index into nodes[] whihc is indexed by fake nodes. My change has no > effect without fake numa, and harms no one with fake numa. > -- Ethan > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mm: Inconsistent use of node IDs
Andi Kleen wrote: On Monday 12 March 2007 23:51, Ethan Solomita wrote: This patch corrects inconsistent use of node numbers (variously "nid" or "node") in the presence of fake NUMA. I think it's very consistent -- your patch would make it inconsistent though. It's consistent to call node_online() with a physical node ID when the online node mask is composed of fake nodes? Sorry, but when you ask for NUMA emulation you will get it. I don't see any point in a "half way only for some subsystems I like" NUMA emulation. It's unlikely that your ideas of where it is useful and where is not matches other NUMA emulation user's ideas too. I don't understand your comments. My code is intended to work for all systems. If the system is non-NUMA by nature, then all CPUs map to fake node 0. As an example, on a two chip dual-core AMD opteron system, there are 4 "cpus" where CPUs 0 and 1 are close to the first half of memory, and CPUs 2 and 3 are close to the second half. Without this change CPUs 2 and 3 are mapped to fake node 1. This results in awful performance. With this change, CPUs 2 and 3 are mapped to (roughly) 1/2 the fake node count. Their zonelists[] are ordered to do allocations preferentially from zones that are local to CPUs 2 and 3. Can you tell me the scenario where my code makes things worse? Besides adding such a secondary node space would be likely a huge long term mainteance issue. I just can it see breaking with every non trivial change. I'm adding no data structures to do this. The current code already has get_phys_node. My changes use the existing information about node layout, both the physical and fake, and defines a mapping. The current mapping just takes a physical node and says "it's the fake node too". NACK. I wish you would include some specifics as to why you think what you do. You're suggesting we leave in place a system that destroys NUMA locality when using fake numa, and passes around physical node ids as an index into nodes[] whihc is indexed by fake nodes. My change has no effect without fake numa, and harms no one with fake numa. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] mm: Inconsistent use of node IDs
This patch corrects inconsistent use of node numbers (variously "nid" or "node") in the presence of fake NUMA. Both AMD and Intel x86_64 discovery code will determine a CPU's physical node and use that node when calling numa_add_cpu() to associate that CPU with the node, but numa_add_cpu() treats the node argument as a fake node. This physical node may not exist within the fake nodespace, and even if it does, it will likely incorrectly associate a CPU with a fake memory node that may not share the same underlying physical NUMA node. Similarly, the PCI code which determines the node of the PCI bus saves it in the pci_sysdata structure. This node then propagates down to other buses and devices which hang off the PCI bus, and is used to specify a node when allocating memory. The purpose is to provide NUMA locality, but the node is a physical node, and the memory allocation code expects a fake node argument. Provide a routine (get_fake_node()) to map a physical node ID to a fake node ID, where the fake node ID contains memory on the specified physical node ID. This fake node's zonelist is tied to other close fake nodes, maintaining NUMA locality. Also provide numa_online_phys() which is the same as numa_online() but takes a physical node ID. Change init_cpu_to_node(), x86_64 and PCI code use get_fake_node() and numa_online_phys() in order to convert to an appropriate fake ID. Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]> --- arch/i386/pci/acpi.c |6 +++ arch/x86_64/kernel/setup.c| 14 arch/x86_64/mm/numa.c | 70 +- arch/x86_64/pci/k8-bus.c |3 + include/asm-x86_64/topology.h |8 5 files changed, 85 insertions(+), 16 deletions(-) diff -uprN -x install -X linux-2.6.21-rc3-mm2/Documentation/dontdiff linux-2.6.21-rc3-mm2/arch/i386/pci/acpi.c linux-2.6.21-rc3-mm2-phystofake/arch/i386/pci/acpi.c --- linux-2.6.21-rc3-mm2/arch/i386/pci/acpi.c 2007-03-09 16:42:42.0 -0800 +++ linux-2.6.21-rc3-mm2-phystofake/arch/i386/pci/acpi.c2007-03-12 12:36:50.0 -0700 @@ -35,8 +35,13 @@ struct pci_bus * __devinit pci_acpi_scan pxm = acpi_get_pxm(device->handle); #ifdef CONFIG_ACPI_NUMA - if (pxm >= 0) + if (pxm >= 0) { sd->node = pxm_to_node(pxm); +#ifdef CONFIG_NUMA_EMU + if (sd->node != -1) + sd->node = get_fake_node(sd->node); +#endif + } #endif bus = pci_scan_bus_parented(NULL, busnum, &pci_root_ops, sd); diff -uprN -x install -X linux-2.6.21-rc3-mm2/Documentation/dontdiff linux-2.6.21-rc3-mm2/arch/x86_64/kernel/setup.c linux-2.6.21-rc3-mm2-phystofake/arch/x86_64/kernel/setup.c --- linux-2.6.21-rc3-mm2/arch/x86_64/kernel/setup.c 2007-03-09 16:42:42.0 -0800 +++ linux-2.6.21-rc3-mm2-phystofake/arch/x86_64/kernel/setup.c 2007-03-12 12:44:31.0 -0700 @@ -476,20 +476,20 @@ static void __cpuinit display_cacheinfo( } #ifdef CONFIG_NUMA -static int nearby_node(int apicid) +static int __init nearby_node(int apicid) { int i; for (i = apicid - 1; i >= 0; i--) { int node = apicid_to_node[i]; - if (node != NUMA_NO_NODE && node_online(node)) + if (node != NUMA_NO_NODE && node_online_phys(node)) return node; } for (i = apicid + 1; i < MAX_LOCAL_APIC; i++) { int node = apicid_to_node[i]; - if (node != NUMA_NO_NODE && node_online(node)) + if (node != NUMA_NO_NODE && node_online_phys(node)) return node; } - return first_node(node_online_map); /* Shouldn't happen */ + return NUMA_NO_NODE; /* Shouldn't happen */ } #endif @@ -528,7 +528,7 @@ static void __init amd_detect_cmp(struct node = c->phys_proc_id; if (apicid_to_node[apicid] != NUMA_NO_NODE) node = apicid_to_node[apicid]; - if (!node_online(node)) { + if (!node_online_phys(node)) { /* Two possibilities here: - The CPU is missing memory and no node was created. In that case try picking one from a nearby CPU @@ -543,9 +543,10 @@ static void __init amd_detect_cmp(struct apicid_to_node[ht_nodeid] != NUMA_NO_NODE) node = apicid_to_node[ht_nodeid]; /* Pick a nearby node */ - if (!node_online(node)) + if (!node_online_phys(node)) node = nearby_node(apicid); } + node = get_fake_node(node); numa_set_node(cpu, node); printk(KERN_INFO "CPU %d/%x -> Node %d\n", cpu, apicid, node); @@ -679,7 +680,7 @@ static int __cpuinit intel_num_cpu_cores return 1; } -static void srat_detect_node(void) +static void
Re: [RFC 0/8] Cpuset aware writeback
Hi Christoph -- has anything come of resolving the NFS / OOM concerns that Andrew Morton expressed concerning the patch? I'd be happy to see some progress on getting this patch (i.e. the one you posted on 1/23) through. Thanks, -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/