Re: NFS livelock / starvation ?
On Mon, 2007-04-16 at 16:24 +0800, Zhou Yingchao wrote: When we run a two nfs client and a nfs server in the following way, we met a livelock / starvation condition. MachineAMachineB Client1 Client2 Server As shown in the figure, we run a client and server on one machine, and run another client on another machine. When Client1 and Client2 make many writes at the same time, the Client1's request is blocked until Client2's writes finished. We check the code, Client1 is blocked in generic_file_write- ... balance_dirty_pages, balance_dirty_pages call writeback_inodes to (only) flush data of the related fs. In nfs, we found that the Server has enhanced its dirty_thresh. So in the loop of writeback_inodes, Client1 has no data to write out, and the condition ns_reclaimable+wbs.nr_writeback=dirty_thresh will not be true until Client2 finishes its write request to Server. So the loop will only end after Client2 finished its write job. The problem in this path is: why we write only pages of the related fs in writeback_inodes but check the dirty thresh for total pages? I am working on patches to fix this. Current version at (against -mm): http://programming.kicks-ass.net/kernel-patches/balance_dirty_pages/ However, after a rewrite of the BDI statistics work there are some funnies, which I haven't had time to analyse yet :-/ I hope to post a new version soonish... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 00/12] per device dirty throttling -v4
The latest version of the per device dirty throttling. Dropped all the congestion_wait() churn, will contemplate a rename patch. Reworked the BDI statistics to use percpu_counter. against 2.6.21-rc6-mm1; the first patch is for easy application. Andrew can of course just drop the patch it reverts. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 07/12] mm: count dirty pages per BDI
Count per BDI dirty pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/buffer.c |1 + include/linux/backing-dev.h |1 + mm/page-writeback.c |2 ++ mm/truncate.c |1 + 4 files changed, 5 insertions(+) Index: linux-2.6-mm/fs/buffer.c === --- linux-2.6-mm.orig/fs/buffer.c +++ linux-2.6-mm/fs/buffer.c @@ -740,6 +740,7 @@ int __set_page_dirty_buffers(struct page if (page-mapping) {/* Race with truncate? */ if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); + __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } radix_tree_tag_set(mapping-page_tree, Index: linux-2.6-mm/mm/page-writeback.c === --- linux-2.6-mm.orig/mm/page-writeback.c +++ linux-2.6-mm/mm/page-writeback.c @@ -828,6 +828,7 @@ int __set_page_dirty_nobuffers(struct pa BUG_ON(mapping2 != mapping); if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); + __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } radix_tree_tag_set(mapping-page_tree, @@ -961,6 +962,7 @@ int clear_page_dirty_for_io(struct page */ if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); return 1; } return 0; Index: linux-2.6-mm/mm/truncate.c === --- linux-2.6-mm.orig/mm/truncate.c +++ linux-2.6-mm/mm/truncate.c @@ -71,6 +71,7 @@ void cancel_dirty_page(struct page *page struct address_space *mapping = page-mapping; if (mapping mapping_cap_account_dirty(mapping)) { dec_zone_page_state(page, NR_FILE_DIRTY); + dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); if (account_size) task_io_account_cancelled_write(account_size); } Index: linux-2.6-mm/include/linux/backing-dev.h === --- linux-2.6-mm.orig/include/linux/backing-dev.h +++ linux-2.6-mm/include/linux/backing-dev.h @@ -26,6 +26,7 @@ enum bdi_state { typedef int (congested_fn)(void *, int); enum bdi_stat_item { + BDI_DIRTY, NR_BDI_STAT_ITEMS }; -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/12] mm: per device dirty threshold
Scale writeback cache per backing device, proportional to its writeout speed. By decoupling the BDI dirty thresholds a number of problems we currently have will go away, namely: - mutual interference starvation (for any number of BDIs); - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts). It might be that all dirty pages are for a single BDI while other BDIs are idling. By giving each BDI a 'fair' share of the dirty limit, each one can have dirty pages outstanding and make progress. A global threshold also creates a deadlock for stacked BDIs; when A writes to B, and A generates enough dirty pages to get throttled, B will never start writeback until the dirty pages go away. Again, by giving each BDI its own 'independent' dirty limit, this problem is avoided. So the problem is to determine how to distribute the total dirty limit across the BDIs fairly and efficiently. A DBI that has a large dirty limit but does not have any dirty pages outstanding is a waste. What is done is to keep a floating proportion between the DBIs based on writeback completions. This way faster/more active devices get a larger share than slower/idle devices. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/backing-dev.h | 48 +++ mm/backing-dev.c|3 mm/page-writeback.c | 185 +--- 3 files changed, 207 insertions(+), 29 deletions(-) Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h 2007-04-16 11:47:14.0 +0200 +++ linux-2.6/include/linux/backing-dev.h 2007-04-16 11:47:14.0 +0200 @@ -29,6 +29,7 @@ enum bdi_stat_item { BDI_DIRTY, BDI_WRITEBACK, BDI_UNSTABLE, + BDI_WRITEOUT, NR_BDI_STAT_ITEMS }; @@ -44,6 +45,13 @@ struct backing_dev_info { void *unplug_io_data; struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS]; + + /* +* data used for scaling the writeback cache +*/ + spinlock_t lock;/* protect the cycle count */ + unsigned long cycles; /* writeout cycles */ + int dirty_exceeded; }; void bdi_init(struct backing_dev_info *bdi); @@ -55,6 +63,12 @@ static inline void __mod_bdi_stat(struct percpu_counter_mod(bdi-bdi_stat[item], amount); } +static inline void __mod_bdi_stat64(struct backing_dev_info *bdi, + enum bdi_stat_item item, s64 amount) +{ + percpu_counter_mod64(bdi-bdi_stat[item], amount); +} + static inline void __inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item) { @@ -87,12 +101,46 @@ static inline void dec_bdi_stat(struct b local_irq_restore(flags); } +static inline s64 __bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + return percpu_counter_read(bdi-bdi_stat[item]); +} + static inline s64 bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item) { return percpu_counter_read_positive(bdi-bdi_stat[item]); } +static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + return percpu_counter_sum(bdi-bdi_stat[item]); +} + +static inline s64 bdi_stat_sum(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + s64 sum; + unsigned long flags; + + local_irq_save(flags); + sum = __bdi_stat_sum(bdi, item); + local_irq_restore(flags); + + return sum; +} + +static inline unsigned long bdi_stat_delta(void) +{ +#ifdef CONFIG_SMP + return NR_CPUS * FBC_BATCH; +#else + return 1UL; +#endif +} + /* * Flags in backing_dev_info::capability * - The first two flags control whether dirty pages will contribute to the Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c 2007-04-16 11:47:14.0 +0200 +++ linux-2.6/mm/page-writeback.c 2007-04-16 16:21:58.0 +0200 @@ -49,8 +49,6 @@ */ static long ratelimit_pages = 32; -static int dirty_exceeded __cacheline_aligned_in_smp; /* Dirty mem may be over limit */ - /* * When balance_dirty_pages decides that the caller needs to perform some * non-background writeback, this is how many pages it will attempt to write. @@ -103,6 +101,88 @@ EXPORT_SYMBOL(laptop_mode); static void background_writeout(unsigned long _min_pages); /* + * Scale the writeback cache size proportional to the relative writeout speeds. + * + * We do this by tracking a floating average per BDI and a global floating + * average. We optimize away the '/= 2' for the global average by noting that: + * + * if (++i thresh) i /= 2: + * + * Can be approximated by: + * + * thresh/2 + (++i % thresh/2) + * + * Furthermore, when we choose thresh to be 2^n it can be written in terms of + * binary
[PATCH 06/12] mm: scalable bdi statistics counters.
Provide scalable per backing_dev_info statistics counters. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/backing-dev.h | 50 ++-- mm/backing-dev.c| 26 ++ 2 files changed, 74 insertions(+), 2 deletions(-) Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h 2007-04-12 13:27:13.0 +0200 +++ linux-2.6/include/linux/backing-dev.h 2007-04-12 13:28:40.0 +0200 @@ -8,6 +8,7 @@ #ifndef _LINUX_BACKING_DEV_H #define _LINUX_BACKING_DEV_H +#include linux/percpu_counter.h #include asm/atomic.h struct page; @@ -24,6 +25,10 @@ enum bdi_state { typedef int (congested_fn)(void *, int); +enum bdi_stat_item { + NR_BDI_STAT_ITEMS +}; + struct backing_dev_info { unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ unsigned long ra_pages0; /* min readahead on start of file */ @@ -34,14 +39,55 @@ struct backing_dev_info { void *congested_data; /* Pointer to aux data for congested func */ void (*unplug_io_fn)(struct backing_dev_info *, struct page *); void *unplug_io_data; + + struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS]; }; -static inline void bdi_init(struct backing_dev_info *bdi) +void bdi_init(struct backing_dev_info *bdi); +void bdi_destroy(struct backing_dev_info *bdi); + +static inline void __mod_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item, s32 amount) +{ + percpu_counter_mod(bdi-bdi_stat[item], amount); +} + +static inline void __inc_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + __mod_bdi_stat(bdi, item, 1); +} + +static inline void inc_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + unsigned long flags; + + local_irq_save(flags); + __inc_bdi_stat(bdi, item); + local_irq_restore(flags); +} + +static inline void __dec_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) { + __mod_bdi_stat(bdi, item, -1); +} + +static inline void dec_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + unsigned long flags; + + local_irq_save(flags); + __dec_bdi_stat(bdi, item); + local_irq_restore(flags); } -static inline void bdi_destroy(struct backing_dev_info *bdi) +static inline s64 bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) { + return percpu_counter_read_positive(bdi-bdi_stat[item]); } /* Index: linux-2.6/mm/backing-dev.c === --- linux-2.6.orig/mm/backing-dev.c 2007-04-12 13:27:10.0 +0200 +++ linux-2.6/mm/backing-dev.c 2007-04-12 13:28:26.0 +0200 @@ -5,6 +5,30 @@ #include linux/sched.h #include linux/module.h +void bdi_init(struct backing_dev_info *bdi) +{ + int i; + + if (!(bdi_cap_writeback_dirty(bdi) || bdi_cap_account_dirty(bdi))) + return; + + for (i = 0; i NR_BDI_STAT_ITEMS; i++) + percpu_counter_init(bdi-bdi_stat[i], 0); +} +EXPORT_SYMBOL(bdi_init); + +void bdi_destroy(struct backing_dev_info *bdi) +{ + int i; + + if (!(bdi_cap_writeback_dirty(bdi) || bdi_cap_account_dirty(bdi))) + return; + + for (i = 0; i NR_BDI_STAT_ITEMS; i++) + percpu_counter_destroy(bdi-bdi_stat[i]); +} +EXPORT_SYMBOL(bdi_destroy); + static wait_queue_head_t congestion_wqh[2] = { __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1]) @@ -70,3 +94,5 @@ long congestion_wait_interruptible(int r return ret; } EXPORT_SYMBOL(congestion_wait_interruptible); + + -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/12] mm: expose BDI statistics in sysfs.
Expose the per BDI stats in /sys/block/dev/queue/* Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- block/ll_rw_blk.c | 32 1 file changed, 32 insertions(+) Index: linux-2.6-mm/block/ll_rw_blk.c === --- linux-2.6-mm.orig/block/ll_rw_blk.c +++ linux-2.6-mm/block/ll_rw_blk.c @@ -3976,6 +3976,20 @@ static ssize_t queue_max_hw_sectors_show return queue_var_show(max_hw_sectors_kb, (page)); } +static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page) +{ + return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, BDI_DIRTY)); +} + +static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page) +{ + return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, BDI_WRITEBACK)); +} + +static ssize_t queue_nr_unstable_show(struct request_queue *q, char *page) +{ + return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, BDI_UNSTABLE)); +} static struct queue_sysfs_entry queue_requests_entry = { .attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR }, @@ -4006,6 +4020,21 @@ static struct queue_sysfs_entry queue_ma .show = queue_max_hw_sectors_show, }; +static struct queue_sysfs_entry queue_dirty_entry = { + .attr = {.name = dirty_pages, .mode = S_IRUGO }, + .show = queue_nr_dirty_show, +}; + +static struct queue_sysfs_entry queue_writeback_entry = { + .attr = {.name = writeback_pages, .mode = S_IRUGO }, + .show = queue_nr_writeback_show, +}; + +static struct queue_sysfs_entry queue_unstable_entry = { + .attr = {.name = unstable_pages, .mode = S_IRUGO }, + .show = queue_nr_unstable_show, +}; + static struct queue_sysfs_entry queue_iosched_entry = { .attr = {.name = scheduler, .mode = S_IRUGO | S_IWUSR }, .show = elv_iosched_show, @@ -4018,6 +4047,9 @@ static struct attribute *default_attrs[] queue_initial_ra_entry.attr, queue_max_hw_sectors_entry.attr, queue_max_sectors_entry.attr, + queue_dirty_entry.attr, + queue_writeback_entry.attr, + queue_unstable_entry.attr, queue_iosched_entry.attr, NULL, }; -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 03/12] lib: dampen the percpu_counter FBC_BATCH
With the current logic the percpu_counter's accuracy delta is quadric wrt the number of cpus in the system, reduce this to O(n ln n). Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/percpu_counter.h |7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) Index: linux-2.6-mm/include/linux/percpu_counter.h === --- linux-2.6-mm.orig/include/linux/percpu_counter.h +++ linux-2.6-mm/include/linux/percpu_counter.h @@ -11,6 +11,7 @@ #include linux/threads.h #include linux/percpu.h #include linux/types.h +#include linux/log2.h #ifdef CONFIG_SMP @@ -20,11 +21,7 @@ struct percpu_counter { s32 *counters; }; -#if NR_CPUS = 16 -#define FBC_BATCH (NR_CPUS*2) -#else -#define FBC_BATCH (NR_CPUS*4) -#endif +#define FBC_BATCH (8*ilog2(NR_CPUS)) static inline void percpu_counter_init(struct percpu_counter *fbc, s64 amount) { -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/12] mm: bdi init hooks
provide BDI constructor/destructor hooks Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- block/ll_rw_blk.c |2 ++ drivers/block/rd.c |6 ++ drivers/char/mem.c |2 ++ drivers/mtd/mtdcore.c |5 + fs/char_dev.c |1 + fs/configfs/configfs_internal.h |2 ++ fs/configfs/inode.c |8 fs/configfs/mount.c |2 ++ fs/fuse/inode.c |2 ++ fs/hugetlbfs/inode.c|3 +++ fs/nfs/client.c |3 +++ fs/ocfs2/dlm/dlmfs.c|6 +- fs/ramfs/inode.c|7 ++- fs/sysfs/inode.c|5 + fs/sysfs/mount.c|2 ++ fs/sysfs/sysfs.h|1 + include/linux/backing-dev.h |7 +++ kernel/cpuset.c |3 +++ mm/shmem.c |1 + mm/swap.c |2 ++ 20 files changed, 68 insertions(+), 2 deletions(-) Index: linux-2.6/block/ll_rw_blk.c === --- linux-2.6.orig/block/ll_rw_blk.c2007-04-12 11:35:53.0 +0200 +++ linux-2.6/block/ll_rw_blk.c 2007-04-12 13:19:40.0 +0200 @@ -1771,6 +1771,7 @@ static void blk_release_queue(struct kob blk_trace_shutdown(q); + bdi_destroy(q-backing_dev_info); kmem_cache_free(requestq_cachep, q); } @@ -1836,6 +1837,7 @@ request_queue_t *blk_alloc_queue_node(gf q-kobj.ktype = queue_ktype; kobject_init(q-kobj); q-backing_dev_info = default_backing_dev_info; + bdi_init(q-backing_dev_info); q-backing_dev_info.unplug_io_fn = blk_backing_dev_unplug; q-backing_dev_info.unplug_io_data = q; Index: linux-2.6/drivers/block/rd.c === --- linux-2.6.orig/drivers/block/rd.c 2007-04-12 11:35:51.0 +0200 +++ linux-2.6/drivers/block/rd.c2007-04-12 11:35:59.0 +0200 @@ -411,6 +411,9 @@ static void __exit rd_cleanup(void) blk_cleanup_queue(rd_queue[i]); } unregister_blkdev(RAMDISK_MAJOR, ramdisk); + + bdi_destroy(rd_file_backing_dev_info); + bdi_destroy(rd_backing_dev_info); } /* @@ -421,6 +424,9 @@ static int __init rd_init(void) int i; int err = -ENOMEM; + bdi_init(rd_backing_dev_info); + bdi_init(rd_file_backing_dev_info); + if (rd_blocksize PAGE_SIZE || rd_blocksize 512 || (rd_blocksize (rd_blocksize-1))) { printk(RAMDISK: wrong blocksize %d, reverting to defaults\n, Index: linux-2.6/drivers/char/mem.c === --- linux-2.6.orig/drivers/char/mem.c 2007-04-12 11:35:51.0 +0200 +++ linux-2.6/drivers/char/mem.c2007-04-12 11:35:59.0 +0200 @@ -987,6 +987,8 @@ static int __init chr_dev_init(void) MKDEV(MEM_MAJOR, devlist[i].minor), devlist[i].name); + bdi_init(zero_bdi); + return 0; } Index: linux-2.6/fs/char_dev.c === --- linux-2.6.orig/fs/char_dev.c2007-04-12 11:35:51.0 +0200 +++ linux-2.6/fs/char_dev.c 2007-04-12 11:35:59.0 +0200 @@ -546,6 +546,7 @@ static struct kobject *base_probe(dev_t void __init chrdev_init(void) { cdev_map = kobj_map_init(base_probe, chrdevs_lock); + bdi_init(directly_mappable_cdev_bdi); } Index: linux-2.6/fs/fuse/inode.c === --- linux-2.6.orig/fs/fuse/inode.c 2007-04-12 11:35:51.0 +0200 +++ linux-2.6/fs/fuse/inode.c 2007-04-12 11:35:59.0 +0200 @@ -415,6 +415,7 @@ static struct fuse_conn *new_conn(void) atomic_set(fc-num_waiting, 0); fc-bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; fc-bdi.unplug_io_fn = default_unplug_io_fn; + bdi_init(fc-bdi); fc-reqctr = 0; fc-blocked = 1; get_random_bytes(fc-scramble_key, sizeof(fc-scramble_key)); @@ -428,6 +429,7 @@ void fuse_conn_put(struct fuse_conn *fc) if (fc-destroy_req) fuse_request_free(fc-destroy_req); mutex_destroy(fc-inst_mutex); + bdi_destroy(fc-bdi); kfree(fc); } } Index: linux-2.6/fs/nfs/client.c === --- linux-2.6.orig/fs/nfs/client.c 2007-04-12 11:35:51.0 +0200 +++ linux-2.6/fs/nfs/client.c 2007-04-12 11:35:59.0 +0200 @@ -657,6 +657,8 @@ static void nfs_server_set_fsinfo(struct if (server-rsize NFS_MAX_FILE_IO_SIZE) server-rsize = NFS_MAX_FILE_IO_SIZE
[PATCH 12/12] debug: expose BDI statistics in sysfs.
Expose the per BDI stats in /sys/block/dev/queue/* Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- block/ll_rw_blk.c | 49 + mm/page-writeback.c |2 +- 2 files changed, 50 insertions(+), 1 deletion(-) Index: linux-2.6/block/ll_rw_blk.c === --- linux-2.6.orig/block/ll_rw_blk.c2007-04-12 17:35:03.0 +0200 +++ linux-2.6/block/ll_rw_blk.c 2007-04-12 17:36:49.0 +0200 @@ -3991,6 +3991,37 @@ static ssize_t queue_nr_unstable_show(st return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, BDI_UNSTABLE)); } +extern void get_writeout_scale(struct backing_dev_info *, int *, int *); + +static ssize_t queue_nr_cache_ratio_show(struct request_queue *q, char *page) +{ + int scale, div; + + get_writeout_scale(q-backing_dev_info, scale, div); + scale *= 1024; + scale /= div; + + return sprintf(page, %d\n, scale); +} + +extern void +get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty, + struct backing_dev_info *bdi); + +static ssize_t queue_nr_cache_size_show(struct request_queue *q, char *page) +{ + long background, dirty, bdi_dirty; + get_dirty_limits(background, dirty, bdi_dirty, q-backing_dev_info); + return sprintf(page, %ld\n, bdi_dirty); +} + +static ssize_t queue_nr_cache_total_show(struct request_queue *q, char *page) +{ + long background, dirty, bdi_dirty; + get_dirty_limits(background, dirty, bdi_dirty, q-backing_dev_info); + return sprintf(page, %ld\n, dirty); +} + static struct queue_sysfs_entry queue_requests_entry = { .attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR }, .show = queue_requests_show, @@ -4035,6 +4066,21 @@ static struct queue_sysfs_entry queue_un .show = queue_nr_unstable_show, }; +static struct queue_sysfs_entry queue_cache_ratio_entry = { + .attr = {.name = cache_ratio, .mode = S_IRUGO }, + .show = queue_nr_cache_ratio_show, +}; + +static struct queue_sysfs_entry queue_cache_size_entry = { + .attr = {.name = cache_size, .mode = S_IRUGO }, + .show = queue_nr_cache_size_show, +}; + +static struct queue_sysfs_entry queue_cache_total_entry = { + .attr = {.name = cache_total, .mode = S_IRUGO }, + .show = queue_nr_cache_total_show, +}; + static struct queue_sysfs_entry queue_iosched_entry = { .attr = {.name = scheduler, .mode = S_IRUGO | S_IWUSR }, .show = elv_iosched_show, @@ -4050,6 +4096,9 @@ static struct attribute *default_attrs[] queue_dirty_entry.attr, queue_writeback_entry.attr, queue_unstable_entry.attr, + queue_cache_ratio_entry.attr, + queue_cache_size_entry.attr, + queue_cache_total_entry.attr, queue_iosched_entry.attr, NULL, }; Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c 2007-04-12 17:36:05.0 +0200 +++ linux-2.6/mm/page-writeback.c 2007-04-12 17:36:49.0 +0200 @@ -237,7 +237,7 @@ static unsigned long determine_dirtyable return x + 1; /* Ensure that we never return 0 */ } -static void +void get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty, struct backing_dev_info *bdi) { -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/12] revert per-backing_dev-dirty-and-writeback-page-accounting
For ease of application.. --- block/ll_rw_blk.c | 29 - fs/buffer.c |1 - include/linux/backing-dev.h |2 -- mm/page-writeback.c | 13 ++--- mm/truncate.c |1 - 5 files changed, 2 insertions(+), 44 deletions(-) Index: linux-2.6/block/ll_rw_blk.c === --- linux-2.6.orig/block/ll_rw_blk.c2007-04-10 16:30:55.0 +0200 +++ linux-2.6/block/ll_rw_blk.c 2007-04-10 16:35:24.0 +0200 @@ -201,8 +201,6 @@ EXPORT_SYMBOL(blk_queue_softirq_done); **/ void blk_queue_make_request(request_queue_t * q, make_request_fn * mfn) { - struct backing_dev_info *bdi = q-backing_dev_info; - /* * set defaults */ @@ -210,8 +208,6 @@ void blk_queue_make_request(request_queu blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS); q-make_request_fn = mfn; - atomic_long_set(bdi-nr_dirty, 0); - atomic_long_set(bdi-nr_writeback, 0); blk_queue_max_sectors(q, SAFE_MAX_SECTORS); blk_queue_hardsect_size(q, 512); blk_queue_dma_alignment(q, 511); @@ -3978,19 +3974,6 @@ static ssize_t queue_max_hw_sectors_show return queue_var_show(max_hw_sectors_kb, (page)); } -static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page) -{ - return sprintf(page, %lu\n, - atomic_long_read(q-backing_dev_info.nr_dirty)); - -} - -static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page) -{ - return sprintf(page, %lu\n, - atomic_long_read(q-backing_dev_info.nr_writeback)); - -} static struct queue_sysfs_entry queue_requests_entry = { .attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR }, @@ -4021,16 +4004,6 @@ static struct queue_sysfs_entry queue_ma .show = queue_max_hw_sectors_show, }; -static struct queue_sysfs_entry queue_nr_dirty_entry = { - .attr = {.name = nr_dirty, .mode = S_IRUGO }, - .show = queue_nr_dirty_show, -}; - -static struct queue_sysfs_entry queue_nr_writeback_entry = { - .attr = {.name = nr_writeback, .mode = S_IRUGO }, - .show = queue_nr_writeback_show, -}; - static struct queue_sysfs_entry queue_iosched_entry = { .attr = {.name = scheduler, .mode = S_IRUGO | S_IWUSR }, .show = elv_iosched_show, @@ -4043,8 +4016,6 @@ static struct attribute *default_attrs[] queue_initial_ra_entry.attr, queue_max_hw_sectors_entry.attr, queue_max_sectors_entry.attr, - queue_nr_dirty_entry.attr, - queue_nr_writeback_entry.attr, queue_iosched_entry.attr, NULL, }; Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c 2007-04-10 16:30:15.0 +0200 +++ linux-2.6/fs/buffer.c 2007-04-10 16:35:03.0 +0200 @@ -740,7 +740,6 @@ int __set_page_dirty_buffers(struct page if (page-mapping) {/* Race with truncate? */ if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - atomic_long_inc(mapping-backing_dev_info-nr_dirty); task_io_account_write(PAGE_CACHE_SIZE); } radix_tree_tag_set(mapping-page_tree, Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h 2007-04-10 16:30:55.0 +0200 +++ linux-2.6/include/linux/backing-dev.h 2007-04-10 16:35:03.0 +0200 @@ -30,8 +30,6 @@ struct backing_dev_info { unsigned long ra_thrash_bytes; /* estimated thrashing threshold */ unsigned long state;/* Always use atomic bitops on this */ unsigned int capabilities; /* Device capabilities */ - atomic_long_t nr_dirty; /* Pages dirty against this BDI */ - atomic_long_t nr_writeback;/* Pages under writeback against this BDI */ congested_fn *congested_fn; /* Function pointer if device is md/dm */ void *congested_data; /* Pointer to aux data for congested func */ void (*unplug_io_fn)(struct backing_dev_info *, struct page *); Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c 2007-04-10 16:30:15.0 +0200 +++ linux-2.6/mm/page-writeback.c 2007-04-10 16:35:03.0 +0200 @@ -828,8 +828,6 @@ int __set_page_dirty_nobuffers(struct pa BUG_ON(mapping2 != mapping); if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - atomic_long_inc(mapping-backing_dev_info- -
[PATCH 02/12] nfs: remove congestion_end()
Its redundant, clear_bdi_congested() already wakes the waiters. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/write.c |4 +--- include/linux/backing-dev.h |1 - mm/backing-dev.c| 13 - 3 files changed, 1 insertion(+), 17 deletions(-) Index: linux-2.6-mm/fs/nfs/write.c === --- linux-2.6-mm.orig/fs/nfs/write.c2007-04-05 16:24:50.0 +0200 +++ linux-2.6-mm/fs/nfs/write.c 2007-04-05 16:25:04.0 +0200 @@ -235,10 +235,8 @@ static void nfs_end_page_writeback(struc struct nfs_server *nfss = NFS_SERVER(inode); end_page_writeback(page); - if (atomic_long_dec_return(nfss-writeback) NFS_CONGESTION_OFF_THRESH) { + if (atomic_long_dec_return(nfss-writeback) NFS_CONGESTION_OFF_THRESH) clear_bdi_congested(nfss-backing_dev_info, WRITE); - congestion_end(WRITE); - } } /* Index: linux-2.6-mm/include/linux/backing-dev.h === --- linux-2.6-mm.orig/include/linux/backing-dev.h 2007-04-05 16:24:50.0 +0200 +++ linux-2.6-mm/include/linux/backing-dev.h2007-04-05 16:25:08.0 +0200 @@ -96,7 +96,6 @@ void clear_bdi_congested(struct backing_ void set_bdi_congested(struct backing_dev_info *bdi, int rw); long congestion_wait(int rw, long timeout); long congestion_wait_interruptible(int rw, long timeout); -void congestion_end(int rw); #define bdi_cap_writeback_dirty(bdi) \ (!((bdi)-capabilities BDI_CAP_NO_WRITEBACK)) Index: linux-2.6-mm/mm/backing-dev.c === --- linux-2.6-mm.orig/mm/backing-dev.c 2007-04-05 16:24:50.0 +0200 +++ linux-2.6-mm/mm/backing-dev.c 2007-04-05 16:25:16.0 +0200 @@ -70,16 +70,3 @@ long congestion_wait_interruptible(int r return ret; } EXPORT_SYMBOL(congestion_wait_interruptible); - -/** - * congestion_end - wake up sleepers on a congested backing_dev_info - * @rw: READ or WRITE - */ -void congestion_end(int rw) -{ - wait_queue_head_t *wqh = congestion_wqh[rw]; - - if (waitqueue_active(wqh)) - wake_up(wqh); -} -EXPORT_SYMBOL(congestion_end); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 08/12] mm: count writeback pages per BDI
Count per BDI writeback pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/backing-dev.h |1 + mm/page-writeback.c | 12 ++-- 2 files changed, 11 insertions(+), 2 deletions(-) Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c 2007-04-12 17:32:17.0 +0200 +++ linux-2.6/mm/page-writeback.c 2007-04-12 17:34:24.0 +0200 @@ -977,14 +977,18 @@ int test_clear_page_writeback(struct pag int ret; if (mapping) { + struct backing_dev_info *bdi = mapping-backing_dev_info; unsigned long flags; write_lock_irqsave(mapping-tree_lock, flags); ret = TestClearPageWriteback(page); - if (ret) + if (ret) { radix_tree_tag_clear(mapping-page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + if (bdi_cap_writeback_dirty(bdi)) + __dec_bdi_stat(bdi, BDI_WRITEBACK); + } write_unlock_irqrestore(mapping-tree_lock, flags); } else { ret = TestClearPageWriteback(page); @@ -1000,14 +1004,18 @@ int test_set_page_writeback(struct page int ret; if (mapping) { + struct backing_dev_info *bdi = mapping-backing_dev_info; unsigned long flags; write_lock_irqsave(mapping-tree_lock, flags); ret = TestSetPageWriteback(page); - if (!ret) + if (!ret) { radix_tree_tag_set(mapping-page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + if (bdi_cap_writeback_dirty(bdi)) + __inc_bdi_stat(bdi, BDI_WRITEBACK); + } if (!PageDirty(page)) radix_tree_tag_clear(mapping-page_tree, page_index(page), Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h 2007-04-12 17:32:17.0 +0200 +++ linux-2.6/include/linux/backing-dev.h 2007-04-12 17:32:40.0 +0200 @@ -27,6 +27,7 @@ typedef int (congested_fn)(void *, int); enum bdi_stat_item { BDI_DIRTY, + BDI_WRITEBACK, NR_BDI_STAT_ITEMS }; -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 04/12] lib: percpu_counter_mod64
Add percpu_counter_mod64() to allow large modifications. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/percpu_counter.h |9 + lib/percpu_counter.c | 28 2 files changed, 37 insertions(+) Index: linux-2.6/include/linux/percpu_counter.h === --- linux-2.6.orig/include/linux/percpu_counter.h 2007-04-12 13:54:55.0 +0200 +++ linux-2.6/include/linux/percpu_counter.h2007-04-12 14:00:21.0 +0200 @@ -36,6 +36,7 @@ static inline void percpu_counter_destro } void percpu_counter_mod(struct percpu_counter *fbc, s32 amount); +void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount); s64 percpu_counter_sum(struct percpu_counter *fbc); static inline s64 percpu_counter_read(struct percpu_counter *fbc) @@ -81,6 +82,14 @@ percpu_counter_mod(struct percpu_counter preempt_enable(); } +static inline void +percpu_counter_mod64(struct percpu_counter *fbc, s64 amount) +{ + preempt_disable(); + fbc-count += amount; + preempt_enable(); +} + static inline s64 percpu_counter_read(struct percpu_counter *fbc) { return fbc-count; Index: linux-2.6/lib/percpu_counter.c === --- linux-2.6.orig/lib/percpu_counter.c 2006-07-31 13:07:38.0 +0200 +++ linux-2.6/lib/percpu_counter.c 2007-04-12 14:17:12.0 +0200 @@ -25,6 +25,34 @@ void percpu_counter_mod(struct percpu_co } EXPORT_SYMBOL(percpu_counter_mod); +void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount) +{ + long count; + s32 *pcount; + int cpu; + + if (amount = FBC_BATCH || amount = -FBC_BATCH) { + spin_lock(fbc-lock); + fbc-count += amount; + spin_unlock(fbc-lock); + return; + } + + cpu = get_cpu(); + pcount = per_cpu_ptr(fbc-counters, cpu); + count = *pcount + amount; + if (count = FBC_BATCH || count = -FBC_BATCH) { + spin_lock(fbc-lock); + fbc-count += count; + *pcount = 0; + spin_unlock(fbc-lock); + } else { + *pcount = count; + } + put_cpu(); +} +EXPORT_SYMBOL(percpu_counter_mod64); + /* * Add up all the per-cpu counts, return the result. This is a more accurate * but much slower version of percpu_counter_read_positive() -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Tue, 2007-04-17 at 21:19 -0400, Trond Myklebust wrote: I've split the issues introduced by the 2.6.21-rcX write code up into 4 subproblems. The first patch is just a cleanup in order to ease review. Patch number 2 ensures that we never release the PG_writeback flag until _after_ we've either discarded the unstable request altogether, or put it on the nfs_inode's commit or dirty lists. Patch number 3 fixes the 'desynchronized value of nfs_i.ncommit' error. It uses the PG_NEED_COMMIT flag as an indicator for whether or not the request may be redirtied. Patch number 4 protects the NFS '.set_page_dirty' address_space operation against races with nfs_inode_add_request. Ok, stuck them in, and my debug patch from yesterday, just in case... However, I can't seem to run long enough to establish whether the problem is gone. It deadlocks between 10-30 minutes due to missing IO completions, whereas yesterday it took between 45-60 minutes to trigger the 'desynchronized value of nfs_i.ncommit' messages. I will continue trying go get a good run, however if you got some (perhaps experimental .22) patches you want me to try.. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues
On Wed, 2007-04-18 at 10:19 +0200, Peter Zijlstra wrote: On Tue, 2007-04-17 at 21:19 -0400, Trond Myklebust wrote: I've split the issues introduced by the 2.6.21-rcX write code up into 4 subproblems. The first patch is just a cleanup in order to ease review. Patch number 2 ensures that we never release the PG_writeback flag until _after_ we've either discarded the unstable request altogether, or put it on the nfs_inode's commit or dirty lists. Patch number 3 fixes the 'desynchronized value of nfs_i.ncommit' error. It uses the PG_NEED_COMMIT flag as an indicator for whether or not the request may be redirtied. Patch number 4 protects the NFS '.set_page_dirty' address_space operation against races with nfs_inode_add_request. Ok, stuck them in, and my debug patch from yesterday, just in case... However, I can't seem to run long enough to establish whether the problem is gone. It deadlocks between 10-30 minutes due to missing IO completions, whereas yesterday it took between 45-60 minutes to trigger the 'desynchronized value of nfs_i.ncommit' messages. I will continue trying go get a good run, Just got one around 80-90 minutes, no 'desynchronized value of nfs_i.ncommit' errors. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] mm: per device dirty threshold
On Thu, 2007-04-19 at 19:49 +0200, Miklos Szeredi wrote: +static inline unsigned long bdi_stat_delta(void) +{ +#ifdef CONFIG_SMP + return NR_CPUS * FBC_BATCH; Shouln't this be multiplied by the number of counters to sum? I.e. 3 if dirty and unstable are separate, and 2 if they are not. Ah, yes, good catch. How about this: --- Since we're adding 3 stat counters, tripple the per counter delta as well. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- mm/page-writeback.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c 2007-04-19 19:59:26.0 +0200 +++ linux-2.6/mm/page-writeback.c 2007-04-19 20:00:09.0 +0200 @@ -321,7 +321,7 @@ static void balance_dirty_pages(struct a get_dirty_limits(background_thresh, dirty_thresh, bdi_thresh, bdi); - if (bdi_thresh bdi_stat_delta()) { + if (bdi_thresh 3*bdi_stat_delta()) { bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_DIRTY) + bdi_stat_sum(bdi, BDI_UNSTABLE); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/12] mm: count unstable pages per BDI
On Thu, 2007-04-19 at 19:44 +0200, Miklos Szeredi wrote: Count per BDI unstable pages. I'm wondering, is it really worth having this category separate from per BDI brity pages? With the exception of the export to sysfs, always the sum of unstable + dirty is used. I guess you are right, but it offends my sense of aesthetics to break symmetry with the zone statistics. However, it has the added advantage of only needing 2 deltas as well. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/12] mm: count unstable pages per BDI
On Thu, 2007-04-19 at 20:12 +0200, Peter Zijlstra wrote: On Thu, 2007-04-19 at 19:44 +0200, Miklos Szeredi wrote: Count per BDI unstable pages. I'm wondering, is it really worth having this category separate from per BDI brity pages? With the exception of the export to sysfs, always the sum of unstable + dirty is used. I guess you are right, but it offends my sense of aesthetics to break symmetry with the zone statistics. However, it has the added advantage of only needing 2 deltas as well. I guess, this should do. --- fs/buffer.c |2 +- fs/nfs/write.c | 11 +++ include/linux/backing-dev.h |3 +-- mm/page-writeback.c | 16 +++- mm/truncate.c |2 +- 5 files changed, 17 insertions(+), 17 deletions(-) Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c 2007-04-19 19:59:26.0 +0200 +++ linux-2.6/fs/buffer.c 2007-04-19 20:35:39.0 +0200 @@ -733,7 +733,7 @@ int __set_page_dirty_buffers(struct page if (page-mapping) {/* Race with truncate? */ if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); + __inc_bdi_stat(mapping-backing_dev_info, BDI_RECLAIM); task_io_account_write(PAGE_CACHE_SIZE); } radix_tree_tag_set(mapping-page_tree, Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c 2007-04-19 19:59:26.0 +0200 +++ linux-2.6/fs/nfs/write.c2007-04-19 20:39:03.0 +0200 @@ -456,7 +456,7 @@ nfs_mark_request_commit(struct nfs_page nfsi-ncommit++; spin_unlock(nfsi-req_lock); inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); - inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); + inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIM); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } #endif @@ -518,7 +518,8 @@ static void nfs_cancel_commit_list(struc while(!list_empty(head)) { req = nfs_list_entry(head-next); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); - dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); + dec_bdi_stat(req-wb_page-mapping-backing_dev_info, + BDI_RECLAIM); nfs_list_remove_request(req); nfs_inode_remove_request(req); nfs_unlock_request(req); @@ -1247,7 +1248,8 @@ nfs_commit_list(struct inode *inode, str nfs_list_remove_request(req); nfs_mark_request_commit(req); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); - dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); + dec_bdi_stat(req-wb_page-mapping-backing_dev_info, + BDI_RECLAIM); nfs_clear_page_writeback(req); } return -ENOMEM; @@ -1272,7 +1274,8 @@ static void nfs_commit_done(struct rpc_t req = nfs_list_entry(data-pages.next); nfs_list_remove_request(req); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); - dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); + dec_bdi_stat(req-wb_page-mapping-backing_dev_info, + BDI_RECLAIM); dprintk(NFS: commit (%s/%Ld [EMAIL PROTECTED]), req-wb_context-dentry-d_inode-i_sb-s_id, Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h 2007-04-19 19:59:26.0 +0200 +++ linux-2.6/include/linux/backing-dev.h 2007-04-19 20:39:24.0 +0200 @@ -26,9 +26,8 @@ enum bdi_state { typedef int (congested_fn)(void *, int); enum bdi_stat_item { - BDI_DIRTY, + BDI_RECLAIM, BDI_WRITEBACK, - BDI_UNSTABLE, BDI_WRITEOUT, NR_BDI_STAT_ITEMS }; Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c 2007-04-19 20:00:09.0 +0200 +++ linux-2.6/mm/page-writeback.c 2007-04-19 20:40:29.0 +0200 @@ -300,8 +300,7 @@ static void balance_dirty_pages(struct a get_dirty_limits(background_thresh, dirty_thresh, bdi_thresh, bdi); - bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + - bdi_stat(bdi, BDI_UNSTABLE); + bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIM
Re: [PATCH 09/12] mm: count unstable pages per BDI
On Thu, 2007-04-19 at 20:46 +0200, Peter Zijlstra wrote: On Thu, 2007-04-19 at 20:12 +0200, Peter Zijlstra wrote: On Thu, 2007-04-19 at 19:44 +0200, Miklos Szeredi wrote: Count per BDI unstable pages. I'm wondering, is it really worth having this category separate from per BDI brity pages? With the exception of the export to sysfs, always the sum of unstable + dirty is used. I guess you are right, but it offends my sense of aesthetics to break symmetry with the zone statistics. However, it has the added advantage of only needing 2 deltas as well. I guess, this should do. OK, the compiler told me I messed up :-/. I'll respin the whole series and repost tomorrow or something... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/12] mm: count unstable pages per BDI
On Thu, 2007-04-19 at 21:20 +0200, Miklos Szeredi wrote: Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c 2007-04-19 19:59:26.0 +0200 +++ linux-2.6/fs/buffer.c 2007-04-19 20:35:39.0 +0200 @@ -733,7 +733,7 @@ int __set_page_dirty_buffers(struct page if (page-mapping) {/* Race with truncate? */ if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); + __inc_bdi_stat(mapping-backing_dev_info, BDI_RECLAIM); This name suggests it's _under_ reclaim, which is not true. You might rather want to call it BDI_RECLAIMABLE, or something similar in meaning. Yeah, my fingers got lazy on me :-) will instruct them to type more. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/10] mm: expose BDI statistics in sysfs.
Expose the per BDI stats in /sys/block/dev/queue/* Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- block/ll_rw_blk.c | 32 1 file changed, 32 insertions(+) Index: linux-2.6-mm/block/ll_rw_blk.c === --- linux-2.6-mm.orig/block/ll_rw_blk.c +++ linux-2.6-mm/block/ll_rw_blk.c @@ -3976,6 +3976,15 @@ static ssize_t queue_max_hw_sectors_show return queue_var_show(max_hw_sectors_kb, (page)); } +static ssize_t queue_nr_reclaimable_show(struct request_queue *q, char *page) +{ + return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, BDI_RECLAIMABLE)); +} + +static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page) +{ + return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, BDI_WRITEBACK)); +} static struct queue_sysfs_entry queue_requests_entry = { .attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR }, @@ -4006,6 +4020,16 @@ static struct queue_sysfs_entry queue_ma .show = queue_max_hw_sectors_show, }; +static struct queue_sysfs_entry queue_reclaimable_entry = { + .attr = {.name = reclaimable_pages, .mode = S_IRUGO }, + .show = queue_nr_reclaimable_show, +}; + +static struct queue_sysfs_entry queue_writeback_entry = { + .attr = {.name = writeback_pages, .mode = S_IRUGO }, + .show = queue_nr_writeback_show, +}; + static struct queue_sysfs_entry queue_iosched_entry = { .attr = {.name = scheduler, .mode = S_IRUGO | S_IWUSR }, .show = elv_iosched_show, @@ -4018,6 +4047,8 @@ static struct attribute *default_attrs[] queue_initial_ra_entry.attr, queue_max_hw_sectors_entry.attr, queue_max_sectors_entry.attr, + queue_reclaimable_entry.attr, + queue_writeback_entry.attr, queue_iosched_entry.attr, NULL, }; -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 00/10] per device dirty throttling -v5
The latest version of the per device dirty throttling. against 2.6.21-rc6-mm1; the first patch is for easy application. Andrew can of course just drop the patch it reverts. Merged BDI_DIRTY and BDI_UNSTABLE into BDI_RECLAIMABLE, and multiplied bdi_stat_delta() by the number of counters summed. Thanks to Miklos for pointing these out. -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/10] revert per-backing_dev-dirty-and-writeback-page-accounting
For ease of application.. --- block/ll_rw_blk.c | 29 - fs/buffer.c |1 - include/linux/backing-dev.h |2 -- mm/page-writeback.c | 13 ++--- mm/truncate.c |1 - 5 files changed, 2 insertions(+), 44 deletions(-) Index: linux-2.6/block/ll_rw_blk.c === --- linux-2.6.orig/block/ll_rw_blk.c2007-04-10 16:30:55.0 +0200 +++ linux-2.6/block/ll_rw_blk.c 2007-04-10 16:35:24.0 +0200 @@ -201,8 +201,6 @@ EXPORT_SYMBOL(blk_queue_softirq_done); **/ void blk_queue_make_request(request_queue_t * q, make_request_fn * mfn) { - struct backing_dev_info *bdi = q-backing_dev_info; - /* * set defaults */ @@ -210,8 +208,6 @@ void blk_queue_make_request(request_queu blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS); q-make_request_fn = mfn; - atomic_long_set(bdi-nr_dirty, 0); - atomic_long_set(bdi-nr_writeback, 0); blk_queue_max_sectors(q, SAFE_MAX_SECTORS); blk_queue_hardsect_size(q, 512); blk_queue_dma_alignment(q, 511); @@ -3978,19 +3974,6 @@ static ssize_t queue_max_hw_sectors_show return queue_var_show(max_hw_sectors_kb, (page)); } -static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page) -{ - return sprintf(page, %lu\n, - atomic_long_read(q-backing_dev_info.nr_dirty)); - -} - -static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page) -{ - return sprintf(page, %lu\n, - atomic_long_read(q-backing_dev_info.nr_writeback)); - -} static struct queue_sysfs_entry queue_requests_entry = { .attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR }, @@ -4021,16 +4004,6 @@ static struct queue_sysfs_entry queue_ma .show = queue_max_hw_sectors_show, }; -static struct queue_sysfs_entry queue_nr_dirty_entry = { - .attr = {.name = nr_dirty, .mode = S_IRUGO }, - .show = queue_nr_dirty_show, -}; - -static struct queue_sysfs_entry queue_nr_writeback_entry = { - .attr = {.name = nr_writeback, .mode = S_IRUGO }, - .show = queue_nr_writeback_show, -}; - static struct queue_sysfs_entry queue_iosched_entry = { .attr = {.name = scheduler, .mode = S_IRUGO | S_IWUSR }, .show = elv_iosched_show, @@ -4043,8 +4016,6 @@ static struct attribute *default_attrs[] queue_initial_ra_entry.attr, queue_max_hw_sectors_entry.attr, queue_max_sectors_entry.attr, - queue_nr_dirty_entry.attr, - queue_nr_writeback_entry.attr, queue_iosched_entry.attr, NULL, }; Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c 2007-04-10 16:30:15.0 +0200 +++ linux-2.6/fs/buffer.c 2007-04-10 16:35:03.0 +0200 @@ -740,7 +740,6 @@ int __set_page_dirty_buffers(struct page if (page-mapping) {/* Race with truncate? */ if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - atomic_long_inc(mapping-backing_dev_info-nr_dirty); task_io_account_write(PAGE_CACHE_SIZE); } radix_tree_tag_set(mapping-page_tree, Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h 2007-04-10 16:30:55.0 +0200 +++ linux-2.6/include/linux/backing-dev.h 2007-04-10 16:35:03.0 +0200 @@ -30,8 +30,6 @@ struct backing_dev_info { unsigned long ra_thrash_bytes; /* estimated thrashing threshold */ unsigned long state;/* Always use atomic bitops on this */ unsigned int capabilities; /* Device capabilities */ - atomic_long_t nr_dirty; /* Pages dirty against this BDI */ - atomic_long_t nr_writeback;/* Pages under writeback against this BDI */ congested_fn *congested_fn; /* Function pointer if device is md/dm */ void *congested_data; /* Pointer to aux data for congested func */ void (*unplug_io_fn)(struct backing_dev_info *, struct page *); Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c 2007-04-10 16:30:15.0 +0200 +++ linux-2.6/mm/page-writeback.c 2007-04-10 16:35:03.0 +0200 @@ -828,8 +828,6 @@ int __set_page_dirty_nobuffers(struct pa BUG_ON(mapping2 != mapping); if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - atomic_long_inc(mapping-backing_dev_info- -
[PATCH 04/10] lib: percpu_counter_mod64
Add percpu_counter_mod64() to allow large modifications. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/percpu_counter.h |9 + lib/percpu_counter.c | 28 2 files changed, 37 insertions(+) Index: linux-2.6/include/linux/percpu_counter.h === --- linux-2.6.orig/include/linux/percpu_counter.h 2007-04-12 13:54:55.0 +0200 +++ linux-2.6/include/linux/percpu_counter.h2007-04-12 14:00:21.0 +0200 @@ -36,6 +36,7 @@ static inline void percpu_counter_destro } void percpu_counter_mod(struct percpu_counter *fbc, s32 amount); +void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount); s64 percpu_counter_sum(struct percpu_counter *fbc); static inline s64 percpu_counter_read(struct percpu_counter *fbc) @@ -81,6 +82,14 @@ percpu_counter_mod(struct percpu_counter preempt_enable(); } +static inline void +percpu_counter_mod64(struct percpu_counter *fbc, s64 amount) +{ + preempt_disable(); + fbc-count += amount; + preempt_enable(); +} + static inline s64 percpu_counter_read(struct percpu_counter *fbc) { return fbc-count; Index: linux-2.6/lib/percpu_counter.c === --- linux-2.6.orig/lib/percpu_counter.c 2006-07-31 13:07:38.0 +0200 +++ linux-2.6/lib/percpu_counter.c 2007-04-12 14:17:12.0 +0200 @@ -25,6 +25,34 @@ void percpu_counter_mod(struct percpu_co } EXPORT_SYMBOL(percpu_counter_mod); +void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount) +{ + long count; + s32 *pcount; + int cpu; + + if (amount = FBC_BATCH || amount = -FBC_BATCH) { + spin_lock(fbc-lock); + fbc-count += amount; + spin_unlock(fbc-lock); + return; + } + + cpu = get_cpu(); + pcount = per_cpu_ptr(fbc-counters, cpu); + count = *pcount + amount; + if (count = FBC_BATCH || count = -FBC_BATCH) { + spin_lock(fbc-lock); + fbc-count += count; + *pcount = 0; + spin_unlock(fbc-lock); + } else { + *pcount = count; + } + put_cpu(); +} +EXPORT_SYMBOL(percpu_counter_mod64); + /* * Add up all the per-cpu counts, return the result. This is a more accurate * but much slower version of percpu_counter_read_positive() -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 07/10] mm: count reclaimable pages per BDI
Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/buffer.c |2 ++ fs/nfs/write.c |7 +++ include/linux/backing-dev.h |1 + mm/page-writeback.c |4 mm/truncate.c |2 ++ 5 files changed, 16 insertions(+) Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c 2007-04-20 15:20:48.0 +0200 +++ linux-2.6/fs/buffer.c 2007-04-20 15:26:31.0 +0200 @@ -740,6 +740,8 @@ int __set_page_dirty_buffers(struct page if (page-mapping) {/* Race with truncate? */ if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); + __inc_bdi_stat(mapping-backing_dev_info, + BDI_RECLAIMABLE); task_io_account_write(PAGE_CACHE_SIZE); } radix_tree_tag_set(mapping-page_tree, Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c 2007-04-20 15:20:48.0 +0200 +++ linux-2.6/mm/page-writeback.c 2007-04-20 15:27:28.0 +0200 @@ -828,6 +828,8 @@ int __set_page_dirty_nobuffers(struct pa BUG_ON(mapping2 != mapping); if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); + __inc_bdi_stat(mapping-backing_dev_info, + BDI_RECLAIMABLE); task_io_account_write(PAGE_CACHE_SIZE); } radix_tree_tag_set(mapping-page_tree, @@ -961,6 +963,8 @@ int clear_page_dirty_for_io(struct page */ if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + dec_bdi_stat(mapping-backing_dev_info, + BDI_RECLAIMABLE); return 1; } return 0; Index: linux-2.6/mm/truncate.c === --- linux-2.6.orig/mm/truncate.c2007-04-20 15:20:48.0 +0200 +++ linux-2.6/mm/truncate.c 2007-04-20 15:27:38.0 +0200 @@ -71,6 +71,8 @@ void cancel_dirty_page(struct page *page struct address_space *mapping = page-mapping; if (mapping mapping_cap_account_dirty(mapping)) { dec_zone_page_state(page, NR_FILE_DIRTY); + dec_bdi_stat(mapping-backing_dev_info, + BDI_RECLAIMABLE); if (account_size) task_io_account_cancelled_write(account_size); } Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c 2007-04-20 15:20:05.0 +0200 +++ linux-2.6/fs/nfs/write.c2007-04-20 15:27:07.0 +0200 @@ -449,6 +449,7 @@ nfs_mark_request_commit(struct nfs_page nfsi-ncommit++; spin_unlock(nfsi-req_lock); inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); + inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIMABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } #endif @@ -509,6 +510,8 @@ static void nfs_cancel_commit_list(struc while(!list_empty(head)) { req = nfs_list_entry(head-next); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); + dec_bdi_stat(req-wb_page-mapping-backing_dev_info, + BDI_RECLAIMABLE); nfs_list_remove_request(req); nfs_inode_remove_request(req); nfs_unlock_request(req); @@ -1234,6 +1237,8 @@ nfs_commit_list(struct inode *inode, str nfs_list_remove_request(req); nfs_mark_request_commit(req); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); + dec_bdi_stat(req-wb_page-mapping-backing_dev_info, + BDI_RECLAIMABLE); nfs_clear_page_writeback(req); } return -ENOMEM; @@ -1258,6 +1263,8 @@ static void nfs_commit_done(struct rpc_t req = nfs_list_entry(data-pages.next); nfs_list_remove_request(req); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); + dec_bdi_stat(req-wb_page-mapping-backing_dev_info, + BDI_RECLAIMABLE); dprintk(NFS: commit (%s/%Ld [EMAIL PROTECTED]), req-wb_context-dentry-d_inode-i_sb-s_id, Index
[PATCH 10/10] mm: per device dirty threshold
Scale writeback cache per backing device, proportional to its writeout speed. By decoupling the BDI dirty thresholds a number of problems we currently have will go away, namely: - mutual interference starvation (for any number of BDIs); - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts). It might be that all dirty pages are for a single BDI while other BDIs are idling. By giving each BDI a 'fair' share of the dirty limit, each one can have dirty pages outstanding and make progress. A global threshold also creates a deadlock for stacked BDIs; when A writes to B, and A generates enough dirty pages to get throttled, B will never start writeback until the dirty pages go away. Again, by giving each BDI its own 'independent' dirty limit, this problem is avoided. So the problem is to determine how to distribute the total dirty limit across the BDIs fairly and efficiently. A DBI that has a large dirty limit but does not have any dirty pages outstanding is a waste. What is done is to keep a floating proportion between the DBIs based on writeback completions. This way faster/more active devices get a larger share than slower/idle devices. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/backing-dev.h | 51 mm/backing-dev.c|3 mm/page-writeback.c | 181 3 files changed, 206 insertions(+), 29 deletions(-) Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h 2007-04-20 15:28:17.0 +0200 +++ linux-2.6/include/linux/backing-dev.h 2007-04-20 15:33:59.0 +0200 @@ -28,6 +28,7 @@ typedef int (congested_fn)(void *, int); enum bdi_stat_item { BDI_RECLAIMABLE, BDI_WRITEBACK, + BDI_WRITEOUT, NR_BDI_STAT_ITEMS }; @@ -43,6 +44,13 @@ struct backing_dev_info { void *unplug_io_data; struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS]; + + /* +* data used for scaling the writeback cache +*/ + spinlock_t lock;/* protect the cycle count */ + unsigned long cycles; /* writeout cycles */ + int dirty_exceeded; }; void bdi_init(struct backing_dev_info *bdi); @@ -54,6 +62,12 @@ static inline void __mod_bdi_stat(struct percpu_counter_mod(bdi-bdi_stat[item], amount); } +static inline void __mod_bdi_stat64(struct backing_dev_info *bdi, + enum bdi_stat_item item, s64 amount) +{ + percpu_counter_mod64(bdi-bdi_stat[item], amount); +} + static inline void __inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item) { @@ -86,12 +100,49 @@ static inline void dec_bdi_stat(struct b local_irq_restore(flags); } +static inline s64 __bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + return percpu_counter_read(bdi-bdi_stat[item]); +} + static inline s64 bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item) { return percpu_counter_read_positive(bdi-bdi_stat[item]); } +static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + return percpu_counter_sum(bdi-bdi_stat[item]); +} + +static inline s64 bdi_stat_sum(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + s64 sum; + unsigned long flags; + + local_irq_save(flags); + sum = __bdi_stat_sum(bdi, item); + local_irq_restore(flags); + + return sum; +} + +/* + * maximal error of a stat counter. + */ +static inline unsigned long bdi_stat_delta(void) +{ +#ifdef CONFIG_SMP + return NR_CPUS * FBC_BATCH; +#else + return 1UL; +#endif +} + /* * Flags in backing_dev_info::capability * - The first two flags control whether dirty pages will contribute to the Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c 2007-04-20 15:28:10.0 +0200 +++ linux-2.6/mm/page-writeback.c 2007-04-20 15:35:01.0 +0200 @@ -49,8 +49,6 @@ */ static long ratelimit_pages = 32; -static int dirty_exceeded __cacheline_aligned_in_smp; /* Dirty mem may be over limit */ - /* * When balance_dirty_pages decides that the caller needs to perform some * non-background writeback, this is how many pages it will attempt to write. @@ -103,6 +101,88 @@ EXPORT_SYMBOL(laptop_mode); static void background_writeout(unsigned long _min_pages); /* + * Scale the writeback cache size proportional to the relative writeout speeds. + * + * We do this by tracking a floating average per BDI and a global floating + * average. We optimize away the '/= 2' for the global average by noting that: + * + * if (++i thresh) i /= 2: + * + * Can be approximated by: + * + * thresh/2 + (++i % thresh/2) + * + * Furthermore, when
[PATCH 08/10] mm: count writeback pages per BDI
Count per BDI writeback pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/backing-dev.h |1 + mm/page-writeback.c | 12 ++-- 2 files changed, 11 insertions(+), 2 deletions(-) Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c 2007-04-20 15:27:28.0 +0200 +++ linux-2.6/mm/page-writeback.c 2007-04-20 15:28:10.0 +0200 @@ -979,14 +979,18 @@ int test_clear_page_writeback(struct pag int ret; if (mapping) { + struct backing_dev_info *bdi = mapping-backing_dev_info; unsigned long flags; write_lock_irqsave(mapping-tree_lock, flags); ret = TestClearPageWriteback(page); - if (ret) + if (ret) { radix_tree_tag_clear(mapping-page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + if (bdi_cap_writeback_dirty(bdi)) + __dec_bdi_stat(bdi, BDI_WRITEBACK); + } write_unlock_irqrestore(mapping-tree_lock, flags); } else { ret = TestClearPageWriteback(page); @@ -1002,14 +1006,18 @@ int test_set_page_writeback(struct page int ret; if (mapping) { + struct backing_dev_info *bdi = mapping-backing_dev_info; unsigned long flags; write_lock_irqsave(mapping-tree_lock, flags); ret = TestSetPageWriteback(page); - if (!ret) + if (!ret) { radix_tree_tag_set(mapping-page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + if (bdi_cap_writeback_dirty(bdi)) + __inc_bdi_stat(bdi, BDI_WRITEBACK); + } if (!PageDirty(page)) radix_tree_tag_clear(mapping-page_tree, page_index(page), Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h 2007-04-20 15:25:47.0 +0200 +++ linux-2.6/include/linux/backing-dev.h 2007-04-20 15:28:17.0 +0200 @@ -27,6 +27,7 @@ typedef int (congested_fn)(void *, int); enum bdi_stat_item { BDI_RECLAIMABLE, + BDI_WRITEBACK, NR_BDI_STAT_ITEMS }; -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 02/10] nfs: remove congestion_end()
Its redundant, clear_bdi_congested() already wakes the waiters. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/write.c |4 +--- include/linux/backing-dev.h |1 - mm/backing-dev.c| 13 - 3 files changed, 1 insertion(+), 17 deletions(-) Index: linux-2.6-mm/fs/nfs/write.c === --- linux-2.6-mm.orig/fs/nfs/write.c2007-04-05 16:24:50.0 +0200 +++ linux-2.6-mm/fs/nfs/write.c 2007-04-05 16:25:04.0 +0200 @@ -235,10 +235,8 @@ static void nfs_end_page_writeback(struc struct nfs_server *nfss = NFS_SERVER(inode); end_page_writeback(page); - if (atomic_long_dec_return(nfss-writeback) NFS_CONGESTION_OFF_THRESH) { + if (atomic_long_dec_return(nfss-writeback) NFS_CONGESTION_OFF_THRESH) clear_bdi_congested(nfss-backing_dev_info, WRITE); - congestion_end(WRITE); - } } /* Index: linux-2.6-mm/include/linux/backing-dev.h === --- linux-2.6-mm.orig/include/linux/backing-dev.h 2007-04-05 16:24:50.0 +0200 +++ linux-2.6-mm/include/linux/backing-dev.h2007-04-05 16:25:08.0 +0200 @@ -96,7 +96,6 @@ void clear_bdi_congested(struct backing_ void set_bdi_congested(struct backing_dev_info *bdi, int rw); long congestion_wait(int rw, long timeout); long congestion_wait_interruptible(int rw, long timeout); -void congestion_end(int rw); #define bdi_cap_writeback_dirty(bdi) \ (!((bdi)-capabilities BDI_CAP_NO_WRITEBACK)) Index: linux-2.6-mm/mm/backing-dev.c === --- linux-2.6-mm.orig/mm/backing-dev.c 2007-04-05 16:24:50.0 +0200 +++ linux-2.6-mm/mm/backing-dev.c 2007-04-05 16:25:16.0 +0200 @@ -70,16 +70,3 @@ long congestion_wait_interruptible(int r return ret; } EXPORT_SYMBOL(congestion_wait_interruptible); - -/** - * congestion_end - wake up sleepers on a congested backing_dev_info - * @rw: READ or WRITE - */ -void congestion_end(int rw) -{ - wait_queue_head_t *wqh = congestion_wqh[rw]; - - if (waitqueue_active(wqh)) - wake_up(wqh); -} -EXPORT_SYMBOL(congestion_end); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 03/10] lib: dampen the percpu_counter FBC_BATCH
With the current logic the percpu_counter's accuracy delta is quadric wrt the number of cpus in the system, reduce this to O(n ln n). Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/percpu_counter.h |7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) Index: linux-2.6-mm/include/linux/percpu_counter.h === --- linux-2.6-mm.orig/include/linux/percpu_counter.h +++ linux-2.6-mm/include/linux/percpu_counter.h @@ -11,6 +11,7 @@ #include linux/threads.h #include linux/percpu.h #include linux/types.h +#include linux/log2.h #ifdef CONFIG_SMP @@ -20,11 +21,7 @@ struct percpu_counter { s32 *counters; }; -#if NR_CPUS = 16 -#define FBC_BATCH (NR_CPUS*2) -#else -#define FBC_BATCH (NR_CPUS*4) -#endif +#define FBC_BATCH (8*ilog2(NR_CPUS)) static inline void percpu_counter_init(struct percpu_counter *fbc, s64 amount) { -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 06/10] mm: scalable bdi statistics counters.
Provide scalable per backing_dev_info statistics counters. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/backing-dev.h | 50 ++-- mm/backing-dev.c| 26 ++ 2 files changed, 74 insertions(+), 2 deletions(-) Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h 2007-04-12 13:27:13.0 +0200 +++ linux-2.6/include/linux/backing-dev.h 2007-04-12 13:28:40.0 +0200 @@ -8,6 +8,7 @@ #ifndef _LINUX_BACKING_DEV_H #define _LINUX_BACKING_DEV_H +#include linux/percpu_counter.h #include asm/atomic.h struct page; @@ -24,6 +25,10 @@ enum bdi_state { typedef int (congested_fn)(void *, int); +enum bdi_stat_item { + NR_BDI_STAT_ITEMS +}; + struct backing_dev_info { unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ unsigned long ra_pages0; /* min readahead on start of file */ @@ -34,14 +39,55 @@ struct backing_dev_info { void *congested_data; /* Pointer to aux data for congested func */ void (*unplug_io_fn)(struct backing_dev_info *, struct page *); void *unplug_io_data; + + struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS]; }; -static inline void bdi_init(struct backing_dev_info *bdi) +void bdi_init(struct backing_dev_info *bdi); +void bdi_destroy(struct backing_dev_info *bdi); + +static inline void __mod_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item, s32 amount) +{ + percpu_counter_mod(bdi-bdi_stat[item], amount); +} + +static inline void __inc_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + __mod_bdi_stat(bdi, item, 1); +} + +static inline void inc_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + unsigned long flags; + + local_irq_save(flags); + __inc_bdi_stat(bdi, item); + local_irq_restore(flags); +} + +static inline void __dec_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) { + __mod_bdi_stat(bdi, item, -1); +} + +static inline void dec_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + unsigned long flags; + + local_irq_save(flags); + __dec_bdi_stat(bdi, item); + local_irq_restore(flags); } -static inline void bdi_destroy(struct backing_dev_info *bdi) +static inline s64 bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) { + return percpu_counter_read_positive(bdi-bdi_stat[item]); } /* Index: linux-2.6/mm/backing-dev.c === --- linux-2.6.orig/mm/backing-dev.c 2007-04-12 13:27:10.0 +0200 +++ linux-2.6/mm/backing-dev.c 2007-04-12 13:28:26.0 +0200 @@ -5,6 +5,30 @@ #include linux/sched.h #include linux/module.h +void bdi_init(struct backing_dev_info *bdi) +{ + int i; + + if (!(bdi_cap_writeback_dirty(bdi) || bdi_cap_account_dirty(bdi))) + return; + + for (i = 0; i NR_BDI_STAT_ITEMS; i++) + percpu_counter_init(bdi-bdi_stat[i], 0); +} +EXPORT_SYMBOL(bdi_init); + +void bdi_destroy(struct backing_dev_info *bdi) +{ + int i; + + if (!(bdi_cap_writeback_dirty(bdi) || bdi_cap_account_dirty(bdi))) + return; + + for (i = 0; i NR_BDI_STAT_ITEMS; i++) + percpu_counter_destroy(bdi-bdi_stat[i]); +} +EXPORT_SYMBOL(bdi_destroy); + static wait_queue_head_t congestion_wqh[2] = { __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1]) @@ -70,3 +94,5 @@ long congestion_wait_interruptible(int r return ret; } EXPORT_SYMBOL(congestion_wait_interruptible); + + -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/10] mm: bdi init hooks
provide BDI constructor/destructor hooks Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- block/ll_rw_blk.c |2 ++ drivers/block/rd.c |6 ++ drivers/char/mem.c |2 ++ drivers/mtd/mtdcore.c |5 + fs/char_dev.c |1 + fs/configfs/configfs_internal.h |2 ++ fs/configfs/inode.c |8 fs/configfs/mount.c |2 ++ fs/fuse/inode.c |2 ++ fs/hugetlbfs/inode.c|3 +++ fs/nfs/client.c |3 +++ fs/ocfs2/dlm/dlmfs.c|6 +- fs/ramfs/inode.c|7 ++- fs/sysfs/inode.c|5 + fs/sysfs/mount.c|2 ++ fs/sysfs/sysfs.h|1 + include/linux/backing-dev.h |7 +++ kernel/cpuset.c |3 +++ mm/shmem.c |1 + mm/swap.c |2 ++ 20 files changed, 68 insertions(+), 2 deletions(-) Index: linux-2.6/block/ll_rw_blk.c === --- linux-2.6.orig/block/ll_rw_blk.c2007-04-12 11:35:53.0 +0200 +++ linux-2.6/block/ll_rw_blk.c 2007-04-12 13:19:40.0 +0200 @@ -1771,6 +1771,7 @@ static void blk_release_queue(struct kob blk_trace_shutdown(q); + bdi_destroy(q-backing_dev_info); kmem_cache_free(requestq_cachep, q); } @@ -1836,6 +1837,7 @@ request_queue_t *blk_alloc_queue_node(gf q-kobj.ktype = queue_ktype; kobject_init(q-kobj); q-backing_dev_info = default_backing_dev_info; + bdi_init(q-backing_dev_info); q-backing_dev_info.unplug_io_fn = blk_backing_dev_unplug; q-backing_dev_info.unplug_io_data = q; Index: linux-2.6/drivers/block/rd.c === --- linux-2.6.orig/drivers/block/rd.c 2007-04-12 11:35:51.0 +0200 +++ linux-2.6/drivers/block/rd.c2007-04-12 11:35:59.0 +0200 @@ -411,6 +411,9 @@ static void __exit rd_cleanup(void) blk_cleanup_queue(rd_queue[i]); } unregister_blkdev(RAMDISK_MAJOR, ramdisk); + + bdi_destroy(rd_file_backing_dev_info); + bdi_destroy(rd_backing_dev_info); } /* @@ -421,6 +424,9 @@ static int __init rd_init(void) int i; int err = -ENOMEM; + bdi_init(rd_backing_dev_info); + bdi_init(rd_file_backing_dev_info); + if (rd_blocksize PAGE_SIZE || rd_blocksize 512 || (rd_blocksize (rd_blocksize-1))) { printk(RAMDISK: wrong blocksize %d, reverting to defaults\n, Index: linux-2.6/drivers/char/mem.c === --- linux-2.6.orig/drivers/char/mem.c 2007-04-12 11:35:51.0 +0200 +++ linux-2.6/drivers/char/mem.c2007-04-12 11:35:59.0 +0200 @@ -987,6 +987,8 @@ static int __init chr_dev_init(void) MKDEV(MEM_MAJOR, devlist[i].minor), devlist[i].name); + bdi_init(zero_bdi); + return 0; } Index: linux-2.6/fs/char_dev.c === --- linux-2.6.orig/fs/char_dev.c2007-04-12 11:35:51.0 +0200 +++ linux-2.6/fs/char_dev.c 2007-04-12 11:35:59.0 +0200 @@ -546,6 +546,7 @@ static struct kobject *base_probe(dev_t void __init chrdev_init(void) { cdev_map = kobj_map_init(base_probe, chrdevs_lock); + bdi_init(directly_mappable_cdev_bdi); } Index: linux-2.6/fs/fuse/inode.c === --- linux-2.6.orig/fs/fuse/inode.c 2007-04-12 11:35:51.0 +0200 +++ linux-2.6/fs/fuse/inode.c 2007-04-12 11:35:59.0 +0200 @@ -415,6 +415,7 @@ static struct fuse_conn *new_conn(void) atomic_set(fc-num_waiting, 0); fc-bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; fc-bdi.unplug_io_fn = default_unplug_io_fn; + bdi_init(fc-bdi); fc-reqctr = 0; fc-blocked = 1; get_random_bytes(fc-scramble_key, sizeof(fc-scramble_key)); @@ -428,6 +429,7 @@ void fuse_conn_put(struct fuse_conn *fc) if (fc-destroy_req) fuse_request_free(fc-destroy_req); mutex_destroy(fc-inst_mutex); + bdi_destroy(fc-bdi); kfree(fc); } } Index: linux-2.6/fs/nfs/client.c === --- linux-2.6.orig/fs/nfs/client.c 2007-04-12 11:35:51.0 +0200 +++ linux-2.6/fs/nfs/client.c 2007-04-12 11:35:59.0 +0200 @@ -657,6 +657,8 @@ static void nfs_server_set_fsinfo(struct if (server-rsize NFS_MAX_FILE_IO_SIZE) server-rsize = NFS_MAX_FILE_IO_SIZE
Re: [PATCH 03/10] lib: dampen the percpu_counter FBC_BATCH
On Sat, 2007-04-21 at 02:55 -0700, Andrew Morton wrote: On Fri, 20 Apr 2007 17:51:57 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: With the current logic the percpu_counter's accuracy delta is quadric wrt the number of cpus in the system, reduce this to O(n ln n). Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/percpu_counter.h |7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) Index: linux-2.6-mm/include/linux/percpu_counter.h === --- linux-2.6-mm.orig/include/linux/percpu_counter.h +++ linux-2.6-mm/include/linux/percpu_counter.h @@ -11,6 +11,7 @@ #include linux/threads.h #include linux/percpu.h #include linux/types.h +#include linux/log2.h #ifdef CONFIG_SMP @@ -20,11 +21,7 @@ struct percpu_counter { s32 *counters; }; -#if NR_CPUS = 16 -#define FBC_BATCH (NR_CPUS*2) -#else -#define FBC_BATCH (NR_CPUS*4) -#endif +#define FBC_BATCH (8*ilog2(NR_CPUS)) static inline void percpu_counter_init(struct percpu_counter *fbc, s64 amount) { I worry that this might be too small when there are hundreds of CPUs online. With 1024 CPUs we go for the lock once per 80 counts. That's not much. If we have 1024 CPUs, each one of which is incrementing this counter at N Hz, we have 1024/80=12 CPUs all going for the same lock at N Hz. It could get bad. But I don't know what the gain is for this loss. Your changelog should have told us. What problem is this patch solving? In 10/10 I introduce bdi_stat_delta() which gives the maximum error of a single counter. That is used to switch between precise (percpu_counter_sum) and imprecise (percpu_counter_read) accesses of the stats. I worried that the current quadric error would be too large; and as the ZVC counters also use a logarithmic error bound I thought it would be good to have here as well. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 04/10] lib: percpu_counter_mod64
On Sat, 2007-04-21 at 02:55 -0700, Andrew Morton wrote: On Fri, 20 Apr 2007 17:51:58 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: Add percpu_counter_mod64() to allow large modifications. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/percpu_counter.h |9 + lib/percpu_counter.c | 28 2 files changed, 37 insertions(+) Index: linux-2.6/include/linux/percpu_counter.h === --- linux-2.6.orig/include/linux/percpu_counter.h 2007-04-12 13:54:55.0 +0200 +++ linux-2.6/include/linux/percpu_counter.h2007-04-12 14:00:21.0 +0200 @@ -36,6 +36,7 @@ static inline void percpu_counter_destro } void percpu_counter_mod(struct percpu_counter *fbc, s32 amount); +void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount); s64 percpu_counter_sum(struct percpu_counter *fbc); static inline s64 percpu_counter_read(struct percpu_counter *fbc) @@ -81,6 +82,14 @@ percpu_counter_mod(struct percpu_counter preempt_enable(); } +static inline void +percpu_counter_mod64(struct percpu_counter *fbc, s64 amount) +{ + preempt_disable(); + fbc-count += amount; + preempt_enable(); +} + static inline s64 percpu_counter_read(struct percpu_counter *fbc) { return fbc-count; Index: linux-2.6/lib/percpu_counter.c === --- linux-2.6.orig/lib/percpu_counter.c 2006-07-31 13:07:38.0 +0200 +++ linux-2.6/lib/percpu_counter.c 2007-04-12 14:17:12.0 +0200 @@ -25,6 +25,34 @@ void percpu_counter_mod(struct percpu_co } EXPORT_SYMBOL(percpu_counter_mod); +void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount) +{ + long count; + s32 *pcount; + int cpu; + + if (amount = FBC_BATCH || amount = -FBC_BATCH) { + spin_lock(fbc-lock); + fbc-count += amount; + spin_unlock(fbc-lock); + return; + } This is wrong, a little. If the counter was at -FBC_BATCH/2 and the caller passed in FBC_BATCH, we could just set the cpu-local counter to FBC_BATCH/2 instead of going for the lock. Probably doesn't matter though. Right, I could have taken along the current percpu offset. + cpu = get_cpu(); + pcount = per_cpu_ptr(fbc-counters, cpu); + count = *pcount + amount; + if (count = FBC_BATCH || count = -FBC_BATCH) { + spin_lock(fbc-lock); + fbc-count += count; + *pcount = 0; + spin_unlock(fbc-lock); + } else { + *pcount = count; + } + put_cpu(); +} +EXPORT_SYMBOL(percpu_counter_mod64); Bloaty. Surely we won't be needing this on 32-bit kernels? Even monster PAE has only 64,000,000 pages and won't be using deltas of more than 4 gigapages? Does even 64-bit need to handle 4 gigapages in a single hit? /me suspects another changelog bug Yeah, /me chastises himself for that... This is because percpu_counter is s64 instead of the native long; I need to halve the counter at some point (bdi_writeout_norm) and do that by subtracting half the current value. If percpu_counter_mod is limited to s32 this might not always work (although in practice it might just fit). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 07/10] mm: count reclaimable pages per BDI
On Sat, 2007-04-21 at 02:55 -0700, Andrew Morton wrote: On Fri, 20 Apr 2007 17:52:01 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable. hm. Aggregating dirty and unstable at inc/dec time is a bit kludgy. If later on we want to know just dirty then we're in trouble. I can see the logic behind it though. Perhaps one could have separate BDI_DIRTY and BDI_UNSTABLE and treat them separately at inc/dec time, but give them the same numerical value, so they in fact refer to the same counter. That's kludgy too. :-( I struggled with it too; don't have a ready solution either. I'll do whatever the consensus agrees upon. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 08/10] mm: count writeback pages per BDI
On Sat, 2007-04-21 at 02:55 -0700, Andrew Morton wrote: On Fri, 20 Apr 2007 17:52:02 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: Count per BDI writeback pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/backing-dev.h |1 + mm/page-writeback.c | 12 ++-- 2 files changed, 11 insertions(+), 2 deletions(-) Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c 2007-04-20 15:27:28.0 +0200 +++ linux-2.6/mm/page-writeback.c 2007-04-20 15:28:10.0 +0200 @@ -979,14 +979,18 @@ int test_clear_page_writeback(struct pag int ret; if (mapping) { + struct backing_dev_info *bdi = mapping-backing_dev_info; unsigned long flags; write_lock_irqsave(mapping-tree_lock, flags); ret = TestClearPageWriteback(page); - if (ret) + if (ret) { radix_tree_tag_clear(mapping-page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + if (bdi_cap_writeback_dirty(bdi)) + __dec_bdi_stat(bdi, BDI_WRITEBACK); Why do we test bdi_cap_writeback_dirty() here? If we remove that test, we end up accumulating statistics for non-writebackable backing devs, but does that matter? It would not, had I not cheated: +void bdi_init(struct backing_dev_info *bdi) +{ + int i; + + if (!(bdi_cap_writeback_dirty(bdi) || bdi_cap_account_dirty(bdi))) + return; + + for (i = 0; i NR_BDI_STAT_ITEMS; i++) + percpu_counter_init(bdi-bdi_stat[i], 0); +} +EXPORT_SYMBOL(bdi_init); Probably the common case is writebackable backing-devs, so eliminating the test-n-branch might be a net microgain. Time vs space. Now we don't even have storage for those BDIs.. Don't particularly care on this point though, I just thought it might be worthwhile to save on the percpu data. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/10] mm: expose BDI statistics in sysfs.
On Sat, 2007-04-21 at 02:55 -0700, Andrew Morton wrote: On Fri, 20 Apr 2007 17:52:03 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: Expose the per BDI stats in /sys/block/dev/queue/* Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- block/ll_rw_blk.c | 32 1 file changed, 32 insertions(+) Index: linux-2.6-mm/block/ll_rw_blk.c === --- linux-2.6-mm.orig/block/ll_rw_blk.c +++ linux-2.6-mm/block/ll_rw_blk.c @@ -3976,6 +3976,15 @@ static ssize_t queue_max_hw_sectors_show return queue_var_show(max_hw_sectors_kb, (page)); } +static ssize_t queue_nr_reclaimable_show(struct request_queue *q, char *page) +{ + return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, BDI_RECLAIMABLE)); +} We try to present memory statistics to userspace in bytes or kbytes rather than number-of-pages. Because page-size varies between architectures and between .configs. Displaying number-of-pages is just inviting people to write it-broke-when-i-moved-it-to-ia64 applications. Plus kbytes is a bit more user-friendly, particularly when the user will want to compare these numbers to /proc/meminfo, for example. Using %llu might be more appropriate than %lld. Right, the biggest problem I actually have with his piece of code is that is does not represent all BDIs in the system. For example, the BDI of NFS mounts is not accessible. Will fix the other points. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Sat, 2007-04-21 at 02:55 -0700, Andrew Morton wrote: On Fri, 20 Apr 2007 17:52:04 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: Scale writeback cache per backing device, proportional to its writeout speed. By decoupling the BDI dirty thresholds a number of problems we currently have will go away, namely: - mutual interference starvation (for any number of BDIs); - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts). It might be that all dirty pages are for a single BDI while other BDIs are idling. By giving each BDI a 'fair' share of the dirty limit, each one can have dirty pages outstanding and make progress. A global threshold also creates a deadlock for stacked BDIs; when A writes to B, and A generates enough dirty pages to get throttled, B will never start writeback until the dirty pages go away. Again, by giving each BDI its own 'independent' dirty limit, this problem is avoided. So the problem is to determine how to distribute the total dirty limit across the BDIs fairly and efficiently. A DBI that has a large dirty limit but does not have any dirty pages outstanding is a waste. What is done is to keep a floating proportion between the DBIs based on writeback completions. This way faster/more active devices get a larger share than slower/idle devices. This is a pretty major improvement to various nasty corner-cases, if it works. Does it work? Please describe the testing you did, and the results. The testing I did was several dd instances racing each other to various devices; usually one in a loop and the other a single, timed, instance. I tested, disk vs disk, disk vs usbstick, disk vs nfs-mount. Using the debug patch from the last series; the one which exposed the actual ratio assigned and the total; I monitored (where possible) that the ratio was around the relative writeout speeds. The main indicator was that the writes should complete in roughly the same time as if they were done on an idle system. the disk vs usbstick gave the most dramatic improvement; on mainline the usbstick is totally starved by a heavy disk writer, with these patches it takes about the same time as it would on an idle system Along with the first series was a number of results; those still stand. Has this been confirmed to fix Miklos's FUSE and loopback problems? I must defer to Miklos for that. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/backing-dev.h | 51 mm/backing-dev.c|3 mm/page-writeback.c | 181 3 files changed, 206 insertions(+), 29 deletions(-) Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h 2007-04-20 15:28:17.0 +0200 +++ linux-2.6/include/linux/backing-dev.h 2007-04-20 15:33:59.0 +0200 @@ -28,6 +28,7 @@ typedef int (congested_fn)(void *, int); enum bdi_stat_item { BDI_RECLAIMABLE, BDI_WRITEBACK, + BDI_WRITEOUT, NR_BDI_STAT_ITEMS }; Whoa, head is now swimming. What's the difference between writeback and writeout? writeback is the number of pages in the writeback state. writeout is a relative proportion (against all other BDIs) of completed writeouts. @@ -43,6 +44,13 @@ struct backing_dev_info { void *unplug_io_data; struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS]; + + /* +* data used for scaling the writeback cache +*/ + spinlock_t lock;/* protect the cycle count */ + unsigned long cycles; /* writeout cycles */ + int dirty_exceeded; }; void bdi_init(struct backing_dev_info *bdi); @@ -54,6 +62,12 @@ static inline void __mod_bdi_stat(struct percpu_counter_mod(bdi-bdi_stat[item], amount); } +static inline void __mod_bdi_stat64(struct backing_dev_info *bdi, + enum bdi_stat_item item, s64 amount) +{ + percpu_counter_mod64(bdi-bdi_stat[item], amount); +} + static inline void __inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item) { @@ -86,12 +100,49 @@ static inline void dec_bdi_stat(struct b local_irq_restore(flags); } +static inline s64 __bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + return percpu_counter_read(bdi-bdi_stat[item]); +} + static inline s64 bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item) { return percpu_counter_read_positive(bdi-bdi_stat[item]); } So here, the __ means it doesn't do the force-it-positive treatment. +static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + return percpu_counter_sum(bdi-bdi_stat[item]); +} + +static inline s64 bdi_stat_sum(struct
Re: [PATCH 10/10] mm: per device dirty threshold
+/* + * maximal error of a stat counter. + */ +static inline unsigned long bdi_stat_delta(void) +{ +#ifdef CONFIG_SMP + return NR_CPUS * FBC_BATCH; This is enormously wrong for CONFIG_NR_CPUS=1024 on a 2-way. Right, I knew about that but, uhm. I wanted to make that num_online_cpus(), and install a hotplug notifier to fold the percpu delta back into the total on cpu offline. But I have to look into doing that hotplug notifier stuff. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH] reiserfs vs BKL
Hi all, Obviously nobody still uses reiserfs; or at least not with PREEMPT_BKL=n. We seem to have grown all kinds of scheduling assumptions all over that code. A excerpt from my bootlog when I tried: | preempt count: 0001 ] | 1-level deep critical section nesting: .. [_spin_lock+16/112] _spin_lock+0x10/0x70 .[__rcu_process_callbacks+397/432] .. ( = __rcu_process_callbacks+0x18d/0x1b0) [show_trace_log_lvl+33/80] show_trace_log_lvl+0x21/0x50 [show_trace+18/32] show_trace+0x12/0x20 [dump_stack+22/32] dump_stack+0x16/0x20 [schedule+1903/2096] schedule+0x76f/0x830 [io_schedule+34/48] io_schedule+0x22/0x30 [sync_buffer+53/64] sync_buffer+0x35/0x40 [__wait_on_bit+69/112] __wait_on_bit+0x45/0x70 [out_of_line_wait_on_bit+109/128] out_of_line_wait_on_bit+0x6d/0x80 [__wait_on_buffer+39/48] __wait_on_buffer+0x27/0x30 [pg0+944846854/1069073408] search_by_key+0x156/0x11d0 [reiserfs] [pg0+944764484/1069073408] reiserfs_read_locked_inode+0x64/0x550 [reiserfs] [pg0+944765867/1069073408] reiserfs_iget+0x7b/0x90 [reiserfs] [pg0+944752186/1069073408] reiserfs_lookup+0xca/0x130 [reiserfs] [do_lookup+305/368] do_lookup+0x131/0x170 [__link_path_walk+2077/3600] __link_path_walk+0x81d/0xe10 [link_path_walk+70/208] link_path_walk+0x46/0xd0 [do_path_lookup+139/464] do_path_lookup+0x8b/0x1d0 [__path_lookup_intent_open+74/144] __path_lookup_intent_open+0x4a/0x90 [path_lookup_open+33/48] path_lookup_open+0x21/0x30 [open_namei+98/1568] open_namei+0x62/0x620 [do_filp_open+44/80] do_filp_open+0x2c/0x50 [do_sys_open+71/224] do_sys_open+0x47/0xe0 [sys_open+28/32] sys_open+0x1c/0x20 [sysenter_past_esp+106/149] sysenter_past_esp+0x6a/0x95 === --- | preempt count: 0001 ] | 1-level deep critical section nesting: .. [lock_kernel+28/176] lock_kernel+0x1c/0xb0 .[pg0+944752069/1069073408] .. ( = reiserfs_lookup+0x55/0x130 [reiserfs]) [show_trace_log_lvl+33/80] show_trace_log_lvl+0x21/0x50 [show_trace+18/32] show_trace+0x12/0x20 [dump_stack+22/32] dump_stack+0x16/0x20 [schedule+1903/2096] schedule+0x76f/0x830 [__cond_resched+18/48] __cond_resched+0x12/0x30 [cond_resched+42/64] cond_resched+0x2a/0x40 [pg0+944846892/1069073408] search_by_key+0x17c/0x11d0 [reiserfs] [pg0+944766152/1069073408] reiserfs_update_sd_size+0x108/0x320 [reiserfs] [pg0+944811796/1069073408] reiserfs_dirty_inode+0x74/0xa0 [reiserfs] [__mark_inode_dirty+48/400] __mark_inode_dirty+0x30/0x190 [touch_atime+142/240] touch_atime+0x8e/0xf0 [__link_path_walk+2980/3600] __link_path_walk+0xba4/0xe10 [link_path_walk+70/208] link_path_walk+0x46/0xd0 [do_path_lookup+139/464] do_path_lookup+0x8b/0x1d0 [__path_lookup_intent_open+74/144] __path_lookup_intent_open+0x4a/0x90 [path_lookup_open+33/48] path_lookup_open+0x21/0x30 [open_exec+39/176] open_exec+0x27/0xb0 [do_execve+56/544] do_execve+0x38/0x220 [sys_execve+46/144] sys_execve+0x2e/0x90 [sysenter_past_esp+106/149] sysenter_past_esp+0x6a/0x95 === And the sole reason I tried that was because I managed to get 1 second (yes _second_) latencies due to BKL priority inversion. Since this is not really a workable situation I propose we do something about it, below my attempt. However, since I know absolutely nothing about reiserfs, it might well be very wrong and will eat all your data (babies and favourite pets too while we're at it). --- Replace all the lock_kernel() instances with reiserfs_write_lock(sb), and make that use an actual per super-block mutex instead of lock_kernel(). This should make reiserfs safe from PREEMPT_BKL=n, since it seems to rely on being able to schedule. Also, it removes the dependency on the BKL, and thereby is not prone to cause prio inversion with remaining BKL users (notably tty). Compile tested only, since I didn't dare boot it. NOT-Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/reiserfs/inode.c|4 ++-- fs/reiserfs/journal.c | 12 ++-- fs/reiserfs/super.c|1 + fs/reiserfs/xattr.c| 16 include/linux/reiserfs_fs.h|4 ++-- include/linux/reiserfs_fs_sb.h |2 ++ 6 files changed, 21 insertions(+), 18 deletions(-) Index: linux-2.6-twins/fs/reiserfs/inode.c === --- linux-2.6-twins.orig/fs/reiserfs/inode.c +++ linux-2.6-twins/fs/reiserfs/inode.c @@ -485,10 +485,10 @@ static int reiserfs_get_blocks_direct_io disappeared */ if (REISERFS_I(inode)-i_flags i_pack_on_close_mask) { int err; - lock_kernel(); + reiserfs_write_lock(inode-i_sb); err = reiserfs_commit_for_inode(inode); REISERFS_I(inode)-i_flags = ~i_pack_on_close_mask; - unlock_kernel
Re: [RFC][PATCH] reiserfs vs BKL
On Sat, 2007-04-21 at 12:14 -0400, Jeff Mahoney wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Peter Zijlstra wrote: Replace all the lock_kernel() instances with reiserfs_write_lock(sb), and make that use an actual per super-block mutex instead of lock_kernel(). This should make reiserfs safe from PREEMPT_BKL=n, since it seems to rely on being able to schedule. Also, it removes the dependency on the BKL, and thereby is not prone to cause prio inversion with remaining BKL users (notably tty). Compile tested only, since I didn't dare boot it. NACK. darn, I suspected it would be wrong :-/, it was too easy to be right. Believe me, I would *love* to nuke the BKL from reiserfs, but a search and replace of this nature is just wrong. reiserfs_write_lock() using the BKL isn't an accident - it depends on its nesting properties. If you did try to boot this kernel, you'd deadlock pretty quickly. Right, I see. This one has been on my TODO list for a long time. Interestingly, I've been doing reiserfs xattr development recently using 2.6.21-rc7-git2, and I'm not seeing any of these messages. Yeah, that would be pretty close to what I ran. # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_BKL is not set I have CONFIG_PREEMPT=y, but seeing how one of those points came from a cond_resched() within a lock_kernel() section, I'm not seeing how you don't get these. A well, I really do hope you come up with something some day, for me its time to go change filesystems - bah, the pain... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 04/10] lib: percpu_counter_mod64
On Sat, 2007-04-21 at 12:21 -0700, Andrew Morton wrote: On Sat, 21 Apr 2007 13:02:26 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: + cpu = get_cpu(); + pcount = per_cpu_ptr(fbc-counters, cpu); + count = *pcount + amount; + if (count = FBC_BATCH || count = -FBC_BATCH) { + spin_lock(fbc-lock); + fbc-count += count; + *pcount = 0; + spin_unlock(fbc-lock); + } else { + *pcount = count; + } + put_cpu(); +} +EXPORT_SYMBOL(percpu_counter_mod64); Bloaty. Surely we won't be needing this on 32-bit kernels? Even monster PAE has only 64,000,000 pages and won't be using deltas of more than 4 gigapages? Does even 64-bit need to handle 4 gigapages in a single hit? /me suspects another changelog bug Yeah, /me chastises himself for that... This is because percpu_counter is s64 instead of the native long; I need to halve the counter at some point (bdi_writeout_norm) and do that by subtracting half the current value. ah, the mysterious bdi_writeout_norm(). I don't think it's possible to precisely halve a percpu_counter - there has to be some error involved. I guess that's acceptable within the inscrutable bdi_writeout_norm(). otoh, there's a chance that the attempt to halve the counter will take the counter negative, due to races. Does the elusive bdi_writeout_norm() handle that? If not, it should. If it does, then there should be comments around the places where this is being handled, because it is subtle, and unobvious, and others might break it by accident. The counter it is halving is only ever incremented, so we might be off a little, but only to the safe side. I shall do the comment thing along with all the other missing comments :-) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Sat, 2007-04-21 at 14:15 +0200, Peter Zijlstra wrote: +/* + * maximal error of a stat counter. + */ +static inline unsigned long bdi_stat_delta(void) +{ +#ifdef CONFIG_SMP + return NR_CPUS * FBC_BATCH; This is enormously wrong for CONFIG_NR_CPUS=1024 on a 2-way. Right, I knew about that but, uhm. I wanted to make that num_online_cpus(), and install a hotplug notifier to fold the percpu delta back into the total on cpu offline. But I have to look into doing that hotplug notifier stuff. Something like this should do I think, I just looked at other hotplug code and imitated the pattern. I assumed CONFIG_HOTPLUG_CPU requires CONFIG_SMP, I didn't actually try that one :-) --- In order to estimate the per stat counter error more accurately, using num_online_cpus() instead of NR_CPUS, install a cpu hotplug notifier (when cpu hotplug is enabled) that flushes whatever percpu delta was present into the total on cpu unplug. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/backing-dev.h|6 - include/linux/percpu_counter.h |1 lib/percpu_counter.c | 11 + mm/backing-dev.c | 47 + 4 files changed, 64 insertions(+), 1 deletion(-) Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h 2007-04-21 21:32:49.0 +0200 +++ linux-2.6/include/linux/backing-dev.h 2007-04-21 21:33:28.0 +0200 @@ -51,6 +51,10 @@ struct backing_dev_info { spinlock_t lock;/* protect the cycle count */ unsigned long cycles; /* writeout cycles */ int dirty_exceeded; + +#ifdef CONFIG_HOTPLUG_CPU + struct notifier_block hotplug_nb; +#endif }; void bdi_init(struct backing_dev_info *bdi); @@ -137,7 +141,7 @@ static inline s64 bdi_stat_sum(struct ba static inline unsigned long bdi_stat_delta(void) { #ifdef CONFIG_SMP - return NR_CPUS * FBC_BATCH; + return num_online_cpus() * FBC_BATCH; #else return 1UL; #endif Index: linux-2.6/include/linux/percpu_counter.h === --- linux-2.6.orig/include/linux/percpu_counter.h 2007-04-21 21:32:49.0 +0200 +++ linux-2.6/include/linux/percpu_counter.h2007-04-21 21:33:17.0 +0200 @@ -38,6 +38,7 @@ static inline void percpu_counter_destro void percpu_counter_mod(struct percpu_counter *fbc, s32 amount); void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount); s64 percpu_counter_sum(struct percpu_counter *fbc); +void percpu_counter_fold(struct percpu_counter *fbx, int cpu); static inline s64 percpu_counter_read(struct percpu_counter *fbc) { Index: linux-2.6/lib/percpu_counter.c === --- linux-2.6.orig/lib/percpu_counter.c 2007-04-21 21:32:49.0 +0200 +++ linux-2.6/lib/percpu_counter.c 2007-04-21 21:33:17.0 +0200 @@ -72,3 +72,14 @@ s64 percpu_counter_sum(struct percpu_cou return ret 0 ? 0 : ret; } EXPORT_SYMBOL(percpu_counter_sum); + +void percpu_counter_fold(struct percpu_counter *fbc, int cpu) +{ + s32 *pcount = per_cpu_ptr(fbc-counters, cpu); + if (*pcount) { + spin_lock(fbc-lock); + fbc-count += *pcount; + *pcount = 0; + spin_unlock(fbc-lock); + } +} Index: linux-2.6/mm/backing-dev.c === --- linux-2.6.orig/mm/backing-dev.c 2007-04-21 21:32:49.0 +0200 +++ linux-2.6/mm/backing-dev.c 2007-04-21 21:34:47.0 +0200 @@ -4,6 +4,49 @@ #include linux/fs.h #include linux/sched.h #include linux/module.h +#include linux/cpu.h + +#ifdef CONFIG_HOTPLUG_CPU +static int bdi_stat_fold(struct notifier_block *nb, + unsigned long action, void *hcpu) +{ + struct backing_dev_info *bdi = + container_of(nb, struct backing_dev_info, hotplug_nb); + unsigned long flags; + int cpu = (unsigned long)hcpu; + int i; + + if (action == CPU_DEAD) { + local_irq_save(flags); + for (i = 0; i NR_BDI_STAT_ITEMS; i++) + percpu_counter_fold(bdi-bdi_stat[i], cpu); + local_irq_restore(flags); + } + return NOTIFY_OK; +} + +static void bdi_init_hotplug(struct backing_dev_info *bdi) +{ + bdi-hotplug_nb = (struct notifier_block){ + .notifier_call = bdi_stat_fold, + .priority = 0, + }; + register_hotcpu_notifier(bdi-hotplug_nb); +} + +static void bdi_destroy_hotplug(struct backing_dev_info *bdi) +{ + unregister_hotcpu_notifier(bdi-hotplug_nb); +} +#else +static void bdi_init_hotplug(struct backing_dev_info *bdi) +{ +} + +static void bdi_destroy_hotplug(struct backing_dev_info
Re: Loud pop coming from hard drive on reboot
On Wed, 2007-04-18 at 17:27 -0400, Chuck Ebbert wrote: Bartlomiej Zolnierkiewicz wrote: On Wednesday 18 April 2007, Chuck Ebbert wrote: Mark Lord wrote: Mark Lord wrote: With the patch applied, I don't see *any* new activity in those S.M.A.R.T. attributes over multiple hibernates (Linux suspend-to-disk). Scratch that -- operator failure. ;) The patch makes no difference over hibernates in the SMART logs. It's still logging extra Power-Off_Retract_Count pegs, which it DID NOT USED TO DO not so long ago. Just to add to the fun, my problems are happening with the old IDE drivers... The issue you are experiencing results in the same problem (disk doing power off retract) but it has a totally different root cause - your notebook loses power on reboot. It is actually a hardware problem and as you have reported the same problem is present when using the other OS. My power off retract count increases whether I do a halt/poweroff or a reboot. The only difference is the volume of the noise. And I just noticed my seek error rate is increasing. /me plans purchase of another drive, definitely not Seagate... I think that the issue needs to be fixed (by detecting affected notebook(s) using DMI?) in Linux PM handling and not in IDE subsystem because: * there may be some other hardware devices affected by the power loss (== they require shutdown sequence) * the same problem will bite if somebody decides to use libata (FC7?) Yeah, this needs fixing too. I've been playing with another notebook and the power does stay on during reboot, so I wonder how widespread the problem is? /me too Thinkpad T23, with a ST980815A Ticks ever few seconds, but seems to mostly go away with hdparm -B255 /dev/sda1 but I have an increasing seek error rate as well. I got the ST disk because thinkwiki suggested it. I suspect this problem killed the previous disk in this laptop. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 08/10] mm: count writeback pages per BDI
On Sun, 2007-04-22 at 00:19 -0700, Andrew Morton wrote: It could be that we never call test_clear_page_writeback() against !bdi_cap_writeback_dirty() pages anwyay. I can't think why we would, but the relationships there aren't very clear. Does don't account for dirty memory imply doesn't ever do writeback? One would need to check, and it's perhaps a bit fragile. I did, thats how that test ended up there; I guess a comment would have been a good thing, no? :-) end_swap_bio_write() calls end_page_writeback(), and swap_backing_dev_info has neither cap_writeback nor cap_account_dirty. It's worth checking though. Boy we're doing a lot of stuff in there nowadays. OT: it might be worth looking into batching this work up - the predominant caller should be mpage_end_io_write(), and he has a whole bunch of pages which are usually all from the same file, all contiguous. It's pretty inefficient to be handling that data one-page-at-a-time, and some significant speedups may be available. Right, that might be a good spot to hook into, I'll have a look. Instead, everyone seems to think that variable pagecache page size is the only way of improving things. Shudder. hehe, I guess you haven't looked at my concurrent pagecache patches yet either :-) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Sat, 2007-04-21 at 22:25 +0200, Miklos Szeredi wrote: The other deadlock, in throttle_vm_writeout() is still to be solved. Let's go back to the original changelog: Author: marcelo.tosatti marcelo.tosatti Date: Tue Mar 8 17:25:19 2005 + [PATCH] vm: pageout throttling With silly pageout testcases it is possible to place huge amounts of memory under I/O. With a large request queue (CFQ uses 8192 requests) it is possible to place _all_ memory under I/O at the same time. This means that all memory is pinned and unreclaimable and the VM gets upset and goes oom. The patch limits the amount of memory which is under pageout writeout to be a little more than the amount of memory at which balance_dirty_pages() callers will synchronously throttle. This means that heavy pageout activity can starve heavy writeback activity completely, but heavy writeback activity will not cause starvation of pageout. Because we don't want a simple `dd' to be causing excessive latencies in page reclaim. Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Linus Torvalds [EMAIL PROTECTED] (A good one! I wrote it ;)) I believe that the combination of dirty-page-tracking and its calls to balance_dirty_pages() mean that we can now never get more than dirty_ratio of memory into the dirty-or-writeback condition. The vm scanner can convert dirty pages into clean, under-writeback pages, but it cannot increase the total of dirty+writeback. What about swapout? That can increase the number of writeback pages, without decreasing the number of dirty pages, no? Could we not solve that by enabling cap_account_writeback on swapper_space, and thereby account swap writeback pages. Then the VM knows it has outstanding IO and need not panic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Mon, 2007-04-23 at 08:48 -0700, Christoph Lameter wrote: On Sat, 21 Apr 2007, Peter Zijlstra wrote: This is enormously wrong for CONFIG_NR_CPUS=1024 on a 2-way. Right, I knew about that but, uhm. I wanted to make that num_online_cpus(), and install a hotplug notifier to fold the percpu delta back into the total on cpu offline. Use nr_cpu_ids instead. Contains the maximum possible cpus on this hardware and allows to handle the hotplug case easily. Ooh, thats handy... /me ditches the hotplug code again. That is, unless its very common to have half empty boxens.. ? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 2007-04-24 at 12:58 +1000, Neil Brown wrote: On Friday April 20, [EMAIL PROTECTED] wrote: Scale writeback cache per backing device, proportional to its writeout speed. So it works like this: We account for writeout in full pages. When a page has the Writeback flag cleared, we account that as a successfully retired write for the relevant bdi. By using floating averages we keep track of how many writes each bdi has retired 'recently' where the unit of time in which we understand 'recently' is a single page written. That is actually that period I keep referring to. So recently is the last 'period' number of writeout completions. We keep a floating average for each bdi, and a floating average for the total writeouts (that 'average' is, of course, 1.) 1 in the sense of unity, yes :-) Using these numbers we can calculate what faction of 'recently' retired writes were retired by each bdi (get_writeout_scale). Multiplying this fraction by the system-wide number of pages that are allowed to be dirty before write-throttling, we get the number of pages that the bdi can have dirty before write-throttling the bdi. I note that the same fraction is *not* applied to background_thresh. Should it be? I guess not - there would be interesting starting transients, as a bdi which had done no writeout would not be allowed any dirty pages, so background writeout would start immediately, which isn't what you want... or is it? This is something I have not been able to come to a conclusive answer yet,... For each bdi we also track the number of (dirty, writeback, unstable) pages and do not allow this to exceed the limit set for this bdi. The calculations involving 'reserve' in get_dirty_limits are a little confusing. It looks like you calculating how much total head-room there is for the bdi (pages that the system can still dirty - pages this bdi has dirty) and making sure the number returned in pbdi_dirty doesn't allow more than that to be used. Yes, it limits the earned share of the total dirty limit to the possible share, ensuring that the total dirty limit is never exceeded. This is especially relevant when the proportions change faster than the pages get written out, ie. when the period total dirty limit. This is probably a reasonable thing to do but it doesn't feel like the right place. I think get_dirty_limits should return the raw threshold, and balance_dirty_pages should do both tests - the bdi-local test and the system-wide test. Ok, that makes sense I guess. Currently you have a rather odd situation where + if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh) + break; might included numbers obtained with bdi_stat_sum being compared with numbers obtained with bdi_stat. Yes, I was aware of that. The bdi_thresh is based on bdi_stat() numbers, whereas the others could be bdi_stat_sum(). I think this is ok, since the threshold is a 'guess' anyway, we just _need_ to ensure we do not get trapped by writeouts not arriving (due to getting stuck in the per cpu deltas). -- I have all this commented in the new version. With these patches, the VM still (I think) assumes that each BDI has a reasonable queue limit, so that writeback_inodes will block on a full queue. If a BDI has a very large queue, balance_dirty_pages will simply turn lots of DIRTY pages into WRITEBACK pages and then think We've done our duty without actually blocking at all. It will block once we exceed the total number of dirty pages allowed for that BDI. But yes, this does not take away the need for queue limits. This work was primarily aimed at allowing multiple queues to not interfere as much, so they all can make progress and not get starved. With the extra accounting that we now have, I would like to see balance_dirty_pages dirty pages wait until RECLAIMABLE+WRITEBACK is actually less than 'threshold'. This would probably mean that we would need to support per-bdi background_writeout to smooth things out. Maybe that it fodder for another patch-set. Indeed, I still have to wrap my mind around the background thing. Your input is appreciated. You set: + vm_cycle_shift = 1 + ilog2(vm_total_pages); Can you explain that? You found the one random knob I hid :-) My experience is that scaling dirty limits with main memory isn't what we really want. When you get machines with very large memory, the amount that you want to be dirty is more a function of the speed of your IO devices, rather than the amount of memory, otherwise you can sometimes see large filesystem lags ('sync' taking minutes?) I wonder if it makes sense to try to limit the dirty data for a bdi to the amount that it can write out in some period of time - maybe 3 seconds. Probably configurable. You seem to have almost all the infrastructure in place to do that, and I think it
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 2007-04-24 at 10:19 +0200, Miklos Szeredi wrote: This is probably a reasonable thing to do but it doesn't feel like the right place. I think get_dirty_limits should return the raw threshold, and balance_dirty_pages should do both tests - the bdi-local test and the system-wide test. Ok, that makes sense I guess. Well, my narrow minded world view says it's not such a good idea, because it would again introduce the deadlock scenario, we're trying to avoid. I was only referring to the placement of the clipping; and exactly where that happens does not affect the deadlock. In a sense allowing a queue to go over the global limit just a little bit is a good thing. Actually the very original code does that: if writeback was started for write_chunk number of pages, then we allow ratelimit (8) _new_ pages to be dirtied, effectively ignoring the global limit. It might be time to get rid of that rate-limiting. balance_dirty_pages()'s fast path is not nearly as heavy as it used to be. All these fancy counter systems have removed quite a bit of iteration from there. That's why I've been saying, that the current code is so unfair: if there are lots of dirty pages to be written back to a particular device, then balance_dirty_pages() allows the dirty producer to make even more pages dirty, but if there are _no_ dirty pages for a device, and we are over the limit, then that dirty producer is allowed absolutely no new dirty pages until the global counts subside. Well, that got fixed on a per device basis with this patch, it is still true for multiple tasks writing to the same device. I'm still not quite sure what purpose the above soft limiting serves. It seems to just give advantage to writers, which managed to accumulate lots of dirty pages, and then can convert that into even more dirtyings. The queues only limit the actual in-flight writeback pages, balance_dirty_pages() considers all pages that might become writeback as well as those that are. Would it make sense to remove this behavior, and ensure that balance_dirty_pages() doesn't return until the per-queue limits have been complied with? I don't think that will help, balance_dirty_pages drives the queues. That is, it converts pages from mere dirty to writeback. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 2007-04-24 at 11:14 +0200, Miklos Szeredi wrote: I'm still not quite sure what purpose the above soft limiting serves. It seems to just give advantage to writers, which managed to accumulate lots of dirty pages, and then can convert that into even more dirtyings. The queues only limit the actual in-flight writeback pages, balance_dirty_pages() considers all pages that might become writeback as well as those that are. Would it make sense to remove this behavior, and ensure that balance_dirty_pages() doesn't return until the per-queue limits have been complied with? I don't think that will help, balance_dirty_pages drives the queues. That is, it converts pages from mere dirty to writeback. Yes. But current logic says, that if you convert write_chunk dirty to writeback, you are allowed to dirty ratelimit more. D: number of dirty pages W: number of writeback pages L: global limit C: write_chunk = ratelimit_pages * 1.5 R: ratelimit If D+W = L, then R = 8 Let's assume, that D == L and W == 0. And that all of the dirty pages belong to a single device. Also for simplicity, lets assume an infinite length queue, and a slow device. Then while converting the dirty pages to writeback, D / C * R new dirty pages can be created. So when all existing dirty have been converted: D = L / C * R W = L D + W = L * (1 + R / C) So we see, that we're now even more above the limit than before the conversion. This means, that we starve writers to other devices, which don't have as many dirty pages, because until the slow device doesn't finish these writes they will not get to do anything. Your patch helps this in that if the other writers have an empty queue and no dirty, they will be allowed to slowly start writing. But they will not gain their full share until the slow dirty-hog goes below the global limit, which may take some time. So I think the logical thing to do, is if the dirty-hog is over it's queue limit, don't let it dirty any more until it's dirty+writeback go below the limit. That allowes other devices to more quickly gain their share of dirty pages. Ahh, now I see; I had totally blocked out these few lines: pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ yeah, those look dubious indeed... And reading back Neil's comments, I think he agrees. Shall we just kill those? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 2007-04-24 at 03:00 -0700, Andrew Morton wrote: On Tue, 24 Apr 2007 11:47:20 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote: Ahh, now I see; I had totally blocked out these few lines: pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ yeah, those look dubious indeed... And reading back Neil's comments, I think he agrees. Shall we just kill those? I think we should. Athough I'm a little afraid, that Akpm will tell me again, that I'm a stupid git, and that those lines are in fact vitally important ;) It depends what they're replaced with. That code is there, iirc, to prevent a process from getting stuck in balance_dirty_pages() forever due to the dirtying activity of other processes. hm, we ask the process to write write_chunk pages each go around the loop. So if it wrote write-chunk/2 pages on the first pass it might end up writing write_chunk*1.5 pages total. I guess that's rare and doesn't matter much if it does happen - the upper bound is write_chunk*2-1, I think. Right, but I think the problem is that its dirty - writeback, not dirty - writeback completed. Ie. they don't guarantee progress, it could be that the total nr_reclaimable + nr_writeback will steadily increase due to this break. How about ensuring that vm_writeout_total increases least 2*sync_writeback_pages() during our stay in balance_dirty_pages(). That way we have the guarantee that more pages get written out than can be dirtied. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 2007-04-24 at 12:19 +0200, Miklos Szeredi wrote: Ahh, now I see; I had totally blocked out these few lines: pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ yeah, those look dubious indeed... And reading back Neil's comments, I think he agrees. Shall we just kill those? I think we should. Athough I'm a little afraid, that Akpm will tell me again, that I'm a stupid git, and that those lines are in fact vitally important ;) It depends what they're replaced with. That code is there, iirc, to prevent a process from getting stuck in balance_dirty_pages() forever due to the dirtying activity of other processes. hm, we ask the process to write write_chunk pages each go around the loop. So if it wrote write-chunk/2 pages on the first pass it might end up writing write_chunk*1.5 pages total. I guess that's rare and doesn't matter much if it does happen - the upper bound is write_chunk*2-1, I think. Right, but I think the problem is that its dirty - writeback, not dirty - writeback completed. Ie. they don't guarantee progress, it could be that the total nr_reclaimable + nr_writeback will steadily increase due to this break. How about ensuring that vm_writeout_total increases least 2*sync_writeback_pages() during our stay in balance_dirty_pages(). That way we have the guarantee that more pages get written out than can be dirtied. No, because that's a global counter, which many writers could be looking at. We'd need a per-task writeout counter, but when finishing the write we don't know anymore which task it was performed for. Yeah, just reached that conclusion myself too - again, I ran into that when trying to figure out how to do the per task balancing right. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] revoke: core code
On Fri, 2007-03-09 at 10:15 +0200, Pekka J Enberg wrote: +static int revoke_vma(struct vm_area_struct *vma, struct zap_details *details) +{ + unsigned long restart_addr, start_addr, end_addr; + int need_break; + + start_addr = vma-vm_start; + end_addr = vma-vm_end; + + /* + * Not holding -mmap_sem here. + */ + vma-vm_flags |= VM_REVOKED; + smp_mb(); Hmm, i_mmap_lock pins the vma and excludes modifications, but doesn't exclude concurrent faults. I guess its save. + again: + restart_addr = zap_page_range(vma, start_addr, end_addr - start_addr, + details); + + need_break = need_resched() || need_lockbreak(details-i_mmap_lock); + if (need_break) + goto out_need_break; + + if (restart_addr end_addr) { + start_addr = restart_addr; + goto again; + } + return 0; + + out_need_break: + spin_unlock(details-i_mmap_lock); + cond_resched(); + spin_lock(details-i_mmap_lock); + return -EINTR; I'm not sure this scheme works, given a sufficiently loaded machine, this might never complete. +} + +static int revoke_mapping(struct address_space *mapping, struct file *to_exclude) +{ + struct vm_area_struct *vma; + struct prio_tree_iter iter; + struct zap_details details; + int err = 0; + + details.i_mmap_lock = mapping-i_mmap_lock; + + spin_lock(mapping-i_mmap_lock); + vma_prio_tree_foreach(vma, iter, mapping-i_mmap, 0, ULONG_MAX) { + if (vma-vm_flags VM_SHARED vma-vm_file != to_exclude) { I'm never sure of operator precedence and prefer: (vma-vm_flags VM_SHARED) ... which leaves no room for error. + err = revoke_vma(vma, details); + if (err) + goto out; + } + } + + list_for_each_entry(vma, mapping-i_mmap_nonlinear, shared.vm_set.list) { + if (vma-vm_flags VM_SHARED vma-vm_file != to_exclude) { Idem. + err = revoke_vma(vma, details); + if (err) + goto out; + } + } + out: + spin_unlock(mapping-i_mmap_lock); + return err; +} - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/3] swsusp: Stop using page flags
On Sun, 2007-03-11 at 11:17 +0100, Rafael J. Wysocki wrote: Hi, The following three patches make swsusp use its own data structures for memory management instead of special page flags. Thus the page flags used so far by swsusp (PG_nosave, PG_nosave_free) can be used for other purposes and I believe there are some urgend needs of them. :-) Last week I sent these patches to the linux-pm and linux-mm lists and there were no negative comments. Also I've been testing them on my x86_64 boxes for a few days and apparently they don't break anything. I think they can go into -mm for testing. Comments are welcome. These patches have my blessing, they look good to me, but I'm not much involved with the swsusp code, so I won't ACK them. Again, thanks a bunch for freeing up 2 page flags :-) Peter - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!)
On Sun, 2007-03-11 at 15:50 +0200, Michael S. Tsirkin wrote: Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0! Feb 27 17:47:52 sw169 kernel: [8053aaf1] _spin_lock_irqsave+0x15/0x24 Feb 27 17:47:52 sw169 kernel: [88067a23] :ib_ipoib:ipoib_neigh_destructor+0xc2/0x139 It looks like this is deadlocking trying to take priv-lock in ipoib_neigh_destructor(). One idea I just had would be to build a kernel with CONFIG_PROVE_LOCKING turned on, and then rerun this test. There's a good chance that this would diagnose the deadlock. (I don't have good access to my test machines right now, or else I would do it myself) OK, I did that. But I get [13440.761857] INFO: trying to register non-static key. [13440.766903] the code is fine but needs lockdep annotation. [13440.772455] turning off the locking correctness validator. and I am not sure what triggers this, or how to fix it to have the validator actually do its job. It usually indicates a spinlock is not properly initialized. Like __SPIN_LOCK_UNLOCKED() used in a non-static context, use spin_lock_init() in these cases. However looking at the code, ipoib_neight_destructor only uses priv-lock, and that seems to get properly initialized in ipoib_setup() using spin_lock_init(). So either there are other sites that instanciate those objects and forget about the lock init, or the object is corrupted (use after free?) Ingo, what key does the message refer to? The stack dump seems to point to drivers/infiniband/ulp/ipoib/ipoib_main.c line 829. Full message below: [13440.761857] INFO: trying to register non-static key. [13440.766903] the code is fine but needs lockdep annotation. [13440.772455] turning off the locking correctness validator. [13440.778008] [c023c082] __lock_acquire+0xae4/0xbb9 [13440.783078] [c023c43d] lock_acquire+0x56/0x71 [13440.787784] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13440.794412] [c051ad41] _spin_lock_irqsave+0x32/0x41 [13440.799649] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13440.806275] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13440.812897] [c04a1c1b] dst_run_gc+0xc/0x118 [13440.817439] [c022af6e] run_timer_softirq+0x37/0x16b [13440.822673] [c04a1c0f] dst_run_gc+0x0/0x118 [13440.827221] [c04a3eab] neigh_destroy+0xbe/0x104 [13440.832114] [c04a1bb1] dst_destroy+0x4d/0xab [13440.836751] [c04a1c64] dst_run_gc+0x55/0x118 [13440.841384] [c022b03f] run_timer_softirq+0x108/0x16b [13440.846711] [c0227634] __do_softirq+0x5a/0xd5 [13440.851427] [c023b435] trace_hardirqs_on+0x106/0x141 [13440.856754] [c0227643] __do_softirq+0x69/0xd5 [13440.861470] [c02276e6] do_softirq+0x37/0x4d [13440.866016] [c02167b0] smp_apic_timer_interrupt+0x6b/0x77 [13440.871774] [c02029ef] default_idle+0x3b/0x54 [13440.876491] [c02029ef] default_idle+0x3b/0x54 [13440.881211] [c0204c33] apic_timer_interrupt+0x33/0x38 [13440.886624] [c02029ef] default_idle+0x3b/0x54 [13440.891342] [c02029f1] default_idle+0x3d/0x54 [13440.896061] [c0202aaa] cpu_idle+0xa2/0xbb [13440.900436] === [13768.711447] BUG: spinlock lockup on CPU#1, swapper/0, c0687880 [13768.717353] [c031f919] _raw_spin_lock+0xda/0xfd [13768.722247] [c051ad48] _spin_lock_irqsave+0x39/0x41 [13768.727486] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13768.734110] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13768.740735] [c04a1c1b] dst_run_gc+0xc/0x118 [13768.745276] [c022af6e] run_timer_softirq+0x37/0x16b [13768.750517] [c04a1c0f] dst_run_gc+0x0/0x118 [13768.755061] [c04a3eab] neigh_destroy+0xbe/0x104 [13768.759955] [c04a1bb1] dst_destroy+0x4d/0xab [13768.764586] [c04a1c64] dst_run_gc+0x55/0x118 [13768.769218] [c022b03f] run_timer_softirq+0x108/0x16b [13768.774542] [c0227634] __do_softirq+0x5a/0xd5 [13768.779261] [c023b435] trace_hardirqs_on+0x106/0x141 [13768.784588] [c0227643] __do_softirq+0x69/0xd5 [13768.789308] [c02276e6] do_softirq+0x37/0x4d [13768.793851] [c02167b0] smp_apic_timer_interrupt+0x6b/0x77 [13768.799609] [c02029ef] default_idle+0x3b/0x54 [13768.804326] [c02029ef] default_idle+0x3b/0x54 [13768.809054] [c0204c33] apic_timer_interrupt+0x33/0x38 [13768.814471] [c02029ef] default_idle+0x3b/0x54 [13768.819187] [c02029f1] default_idle+0x3d/0x54 [13768.823903] [c0202aaa] cpu_idle+0xa2/0xbb [13768.828279] === -- MST - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please
Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2
On Mon, 2007-03-12 at 08:26 -0700, Linus Torvalds wrote: So good fairness really should involve some notion of work done for others. It's just not very easy to do.. A solution that is already in demand is a class based scheduler, where the thread doing work for a client (temp.) joins the class of the client. The in-kernel virtualization guys also want to have this, for pretty much the same reasons. Peter - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2
On Mon, 2007-03-12 at 21:11 +0100, Mike Galbraith wrote: How would you go about ensuring that there won't be any cycles wasted? SCHED_IDLE or otherwise nice 19 Killing the known corner case starvation scenarios is wonderful, but let's not just pretend that interactive tasks don't have any special requirements. Interaction wants low latency, getting that is traditionally expressed in priorities - the highest prio gets the least latency (all RTOSs work like that). There is nothing that warrants giving them more CPU time IMHO; if you think they deserve more, express that using priorities. Priorities are a well understood concept and they work; heuristics can (and Murphy tells us they will) go wrong. Getting the server/client thing working can be done without heuristics using class based scheduling. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] VM throttling: avoid blocking occasional writers
Hi, I've been working on an alternative solution (see patch below). However I haven't posted yet because I'm not quite satisfied and haven't done a lot of testing. The patch relies on the per backing dev dirty/writeback counts currently in -mm to which David Chinner objected. I plan to rework those as percpu counters. I think my solution might behave better because it fully decouples the device throttling. --- Scale writeback cache per backing device, proportional to its writeout speed. akpm sayeth: Which problem are we trying to solve here? afaik our two uppermost problems are: a) Heavy write to queue A causes light writer to queue B to blok for a long time in balance_dirty_pages(). Even if the devices have the same speed. This one; esp when not the same speed. The - my usb stick makes my computer suck - problem. But even on similar speed, the separation of device should avoid blocking dev B when dev A is being throttled. The writeout speed is measure dynamically, so when it doesn't have anything to write out for a while its writeback cache size goes to 0. Conversely, when starting up it will in the beginning act almost synchronous but will quickly build up a 'fair' share of the writeback cache. b) heavy write to device A causes light write to device A to block for a long time in balance_dirty_pages(), occasionally. Harder to fix. This will indeed take more. I've thought about it though. But one quickly ends up with per task state. How it all works: We pick a 2^n value based on the vm_dirty_ratio and total vm size to act as a period - vm_cycle_shift. This period measures 'time' in writeout events. Each writeout increases time and adds to a per bdi counter. This counter is halved when a period expires. So per bdi speed is: 0.5 * (previous cycle speed) + this cycle's events. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- block/ll_rw_blk.c |3 + include/linux/backing-dev.h |7 +++ include/linux/writeback.h | 10 kernel/sysctl.c | 10 +++- mm/page-writeback.c | 102 ++-- 5 files changed, 119 insertions(+), 13 deletions(-) Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h +++ linux-2.6/include/linux/backing-dev.h @@ -34,6 +34,13 @@ struct backing_dev_info { void *congested_data; /* Pointer to aux data for congested func */ void (*unplug_io_fn)(struct backing_dev_info *, struct page *); void *unplug_io_data; + + /* +* data used for scaling the writeback cache +*/ + spinlock_t lock;/* protect the cycle count */ + atomic_long_t nr_writeout; /* writeout scale */ + unsigned long cycles; /* writeout cycles */ }; Index: linux-2.6/include/linux/writeback.h === --- linux-2.6.orig/include/linux/writeback.h +++ linux-2.6/include/linux/writeback.h @@ -4,6 +4,8 @@ #ifndef WRITEBACK_H #define WRITEBACK_H +#include linux/log2.h + struct backing_dev_info; extern spinlock_t inode_lock; @@ -89,11 +91,19 @@ void throttle_vm_writeout(gfp_t gfp_mask /* These are exported to sysctl. */ extern int dirty_background_ratio; extern int vm_dirty_ratio; +extern int vm_cycle_shift; extern int dirty_writeback_interval; extern int dirty_expire_interval; extern int block_dump; extern int laptop_mode; +extern long vm_total_pages; /* reduce dependancy stuff */ +static inline void update_cycle_shift(void) +{ + unsigned long dirty_pages = (vm_dirty_ratio * vm_total_pages) / 100; + vm_cycle_shift = 2 + ilog2_up(int_sqrt(dirty_pages)); +} + struct ctl_table; struct file; int dirty_writeback_centisecs_handler(struct ctl_table *, int, struct file *, Index: linux-2.6/kernel/sysctl.c === --- linux-2.6.orig/kernel/sysctl.c +++ linux-2.6/kernel/sysctl.c @@ -612,6 +612,14 @@ static ctl_table kern_table[] = { static int zero; static int one_hundred = 100; +static int proc_dointvec_vm_dirty_ratio(ctl_table *table, int write, + struct file *filp, void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos); + update_cycle_shift(); + return ret; +} static ctl_table vm_table[] = { { @@ -663,7 +671,7 @@ static ctl_table vm_table[] = { .data = vm_dirty_ratio, .maxlen = sizeof(vm_dirty_ratio), .mode = 0644, - .proc_handler = proc_dointvec_minmax, + .proc_handler = proc_dointvec_vm_dirty_ratio, .strategy = sysctl_intvec, .extra1 = zero, .extra2 = one_hundred, Index: linux
Re: NPTL patch for linux 2.4.28
On Thu, 2007-03-15 at 03:14 +0530, Syed Ahemed wrote: Getting RHEL's source ( http://lkml.org/lkml/2005/3/21/380 ) was an idea i thought about but then a download of the RHEL source from the following location was denied . http://download.fedora.redhat.com/pub/fedora/linux/core/1/SRPMS/ and the rpmfind site. (Guess need to be a paid subscriber for that right ?) Strangely enough you try to download Fedora Core SRPMs whilst you speak of RHEL. Try this one: ftp://ftp.redhat.com/pub/redhat/linux/enterprise/3/en/os/i386/SRPMS/kernel-2.4.21-4.EL.src.rpm Also, CentOS would have similar sources. Google could have informed you on that. I still wonder why there aren't any NPTL patches available in the non-redhat sites for kernels like 2.4.21 or more . Because most people, especially the ones on this mailing list have moved on to 2.6 quite some time ago. May I suggest you do the same? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/filemap.c: unconditionally call mark_page_accessed
On Wed, 2007-03-14 at 15:58 -0400, Ashif Harji wrote: This patch unconditionally calls mark_page_accessed to prevent pages, especially for small files, from being evicted from the page cache despite frequent access. Since we're hackling over the use-once stuff again... /me brings up: http://marc.info/?l=linux-mmm=115316894804385w=2 and ducks. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements
On Thu, 2007-03-15 at 20:10 +0100, Eric Dumazet wrote: Hi I'm pleased to present these patches which improve linux futex performance and scalability, on both UP, SMP and NUMA configs. I had this idea last year but I was not understood, probably because I gave not enough explanations. Sorry if this mail is really long... I started playing with it after your last reference to it, I have some code here (against -rt): http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/ Which I will post once I have the found what keeps pthread_join() from completing :-( It basically adds a per task vma lookup cache which can also activate the private logic without explicit use of the new interface. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6.20] BUG: workqueue leaked lock
On Thu, 2007-03-15 at 11:06 -0800, Andrew Morton wrote: On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden [EMAIL PROTECTED] wrote: ... [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x/3577 [ 1756.728271] last function: laundromat_main+0x0/0x69 [nfsd] [ 1756.728392] 2 locks held by nfsd4/3577: [ 1756.728435] #0: (client_mutex){--..}, at: [c1205b88] mutex_lock+0x8/0xa [ 1756.728679] #1: (inode-i_mutex){--..}, at: [c1205b88] mutex_lock+0x8/0xa [ 1756.728923] [c1003d57] show_trace_log_lvl+0x1a/0x30 [ 1756.729015] [c1003d7f] show_trace+0x12/0x14 [ 1756.729103] [c1003e79] dump_stack+0x16/0x18 [ 1756.729187] [c102c2e8] run_workqueue+0x167/0x170 [ 1756.729276] [c102c437] worker_thread+0x146/0x165 [ 1756.729368] [c102f797] kthread+0x97/0xc4 [ 1756.729456] [c1003bdb] kernel_thread_helper+0x7/0x10 [ 1756.729547] === [ 1792.436492] svc: unknown version (0 for prog 13, nfsd) [ 1846.683648] BUG: workqueue leaked lock or atomic: nfsd4/0x/3577 [ 1846.683701] last function: laundromat_main+0x0/0x69 [nfsd] [ 1846.683832] 2 locks held by nfsd4/3577: [ 1846.683885] #0: (client_mutex){--..}, at: [c1205b88] mutex_lock+0x8/0xa [ 1846.683980] #1: (inode-i_mutex){--..}, at: [c1205b88] mutex_lock+0x8/0xa [ 1846.683988] [c1003d57] show_trace_log_lvl+0x1a/0x30 [ 1846.683994] [c1003d7f] show_trace+0x12/0x14 [ 1846.683997] [c1003e79] dump_stack+0x16/0x18 [ 1846.684001] [c102c2e8] run_workqueue+0x167/0x170 [ 1846.684006] [c102c437] worker_thread+0x146/0x165 [ 1846.684012] [c102f797] kthread+0x97/0xc4 [ 1846.684023] [c1003bdb] kernel_thread_helper+0x7/0x10 Oleg, that's a fairly incomprehensible message we have in there. Can you please explain what it means? I think I'm responsible for this message (commit d5abe669172f20a4129a711de0f250a4e07db298); what is says is that the function executed by the workqueue (here laundromat_main) leaked an atomic context or is still holding locks (2 locks in this case). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements
On Fri, 2007-03-16 at 10:30 +0100, Eric Dumazet wrote: On Friday 16 March 2007 09:05, Peter Zijlstra wrote: On Thu, 2007-03-15 at 20:10 +0100, Eric Dumazet wrote: Hi I'm pleased to present these patches which improve linux futex performance and scalability, on both UP, SMP and NUMA configs. I had this idea last year but I was not understood, probably because I gave not enough explanations. Sorry if this mail is really long... I started playing with it after your last reference to it, I have some code here (against -rt): http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/ Which I will post once I have the found what keeps pthread_join() from completing :-( It basically adds a per task vma lookup cache which can also activate the private logic without explicit use of the new interface. Hi Peter I dont think yet another cache will help in the general case. A typical program uses many vmas at once... glibc has internal futexes, on a different vma than futexes declared in your program. Each shared library is going to have its own vma for its data (and futexes) (244 vmas on one kmail program for example) Yeah, I was just hoping a few cache entries would be enough to get the worst of them. A benchmark will have to tell I guess. About your guess_futex_shared() thing, I miss the vma_anon() definition. http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.patch But if it has to walk the vmas (and take mmap_sem), you already loose the PRIVATE benefit. It doesn't take mmap_sem, I am aware of the problems. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21-rc3-mm2 3/4] futex_requeue_pi optimization
On Tue, 2007-03-13 at 10:52 +0100, [EMAIL PROTECTED] wrote: plain text document attachment (futex-requeue-pi.diff) This patch provides the futex_requeue_pi functionality. This provides an optimization, already used for (normal) futexes, to be used for PI-futexes. This optimization is currently used by the glibc in pthread_broadcast, when using normal mutexes. With futex_requeue_pi, it can be used with PRIO_INHERIT mutexes too. Signed-off-by: Pierre Peiffer [EMAIL PROTECTED] --- /* + * Retrieve the original address used to compute this key + */ +static void *get_futex_address(union futex_key *key) +{ + void *uaddr; + + if (key-both.offset 1) { + /* shared mapping */ + uaddr = (void*)((key-shared.pgoff PAGE_SHIFT) + + key-shared.offset - 1); + } else { + /* private mapping */ + uaddr = (void*)(key-private.address + key-private.offset); + } + + return uaddr; +} This will not work for nonlinear vmas, granted, not a lot of ppl stick futexes in nonlinear vmas, but the futex_key stuff handles it, this doesn't. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements
On Fri, 2007-03-16 at 11:30 +0100, Eric Dumazet wrote: On Friday 16 March 2007 11:10, Peter Zijlstra wrote: http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.p atch Oh thanks But if it has to walk the vmas (and take mmap_sem), you already loose the PRIVATE benefit. It doesn't take mmap_sem, I am aware of the problems. Yes but the vma_anon() - vma_cache_find() needs to read 3 cache lines on x86_64 (sizeof(struct vma_cache) = 136) and dirty one bit, so it might be more expensive than the mmap_sem ... I though the cacheline was 128 bytes, but point taken. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT] NFS client updates for 2.6.20
On Mon, 2007-02-12 at 23:39 -0800, Trond Myklebust wrote: commit 7c85d9007d05436e71d2b805b96c1e36a8193bd4 Author: Trond Myklebust [EMAIL PROTECTED] Date: Wed Dec 13 15:23:48 2006 -0500 NFS: Fixup some outdated comments... Signed-off-by: Trond Myklebust [EMAIL PROTECTED] A few new comments would be appreciated :-) commit 02241bc47e8961768de83d855accd0dcad1df045 Author: Trond Myklebust [EMAIL PROTECTED] Date: Sat Jan 13 02:28:07 2007 -0500 NFS: Ensure that -writepage() uses flush_stable() when reclaiming pages Signed-off-by: Trond Myklebust [EMAIL PROTECTED] That patch seems to make all -writepage() calls use stable storage. Of course, given that there are no other -writepage() calls this works out. I've been using this: --- Subject: nfs: only use stable storage for swap unstable writes don't make sense for swap pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] Cc: Trond Myklebust [EMAIL PROTECTED] --- fs/nfs/write.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6-git/fs/nfs/write.c === --- linux-2.6-git.orig/fs/nfs/write.c 2007-01-09 11:38:25.0 +0100 +++ linux-2.6-git/fs/nfs/write.c2007-01-09 13:40:34.0 +0100 @@ -231,7 +231,7 @@ static int nfs_writepage_setup(struct nf static int wb_priority(struct writeback_control *wbc) { if (wbc-for_reclaim) - return FLUSH_HIGHPRI; + return FLUSH_HIGHPRI|FLUSH_STABLE; if (wbc-for_kupdate) return FLUSH_LOWPRI; return 0; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] lockdep: annotate BLKPG_DEL_PARTITION
= [ INFO: possible recursive locking detected ] 2.6.19-1.2909.fc7 #1 - anaconda/587 is trying to acquire lock: (bdev-bd_mutex){--..}, at: [c05fb380] mutex_lock+0x21/0x24 but task is already holding lock: (bdev-bd_mutex){--..}, at: [c05fb380] mutex_lock+0x21/0x24 other info that might help us debug this: 1 lock held by anaconda/587: #0: (bdev-bd_mutex){--..}, at: [c05fb380] mutex_lock+0x21/0x24 stack backtrace: [c0405812] show_trace_log_lvl+0x1a/0x2f [c0405db2] show_trace+0x12/0x14 [c0405e36] dump_stack+0x16/0x18 [c043bd84] __lock_acquire+0x116/0xa09 [c043c960] lock_acquire+0x56/0x6f [c05fb1fa] __mutex_lock_slowpath+0xe5/0x24a [c05fb380] mutex_lock+0x21/0x24 [c04d82fb] blkdev_ioctl+0x600/0x76d [c04946b1] block_ioctl+0x1b/0x1f [c047ed5a] do_ioctl+0x22/0x68 [c047eff2] vfs_ioctl+0x252/0x265 [c047f04e] sys_ioctl+0x49/0x63 [c0404070] syscall_call+0x7/0xb Annotate BLKPG_DEL_PARTITION's bd_mutex locking and add a little comment clarifying the bd_mutex locking, because I confused myself and initially thought the lock order was wrong too. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- block/ioctl.c |2 +- fs/block_dev.c |7 +++ 2 files changed, 8 insertions(+), 1 deletion(-) Index: linux-2.6.20.noarch/block/ioctl.c === --- linux-2.6.20.noarch.orig/block/ioctl.c +++ linux-2.6.20.noarch/block/ioctl.c @@ -82,7 +82,7 @@ static int blkpg_ioctl(struct block_devi fsync_bdev(bdevp); invalidate_bdev(bdevp, 0); - mutex_lock(bdev-bd_mutex); + mutex_lock_nested(bdev-bd_mutex, 1); delete_partition(disk, part); mutex_unlock(bdev-bd_mutex); mutex_unlock(bdevp-bd_mutex); Index: linux-2.6.20.noarch/fs/block_dev.c === --- linux-2.6.20.noarch.orig/fs/block_dev.c +++ linux-2.6.20.noarch/fs/block_dev.c @@ -1101,6 +1101,13 @@ static int __blkdev_get(struct block_dev int for_part); static int __blkdev_put(struct block_device *bdev, int for_part); +/* + * bd_mutex locking: + * + * mutex_lock(part-bd_mutex) + *mutex_lock_nested(whole-bd_mutex, 1) + */ + static int do_open(struct block_device *bdev, struct file *file, int for_part) { struct module *owner = NULL; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20 1/1] fbdev,mm: hecuba/E-Ink fbdev driver
On Sat, 2007-02-17 at 11:42 +0100, Jaya Kumar wrote: Hi James, Geert, lkml and mm, Hi Jaya, This patch adds support for the Hecuba/E-Ink display with deferred IO. The changes from the previous version are to switch to using a mutex and lock_page. I welcome your feedback and advice. This changelog ought to be a little more extensive; esp. because you're using these fancy new functions -page_mkwrite() and page_mkclean() in a novel way. Also, I'd still like to see a way to call msync() on the mmap'ed region to force a flush. I think providing a fb_fsync() method in fbmem.c and a hook down to the driver ought to work. Also, you now seem to use a fixed 1 second delay, perhaps provide an ioctl or something to customize this? And, as Andrew suggested last time around, could you perhaps push this fancy new idea into the FB layer so that more drivers can make us of it? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 2/6] mm: count dirty pages per BDI
Count per BDI dirty pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/buffer.c |1 + include/linux/backing-dev.h |1 + mm/page-writeback.c |2 ++ mm/truncate.c |1 + 4 files changed, 5 insertions(+) Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -732,6 +732,7 @@ int __set_page_dirty_buffers(struct page if (page-mapping) {/* Race with truncate? */ if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); + __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } radix_tree_tag_set(mapping-page_tree, Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -766,6 +766,7 @@ int __set_page_dirty_nobuffers(struct pa BUG_ON(mapping2 != mapping); if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); + __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } radix_tree_tag_set(mapping-page_tree, @@ -892,6 +893,7 @@ int clear_page_dirty_for_io(struct page set_page_dirty(page); if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); return 1; } return 0; Index: linux-2.6/mm/truncate.c === --- linux-2.6.orig/mm/truncate.c +++ linux-2.6/mm/truncate.c @@ -71,6 +71,7 @@ void cancel_dirty_page(struct page *page struct address_space *mapping = page-mapping; if (mapping mapping_cap_account_dirty(mapping)) { dec_zone_page_state(page, NR_FILE_DIRTY); + dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); if (account_size) task_io_account_cancelled_write(account_size); } Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h +++ linux-2.6/include/linux/backing-dev.h @@ -23,6 +23,7 @@ enum bdi_state { }; enum bdi_stat_item { + BDI_DIRTY, NR_BDI_STAT_ITEMS }; -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 3/6] mm: count writeback pages per BDI
Count per BDI writeback pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/backing-dev.h |1 + mm/page-writeback.c |8 ++-- 2 files changed, 7 insertions(+), 2 deletions(-) Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -912,10 +912,12 @@ int test_clear_page_writeback(struct pag write_lock_irqsave(mapping-tree_lock, flags); ret = TestClearPageWriteback(page); - if (ret) + if (ret) { radix_tree_tag_clear(mapping-page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + __dec_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK); + } write_unlock_irqrestore(mapping-tree_lock, flags); } else { ret = TestClearPageWriteback(page); @@ -933,10 +935,12 @@ int test_set_page_writeback(struct page write_lock_irqsave(mapping-tree_lock, flags); ret = TestSetPageWriteback(page); - if (!ret) + if (!ret) { radix_tree_tag_set(mapping-page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + __inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK); + } if (!PageDirty(page)) radix_tree_tag_clear(mapping-page_tree, page_index(page), Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h +++ linux-2.6/include/linux/backing-dev.h @@ -24,6 +24,7 @@ enum bdi_state { enum bdi_stat_item { BDI_DIRTY, + BDI_WRITEBACK, NR_BDI_STAT_ITEMS }; -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 1/6] mm: scalable bdi statistics counters.
Provide scalable per backing_dev_info statistics counters modeled on the ZVC code. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- block/ll_rw_blk.c |1 drivers/block/rd.c |2 drivers/char/mem.c |2 fs/char_dev.c |1 include/linux/backing-dev.h | 86 mm/backing-dev.c| 103 6 files changed, 195 insertions(+) Index: linux-2.6/block/ll_rw_blk.c === --- linux-2.6.orig/block/ll_rw_blk.c +++ linux-2.6/block/ll_rw_blk.c @@ -211,6 +211,7 @@ void blk_queue_make_request(request_queu q-backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; q-backing_dev_info.state = 0; q-backing_dev_info.capabilities = BDI_CAP_MAP_COPY; + bdi_stat_init(q-backing_dev_info); blk_queue_max_sectors(q, SAFE_MAX_SECTORS); blk_queue_hardsect_size(q, 512); blk_queue_dma_alignment(q, 511); Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h +++ linux-2.6/include/linux/backing-dev.h @@ -22,6 +22,17 @@ enum bdi_state { BDI_unused, /* Available bits start here */ }; +enum bdi_stat_item { + NR_BDI_STAT_ITEMS +}; + +#ifdef CONFIG_SMP +struct bdi_per_cpu_data { + s8 stat_threshold; + s8 bdi_stat_diff[NR_BDI_STAT_ITEMS]; +} cacheline_aligned_in_smp; +#endif + typedef int (congested_fn)(void *, int); struct backing_dev_info { @@ -32,8 +43,83 @@ struct backing_dev_info { void *congested_data; /* Pointer to aux data for congested func */ void (*unplug_io_fn)(struct backing_dev_info *, struct page *); void *unplug_io_data; + + atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS]; +#ifdef CONFIG_SMP + struct bdi_per_cpu_data pcd[NR_CPUS]; +#endif }; +extern atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS]; + +static inline void bdi_stat_add(long x, struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + atomic_long_add(x, bdi-bdi_stats[item]); + atomic_long_add(x, bdi_stats[item]); +} + +/* + * cannot be unsigned long and clip on 0. + */ +static inline unsigned long global_bdi_stat(enum bdi_stat_item item) +{ + long x = atomic_long_read(bdi_stats[item]); +#ifdef CONFIG_SMP + if (x 0) + x = 0; +#endif + return x; +} + +static inline unsigned long bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + long x = atomic_long_read(bdi-bdi_stats[item]); +#ifdef CONFIG_SMP + if (x 0) + x = 0; +#endif + return x; +} + +#ifdef CONFIG_SMP +void __mod_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item, int delta); +void __inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item); +void __dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item); + +void mod_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item, int delta); +void inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item); +void dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item); + +#else /* CONFIG_SMP */ + +static inline void __mod_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item, int delta) +{ + bdi_stat_add(delta, bdi, item); +} + +static inline void __inc_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + atomic_long_inc(bdi-bdi_stats[item]); + atomic_long_inc(bdi_stats[item]); +} + +static inline void __dec_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + atomic_long_dec(bdi-bdi_stats[item]); + atomic_long_dec(bdi_stats[item]); +} + +#define mod_bdi_stat __mod_bdi_stat +#define inc_bdi_stat __inc_bdi_stat +#define dec_bdi_stat __dec_bdi_stat +#endif + +void bdi_stat_init(struct backing_dev_info *bdi); /* * Flags in backing_dev_info::capability Index: linux-2.6/mm/backing-dev.c === --- linux-2.6.orig/mm/backing-dev.c +++ linux-2.6/mm/backing-dev.c @@ -67,3 +67,106 @@ void congestion_end(int rw) wake_up(wqh); } EXPORT_SYMBOL(congestion_end); + +atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS]; +EXPORT_SYMBOL(bdi_stats); + +void bdi_stat_init(struct backing_dev_info *bdi) +{ + int i; + + for (i = 0; i NR_BDI_STAT_ITEMS; i++) + atomic_long_set(bdi-bdi_stats[i], 0); + +#ifdef CONFIG_SMP + for (i = 0; i NR_CPUS; i++) { + int j; + for (j = 0; j NR_BDI_STAT_ITEMS; j++) + bdi-pcd[i].bdi_stat_diff[j] = 0; + bdi-pcd[i].stat_threshold = 8 * ilog2(num_online_cpus()); + } +#endif +} +EXPORT_SYMBOL(bdi_stat_init); + +#ifdef
[RFC][PATCH 5/6] mm: per device dirty threshold
Scale writeback cache per backing device, proportional to its writeout speed. akpm sayeth: Which problem are we trying to solve here? afaik our two uppermost problems are: a) Heavy write to queue A causes light writer to queue B to blok for a long time in balance_dirty_pages(). Even if the devices have the same speed. This one; esp when not the same speed. The - my usb stick makes my computer suck - problem. But even on similar speed, the separation of device should avoid blocking dev B when dev A is being throttled. The writeout speed is measure dynamically, so when it doesn't have anything to write out for a while its writeback cache size goes to 0. Conversely, when starting up it will in the beginning act almost synchronous but will quickly build up a 'fair' share of the writeback cache. b) heavy write to device A causes light write to device A to block for a long time in balance_dirty_pages(), occasionally. Harder to fix. This will indeed take more. I've thought about it though. But one quickly ends up with per task state. How it all works: We pick a 2^n value based on the total vm size to act as a period - vm_cycle_shift. This period measures 'time' in writeout events. Each writeout increases time and adds to a per bdi counter. This counter is halved when a period expires. So per bdi speed is: 0.5 * (previous cycle speed) + this cycle's events. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/backing-dev.h |8 ++ mm/backing-dev.c|3 mm/page-writeback.c | 145 ++-- 3 files changed, 125 insertions(+), 31 deletions(-) Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h +++ linux-2.6/include/linux/backing-dev.h @@ -26,6 +26,8 @@ enum bdi_stat_item { BDI_DIRTY, BDI_WRITEBACK, BDI_UNSTABLE, + BDI_WRITEOUT, + BDI_WRITEOUT_TOTAL, NR_BDI_STAT_ITEMS }; @@ -47,6 +49,12 @@ struct backing_dev_info { void (*unplug_io_fn)(struct backing_dev_info *, struct page *); void *unplug_io_data; + /* +* data used for scaling the writeback cache +*/ + spinlock_t lock;/* protect the cycle count */ + unsigned long cycles; /* writeout cycles */ + atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS]; #ifdef CONFIG_SMP struct bdi_per_cpu_data pcd[NR_CPUS]; Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -49,8 +49,6 @@ */ static long ratelimit_pages = 32; -static int dirty_exceeded __cacheline_aligned_in_smp; /* Dirty mem may be over limit */ - /* * When balance_dirty_pages decides that the caller needs to perform some * non-background writeback, this is how many pages it will attempt to write. @@ -103,6 +101,77 @@ EXPORT_SYMBOL(laptop_mode); static void background_writeout(unsigned long _min_pages); /* + * Scale the writeback cache size proportional to the relative writeout speeds. + * + * We do this by tracking a floating average per BDI and a global floating + * average. We optimize away the '/= 2' for the global average by noting that: + * + * if (++i thresh) i /= 2: + * + * Can be approximated by: + * + * thresh/2 + (++i % thresh/2) + * + * Furthermore, when we choose thresh to be 2^n it can be written in terms of + * binary operations and wraparound artifacts disappear. + */ +static int vm_cycle_shift __read_mostly; + +/* + * Sync up the per BDI average to the global cycle. + * + * NOTE: we mask out the MSB of the cycle count because bdi_stats really are + * not unsigned long. (see comment in backing_dev.h) + */ +static void bdi_writeout_norm(struct backing_dev_info *bdi) +{ + int bits = vm_cycle_shift; + unsigned long cycle = 1UL bits; + unsigned long mask = ~(cycle - 1) | (1UL BITS_PER_LONG-1); + unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL) 1; + unsigned long flags; + + if ((bdi-cycles mask) == (total mask)) + return; + + spin_lock_irqsave(bdi-lock, flags); + while ((bdi-cycles mask) != (total mask)) { + unsigned long half = bdi_stat(bdi, BDI_WRITEOUT) / 2; + + mod_bdi_stat(bdi, BDI_WRITEOUT, -half); + bdi-cycles += cycle; + } + spin_unlock_irqrestore(bdi-lock, flags); +} + +static void bdi_writeout_inc(struct backing_dev_info *bdi) +{ + if (!bdi_cap_writeback_dirty(bdi)) + return; + + __inc_bdi_stat(bdi, BDI_WRITEOUT); + __inc_bdi_stat(bdi, BDI_WRITEOUT_TOTAL); + + bdi_writeout_norm(bdi); +} + +void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div) +{ + int bits = vm_cycle_shift - 1; + unsigned long total = global_bdi_stat
[RFC][PATCH 4/6] mm: count unstable pages per BDI
Count per BDI unstable pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/write.c |4 include/linux/backing-dev.h |1 + 2 files changed, 5 insertions(+) Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c +++ linux-2.6/fs/nfs/write.c @@ -474,6 +474,7 @@ nfs_mark_request_commit(struct nfs_page nfsi-ncommit++; spin_unlock(nfsi-req_lock); inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); + inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } #endif @@ -545,6 +546,7 @@ static void nfs_cancel_commit_list(struc while(!list_empty(head)) { req = nfs_list_entry(head-next); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); + dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); nfs_list_remove_request(req); nfs_inode_remove_request(req); nfs_unlock_request(req); @@ -1278,6 +1280,7 @@ nfs_commit_list(struct inode *inode, str nfs_list_remove_request(req); nfs_mark_request_commit(req); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); + dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); nfs_clear_page_writeback(req); } return -ENOMEM; @@ -1302,6 +1305,7 @@ static void nfs_commit_done(struct rpc_t req = nfs_list_entry(data-pages.next); nfs_list_remove_request(req); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); + dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); dprintk(NFS: commit (%s/%Ld [EMAIL PROTECTED]), req-wb_context-dentry-d_inode-i_sb-s_id, Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h +++ linux-2.6/include/linux/backing-dev.h @@ -25,6 +25,7 @@ enum bdi_state { enum bdi_stat_item { BDI_DIRTY, BDI_WRITEBACK, + BDI_UNSTABLE, NR_BDI_STAT_ITEMS }; -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 0/6] per device dirty throttling
This patch-set implements per device dirty page throttling. Which should solve the problem we currently have with one device hogging the dirty limit. Preliminary testing shows good results: mem=128M time (dd if=/dev/zero of=/mnt/dev/zero bs=4096 count=$((1024*1024/4)); sync) 1GB to disk real0m33.074s 0m34.596s 0m33.387s user0m0.147s0m0.163s0m0.142s sys 0m7.872s0m8.409s0m8.395s 1GB to usb-flash real3m21.170s 3m15.512s 3m23.889s user0m0.135s0m0.146s0m0.127s sys 0m7.327s0m7.328s0m7.342s 2.6.20 device vs device 1GB disk vs disk real1m30.736s 1m16.133s 1m42.068s user0m0.204s0m0.167s0m0.222s sys 0m10.438s 0m7.958s0m10.599s 1GB usb-flash vs background disk N/A 30m+ 1GB disk vs background usb-flash real4m0.687s2m20.145s 4m12.923s user0m0.173s0m0.185s0m0.161s sys 0m8.227s0m8.581s0m8.345s 2.6.20-writeback 1GB disk vs disk real0m36.696s 0m40.837s 0m38.679s user0m0.161s0m0.148s0m0.160s sys 0m8.240s0m8.068s0m8.174s 1GB usb-flash vs background disk real3m37.464s 3m49.720s 4m5.805s user0m0.167s0m0.166s0m0.149s sys 0m7.195s0m7.281s0m7.199s 1GB disk vs background usb-flash real0m41.585s 0m30.888s 0m34.493s user0m0.161s0m0.167s0m0.162s sys 0m7.826s0m7.807s0m7.821s - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 6/6] mm: expose BDI statistics in sysfs.
Expose the per BDI stats in /sys/block/dev/queue/* Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- block/ll_rw_blk.c | 51 +++ 1 file changed, 51 insertions(+) Index: linux-2.6/block/ll_rw_blk.c === --- linux-2.6.orig/block/ll_rw_blk.c +++ linux-2.6/block/ll_rw_blk.c @@ -3923,6 +3923,33 @@ static ssize_t queue_max_hw_sectors_show return queue_var_show(max_hw_sectors_kb, (page)); } +static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page) +{ + return sprintf(page, %lu\n, bdi_stat(q-backing_dev_info, BDI_DIRTY)); +} + +static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page) +{ + return sprintf(page, %lu\n, bdi_stat(q-backing_dev_info, BDI_WRITEBACK)); +} + +static ssize_t queue_nr_unstable_show(struct request_queue *q, char *page) +{ + return sprintf(page, %lu\n, bdi_stat(q-backing_dev_info, BDI_UNSTABLE)); +} + +extern void get_writeout_scale(struct backing_dev_info *, int *, int *); + +static ssize_t queue_nr_cache_show(struct request_queue *q, char *page) +{ + int scale, div; + + get_writeout_scale(q-backing_dev_info, scale, div); + scale *= 1024; + scale /= div; + + return sprintf(page, %d\n, scale); +} static struct queue_sysfs_entry queue_requests_entry = { .attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR }, @@ -3947,6 +3974,26 @@ static struct queue_sysfs_entry queue_ma .show = queue_max_hw_sectors_show, }; +static struct queue_sysfs_entry queue_dirty_entry = { + .attr = {.name = dirty_pages, .mode = S_IRUGO }, + .show = queue_nr_dirty_show, +}; + +static struct queue_sysfs_entry queue_writeback_entry = { + .attr = {.name = writeback_pages, .mode = S_IRUGO }, + .show = queue_nr_writeback_show, +}; + +static struct queue_sysfs_entry queue_unstable_entry = { + .attr = {.name = unstable_pages, .mode = S_IRUGO }, + .show = queue_nr_unstable_show, +}; + +static struct queue_sysfs_entry queue_cache_entry = { + .attr = {.name = cache_ratio, .mode = S_IRUGO }, + .show = queue_nr_cache_show, +}; + static struct queue_sysfs_entry queue_iosched_entry = { .attr = {.name = scheduler, .mode = S_IRUGO | S_IWUSR }, .show = elv_iosched_show, @@ -3958,6 +4005,10 @@ static struct attribute *default_attrs[] queue_ra_entry.attr, queue_max_hw_sectors_entry.attr, queue_max_sectors_entry.attr, + queue_dirty_entry.attr, + queue_writeback_entry.attr, + queue_unstable_entry.attr, + queue_cache_entry.attr, queue_iosched_entry.attr, NULL, }; -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/6] per device dirty throttling
Sorry for duplicates, I was fooled by an MTA hanging on to them for a few hours. I counted them lost in cyberspace. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 7/6] assorted fixes
Just taking out the MSB isn't enough to counter the clipping on 0 done by the stats counter accessors. Create some accessors that don't do that. Also, increase the period to about the side of memory (TODO, determine some upper bound here). This should give much more stable results. (Best would be to keep it in the order of whatever vm_dirty_ratio gives, however changing vm_cycle_shift is dangerous). Finally, limit the adjustment rate to not grow faster than available dirty space. Without this the analytic model can use up to 2 times the dirty limit and the discrete model is basically unbounded. It goes *BANG* when using NFS,... need to look into that. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/backing-dev.h | 12 mm/page-writeback.c | 37 ++--- 2 files changed, 34 insertions(+), 15 deletions(-) Index: linux-2.6/include/linux/backing-dev.h === --- linux-2.6.orig/include/linux/backing-dev.h +++ linux-2.6/include/linux/backing-dev.h @@ -70,6 +70,12 @@ static inline void bdi_stat_add(long x, atomic_long_add(x, bdi_stats[item]); } + +static inline unsigned long __global_bdi_stat(enum bdi_stat_item item) +{ + return atomic_long_read(bdi_stats[item]); +} + /* * cannot be unsigned long and clip on 0. */ @@ -83,6 +89,12 @@ static inline unsigned long global_bdi_s return x; } +static inline unsigned long __bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item) +{ + return atomic_long_read(bdi-bdi_stats[item]); +} + static inline unsigned long bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item) { Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -119,16 +119,13 @@ static int vm_cycle_shift __read_mostly; /* * Sync up the per BDI average to the global cycle. - * - * NOTE: we mask out the MSB of the cycle count because bdi_stats really are - * not unsigned long. (see comment in backing_dev.h) */ static void bdi_writeout_norm(struct backing_dev_info *bdi) { int bits = vm_cycle_shift; unsigned long cycle = 1UL bits; - unsigned long mask = ~(cycle - 1) | (1UL BITS_PER_LONG-1); - unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL) 1; + unsigned long mask = ~(cycle - 1); + unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL) 1; unsigned long flags; if ((bdi-cycles mask) == (total mask)) @@ -136,7 +133,7 @@ static void bdi_writeout_norm(struct bac spin_lock_irqsave(bdi-lock, flags); while ((bdi-cycles mask) != (total mask)) { - unsigned long half = bdi_stat(bdi, BDI_WRITEOUT) / 2; + unsigned long half = __bdi_stat(bdi, BDI_WRITEOUT) / 2; mod_bdi_stat(bdi, BDI_WRITEOUT, -half); bdi-cycles += cycle; @@ -158,13 +155,13 @@ static void bdi_writeout_inc(struct back void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div) { int bits = vm_cycle_shift - 1; - unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL); + unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL); unsigned long cycle = 1UL bits; unsigned long mask = cycle - 1; if (bdi_cap_writeback_dirty(bdi)) { bdi_writeout_norm(bdi); - *scale = bdi_stat(bdi, BDI_WRITEOUT); + *scale = __bdi_stat(bdi, BDI_WRITEOUT); } else *scale = 0; @@ -234,19 +231,29 @@ get_dirty_limits(long *pbackground, long *pdirty = dirty; if (mapping) { + struct backing_dev_info *bdi = mapping-backing_dev_info; + long reserve; long long tmp = dirty; int scale, div; - get_writeout_scale(mapping-backing_dev_info, scale, div); - - if (scale div) - scale = div; + get_writeout_scale(bdi, scale, div); - tmp = (tmp * 122) 7; /* take ~95% of total dirty value */ tmp *= scale; do_div(tmp, div); - *pbdi_dirty = (long)tmp; + reserve = dirty - + (global_bdi_stat(BDI_DIRTY) + +global_bdi_stat(BDI_WRITEBACK) + +global_bdi_stat(BDI_UNSTABLE)); + + if (reserve 0) + reserve = 0; + + reserve += bdi_stat(bdi, BDI_DIRTY) + + bdi_stat(bdi, BDI_WRITEBACK) + + bdi_stat(bdi, BDI_UNSTABLE); + + *pbdi_dirty = min((long)tmp, reserve); } } @@ -627,7 +634,7 @@ void __init page_writeback_init(void) mod_timer(wb_timer, jiffies
Re: [RFC][PATCH 0/6] per device dirty throttling
On Tue, 2007-03-20 at 18:47 +1100, David Chinner wrote: On Mon, Mar 19, 2007 at 04:57:37PM +0100, Peter Zijlstra wrote: This patch-set implements per device dirty page throttling. Which should solve the problem we currently have with one device hogging the dirty limit. Preliminary testing shows good results: I just ran some higher throughput number on this patchset. Identical 4-disk dm stripes, XFS, 4p x86_64, 16GB RAM, dirty_ratio = 5: One dm stripe: 320MB/s two dm stripes: 310+315MB/s three dm stripes: 254+253+253MB/s (pci-x bus bound) The three stripe test was for 100GB of data to each filesystem - all the writes finished with 1s of each other at 7m4s. Interestingly, the amount of memory in cache for each of these devices was almost exactly the same - about 5.2GB each. Looks good so far Hmmm - small problem - root disk (XFS) got stuck in balance_dirty_pages_ratelimited_nr() after the above write test attempting to unmount the filesystems (i.e. umount trying to modify /etc/mtab got stuck and the root fs locked up) (reboot) Hmm, interesting, I'll look into it. None-identical dm stripes, XFS, run alone: Single disk: 80MB/s 2 disk dm stripe: 155MB/s 4 disk dm stripe: 310MB/s Combined, after some runtime: # ls -sh /mnt/dm*/test 10G /mnt/dm0/test 19G /mnt/dm1/test 41G /mnt/dm2/test 15G /mnt/dm0/test 27G /mnt/dm1/test 52G /mnt/dm2/test 18G /mnt/dm0/test 32G /mnt/dm1/test 64G /mnt/dm2/test 24G /mnt/dm0/test 45G /mnt/dm1/test 86G /mnt/dm2/test 27G /mnt/dm0/test 51G /mnt/dm1/test 95G /mnt/dm2/test 29G /mnt/dm0/test 52G /mnt/dm1/test 97G /mnt/dm2/test 29G /mnt/dm0/test 54G /mnt/dm1/test 101G /mnt/dm2/test [done] 35G /mnt/dm0/test 65G /mnt/dm1/test 101G /mnt/dm2/test 38G /mnt/dm0/test 70G /mnt/dm1/test 101G /mnt/dm2/test And so on. Final number: Single disk: 70MB/s 2 disk dm stripe: 130MB/s 4 disk dm stripe: 260MB/s So overall we've lost about 15-20% of the theoretical aggregate perfomrance, but we haven't starved any of the devices over a long period of time. However, looking at vmstat for total throughput, there are periods of time where it appears that the fastest disk goes idle. That is, we drop from an aggregate of about 550MB/s to below 300MB/s for several seconds at a time. You can sort of see this from the file size output above - long term the ratios remain the same, but in the short term we see quite a bit of variability. I suspect you did not apply 7/6? There is some trouble with signed vs unsigned in the initial patch set that I tried to 'fix' by masking out the MSB, but that doesn't work and results in 'time' getting stuck for about half the time. When the fast disk completed, I saw almost the same thing, but this time it seems like the slow disk (i.e. ~230MB/s to ~150MB/s) stopped for several seconds. I haven't really digested what the patches do, If you have questions please ask, I'll try to write up coherent answers :-) but it's almost like it is throttling a device completely while it allows another to finish writing it's quota (underestimating bandwidth?). Yeah, there is some lumpy-ness in BIO submission or write completions it seems, and when that granularity (multiplied by the number of active devices) is larger than the 'time' period over with we average (indicated by vm_cycle_shift) very weird stuff can happen. (umount after writes hung again. Same root disk thing as before) This is looking promising, Peter. When it is more stable I'll run some more tests Thanks for the tests. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/6] per device dirty throttling
On Tue, 2007-03-20 at 20:38 +1100, David Chinner wrote: On Tue, Mar 20, 2007 at 09:08:24AM +0100, Peter Zijlstra wrote: On Tue, 2007-03-20 at 18:47 +1100, David Chinner wrote: So overall we've lost about 15-20% of the theoretical aggregate perfomrance, but we haven't starved any of the devices over a long period of time. However, looking at vmstat for total throughput, there are periods of time where it appears that the fastest disk goes idle. That is, we drop from an aggregate of about 550MB/s to below 300MB/s for several seconds at a time. You can sort of see this from the file size output above - long term the ratios remain the same, but in the short term we see quite a bit of variability. I suspect you did not apply 7/6? There is some trouble with signed vs unsigned in the initial patch set that I tried to 'fix' by masking out the MSB, but that doesn't work and results in 'time' getting stuck for about half the time. I applied the fixes patch as well, so i had all that you posted... Humm, not that then. but it's almost like it is throttling a device completely while it allows another to finish writing it's quota (underestimating bandwidth?). Yeah, there is some lumpy-ness in BIO submission or write completions it seems, and when that granularity (multiplied by the number of active devices) is larger than the 'time' period over with we average (indicated by vm_cycle_shift) very weird stuff can happen. Sounds like the period is a bit too short atm if we can get into this sort of problem with only 2 active devices Yeah, trouble is, I significantly extended this period in 7/6. Will have to ponder a bit on what is happening then. Anyway, thanks for the feedback. I'll try and reproduce the umount problem, maybe that will give some hints. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/6] per device dirty throttling
This seems to fix the worst of it, I'll run with it for a few days and respin the patches and repost when nothing weird happens. --- Found missing bdi_stat_init() sites for NFS and Fuse Optimize bdi_writeout_norm(), break out of the loop when we hit zero. This will allow 'new' BDIs to catch up without triggering NMI/softlockup msgs. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/fuse/inode.c |1 + fs/nfs/client.c |1 + mm/page-writeback.c | 25 +++-- 3 files changed, 21 insertions(+), 6 deletions(-) Index: linux-2.6/fs/nfs/client.c === --- linux-2.6.orig/fs/nfs/client.c +++ linux-2.6/fs/nfs/client.c @@ -657,6 +657,7 @@ static void nfs_server_set_fsinfo(struct server-rsize = NFS_MAX_FILE_IO_SIZE; server-rpages = (server-rsize + PAGE_CACHE_SIZE - 1) PAGE_CACHE_SHIFT; server-backing_dev_info.ra_pages = server-rpages * NFS_MAX_READAHEAD; + bdi_stat_init(server-backing_dev_info); if (server-wsize max_rpc_payload) server-wsize = max_rpc_payload; Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -114,6 +114,13 @@ static void background_writeout(unsigned * * Furthermore, when we choose thresh to be 2^n it can be written in terms of * binary operations and wraparound artifacts disappear. + * + * Also note that this yields a natural counter of the elapsed periods: + * + * i / thresh + * + * Its monotonous increasing property can be applied to mitigate the wrap- + * around issue. */ static int vm_cycle_shift __read_mostly; @@ -125,19 +132,25 @@ static void bdi_writeout_norm(struct bac int bits = vm_cycle_shift; unsigned long cycle = 1UL bits; unsigned long mask = ~(cycle - 1); - unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL) 1; + unsigned long global_cycle = + (__global_bdi_stat(BDI_WRITEOUT_TOTAL) 1) mask; unsigned long flags; - if ((bdi-cycles mask) == (total mask)) + if ((bdi-cycles mask) == global_cycle) return; spin_lock_irqsave(bdi-lock, flags); - while ((bdi-cycles mask) != (total mask)) { - unsigned long half = __bdi_stat(bdi, BDI_WRITEOUT) / 2; + while ((bdi-cycles mask) != global_cycle) { + unsigned long val = __bdi_stat(bdi, BDI_WRITEOUT); + unsigned long half = (val + 1) / 2; + + if (!val) + break; mod_bdi_stat(bdi, BDI_WRITEOUT, -half); bdi-cycles += cycle; } + bdi-cycles = global_cycle; spin_unlock_irqrestore(bdi-lock, flags); } @@ -146,10 +159,10 @@ static void bdi_writeout_inc(struct back if (!bdi_cap_writeback_dirty(bdi)) return; + bdi_writeout_norm(bdi); + __inc_bdi_stat(bdi, BDI_WRITEOUT); __inc_bdi_stat(bdi, BDI_WRITEOUT_TOTAL); - - bdi_writeout_norm(bdi); } void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div) Index: linux-2.6/fs/fuse/inode.c === --- linux-2.6.orig/fs/fuse/inode.c +++ linux-2.6/fs/fuse/inode.c @@ -413,6 +413,7 @@ static struct fuse_conn *new_conn(void) atomic_set(fc-num_waiting, 0); fc-bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; fc-bdi.unplug_io_fn = default_unplug_io_fn; + bdi_stat_init(fc-bdi); fc-reqctr = 0; fc-blocked = 1; get_random_bytes(fc-scramble_key, sizeof(fc-scramble_key)); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21-rc3-mm2 3/4] futex_requeue_pi optimization
On Tue, 2007-03-20 at 16:32 +0100, Pierre Peiffer wrote: Peter Zijlstra a écrit : +static void *get_futex_address(union futex_key *key) +{ + void *uaddr; + + if (key-both.offset 1) { + /* shared mapping */ + uaddr = (void*)((key-shared.pgoff PAGE_SHIFT) + + key-shared.offset - 1); + } else { + /* private mapping */ + uaddr = (void*)(key-private.address + key-private.offset); + } + + return uaddr; +} This will not work for nonlinear vmas, granted, not a lot of ppl stick futexes in nonlinear vmas, but the futex_key stuff handles it, this doesn't. Indeed ! Thanks for pointing me to this. Since I'm not familiar with vmm, does this code look more correct to you ? Unfortunately not, nonlinear vmas don't have a linear relation between address and offset. What you would need to do is do a linear walk of the page tables. But even that might not suffice if nonlinear vmas may form a non-injective, surjective mapping. /me checks.. Hmm, yes that seems valid, so in general, this reverse mapping does not uniquely exist for non-linear vmas. :-( What to do... disallow futexes in nonlinear mappings, store the address in the key? static void *get_futex_address(union futex_key *key) { void *uaddr; struct vm_area_struct *vma = current-mm-mmap; if (key-both.offset 1) { /* shared mapping */ struct file * vmf; do { if ((vmf = vma-vm_file) (key-shared.inode == vmf-f_dentry-d_inode)) break; vma = vma-vm_next; } while (vma); if (likely(!(vma-vm_flags VM_NONLINEAR))) uaddr = (void*)((key-shared.pgoff PAGE_SHIFT) + key-shared.offset - 1); else uaddr = (void*) vma-vm_start + ((key-shared.pgoff - vma-vm_pgoff) PAGE_SHIFT) + key-shared.offset - 1; } else { /* private mapping */ uaddr = (void*)(key-private.address + key-private.offset); } return uaddr; } Or is there a more direct way to retrieve the vma corresponding to the given inode ? the vma_prio_tree would be able to give all vmas associated with a mapping. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RSDL v0.31
On Wed, 2007-03-21 at 15:57 +0100, Mike Galbraith wrote: 'f' is a progglet which sleeps a bit and burns a bit, duration depending on argument given. 'sh' is a shell 100% hog. In this scenario, the argument was set such that 'f' used right at 50% cpu. All are started at the same time, and I froze top when the first 'f' reached 1:00. May one enquire how much CPU the mythical 'f' uses when ran alone? Just to get a gauge for the numbers? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock
On Thu, 2007-03-22 at 07:11 +0100, Jarek Poplawski wrote: Here is some joke: [PATCH] lockdep: lockdep_depth vs. debug_locks lockdep really shouldn't be used when debug_locks == 0! This happens then lockdep reports a fatal error, at which point it will stop tracking locks and leave whatever state was there. Reported-by: Folkert van Heusden [EMAIL PROTECTED] Inspired-by: Oleg Nesterov [EMAIL PROTECTED] Signed-off-by: Jarek Poplawski [EMAIL PROTECTED] This looks sane, thanks for figuring this out. Acked-by: Peter Zijlstra [EMAIL PROTECTED] --- diff -Nurp 2.6.21-rc4-git4-/include/linux/lockdep.h 2.6.21-rc4-git4/include/linux/lockdep.h --- 2.6.21-rc4-git4-/include/linux/lockdep.h 2007-03-20 20:24:17.0 +0100 +++ 2.6.21-rc4-git4/include/linux/lockdep.h 2007-03-21 22:32:41.0 +0100 @@ -245,7 +245,7 @@ extern void lock_release(struct lockdep_ # define INIT_LOCKDEP.lockdep_recursion = 0, -#define lockdep_depth(tsk) ((tsk)-lockdep_depth) +#define lockdep_depth(tsk) (debug_locks ? (tsk)-lockdep_depth : 0) #else /* !LOCKDEP */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lockdep: debug_show_all_locks debug_show_held_locks vs. debug_locks
On Thu, 2007-03-22 at 07:57 +0100, Jarek Poplawski wrote: And here is some addition. [PATCH] lockdep: debug_show_all_locks debug_show_held_locks vs. debug_locks lockdep's data shouldn't be used when debug_locks == 0 because it's not updated after this, so it's more misleading than helpful. PS: probably lockdep's current- fields should be reset after it turns debug_locks off: so, after printing a bug report, but before return from exported functions, but there are really a lot of these possibilities (e.g. after DEBUG_LOCKS_WARN_ON), so, something could be missed. (Of course direct use of this fields isn't recommended either.) Reported-by: Folkert van Heusden [EMAIL PROTECTED] Inspired-by: Oleg Nesterov [EMAIL PROTECTED] Signed-off-by: Jarek Poplawski [EMAIL PROTECTED] Acked-by: Peter Zijlstra [EMAIL PROTECTED] --- diff -Nurp 2.6.21-rc4-git4-/kernel/lockdep.c 2.6.21-rc4-git4/kernel/lockdep.c --- 2.6.21-rc4-git4-/kernel/lockdep.c 2007-03-21 22:46:26.0 +0100 +++ 2.6.21-rc4-git4/kernel/lockdep.c 2007-03-21 23:05:17.0 +0100 @@ -2742,6 +2742,10 @@ void debug_show_all_locks(void) int count = 10; int unlock = 1; + if (unlikely(!debug_locks)) { + printk(INFO: lockdep is turned off.\n); + return; + } printk(\nShowing all locks held in the system:\n); /* @@ -2785,6 +2789,10 @@ EXPORT_SYMBOL_GPL(debug_show_all_locks); void debug_show_held_locks(struct task_struct *task) { + if (unlikely(!debug_locks)) { + printk(INFO: lockdep is turned off.\n); + return; + } lockdep_print_held_locks(task); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
On Sat, 2007-03-24 at 22:55 +0100, Miklos Szeredi wrote: This is a slightly different take on the fix for the deadlock in fuse with dirty balancing. David Chinner convinced me, that per-bdi counters are too expensive, and that it's not worth trying to account the number of pages under writeback, as they will be limited by the queue anyway. Please have a look at this: http://lkml.org/lkml/2007/3/19/220 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
On Sun, 2007-03-25 at 13:34 +0200, Miklos Szeredi wrote: Please have a look at this: http://lkml.org/lkml/2007/3/19/220 + if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) = + bdi_thresh) + break; Yes, this will resolve the deadlock as well, where balance_dirty_pages() is currently looping forever with: Almost. This - if (nr_reclaimable) { + if (bdi_nr_reclaimable) { writeback_inodes(wbc); still makes it loop forever if bdi_nr_reclaimable == 0, since the exit condition is not checked. Shouldn't it break out of the loop if bdi_stat(bdi, BDI_WRITEBACK) = bdi_thresh in this case? for (;;) { struct writeback_control wbc = { .bdi= bdi, .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write= write_chunk, .range_cyclic = 1, }; get_dirty_limits(background_thresh, dirty_thresh, bdi_thresh, bdi); bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + bdi_stat(bdi, BDI_UNSTABLE); (A) if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) = bdi_thresh) break; /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. * Unstable writes are a feature of certain networked * filesystems (i.e. NFS) in which data may have been * written to the server's write cache, but has not yet * been flushed to permanent storage. */ (B) if (bdi_nr_reclaimable) { writeback_inodes(wbc); get_dirty_limits(background_thresh, dirty_thresh, bdi_thresh, bdi); bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + bdi_stat(bdi, BDI_UNSTABLE); (C) if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) = bdi_thresh) break; pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ } congestion_wait(WRITE, HZ/10); } I'm thinking that if bdi_nr_reclaimable == 0, A reduces to bdi_stat(bdi, BDI_WRITEBACK) = bdi_thresh and we're still out of the loop, no? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/3] only allow nonlinear vmas for ram backed filesystems
On Sat, 2007-03-24 at 23:09 +0100, Miklos Szeredi wrote: From: Miklos Szeredi [EMAIL PROTECTED] Dirty page accounting/limiting doesn't work for nonlinear mappings, so for non-ram backed filesystems emulate with linear mappings. This retains ABI compatibility with previous kernels at minimal code cost. All known users of nonlinear mappings actually use tmpfs, so this shouldn't have any negative effect. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] Acked-by: Peter Zijlstra [EMAIL PROTECTED] --- Index: linux-2.6.21-rc4-mm1/mm/fremap.c === --- linux-2.6.21-rc4-mm1.orig/mm/fremap.c 2007-03-24 22:30:05.0 +0100 +++ linux-2.6.21-rc4-mm1/mm/fremap.c 2007-03-24 22:37:59.0 +0100 @@ -181,6 +181,24 @@ asmlinkage long sys_remap_file_pages(uns goto retry; } mapping = vma-vm_file-f_mapping; + /* + * page_mkclean doesn't work on nonlinear vmas, so if dirty + * pages need to be accounted, emulate with linear vmas. + */ + if (mapping_cap_account_dirty(mapping)) { + unsigned long addr; + + flags = MAP_NONBLOCK; + addr = mmap_region(vma-vm_file, start, size, flags, +vma-vm_flags, pgoff, 1); + if (IS_ERR_VALUE(addr)) + err = addr; + else { + BUG_ON(addr != start); + err = 0; + } + goto out; + } spin_lock(mapping-i_mmap_lock); flush_dcache_mmap_lock(mapping); vma-vm_flags |= VM_NONLINEAR; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/3] split mmap
On Sat, 2007-03-24 at 23:07 +0100, Miklos Szeredi wrote: From: Miklos Szeredi [EMAIL PROTECTED] This is a straightforward split of do_mmap_pgoff() into two functions: - do_mmap_pgoff() checks the parameters, and calculates the vma flags. Then it calls - mmap_region(), which does the actual mapping Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] Acked-by: Peter Zijlstra [EMAIL PROTECTED] --- Index: linux/mm/mmap.c === --- linux.orig/mm/mmap.c 2007-03-24 21:00:40.0 +0100 +++ linux/mm/mmap.c 2007-03-24 22:28:52.0 +0100 @@ -893,14 +893,11 @@ unsigned long do_mmap_pgoff(struct file unsigned long flags, unsigned long pgoff) { struct mm_struct * mm = current-mm; - struct vm_area_struct * vma, * prev; struct inode *inode; unsigned int vm_flags; - int correct_wcount = 0; int error; - struct rb_node ** rb_link, * rb_parent; int accountable = 1; - unsigned long charged = 0, reqprot = prot; + unsigned long reqprot = prot; /* * Does the application expect PROT_READ to imply PROT_EXEC? @@ -1025,7 +1022,25 @@ unsigned long do_mmap_pgoff(struct file error = security_file_mmap(file, reqprot, prot, flags); if (error) return error; - + + return mmap_region(file, addr, len, flags, vm_flags, pgoff, +accountable); +} +EXPORT_SYMBOL(do_mmap_pgoff); + +unsigned long mmap_region(struct file *file, unsigned long addr, + unsigned long len, unsigned long flags, + unsigned int vm_flags, unsigned long pgoff, + int accountable) +{ + struct mm_struct *mm = current-mm; + struct vm_area_struct *vma, *prev; + int correct_wcount = 0; + int error; + struct rb_node **rb_link, *rb_parent; + unsigned long charged = 0; + struct inode *inode = file ? file-f_path.dentry-d_inode : NULL; + /* Clear old maps */ error = -ENOMEM; munmap_back: @@ -1174,8 +1189,6 @@ unacct_error: return error; } -EXPORT_SYMBOL(do_mmap_pgoff); - /* Get an address range which is currently unmapped. * For shmat() with addr=0. * Index: linux/include/linux/mm.h === --- linux.orig/include/linux/mm.h 2007-03-24 21:00:40.0 +0100 +++ linux/include/linux/mm.h 2007-03-24 22:28:52.0 +0100 @@ -1035,6 +1035,10 @@ extern unsigned long get_unmapped_area(s extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flag, unsigned long pgoff); +extern unsigned long mmap_region(struct file *file, unsigned long addr, + unsigned long len, unsigned long flags, + unsigned int vm_flags, unsigned long pgoff, + int accountable); static inline unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/3] update ctime and mtime for mmaped write
On Sat, 2007-03-24 at 23:11 +0100, Miklos Szeredi wrote: From: Miklos Szeredi [EMAIL PROTECTED] Changes: v3: o rename is_page_modified to test_clear_page_modified v2: o set AS_CMTIME flag in clear_page_dirty_for_io() too o don't clear AS_CMTIME in file_update_time() o check the dirty bit in the page tables v1: o moved check from __fput() to remove_vma(), which is more logical o changed set_page_dirty() to set_page_dirty_mapping in hugetlb.c o cleaned up #ifdef CONFIG_BLOCK mess This patch makes writing to shared memory mappings update st_ctime and st_mtime as defined by SUSv3: The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED and PROT_WRITE shall be marked for update at some point in the interval between a write reference to the mapped region and the next call to msync() with MS_ASYNC or MS_SYNC for that portion of the file by any process. If there is no such call and if the underlying file is modified as a result of a write reference, then these fields shall be marked for update at some time after the write reference. A new address_space flag is introduced: AS_CMTIME. This is set each time a page is dirtied through a userspace memory mapping. This includes write accesses via get_user_pages(). Note, the flag is set unconditionally, even if the page is already dirty. This is important, because the page might have been dirtied earlier by a non-mmap write. This flag is checked in msync() and munmap()/mremap(), and if set, the file times are updated and the flag is cleared. Msync also needs to check the dirty bit in the page tables, because the data might change again after an msync(MS_ASYNC), while the page is already dirty and read-write. This also makes the time updating work for memory backed filesystems such as tmpfs. This implementation walks the pages in the synced range, and uses rmap to find all the ptes for each page. Non-linear vmas are ignored, since the ptes can only be found by scanning the whole vma, which is very inefficient. As an optimization, if dirty pages are accounted, then only walk the dirty pages, since the clean pages necessarily have clean ptes. This doesn't work for memory backed filesystems, where no dirty accounting is done. An alternative implementation could check for all intersecting vmas in the mapping and walk the page tables for each. This would probably be more efficient for memory backed filesystems and if the number of dirty pages is near the total number of pages in the range. Fixes Novell Bugzilla #206431. Inspired by Peter Staubach's patch and the resulting comments. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- A few comments.. Index: linux-2.6.21-rc4-mm1/mm/rmap.c === --- linux-2.6.21-rc4-mm1.orig/mm/rmap.c 2007-03-24 19:03:11.0 +0100 +++ linux-2.6.21-rc4-mm1/mm/rmap.c2007-03-24 19:34:30.0 +0100 @@ -507,6 +507,43 @@ int page_mkclean(struct page *page) EXPORT_SYMBOL_GPL(page_mkclean); /** + * test_clear_page_modified - check and clear the dirty bit for all mappings of a page + * @page:the page to check + */ +bool test_clear_page_modified(struct page *page) +{ + struct address_space *mapping = page-mapping; page_mapping(page)? Otherwise that BUG_ON(!mapping) a few lines down isn't of much use. + pgoff_t pgoff = page-index (PAGE_CACHE_SHIFT - PAGE_SHIFT); + struct vm_area_struct *vma; + struct prio_tree_iter iter; + bool modified = false; + + BUG_ON(!mapping); + BUG_ON(!page_mapped(page)); + + spin_lock(mapping-i_mmap_lock); + vma_prio_tree_foreach(vma, iter, mapping-i_mmap, pgoff, pgoff) { + if (vma-vm_flags VM_SHARED) { + struct mm_struct *mm = vma-vm_mm; + unsigned long addr = vma_address(page, vma); + pte_t *pte; + spinlock_t *ptl; + + if (addr != -EFAULT + (pte = page_check_address(page, mm, addr, ptl))) { + if (ptep_clear_flush_dirty(vma, addr, pte)) + modified = true; + pte_unmap_unlock(pte, ptl); + } Its against coding style to do assignments in conditionals. + } + } + spin_unlock(mapping-i_mmap_lock); + if (page_test_and_clear_dirty(page)) + modified = true; + return modified; +} Why not parametrize page_mkclean() to conditionally wrprotect clean pages? Something like: --- mm/rmap.c~ 2007-03-11 17:52:20.0 +0100 +++ mm/rmap.c 2007-03-25 14:01:55.0 +0200 @@ -432,7 +432,8 @@ return referenced; } -static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) +static int +page_mkclean_one(struct page *page,
Re: [patch 2/3] only allow nonlinear vmas for ram backed filesystems
On Sun, 2007-03-25 at 16:00 -0800, Andrew Morton wrote: On Sat, 24 Mar 2007 23:09:19 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote: Dirty page accounting/limiting doesn't work for nonlinear mappings, Doesn't it? iirc the problem is that we don't correctly re-clean the ptes while starting writeout. And the dirty-page accounting is in fact correct (it'd darn well better be). If we do not re-protect the pages on writeout, we'll decrement the dirty count but not get a fault on re-dirty. Hence the dirty count will actually skew. In order to make page_mkclean() work for nonlinear vmas we need to do a full pte scan for each invocation (we could perhaps only scan 1 in n times to try and limit the damage) and that hurts. This will basically render it useless. The other solution is adding rmap information to nonlinear vmas but doubling the memory overhead for nonlinear mappings was not deemed a good idea. so for non-ram backed filesystems emulate with linear mappings. This retains ABI compatibility with previous kernels at minimal code cost. All known users of nonlinear mappings actually use tmpfs, so this shouldn't have any negative effect. Unless someone is using remap_file_pages() against an ext3 file, in which case their application stops working? it'll work up to a certain point (when you hit the max vma count). When this issue first came up a few weeks ago, nobody knew of any non tmpfs users. That would be a problem. These guys: http://www.technovelty.org/code/linux/fremap.html, for example, will be in for a little surprise. Its an example, it doesn't show if there are actually users of this, but point taken. We actually could make this very example work by noting that they map PROT_READ only. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
On Mon, 2007-03-26 at 02:08 -0800, Andrew Morton wrote: On Mon, 26 Mar 2007 11:32:47 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote: Stopping writers which have idle queues is completely unproductive, and that is basically what the current algorithm does. This is because the kernel permits all of its allotment of dirty+writeback pages to be dirty+writeback against a single device. A good way of solving the one-device-starves-another-one problem is to dynamically adjust the per-device dirty+writeback levels so that (for example) if two devices are being written to, each gets 50% of the allotment. This is exactly what happens with my patch if both devices write at the same speed. (Or at least, that is what is supposed to happen ;-) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/12] mm: page_alloc_wait
On Thu, 2007-04-05 at 15:57 -0700, Andrew Morton wrote: On Thu, 05 Apr 2007 19:42:19 +0200 [EMAIL PROTECTED] wrote: Introduce a mechanism to wait on free memory. Currently congestion_wait() is abused to do this. Such a very small explanation for such a terrifying change. Yes, I suck at writing changelogs, bad me. Normally I would take a day to write them, but I just wanted to get this code out there. Perhaps a bad decision. ... --- linux-2.6-mm.orig/mm/vmscan.c 2007-04-05 16:29:46.0 +0200 +++ linux-2.6-mm/mm/vmscan.c2007-04-05 16:29:49.0 +0200 @@ -1436,6 +1436,7 @@ static int kswapd(void *p) finish_wait(pgdat-kswapd_wait, wait); balance_pgdat(pgdat, order); + page_alloc_ok(); } return 0; } For a start, we don't know that kswapd freed pages which are in a suitable zone. And we don't know that kswapd freed pages which are in a suitable cpuset. congestion_wait() is similarly ignorant of the suitability of the pages, but the whole idea behind congestion_wait is that it will throttle page allocators to some speed which is proportional to the speed at which the IO systems can retire writes - view it as a variable-speed polling operation, in which the polling frequency goes up when the IO system gets faster. This patch changes that philosophy fundamentally. That's worth more than a 2-line changelog. Also, there might be situations in which kswapd gets stuck in some dark corner. Perhaps the process which is waiting in the page allocator holds filesystem locks which kswapd is blocked on. Or kswapd might be blocked on a particular request queue, or a dead NFS server or something. The timeout will save us, but things will be slow. There could be other problems too, dunno - this stuff is tricky. Why are you changing it, what problems are being solved, etc? Lets start with the why, because of 12/12; I wanted to introduce per BDI congestion feedback, and hence needed a BDI context for congestion_wait(). These specific callers weren't in the context of a BDI but of a more global idea. Perhaps I could call page_alloc_ok() from bdi_congestion_end() irrespective of the actual BDI uncongested? That would more or less give the old semantics. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/12] mm: accurate pageout congestion wait
On Thu, 2007-04-05 at 16:17 -0700, Andrew Morton wrote: On Thu, 05 Apr 2007 19:42:20 +0200 [EMAIL PROTECTED] wrote: Only do the congestion wait when we actually encountered congestion. The name congestion_wait() was accurate back in 2002, but it isn't accurate any more, and you got misled. It does not only wait for a queue to become uncongested. Quite so indeed. See clear_bdi_congested()'s callers. As long as the queue is in an uncongested state, we deliver wakeups to congestion_wait() blockers on every IO completion. As I said before, it is so that the MM's polling operations poll at a higher frequency when the IO system is working faster. (It is also to synchronise with end_page_writeback()'s feeding of clean pages to us via rotate_reclaimable_page()). Hmm, but the condition under which we did call congestion_wait() is a bit magical. Page reclaim can get into trouble without any request queue having entered a congested state. For example, think about a machine which has a single disk, and the operator has increased that disk's request queue size to 100,000. With your patch all the VM's throttling would be bypassed and we go into a busy loop and declare OOM instantly. There are probably other situations in which page reclaim gets into trouble without a request queue being congested. Ok, in the light of allt his, I will think on this some more. Minor point: bdi_congested() can be arbitrarily expensive - for DM stackups it is roughly proportional to the number of subdevices in the device. We need to be careful about how frequently we call it. Yuck, ok, good point. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 12/12] mm: per BDI congestion feedback
On Thu, 2007-04-05 at 16:24 -0700, Andrew Morton wrote: On Thu, 05 Apr 2007 19:42:21 +0200 [EMAIL PROTECTED] wrote: Now that we have per BDI dirty throttling is makes sense to also have oer BDI congestion feedback; why wait on another device if the current one is not congested. Similar comments apply. congestion_wait() should be called throttle_at_a_rate_proportional_to_the_speed_of_presently_uncongested_queues(). If a process is throttled in the page allocator waiting for pages to become reclaimable, that process absolutely does not care whether those pages were previously dirty against /dev/sda or against /dev/sdb. It wants to be woken up for writeout completion against any queue. OK, so you disagree with Miklos' 2nd point here: http://lkml.org/lkml/2007/4/4/137 And in the light of clear_bdi_congestion() being called for each writeout completion under the threshold this does make sense. So this whole 8-12/12 series is not needed and just served as an learning experience :-/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/12] mm: scalable bdi statistics counters.
On Thu, 2007-04-05 at 15:37 -0700, Andrew Morton wrote: On Thu, 05 Apr 2007 19:42:11 +0200 [EMAIL PROTECTED] wrote: Provide scalable per backing_dev_info statistics counters modeled on the ZVC code. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- block/ll_rw_blk.c |1 drivers/block/rd.c |2 drivers/char/mem.c |2 fs/char_dev.c |1 fs/fuse/inode.c |1 fs/nfs/client.c |1 include/linux/backing-dev.h | 98 + mm/backing-dev.c| 103 madness! Quite duplicative of vmstat.h, yet all this infrastructure is still only usable in one specific application. Can we please look at generalising the vmstat.h stuff? Or, the API in percpu_counter.h appears suitable to this application. (The comment at line 6 is a total lie). Ok, I'll see what I can come up with. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes)
Hi, some thoughts on shared futexes; Could we get rid of the mmap_sem on the shared futexes in the following manner: - do a page table walk to find the pte; - get a page using pfn_to_page (skipping VM_PFNMAP) - get the futex key from page-mapping-host and page-index and offset from addr % PAGE_SIZE. or given a key: - lookup the page from key.shared.inode-i_mapping by key.shared.pgoff possibly loading the page using mapping-a_ops-readpage(). then: - perform the futex operation on a kmap of the page This should all work except for VM_PFNMAP. Since the address is passed from userspace we cannot trust it to not point into a VM_PFNMAP area. However, with the RCU VMA lookup patches I'm working on we could do that check without holding locks and without exclusive cachelines; the question is, is that good enough? Or is there an alternative way of determining a pfnmap given a pfn/struct page? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] FUTEX : new PRIVATE futexes
On Thu, 2007-04-05 at 19:49 +0200, Eric Dumazet wrote: Hi all I'm pleased to present this patch which improves linux futexes performance and scalability, merely avoiding taking mmap_sem rwlock. Ulrich agreed with the API and said glibc work could start as soon as he gets a Fedora kernel with it :) Andrew, could we get this in mm as well ? This version is against 2.6.21-rc5-mm4 (so supports 64bit futexes) In this third version I dropped the NUMA optims and process private hash table, to let new API come in and be tested. Good work, Thanks! Acked-by: Peter Zijlstra [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes)
On Fri, 2007-04-06 at 14:02 +0100, Hugh Dickins wrote: On Fri, 6 Apr 2007, Peter Zijlstra wrote: some thoughts on shared futexes; Could we get rid of the mmap_sem on the shared futexes in the following manner: - do a page table walk to find the pte; (walk meaning descent down the levels, I presume, rather than across) indeed. I've not had time to digest your proposal, and I'm about to go out: let me sound a warning that springs to mind, maybe it's entirely inapproriate, but better said than kept silent. It looks as if you're supposing that mmap_sem is needed to find_vma, but not for going down the pagetables. It's not a simple as that: you need to be careful that a concurrent munmap from another thread isn't freeing pagetables from under you. Holding (down_read) of mmap_sem is one way to protect against that. try_to_unmap doesn't have that luxury: in its case, it's made safe by the way free_pgtables does anon_vma_unlink and unlink_file_vma before freeing any pagetables, so try_to_unmap etc. won't get there; but you can't do that. Ah, drad. I had hoped that the pte_lock would solve that race, but that doesn't cover the upper levels of the tree. Back to the drawing board.. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes)
On Fri, 2007-04-06 at 23:15 +1000, Nick Piggin wrote: Hugh Dickins wrote: On Fri, 6 Apr 2007, Peter Zijlstra wrote: some thoughts on shared futexes; Could we get rid of the mmap_sem on the shared futexes in the following manner: I'd imagine shared futexes would be much less common than private for threaded programs... I'd say we should reevaluate things once we have private futexes, and malloc/free stop hammering mmap_sem so hard... Indeed, private futexes are by far the most common. - get a page using pfn_to_page (skipping VM_PFNMAP) - get the futex key from page-mapping-host and page-index and offset from addr % PAGE_SIZE. or given a key: - lookup the page from key.shared.inode-i_mapping by key.shared.pgoff possibly loading the page using mapping-a_ops-readpage(). For shared futexes, wouldn't i_mapping be worse, because you'd be ping-ponging the tree_lock between processes, rather than have each use their own mmap_sem? Your lockless pagecache work would solve most of that, no? That also only helps for the wakeup case too, doesn't it? You have to use the vmas to find out which inode to use to do the wait, I think? (unless you introduce a new shared futex API). one could do something like this: struct address_space *mapping = page_mapping(page); if (!mapping || mapping == swapper_space) do_private_futex(); else do_shared_futex(mapping-host, page-index, addr % PAGE_SIZE); But alas, it seems I overlooked that the mmap_sem also protects the page tables as pointed out by Hugh, so this is all in fain it seems. A well. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/8] RSS controller based on process containers (v2)
*ugh* /me no like. The basic premises seems to be that we can track page owners perfectly (although this patch set does not yet do so), through get/release operations (on _mapcount). This is simply not true for unmapped pagecache pages. Those receive no 'release' event; (the usage by find_get_page() could be seen as 'get'). Also, you don't seem to balance the active/inactive scanning on a per container basis. This skews the per container working set logic. Lastly, you don't call the slab shrinker for container reclaim; which would leave slab reclaim only for those few non process specific allocations, which would greatly skew the pagecache/slab balance. Let us call struct reclaim_struct { struct list_head active_list; struct list_head inactive_list; unsigned long nr_active; unsigned long nr_inactive; } Lets recognise three distinct page categories: - anonymous memory, - mapped pagecache, and - unmapped pagecache. We then keep anonymous pages on a per container reclaim_struct, these pages are fully accounted to each container. We keep mapped pagecache pages on per inode reclaim_structs, these files could be shared between containers and we could either just account all pages belonging to each file proportional to the number of containers involved, or do a more precise accounting. We keep unmapped pagecache pages on a global reclaim_struct, these pages can, in general, not be pinned to a specific container; all we can do is keep a floating proportion relative to container 'get' events (find_get_page() and perhaps add_to_page_cache()). Reclaim will then have to fairly reclaim pages from all of these lists. If we schedule such that it appears that these lists are parallel instead of serial - that is a each tail is really a tail, not the head of another list - the current reclaim semantics are preserved. The slab shrinker should be called proportional to the containers size relative to the machine. Global reclaim will have to call each container reclaim in proportional fashion. The biggest problem with this approach is that there is no per zone reclaim left, which is relied upon by the allocator to provide free pages in a given physical address range. However there has been talk to create a proper range allocator independent of zones. Just my 0.02 euro.. Peter - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why kmem_cache_free occupy CPU for more than 10 seconds?
On Wed, 2007-04-11 at 02:53 -0700, Paul Jackson wrote: I'm confused - which end of ths stack is up? cpuset_exit doesn't call do_exit, rather it's the other way around. But put_files_struct doesn't call do_exit, rather do_exit calls __exit_files calls put_files_struct. I'm guessing its x86_64 which generates crap traces. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/