Re: NFS livelock / starvation ?

2007-04-16 Thread Peter Zijlstra
On Mon, 2007-04-16 at 16:24 +0800, Zhou Yingchao wrote:
 When we run a two nfs client and a nfs server in the following way, we
 met a livelock / starvation condition.
 
 MachineAMachineB
  Client1 Client2
  Server
 
 As shown in the figure, we run a client and server on one machine, and
 run another client on another machine. When Client1 and Client2 make
 many writes at the same time, the Client1's request is blocked until
 Client2's writes finished.
 
 We check the code, Client1 is blocked in generic_file_write- ...
 balance_dirty_pages, balance_dirty_pages call writeback_inodes to
 (only) flush data of the related fs.
 
 In nfs, we found that the Server has enhanced its dirty_thresh. So in
 the loop of writeback_inodes, Client1 has no data to write out, and
 the condition ns_reclaimable+wbs.nr_writeback=dirty_thresh will not
 be true until Client2 finishes its write request to Server. So the
 loop will only end after Client2 finished its write job.
 
 The problem in this path is: why we write only pages of the related fs
 in writeback_inodes but check the dirty thresh for total pages?

I am working on patches to fix this.

Current version at (against -mm):
  http://programming.kicks-ass.net/kernel-patches/balance_dirty_pages/

However, after a rewrite of the BDI statistics work there are some
funnies, which I haven't had time to analyse yet :-/

I hope to post a new version soonish...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 00/12] per device dirty throttling -v4

2007-04-17 Thread Peter Zijlstra
The latest version of the per device dirty throttling.

Dropped all the congestion_wait() churn, will contemplate a rename patch.
Reworked the BDI statistics to use percpu_counter.

against 2.6.21-rc6-mm1; the first patch is for easy application.
Andrew can of course just drop the patch it reverts.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 07/12] mm: count dirty pages per BDI

2007-04-17 Thread Peter Zijlstra
Count per BDI dirty pages.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/buffer.c |1 +
 include/linux/backing-dev.h |1 +
 mm/page-writeback.c |2 ++
 mm/truncate.c   |1 +
 4 files changed, 5 insertions(+)

Index: linux-2.6-mm/fs/buffer.c
===
--- linux-2.6-mm.orig/fs/buffer.c
+++ linux-2.6-mm/fs/buffer.c
@@ -740,6 +740,7 @@ int __set_page_dirty_buffers(struct page
if (page-mapping) {/* Race with truncate? */
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+   __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
task_io_account_write(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(mapping-page_tree,
Index: linux-2.6-mm/mm/page-writeback.c
===
--- linux-2.6-mm.orig/mm/page-writeback.c
+++ linux-2.6-mm/mm/page-writeback.c
@@ -828,6 +828,7 @@ int __set_page_dirty_nobuffers(struct pa
BUG_ON(mapping2 != mapping);
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+   __inc_bdi_stat(mapping-backing_dev_info, 
BDI_DIRTY);
task_io_account_write(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(mapping-page_tree,
@@ -961,6 +962,7 @@ int clear_page_dirty_for_io(struct page 
 */
if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+   dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
return 1;
}
return 0;
Index: linux-2.6-mm/mm/truncate.c
===
--- linux-2.6-mm.orig/mm/truncate.c
+++ linux-2.6-mm/mm/truncate.c
@@ -71,6 +71,7 @@ void cancel_dirty_page(struct page *page
struct address_space *mapping = page-mapping;
if (mapping  mapping_cap_account_dirty(mapping)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+   dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
if (account_size)
task_io_account_cancelled_write(account_size);
}
Index: linux-2.6-mm/include/linux/backing-dev.h
===
--- linux-2.6-mm.orig/include/linux/backing-dev.h
+++ linux-2.6-mm/include/linux/backing-dev.h
@@ -26,6 +26,7 @@ enum bdi_state {
 typedef int (congested_fn)(void *, int);
 
 enum bdi_stat_item {
+   BDI_DIRTY,
NR_BDI_STAT_ITEMS
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/12] mm: per device dirty threshold

2007-04-17 Thread Peter Zijlstra
Scale writeback cache per backing device, proportional to its writeout speed.

By decoupling the BDI dirty thresholds a number of problems we currently have
will go away, namely:

 - mutual interference starvation (for any number of BDIs);
 - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).

It might be that all dirty pages are for a single BDI while other BDIs are
idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
dirty pages outstanding and make progress.

A global threshold also creates a deadlock for stacked BDIs; when A writes to
B, and A generates enough dirty pages to get throttled, B will never start
writeback until the dirty pages go away. Again, by giving each BDI its own
'independent' dirty limit, this problem is avoided.

So the problem is to determine how to distribute the total dirty limit across
the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
not have any dirty pages outstanding is a waste.

What is done is to keep a floating proportion between the DBIs based on
writeback completions. This way faster/more active devices get a larger share
than slower/idle devices.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/backing-dev.h |   48 +++
 mm/backing-dev.c|3 
 mm/page-writeback.c |  185 +---
 3 files changed, 207 insertions(+), 29 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h  2007-04-16 11:47:14.0 
+0200
+++ linux-2.6/include/linux/backing-dev.h   2007-04-16 11:47:14.0 
+0200
@@ -29,6 +29,7 @@ enum bdi_stat_item {
BDI_DIRTY,
BDI_WRITEBACK,
BDI_UNSTABLE,
+   BDI_WRITEOUT,
NR_BDI_STAT_ITEMS
 };
 
@@ -44,6 +45,13 @@ struct backing_dev_info {
void *unplug_io_data;
 
struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
+
+   /*
+* data used for scaling the writeback cache
+*/
+   spinlock_t lock;/* protect the cycle count */
+   unsigned long cycles;   /* writeout cycles */
+   int dirty_exceeded;
 };
 
 void bdi_init(struct backing_dev_info *bdi);
@@ -55,6 +63,12 @@ static inline void __mod_bdi_stat(struct
percpu_counter_mod(bdi-bdi_stat[item], amount);
 }
 
+static inline void __mod_bdi_stat64(struct backing_dev_info *bdi,
+   enum bdi_stat_item item, s64 amount)
+{
+   percpu_counter_mod64(bdi-bdi_stat[item], amount);
+}
+
 static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
 {
@@ -87,12 +101,46 @@ static inline void dec_bdi_stat(struct b
local_irq_restore(flags);
 }
 
+static inline s64 __bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   return percpu_counter_read(bdi-bdi_stat[item]);
+}
+
 static inline s64 bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
 {
return percpu_counter_read_positive(bdi-bdi_stat[item]);
 }
 
+static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   return percpu_counter_sum(bdi-bdi_stat[item]);
+}
+
+static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   s64 sum;
+   unsigned long flags;
+
+   local_irq_save(flags);
+   sum = __bdi_stat_sum(bdi, item);
+   local_irq_restore(flags);
+
+   return sum;
+}
+
+static inline unsigned long bdi_stat_delta(void)
+{
+#ifdef CONFIG_SMP
+   return NR_CPUS * FBC_BATCH;
+#else
+   return 1UL;
+#endif
+}
+
 /*
  * Flags in backing_dev_info::capability
  * - The first two flags control whether dirty pages will contribute to the
Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c  2007-04-16 11:47:14.0 +0200
+++ linux-2.6/mm/page-writeback.c   2007-04-16 16:21:58.0 +0200
@@ -49,8 +49,6 @@
  */
 static long ratelimit_pages = 32;
 
-static int dirty_exceeded __cacheline_aligned_in_smp;  /* Dirty mem may be 
over limit */
-
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
@@ -103,6 +101,88 @@ EXPORT_SYMBOL(laptop_mode);
 static void background_writeout(unsigned long _min_pages);
 
 /*
+ * Scale the writeback cache size proportional to the relative writeout speeds.
+ *
+ * We do this by tracking a floating average per BDI and a global floating
+ * average. We optimize away the '/= 2' for the global average by noting that:
+ *
+ *  if (++i  thresh) i /= 2:
+ *
+ * Can be approximated by:
+ *
+ *   thresh/2 + (++i % thresh/2)
+ *
+ * Furthermore, when we choose thresh to be 2^n it can be written in terms of
+ * binary

[PATCH 06/12] mm: scalable bdi statistics counters.

2007-04-17 Thread Peter Zijlstra
Provide scalable per backing_dev_info statistics counters.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/backing-dev.h |   50 ++--
 mm/backing-dev.c|   26 ++
 2 files changed, 74 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h  2007-04-12 13:27:13.0 
+0200
+++ linux-2.6/include/linux/backing-dev.h   2007-04-12 13:28:40.0 
+0200
@@ -8,6 +8,7 @@
 #ifndef _LINUX_BACKING_DEV_H
 #define _LINUX_BACKING_DEV_H
 
+#include linux/percpu_counter.h
 #include asm/atomic.h
 
 struct page;
@@ -24,6 +25,10 @@ enum bdi_state {
 
 typedef int (congested_fn)(void *, int);
 
+enum bdi_stat_item {
+   NR_BDI_STAT_ITEMS
+};
+
 struct backing_dev_info {
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
unsigned long ra_pages0; /* min readahead on start of file */
@@ -34,14 +39,55 @@ struct backing_dev_info {
void *congested_data;   /* Pointer to aux data for congested func */
void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
void *unplug_io_data;
+
+   struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 };
 
-static inline void bdi_init(struct backing_dev_info *bdi)
+void bdi_init(struct backing_dev_info *bdi);
+void bdi_destroy(struct backing_dev_info *bdi);
+
+static inline void __mod_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item, s32 amount)
+{
+   percpu_counter_mod(bdi-bdi_stat[item], amount);
+}
+
+static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   __mod_bdi_stat(bdi, item, 1);
+}
+
+static inline void inc_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __inc_bdi_stat(bdi, item);
+   local_irq_restore(flags);
+}
+
+static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
 {
+   __mod_bdi_stat(bdi, item, -1);
+}
+
+static inline void dec_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __dec_bdi_stat(bdi, item);
+   local_irq_restore(flags);
 }
 
-static inline void bdi_destroy(struct backing_dev_info *bdi)
+static inline s64 bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
 {
+   return percpu_counter_read_positive(bdi-bdi_stat[item]);
 }
 
 /*
Index: linux-2.6/mm/backing-dev.c
===
--- linux-2.6.orig/mm/backing-dev.c 2007-04-12 13:27:10.0 +0200
+++ linux-2.6/mm/backing-dev.c  2007-04-12 13:28:26.0 +0200
@@ -5,6 +5,30 @@
 #include linux/sched.h
 #include linux/module.h
 
+void bdi_init(struct backing_dev_info *bdi)
+{
+   int i;
+
+   if (!(bdi_cap_writeback_dirty(bdi) || bdi_cap_account_dirty(bdi)))
+   return;
+
+   for (i = 0; i  NR_BDI_STAT_ITEMS; i++)
+   percpu_counter_init(bdi-bdi_stat[i], 0);
+}
+EXPORT_SYMBOL(bdi_init);
+
+void bdi_destroy(struct backing_dev_info *bdi)
+{
+   int i;
+
+   if (!(bdi_cap_writeback_dirty(bdi) || bdi_cap_account_dirty(bdi)))
+   return;
+
+   for (i = 0; i  NR_BDI_STAT_ITEMS; i++)
+   percpu_counter_destroy(bdi-bdi_stat[i]);
+}
+EXPORT_SYMBOL(bdi_destroy);
+
 static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
@@ -70,3 +94,5 @@ long congestion_wait_interruptible(int r
return ret;
 }
 EXPORT_SYMBOL(congestion_wait_interruptible);
+
+

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/12] mm: expose BDI statistics in sysfs.

2007-04-17 Thread Peter Zijlstra
Expose the per BDI stats in /sys/block/dev/queue/*

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 block/ll_rw_blk.c |   32 
 1 file changed, 32 insertions(+)

Index: linux-2.6-mm/block/ll_rw_blk.c
===
--- linux-2.6-mm.orig/block/ll_rw_blk.c
+++ linux-2.6-mm/block/ll_rw_blk.c
@@ -3976,6 +3976,20 @@ static ssize_t queue_max_hw_sectors_show
return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, 
BDI_DIRTY));
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, 
BDI_WRITEBACK));
+}
+
+static ssize_t queue_nr_unstable_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, 
BDI_UNSTABLE));
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR },
@@ -4006,6 +4020,21 @@ static struct queue_sysfs_entry queue_ma
.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_dirty_entry = {
+   .attr = {.name = dirty_pages, .mode = S_IRUGO },
+   .show = queue_nr_dirty_show,
+};
+
+static struct queue_sysfs_entry queue_writeback_entry = {
+   .attr = {.name = writeback_pages, .mode = S_IRUGO },
+   .show = queue_nr_writeback_show,
+};
+
+static struct queue_sysfs_entry queue_unstable_entry = {
+   .attr = {.name = unstable_pages, .mode = S_IRUGO },
+   .show = queue_nr_unstable_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
.attr = {.name = scheduler, .mode = S_IRUGO | S_IWUSR },
.show = elv_iosched_show,
@@ -4018,6 +4047,9 @@ static struct attribute *default_attrs[]
queue_initial_ra_entry.attr,
queue_max_hw_sectors_entry.attr,
queue_max_sectors_entry.attr,
+   queue_dirty_entry.attr,
+   queue_writeback_entry.attr,
+   queue_unstable_entry.attr,
queue_iosched_entry.attr,
NULL,
 };

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/12] lib: dampen the percpu_counter FBC_BATCH

2007-04-17 Thread Peter Zijlstra
With the current logic the percpu_counter's accuracy delta is quadric
wrt the number of cpus in the system, reduce this to O(n ln n).

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/percpu_counter.h |7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

Index: linux-2.6-mm/include/linux/percpu_counter.h
===
--- linux-2.6-mm.orig/include/linux/percpu_counter.h
+++ linux-2.6-mm/include/linux/percpu_counter.h
@@ -11,6 +11,7 @@
 #include linux/threads.h
 #include linux/percpu.h
 #include linux/types.h
+#include linux/log2.h
 
 #ifdef CONFIG_SMP
 
@@ -20,11 +21,7 @@ struct percpu_counter {
s32 *counters;
 };
 
-#if NR_CPUS = 16
-#define FBC_BATCH  (NR_CPUS*2)
-#else
-#define FBC_BATCH  (NR_CPUS*4)
-#endif
+#define FBC_BATCH  (8*ilog2(NR_CPUS))
 
 static inline void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/12] mm: bdi init hooks

2007-04-17 Thread Peter Zijlstra
provide BDI constructor/destructor hooks

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 block/ll_rw_blk.c   |2 ++
 drivers/block/rd.c  |6 ++
 drivers/char/mem.c  |2 ++
 drivers/mtd/mtdcore.c   |5 +
 fs/char_dev.c   |1 +
 fs/configfs/configfs_internal.h |2 ++
 fs/configfs/inode.c |8 
 fs/configfs/mount.c |2 ++
 fs/fuse/inode.c |2 ++
 fs/hugetlbfs/inode.c|3 +++
 fs/nfs/client.c |3 +++
 fs/ocfs2/dlm/dlmfs.c|6 +-
 fs/ramfs/inode.c|7 ++-
 fs/sysfs/inode.c|5 +
 fs/sysfs/mount.c|2 ++
 fs/sysfs/sysfs.h|1 +
 include/linux/backing-dev.h |7 +++
 kernel/cpuset.c |3 +++
 mm/shmem.c  |1 +
 mm/swap.c   |2 ++
 20 files changed, 68 insertions(+), 2 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===
--- linux-2.6.orig/block/ll_rw_blk.c2007-04-12 11:35:53.0 +0200
+++ linux-2.6/block/ll_rw_blk.c 2007-04-12 13:19:40.0 +0200
@@ -1771,6 +1771,7 @@ static void blk_release_queue(struct kob
 
blk_trace_shutdown(q);
 
+   bdi_destroy(q-backing_dev_info);
kmem_cache_free(requestq_cachep, q);
 }
 
@@ -1836,6 +1837,7 @@ request_queue_t *blk_alloc_queue_node(gf
q-kobj.ktype = queue_ktype;
kobject_init(q-kobj);
q-backing_dev_info = default_backing_dev_info;
+   bdi_init(q-backing_dev_info);
 
q-backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
q-backing_dev_info.unplug_io_data = q;
Index: linux-2.6/drivers/block/rd.c
===
--- linux-2.6.orig/drivers/block/rd.c   2007-04-12 11:35:51.0 +0200
+++ linux-2.6/drivers/block/rd.c2007-04-12 11:35:59.0 +0200
@@ -411,6 +411,9 @@ static void __exit rd_cleanup(void)
blk_cleanup_queue(rd_queue[i]);
}
unregister_blkdev(RAMDISK_MAJOR, ramdisk);
+
+   bdi_destroy(rd_file_backing_dev_info);
+   bdi_destroy(rd_backing_dev_info);
 }
 
 /*
@@ -421,6 +424,9 @@ static int __init rd_init(void)
int i;
int err = -ENOMEM;
 
+   bdi_init(rd_backing_dev_info);
+   bdi_init(rd_file_backing_dev_info);
+
if (rd_blocksize  PAGE_SIZE || rd_blocksize  512 ||
(rd_blocksize  (rd_blocksize-1))) {
printk(RAMDISK: wrong blocksize %d, reverting to defaults\n,
Index: linux-2.6/drivers/char/mem.c
===
--- linux-2.6.orig/drivers/char/mem.c   2007-04-12 11:35:51.0 +0200
+++ linux-2.6/drivers/char/mem.c2007-04-12 11:35:59.0 +0200
@@ -987,6 +987,8 @@ static int __init chr_dev_init(void)
  MKDEV(MEM_MAJOR, devlist[i].minor),
  devlist[i].name);
 
+   bdi_init(zero_bdi);
+
return 0;
 }
 
Index: linux-2.6/fs/char_dev.c
===
--- linux-2.6.orig/fs/char_dev.c2007-04-12 11:35:51.0 +0200
+++ linux-2.6/fs/char_dev.c 2007-04-12 11:35:59.0 +0200
@@ -546,6 +546,7 @@ static struct kobject *base_probe(dev_t 
 void __init chrdev_init(void)
 {
cdev_map = kobj_map_init(base_probe, chrdevs_lock);
+   bdi_init(directly_mappable_cdev_bdi);
 }
 
 
Index: linux-2.6/fs/fuse/inode.c
===
--- linux-2.6.orig/fs/fuse/inode.c  2007-04-12 11:35:51.0 +0200
+++ linux-2.6/fs/fuse/inode.c   2007-04-12 11:35:59.0 +0200
@@ -415,6 +415,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(fc-num_waiting, 0);
fc-bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc-bdi.unplug_io_fn = default_unplug_io_fn;
+   bdi_init(fc-bdi);
fc-reqctr = 0;
fc-blocked = 1;
get_random_bytes(fc-scramble_key, sizeof(fc-scramble_key));
@@ -428,6 +429,7 @@ void fuse_conn_put(struct fuse_conn *fc)
if (fc-destroy_req)
fuse_request_free(fc-destroy_req);
mutex_destroy(fc-inst_mutex);
+   bdi_destroy(fc-bdi);
kfree(fc);
}
 }
Index: linux-2.6/fs/nfs/client.c
===
--- linux-2.6.orig/fs/nfs/client.c  2007-04-12 11:35:51.0 +0200
+++ linux-2.6/fs/nfs/client.c   2007-04-12 11:35:59.0 +0200
@@ -657,6 +657,8 @@ static void nfs_server_set_fsinfo(struct
if (server-rsize  NFS_MAX_FILE_IO_SIZE)
server-rsize = NFS_MAX_FILE_IO_SIZE

[PATCH 12/12] debug: expose BDI statistics in sysfs.

2007-04-17 Thread Peter Zijlstra
Expose the per BDI stats in /sys/block/dev/queue/*

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 block/ll_rw_blk.c   |   49 +
 mm/page-writeback.c |2 +-
 2 files changed, 50 insertions(+), 1 deletion(-)

Index: linux-2.6/block/ll_rw_blk.c
===
--- linux-2.6.orig/block/ll_rw_blk.c2007-04-12 17:35:03.0 +0200
+++ linux-2.6/block/ll_rw_blk.c 2007-04-12 17:36:49.0 +0200
@@ -3991,6 +3991,37 @@ static ssize_t queue_nr_unstable_show(st
return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, 
BDI_UNSTABLE));
 }
 
+extern void get_writeout_scale(struct backing_dev_info *, int *, int *);
+
+static ssize_t queue_nr_cache_ratio_show(struct request_queue *q, char *page)
+{
+   int scale, div;
+
+   get_writeout_scale(q-backing_dev_info, scale, div);
+   scale *= 1024;
+   scale /= div;
+
+   return sprintf(page, %d\n, scale);
+}
+
+extern void
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+   struct backing_dev_info *bdi);
+
+static ssize_t queue_nr_cache_size_show(struct request_queue *q, char *page)
+{
+   long background, dirty, bdi_dirty;
+   get_dirty_limits(background, dirty, bdi_dirty, q-backing_dev_info);
+   return sprintf(page, %ld\n, bdi_dirty);
+}
+
+static ssize_t queue_nr_cache_total_show(struct request_queue *q, char *page)
+{
+   long background, dirty, bdi_dirty;
+   get_dirty_limits(background, dirty, bdi_dirty, q-backing_dev_info);
+   return sprintf(page, %ld\n, dirty);
+}
+
 static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR },
.show = queue_requests_show,
@@ -4035,6 +4066,21 @@ static struct queue_sysfs_entry queue_un
.show = queue_nr_unstable_show,
 };
 
+static struct queue_sysfs_entry queue_cache_ratio_entry = {
+   .attr = {.name = cache_ratio, .mode = S_IRUGO },
+   .show = queue_nr_cache_ratio_show,
+};
+
+static struct queue_sysfs_entry queue_cache_size_entry = {
+   .attr = {.name = cache_size, .mode = S_IRUGO },
+   .show = queue_nr_cache_size_show,
+};
+
+static struct queue_sysfs_entry queue_cache_total_entry = {
+   .attr = {.name = cache_total, .mode = S_IRUGO },
+   .show = queue_nr_cache_total_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
.attr = {.name = scheduler, .mode = S_IRUGO | S_IWUSR },
.show = elv_iosched_show,
@@ -4050,6 +4096,9 @@ static struct attribute *default_attrs[]
queue_dirty_entry.attr,
queue_writeback_entry.attr,
queue_unstable_entry.attr,
+   queue_cache_ratio_entry.attr,
+   queue_cache_size_entry.attr,
+   queue_cache_total_entry.attr,
queue_iosched_entry.attr,
NULL,
 };
Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c  2007-04-12 17:36:05.0 +0200
+++ linux-2.6/mm/page-writeback.c   2007-04-12 17:36:49.0 +0200
@@ -237,7 +237,7 @@ static unsigned long determine_dirtyable
return x + 1;   /* Ensure that we never return 0 */
 }
 
-static void
+void
 get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
 struct backing_dev_info *bdi)
 {

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/12] revert per-backing_dev-dirty-and-writeback-page-accounting

2007-04-17 Thread Peter Zijlstra
For ease of application..

---
 block/ll_rw_blk.c   |   29 -
 fs/buffer.c |1 -
 include/linux/backing-dev.h |2 --
 mm/page-writeback.c |   13 ++---
 mm/truncate.c   |1 -
 5 files changed, 2 insertions(+), 44 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===
--- linux-2.6.orig/block/ll_rw_blk.c2007-04-10 16:30:55.0 +0200
+++ linux-2.6/block/ll_rw_blk.c 2007-04-10 16:35:24.0 +0200
@@ -201,8 +201,6 @@ EXPORT_SYMBOL(blk_queue_softirq_done);
  **/
 void blk_queue_make_request(request_queue_t * q, make_request_fn * mfn)
 {
-   struct backing_dev_info *bdi = q-backing_dev_info;
-
/*
 * set defaults
 */
@@ -210,8 +208,6 @@ void blk_queue_make_request(request_queu
blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
q-make_request_fn = mfn;
-   atomic_long_set(bdi-nr_dirty, 0);
-   atomic_long_set(bdi-nr_writeback, 0);
blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
blk_queue_hardsect_size(q, 512);
blk_queue_dma_alignment(q, 511);
@@ -3978,19 +3974,6 @@ static ssize_t queue_max_hw_sectors_show
return queue_var_show(max_hw_sectors_kb, (page));
 }
 
-static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page)
-{
-   return sprintf(page, %lu\n,
-   atomic_long_read(q-backing_dev_info.nr_dirty));
-
-}
-
-static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
-{
-   return sprintf(page, %lu\n,
-   atomic_long_read(q-backing_dev_info.nr_writeback));
-
-}
 
 static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR },
@@ -4021,16 +4004,6 @@ static struct queue_sysfs_entry queue_ma
.show = queue_max_hw_sectors_show,
 };
 
-static struct queue_sysfs_entry queue_nr_dirty_entry = {
-   .attr = {.name = nr_dirty, .mode = S_IRUGO },
-   .show = queue_nr_dirty_show,
-};
-
-static struct queue_sysfs_entry queue_nr_writeback_entry = {
-   .attr = {.name = nr_writeback, .mode = S_IRUGO },
-   .show = queue_nr_writeback_show,
-};
-
 static struct queue_sysfs_entry queue_iosched_entry = {
.attr = {.name = scheduler, .mode = S_IRUGO | S_IWUSR },
.show = elv_iosched_show,
@@ -4043,8 +4016,6 @@ static struct attribute *default_attrs[]
queue_initial_ra_entry.attr,
queue_max_hw_sectors_entry.attr,
queue_max_sectors_entry.attr,
-   queue_nr_dirty_entry.attr,
-   queue_nr_writeback_entry.attr,
queue_iosched_entry.attr,
NULL,
 };
Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c  2007-04-10 16:30:15.0 +0200
+++ linux-2.6/fs/buffer.c   2007-04-10 16:35:03.0 +0200
@@ -740,7 +740,6 @@ int __set_page_dirty_buffers(struct page
if (page-mapping) {/* Race with truncate? */
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
-   atomic_long_inc(mapping-backing_dev_info-nr_dirty);
task_io_account_write(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(mapping-page_tree,
Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h  2007-04-10 16:30:55.0 
+0200
+++ linux-2.6/include/linux/backing-dev.h   2007-04-10 16:35:03.0 
+0200
@@ -30,8 +30,6 @@ struct backing_dev_info {
unsigned long ra_thrash_bytes;  /* estimated thrashing threshold */
unsigned long state;/* Always use atomic bitops on this */
unsigned int capabilities; /* Device capabilities */
-   atomic_long_t nr_dirty; /* Pages dirty against this BDI */
-   atomic_long_t nr_writeback;/* Pages under writeback against this BDI */
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data;   /* Pointer to aux data for congested func */
void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c  2007-04-10 16:30:15.0 +0200
+++ linux-2.6/mm/page-writeback.c   2007-04-10 16:35:03.0 +0200
@@ -828,8 +828,6 @@ int __set_page_dirty_nobuffers(struct pa
BUG_ON(mapping2 != mapping);
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
-   atomic_long_inc(mapping-backing_dev_info-
-   

[PATCH 02/12] nfs: remove congestion_end()

2007-04-17 Thread Peter Zijlstra
Its redundant, clear_bdi_congested() already wakes the waiters.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/write.c  |4 +---
 include/linux/backing-dev.h |1 -
 mm/backing-dev.c|   13 -
 3 files changed, 1 insertion(+), 17 deletions(-)

Index: linux-2.6-mm/fs/nfs/write.c
===
--- linux-2.6-mm.orig/fs/nfs/write.c2007-04-05 16:24:50.0 +0200
+++ linux-2.6-mm/fs/nfs/write.c 2007-04-05 16:25:04.0 +0200
@@ -235,10 +235,8 @@ static void nfs_end_page_writeback(struc
struct nfs_server *nfss = NFS_SERVER(inode);
 
end_page_writeback(page);
-   if (atomic_long_dec_return(nfss-writeback)  
NFS_CONGESTION_OFF_THRESH) {
+   if (atomic_long_dec_return(nfss-writeback)  
NFS_CONGESTION_OFF_THRESH)
clear_bdi_congested(nfss-backing_dev_info, WRITE);
-   congestion_end(WRITE);
-   }
 }
 
 /*
Index: linux-2.6-mm/include/linux/backing-dev.h
===
--- linux-2.6-mm.orig/include/linux/backing-dev.h   2007-04-05 
16:24:50.0 +0200
+++ linux-2.6-mm/include/linux/backing-dev.h2007-04-05 16:25:08.0 
+0200
@@ -96,7 +96,6 @@ void clear_bdi_congested(struct backing_
 void set_bdi_congested(struct backing_dev_info *bdi, int rw);
 long congestion_wait(int rw, long timeout);
 long congestion_wait_interruptible(int rw, long timeout);
-void congestion_end(int rw);
 
 #define bdi_cap_writeback_dirty(bdi) \
(!((bdi)-capabilities  BDI_CAP_NO_WRITEBACK))
Index: linux-2.6-mm/mm/backing-dev.c
===
--- linux-2.6-mm.orig/mm/backing-dev.c  2007-04-05 16:24:50.0 +0200
+++ linux-2.6-mm/mm/backing-dev.c   2007-04-05 16:25:16.0 +0200
@@ -70,16 +70,3 @@ long congestion_wait_interruptible(int r
return ret;
 }
 EXPORT_SYMBOL(congestion_wait_interruptible);
-
-/**
- * congestion_end - wake up sleepers on a congested backing_dev_info
- * @rw: READ or WRITE
- */
-void congestion_end(int rw)
-{
-   wait_queue_head_t *wqh = congestion_wqh[rw];
-
-   if (waitqueue_active(wqh))
-   wake_up(wqh);
-}
-EXPORT_SYMBOL(congestion_end);

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 08/12] mm: count writeback pages per BDI

2007-04-17 Thread Peter Zijlstra
Count per BDI writeback pages.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/backing-dev.h |1 +
 mm/page-writeback.c |   12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c  2007-04-12 17:32:17.0 +0200
+++ linux-2.6/mm/page-writeback.c   2007-04-12 17:34:24.0 +0200
@@ -977,14 +977,18 @@ int test_clear_page_writeback(struct pag
int ret;
 
if (mapping) {
+   struct backing_dev_info *bdi = mapping-backing_dev_info;
unsigned long flags;
 
write_lock_irqsave(mapping-tree_lock, flags);
ret = TestClearPageWriteback(page);
-   if (ret)
+   if (ret) {
radix_tree_tag_clear(mapping-page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
+   if (bdi_cap_writeback_dirty(bdi))
+   __dec_bdi_stat(bdi, BDI_WRITEBACK);
+   }
write_unlock_irqrestore(mapping-tree_lock, flags);
} else {
ret = TestClearPageWriteback(page);
@@ -1000,14 +1004,18 @@ int test_set_page_writeback(struct page 
int ret;
 
if (mapping) {
+   struct backing_dev_info *bdi = mapping-backing_dev_info;
unsigned long flags;
 
write_lock_irqsave(mapping-tree_lock, flags);
ret = TestSetPageWriteback(page);
-   if (!ret)
+   if (!ret) {
radix_tree_tag_set(mapping-page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
+   if (bdi_cap_writeback_dirty(bdi))
+   __inc_bdi_stat(bdi, BDI_WRITEBACK);
+   }
if (!PageDirty(page))
radix_tree_tag_clear(mapping-page_tree,
page_index(page),
Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h  2007-04-12 17:32:17.0 
+0200
+++ linux-2.6/include/linux/backing-dev.h   2007-04-12 17:32:40.0 
+0200
@@ -27,6 +27,7 @@ typedef int (congested_fn)(void *, int);
 
 enum bdi_stat_item {
BDI_DIRTY,
+   BDI_WRITEBACK,
NR_BDI_STAT_ITEMS
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/12] lib: percpu_counter_mod64

2007-04-17 Thread Peter Zijlstra
Add percpu_counter_mod64() to allow large modifications.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/percpu_counter.h |9 +
 lib/percpu_counter.c   |   28 
 2 files changed, 37 insertions(+)

Index: linux-2.6/include/linux/percpu_counter.h
===
--- linux-2.6.orig/include/linux/percpu_counter.h   2007-04-12 
13:54:55.0 +0200
+++ linux-2.6/include/linux/percpu_counter.h2007-04-12 14:00:21.0 
+0200
@@ -36,6 +36,7 @@ static inline void percpu_counter_destro
 }
 
 void percpu_counter_mod(struct percpu_counter *fbc, s32 amount);
+void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
@@ -81,6 +82,14 @@ percpu_counter_mod(struct percpu_counter
preempt_enable();
 }
 
+static inline void
+percpu_counter_mod64(struct percpu_counter *fbc, s64 amount)
+{
+   preempt_disable();
+   fbc-count += amount;
+   preempt_enable();
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
return fbc-count;
Index: linux-2.6/lib/percpu_counter.c
===
--- linux-2.6.orig/lib/percpu_counter.c 2006-07-31 13:07:38.0 +0200
+++ linux-2.6/lib/percpu_counter.c  2007-04-12 14:17:12.0 +0200
@@ -25,6 +25,34 @@ void percpu_counter_mod(struct percpu_co
 }
 EXPORT_SYMBOL(percpu_counter_mod);
 
+void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount)
+{
+   long count;
+   s32 *pcount;
+   int cpu;
+
+   if (amount = FBC_BATCH || amount = -FBC_BATCH) {
+   spin_lock(fbc-lock);
+   fbc-count += amount;
+   spin_unlock(fbc-lock);
+   return;
+   }
+
+   cpu = get_cpu();
+   pcount = per_cpu_ptr(fbc-counters, cpu);
+   count = *pcount + amount;
+   if (count = FBC_BATCH || count = -FBC_BATCH) {
+   spin_lock(fbc-lock);
+   fbc-count += count;
+   *pcount = 0;
+   spin_unlock(fbc-lock);
+   } else {
+   *pcount = count;
+   }
+   put_cpu();
+}
+EXPORT_SYMBOL(percpu_counter_mod64);
+
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-18 Thread Peter Zijlstra
On Tue, 2007-04-17 at 21:19 -0400, Trond Myklebust wrote:
 I've split the issues introduced by the 2.6.21-rcX write code up into 4
 subproblems.
 
 The first patch is just a cleanup in order to ease review.
 
 Patch number 2 ensures that we never release the PG_writeback flag until
 _after_ we've either discarded the unstable request altogether, or put it
 on the nfs_inode's commit or dirty lists.
 
 Patch number 3 fixes the 'desynchronized value of nfs_i.ncommit' error. It
 uses the PG_NEED_COMMIT flag as an indicator for whether or not the request
 may be redirtied.
 
 Patch number 4 protects the NFS '.set_page_dirty' address_space operation
 against races with nfs_inode_add_request.

Ok, stuck them in, and my debug patch from yesterday, just in case...

However, I can't seem to run long enough to establish whether the
problem is gone. It deadlocks between 10-30 minutes due to missing IO
completions, whereas yesterday it took between 45-60 minutes to trigger
the 'desynchronized value of nfs_i.ncommit' messages.

I will continue trying go get a good run, however if you got some
(perhaps experimental .22) patches you want me to try..



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] 2.6.21-rc7 NFS writes: fix a series of issues

2007-04-18 Thread Peter Zijlstra
On Wed, 2007-04-18 at 10:19 +0200, Peter Zijlstra wrote:
 On Tue, 2007-04-17 at 21:19 -0400, Trond Myklebust wrote:
  I've split the issues introduced by the 2.6.21-rcX write code up into 4
  subproblems.
  
  The first patch is just a cleanup in order to ease review.
  
  Patch number 2 ensures that we never release the PG_writeback flag until
  _after_ we've either discarded the unstable request altogether, or put it
  on the nfs_inode's commit or dirty lists.
  
  Patch number 3 fixes the 'desynchronized value of nfs_i.ncommit' error. It
  uses the PG_NEED_COMMIT flag as an indicator for whether or not the request
  may be redirtied.
  
  Patch number 4 protects the NFS '.set_page_dirty' address_space operation
  against races with nfs_inode_add_request.
 
 Ok, stuck them in, and my debug patch from yesterday, just in case...
 
 However, I can't seem to run long enough to establish whether the
 problem is gone. It deadlocks between 10-30 minutes due to missing IO
 completions, whereas yesterday it took between 45-60 minutes to trigger
 the 'desynchronized value of nfs_i.ncommit' messages.
 
 I will continue trying go get a good run,

Just got one around 80-90 minutes, no 'desynchronized value of
nfs_i.ncommit' errors.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] mm: per device dirty threshold

2007-04-19 Thread Peter Zijlstra
On Thu, 2007-04-19 at 19:49 +0200, Miklos Szeredi wrote:
  +static inline unsigned long bdi_stat_delta(void)
  +{
  +#ifdef CONFIG_SMP
  +   return NR_CPUS * FBC_BATCH;
 
 Shouln't this be multiplied by the number of counters to sum?  I.e. 3
 if dirty and unstable are separate, and 2 if they are not.

Ah, yes, good catch. How about this:

---

Since we're adding 3 stat counters, tripple the per counter delta as
well.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 mm/page-writeback.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c  2007-04-19 19:59:26.0 +0200
+++ linux-2.6/mm/page-writeback.c   2007-04-19 20:00:09.0 +0200
@@ -321,7 +321,7 @@ static void balance_dirty_pages(struct a
get_dirty_limits(background_thresh, dirty_thresh,
   bdi_thresh, bdi);
 
-   if (bdi_thresh  bdi_stat_delta()) {
+   if (bdi_thresh  3*bdi_stat_delta()) {
bdi_nr_reclaimable =
bdi_stat_sum(bdi, BDI_DIRTY) +
bdi_stat_sum(bdi, BDI_UNSTABLE);


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/12] mm: count unstable pages per BDI

2007-04-19 Thread Peter Zijlstra
On Thu, 2007-04-19 at 19:44 +0200, Miklos Szeredi wrote:
  Count per BDI unstable pages.
  
 
 I'm wondering, is it really worth having this category separate from
 per BDI brity pages?
 
 With the exception of the export to sysfs, always the sum of unstable
 + dirty is used.

I guess you are right, but it offends my sense of aesthetics to break
symmetry with the zone statistics. However, it has the added advantage
of only needing 2 deltas as well.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/12] mm: count unstable pages per BDI

2007-04-19 Thread Peter Zijlstra
On Thu, 2007-04-19 at 20:12 +0200, Peter Zijlstra wrote:
 On Thu, 2007-04-19 at 19:44 +0200, Miklos Szeredi wrote:
   Count per BDI unstable pages.
   
  
  I'm wondering, is it really worth having this category separate from
  per BDI brity pages?
  
  With the exception of the export to sysfs, always the sum of unstable
  + dirty is used.
 
 I guess you are right, but it offends my sense of aesthetics to break
 symmetry with the zone statistics. However, it has the added advantage
 of only needing 2 deltas as well.

I guess, this should do.

---
 fs/buffer.c |2 +-
 fs/nfs/write.c  |   11 +++
 include/linux/backing-dev.h |3 +--
 mm/page-writeback.c |   16 +++-
 mm/truncate.c   |2 +-
 5 files changed, 17 insertions(+), 17 deletions(-)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c  2007-04-19 19:59:26.0 +0200
+++ linux-2.6/fs/buffer.c   2007-04-19 20:35:39.0 +0200
@@ -733,7 +733,7 @@ int __set_page_dirty_buffers(struct page
if (page-mapping) {/* Race with truncate? */
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
-   __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
+   __inc_bdi_stat(mapping-backing_dev_info, BDI_RECLAIM);
task_io_account_write(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(mapping-page_tree,
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c   2007-04-19 19:59:26.0 +0200
+++ linux-2.6/fs/nfs/write.c2007-04-19 20:39:03.0 +0200
@@ -456,7 +456,7 @@ nfs_mark_request_commit(struct nfs_page 
nfsi-ncommit++;
spin_unlock(nfsi-req_lock);
inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
-   inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE);
+   inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIM);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 #endif
@@ -518,7 +518,8 @@ static void nfs_cancel_commit_list(struc
while(!list_empty(head)) {
req = nfs_list_entry(head-next);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
-   dec_bdi_stat(req-wb_page-mapping-backing_dev_info, 
BDI_UNSTABLE);
+   dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
+   BDI_RECLAIM);
nfs_list_remove_request(req);
nfs_inode_remove_request(req);
nfs_unlock_request(req);
@@ -1247,7 +1248,8 @@ nfs_commit_list(struct inode *inode, str
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
-   dec_bdi_stat(req-wb_page-mapping-backing_dev_info, 
BDI_UNSTABLE);
+   dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
+   BDI_RECLAIM);
nfs_clear_page_writeback(req);
}
return -ENOMEM;
@@ -1272,7 +1274,8 @@ static void nfs_commit_done(struct rpc_t
req = nfs_list_entry(data-pages.next);
nfs_list_remove_request(req);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
-   dec_bdi_stat(req-wb_page-mapping-backing_dev_info, 
BDI_UNSTABLE);
+   dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
+   BDI_RECLAIM);
 
dprintk(NFS: commit (%s/%Ld [EMAIL PROTECTED]),
req-wb_context-dentry-d_inode-i_sb-s_id,
Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h  2007-04-19 19:59:26.0 
+0200
+++ linux-2.6/include/linux/backing-dev.h   2007-04-19 20:39:24.0 
+0200
@@ -26,9 +26,8 @@ enum bdi_state {
 typedef int (congested_fn)(void *, int);
 
 enum bdi_stat_item {
-   BDI_DIRTY,
+   BDI_RECLAIM,
BDI_WRITEBACK,
-   BDI_UNSTABLE,
BDI_WRITEOUT,
NR_BDI_STAT_ITEMS
 };
Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c  2007-04-19 20:00:09.0 +0200
+++ linux-2.6/mm/page-writeback.c   2007-04-19 20:40:29.0 +0200
@@ -300,8 +300,7 @@ static void balance_dirty_pages(struct a
 
get_dirty_limits(background_thresh, dirty_thresh,
bdi_thresh, bdi);
-   bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
-   bdi_stat(bdi, BDI_UNSTABLE);
+   bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIM

Re: [PATCH 09/12] mm: count unstable pages per BDI

2007-04-19 Thread Peter Zijlstra
On Thu, 2007-04-19 at 20:46 +0200, Peter Zijlstra wrote:
 On Thu, 2007-04-19 at 20:12 +0200, Peter Zijlstra wrote:
  On Thu, 2007-04-19 at 19:44 +0200, Miklos Szeredi wrote:
Count per BDI unstable pages.

   
   I'm wondering, is it really worth having this category separate from
   per BDI brity pages?
   
   With the exception of the export to sysfs, always the sum of unstable
   + dirty is used.
  
  I guess you are right, but it offends my sense of aesthetics to break
  symmetry with the zone statistics. However, it has the added advantage
  of only needing 2 deltas as well.
 
 I guess, this should do.

OK, the compiler told me I messed up :-/. I'll respin the whole series
and repost tomorrow or something...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/12] mm: count unstable pages per BDI

2007-04-19 Thread Peter Zijlstra
On Thu, 2007-04-19 at 21:20 +0200, Miklos Szeredi wrote:
  Index: linux-2.6/fs/buffer.c
  ===
  --- linux-2.6.orig/fs/buffer.c  2007-04-19 19:59:26.0 +0200
  +++ linux-2.6/fs/buffer.c   2007-04-19 20:35:39.0 +0200
  @@ -733,7 +733,7 @@ int __set_page_dirty_buffers(struct page
  if (page-mapping) {/* Race with truncate? */
  if (mapping_cap_account_dirty(mapping)) {
  __inc_zone_page_state(page, NR_FILE_DIRTY);
  -   __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
  +   __inc_bdi_stat(mapping-backing_dev_info, BDI_RECLAIM);
 
 This name suggests it's _under_ reclaim, which is not true.  You might
 rather want to call it BDI_RECLAIMABLE, or something similar in
 meaning.

Yeah, my fingers got lazy on me :-) will instruct them to type more.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/10] mm: expose BDI statistics in sysfs.

2007-04-20 Thread Peter Zijlstra
Expose the per BDI stats in /sys/block/dev/queue/*

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 block/ll_rw_blk.c |   32 
 1 file changed, 32 insertions(+)

Index: linux-2.6-mm/block/ll_rw_blk.c
===
--- linux-2.6-mm.orig/block/ll_rw_blk.c
+++ linux-2.6-mm/block/ll_rw_blk.c
@@ -3976,6 +3976,15 @@ static ssize_t queue_max_hw_sectors_show
return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_reclaimable_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, 
BDI_RECLAIMABLE));
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, 
BDI_WRITEBACK));
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR },
@@ -4006,6 +4020,16 @@ static struct queue_sysfs_entry queue_ma
.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_reclaimable_entry = {
+   .attr = {.name = reclaimable_pages, .mode = S_IRUGO },
+   .show = queue_nr_reclaimable_show,
+};
+
+static struct queue_sysfs_entry queue_writeback_entry = {
+   .attr = {.name = writeback_pages, .mode = S_IRUGO },
+   .show = queue_nr_writeback_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
.attr = {.name = scheduler, .mode = S_IRUGO | S_IWUSR },
.show = elv_iosched_show,
@@ -4018,6 +4047,8 @@ static struct attribute *default_attrs[]
queue_initial_ra_entry.attr,
queue_max_hw_sectors_entry.attr,
queue_max_sectors_entry.attr,
+   queue_reclaimable_entry.attr,
+   queue_writeback_entry.attr,
queue_iosched_entry.attr,
NULL,
 };

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 00/10] per device dirty throttling -v5

2007-04-20 Thread Peter Zijlstra
The latest version of the per device dirty throttling.

against 2.6.21-rc6-mm1; the first patch is for easy application.
Andrew can of course just drop the patch it reverts.

Merged BDI_DIRTY and BDI_UNSTABLE into BDI_RECLAIMABLE, and multiplied
bdi_stat_delta() by the number of counters summed.

Thanks to Miklos for pointing these out.

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/10] revert per-backing_dev-dirty-and-writeback-page-accounting

2007-04-20 Thread Peter Zijlstra
For ease of application..

---
 block/ll_rw_blk.c   |   29 -
 fs/buffer.c |1 -
 include/linux/backing-dev.h |2 --
 mm/page-writeback.c |   13 ++---
 mm/truncate.c   |1 -
 5 files changed, 2 insertions(+), 44 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===
--- linux-2.6.orig/block/ll_rw_blk.c2007-04-10 16:30:55.0 +0200
+++ linux-2.6/block/ll_rw_blk.c 2007-04-10 16:35:24.0 +0200
@@ -201,8 +201,6 @@ EXPORT_SYMBOL(blk_queue_softirq_done);
  **/
 void blk_queue_make_request(request_queue_t * q, make_request_fn * mfn)
 {
-   struct backing_dev_info *bdi = q-backing_dev_info;
-
/*
 * set defaults
 */
@@ -210,8 +208,6 @@ void blk_queue_make_request(request_queu
blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
q-make_request_fn = mfn;
-   atomic_long_set(bdi-nr_dirty, 0);
-   atomic_long_set(bdi-nr_writeback, 0);
blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
blk_queue_hardsect_size(q, 512);
blk_queue_dma_alignment(q, 511);
@@ -3978,19 +3974,6 @@ static ssize_t queue_max_hw_sectors_show
return queue_var_show(max_hw_sectors_kb, (page));
 }
 
-static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page)
-{
-   return sprintf(page, %lu\n,
-   atomic_long_read(q-backing_dev_info.nr_dirty));
-
-}
-
-static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
-{
-   return sprintf(page, %lu\n,
-   atomic_long_read(q-backing_dev_info.nr_writeback));
-
-}
 
 static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR },
@@ -4021,16 +4004,6 @@ static struct queue_sysfs_entry queue_ma
.show = queue_max_hw_sectors_show,
 };
 
-static struct queue_sysfs_entry queue_nr_dirty_entry = {
-   .attr = {.name = nr_dirty, .mode = S_IRUGO },
-   .show = queue_nr_dirty_show,
-};
-
-static struct queue_sysfs_entry queue_nr_writeback_entry = {
-   .attr = {.name = nr_writeback, .mode = S_IRUGO },
-   .show = queue_nr_writeback_show,
-};
-
 static struct queue_sysfs_entry queue_iosched_entry = {
.attr = {.name = scheduler, .mode = S_IRUGO | S_IWUSR },
.show = elv_iosched_show,
@@ -4043,8 +4016,6 @@ static struct attribute *default_attrs[]
queue_initial_ra_entry.attr,
queue_max_hw_sectors_entry.attr,
queue_max_sectors_entry.attr,
-   queue_nr_dirty_entry.attr,
-   queue_nr_writeback_entry.attr,
queue_iosched_entry.attr,
NULL,
 };
Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c  2007-04-10 16:30:15.0 +0200
+++ linux-2.6/fs/buffer.c   2007-04-10 16:35:03.0 +0200
@@ -740,7 +740,6 @@ int __set_page_dirty_buffers(struct page
if (page-mapping) {/* Race with truncate? */
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
-   atomic_long_inc(mapping-backing_dev_info-nr_dirty);
task_io_account_write(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(mapping-page_tree,
Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h  2007-04-10 16:30:55.0 
+0200
+++ linux-2.6/include/linux/backing-dev.h   2007-04-10 16:35:03.0 
+0200
@@ -30,8 +30,6 @@ struct backing_dev_info {
unsigned long ra_thrash_bytes;  /* estimated thrashing threshold */
unsigned long state;/* Always use atomic bitops on this */
unsigned int capabilities; /* Device capabilities */
-   atomic_long_t nr_dirty; /* Pages dirty against this BDI */
-   atomic_long_t nr_writeback;/* Pages under writeback against this BDI */
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data;   /* Pointer to aux data for congested func */
void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c  2007-04-10 16:30:15.0 +0200
+++ linux-2.6/mm/page-writeback.c   2007-04-10 16:35:03.0 +0200
@@ -828,8 +828,6 @@ int __set_page_dirty_nobuffers(struct pa
BUG_ON(mapping2 != mapping);
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
-   atomic_long_inc(mapping-backing_dev_info-
-   

[PATCH 04/10] lib: percpu_counter_mod64

2007-04-20 Thread Peter Zijlstra
Add percpu_counter_mod64() to allow large modifications.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/percpu_counter.h |9 +
 lib/percpu_counter.c   |   28 
 2 files changed, 37 insertions(+)

Index: linux-2.6/include/linux/percpu_counter.h
===
--- linux-2.6.orig/include/linux/percpu_counter.h   2007-04-12 
13:54:55.0 +0200
+++ linux-2.6/include/linux/percpu_counter.h2007-04-12 14:00:21.0 
+0200
@@ -36,6 +36,7 @@ static inline void percpu_counter_destro
 }
 
 void percpu_counter_mod(struct percpu_counter *fbc, s32 amount);
+void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
@@ -81,6 +82,14 @@ percpu_counter_mod(struct percpu_counter
preempt_enable();
 }
 
+static inline void
+percpu_counter_mod64(struct percpu_counter *fbc, s64 amount)
+{
+   preempt_disable();
+   fbc-count += amount;
+   preempt_enable();
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
return fbc-count;
Index: linux-2.6/lib/percpu_counter.c
===
--- linux-2.6.orig/lib/percpu_counter.c 2006-07-31 13:07:38.0 +0200
+++ linux-2.6/lib/percpu_counter.c  2007-04-12 14:17:12.0 +0200
@@ -25,6 +25,34 @@ void percpu_counter_mod(struct percpu_co
 }
 EXPORT_SYMBOL(percpu_counter_mod);
 
+void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount)
+{
+   long count;
+   s32 *pcount;
+   int cpu;
+
+   if (amount = FBC_BATCH || amount = -FBC_BATCH) {
+   spin_lock(fbc-lock);
+   fbc-count += amount;
+   spin_unlock(fbc-lock);
+   return;
+   }
+
+   cpu = get_cpu();
+   pcount = per_cpu_ptr(fbc-counters, cpu);
+   count = *pcount + amount;
+   if (count = FBC_BATCH || count = -FBC_BATCH) {
+   spin_lock(fbc-lock);
+   fbc-count += count;
+   *pcount = 0;
+   spin_unlock(fbc-lock);
+   } else {
+   *pcount = count;
+   }
+   put_cpu();
+}
+EXPORT_SYMBOL(percpu_counter_mod64);
+
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 07/10] mm: count reclaimable pages per BDI

2007-04-20 Thread Peter Zijlstra
Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/buffer.c |2 ++
 fs/nfs/write.c  |7 +++
 include/linux/backing-dev.h |1 +
 mm/page-writeback.c |4 
 mm/truncate.c   |2 ++
 5 files changed, 16 insertions(+)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c  2007-04-20 15:20:48.0 +0200
+++ linux-2.6/fs/buffer.c   2007-04-20 15:26:31.0 +0200
@@ -740,6 +740,8 @@ int __set_page_dirty_buffers(struct page
if (page-mapping) {/* Race with truncate? */
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+   __inc_bdi_stat(mapping-backing_dev_info,
+   BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(mapping-page_tree,
Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c  2007-04-20 15:20:48.0 +0200
+++ linux-2.6/mm/page-writeback.c   2007-04-20 15:27:28.0 +0200
@@ -828,6 +828,8 @@ int __set_page_dirty_nobuffers(struct pa
BUG_ON(mapping2 != mapping);
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+   __inc_bdi_stat(mapping-backing_dev_info,
+   BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(mapping-page_tree,
@@ -961,6 +963,8 @@ int clear_page_dirty_for_io(struct page 
 */
if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+   dec_bdi_stat(mapping-backing_dev_info,
+   BDI_RECLAIMABLE);
return 1;
}
return 0;
Index: linux-2.6/mm/truncate.c
===
--- linux-2.6.orig/mm/truncate.c2007-04-20 15:20:48.0 +0200
+++ linux-2.6/mm/truncate.c 2007-04-20 15:27:38.0 +0200
@@ -71,6 +71,8 @@ void cancel_dirty_page(struct page *page
struct address_space *mapping = page-mapping;
if (mapping  mapping_cap_account_dirty(mapping)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+   dec_bdi_stat(mapping-backing_dev_info,
+   BDI_RECLAIMABLE);
if (account_size)
task_io_account_cancelled_write(account_size);
}
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c   2007-04-20 15:20:05.0 +0200
+++ linux-2.6/fs/nfs/write.c2007-04-20 15:27:07.0 +0200
@@ -449,6 +449,7 @@ nfs_mark_request_commit(struct nfs_page 
nfsi-ncommit++;
spin_unlock(nfsi-req_lock);
inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
+   inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 #endif
@@ -509,6 +510,8 @@ static void nfs_cancel_commit_list(struc
while(!list_empty(head)) {
req = nfs_list_entry(head-next);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
+   dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
+   BDI_RECLAIMABLE);
nfs_list_remove_request(req);
nfs_inode_remove_request(req);
nfs_unlock_request(req);
@@ -1234,6 +1237,8 @@ nfs_commit_list(struct inode *inode, str
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
+   dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
+   BDI_RECLAIMABLE);
nfs_clear_page_writeback(req);
}
return -ENOMEM;
@@ -1258,6 +1263,8 @@ static void nfs_commit_done(struct rpc_t
req = nfs_list_entry(data-pages.next);
nfs_list_remove_request(req);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
+   dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
+   BDI_RECLAIMABLE);
 
dprintk(NFS: commit (%s/%Ld [EMAIL PROTECTED]),
req-wb_context-dentry-d_inode-i_sb-s_id,
Index

[PATCH 10/10] mm: per device dirty threshold

2007-04-20 Thread Peter Zijlstra
Scale writeback cache per backing device, proportional to its writeout speed.

By decoupling the BDI dirty thresholds a number of problems we currently have
will go away, namely:

 - mutual interference starvation (for any number of BDIs);
 - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).

It might be that all dirty pages are for a single BDI while other BDIs are
idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
dirty pages outstanding and make progress.

A global threshold also creates a deadlock for stacked BDIs; when A writes to
B, and A generates enough dirty pages to get throttled, B will never start
writeback until the dirty pages go away. Again, by giving each BDI its own
'independent' dirty limit, this problem is avoided.

So the problem is to determine how to distribute the total dirty limit across
the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
not have any dirty pages outstanding is a waste.

What is done is to keep a floating proportion between the DBIs based on
writeback completions. This way faster/more active devices get a larger share
than slower/idle devices.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/backing-dev.h |   51 
 mm/backing-dev.c|3 
 mm/page-writeback.c |  181 
 3 files changed, 206 insertions(+), 29 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h  2007-04-20 15:28:17.0 
+0200
+++ linux-2.6/include/linux/backing-dev.h   2007-04-20 15:33:59.0 
+0200
@@ -28,6 +28,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
BDI_RECLAIMABLE,
BDI_WRITEBACK,
+   BDI_WRITEOUT,
NR_BDI_STAT_ITEMS
 };
 
@@ -43,6 +44,13 @@ struct backing_dev_info {
void *unplug_io_data;
 
struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
+
+   /*
+* data used for scaling the writeback cache
+*/
+   spinlock_t lock;/* protect the cycle count */
+   unsigned long cycles;   /* writeout cycles */
+   int dirty_exceeded;
 };
 
 void bdi_init(struct backing_dev_info *bdi);
@@ -54,6 +62,12 @@ static inline void __mod_bdi_stat(struct
percpu_counter_mod(bdi-bdi_stat[item], amount);
 }
 
+static inline void __mod_bdi_stat64(struct backing_dev_info *bdi,
+   enum bdi_stat_item item, s64 amount)
+{
+   percpu_counter_mod64(bdi-bdi_stat[item], amount);
+}
+
 static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
 {
@@ -86,12 +100,49 @@ static inline void dec_bdi_stat(struct b
local_irq_restore(flags);
 }
 
+static inline s64 __bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   return percpu_counter_read(bdi-bdi_stat[item]);
+}
+
 static inline s64 bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
 {
return percpu_counter_read_positive(bdi-bdi_stat[item]);
 }
 
+static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   return percpu_counter_sum(bdi-bdi_stat[item]);
+}
+
+static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   s64 sum;
+   unsigned long flags;
+
+   local_irq_save(flags);
+   sum = __bdi_stat_sum(bdi, item);
+   local_irq_restore(flags);
+
+   return sum;
+}
+
+/*
+ * maximal error of a stat counter.
+ */
+static inline unsigned long bdi_stat_delta(void)
+{
+#ifdef CONFIG_SMP
+   return NR_CPUS * FBC_BATCH;
+#else
+   return 1UL;
+#endif
+}
+
 /*
  * Flags in backing_dev_info::capability
  * - The first two flags control whether dirty pages will contribute to the
Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c  2007-04-20 15:28:10.0 +0200
+++ linux-2.6/mm/page-writeback.c   2007-04-20 15:35:01.0 +0200
@@ -49,8 +49,6 @@
  */
 static long ratelimit_pages = 32;
 
-static int dirty_exceeded __cacheline_aligned_in_smp;  /* Dirty mem may be 
over limit */
-
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
@@ -103,6 +101,88 @@ EXPORT_SYMBOL(laptop_mode);
 static void background_writeout(unsigned long _min_pages);
 
 /*
+ * Scale the writeback cache size proportional to the relative writeout speeds.
+ *
+ * We do this by tracking a floating average per BDI and a global floating
+ * average. We optimize away the '/= 2' for the global average by noting that:
+ *
+ *  if (++i  thresh) i /= 2:
+ *
+ * Can be approximated by:
+ *
+ *   thresh/2 + (++i % thresh/2)
+ *
+ * Furthermore, when

[PATCH 08/10] mm: count writeback pages per BDI

2007-04-20 Thread Peter Zijlstra
Count per BDI writeback pages.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/backing-dev.h |1 +
 mm/page-writeback.c |   12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c  2007-04-20 15:27:28.0 +0200
+++ linux-2.6/mm/page-writeback.c   2007-04-20 15:28:10.0 +0200
@@ -979,14 +979,18 @@ int test_clear_page_writeback(struct pag
int ret;
 
if (mapping) {
+   struct backing_dev_info *bdi = mapping-backing_dev_info;
unsigned long flags;
 
write_lock_irqsave(mapping-tree_lock, flags);
ret = TestClearPageWriteback(page);
-   if (ret)
+   if (ret) {
radix_tree_tag_clear(mapping-page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
+   if (bdi_cap_writeback_dirty(bdi))
+   __dec_bdi_stat(bdi, BDI_WRITEBACK);
+   }
write_unlock_irqrestore(mapping-tree_lock, flags);
} else {
ret = TestClearPageWriteback(page);
@@ -1002,14 +1006,18 @@ int test_set_page_writeback(struct page 
int ret;
 
if (mapping) {
+   struct backing_dev_info *bdi = mapping-backing_dev_info;
unsigned long flags;
 
write_lock_irqsave(mapping-tree_lock, flags);
ret = TestSetPageWriteback(page);
-   if (!ret)
+   if (!ret) {
radix_tree_tag_set(mapping-page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
+   if (bdi_cap_writeback_dirty(bdi))
+   __inc_bdi_stat(bdi, BDI_WRITEBACK);
+   }
if (!PageDirty(page))
radix_tree_tag_clear(mapping-page_tree,
page_index(page),
Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h  2007-04-20 15:25:47.0 
+0200
+++ linux-2.6/include/linux/backing-dev.h   2007-04-20 15:28:17.0 
+0200
@@ -27,6 +27,7 @@ typedef int (congested_fn)(void *, int);
 
 enum bdi_stat_item {
BDI_RECLAIMABLE,
+   BDI_WRITEBACK,
NR_BDI_STAT_ITEMS
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/10] nfs: remove congestion_end()

2007-04-20 Thread Peter Zijlstra
Its redundant, clear_bdi_congested() already wakes the waiters.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/write.c  |4 +---
 include/linux/backing-dev.h |1 -
 mm/backing-dev.c|   13 -
 3 files changed, 1 insertion(+), 17 deletions(-)

Index: linux-2.6-mm/fs/nfs/write.c
===
--- linux-2.6-mm.orig/fs/nfs/write.c2007-04-05 16:24:50.0 +0200
+++ linux-2.6-mm/fs/nfs/write.c 2007-04-05 16:25:04.0 +0200
@@ -235,10 +235,8 @@ static void nfs_end_page_writeback(struc
struct nfs_server *nfss = NFS_SERVER(inode);
 
end_page_writeback(page);
-   if (atomic_long_dec_return(nfss-writeback)  
NFS_CONGESTION_OFF_THRESH) {
+   if (atomic_long_dec_return(nfss-writeback)  
NFS_CONGESTION_OFF_THRESH)
clear_bdi_congested(nfss-backing_dev_info, WRITE);
-   congestion_end(WRITE);
-   }
 }
 
 /*
Index: linux-2.6-mm/include/linux/backing-dev.h
===
--- linux-2.6-mm.orig/include/linux/backing-dev.h   2007-04-05 
16:24:50.0 +0200
+++ linux-2.6-mm/include/linux/backing-dev.h2007-04-05 16:25:08.0 
+0200
@@ -96,7 +96,6 @@ void clear_bdi_congested(struct backing_
 void set_bdi_congested(struct backing_dev_info *bdi, int rw);
 long congestion_wait(int rw, long timeout);
 long congestion_wait_interruptible(int rw, long timeout);
-void congestion_end(int rw);
 
 #define bdi_cap_writeback_dirty(bdi) \
(!((bdi)-capabilities  BDI_CAP_NO_WRITEBACK))
Index: linux-2.6-mm/mm/backing-dev.c
===
--- linux-2.6-mm.orig/mm/backing-dev.c  2007-04-05 16:24:50.0 +0200
+++ linux-2.6-mm/mm/backing-dev.c   2007-04-05 16:25:16.0 +0200
@@ -70,16 +70,3 @@ long congestion_wait_interruptible(int r
return ret;
 }
 EXPORT_SYMBOL(congestion_wait_interruptible);
-
-/**
- * congestion_end - wake up sleepers on a congested backing_dev_info
- * @rw: READ or WRITE
- */
-void congestion_end(int rw)
-{
-   wait_queue_head_t *wqh = congestion_wqh[rw];
-
-   if (waitqueue_active(wqh))
-   wake_up(wqh);
-}
-EXPORT_SYMBOL(congestion_end);

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/10] lib: dampen the percpu_counter FBC_BATCH

2007-04-20 Thread Peter Zijlstra
With the current logic the percpu_counter's accuracy delta is quadric
wrt the number of cpus in the system, reduce this to O(n ln n).

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/percpu_counter.h |7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

Index: linux-2.6-mm/include/linux/percpu_counter.h
===
--- linux-2.6-mm.orig/include/linux/percpu_counter.h
+++ linux-2.6-mm/include/linux/percpu_counter.h
@@ -11,6 +11,7 @@
 #include linux/threads.h
 #include linux/percpu.h
 #include linux/types.h
+#include linux/log2.h
 
 #ifdef CONFIG_SMP
 
@@ -20,11 +21,7 @@ struct percpu_counter {
s32 *counters;
 };
 
-#if NR_CPUS = 16
-#define FBC_BATCH  (NR_CPUS*2)
-#else
-#define FBC_BATCH  (NR_CPUS*4)
-#endif
+#define FBC_BATCH  (8*ilog2(NR_CPUS))
 
 static inline void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 06/10] mm: scalable bdi statistics counters.

2007-04-20 Thread Peter Zijlstra
Provide scalable per backing_dev_info statistics counters.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/backing-dev.h |   50 ++--
 mm/backing-dev.c|   26 ++
 2 files changed, 74 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h  2007-04-12 13:27:13.0 
+0200
+++ linux-2.6/include/linux/backing-dev.h   2007-04-12 13:28:40.0 
+0200
@@ -8,6 +8,7 @@
 #ifndef _LINUX_BACKING_DEV_H
 #define _LINUX_BACKING_DEV_H
 
+#include linux/percpu_counter.h
 #include asm/atomic.h
 
 struct page;
@@ -24,6 +25,10 @@ enum bdi_state {
 
 typedef int (congested_fn)(void *, int);
 
+enum bdi_stat_item {
+   NR_BDI_STAT_ITEMS
+};
+
 struct backing_dev_info {
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
unsigned long ra_pages0; /* min readahead on start of file */
@@ -34,14 +39,55 @@ struct backing_dev_info {
void *congested_data;   /* Pointer to aux data for congested func */
void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
void *unplug_io_data;
+
+   struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 };
 
-static inline void bdi_init(struct backing_dev_info *bdi)
+void bdi_init(struct backing_dev_info *bdi);
+void bdi_destroy(struct backing_dev_info *bdi);
+
+static inline void __mod_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item, s32 amount)
+{
+   percpu_counter_mod(bdi-bdi_stat[item], amount);
+}
+
+static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   __mod_bdi_stat(bdi, item, 1);
+}
+
+static inline void inc_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __inc_bdi_stat(bdi, item);
+   local_irq_restore(flags);
+}
+
+static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
 {
+   __mod_bdi_stat(bdi, item, -1);
+}
+
+static inline void dec_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __dec_bdi_stat(bdi, item);
+   local_irq_restore(flags);
 }
 
-static inline void bdi_destroy(struct backing_dev_info *bdi)
+static inline s64 bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
 {
+   return percpu_counter_read_positive(bdi-bdi_stat[item]);
 }
 
 /*
Index: linux-2.6/mm/backing-dev.c
===
--- linux-2.6.orig/mm/backing-dev.c 2007-04-12 13:27:10.0 +0200
+++ linux-2.6/mm/backing-dev.c  2007-04-12 13:28:26.0 +0200
@@ -5,6 +5,30 @@
 #include linux/sched.h
 #include linux/module.h
 
+void bdi_init(struct backing_dev_info *bdi)
+{
+   int i;
+
+   if (!(bdi_cap_writeback_dirty(bdi) || bdi_cap_account_dirty(bdi)))
+   return;
+
+   for (i = 0; i  NR_BDI_STAT_ITEMS; i++)
+   percpu_counter_init(bdi-bdi_stat[i], 0);
+}
+EXPORT_SYMBOL(bdi_init);
+
+void bdi_destroy(struct backing_dev_info *bdi)
+{
+   int i;
+
+   if (!(bdi_cap_writeback_dirty(bdi) || bdi_cap_account_dirty(bdi)))
+   return;
+
+   for (i = 0; i  NR_BDI_STAT_ITEMS; i++)
+   percpu_counter_destroy(bdi-bdi_stat[i]);
+}
+EXPORT_SYMBOL(bdi_destroy);
+
 static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
@@ -70,3 +94,5 @@ long congestion_wait_interruptible(int r
return ret;
 }
 EXPORT_SYMBOL(congestion_wait_interruptible);
+
+

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/10] mm: bdi init hooks

2007-04-20 Thread Peter Zijlstra
provide BDI constructor/destructor hooks

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 block/ll_rw_blk.c   |2 ++
 drivers/block/rd.c  |6 ++
 drivers/char/mem.c  |2 ++
 drivers/mtd/mtdcore.c   |5 +
 fs/char_dev.c   |1 +
 fs/configfs/configfs_internal.h |2 ++
 fs/configfs/inode.c |8 
 fs/configfs/mount.c |2 ++
 fs/fuse/inode.c |2 ++
 fs/hugetlbfs/inode.c|3 +++
 fs/nfs/client.c |3 +++
 fs/ocfs2/dlm/dlmfs.c|6 +-
 fs/ramfs/inode.c|7 ++-
 fs/sysfs/inode.c|5 +
 fs/sysfs/mount.c|2 ++
 fs/sysfs/sysfs.h|1 +
 include/linux/backing-dev.h |7 +++
 kernel/cpuset.c |3 +++
 mm/shmem.c  |1 +
 mm/swap.c   |2 ++
 20 files changed, 68 insertions(+), 2 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===
--- linux-2.6.orig/block/ll_rw_blk.c2007-04-12 11:35:53.0 +0200
+++ linux-2.6/block/ll_rw_blk.c 2007-04-12 13:19:40.0 +0200
@@ -1771,6 +1771,7 @@ static void blk_release_queue(struct kob
 
blk_trace_shutdown(q);
 
+   bdi_destroy(q-backing_dev_info);
kmem_cache_free(requestq_cachep, q);
 }
 
@@ -1836,6 +1837,7 @@ request_queue_t *blk_alloc_queue_node(gf
q-kobj.ktype = queue_ktype;
kobject_init(q-kobj);
q-backing_dev_info = default_backing_dev_info;
+   bdi_init(q-backing_dev_info);
 
q-backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
q-backing_dev_info.unplug_io_data = q;
Index: linux-2.6/drivers/block/rd.c
===
--- linux-2.6.orig/drivers/block/rd.c   2007-04-12 11:35:51.0 +0200
+++ linux-2.6/drivers/block/rd.c2007-04-12 11:35:59.0 +0200
@@ -411,6 +411,9 @@ static void __exit rd_cleanup(void)
blk_cleanup_queue(rd_queue[i]);
}
unregister_blkdev(RAMDISK_MAJOR, ramdisk);
+
+   bdi_destroy(rd_file_backing_dev_info);
+   bdi_destroy(rd_backing_dev_info);
 }
 
 /*
@@ -421,6 +424,9 @@ static int __init rd_init(void)
int i;
int err = -ENOMEM;
 
+   bdi_init(rd_backing_dev_info);
+   bdi_init(rd_file_backing_dev_info);
+
if (rd_blocksize  PAGE_SIZE || rd_blocksize  512 ||
(rd_blocksize  (rd_blocksize-1))) {
printk(RAMDISK: wrong blocksize %d, reverting to defaults\n,
Index: linux-2.6/drivers/char/mem.c
===
--- linux-2.6.orig/drivers/char/mem.c   2007-04-12 11:35:51.0 +0200
+++ linux-2.6/drivers/char/mem.c2007-04-12 11:35:59.0 +0200
@@ -987,6 +987,8 @@ static int __init chr_dev_init(void)
  MKDEV(MEM_MAJOR, devlist[i].minor),
  devlist[i].name);
 
+   bdi_init(zero_bdi);
+
return 0;
 }
 
Index: linux-2.6/fs/char_dev.c
===
--- linux-2.6.orig/fs/char_dev.c2007-04-12 11:35:51.0 +0200
+++ linux-2.6/fs/char_dev.c 2007-04-12 11:35:59.0 +0200
@@ -546,6 +546,7 @@ static struct kobject *base_probe(dev_t 
 void __init chrdev_init(void)
 {
cdev_map = kobj_map_init(base_probe, chrdevs_lock);
+   bdi_init(directly_mappable_cdev_bdi);
 }
 
 
Index: linux-2.6/fs/fuse/inode.c
===
--- linux-2.6.orig/fs/fuse/inode.c  2007-04-12 11:35:51.0 +0200
+++ linux-2.6/fs/fuse/inode.c   2007-04-12 11:35:59.0 +0200
@@ -415,6 +415,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(fc-num_waiting, 0);
fc-bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc-bdi.unplug_io_fn = default_unplug_io_fn;
+   bdi_init(fc-bdi);
fc-reqctr = 0;
fc-blocked = 1;
get_random_bytes(fc-scramble_key, sizeof(fc-scramble_key));
@@ -428,6 +429,7 @@ void fuse_conn_put(struct fuse_conn *fc)
if (fc-destroy_req)
fuse_request_free(fc-destroy_req);
mutex_destroy(fc-inst_mutex);
+   bdi_destroy(fc-bdi);
kfree(fc);
}
 }
Index: linux-2.6/fs/nfs/client.c
===
--- linux-2.6.orig/fs/nfs/client.c  2007-04-12 11:35:51.0 +0200
+++ linux-2.6/fs/nfs/client.c   2007-04-12 11:35:59.0 +0200
@@ -657,6 +657,8 @@ static void nfs_server_set_fsinfo(struct
if (server-rsize  NFS_MAX_FILE_IO_SIZE)
server-rsize = NFS_MAX_FILE_IO_SIZE

Re: [PATCH 03/10] lib: dampen the percpu_counter FBC_BATCH

2007-04-21 Thread Peter Zijlstra
On Sat, 2007-04-21 at 02:55 -0700, Andrew Morton wrote:
 On Fri, 20 Apr 2007 17:51:57 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  With the current logic the percpu_counter's accuracy delta is quadric
  wrt the number of cpus in the system, reduce this to O(n ln n).
  
  Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
  ---
   include/linux/percpu_counter.h |7 ++-
   1 file changed, 2 insertions(+), 5 deletions(-)
  
  Index: linux-2.6-mm/include/linux/percpu_counter.h
  ===
  --- linux-2.6-mm.orig/include/linux/percpu_counter.h
  +++ linux-2.6-mm/include/linux/percpu_counter.h
  @@ -11,6 +11,7 @@
   #include linux/threads.h
   #include linux/percpu.h
   #include linux/types.h
  +#include linux/log2.h
   
   #ifdef CONFIG_SMP
   
  @@ -20,11 +21,7 @@ struct percpu_counter {
  s32 *counters;
   };
   
  -#if NR_CPUS = 16
  -#define FBC_BATCH  (NR_CPUS*2)
  -#else
  -#define FBC_BATCH  (NR_CPUS*4)
  -#endif
  +#define FBC_BATCH  (8*ilog2(NR_CPUS))
   
   static inline void percpu_counter_init(struct percpu_counter *fbc, s64 
  amount)
   {
 
 I worry that this might be too small when there are hundreds of CPUs online.
 
 With 1024 CPUs we go for the lock once per 80 counts.  That's not much. 
 
 If we have 1024 CPUs, each one of which is incrementing this counter at N
 Hz, we have 1024/80=12 CPUs all going for the same lock at N Hz.  It could
 get bad.
 
 But I don't know what the gain is for this loss.  Your changelog should
 have told us.
 
 What problem is this patch solving?

In 10/10 I introduce bdi_stat_delta() which gives the maximum error of a
single counter. That is used to switch between precise
(percpu_counter_sum) and imprecise (percpu_counter_read) accesses of the
stats.

I worried that the current quadric error would be too large; and as the
ZVC counters also use a logarithmic error bound I thought it would be
good to have here as well.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/10] lib: percpu_counter_mod64

2007-04-21 Thread Peter Zijlstra
On Sat, 2007-04-21 at 02:55 -0700, Andrew Morton wrote:
 On Fri, 20 Apr 2007 17:51:58 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  Add percpu_counter_mod64() to allow large modifications.
  
  Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
  ---
   include/linux/percpu_counter.h |9 +
   lib/percpu_counter.c   |   28 
   2 files changed, 37 insertions(+)
  
  Index: linux-2.6/include/linux/percpu_counter.h
  ===
  --- linux-2.6.orig/include/linux/percpu_counter.h   2007-04-12 
  13:54:55.0 +0200
  +++ linux-2.6/include/linux/percpu_counter.h2007-04-12 
  14:00:21.0 +0200
  @@ -36,6 +36,7 @@ static inline void percpu_counter_destro
   }
   
   void percpu_counter_mod(struct percpu_counter *fbc, s32 amount);
  +void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount);
   s64 percpu_counter_sum(struct percpu_counter *fbc);
   
   static inline s64 percpu_counter_read(struct percpu_counter *fbc)
  @@ -81,6 +82,14 @@ percpu_counter_mod(struct percpu_counter
  preempt_enable();
   }
   
  +static inline void
  +percpu_counter_mod64(struct percpu_counter *fbc, s64 amount)
  +{
  +   preempt_disable();
  +   fbc-count += amount;
  +   preempt_enable();
  +}
  +
   static inline s64 percpu_counter_read(struct percpu_counter *fbc)
   {
  return fbc-count;
  Index: linux-2.6/lib/percpu_counter.c
  ===
  --- linux-2.6.orig/lib/percpu_counter.c 2006-07-31 13:07:38.0 
  +0200
  +++ linux-2.6/lib/percpu_counter.c  2007-04-12 14:17:12.0 +0200
  @@ -25,6 +25,34 @@ void percpu_counter_mod(struct percpu_co
   }
   EXPORT_SYMBOL(percpu_counter_mod);
   
  +void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount)
  +{
  +   long count;
  +   s32 *pcount;
  +   int cpu;
  +
  +   if (amount = FBC_BATCH || amount = -FBC_BATCH) {
  +   spin_lock(fbc-lock);
  +   fbc-count += amount;
  +   spin_unlock(fbc-lock);
  +   return;
  +   }
 
 This is wrong, a little.
 
 If the counter was at -FBC_BATCH/2 and the caller passed in FBC_BATCH, we
 could just set the cpu-local counter to FBC_BATCH/2 instead of going for
 the lock.
 
 Probably doesn't matter though.

Right, I could have taken along the current percpu offset.

  +   cpu = get_cpu();
  +   pcount = per_cpu_ptr(fbc-counters, cpu);
  +   count = *pcount + amount;
  +   if (count = FBC_BATCH || count = -FBC_BATCH) {
  +   spin_lock(fbc-lock);
  +   fbc-count += count;
  +   *pcount = 0;
  +   spin_unlock(fbc-lock);
  +   } else {
  +   *pcount = count;
  +   }
  +   put_cpu();
  +}
  +EXPORT_SYMBOL(percpu_counter_mod64);
 
 Bloaty.  Surely we won't be needing this on 32-bit kernels?  Even monster
 PAE has only 64,000,000 pages and won't be using deltas of more than 4
 gigapages?
 
 Does even 64-bit need to handle 4 gigapages in a single hit?  /me suspects
 another changelog bug

Yeah, /me chastises himself for that...

This is because percpu_counter is s64 instead of the native long; I need
to halve the counter at some point (bdi_writeout_norm) and do that by
subtracting half the current value.

If percpu_counter_mod is limited to s32 this might not always work
(although in practice it might just fit).



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] mm: count reclaimable pages per BDI

2007-04-21 Thread Peter Zijlstra
On Sat, 2007-04-21 at 02:55 -0700, Andrew Morton wrote:
 On Fri, 20 Apr 2007 17:52:01 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.
 
 hm.  Aggregating dirty and unstable at inc/dec time is a bit kludgy.  If
 later on we want to know just dirty then we're in trouble.
 
 I can see the logic behind it though.
 
 Perhaps one could have separate BDI_DIRTY and BDI_UNSTABLE and treat them
 separately at inc/dec time, but give them the same numerical value, so
 they in fact refer to the same counter.  That's kludgy too.

:-(

I struggled with it too; don't have a ready solution either. I'll do
whatever the consensus agrees upon.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/10] mm: count writeback pages per BDI

2007-04-21 Thread Peter Zijlstra
On Sat, 2007-04-21 at 02:55 -0700, Andrew Morton wrote:
 On Fri, 20 Apr 2007 17:52:02 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  Count per BDI writeback pages.
  
  Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
  ---
   include/linux/backing-dev.h |1 +
   mm/page-writeback.c |   12 ++--
   2 files changed, 11 insertions(+), 2 deletions(-)
  
  Index: linux-2.6/mm/page-writeback.c
  ===
  --- linux-2.6.orig/mm/page-writeback.c  2007-04-20 15:27:28.0 
  +0200
  +++ linux-2.6/mm/page-writeback.c   2007-04-20 15:28:10.0 +0200
  @@ -979,14 +979,18 @@ int test_clear_page_writeback(struct pag
  int ret;
   
  if (mapping) {
  +   struct backing_dev_info *bdi = mapping-backing_dev_info;
  unsigned long flags;
   
  write_lock_irqsave(mapping-tree_lock, flags);
  ret = TestClearPageWriteback(page);
  -   if (ret)
  +   if (ret) {
  radix_tree_tag_clear(mapping-page_tree,
  page_index(page),
  PAGECACHE_TAG_WRITEBACK);
  +   if (bdi_cap_writeback_dirty(bdi))
  +   __dec_bdi_stat(bdi, BDI_WRITEBACK);
 
 Why do we test bdi_cap_writeback_dirty() here?
 
 If we remove that test, we end up accumulating statistics for
 non-writebackable backing devs, but does that matter? 

It would not, had I not cheated:

+void bdi_init(struct backing_dev_info *bdi)
+{
+   int i;
+
+   if (!(bdi_cap_writeback_dirty(bdi) || bdi_cap_account_dirty(bdi)))
+   return;
+
+   for (i = 0; i  NR_BDI_STAT_ITEMS; i++)
+   percpu_counter_init(bdi-bdi_stat[i], 0);
+}
+EXPORT_SYMBOL(bdi_init);

  Probably the common
 case is writebackable backing-devs, so eliminating the test-n-branch might
 be a net microgain.

Time vs space. Now we don't even have storage for those BDIs..

Don't particularly care on this point though, I just thought it might be
worthwhile to save on the percpu data.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/10] mm: expose BDI statistics in sysfs.

2007-04-21 Thread Peter Zijlstra
On Sat, 2007-04-21 at 02:55 -0700, Andrew Morton wrote:
 On Fri, 20 Apr 2007 17:52:03 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  Expose the per BDI stats in /sys/block/dev/queue/*
  
  Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
  ---
   block/ll_rw_blk.c |   32 
   1 file changed, 32 insertions(+)
  
  Index: linux-2.6-mm/block/ll_rw_blk.c
  ===
  --- linux-2.6-mm.orig/block/ll_rw_blk.c
  +++ linux-2.6-mm/block/ll_rw_blk.c
  @@ -3976,6 +3976,15 @@ static ssize_t queue_max_hw_sectors_show
  return queue_var_show(max_hw_sectors_kb, (page));
   }
   
  +static ssize_t queue_nr_reclaimable_show(struct request_queue *q, char 
  *page)
  +{
  +   return sprintf(page, %lld\n, bdi_stat(q-backing_dev_info, 
  BDI_RECLAIMABLE));
  +}
 
 We try to present memory statistics to userspace in bytes or kbytes rather
 than number-of-pages.  Because page-size varies between architectures and
 between .configs.  Displaying number-of-pages is just inviting people to write
 it-broke-when-i-moved-it-to-ia64 applications.
 
 Plus kbytes is a bit more user-friendly, particularly when the user will
 want to compare these numbers to /proc/meminfo, for example.
 
 Using %llu might be more appropriate than %lld.


Right, the biggest problem I actually have with his piece of code is
that is does not represent all BDIs in the system. For example, the BDI
of NFS mounts is not accessible.

Will fix the other points.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-21 Thread Peter Zijlstra
On Sat, 2007-04-21 at 02:55 -0700, Andrew Morton wrote:
 On Fri, 20 Apr 2007 17:52:04 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  Scale writeback cache per backing device, proportional to its writeout 
  speed.
  
  By decoupling the BDI dirty thresholds a number of problems we currently 
  have
  will go away, namely:
  
   - mutual interference starvation (for any number of BDIs);
   - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).
  
  It might be that all dirty pages are for a single BDI while other BDIs are
  idling. By giving each BDI a 'fair' share of the dirty limit, each one can 
  have
  dirty pages outstanding and make progress.
  
  A global threshold also creates a deadlock for stacked BDIs; when A writes 
  to
  B, and A generates enough dirty pages to get throttled, B will never start
  writeback until the dirty pages go away. Again, by giving each BDI its own
  'independent' dirty limit, this problem is avoided.
  
  So the problem is to determine how to distribute the total dirty limit 
  across
  the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
  not have any dirty pages outstanding is a waste.
  
  What is done is to keep a floating proportion between the DBIs based on
  writeback completions. This way faster/more active devices get a larger 
  share
  than slower/idle devices.
 
 This is a pretty major improvement to various nasty corner-cases, if it
 works.
 
 Does it work?  Please describe the testing you did, and the results.

The testing I did was several dd instances racing each other to various
devices; usually one in a loop and the other a single, timed, instance.

I tested, disk vs disk, disk vs usbstick, disk vs nfs-mount.

Using the debug patch from the last series; the one which exposed the
actual ratio assigned and the total; I monitored (where possible) that
the ratio was around the relative writeout speeds.

The main indicator was that the writes should complete in roughly the
same time as if they were done on an idle system.

the disk vs usbstick gave the most dramatic improvement; on mainline the
usbstick is totally starved by a heavy disk writer, with these patches
it takes about the same time as it would on an idle system

Along with the first series was a number of results; those still stand.

 Has this been confirmed to fix Miklos's FUSE and loopback problems?

I must defer to Miklos for that.

  Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
  ---
   include/linux/backing-dev.h |   51 
   mm/backing-dev.c|3 
   mm/page-writeback.c |  181 
  
   3 files changed, 206 insertions(+), 29 deletions(-)
  
  Index: linux-2.6/include/linux/backing-dev.h
  ===
  --- linux-2.6.orig/include/linux/backing-dev.h  2007-04-20 
  15:28:17.0 +0200
  +++ linux-2.6/include/linux/backing-dev.h   2007-04-20 15:33:59.0 
  +0200
  @@ -28,6 +28,7 @@ typedef int (congested_fn)(void *, int);
   enum bdi_stat_item {
  BDI_RECLAIMABLE,
  BDI_WRITEBACK,
  +   BDI_WRITEOUT,
  NR_BDI_STAT_ITEMS
   };
 
 Whoa, head is now swimming.  What's the difference between writeback and
 writeout?

writeback is the number of pages in the writeback state.

writeout is a relative proportion (against all other BDIs) of completed
writeouts.

  @@ -43,6 +44,13 @@ struct backing_dev_info {
  void *unplug_io_data;
   
  struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
  +
  +   /*
  +* data used for scaling the writeback cache
  +*/
  +   spinlock_t lock;/* protect the cycle count */
  +   unsigned long cycles;   /* writeout cycles */
  +   int dirty_exceeded;
   };
   
   void bdi_init(struct backing_dev_info *bdi);
  @@ -54,6 +62,12 @@ static inline void __mod_bdi_stat(struct
  percpu_counter_mod(bdi-bdi_stat[item], amount);
   }
   
  +static inline void __mod_bdi_stat64(struct backing_dev_info *bdi,
  +   enum bdi_stat_item item, s64 amount)
  +{
  +   percpu_counter_mod64(bdi-bdi_stat[item], amount);
  +}
  +
   static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
  enum bdi_stat_item item)
   {
  @@ -86,12 +100,49 @@ static inline void dec_bdi_stat(struct b
  local_irq_restore(flags);
   }
   
  +static inline s64 __bdi_stat(struct backing_dev_info *bdi,
  +   enum bdi_stat_item item)
  +{
  +   return percpu_counter_read(bdi-bdi_stat[item]);
  +}
  +
   static inline s64 bdi_stat(struct backing_dev_info *bdi,
  enum bdi_stat_item item)
   {
  return percpu_counter_read_positive(bdi-bdi_stat[item]);
   }
 
 So here, the __ means it doesn't do the force-it-positive treatment.
 
  +static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
  +   enum bdi_stat_item item)
  +{
  +   return percpu_counter_sum(bdi-bdi_stat[item]);
  +}
  +
  +static inline s64 bdi_stat_sum(struct

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-21 Thread Peter Zijlstra

   +/*
   + * maximal error of a stat counter.
   + */
   +static inline unsigned long bdi_stat_delta(void)
   +{
   +#ifdef CONFIG_SMP
   + return NR_CPUS * FBC_BATCH;
  
  This is enormously wrong for CONFIG_NR_CPUS=1024 on a 2-way.

Right, I knew about that but, uhm.

I wanted to make that num_online_cpus(), and install a hotplug notifier
to fold the percpu delta back into the total on cpu offline.

But I have to look into doing that hotplug notifier stuff.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH] reiserfs vs BKL

2007-04-21 Thread Peter Zijlstra
Hi all,

Obviously nobody still uses reiserfs; or at least not with
PREEMPT_BKL=n.

We seem to have grown all kinds of scheduling assumptions all over that
code. A excerpt from my bootlog when I tried:


 | preempt count: 0001 ]
 | 1-level deep critical section nesting:
 
 .. [_spin_lock+16/112]  _spin_lock+0x10/0x70
 .[__rcu_process_callbacks+397/432] ..   ( = 
__rcu_process_callbacks+0x18d/0x1b0)

  [show_trace_log_lvl+33/80] show_trace_log_lvl+0x21/0x50
  [show_trace+18/32] show_trace+0x12/0x20
  [dump_stack+22/32] dump_stack+0x16/0x20
  [schedule+1903/2096] schedule+0x76f/0x830
  [io_schedule+34/48] io_schedule+0x22/0x30
  [sync_buffer+53/64] sync_buffer+0x35/0x40
  [__wait_on_bit+69/112] __wait_on_bit+0x45/0x70
  [out_of_line_wait_on_bit+109/128] out_of_line_wait_on_bit+0x6d/0x80
  [__wait_on_buffer+39/48] __wait_on_buffer+0x27/0x30
  [pg0+944846854/1069073408] search_by_key+0x156/0x11d0 [reiserfs]
  [pg0+944764484/1069073408] reiserfs_read_locked_inode+0x64/0x550 [reiserfs]
  [pg0+944765867/1069073408] reiserfs_iget+0x7b/0x90 [reiserfs]
  [pg0+944752186/1069073408] reiserfs_lookup+0xca/0x130 [reiserfs]
  [do_lookup+305/368] do_lookup+0x131/0x170
  [__link_path_walk+2077/3600] __link_path_walk+0x81d/0xe10
  [link_path_walk+70/208] link_path_walk+0x46/0xd0
  [do_path_lookup+139/464] do_path_lookup+0x8b/0x1d0
  [__path_lookup_intent_open+74/144] __path_lookup_intent_open+0x4a/0x90
  [path_lookup_open+33/48] path_lookup_open+0x21/0x30
  [open_namei+98/1568] open_namei+0x62/0x620
  [do_filp_open+44/80] do_filp_open+0x2c/0x50
  [do_sys_open+71/224] do_sys_open+0x47/0xe0
  [sys_open+28/32] sys_open+0x1c/0x20
  [sysenter_past_esp+106/149] sysenter_past_esp+0x6a/0x95
  ===


 ---
 | preempt count: 0001 ]
 | 1-level deep critical section nesting:
 
 .. [lock_kernel+28/176]  lock_kernel+0x1c/0xb0
 .[pg0+944752069/1069073408] ..   ( = reiserfs_lookup+0x55/0x130 
[reiserfs])

  [show_trace_log_lvl+33/80] show_trace_log_lvl+0x21/0x50
  [show_trace+18/32] show_trace+0x12/0x20
  [dump_stack+22/32] dump_stack+0x16/0x20
  [schedule+1903/2096] schedule+0x76f/0x830
  [__cond_resched+18/48] __cond_resched+0x12/0x30
  [cond_resched+42/64] cond_resched+0x2a/0x40
  [pg0+944846892/1069073408] search_by_key+0x17c/0x11d0 [reiserfs]
  [pg0+944766152/1069073408] reiserfs_update_sd_size+0x108/0x320 [reiserfs]
  [pg0+944811796/1069073408] reiserfs_dirty_inode+0x74/0xa0 [reiserfs]
  [__mark_inode_dirty+48/400] __mark_inode_dirty+0x30/0x190
  [touch_atime+142/240] touch_atime+0x8e/0xf0
  [__link_path_walk+2980/3600] __link_path_walk+0xba4/0xe10
  [link_path_walk+70/208] link_path_walk+0x46/0xd0
  [do_path_lookup+139/464] do_path_lookup+0x8b/0x1d0
  [__path_lookup_intent_open+74/144] __path_lookup_intent_open+0x4a/0x90
  [path_lookup_open+33/48] path_lookup_open+0x21/0x30
  [open_exec+39/176] open_exec+0x27/0xb0
  [do_execve+56/544] do_execve+0x38/0x220
  [sys_execve+46/144] sys_execve+0x2e/0x90
  [sysenter_past_esp+106/149] sysenter_past_esp+0x6a/0x95
  ===



And the sole reason I tried that was because I managed to get 1 second
(yes _second_) latencies due to BKL priority inversion.

Since this is not really a workable situation I propose we do something
about it, below my attempt. However, since I know absolutely nothing
about reiserfs, it might well be very wrong and will eat all your data
(babies and favourite pets too while we're at it).

---
Replace all the lock_kernel() instances with reiserfs_write_lock(sb),
and make that use an actual per super-block mutex instead of
lock_kernel().

This should make reiserfs safe from PREEMPT_BKL=n, since it seems to
rely on being able to schedule. Also, it removes the dependency on the
BKL, and thereby is not prone to cause prio inversion with remaining BKL
users (notably tty).

Compile tested only, since I didn't dare boot it.

NOT-Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/reiserfs/inode.c|4 ++--
 fs/reiserfs/journal.c  |   12 ++--
 fs/reiserfs/super.c|1 +
 fs/reiserfs/xattr.c|   16 
 include/linux/reiserfs_fs.h|4 ++--
 include/linux/reiserfs_fs_sb.h |2 ++
 6 files changed, 21 insertions(+), 18 deletions(-)

Index: linux-2.6-twins/fs/reiserfs/inode.c
===
--- linux-2.6-twins.orig/fs/reiserfs/inode.c
+++ linux-2.6-twins/fs/reiserfs/inode.c
@@ -485,10 +485,10 @@ static int reiserfs_get_blocks_direct_io
   disappeared */
if (REISERFS_I(inode)-i_flags  i_pack_on_close_mask) {
int err;
-   lock_kernel();
+   reiserfs_write_lock(inode-i_sb);
err = reiserfs_commit_for_inode(inode);
REISERFS_I(inode)-i_flags = ~i_pack_on_close_mask;
-   unlock_kernel

Re: [RFC][PATCH] reiserfs vs BKL

2007-04-21 Thread Peter Zijlstra
On Sat, 2007-04-21 at 12:14 -0400, Jeff Mahoney wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Peter Zijlstra wrote:
  Replace all the lock_kernel() instances with reiserfs_write_lock(sb),
  and make that use an actual per super-block mutex instead of
  lock_kernel().
  
  This should make reiserfs safe from PREEMPT_BKL=n, since it seems to
  rely on being able to schedule. Also, it removes the dependency on the
  BKL, and thereby is not prone to cause prio inversion with remaining BKL
  users (notably tty).
  
  Compile tested only, since I didn't dare boot it.
 
 NACK.

darn, I suspected it would be wrong :-/, it was too easy to be right.

 Believe me, I would *love* to nuke the BKL from reiserfs, but a search
 and replace of this nature is just wrong. reiserfs_write_lock() using
 the BKL isn't an accident - it depends on its nesting properties. If you
 did try to boot this kernel, you'd deadlock pretty quickly.

Right, I see. 

 This one has been on my TODO list for a long time. Interestingly, I've
 been doing reiserfs xattr development recently using 2.6.21-rc7-git2,
 and I'm not seeing any of these messages.

Yeah, that would be pretty close to what I ran.

 # CONFIG_PREEMPT_NONE is not set
 CONFIG_PREEMPT_VOLUNTARY=y
 # CONFIG_PREEMPT is not set
 # CONFIG_PREEMPT_BKL is not set

I have CONFIG_PREEMPT=y, but seeing how one of those points came from a
cond_resched() within a lock_kernel() section, I'm not seeing how you
don't get these.

A well, I really do hope you come up with something some day, for me its
time to go change filesystems - bah, the pain...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/10] lib: percpu_counter_mod64

2007-04-21 Thread Peter Zijlstra
On Sat, 2007-04-21 at 12:21 -0700, Andrew Morton wrote:
 On Sat, 21 Apr 2007 13:02:26 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
+   cpu = get_cpu();
+   pcount = per_cpu_ptr(fbc-counters, cpu);
+   count = *pcount + amount;
+   if (count = FBC_BATCH || count = -FBC_BATCH) {
+   spin_lock(fbc-lock);
+   fbc-count += count;
+   *pcount = 0;
+   spin_unlock(fbc-lock);
+   } else {
+   *pcount = count;
+   }
+   put_cpu();
+}
+EXPORT_SYMBOL(percpu_counter_mod64);
   
   Bloaty.  Surely we won't be needing this on 32-bit kernels?  Even monster
   PAE has only 64,000,000 pages and won't be using deltas of more than 4
   gigapages?
   
   Does even 64-bit need to handle 4 gigapages in a single hit?  /me 
   suspects
   another changelog bug
  
  Yeah, /me chastises himself for that...
  
  This is because percpu_counter is s64 instead of the native long; I need
  to halve the counter at some point (bdi_writeout_norm) and do that by
  subtracting half the current value.
 
 ah, the mysterious bdi_writeout_norm().
 
 I don't think it's possible to precisely halve a percpu_counter - there has
 to be some error involved.  I guess that's acceptable within the
 inscrutable bdi_writeout_norm().
 
 otoh, there's a chance that the attempt to halve the counter will take the
 counter negative, due to races.  Does the elusive bdi_writeout_norm()
 handle that?  If not, it should.  If it does, then there should be comments
 around the places where this is being handled, because it is subtle, and 
 unobvious,
 and others might break it by accident.

The counter it is halving is only ever incremented, so we might be off a
little, but only to the safe side.

I shall do the comment thing along with all the other missing
comments :-)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-21 Thread Peter Zijlstra
On Sat, 2007-04-21 at 14:15 +0200, Peter Zijlstra wrote:
+/*
+ * maximal error of a stat counter.
+ */
+static inline unsigned long bdi_stat_delta(void)
+{
+#ifdef CONFIG_SMP
+   return NR_CPUS * FBC_BATCH;
   
   This is enormously wrong for CONFIG_NR_CPUS=1024 on a 2-way.
 
 Right, I knew about that but, uhm.
 
 I wanted to make that num_online_cpus(), and install a hotplug notifier
 to fold the percpu delta back into the total on cpu offline.
 
 But I have to look into doing that hotplug notifier stuff.

Something like this should do I think, I just looked at other hotplug
code and imitated the pattern.

I assumed CONFIG_HOTPLUG_CPU requires CONFIG_SMP, I didn't actually try
that one :-)

---

In order to estimate the per stat counter error more accurately, using
num_online_cpus() instead of NR_CPUS, install a cpu hotplug notifier
(when cpu hotplug is enabled) that flushes whatever percpu delta was
present into the total on cpu unplug.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/backing-dev.h|6 -
 include/linux/percpu_counter.h |1 
 lib/percpu_counter.c   |   11 +
 mm/backing-dev.c   |   47 +
 4 files changed, 64 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h  2007-04-21 21:32:49.0 
+0200
+++ linux-2.6/include/linux/backing-dev.h   2007-04-21 21:33:28.0 
+0200
@@ -51,6 +51,10 @@ struct backing_dev_info {
spinlock_t lock;/* protect the cycle count */
unsigned long cycles;   /* writeout cycles */
int dirty_exceeded;
+
+#ifdef CONFIG_HOTPLUG_CPU
+   struct notifier_block hotplug_nb;
+#endif
 };
 
 void bdi_init(struct backing_dev_info *bdi);
@@ -137,7 +141,7 @@ static inline s64 bdi_stat_sum(struct ba
 static inline unsigned long bdi_stat_delta(void)
 {
 #ifdef CONFIG_SMP
-   return NR_CPUS * FBC_BATCH;
+   return num_online_cpus() * FBC_BATCH;
 #else
return 1UL;
 #endif
Index: linux-2.6/include/linux/percpu_counter.h
===
--- linux-2.6.orig/include/linux/percpu_counter.h   2007-04-21 
21:32:49.0 +0200
+++ linux-2.6/include/linux/percpu_counter.h2007-04-21 21:33:17.0 
+0200
@@ -38,6 +38,7 @@ static inline void percpu_counter_destro
 void percpu_counter_mod(struct percpu_counter *fbc, s32 amount);
 void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
+void percpu_counter_fold(struct percpu_counter *fbx, int cpu);
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
Index: linux-2.6/lib/percpu_counter.c
===
--- linux-2.6.orig/lib/percpu_counter.c 2007-04-21 21:32:49.0 +0200
+++ linux-2.6/lib/percpu_counter.c  2007-04-21 21:33:17.0 +0200
@@ -72,3 +72,14 @@ s64 percpu_counter_sum(struct percpu_cou
return ret  0 ? 0 : ret;
 }
 EXPORT_SYMBOL(percpu_counter_sum);
+
+void percpu_counter_fold(struct percpu_counter *fbc, int cpu)
+{
+   s32 *pcount = per_cpu_ptr(fbc-counters, cpu);
+   if (*pcount) {
+   spin_lock(fbc-lock);
+   fbc-count += *pcount;
+   *pcount = 0;
+   spin_unlock(fbc-lock);
+   }
+}
Index: linux-2.6/mm/backing-dev.c
===
--- linux-2.6.orig/mm/backing-dev.c 2007-04-21 21:32:49.0 +0200
+++ linux-2.6/mm/backing-dev.c  2007-04-21 21:34:47.0 +0200
@@ -4,6 +4,49 @@
 #include linux/fs.h
 #include linux/sched.h
 #include linux/module.h
+#include linux/cpu.h
+
+#ifdef CONFIG_HOTPLUG_CPU
+static int bdi_stat_fold(struct notifier_block *nb,
+   unsigned long action, void *hcpu)
+{
+   struct backing_dev_info *bdi =
+   container_of(nb, struct backing_dev_info, hotplug_nb);
+   unsigned long flags;
+   int cpu = (unsigned long)hcpu;
+   int i;
+
+   if (action == CPU_DEAD) {
+   local_irq_save(flags);
+   for (i = 0; i  NR_BDI_STAT_ITEMS; i++)
+   percpu_counter_fold(bdi-bdi_stat[i], cpu);
+   local_irq_restore(flags);
+   }
+   return NOTIFY_OK;
+}
+
+static void bdi_init_hotplug(struct backing_dev_info *bdi)
+{
+   bdi-hotplug_nb = (struct notifier_block){
+   .notifier_call = bdi_stat_fold,
+   .priority = 0,
+   };
+   register_hotcpu_notifier(bdi-hotplug_nb);
+}
+
+static void bdi_destroy_hotplug(struct backing_dev_info *bdi)
+{
+   unregister_hotcpu_notifier(bdi-hotplug_nb);
+}
+#else
+static void bdi_init_hotplug(struct backing_dev_info *bdi)
+{
+}
+
+static void bdi_destroy_hotplug(struct backing_dev_info

Re: Loud pop coming from hard drive on reboot

2007-04-21 Thread Peter Zijlstra
On Wed, 2007-04-18 at 17:27 -0400, Chuck Ebbert wrote:
 Bartlomiej Zolnierkiewicz wrote:
  On Wednesday 18 April 2007, Chuck Ebbert wrote:
  Mark Lord wrote:
  Mark Lord wrote:
  With the patch applied, I don't see *any* new activity in those
  S.M.A.R.T.
  attributes over multiple hibernates (Linux suspend-to-disk).
  Scratch that -- operator failure.  ;)
  The patch makes no difference over hibernates in the SMART logs.
 
  It's still logging extra Power-Off_Retract_Count pegs,
  which it DID NOT USED TO DO not so long ago.
 
  Just to add to the fun, my problems are happening with the old
  IDE drivers...
  
  The issue you are experiencing results in the same problem (disk doing
  power off retract) but it has a totally different root cause - your notebook
  loses power on reboot.  It is actually a hardware problem and as you have
  reported the same problem is present when using the other OS.
  
 
 My power off retract count increases whether I do a halt/poweroff or
 a reboot. The only difference is the volume of the noise.
 
 And I just noticed my seek error rate is increasing.
 
 /me plans purchase of another drive, definitely not Seagate...
 
  I think that the issue needs to be fixed (by detecting affected notebook(s)
  using DMI?) in Linux PM handling and not in IDE subsystem because:
 
  * there may be some other hardware devices affected by the power loss
(== they require shutdown sequence)
 
  * the same problem will bite if somebody decides to use libata (FC7?)
 
 Yeah, this needs fixing too. I've been playing with another notebook and
 the power does stay on during reboot, so I wonder how widespread the problem 
 is?

/me too

Thinkpad T23, with a ST980815A

Ticks ever few seconds, but seems to mostly go away with 
  hdparm -B255 /dev/sda1

but I have an increasing seek error rate as well. I got the ST disk
because thinkwiki suggested it.

I suspect this problem killed the previous disk in this laptop.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/10] mm: count writeback pages per BDI

2007-04-22 Thread Peter Zijlstra
On Sun, 2007-04-22 at 00:19 -0700, Andrew Morton wrote:

 It could be that we never call test_clear_page_writeback() against
 !bdi_cap_writeback_dirty() pages anwyay.  I can't think why we would, but
 the relationships there aren't very clear.  Does don't account for dirty
 memory imply doesn't ever do writeback?  One would need to check, and
 it's perhaps a bit fragile.

I did, thats how that test ended up there; I guess a comment would have
been a good thing, no? :-)

end_swap_bio_write() calls end_page_writeback(), and
swap_backing_dev_info has neither cap_writeback nor cap_account_dirty.

 It's worth checking though.  Boy we're doing a lot of stuff in there
 nowadays.
 
 OT: it might be worth looking into batching this work up - the predominant
 caller should be mpage_end_io_write(), and he has a whole bunch of pages
 which are usually all from the same file, all contiguous.  It's pretty
 inefficient to be handling that data one-page-at-a-time, and some
 significant speedups may be available.

Right, that might be a good spot to hook into, I'll have a look.

 Instead, everyone seems to think that variable pagecache page size is the
 only way of improving things.  Shudder.

hehe, I guess you haven't looked at my concurrent pagecache patches yet
either :-)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-23 Thread Peter Zijlstra
On Sat, 2007-04-21 at 22:25 +0200, Miklos Szeredi wrote: 
   The other deadlock, in throttle_vm_writeout() is still to be solved.
  
  Let's go back to the original changelog:
  
  Author: marcelo.tosatti marcelo.tosatti
  Date:   Tue Mar 8 17:25:19 2005 +
  
  [PATCH] vm: pageout throttling
  
  With silly pageout testcases it is possible to place huge amounts of 
  memory
  under I/O.  With a large request queue (CFQ uses 8192 requests) it is
  possible to place _all_ memory under I/O at the same time.
  
  This means that all memory is pinned and unreclaimable and the VM gets
  upset and goes oom.
  
  The patch limits the amount of memory which is under pageout writeout 
  to be
  a little more than the amount of memory at which balance_dirty_pages()
  callers will synchronously throttle.
  
  This means that heavy pageout activity can starve heavy writeback 
  activity
  completely, but heavy writeback activity will not cause starvation of
  pageout.  Because we don't want a simple `dd' to be causing excessive
  latencies in page reclaim.
  
  Signed-off-by: Andrew Morton [EMAIL PROTECTED]
  Signed-off-by: Linus Torvalds [EMAIL PROTECTED]
  
  (A good one!  I wrote it ;))
  
  
  I believe that the combination of dirty-page-tracking and its calls to
  balance_dirty_pages() mean that we can now never get more than dirty_ratio
  of memory into the dirty-or-writeback condition.
  
  The vm scanner can convert dirty pages into clean, under-writeback pages,
  but it cannot increase the total of dirty+writeback.
 
 What about swapout?  That can increase the number of writeback pages,
 without decreasing the number of dirty pages, no?

Could we not solve that by enabling cap_account_writeback on
swapper_space, and thereby account swap writeback pages. Then the VM
knows it has outstanding IO and need not panic.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-23 Thread Peter Zijlstra
On Mon, 2007-04-23 at 08:48 -0700, Christoph Lameter wrote:
 On Sat, 21 Apr 2007, Peter Zijlstra wrote:
 
This is enormously wrong for CONFIG_NR_CPUS=1024 on a 2-way.
  
  Right, I knew about that but, uhm.
  
  I wanted to make that num_online_cpus(), and install a hotplug notifier
  to fold the percpu delta back into the total on cpu offline.
 
 Use nr_cpu_ids instead. Contains the maximum possible cpus on this 
 hardware and allows to handle the hotplug case easily.

Ooh, thats handy... /me ditches the hotplug code again.
That is, unless its very common to have half empty boxens.. ?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Peter Zijlstra
On Tue, 2007-04-24 at 12:58 +1000, Neil Brown wrote:
 On Friday April 20, [EMAIL PROTECTED] wrote:
  Scale writeback cache per backing device, proportional to its writeout 
  speed.
 
 So it works like this:
 
  We account for writeout in full pages.
  When a page has the Writeback flag cleared, we account that as a
  successfully retired write for the relevant bdi.
  By using floating averages we keep track of how many writes each bdi
  has retired 'recently' where the unit of time in which we understand
  'recently' is a single page written.

That is actually that period I keep referring to. So recently is the
last 'period' number of writeout completions.

  We keep a floating average for each bdi, and a floating average for
  the total writeouts (that 'average' is, of course, 1.)

1 in the sense of unity, yes :-)

  Using these numbers we can calculate what faction of 'recently'
  retired writes were retired by each bdi (get_writeout_scale).
 
  Multiplying this fraction by the system-wide number of pages that are
  allowed to be dirty before write-throttling, we get the number of
  pages that the bdi can have dirty before write-throttling the bdi.
 
  I note that the same fraction is *not* applied to background_thresh.
  Should it be?  I guess not - there would be interesting starting
  transients, as a bdi which had done no writeout would not be allowed
  any dirty pages, so background writeout would start immediately,
  which isn't what you want... or is it?

This is something I have not been able to come to a conclusive answer
yet,... 

  For each bdi we also track the number of (dirty, writeback, unstable)
  pages and do not allow this to exceed the limit set for this bdi.
 
  The calculations involving 'reserve' in get_dirty_limits are a little
  confusing.  It looks like you calculating how much total head-room
  there is for the bdi (pages that the system can still dirty - pages
  this bdi has dirty) and making sure the number returned in pbdi_dirty
  doesn't allow more than that to be used.  

Yes, it limits the earned share of the total dirty limit to the possible
share, ensuring that the total dirty limit is never exceeded.

This is especially relevant when the proportions change faster than the
pages get written out, ie. when the period  total dirty limit.

 This is probably a
  reasonable thing to do but it doesn't feel like the right place.  I
  think get_dirty_limits should return the raw threshold, and
  balance_dirty_pages should do both tests - the bdi-local test and the
  system-wide test.

Ok, that makes sense I guess.

  Currently you have a rather odd situation where
 + if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
 + break;
  might included numbers obtained with bdi_stat_sum being compared with
  numbers obtained with bdi_stat.

Yes, I was aware of that. The bdi_thresh is based on bdi_stat() numbers,
whereas the others could be bdi_stat_sum(). I think this is ok, since
the threshold is a 'guess' anyway, we just _need_ to ensure we do not
get trapped by writeouts not arriving (due to getting stuck in the per
cpu deltas).  -- I have all this commented in the new version.

  With these patches, the VM still (I think) assumes that each BDI has
  a reasonable queue limit, so that writeback_inodes will block on a
  full queue.  If a BDI has a very large queue, balance_dirty_pages
  will simply turn lots of DIRTY pages into WRITEBACK pages and then
  think We've done our duty without actually blocking at all.

It will block once we exceed the total number of dirty pages allowed for
that BDI. But yes, this does not take away the need for queue limits.

This work was primarily aimed at allowing multiple queues to not
interfere as much, so they all can make progress and not get starved.

  With the extra accounting that we now have, I would like to see
  balance_dirty_pages dirty pages wait until RECLAIMABLE+WRITEBACK is
  actually less than 'threshold'.  This would probably mean that we
  would need to support per-bdi background_writeout to smooth things
  out.  Maybe that it fodder for another patch-set.

Indeed, I still have to wrap my mind around the background thing. Your
input is appreciated.

  You set:
 + vm_cycle_shift = 1 + ilog2(vm_total_pages);
 
  Can you explain that?

You found the one random knob I hid :-)

   My experience is that scaling dirty limits
  with main memory isn't what we really want.  When you get machines
  with very large memory, the amount that you want to be dirty is more
  a function of the speed of your IO devices, rather than the amount
  of memory, otherwise you can sometimes see large filesystem lags
  ('sync' taking minutes?)
 
  I wonder if it makes sense to try to limit the dirty data for a bdi
  to the amount that it can write out in some period of time - maybe 3
  seconds.  Probably configurable.  You seem to have almost all the
  infrastructure in place to do that, and I think it 

Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Peter Zijlstra
On Tue, 2007-04-24 at 10:19 +0200, Miklos Szeredi wrote:
   This is probably a
reasonable thing to do but it doesn't feel like the right place.  I
think get_dirty_limits should return the raw threshold, and
balance_dirty_pages should do both tests - the bdi-local test and the
system-wide test.
  
  Ok, that makes sense I guess.
 
 Well, my narrow minded world view says it's not such a good idea,
 because it would again introduce the deadlock scenario, we're trying
 to avoid.

I was only referring to the placement of the clipping; and exactly where
that happens does not affect the deadlock.

 In a sense allowing a queue to go over the global limit just a little
 bit is a good thing.  Actually the very original code does that: if
 writeback was started for write_chunk number of pages, then we allow
 ratelimit (8) _new_ pages to be dirtied, effectively ignoring the
 global limit.

It might be time to get rid of that rate-limiting.
balance_dirty_pages()'s fast path is not nearly as heavy as it used to
be. All these fancy counter systems have removed quite a bit of
iteration from there.

 That's why I've been saying, that the current code is so unfair: if
 there are lots of dirty pages to be written back to a particular
 device, then balance_dirty_pages() allows the dirty producer to make
 even more pages dirty, but if there are _no_ dirty pages for a device,
 and we are over the limit, then that dirty producer is allowed
 absolutely no new dirty pages until the global counts subside.

Well, that got fixed on a per device basis with this patch, it is still
true for multiple tasks writing to the same device.

 I'm still not quite sure what purpose the above soft limiting
 serves.  It seems to just give advantage to writers, which managed to
 accumulate lots of dirty pages, and then can convert that into even
 more dirtyings.

The queues only limit the actual in-flight writeback pages,
balance_dirty_pages() considers all pages that might become writeback as
well as those that are.

 Would it make sense to remove this behavior, and ensure that
 balance_dirty_pages() doesn't return until the per-queue limits have
 been complied with?

I don't think that will help, balance_dirty_pages drives the queues.
That is, it converts pages from mere dirty to writeback.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Peter Zijlstra
On Tue, 2007-04-24 at 11:14 +0200, Miklos Szeredi wrote:

   I'm still not quite sure what purpose the above soft limiting
   serves.  It seems to just give advantage to writers, which managed to
   accumulate lots of dirty pages, and then can convert that into even
   more dirtyings.
  
  The queues only limit the actual in-flight writeback pages,
  balance_dirty_pages() considers all pages that might become writeback as
  well as those that are.
  
   Would it make sense to remove this behavior, and ensure that
   balance_dirty_pages() doesn't return until the per-queue limits have
   been complied with?
  
  I don't think that will help, balance_dirty_pages drives the queues.
  That is, it converts pages from mere dirty to writeback.
 
 Yes.  But current logic says, that if you convert write_chunk dirty
 to writeback, you are allowed to dirty ratelimit more. 
 
 D: number of dirty pages
 W: number of writeback pages
 L: global limit
 C: write_chunk = ratelimit_pages * 1.5
 R: ratelimit
 
 If D+W = L, then R = 8
 
 Let's assume, that D == L and W == 0.  And that all of the dirty pages
 belong to a single device.  Also for simplicity, lets assume an
 infinite length queue, and a slow device.
 
 Then while converting the dirty pages to writeback, D / C * R new
 dirty pages can be created.  So when all existing dirty have been
 converted:
 
   D = L / C * R
   W = L
 
   D + W = L * (1 + R / C)
 
 So we see, that we're now even more above the limit than before the
 conversion.  This means, that we starve writers to other devices,
 which don't have as many dirty pages, because until the slow device
 doesn't finish these writes they will not get to do anything.
 
 Your patch helps this in that if the other writers have an empty queue
 and no dirty, they will be allowed to slowly start writing.  But they
 will not gain their full share until the slow dirty-hog goes below the
 global limit, which may take some time.
 
 So I think the logical thing to do, is if the dirty-hog is over it's
 queue limit, don't let it dirty any more until it's dirty+writeback go
 below the limit.  That allowes other devices to more quickly gain
 their share of dirty pages.

Ahh, now I see; I had totally blocked out these few lines:

pages_written += write_chunk - wbc.nr_to_write;
if (pages_written = write_chunk)
break;  /* We've done our duty */

yeah, those look dubious indeed... And reading back Neil's comments, I
think he agrees.

Shall we just kill those?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Peter Zijlstra
On Tue, 2007-04-24 at 03:00 -0700, Andrew Morton wrote:
 On Tue, 24 Apr 2007 11:47:20 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:
 
   Ahh, now I see; I had totally blocked out these few lines:
   
 pages_written += write_chunk - wbc.nr_to_write;
 if (pages_written = write_chunk)
 break;  /* We've done our duty */
   
   yeah, those look dubious indeed... And reading back Neil's comments, I
   think he agrees.
   
   Shall we just kill those?
  
  I think we should.
  
  Athough I'm a little afraid, that Akpm will tell me again, that I'm a
  stupid git, and that those lines are in fact vitally important ;)
  
 
 It depends what they're replaced with.
 
 That code is there, iirc, to prevent a process from getting stuck in
 balance_dirty_pages() forever due to the dirtying activity of other
 processes.
 
 hm, we ask the process to write write_chunk pages each go around the loop.
 So if it wrote write-chunk/2 pages on the first pass it might end up writing
 write_chunk*1.5 pages total.  I guess that's rare and doesn't matter much
 if it does happen - the upper bound is write_chunk*2-1, I think.

Right, but I think the problem is that its dirty - writeback, not dirty
- writeback completed.

Ie. they don't guarantee progress, it could be that the total
nr_reclaimable + nr_writeback will steadily increase due to this break.

How about ensuring that vm_writeout_total increases least
2*sync_writeback_pages() during our stay in balance_dirty_pages(). That
way we have the guarantee that more pages get written out than can be
dirtied.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm: per device dirty threshold

2007-04-24 Thread Peter Zijlstra
On Tue, 2007-04-24 at 12:19 +0200, Miklos Szeredi wrote:
 Ahh, now I see; I had totally blocked out these few lines:
 
   pages_written += write_chunk - wbc.nr_to_write;
   if (pages_written = write_chunk)
   break;  /* We've done our duty 
 */
 
 yeah, those look dubious indeed... And reading back Neil's comments, I
 think he agrees.
 
 Shall we just kill those?

I think we should.

Athough I'm a little afraid, that Akpm will tell me again, that I'm a
stupid git, and that those lines are in fact vitally important ;)

   
   It depends what they're replaced with.
   
   That code is there, iirc, to prevent a process from getting stuck in
   balance_dirty_pages() forever due to the dirtying activity of other
   processes.
   
   hm, we ask the process to write write_chunk pages each go around the loop.
   So if it wrote write-chunk/2 pages on the first pass it might end up 
   writing
   write_chunk*1.5 pages total.  I guess that's rare and doesn't matter much
   if it does happen - the upper bound is write_chunk*2-1, I think.
  
  Right, but I think the problem is that its dirty - writeback, not dirty
  - writeback completed.
  
  Ie. they don't guarantee progress, it could be that the total
  nr_reclaimable + nr_writeback will steadily increase due to this break.
  
  How about ensuring that vm_writeout_total increases least
  2*sync_writeback_pages() during our stay in balance_dirty_pages(). That
  way we have the guarantee that more pages get written out than can be
  dirtied.
 
 No, because that's a global counter, which many writers could be
 looking at.
 
 We'd need a per-task writeout counter, but when finishing the write we
 don't know anymore which task it was performed for.

Yeah, just reached that conclusion myself too - again, I ran into that
when trying to figure out how to do the per task balancing right.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/7] revoke: core code

2007-03-09 Thread Peter Zijlstra
On Fri, 2007-03-09 at 10:15 +0200, Pekka J Enberg wrote:

 +static int revoke_vma(struct vm_area_struct *vma, struct zap_details 
 *details)
 +{
 + unsigned long restart_addr, start_addr, end_addr;
 + int need_break;
 +
 + start_addr = vma-vm_start;
 + end_addr = vma-vm_end;
 +
 + /*
 +  * Not holding -mmap_sem here.
 +  */
 + vma-vm_flags |= VM_REVOKED;
 + smp_mb();

Hmm, i_mmap_lock pins the vma and excludes modifications, but doesn't
exclude concurrent faults.

I guess its save.

 +  again:
 + restart_addr = zap_page_range(vma, start_addr, end_addr - start_addr,
 +   details);
 +
 + need_break = need_resched() || need_lockbreak(details-i_mmap_lock);
 + if (need_break)
 + goto out_need_break;
 +
 + if (restart_addr  end_addr) {
 + start_addr = restart_addr;
 + goto again;
 + }
 + return 0;
 +
 +  out_need_break:
 + spin_unlock(details-i_mmap_lock);
 + cond_resched();
 + spin_lock(details-i_mmap_lock);
 + return -EINTR;

I'm not sure this scheme works, given a sufficiently loaded machine,
this might never complete.

 +}
 +
 +static int revoke_mapping(struct address_space *mapping, struct file 
 *to_exclude)
 +{
 + struct vm_area_struct *vma;
 + struct prio_tree_iter iter;
 + struct zap_details details;
 + int err = 0;
 +
 + details.i_mmap_lock = mapping-i_mmap_lock;
 +
 + spin_lock(mapping-i_mmap_lock);
 + vma_prio_tree_foreach(vma, iter, mapping-i_mmap, 0, ULONG_MAX) {
 + if (vma-vm_flags  VM_SHARED  vma-vm_file != to_exclude) {

I'm never sure of operator precedence and prefer:

 (vma-vm_flags  VM_SHARED)  ...

which leaves no room for error.

 + err = revoke_vma(vma, details);
 + if (err)
 + goto out;
 + }
 + }
 +
 + list_for_each_entry(vma, mapping-i_mmap_nonlinear, 
 shared.vm_set.list) {
 + if (vma-vm_flags  VM_SHARED  vma-vm_file != to_exclude) {

Idem.

 + err = revoke_vma(vma, details);
 + if (err)
 + goto out;
 + }
 + }
 +  out:
 + spin_unlock(mapping-i_mmap_lock);
 + return err;
 +}


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/3] swsusp: Stop using page flags

2007-03-11 Thread Peter Zijlstra
On Sun, 2007-03-11 at 11:17 +0100, Rafael J. Wysocki wrote:
 Hi,
 
 The following three patches make swsusp use its own data structures for memory
 management instead of special page flags.  Thus the page flags used so far by
 swsusp (PG_nosave, PG_nosave_free) can be used for other purposes and I 
 believe
 there are some urgend needs of them. :-)
 
 Last week I sent these patches to the linux-pm and linux-mm lists and there
 were no negative comments.  Also I've been testing them on my x86_64 boxes for
 a few days and apparently they don't break anything.  I think they can go into
 -mm for testing.
 
 Comments are welcome.

These patches have my blessing, they look good to me, but I'm not much
involved with the swsusp code, so I won't ACK them.

Again, thanks a bunch for freeing up 2 page flags :-)

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!)

2007-03-11 Thread Peter Zijlstra
On Sun, 2007-03-11 at 15:50 +0200, Michael S. Tsirkin wrote:
  Quoting Roland Dreier [EMAIL PROTECTED]:
  Subject: Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!
  
  Feb 27 17:47:52 sw169 kernel:  [8053aaf1] 
  _spin_lock_irqsave+0x15/0x24
  Feb 27 17:47:52 sw169 kernel:  [88067a23] 
  :ib_ipoib:ipoib_neigh_destructor+0xc2/0x139
  
  It looks like this is deadlocking trying to take priv-lock in 
  ipoib_neigh_destructor().
  One idea I just had would be to build a kernel with CONFIG_PROVE_LOCKING 
  turned on, and then rerun this test.  There's a good chance that this would
  diagnose the deadlock.  (I don't have good access to my test machines right 
  now, or
  else I would do it myself)
 
 OK, I did that. But I get
   [13440.761857] INFO: trying to register non-static key.
   [13440.766903] the code is fine but needs lockdep annotation.
   [13440.772455] turning off the locking correctness validator.
 and I am not sure what triggers this, or how to fix it to have the
 validator actually do its job.

It usually indicates a spinlock is not properly initialized. Like
__SPIN_LOCK_UNLOCKED() used in a non-static context, use
spin_lock_init() in these cases.

However looking at the code, ipoib_neight_destructor only uses
priv-lock, and that seems to get properly initialized in ipoib_setup()
using spin_lock_init().

So either there are other sites that instanciate those objects and
forget about the lock init, or the object is corrupted (use after free?)

 Ingo, what key does the message refer to?
 
 The stack dump seems to point to drivers/infiniband/ulp/ipoib/ipoib_main.c 
 line
 829.
 
 Full message below:
   
 [13440.761857] INFO: trying to register non-static key.
 [13440.766903] the code is fine but needs lockdep annotation.
 [13440.772455] turning off the locking correctness validator.
 [13440.778008]  [c023c082] __lock_acquire+0xae4/0xbb9
 [13440.783078]  [c023c43d] lock_acquire+0x56/0x71
 [13440.787784]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
 [13440.794412]  [c051ad41] _spin_lock_irqsave+0x32/0x41
 [13440.799649]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
 [13440.806275]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
 [13440.812897]  [c04a1c1b] dst_run_gc+0xc/0x118
 [13440.817439]  [c022af6e] run_timer_softirq+0x37/0x16b
 [13440.822673]  [c04a1c0f] dst_run_gc+0x0/0x118
 [13440.827221]  [c04a3eab] neigh_destroy+0xbe/0x104
 [13440.832114]  [c04a1bb1] dst_destroy+0x4d/0xab
 [13440.836751]  [c04a1c64] dst_run_gc+0x55/0x118
 [13440.841384]  [c022b03f] run_timer_softirq+0x108/0x16b
 [13440.846711]  [c0227634] __do_softirq+0x5a/0xd5
 [13440.851427]  [c023b435] trace_hardirqs_on+0x106/0x141
 [13440.856754]  [c0227643] __do_softirq+0x69/0xd5
 [13440.861470]  [c02276e6] do_softirq+0x37/0x4d
 [13440.866016]  [c02167b0] smp_apic_timer_interrupt+0x6b/0x77
 [13440.871774]  [c02029ef] default_idle+0x3b/0x54
 [13440.876491]  [c02029ef] default_idle+0x3b/0x54
 [13440.881211]  [c0204c33] apic_timer_interrupt+0x33/0x38
 [13440.886624]  [c02029ef] default_idle+0x3b/0x54
 [13440.891342]  [c02029f1] default_idle+0x3d/0x54
 [13440.896061]  [c0202aaa] cpu_idle+0xa2/0xbb
 [13440.900436]  ===
 [13768.711447] BUG: spinlock lockup on CPU#1, swapper/0, c0687880
 [13768.717353]  [c031f919] _raw_spin_lock+0xda/0xfd
 [13768.722247]  [c051ad48] _spin_lock_irqsave+0x39/0x41
 [13768.727486]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
 [13768.734110]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
 [13768.740735]  [c04a1c1b] dst_run_gc+0xc/0x118
 [13768.745276]  [c022af6e] run_timer_softirq+0x37/0x16b
 [13768.750517]  [c04a1c0f] dst_run_gc+0x0/0x118
 [13768.755061]  [c04a3eab] neigh_destroy+0xbe/0x104
 [13768.759955]  [c04a1bb1] dst_destroy+0x4d/0xab
 [13768.764586]  [c04a1c64] dst_run_gc+0x55/0x118
 [13768.769218]  [c022b03f] run_timer_softirq+0x108/0x16b
 [13768.774542]  [c0227634] __do_softirq+0x5a/0xd5
 [13768.779261]  [c023b435] trace_hardirqs_on+0x106/0x141
 [13768.784588]  [c0227643] __do_softirq+0x69/0xd5
 [13768.789308]  [c02276e6] do_softirq+0x37/0x4d
 [13768.793851]  [c02167b0] smp_apic_timer_interrupt+0x6b/0x77
 [13768.799609]  [c02029ef] default_idle+0x3b/0x54
 [13768.804326]  [c02029ef] default_idle+0x3b/0x54
 [13768.809054]  [c0204c33] apic_timer_interrupt+0x33/0x38
 [13768.814471]  [c02029ef] default_idle+0x3b/0x54
 [13768.819187]  [c02029f1] default_idle+0x3d/0x54
 [13768.823903]  [c0202aaa] cpu_idle+0xa2/0xbb
 [13768.828279]  ===
 
 
 -- 
 MST
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please 

Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-12 Thread Peter Zijlstra
On Mon, 2007-03-12 at 08:26 -0700, Linus Torvalds wrote:

 So good fairness really should involve some notion of work done for 
 others. It's just not very easy to do..

A solution that is already in demand is a class based scheduler, where
the thread doing work for a client (temp.) joins the class of the
client.

The in-kernel virtualization guys also want to have this, for pretty
much the same reasons.

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-12 Thread Peter Zijlstra
On Mon, 2007-03-12 at 21:11 +0100, Mike Galbraith wrote:

 How would you go about ensuring that there won't be any cycles wasted?

SCHED_IDLE or otherwise nice 19
 
 Killing the known corner case starvation scenarios is wonderful, but
 let's not just pretend that interactive tasks don't have any special
 requirements.

Interaction wants low latency, getting that is traditionally expressed
in priorities - the highest prio gets the least latency (all RTOSs work
like that).

There is nothing that warrants giving them more CPU time IMHO; if you
think they deserve more, express that using priorities.

Priorities are a well understood concept and they work; heuristics can
(and Murphy tells us they will) go wrong.

Getting the server/client thing working can be done without heuristics
using class based scheduling.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] VM throttling: avoid blocking occasional writers

2007-03-14 Thread Peter Zijlstra
Hi,

I've been working on an alternative solution (see patch below). However
I haven't posted yet because I'm not quite satisfied and haven't done a
lot of testing.

The patch relies on the per backing dev dirty/writeback counts currently
in -mm to which David Chinner objected. I plan to rework those as percpu
counters.

I think my solution might behave better because it fully decouples the
device throttling.

---

Scale writeback cache per backing device, proportional to its writeout speed.

akpm sayeth:
 Which problem are we trying to solve here?  afaik our two uppermost
 problems are:
 
 a) Heavy write to queue A causes light writer to queue B to blok for a long
 time in balance_dirty_pages().  Even if the devices have the same speed.  

This one; esp when not the same speed. The - my usb stick makes my
computer suck - problem. But even on similar speed, the separation of
device should avoid blocking dev B when dev A is being throttled.

The writeout speed is measure dynamically, so when it doesn't have
anything to write out for a while its writeback cache size goes to 0.

Conversely, when starting up it will in the beginning act almost
synchronous but will quickly build up a 'fair' share of the writeback
cache.

 b) heavy write to device A causes light write to device A to block for a
 long time in balance_dirty_pages(), occasionally.  Harder to fix.

This will indeed take more. I've thought about it though. But one
quickly ends up with per task state.


How it all works:

We pick a 2^n value based on the vm_dirty_ratio and total vm size to act as a
period - vm_cycle_shift. This period measures 'time' in writeout events.

Each writeout increases time and adds to a per bdi counter. This counter is 
halved when a period expires. So per bdi speed is:

  0.5 * (previous cycle speed) + this cycle's events.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 block/ll_rw_blk.c   |3 +
 include/linux/backing-dev.h |7 +++
 include/linux/writeback.h   |   10 
 kernel/sysctl.c |   10 +++-
 mm/page-writeback.c |  102 ++--
 5 files changed, 119 insertions(+), 13 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -34,6 +34,13 @@ struct backing_dev_info {
void *congested_data;   /* Pointer to aux data for congested func */
void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
void *unplug_io_data;
+
+   /*
+* data used for scaling the writeback cache
+*/
+   spinlock_t lock;/* protect the cycle count */
+   atomic_long_t nr_writeout;  /* writeout scale */
+   unsigned long cycles;   /* writeout cycles */
 };
 
 
Index: linux-2.6/include/linux/writeback.h
===
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -4,6 +4,8 @@
 #ifndef WRITEBACK_H
 #define WRITEBACK_H
 
+#include linux/log2.h
+
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
@@ -89,11 +91,19 @@ void throttle_vm_writeout(gfp_t gfp_mask
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
 extern int vm_dirty_ratio;
+extern int vm_cycle_shift;
 extern int dirty_writeback_interval;
 extern int dirty_expire_interval;
 extern int block_dump;
 extern int laptop_mode;
 
+extern long vm_total_pages; /* reduce dependancy stuff */
+static inline void update_cycle_shift(void)
+{
+   unsigned long dirty_pages = (vm_dirty_ratio * vm_total_pages) / 100;
+   vm_cycle_shift = 2 + ilog2_up(int_sqrt(dirty_pages));
+}
+
 struct ctl_table;
 struct file;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int, struct file *,
Index: linux-2.6/kernel/sysctl.c
===
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -612,6 +612,14 @@ static ctl_table kern_table[] = {
 static int zero;
 static int one_hundred = 100;
 
+static int proc_dointvec_vm_dirty_ratio(ctl_table *table, int write,
+   struct file *filp, void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
+   update_cycle_shift();
+   return ret;
+}
 
 static ctl_table vm_table[] = {
{
@@ -663,7 +671,7 @@ static ctl_table vm_table[] = {
.data   = vm_dirty_ratio,
.maxlen = sizeof(vm_dirty_ratio),
.mode   = 0644,
-   .proc_handler   = proc_dointvec_minmax,
+   .proc_handler   = proc_dointvec_vm_dirty_ratio,
.strategy   = sysctl_intvec,
.extra1 = zero,
.extra2 = one_hundred,
Index: linux

Re: NPTL patch for linux 2.4.28

2007-03-15 Thread Peter Zijlstra
On Thu, 2007-03-15 at 03:14 +0530, Syed Ahemed wrote:

 Getting RHEL's source ( http://lkml.org/lkml/2005/3/21/380 ) was an
 idea i thought about but then a download of the RHEL source from the
 following location was denied .
 http://download.fedora.redhat.com/pub/fedora/linux/core/1/SRPMS/  and
 the rpmfind site.
 (Guess need to be a paid subscriber for that right ?)

Strangely enough you try to download Fedora Core SRPMs whilst you speak
of RHEL. Try this one:

ftp://ftp.redhat.com/pub/redhat/linux/enterprise/3/en/os/i386/SRPMS/kernel-2.4.21-4.EL.src.rpm

Also, CentOS would have similar sources. Google could have informed you
on that.

 I still wonder why there aren't any NPTL patches available in the
 non-redhat sites for kernels like 2.4.21 or more .

Because most people, especially the ones on this mailing list have moved
on to 2.6 quite some time ago. May I suggest you do the same?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/filemap.c: unconditionally call mark_page_accessed

2007-03-15 Thread Peter Zijlstra
On Wed, 2007-03-14 at 15:58 -0400, Ashif Harji wrote:
 This patch unconditionally calls mark_page_accessed to prevent pages, 
 especially for small files, from being evicted from the page cache despite 
 frequent access.

Since we're hackling over the use-once stuff again...

/me brings up: http://marc.info/?l=linux-mmm=115316894804385w=2 and
ducks.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements

2007-03-16 Thread Peter Zijlstra
On Thu, 2007-03-15 at 20:10 +0100, Eric Dumazet wrote:
 Hi
 
 I'm pleased to present these patches which improve linux futex performance 
 and 
 scalability, on both UP, SMP and NUMA configs.
 
 I had this idea last year but I was not understood, probably because I gave 
 not enough explanations. Sorry if this mail is really long...

I started playing with it after your last reference to it, I have some
code here (against -rt):
  http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/

Which I will post once I have the found what keeps pthread_join() from
completing :-(

It basically adds a per task vma lookup cache which can also activate
the private logic without explicit use of the new interface.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6.20] BUG: workqueue leaked lock

2007-03-16 Thread Peter Zijlstra
On Thu, 2007-03-15 at 11:06 -0800, Andrew Morton wrote:
  On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden [EMAIL PROTECTED] 
  wrote:
  ...
  [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x/3577
  [ 1756.728271] last function: laundromat_main+0x0/0x69 [nfsd]
  [ 1756.728392] 2 locks held by nfsd4/3577:
  [ 1756.728435]  #0:  (client_mutex){--..}, at: [c1205b88] 
  mutex_lock+0x8/0xa
  [ 1756.728679]  #1:  (inode-i_mutex){--..}, at: [c1205b88] 
  mutex_lock+0x8/0xa
  [ 1756.728923]  [c1003d57] show_trace_log_lvl+0x1a/0x30
  [ 1756.729015]  [c1003d7f] show_trace+0x12/0x14
  [ 1756.729103]  [c1003e79] dump_stack+0x16/0x18
  [ 1756.729187]  [c102c2e8] run_workqueue+0x167/0x170
  [ 1756.729276]  [c102c437] worker_thread+0x146/0x165
  [ 1756.729368]  [c102f797] kthread+0x97/0xc4
  [ 1756.729456]  [c1003bdb] kernel_thread_helper+0x7/0x10
  [ 1756.729547]  ===
  [ 1792.436492] svc: unknown version (0 for prog 13, nfsd)
  [ 1846.683648] BUG: workqueue leaked lock or atomic: nfsd4/0x/3577
  [ 1846.683701] last function: laundromat_main+0x0/0x69 [nfsd]
  [ 1846.683832] 2 locks held by nfsd4/3577:
  [ 1846.683885]  #0:  (client_mutex){--..}, at: [c1205b88] 
  mutex_lock+0x8/0xa
  [ 1846.683980]  #1:  (inode-i_mutex){--..}, at: [c1205b88] 
  mutex_lock+0x8/0xa
  [ 1846.683988]  [c1003d57] show_trace_log_lvl+0x1a/0x30
  [ 1846.683994]  [c1003d7f] show_trace+0x12/0x14
  [ 1846.683997]  [c1003e79] dump_stack+0x16/0x18
  [ 1846.684001]  [c102c2e8] run_workqueue+0x167/0x170
  [ 1846.684006]  [c102c437] worker_thread+0x146/0x165
  [ 1846.684012]  [c102f797] kthread+0x97/0xc4
  [ 1846.684023]  [c1003bdb] kernel_thread_helper+0x7/0x10
 
 Oleg, that's a fairly incomprehensible message we have in there.  Can you
 please explain what it means?

I think I'm responsible for this message (commit
d5abe669172f20a4129a711de0f250a4e07db298); what is says is that the
function executed by the workqueue (here laundromat_main) leaked an
atomic context or is still holding locks (2 locks in this case).



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements

2007-03-16 Thread Peter Zijlstra
On Fri, 2007-03-16 at 10:30 +0100, Eric Dumazet wrote:
 On Friday 16 March 2007 09:05, Peter Zijlstra wrote:
  On Thu, 2007-03-15 at 20:10 +0100, Eric Dumazet wrote:
   Hi
  
   I'm pleased to present these patches which improve linux futex
   performance and scalability, on both UP, SMP and NUMA configs.
  
   I had this idea last year but I was not understood, probably because I
   gave not enough explanations. Sorry if this mail is really long...
 
  I started playing with it after your last reference to it, I have some
  code here (against -rt):
http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/
 
  Which I will post once I have the found what keeps pthread_join() from
  completing :-(
 
  It basically adds a per task vma lookup cache which can also activate
  the private logic without explicit use of the new interface.
 
 Hi Peter
 
 I dont think yet another cache will help in the general case.
 A typical program uses many vmas at once...
 
 glibc has internal futexes, on a different vma than futexes declared in your 
 program. Each shared library is going to have its own vma for its data (and 
 futexes)
 
 (244 vmas on one kmail program for example)

Yeah, I was just hoping a few cache entries would be enough to get the
worst of them. A benchmark will have to tell I guess.

 About your guess_futex_shared() thing, I miss the vma_anon() definition.

http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.patch

 But if it has to walk the vmas (and take mmap_sem), you already loose the 
 PRIVATE benefit.

It doesn't take mmap_sem, I am aware of the problems.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21-rc3-mm2 3/4] futex_requeue_pi optimization

2007-03-16 Thread Peter Zijlstra
On Tue, 2007-03-13 at 10:52 +0100, [EMAIL PROTECTED] wrote:
 plain text document attachment (futex-requeue-pi.diff)
 This patch provides the futex_requeue_pi functionality.
 
 This provides an optimization, already used for (normal) futexes, to be used 
 for
 PI-futexes.
 
 This optimization is currently used by the glibc in pthread_broadcast, when
 using normal mutexes. With futex_requeue_pi, it can be used with 
 PRIO_INHERIT
 mutexes too.
 
 Signed-off-by: Pierre Peiffer [EMAIL PROTECTED]
 
 ---

  /*
 + * Retrieve the original address used to compute this key
 + */
 +static void *get_futex_address(union futex_key *key)
 +{
 + void *uaddr;
 +
 + if (key-both.offset  1) {
 + /* shared mapping */
 + uaddr = (void*)((key-shared.pgoff  PAGE_SHIFT)
 + + key-shared.offset - 1);
 + } else {
 + /* private mapping */
 + uaddr = (void*)(key-private.address + key-private.offset);
 + }
 +
 + return uaddr;
 +}

This will not work for nonlinear vmas, granted, not a lot of ppl stick
futexes in nonlinear vmas, but the futex_key stuff handles it, this
doesn't.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements

2007-03-16 Thread Peter Zijlstra
On Fri, 2007-03-16 at 11:30 +0100, Eric Dumazet wrote:
 On Friday 16 March 2007 11:10, Peter Zijlstra wrote:
 
  http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.p
 atch
 
 Oh thanks
 
 
   But if it has to walk the vmas (and take mmap_sem), you already loose the
   PRIVATE benefit.
 
  It doesn't take mmap_sem, I am aware of the problems.
 
 Yes but the vma_anon() - vma_cache_find() needs to read 3 cache lines on 
 x86_64 (sizeof(struct vma_cache) = 136)
 and dirty one bit, so it might be more expensive than the mmap_sem ...

I though the cacheline was 128 bytes, but point taken.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT] NFS client updates for 2.6.20

2007-02-13 Thread Peter Zijlstra
On Mon, 2007-02-12 at 23:39 -0800, Trond Myklebust wrote:

 commit 7c85d9007d05436e71d2b805b96c1e36a8193bd4
 Author: Trond Myklebust [EMAIL PROTECTED]
 Date:   Wed Dec 13 15:23:48 2006 -0500
 
 NFS: Fixup some outdated comments...
 
 Signed-off-by: Trond Myklebust [EMAIL PROTECTED]

A few new comments would be appreciated :-)

 commit 02241bc47e8961768de83d855accd0dcad1df045
 Author: Trond Myklebust [EMAIL PROTECTED]
 Date:   Sat Jan 13 02:28:07 2007 -0500
 
 NFS: Ensure that -writepage() uses flush_stable() when reclaiming pages
 
 Signed-off-by: Trond Myklebust [EMAIL PROTECTED]

That patch seems to make all -writepage() calls use stable storage.
Of course, given that there are no other -writepage() calls this
works out.

I've been using this:
---

Subject: nfs: only use stable storage for swap

unstable writes don't make sense for swap pages.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
Cc: Trond Myklebust [EMAIL PROTECTED]
---
 fs/nfs/write.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-git/fs/nfs/write.c
===
--- linux-2.6-git.orig/fs/nfs/write.c   2007-01-09 11:38:25.0 +0100
+++ linux-2.6-git/fs/nfs/write.c2007-01-09 13:40:34.0 +0100
@@ -231,7 +231,7 @@ static int nfs_writepage_setup(struct nf
 static int wb_priority(struct writeback_control *wbc)
 {
if (wbc-for_reclaim)
-   return FLUSH_HIGHPRI;
+   return FLUSH_HIGHPRI|FLUSH_STABLE;
if (wbc-for_kupdate)
return FLUSH_LOWPRI;
return 0;


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] lockdep: annotate BLKPG_DEL_PARTITION

2007-02-16 Thread Peter Zijlstra


=
[ INFO: possible recursive locking detected ]
2.6.19-1.2909.fc7 #1
-
anaconda/587 is trying to acquire lock:
 (bdev-bd_mutex){--..}, at: [c05fb380] mutex_lock+0x21/0x24

but task is already holding lock:
 (bdev-bd_mutex){--..}, at: [c05fb380] mutex_lock+0x21/0x24

other info that might help us debug this:
1 lock held by anaconda/587:
 #0:  (bdev-bd_mutex){--..}, at: [c05fb380] mutex_lock+0x21/0x24

stack backtrace:
 [c0405812] show_trace_log_lvl+0x1a/0x2f
 [c0405db2] show_trace+0x12/0x14
 [c0405e36] dump_stack+0x16/0x18
 [c043bd84] __lock_acquire+0x116/0xa09
 [c043c960] lock_acquire+0x56/0x6f
 [c05fb1fa] __mutex_lock_slowpath+0xe5/0x24a
 [c05fb380] mutex_lock+0x21/0x24
 [c04d82fb] blkdev_ioctl+0x600/0x76d
 [c04946b1] block_ioctl+0x1b/0x1f
 [c047ed5a] do_ioctl+0x22/0x68
 [c047eff2] vfs_ioctl+0x252/0x265
 [c047f04e] sys_ioctl+0x49/0x63
 [c0404070] syscall_call+0x7/0xb

Annotate BLKPG_DEL_PARTITION's bd_mutex locking and add a little comment
clarifying the bd_mutex locking, because I confused myself and initially
thought the lock order was wrong too.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 block/ioctl.c  |2 +-
 fs/block_dev.c |7 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

Index: linux-2.6.20.noarch/block/ioctl.c
===
--- linux-2.6.20.noarch.orig/block/ioctl.c
+++ linux-2.6.20.noarch/block/ioctl.c
@@ -82,7 +82,7 @@ static int blkpg_ioctl(struct block_devi
fsync_bdev(bdevp);
invalidate_bdev(bdevp, 0);
 
-   mutex_lock(bdev-bd_mutex);
+   mutex_lock_nested(bdev-bd_mutex, 1);
delete_partition(disk, part);
mutex_unlock(bdev-bd_mutex);
mutex_unlock(bdevp-bd_mutex);
Index: linux-2.6.20.noarch/fs/block_dev.c
===
--- linux-2.6.20.noarch.orig/fs/block_dev.c
+++ linux-2.6.20.noarch/fs/block_dev.c
@@ -1101,6 +1101,13 @@ static int __blkdev_get(struct block_dev
int for_part);
 static int __blkdev_put(struct block_device *bdev, int for_part);
 
+/*
+ * bd_mutex locking:
+ *
+ *  mutex_lock(part-bd_mutex)
+ *mutex_lock_nested(whole-bd_mutex, 1)
+ */
+
 static int do_open(struct block_device *bdev, struct file *file, int for_part)
 {
struct module *owner = NULL;


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20 1/1] fbdev,mm: hecuba/E-Ink fbdev driver

2007-02-17 Thread Peter Zijlstra
On Sat, 2007-02-17 at 11:42 +0100, Jaya Kumar wrote:
 Hi James, Geert, lkml and mm,

Hi Jaya,

 This patch adds support for the Hecuba/E-Ink display with deferred IO.
 The changes from the previous version are to switch to using a mutex
 and lock_page. I welcome your feedback and advice.

This changelog ought to be a little more extensive; esp. because you're
using these fancy new functions -page_mkwrite() and page_mkclean() in a
novel way.

Also, I'd still like to see a way to call msync() on the mmap'ed region
to force a flush. I think providing a fb_fsync() method in fbmem.c and a
hook down to the driver ought to work.

Also, you now seem to use a fixed 1 second delay, perhaps provide an
ioctl or something to customize this?

And, as Andrew suggested last time around, could you perhaps push this
fancy new idea into the FB layer so that more drivers can make us of it?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 2/6] mm: count dirty pages per BDI

2007-03-19 Thread Peter Zijlstra
Count per BDI dirty pages.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/buffer.c |1 +
 include/linux/backing-dev.h |1 +
 mm/page-writeback.c |2 ++
 mm/truncate.c   |1 +
 4 files changed, 5 insertions(+)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -732,6 +732,7 @@ int __set_page_dirty_buffers(struct page
if (page-mapping) {/* Race with truncate? */
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+   __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
task_io_account_write(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(mapping-page_tree,
Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -766,6 +766,7 @@ int __set_page_dirty_nobuffers(struct pa
BUG_ON(mapping2 != mapping);
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+   __inc_bdi_stat(mapping-backing_dev_info, 
BDI_DIRTY);
task_io_account_write(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(mapping-page_tree,
@@ -892,6 +893,7 @@ int clear_page_dirty_for_io(struct page 
set_page_dirty(page);
if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+   dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
return 1;
}
return 0;
Index: linux-2.6/mm/truncate.c
===
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -71,6 +71,7 @@ void cancel_dirty_page(struct page *page
struct address_space *mapping = page-mapping;
if (mapping  mapping_cap_account_dirty(mapping)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+   dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
if (account_size)
task_io_account_cancelled_write(account_size);
}
Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -23,6 +23,7 @@ enum bdi_state {
 };
 
 enum bdi_stat_item {
+   BDI_DIRTY,
NR_BDI_STAT_ITEMS
 };
 

--

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 3/6] mm: count writeback pages per BDI

2007-03-19 Thread Peter Zijlstra
Count per BDI writeback pages.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/backing-dev.h |1 +
 mm/page-writeback.c |8 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -912,10 +912,12 @@ int test_clear_page_writeback(struct pag
 
write_lock_irqsave(mapping-tree_lock, flags);
ret = TestClearPageWriteback(page);
-   if (ret)
+   if (ret) {
radix_tree_tag_clear(mapping-page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
+   __dec_bdi_stat(mapping-backing_dev_info, 
BDI_WRITEBACK);
+   }
write_unlock_irqrestore(mapping-tree_lock, flags);
} else {
ret = TestClearPageWriteback(page);
@@ -933,10 +935,12 @@ int test_set_page_writeback(struct page 
 
write_lock_irqsave(mapping-tree_lock, flags);
ret = TestSetPageWriteback(page);
-   if (!ret)
+   if (!ret) {
radix_tree_tag_set(mapping-page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
+   __inc_bdi_stat(mapping-backing_dev_info, 
BDI_WRITEBACK);
+   }
if (!PageDirty(page))
radix_tree_tag_clear(mapping-page_tree,
page_index(page),
Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -24,6 +24,7 @@ enum bdi_state {
 
 enum bdi_stat_item {
BDI_DIRTY,
+   BDI_WRITEBACK,
NR_BDI_STAT_ITEMS
 };
 

--

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 1/6] mm: scalable bdi statistics counters.

2007-03-19 Thread Peter Zijlstra
Provide scalable per backing_dev_info statistics counters modeled on the ZVC
code.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 block/ll_rw_blk.c   |1 
 drivers/block/rd.c  |2 
 drivers/char/mem.c  |2 
 fs/char_dev.c   |1 
 include/linux/backing-dev.h |   86 
 mm/backing-dev.c|  103 
 6 files changed, 195 insertions(+)

Index: linux-2.6/block/ll_rw_blk.c
===
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -211,6 +211,7 @@ void blk_queue_make_request(request_queu
q-backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / 
PAGE_CACHE_SIZE;
q-backing_dev_info.state = 0;
q-backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
+   bdi_stat_init(q-backing_dev_info);
blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
blk_queue_hardsect_size(q, 512);
blk_queue_dma_alignment(q, 511);
Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -22,6 +22,17 @@ enum bdi_state {
BDI_unused, /* Available bits start here */
 };
 
+enum bdi_stat_item {
+   NR_BDI_STAT_ITEMS
+};
+
+#ifdef CONFIG_SMP
+struct bdi_per_cpu_data {
+   s8 stat_threshold;
+   s8 bdi_stat_diff[NR_BDI_STAT_ITEMS];
+} cacheline_aligned_in_smp;
+#endif
+
 typedef int (congested_fn)(void *, int);
 
 struct backing_dev_info {
@@ -32,8 +43,83 @@ struct backing_dev_info {
void *congested_data;   /* Pointer to aux data for congested func */
void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
void *unplug_io_data;
+
+   atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
+#ifdef CONFIG_SMP
+   struct bdi_per_cpu_data pcd[NR_CPUS];
+#endif
 };
 
+extern atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
+
+static inline void bdi_stat_add(long x, struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   atomic_long_add(x, bdi-bdi_stats[item]);
+   atomic_long_add(x, bdi_stats[item]);
+}
+
+/*
+ * cannot be unsigned long and clip on 0.
+ */
+static inline unsigned long global_bdi_stat(enum bdi_stat_item item)
+{
+   long x = atomic_long_read(bdi_stats[item]);
+#ifdef CONFIG_SMP
+   if (x  0)
+   x = 0;
+#endif
+   return x;
+}
+
+static inline unsigned long bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   long x = atomic_long_read(bdi-bdi_stats[item]);
+#ifdef CONFIG_SMP
+   if (x  0)
+   x = 0;
+#endif
+   return x;
+}
+
+#ifdef CONFIG_SMP
+void __mod_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item, int 
delta);
+void __inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+void __dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+
+void mod_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item, int 
delta);
+void inc_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+void dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item);
+
+#else /* CONFIG_SMP */
+
+static inline void __mod_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item, int delta)
+{
+   bdi_stat_add(delta, bdi, item);
+}
+
+static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   atomic_long_inc(bdi-bdi_stats[item]);
+   atomic_long_inc(bdi_stats[item]);
+}
+
+static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   atomic_long_dec(bdi-bdi_stats[item]);
+   atomic_long_dec(bdi_stats[item]);
+}
+
+#define mod_bdi_stat __mod_bdi_stat
+#define inc_bdi_stat __inc_bdi_stat
+#define dec_bdi_stat __dec_bdi_stat
+#endif
+
+void bdi_stat_init(struct backing_dev_info *bdi);
 
 /*
  * Flags in backing_dev_info::capability
Index: linux-2.6/mm/backing-dev.c
===
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -67,3 +67,106 @@ void congestion_end(int rw)
wake_up(wqh);
 }
 EXPORT_SYMBOL(congestion_end);
+
+atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
+EXPORT_SYMBOL(bdi_stats);
+
+void bdi_stat_init(struct backing_dev_info *bdi)
+{
+   int i;
+
+   for (i = 0; i  NR_BDI_STAT_ITEMS; i++)
+   atomic_long_set(bdi-bdi_stats[i], 0);
+
+#ifdef CONFIG_SMP
+   for (i = 0; i  NR_CPUS; i++) {
+   int j;
+   for (j = 0; j  NR_BDI_STAT_ITEMS; j++)
+   bdi-pcd[i].bdi_stat_diff[j] = 0;
+   bdi-pcd[i].stat_threshold = 8 * ilog2(num_online_cpus());
+   }
+#endif
+}
+EXPORT_SYMBOL(bdi_stat_init);
+
+#ifdef

[RFC][PATCH 5/6] mm: per device dirty threshold

2007-03-19 Thread Peter Zijlstra
Scale writeback cache per backing device, proportional to its writeout speed.

akpm sayeth:
 Which problem are we trying to solve here?  afaik our two uppermost
 problems are:
 
 a) Heavy write to queue A causes light writer to queue B to blok for a long
 time in balance_dirty_pages().  Even if the devices have the same speed.  

This one; esp when not the same speed. The - my usb stick makes my
computer suck - problem. But even on similar speed, the separation of
device should avoid blocking dev B when dev A is being throttled.

The writeout speed is measure dynamically, so when it doesn't have
anything to write out for a while its writeback cache size goes to 0.

Conversely, when starting up it will in the beginning act almost
synchronous but will quickly build up a 'fair' share of the writeback
cache.

 b) heavy write to device A causes light write to device A to block for a
 long time in balance_dirty_pages(), occasionally.  Harder to fix.

This will indeed take more. I've thought about it though. But one
quickly ends up with per task state.


How it all works:

We pick a 2^n value based on the total vm size to act as a period -
vm_cycle_shift. This period measures 'time' in writeout events.

Each writeout increases time and adds to a per bdi counter. This counter is 
halved when a period expires. So per bdi speed is:

  0.5 * (previous cycle speed) + this cycle's events.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/backing-dev.h |8 ++
 mm/backing-dev.c|3 
 mm/page-writeback.c |  145 ++--
 3 files changed, 125 insertions(+), 31 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -26,6 +26,8 @@ enum bdi_stat_item {
BDI_DIRTY,
BDI_WRITEBACK,
BDI_UNSTABLE,
+   BDI_WRITEOUT,
+   BDI_WRITEOUT_TOTAL,
NR_BDI_STAT_ITEMS
 };
 
@@ -47,6 +49,12 @@ struct backing_dev_info {
void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
void *unplug_io_data;
 
+   /*
+* data used for scaling the writeback cache
+*/
+   spinlock_t lock;/* protect the cycle count */
+   unsigned long cycles;   /* writeout cycles */
+
atomic_long_t bdi_stats[NR_BDI_STAT_ITEMS];
 #ifdef CONFIG_SMP
struct bdi_per_cpu_data pcd[NR_CPUS];
Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -49,8 +49,6 @@
  */
 static long ratelimit_pages = 32;
 
-static int dirty_exceeded __cacheline_aligned_in_smp;  /* Dirty mem may be 
over limit */
-
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
@@ -103,6 +101,77 @@ EXPORT_SYMBOL(laptop_mode);
 static void background_writeout(unsigned long _min_pages);
 
 /*
+ * Scale the writeback cache size proportional to the relative writeout speeds.
+ *
+ * We do this by tracking a floating average per BDI and a global floating
+ * average. We optimize away the '/= 2' for the global average by noting that:
+ *
+ *  if (++i  thresh) i /= 2:
+ *
+ * Can be approximated by:
+ *
+ *   thresh/2 + (++i % thresh/2)
+ *
+ * Furthermore, when we choose thresh to be 2^n it can be written in terms of
+ * binary operations and wraparound artifacts disappear.
+ */
+static int vm_cycle_shift __read_mostly;
+
+/*
+ * Sync up the per BDI average to the global cycle.
+ *
+ * NOTE: we mask out the MSB of the cycle count because bdi_stats really are
+ * not unsigned long. (see comment in backing_dev.h)
+ */
+static void bdi_writeout_norm(struct backing_dev_info *bdi)
+{
+   int bits = vm_cycle_shift;
+   unsigned long cycle = 1UL  bits;
+   unsigned long mask = ~(cycle - 1) | (1UL  BITS_PER_LONG-1);
+   unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL)  1;
+   unsigned long flags;
+
+   if ((bdi-cycles  mask) == (total  mask))
+   return;
+
+   spin_lock_irqsave(bdi-lock, flags);
+   while ((bdi-cycles  mask) != (total  mask)) {
+   unsigned long half = bdi_stat(bdi, BDI_WRITEOUT) / 2;
+
+   mod_bdi_stat(bdi, BDI_WRITEOUT, -half);
+   bdi-cycles += cycle;
+   }
+   spin_unlock_irqrestore(bdi-lock, flags);
+}
+
+static void bdi_writeout_inc(struct backing_dev_info *bdi)
+{
+   if (!bdi_cap_writeback_dirty(bdi))
+   return;
+
+   __inc_bdi_stat(bdi, BDI_WRITEOUT);
+   __inc_bdi_stat(bdi, BDI_WRITEOUT_TOTAL);
+
+   bdi_writeout_norm(bdi);
+}
+
+void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div)
+{
+   int bits = vm_cycle_shift - 1;
+   unsigned long total = global_bdi_stat

[RFC][PATCH 4/6] mm: count unstable pages per BDI

2007-03-19 Thread Peter Zijlstra
Count per BDI unstable pages.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/write.c  |4 
 include/linux/backing-dev.h |1 +
 2 files changed, 5 insertions(+)

Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -474,6 +474,7 @@ nfs_mark_request_commit(struct nfs_page 
nfsi-ncommit++;
spin_unlock(nfsi-req_lock);
inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
+   inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 #endif
@@ -545,6 +546,7 @@ static void nfs_cancel_commit_list(struc
while(!list_empty(head)) {
req = nfs_list_entry(head-next);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
+   dec_bdi_stat(req-wb_page-mapping-backing_dev_info, 
BDI_UNSTABLE);
nfs_list_remove_request(req);
nfs_inode_remove_request(req);
nfs_unlock_request(req);
@@ -1278,6 +1280,7 @@ nfs_commit_list(struct inode *inode, str
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
+   dec_bdi_stat(req-wb_page-mapping-backing_dev_info, 
BDI_UNSTABLE);
nfs_clear_page_writeback(req);
}
return -ENOMEM;
@@ -1302,6 +1305,7 @@ static void nfs_commit_done(struct rpc_t
req = nfs_list_entry(data-pages.next);
nfs_list_remove_request(req);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
+   dec_bdi_stat(req-wb_page-mapping-backing_dev_info, 
BDI_UNSTABLE);
 
dprintk(NFS: commit (%s/%Ld [EMAIL PROTECTED]),
req-wb_context-dentry-d_inode-i_sb-s_id,
Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -25,6 +25,7 @@ enum bdi_state {
 enum bdi_stat_item {
BDI_DIRTY,
BDI_WRITEBACK,
+   BDI_UNSTABLE,
NR_BDI_STAT_ITEMS
 };
 

--

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 0/6] per device dirty throttling

2007-03-19 Thread Peter Zijlstra
This patch-set implements per device dirty page throttling. Which should solve
the problem we currently have with one device hogging the dirty limit.

Preliminary testing shows good results:

mem=128M

time (dd if=/dev/zero of=/mnt/dev/zero bs=4096 count=$((1024*1024/4)); sync)

1GB to disk

real0m33.074s   0m34.596s   0m33.387s
user0m0.147s0m0.163s0m0.142s
sys 0m7.872s0m8.409s0m8.395s

1GB to usb-flash

real3m21.170s   3m15.512s   3m23.889s
user0m0.135s0m0.146s0m0.127s
sys 0m7.327s0m7.328s0m7.342s


2.6.20 device vs device

1GB disk vs disk

real1m30.736s   1m16.133s   1m42.068s
user0m0.204s0m0.167s0m0.222s
sys 0m10.438s   0m7.958s0m10.599s

1GB usb-flash vs background disk

N/A 30m+

1GB disk vs background usb-flash

real4m0.687s2m20.145s   4m12.923s
user0m0.173s0m0.185s0m0.161s
sys 0m8.227s0m8.581s0m8.345s


2.6.20-writeback

1GB disk vs disk

real0m36.696s   0m40.837s   0m38.679s
user0m0.161s0m0.148s0m0.160s
sys 0m8.240s0m8.068s0m8.174s

1GB usb-flash vs background disk

real3m37.464s   3m49.720s   4m5.805s
user0m0.167s0m0.166s0m0.149s
sys 0m7.195s0m7.281s0m7.199s

1GB disk vs background usb-flash

real0m41.585s   0m30.888s   0m34.493s
user0m0.161s0m0.167s0m0.162s
sys 0m7.826s0m7.807s0m7.821s


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 6/6] mm: expose BDI statistics in sysfs.

2007-03-19 Thread Peter Zijlstra
Expose the per BDI stats in /sys/block/dev/queue/*

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 block/ll_rw_blk.c |   51 +++
 1 file changed, 51 insertions(+)

Index: linux-2.6/block/ll_rw_blk.c
===
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -3923,6 +3923,33 @@ static ssize_t queue_max_hw_sectors_show
return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, %lu\n, bdi_stat(q-backing_dev_info, 
BDI_DIRTY));
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, %lu\n, bdi_stat(q-backing_dev_info, 
BDI_WRITEBACK));
+}
+
+static ssize_t queue_nr_unstable_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, %lu\n, bdi_stat(q-backing_dev_info, 
BDI_UNSTABLE));
+}
+
+extern void get_writeout_scale(struct backing_dev_info *, int *, int *);
+
+static ssize_t queue_nr_cache_show(struct request_queue *q, char *page)
+{
+   int scale, div;
+
+   get_writeout_scale(q-backing_dev_info, scale, div);
+   scale *= 1024;
+   scale /= div;
+
+   return sprintf(page, %d\n, scale);
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR },
@@ -3947,6 +3974,26 @@ static struct queue_sysfs_entry queue_ma
.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_dirty_entry = {
+   .attr = {.name = dirty_pages, .mode = S_IRUGO },
+   .show = queue_nr_dirty_show,
+};
+
+static struct queue_sysfs_entry queue_writeback_entry = {
+   .attr = {.name = writeback_pages, .mode = S_IRUGO },
+   .show = queue_nr_writeback_show,
+};
+
+static struct queue_sysfs_entry queue_unstable_entry = {
+   .attr = {.name = unstable_pages, .mode = S_IRUGO },
+   .show = queue_nr_unstable_show,
+};
+
+static struct queue_sysfs_entry queue_cache_entry = {
+   .attr = {.name = cache_ratio, .mode = S_IRUGO },
+   .show = queue_nr_cache_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
.attr = {.name = scheduler, .mode = S_IRUGO | S_IWUSR },
.show = elv_iosched_show,
@@ -3958,6 +4005,10 @@ static struct attribute *default_attrs[]
queue_ra_entry.attr,
queue_max_hw_sectors_entry.attr,
queue_max_sectors_entry.attr,
+   queue_dirty_entry.attr,
+   queue_writeback_entry.attr,
+   queue_unstable_entry.attr,
+   queue_cache_entry.attr,
queue_iosched_entry.attr,
NULL,
 };

--

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/6] per device dirty throttling

2007-03-19 Thread Peter Zijlstra
Sorry for duplicates, I was fooled by an MTA hanging on to them for a
few hours. I counted them lost in cyberspace.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 7/6] assorted fixes

2007-03-19 Thread Peter Zijlstra

Just taking out the MSB isn't enough to counter the clipping on 0 done by
the stats counter accessors. Create some accessors that don't do that.

Also, increase the period to about the side of memory (TODO, determine some
upper bound here). This should give much more stable results. (Best would be
to keep it in the order of whatever vm_dirty_ratio gives, however changing
vm_cycle_shift is dangerous).

Finally, limit the adjustment rate to not grow faster than available dirty
space. Without this the analytic model can use up to 2 times the dirty
limit and the discrete model is basically unbounded.

It goes *BANG* when using NFS,... need to look into that.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/backing-dev.h |   12 
 mm/page-writeback.c |   37 ++---
 2 files changed, 34 insertions(+), 15 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -70,6 +70,12 @@ static inline void bdi_stat_add(long x, 
atomic_long_add(x, bdi_stats[item]);
 }
 
+
+static inline unsigned long __global_bdi_stat(enum bdi_stat_item item)
+{
+   return atomic_long_read(bdi_stats[item]);
+}
+
 /*
  * cannot be unsigned long and clip on 0.
  */
@@ -83,6 +89,12 @@ static inline unsigned long global_bdi_s
return x;
 }
 
+static inline unsigned long __bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item)
+{
+   return atomic_long_read(bdi-bdi_stats[item]);
+}
+
 static inline unsigned long bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
 {
Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -119,16 +119,13 @@ static int vm_cycle_shift __read_mostly;
 
 /*
  * Sync up the per BDI average to the global cycle.
- *
- * NOTE: we mask out the MSB of the cycle count because bdi_stats really are
- * not unsigned long. (see comment in backing_dev.h)
  */
 static void bdi_writeout_norm(struct backing_dev_info *bdi)
 {
int bits = vm_cycle_shift;
unsigned long cycle = 1UL  bits;
-   unsigned long mask = ~(cycle - 1) | (1UL  BITS_PER_LONG-1);
-   unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL)  1;
+   unsigned long mask = ~(cycle - 1);
+   unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL)  1;
unsigned long flags;
 
if ((bdi-cycles  mask) == (total  mask))
@@ -136,7 +133,7 @@ static void bdi_writeout_norm(struct bac
 
spin_lock_irqsave(bdi-lock, flags);
while ((bdi-cycles  mask) != (total  mask)) {
-   unsigned long half = bdi_stat(bdi, BDI_WRITEOUT) / 2;
+   unsigned long half = __bdi_stat(bdi, BDI_WRITEOUT) / 2;
 
mod_bdi_stat(bdi, BDI_WRITEOUT, -half);
bdi-cycles += cycle;
@@ -158,13 +155,13 @@ static void bdi_writeout_inc(struct back
 void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div)
 {
int bits = vm_cycle_shift - 1;
-   unsigned long total = global_bdi_stat(BDI_WRITEOUT_TOTAL);
+   unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL);
unsigned long cycle = 1UL  bits;
unsigned long mask = cycle - 1;
 
if (bdi_cap_writeback_dirty(bdi)) {
bdi_writeout_norm(bdi);
-   *scale = bdi_stat(bdi, BDI_WRITEOUT);
+   *scale = __bdi_stat(bdi, BDI_WRITEOUT);
} else
*scale = 0;
 
@@ -234,19 +231,29 @@ get_dirty_limits(long *pbackground, long
*pdirty = dirty;
 
if (mapping) {
+   struct backing_dev_info *bdi = mapping-backing_dev_info;
+   long reserve;
long long tmp = dirty;
int scale, div;
 
-   get_writeout_scale(mapping-backing_dev_info, scale, div);
-
-   if (scale  div)
-   scale = div;
+   get_writeout_scale(bdi, scale, div);
 
-   tmp = (tmp * 122)  7; /* take ~95% of total dirty value */
tmp *= scale;
do_div(tmp, div);
 
-   *pbdi_dirty = (long)tmp;
+   reserve = dirty -
+   (global_bdi_stat(BDI_DIRTY) +
+global_bdi_stat(BDI_WRITEBACK) +
+global_bdi_stat(BDI_UNSTABLE));
+
+   if (reserve  0)
+   reserve = 0;
+
+   reserve += bdi_stat(bdi, BDI_DIRTY) +
+   bdi_stat(bdi, BDI_WRITEBACK) +
+   bdi_stat(bdi, BDI_UNSTABLE);
+
+   *pbdi_dirty = min((long)tmp, reserve);
}
 }
 
@@ -627,7 +634,7 @@ void __init page_writeback_init(void)
mod_timer(wb_timer, jiffies

Re: [RFC][PATCH 0/6] per device dirty throttling

2007-03-20 Thread Peter Zijlstra
On Tue, 2007-03-20 at 18:47 +1100, David Chinner wrote:
 On Mon, Mar 19, 2007 at 04:57:37PM +0100, Peter Zijlstra wrote:
  This patch-set implements per device dirty page throttling. Which should 
  solve
  the problem we currently have with one device hogging the dirty limit.
  
  Preliminary testing shows good results:
 
 I just ran some higher throughput number on this patchset.
 
 Identical 4-disk dm stripes, XFS, 4p x86_64, 16GB RAM, dirty_ratio = 5:
 
 One dm stripe: 320MB/s
 two dm stripes: 310+315MB/s
 three dm stripes: 254+253+253MB/s (pci-x bus bound)
 
 The three stripe test was for 100GB of data to each
 filesystem - all the writes finished with 1s of each other
 at 7m4s. Interestingly, the amount of memory in cache for
 each of these devices was almost exactly the same - about
 5.2GB each. Looks good so far
 
 Hmmm - small problem - root disk (XFS) got stuck in
 balance_dirty_pages_ratelimited_nr() after the above write test
 attempting to unmount the filesystems (i.e. umount trying
 to modify /etc/mtab got stuck and the root fs locked up)
 
 (reboot)

Hmm, interesting, I'll look into it.

 None-identical dm stripes, XFS, run alone:
 
 Single disk: 80MB/s
 2 disk dm stripe: 155MB/s
 4 disk dm stripe: 310MB/s
 
 Combined, after some runtime:
 
 # ls -sh /mnt/dm*/test
 10G /mnt/dm0/test 19G /mnt/dm1/test   41G /mnt/dm2/test
 15G /mnt/dm0/test 27G /mnt/dm1/test   52G /mnt/dm2/test
 18G /mnt/dm0/test 32G /mnt/dm1/test   64G /mnt/dm2/test
 24G /mnt/dm0/test 45G /mnt/dm1/test   86G /mnt/dm2/test
 27G /mnt/dm0/test 51G /mnt/dm1/test   95G /mnt/dm2/test
 29G /mnt/dm0/test 52G /mnt/dm1/test   97G /mnt/dm2/test
 29G /mnt/dm0/test 54G /mnt/dm1/test   101G /mnt/dm2/test [done]
 35G /mnt/dm0/test 65G /mnt/dm1/test   101G /mnt/dm2/test
 38G /mnt/dm0/test 70G /mnt/dm1/test   101G /mnt/dm2/test
 
 And so on. Final number:
 
 Single disk: 70MB/s
 2 disk dm stripe: 130MB/s
 4 disk dm stripe: 260MB/s
 
 So overall we've lost about 15-20% of the theoretical aggregate
 perfomrance, but we haven't starved any of the devices over a
 long period of time.
 
 However, looking at vmstat for total throughput, there are periods
 of time where it appears that the fastest disk goes idle. That is,
 we drop from an aggregate of about 550MB/s to below 300MB/s for
 several seconds at a time. You can sort of see this from the file
 size output above - long term the ratios remain the same, but in the
 short term we see quite a bit of variability.

I suspect you did not apply 7/6? There is some trouble with signed vs
unsigned in the initial patch set that I tried to 'fix' by masking out
the MSB, but that doesn't work and results in 'time' getting stuck for
about half the time.

 When the fast disk completed, I saw almost the same thing, but
 this time it seems like the slow disk (i.e. ~230MB/s to ~150MB/s)
 stopped for several seconds.
 
 I haven't really digested what the patches do,

If you have questions please ask, I'll try to write up coherent
answers :-)

  but it's almost
 like it is throttling a device completely while it allows another
 to finish writing it's quota (underestimating bandwidth?).

Yeah, there is some lumpy-ness in BIO submission or write completions it
seems, and when that granularity (multiplied by the number of active
devices) is larger than the 'time' period over with we average
(indicated by vm_cycle_shift) very weird stuff can happen.

 (umount after writes hung again. Same root disk thing as before)
 
 This is looking promising, Peter. When it is more stable I'll run
 some more tests

Thanks for the tests.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/6] per device dirty throttling

2007-03-20 Thread Peter Zijlstra
On Tue, 2007-03-20 at 20:38 +1100, David Chinner wrote:
 On Tue, Mar 20, 2007 at 09:08:24AM +0100, Peter Zijlstra wrote:
  On Tue, 2007-03-20 at 18:47 +1100, David Chinner wrote:
   So overall we've lost about 15-20% of the theoretical aggregate
   perfomrance, but we haven't starved any of the devices over a
   long period of time.
   
   However, looking at vmstat for total throughput, there are periods
   of time where it appears that the fastest disk goes idle. That is,
   we drop from an aggregate of about 550MB/s to below 300MB/s for
   several seconds at a time. You can sort of see this from the file
   size output above - long term the ratios remain the same, but in the
   short term we see quite a bit of variability.
  
  I suspect you did not apply 7/6? There is some trouble with signed vs
  unsigned in the initial patch set that I tried to 'fix' by masking out
  the MSB, but that doesn't work and results in 'time' getting stuck for
  about half the time.
 
 I applied the fixes patch as well, so i had all that you posted...

Humm, not that then.

but it's almost
   like it is throttling a device completely while it allows another
   to finish writing it's quota (underestimating bandwidth?).
  
  Yeah, there is some lumpy-ness in BIO submission or write completions it
  seems, and when that granularity (multiplied by the number of active
  devices) is larger than the 'time' period over with we average
  (indicated by vm_cycle_shift) very weird stuff can happen.
 
 Sounds like the period is a bit too short atm if we can get into this
 sort of problem with only 2 active devices

Yeah, trouble is, I significantly extended this period in 7/6.
Will have to ponder a bit on what is happening then.

Anyway, thanks for the feedback.

I'll try and reproduce the umount problem, maybe that will give some
hints.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/6] per device dirty throttling

2007-03-20 Thread Peter Zijlstra

This seems to fix the worst of it, I'll run with it for a few days and
respin the patches and repost when nothing weird happens.

---
Found missing bdi_stat_init() sites for NFS and Fuse

Optimize bdi_writeout_norm(), break out of the loop when we hit zero. This will
allow 'new' BDIs to catch up without triggering NMI/softlockup msgs.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/fuse/inode.c |1 +
 fs/nfs/client.c |1 +
 mm/page-writeback.c |   25 +++--
 3 files changed, 21 insertions(+), 6 deletions(-)

Index: linux-2.6/fs/nfs/client.c
===
--- linux-2.6.orig/fs/nfs/client.c
+++ linux-2.6/fs/nfs/client.c
@@ -657,6 +657,7 @@ static void nfs_server_set_fsinfo(struct
server-rsize = NFS_MAX_FILE_IO_SIZE;
server-rpages = (server-rsize + PAGE_CACHE_SIZE - 1)  
PAGE_CACHE_SHIFT;
server-backing_dev_info.ra_pages = server-rpages * NFS_MAX_READAHEAD;
+   bdi_stat_init(server-backing_dev_info);
 
if (server-wsize  max_rpc_payload)
server-wsize = max_rpc_payload;
Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -114,6 +114,13 @@ static void background_writeout(unsigned
  *
  * Furthermore, when we choose thresh to be 2^n it can be written in terms of
  * binary operations and wraparound artifacts disappear.
+ *
+ * Also note that this yields a natural counter of the elapsed periods:
+ *
+ *   i / thresh
+ *
+ * Its monotonous increasing property can be applied to mitigate the wrap-
+ * around issue.
  */
 static int vm_cycle_shift __read_mostly;
 
@@ -125,19 +132,25 @@ static void bdi_writeout_norm(struct bac
int bits = vm_cycle_shift;
unsigned long cycle = 1UL  bits;
unsigned long mask = ~(cycle - 1);
-   unsigned long total = __global_bdi_stat(BDI_WRITEOUT_TOTAL)  1;
+   unsigned long global_cycle =
+   (__global_bdi_stat(BDI_WRITEOUT_TOTAL)  1)  mask;
unsigned long flags;
 
-   if ((bdi-cycles  mask) == (total  mask))
+   if ((bdi-cycles  mask) == global_cycle)
return;
 
spin_lock_irqsave(bdi-lock, flags);
-   while ((bdi-cycles  mask) != (total  mask)) {
-   unsigned long half = __bdi_stat(bdi, BDI_WRITEOUT) / 2;
+   while ((bdi-cycles  mask) != global_cycle) {
+   unsigned long val = __bdi_stat(bdi, BDI_WRITEOUT);
+   unsigned long half = (val + 1) / 2;
+
+   if (!val)
+   break;
 
mod_bdi_stat(bdi, BDI_WRITEOUT, -half);
bdi-cycles += cycle;
}
+   bdi-cycles = global_cycle;
spin_unlock_irqrestore(bdi-lock, flags);
 }
 
@@ -146,10 +159,10 @@ static void bdi_writeout_inc(struct back
if (!bdi_cap_writeback_dirty(bdi))
return;
 
+   bdi_writeout_norm(bdi);
+
__inc_bdi_stat(bdi, BDI_WRITEOUT);
__inc_bdi_stat(bdi, BDI_WRITEOUT_TOTAL);
-
-   bdi_writeout_norm(bdi);
 }
 
 void get_writeout_scale(struct backing_dev_info *bdi, int *scale, int *div)
Index: linux-2.6/fs/fuse/inode.c
===
--- linux-2.6.orig/fs/fuse/inode.c
+++ linux-2.6/fs/fuse/inode.c
@@ -413,6 +413,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(fc-num_waiting, 0);
fc-bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc-bdi.unplug_io_fn = default_unplug_io_fn;
+   bdi_stat_init(fc-bdi);
fc-reqctr = 0;
fc-blocked = 1;
get_random_bytes(fc-scramble_key, sizeof(fc-scramble_key));


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21-rc3-mm2 3/4] futex_requeue_pi optimization

2007-03-20 Thread Peter Zijlstra
On Tue, 2007-03-20 at 16:32 +0100, Pierre Peiffer wrote:
 Peter Zijlstra a écrit :
  +static void *get_futex_address(union futex_key *key)
  +{
  +  void *uaddr;
  +
  +  if (key-both.offset  1) {
  +  /* shared mapping */
  +  uaddr = (void*)((key-shared.pgoff  PAGE_SHIFT)
  +  + key-shared.offset - 1);
  +  } else {
  +  /* private mapping */
  +  uaddr = (void*)(key-private.address + key-private.offset);
  +  }
  +
  +  return uaddr;
  +}
  
  This will not work for nonlinear vmas, granted, not a lot of ppl stick
  futexes in nonlinear vmas, but the futex_key stuff handles it, this
  doesn't.
 
 Indeed ! Thanks for pointing me to this.
 
 Since I'm not familiar with vmm, does this code look more correct to you ?

Unfortunately not, nonlinear vmas don't have a linear relation between
address and offset. What you would need to do is do a linear walk of the
page tables. But even that might not suffice if nonlinear vmas may form
a non-injective, surjective mapping.

/me checks..

Hmm, yes that seems valid, so in general, this reverse mapping does not
uniquely exist for non-linear vmas. :-(

What to do... disallow futexes in nonlinear mappings, store the address
in the key?

 static void *get_futex_address(union futex_key *key)
 {
   void *uaddr;
   struct vm_area_struct *vma = current-mm-mmap;
 
   if (key-both.offset  1) {
   /* shared mapping */
   struct file * vmf;
 
   do {
   if ((vmf = vma-vm_file)
(key-shared.inode == vmf-f_dentry-d_inode))
   break;
   vma = vma-vm_next;
   } while (vma);
 
   if (likely(!(vma-vm_flags  VM_NONLINEAR)))
   uaddr = (void*)((key-shared.pgoff  PAGE_SHIFT)
   + key-shared.offset - 1);
   else
   uaddr = (void*) vma-vm_start
   + ((key-shared.pgoff - vma-vm_pgoff)
   PAGE_SHIFT)
   + key-shared.offset - 1;
   } else {
   /* private mapping */
   uaddr = (void*)(key-private.address + key-private.offset);
   }
 
   return uaddr;
 }
 
 Or is there a more direct way to retrieve the vma corresponding to the given 
 inode ?

the vma_prio_tree would be able to give all vmas associated with a
mapping.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL v0.31

2007-03-21 Thread Peter Zijlstra
On Wed, 2007-03-21 at 15:57 +0100, Mike Galbraith wrote:

 'f' is a progglet which sleeps a bit and burns a bit, duration depending
 on argument given. 'sh' is a shell 100% hog.  In this scenario, the
 argument was set such that 'f' used right at 50% cpu.  All are started
 at the same time, and I froze top when the first 'f' reached 1:00.

May one enquire how much CPU the mythical 'f' uses when ran alone? Just
to get a gauge for the numbers?



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lockdep: lockdep_depth vs. debug_locks Re: [2.6.20] BUG: workqueue leaked lock

2007-03-22 Thread Peter Zijlstra
On Thu, 2007-03-22 at 07:11 +0100, Jarek Poplawski wrote:
 Here is some joke:
 
 [PATCH] lockdep: lockdep_depth vs. debug_locks
 
 lockdep really shouldn't be used when debug_locks == 0!

This happens then lockdep reports a fatal error, at which point
it will stop tracking locks and leave whatever state was there.

 Reported-by: Folkert van Heusden [EMAIL PROTECTED]
 Inspired-by: Oleg Nesterov [EMAIL PROTECTED]
 Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]

This looks sane, thanks for figuring this out.

Acked-by: Peter Zijlstra [EMAIL PROTECTED]

 ---
 
 diff -Nurp 2.6.21-rc4-git4-/include/linux/lockdep.h 
 2.6.21-rc4-git4/include/linux/lockdep.h
 --- 2.6.21-rc4-git4-/include/linux/lockdep.h  2007-03-20 20:24:17.0 
 +0100
 +++ 2.6.21-rc4-git4/include/linux/lockdep.h   2007-03-21 22:32:41.0 
 +0100
 @@ -245,7 +245,7 @@ extern void lock_release(struct lockdep_
  
  # define INIT_LOCKDEP.lockdep_recursion = 0,
  
 -#define lockdep_depth(tsk)   ((tsk)-lockdep_depth)
 +#define lockdep_depth(tsk)   (debug_locks ? (tsk)-lockdep_depth : 0)
  
  #else /* !LOCKDEP */
  

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lockdep: debug_show_all_locks debug_show_held_locks vs. debug_locks

2007-03-22 Thread Peter Zijlstra
On Thu, 2007-03-22 at 07:57 +0100, Jarek Poplawski wrote:
 And here is some addition.
 
 [PATCH] lockdep: debug_show_all_locks   debug_show_held_locks vs. debug_locks
  
 lockdep's data shouldn't be used when debug_locks == 0
 because it's not updated after this, so it's more misleading
 than helpful.
 
 PS: probably lockdep's current- fields should be reset after
 it turns debug_locks off: so, after printing a bug report, but
 before return from exported functions, but there are really
 a lot of these possibilities (e.g. after DEBUG_LOCKS_WARN_ON),
 so, something could be missed. (Of course direct use of this
 fields isn't recommended either.)
 
 Reported-by: Folkert van Heusden [EMAIL PROTECTED]
 Inspired-by: Oleg Nesterov [EMAIL PROTECTED]
 Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]

Acked-by: Peter Zijlstra [EMAIL PROTECTED]

 ---
 
 diff -Nurp 2.6.21-rc4-git4-/kernel/lockdep.c 2.6.21-rc4-git4/kernel/lockdep.c
 --- 2.6.21-rc4-git4-/kernel/lockdep.c 2007-03-21 22:46:26.0 +0100
 +++ 2.6.21-rc4-git4/kernel/lockdep.c  2007-03-21 23:05:17.0 +0100
 @@ -2742,6 +2742,10 @@ void debug_show_all_locks(void)
   int count = 10;
   int unlock = 1;
  
 + if (unlikely(!debug_locks)) {
 + printk(INFO: lockdep is turned off.\n);
 + return;
 + }
   printk(\nShowing all locks held in the system:\n);
  
   /*
 @@ -2785,6 +2789,10 @@ EXPORT_SYMBOL_GPL(debug_show_all_locks);
  
  void debug_show_held_locks(struct task_struct *task)
  {
 + if (unlikely(!debug_locks)) {
 + printk(INFO: lockdep is turned off.\n);
 + return;
 + }
   lockdep_print_held_locks(task);
  }
  

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()

2007-03-25 Thread Peter Zijlstra
On Sat, 2007-03-24 at 22:55 +0100, Miklos Szeredi wrote:
 This is a slightly different take on the fix for the deadlock in fuse
 with dirty balancing.  David Chinner convinced me, that per-bdi
 counters are too expensive, and that it's not worth trying to account
 the number of pages under writeback, as they will be limited by the
 queue anyway.
 

Please have a look at this:
  http://lkml.org/lkml/2007/3/19/220


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()

2007-03-25 Thread Peter Zijlstra
On Sun, 2007-03-25 at 13:34 +0200, Miklos Szeredi wrote:
   
   Please have a look at this:
 http://lkml.org/lkml/2007/3/19/220
  
  
  
   + if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) =
   + bdi_thresh)
   + break;
   
  
  Yes, this will resolve the deadlock as well, where balance_dirty_pages()
  is currently looping forever with:
 
 Almost.
 
 This
 
  -   if (nr_reclaimable) {
  +   if (bdi_nr_reclaimable) {
  writeback_inodes(wbc);
 
 still makes it loop forever if bdi_nr_reclaimable == 0, since the exit
 condition is not checked. 
 
 Shouldn't it break out of the loop if bdi_stat(bdi, BDI_WRITEBACK) =
 bdi_thresh in this case?

for (;;) {
struct writeback_control wbc = {
.bdi= bdi,
.sync_mode  = WB_SYNC_NONE,
.older_than_this = NULL,
.nr_to_write= write_chunk,
.range_cyclic   = 1,
};

get_dirty_limits(background_thresh, dirty_thresh,
bdi_thresh, bdi);
bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
bdi_stat(bdi, BDI_UNSTABLE);
(A) if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) =
bdi_thresh)
break;

/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 * Unstable writes are a feature of certain networked
 * filesystems (i.e. NFS) in which data may have been
 * written to the server's write cache, but has not yet
 * been flushed to permanent storage.
 */
(B)  if (bdi_nr_reclaimable) {
writeback_inodes(wbc);

get_dirty_limits(background_thresh, dirty_thresh,
   bdi_thresh, bdi);
bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
bdi_stat(bdi, BDI_UNSTABLE);
(C) if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) =
bdi_thresh)
break;

pages_written += write_chunk - wbc.nr_to_write;
if (pages_written = write_chunk)
break;  /* We've done our duty */
}
congestion_wait(WRITE, HZ/10);
}

I'm thinking that if bdi_nr_reclaimable == 0, A reduces to
bdi_stat(bdi, BDI_WRITEBACK) = bdi_thresh and we're still out of the
loop, no?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/3] only allow nonlinear vmas for ram backed filesystems

2007-03-25 Thread Peter Zijlstra
On Sat, 2007-03-24 at 23:09 +0100, Miklos Szeredi wrote:
 From: Miklos Szeredi [EMAIL PROTECTED]
 
 Dirty page accounting/limiting doesn't work for nonlinear mappings, so
 for non-ram backed filesystems emulate with linear mappings.  This
 retains ABI compatibility with previous kernels at minimal code cost.
 
 All known users of nonlinear mappings actually use tmpfs, so this
 shouldn't have any negative effect.
 
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

Acked-by: Peter Zijlstra [EMAIL PROTECTED]

 ---
 
 Index: linux-2.6.21-rc4-mm1/mm/fremap.c
 ===
 --- linux-2.6.21-rc4-mm1.orig/mm/fremap.c 2007-03-24 22:30:05.0 
 +0100
 +++ linux-2.6.21-rc4-mm1/mm/fremap.c  2007-03-24 22:37:59.0 +0100
 @@ -181,6 +181,24 @@ asmlinkage long sys_remap_file_pages(uns
   goto retry;
   }
   mapping = vma-vm_file-f_mapping;
 + /*
 +  * page_mkclean doesn't work on nonlinear vmas, so if dirty
 +  * pages need to be accounted, emulate with linear vmas.
 +  */
 + if (mapping_cap_account_dirty(mapping)) {
 + unsigned long addr;
 +
 + flags = MAP_NONBLOCK;
 + addr = mmap_region(vma-vm_file, start, size, flags,
 +vma-vm_flags, pgoff, 1);
 + if (IS_ERR_VALUE(addr))
 + err = addr;
 + else {
 + BUG_ON(addr != start);
 + err = 0;
 + }
 + goto out;
 + }
   spin_lock(mapping-i_mmap_lock);
   flush_dcache_mmap_lock(mapping);
   vma-vm_flags |= VM_NONLINEAR;

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] split mmap

2007-03-25 Thread Peter Zijlstra
On Sat, 2007-03-24 at 23:07 +0100, Miklos Szeredi wrote:
 From: Miklos Szeredi [EMAIL PROTECTED]
 
 This is a straightforward split of do_mmap_pgoff() into two functions:
 
  - do_mmap_pgoff() checks the parameters, and calculates the vma
flags.  Then it calls
 
  - mmap_region(), which does the actual mapping
 
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

Acked-by: Peter Zijlstra [EMAIL PROTECTED]

 ---
 
 Index: linux/mm/mmap.c
 ===
 --- linux.orig/mm/mmap.c  2007-03-24 21:00:40.0 +0100
 +++ linux/mm/mmap.c   2007-03-24 22:28:52.0 +0100
 @@ -893,14 +893,11 @@ unsigned long do_mmap_pgoff(struct file 
   unsigned long flags, unsigned long pgoff)
  {
   struct mm_struct * mm = current-mm;
 - struct vm_area_struct * vma, * prev;
   struct inode *inode;
   unsigned int vm_flags;
 - int correct_wcount = 0;
   int error;
 - struct rb_node ** rb_link, * rb_parent;
   int accountable = 1;
 - unsigned long charged = 0, reqprot = prot;
 + unsigned long reqprot = prot;
  
   /*
* Does the application expect PROT_READ to imply PROT_EXEC?
 @@ -1025,7 +1022,25 @@ unsigned long do_mmap_pgoff(struct file 
   error = security_file_mmap(file, reqprot, prot, flags);
   if (error)
   return error;
 - 
 +
 + return mmap_region(file, addr, len, flags, vm_flags, pgoff,
 +accountable);
 +}
 +EXPORT_SYMBOL(do_mmap_pgoff);
 +
 +unsigned long mmap_region(struct file *file, unsigned long addr,
 +   unsigned long len, unsigned long flags,
 +   unsigned int vm_flags, unsigned long pgoff,
 +   int accountable)
 +{
 + struct mm_struct *mm = current-mm;
 + struct vm_area_struct *vma, *prev;
 + int correct_wcount = 0;
 + int error;
 + struct rb_node **rb_link, *rb_parent;
 + unsigned long charged = 0;
 + struct inode *inode =  file ? file-f_path.dentry-d_inode : NULL;
 +
   /* Clear old maps */
   error = -ENOMEM;
  munmap_back:
 @@ -1174,8 +1189,6 @@ unacct_error:
   return error;
  }
  
 -EXPORT_SYMBOL(do_mmap_pgoff);
 -
  /* Get an address range which is currently unmapped.
   * For shmat() with addr=0.
   *
 Index: linux/include/linux/mm.h
 ===
 --- linux.orig/include/linux/mm.h 2007-03-24 21:00:40.0 +0100
 +++ linux/include/linux/mm.h  2007-03-24 22:28:52.0 +0100
 @@ -1035,6 +1035,10 @@ extern unsigned long get_unmapped_area(s
  extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
   unsigned long len, unsigned long prot,
   unsigned long flag, unsigned long pgoff);
 +extern unsigned long mmap_region(struct file *file, unsigned long addr,
 + unsigned long len, unsigned long flags,
 + unsigned int vm_flags, unsigned long pgoff,
 + int accountable);
  
  static inline unsigned long do_mmap(struct file *file, unsigned long addr,
   unsigned long len, unsigned long prot,

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/3] update ctime and mtime for mmaped write

2007-03-25 Thread Peter Zijlstra
On Sat, 2007-03-24 at 23:11 +0100, Miklos Szeredi wrote:
 From: Miklos Szeredi [EMAIL PROTECTED]
 
 Changes:
 v3:
  o rename is_page_modified to test_clear_page_modified
 v2:
  o set AS_CMTIME flag in clear_page_dirty_for_io() too
  o don't clear AS_CMTIME in file_update_time()
  o check the dirty bit in the page tables
 v1:
  o moved check from __fput() to remove_vma(), which is more logical
  o changed set_page_dirty() to set_page_dirty_mapping in hugetlb.c
  o cleaned up #ifdef CONFIG_BLOCK mess
 
 This patch makes writing to shared memory mappings update st_ctime and
 st_mtime as defined by SUSv3:
 
The st_ctime and st_mtime fields of a file that is mapped with
MAP_SHARED and PROT_WRITE shall be marked for update at some point
in the interval between a write reference to the mapped region and
the next call to msync() with MS_ASYNC or MS_SYNC for that portion
of the file by any process. If there is no such call and if the
underlying file is modified as a result of a write reference, then
these fields shall be marked for update at some time after the
write reference.
 
 A new address_space flag is introduced: AS_CMTIME.  This is set each
 time a page is dirtied through a userspace memory mapping.  This
 includes write accesses via get_user_pages().
 
 Note, the flag is set unconditionally, even if the page is already
 dirty.  This is important, because the page might have been dirtied
 earlier by a non-mmap write.
 
 This flag is checked in msync() and munmap()/mremap(), and if set, the
 file times are updated and the flag is cleared.
 
 Msync also needs to check the dirty bit in the page tables, because
 the data might change again after an msync(MS_ASYNC), while the page
 is already dirty and read-write.  This also makes the time updating
 work for memory backed filesystems such as tmpfs.
 
 This implementation walks the pages in the synced range, and uses rmap
 to find all the ptes for each page.  Non-linear vmas are ignored,
 since the ptes can only be found by scanning the whole vma, which is
 very inefficient.
 
 As an optimization, if dirty pages are accounted, then only walk the
 dirty pages, since the clean pages necessarily have clean ptes.  This
 doesn't work for memory backed filesystems, where no dirty accounting
 is done.
 
 An alternative implementation could check for all intersecting vmas in
 the mapping and walk the page tables for each.  This would probably be
 more efficient for memory backed filesystems and if the number of
 dirty pages is near the total number of pages in the range.
 
 Fixes Novell Bugzilla #206431.
 
 Inspired by Peter Staubach's patch and the resulting comments.
 
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
 ---

A few comments..

 Index: linux-2.6.21-rc4-mm1/mm/rmap.c
 ===
 --- linux-2.6.21-rc4-mm1.orig/mm/rmap.c   2007-03-24 19:03:11.0 
 +0100
 +++ linux-2.6.21-rc4-mm1/mm/rmap.c2007-03-24 19:34:30.0 +0100
 @@ -507,6 +507,43 @@ int page_mkclean(struct page *page)
  EXPORT_SYMBOL_GPL(page_mkclean);
  
  /**
 + * test_clear_page_modified - check and clear the dirty bit for all mappings 
 of a page
 + * @page:the page to check
 + */
 +bool test_clear_page_modified(struct page *page)
 +{
 + struct address_space *mapping = page-mapping;

page_mapping(page)? Otherwise that BUG_ON(!mapping) a few lines down
isn't of much use.

 + pgoff_t pgoff = page-index  (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 + struct vm_area_struct *vma;
 + struct prio_tree_iter iter;
 + bool modified = false;
 +
 + BUG_ON(!mapping);
 + BUG_ON(!page_mapped(page));
 +
 + spin_lock(mapping-i_mmap_lock);
 + vma_prio_tree_foreach(vma, iter, mapping-i_mmap, pgoff, pgoff) {
 + if (vma-vm_flags  VM_SHARED) {
 + struct mm_struct *mm = vma-vm_mm;
 + unsigned long addr = vma_address(page, vma);
 + pte_t *pte;
 + spinlock_t *ptl;
 +
 + if (addr != -EFAULT 
 + (pte = page_check_address(page, mm, addr, ptl))) {
 + if (ptep_clear_flush_dirty(vma, addr, pte))
 + modified = true;
 + pte_unmap_unlock(pte, ptl);
 + }

Its against coding style to do assignments in conditionals.

 + }
 + }
 + spin_unlock(mapping-i_mmap_lock);
 + if (page_test_and_clear_dirty(page))
 + modified = true;
 + return modified;
 +}

Why not parametrize page_mkclean() to conditionally wrprotect clean
pages? Something like:

--- mm/rmap.c~  2007-03-11 17:52:20.0 +0100
+++ mm/rmap.c   2007-03-25 14:01:55.0 +0200
@@ -432,7 +432,8 @@
return referenced;
 }
 
-static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
+static int
+page_mkclean_one(struct page *page, 

Re: [patch 2/3] only allow nonlinear vmas for ram backed filesystems

2007-03-25 Thread Peter Zijlstra
On Sun, 2007-03-25 at 16:00 -0800, Andrew Morton wrote:
 On Sat, 24 Mar 2007 23:09:19 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:
 
  Dirty page accounting/limiting doesn't work for nonlinear mappings,
 
 Doesn't it?  iirc the problem is that we don't correctly re-clean the ptes
 while starting writeout.  And the dirty-page accounting is in fact correct
 (it'd darn well better be).

If we do not re-protect the pages on writeout, we'll decrement the dirty
count but not get a fault on re-dirty. Hence the dirty count will
actually skew.

In order to make page_mkclean() work for nonlinear vmas we need to do a
full pte scan for each invocation (we could perhaps only scan 1 in n
times to try and limit the damage) and that hurts. This will basically
render it useless.

The other solution is adding rmap information to nonlinear vmas but
doubling the memory overhead for nonlinear mappings was not deemed a
good idea.

  so
  for non-ram backed filesystems emulate with linear mappings.  This
  retains ABI compatibility with previous kernels at minimal code cost.
  
  All known users of nonlinear mappings actually use tmpfs, so this
  shouldn't have any negative effect.
 
 Unless someone is using remap_file_pages() against an ext3 file, in which
 case their application stops working?

it'll work up to a certain point (when you hit the max vma count). When
this issue first came up a few weeks ago, nobody knew of any non tmpfs
users.

 That would be a problem.  These guys:
 http://www.technovelty.org/code/linux/fremap.html, for example, will be in
 for a little surprise.

Its an example, it doesn't show if there are actually users of this, but
point taken. We actually could make this very example work by noting
that they map PROT_READ only.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()

2007-03-26 Thread Peter Zijlstra
On Mon, 2007-03-26 at 02:08 -0800, Andrew Morton wrote:
 On Mon, 26 Mar 2007 11:32:47 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:
 
  Stopping writers which have idle queues is completely unproductive,
  and that is basically what the current algorithm does.
 
 This is because the kernel permits all of its allotment of dirty+writeback
 pages to be dirty+writeback against a single device.
 
 A good way of solving the one-device-starves-another-one problem is to
 dynamically adjust the per-device dirty+writeback levels so that (for
 example) if two devices are being written to, each gets 50% of the
 allotment.

This is exactly what happens with my patch if both devices write at the
same speed. (Or at least, that is what is supposed to happen ;-)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/12] mm: page_alloc_wait

2007-04-06 Thread Peter Zijlstra
On Thu, 2007-04-05 at 15:57 -0700, Andrew Morton wrote:
 On Thu, 05 Apr 2007 19:42:19 +0200
 [EMAIL PROTECTED] wrote:
 
  Introduce a mechanism to wait on free memory.
  
  Currently congestion_wait() is abused to do this.
 
 Such a very small explanation for such a terrifying change.

Yes, I suck at writing changelogs, bad me. Normally I would take a day
to write them, but I just wanted to get this code out there. Perhaps a
bad decision.

  ...
 
  --- linux-2.6-mm.orig/mm/vmscan.c   2007-04-05 16:29:46.0 +0200
  +++ linux-2.6-mm/mm/vmscan.c2007-04-05 16:29:49.0 +0200
  @@ -1436,6 +1436,7 @@ static int kswapd(void *p)
  finish_wait(pgdat-kswapd_wait, wait);
   
  balance_pgdat(pgdat, order);
  +   page_alloc_ok();
  }
  return 0;
   }
 
 For a start, we don't know that kswapd freed pages which are in a suitable
 zone.  And we don't know that kswapd freed pages which are in a suitable
 cpuset.
 
 congestion_wait() is similarly ignorant of the suitability of the pages,
 but the whole idea behind congestion_wait is that it will throttle page
 allocators to some speed which is proportional to the speed at which the IO
 systems can retire writes - view it as a variable-speed polling operation,
 in which the polling frequency goes up when the IO system gets faster. 
 This patch changes that philosophy fundamentally.  That's worth more than a
 2-line changelog.
 
 Also, there might be situations in which kswapd gets stuck in some dark
 corner.  Perhaps the process which is waiting in the page allocator holds
 filesystem locks which kswapd is blocked on.  Or kswapd might be blocked on
 a particular request queue, or a dead NFS server or something.  The timeout
 will save us, but things will be slow.
 
 There could be other problems too, dunno - this stuff is tricky.  Why are
 you changing it, what problems are being solved, etc?

Lets start with the why, because of 12/12; I wanted to introduce per BDI
congestion feedback, and hence needed a BDI context for
congestion_wait(). These specific callers weren't in the context of a
BDI but of a more global idea.

Perhaps I could call page_alloc_ok() from bdi_congestion_end()
irrespective of the actual BDI uncongested? That would more or less give
the old semantics.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] mm: accurate pageout congestion wait

2007-04-06 Thread Peter Zijlstra
On Thu, 2007-04-05 at 16:17 -0700, Andrew Morton wrote:
 On Thu, 05 Apr 2007 19:42:20 +0200
 [EMAIL PROTECTED] wrote:
 
  Only do the congestion wait when we actually encountered congestion.
 
 The name congestion_wait() was accurate back in 2002, but it isn't accurate
 any more, and you got misled.  It does not only wait for a queue to become
 uncongested.

Quite so indeed.

 See clear_bdi_congested()'s callers.  As long as the queue is in an
 uncongested state, we deliver wakeups to congestion_wait() blockers on
 every IO completion.  As I said before, it is so that the MM's polling
 operations poll at a higher frequency when the IO system is working faster.
 (It is also to synchronise with end_page_writeback()'s feeding of clean
 pages to us via rotate_reclaimable_page()).

Hmm, but the condition under which we did call congestion_wait() is a
bit magical.

 Page reclaim can get into trouble without any request queue having entered
 a congested state.  For example, think about a machine which has a single
 disk, and the operator has increased that disk's request queue size to
 100,000.  With your patch all the VM's throttling would be bypassed and we
 go into a busy loop and declare OOM instantly.
 
 There are probably other situations in which page reclaim gets into trouble
 without a request queue being congested.

Ok, in the light of allt his, I will think on this some more.

 Minor point: bdi_congested() can be arbitrarily expensive - for DM stackups
 it is roughly proportional to the number of subdevices in the device.  We
 need to be careful about how frequently we call it.

Yuck, ok, good point.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 12/12] mm: per BDI congestion feedback

2007-04-06 Thread Peter Zijlstra
On Thu, 2007-04-05 at 16:24 -0700, Andrew Morton wrote:
 On Thu, 05 Apr 2007 19:42:21 +0200
 [EMAIL PROTECTED] wrote:
 
  Now that we have per BDI dirty throttling is makes sense to also have oer 
  BDI
  congestion feedback; why wait on another device if the current one is not
  congested.
 
 Similar comments apply.  congestion_wait() should be called
 throttle_at_a_rate_proportional_to_the_speed_of_presently_uncongested_queues().
 
 If a process is throttled in the page allocator waiting for pages to become
 reclaimable, that process absolutely does not care whether those pages were
 previously dirty against /dev/sda or against /dev/sdb.  It wants to be woken
 up for writeout completion against any queue.

OK, so you disagree with Miklos' 2nd point here:
  http://lkml.org/lkml/2007/4/4/137

And in the light of clear_bdi_congestion() being called for each
writeout completion under the threshold this does make sense.

So this whole 8-12/12 series is not needed and just served as an
learning experience :-/


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/12] mm: scalable bdi statistics counters.

2007-04-06 Thread Peter Zijlstra
On Thu, 2007-04-05 at 15:37 -0700, Andrew Morton wrote:
 On Thu, 05 Apr 2007 19:42:11 +0200
 [EMAIL PROTECTED] wrote:
 
  Provide scalable per backing_dev_info statistics counters modeled on the ZVC
  code.
  
  Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
  ---
   block/ll_rw_blk.c   |1 
   drivers/block/rd.c  |2 
   drivers/char/mem.c  |2 
   fs/char_dev.c   |1 
   fs/fuse/inode.c |1 
   fs/nfs/client.c |1 
   include/linux/backing-dev.h |   98 
  +
   mm/backing-dev.c|  103 
  
 
 madness!  Quite duplicative of vmstat.h, yet all this infrastructure
 is still only usable in one specific application.
 
 Can we please look at generalising the vmstat.h stuff?
 
 Or, the API in percpu_counter.h appears suitable to this application.
 (The comment at line 6 is a total lie).

Ok, I'll see what I can come up with.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes)

2007-04-06 Thread Peter Zijlstra

Hi,

some thoughts on shared futexes;

Could we get rid of the mmap_sem on the shared futexes in the following
manner:

 - do a page table walk to find the pte;
 - get a page using pfn_to_page (skipping VM_PFNMAP)
 - get the futex key from page-mapping-host and page-index
   and offset from addr % PAGE_SIZE.

or given a key:

 - lookup the page from key.shared.inode-i_mapping by key.shared.pgoff
   possibly loading the page using mapping-a_ops-readpage().

then:

 - perform the futex operation on a kmap of the page


This should all work except for VM_PFNMAP.

Since the address is passed from userspace we cannot trust it to not
point into a VM_PFNMAP area.

However, with the RCU VMA lookup patches I'm working on we could do that
check without holding locks and without exclusive cachelines; the
question is, is that good enough?

Or is there an alternative way of determining a pfnmap given a
pfn/struct page?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-06 Thread Peter Zijlstra
On Thu, 2007-04-05 at 19:49 +0200, Eric Dumazet wrote:
 Hi all
 
 I'm pleased to present this patch which improves linux futexes performance 
 and 
 scalability, merely avoiding taking mmap_sem rwlock.
 
 Ulrich agreed with the API and said glibc work could start as soon
 as he gets a Fedora kernel with it :)
 
 Andrew, could we get this in mm as well ? This version is against 
 2.6.21-rc5-mm4
 (so supports 64bit futexes)
 
 In this third version I dropped the NUMA optims and process private hash 
 table,
 to let new API come in and be tested.

Good work, Thanks!

Acked-by: Peter Zijlstra [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes)

2007-04-06 Thread Peter Zijlstra
On Fri, 2007-04-06 at 14:02 +0100, Hugh Dickins wrote:
 On Fri, 6 Apr 2007, Peter Zijlstra wrote:
  
  some thoughts on shared futexes;
  
  Could we get rid of the mmap_sem on the shared futexes in the following
  manner:
  
   - do a page table walk to find the pte;
 
 (walk meaning descent down the levels, I presume, rather than across)

indeed.

 I've not had time to digest your proposal, and I'm about to go out:
 let me sound a warning that springs to mind, maybe it's entirely
 inapproriate, but better said than kept silent.
 
 It looks as if you're supposing that mmap_sem is needed to find_vma,
 but not for going down the pagetables.  It's not a simple as that:
 you need to be careful that a concurrent munmap from another thread
 isn't freeing pagetables from under you.
 
 Holding (down_read) of mmap_sem is one way to protect against that.
 try_to_unmap doesn't have that luxury: in its case, it's made safe
 by the way free_pgtables does anon_vma_unlink and unlink_file_vma
 before freeing any pagetables, so try_to_unmap etc. won't get there;
 but you can't do that.

Ah, drad.

I had hoped that the pte_lock would solve that race, but that doesn't
cover the upper levels of the tree.

Back to the drawing board..

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes)

2007-04-06 Thread Peter Zijlstra
On Fri, 2007-04-06 at 23:15 +1000, Nick Piggin wrote:
 Hugh Dickins wrote:
  On Fri, 6 Apr 2007, Peter Zijlstra wrote:
  
 some thoughts on shared futexes;
 
 Could we get rid of the mmap_sem on the shared futexes in the following
 manner:
 
 I'd imagine shared futexes would be much less common than private for
 threaded programs... I'd say we should reevaluate things once we have
 private futexes, and malloc/free stop hammering mmap_sem so hard...

Indeed, private futexes are by far the most common.

  - get a page using pfn_to_page (skipping VM_PFNMAP)
  - get the futex key from page-mapping-host and page-index
and offset from addr % PAGE_SIZE.
 
 or given a key:
 
  - lookup the page from key.shared.inode-i_mapping by key.shared.pgoff
possibly loading the page using mapping-a_ops-readpage().
 
 For shared futexes, wouldn't i_mapping be worse, because you'd be
 ping-ponging the tree_lock between processes, rather than have each
 use their own mmap_sem?

Your lockless pagecache work would solve most of that, no?

 That also only helps for the wakeup case too, doesn't it? You have
 to use the vmas to find out which inode to use to do the wait, I think?
 (unless you introduce a new shared futex API).

one could do something like this:

 struct address_space *mapping = page_mapping(page);
 if (!mapping || mapping == swapper_space)
   do_private_futex();
 else
   do_shared_futex(mapping-host, page-index, addr % PAGE_SIZE);


But alas, it seems I overlooked that the mmap_sem also protects the page
tables as pointed out by Hugh, so this is all in fain it seems.

A well.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] RSS controller based on process containers (v2)

2007-04-09 Thread Peter Zijlstra

*ugh* /me no like.

The basic premises seems to be that we can track page owners perfectly
(although this patch set does not yet do so), through get/release
operations (on _mapcount).

This is simply not true for unmapped pagecache pages. Those receive no
'release' event; (the usage by find_get_page() could be seen as 'get').

Also, you don't seem to balance the active/inactive scanning on a per
container basis. This skews the per container working set logic.

Lastly, you don't call the slab shrinker for container reclaim; which
would leave slab reclaim only for those few non process specific
allocations, which would greatly skew the pagecache/slab balance.



Let us call 

struct reclaim_struct {
struct list_head active_list;
struct list_head inactive_list;
unsigned long nr_active;
unsigned long nr_inactive;
}

Lets recognise three distinct page categories:
 - anonymous memory,
 - mapped pagecache, and
 - unmapped pagecache.


We then keep anonymous pages on a per container reclaim_struct, these
pages are fully accounted to each container.

We keep mapped pagecache pages on per inode reclaim_structs, these files
could be shared between containers and we could either just account all
pages belonging to each file proportional to the number of containers
involved, or do a more precise accounting.

We keep unmapped pagecache pages on a global reclaim_struct, these pages
can, in general, not be pinned to a specific container; all we can do is
keep a floating proportion relative to container 'get' events
(find_get_page() and perhaps add_to_page_cache()).

Reclaim will then have to fairly reclaim pages from all of these lists.
If we schedule such that it appears that these lists are parallel
instead of serial - that is a each tail is really a tail, not the head
of another list - the current reclaim semantics are preserved.

The slab shrinker should be called proportional to the containers size
relative to the machine.

Global reclaim will have to call each container reclaim in proportional
fashion.

The biggest problem with this approach is that there is no per zone
reclaim left, which is relied upon by the allocator to provide free
pages in a given physical address range. However there has been talk to
create a proper range allocator independent of zones.

Just my 0.02 euro..

Peter


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why kmem_cache_free occupy CPU for more than 10 seconds?

2007-04-11 Thread Peter Zijlstra
On Wed, 2007-04-11 at 02:53 -0700, Paul Jackson wrote:
 I'm confused - which end of ths stack is up?
 
 cpuset_exit doesn't call do_exit, rather it's the other
 way around.  But put_files_struct doesn't call do_exit,
 rather do_exit calls __exit_files calls put_files_struct.

I'm guessing its x86_64 which generates crap traces.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >