Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-11-26 Thread Dave Chinner
On Mon, Nov 26, 2012 at 11:53:45AM +, Alan Cox wrote:
> > It's not like there is any shortage of flag bits, so what's the harm
> > of reserving the bit?
> 
> Why not just reserve a small group of bits for fs private use in that
> case - for any fs.

Flawed - one bit, one function for all filesystems, otherwise the
same binary could behave very differently on different filesystems.

Besides, we already have a mechanism for adding filesystem specific
interfaces. It's called an ioctl.  That's what it's there for - a
free-form extensible interface that can be wholly defined and
contained in the out-of-tree patch.

Most filesystems implement ioctls for their own specific
functionality, including for one-off preallocation semantics (e.g.
XFS_IOC_ZERO_RANGE). There is no reason why ext4 can't do the same
thing and we can drop the whole issue of having to modify a syscall
API with magic, undocumented flag bits with unpredictable
behaviour

ext4 is not a special snowflake that allows developers to bend rules
whenever they want. If the ext4 developers want to support out of
tree functionality for their filesystem, then they can do it within
their filesystem via ioctls like everyone else does.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 06/19] list: add a new LRU list type

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Several subsystems use the same construct for LRU lists - a list
head, a spin lock and and item count. They also use exactly the same
code for adding and removing items from the LRU. Create a generic
type for these LRU lists.

This is the beginning of generic, node aware LRUs for shrinkers to
work with.

Signed-off-by: Dave Chinner 
---
 include/linux/list_lru.h |   36 ++
 lib/Makefile |2 +-
 lib/list_lru.c   |  117 ++
 3 files changed, 154 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/list_lru.h
 create mode 100644 lib/list_lru.c

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
new file mode 100644
index 000..3423949
--- /dev/null
+++ b/include/linux/list_lru.h
@@ -0,0 +1,36 @@
+/*
+ * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
+ * Author: David Chinner
+ *
+ * Generic LRU infrastructure
+ */
+#ifndef _LRU_LIST_H
+#define _LRU_LIST_H 0
+
+#include 
+
+struct list_lru {
+   spinlock_t  lock;
+   struct list_headlist;
+   longnr_items;
+};
+
+int list_lru_init(struct list_lru *lru);
+int list_lru_add(struct list_lru *lru, struct list_head *item);
+int list_lru_del(struct list_lru *lru, struct list_head *item);
+
+static inline long list_lru_count(struct list_lru *lru)
+{
+   return lru->nr_items;
+}
+
+typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
+   void *cb_arg);
+typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
+
+long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
+  void *cb_arg, long nr_to_walk);
+
+long list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);
+
+#endif /* _LRU_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index 821a162..a0849d7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 idr.o int_sqrt.o extable.o \
 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-is_single_threaded.o plist.o decompress.o
+is_single_threaded.o plist.o decompress.o list_lru.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/list_lru.c b/lib/list_lru.c
new file mode 100644
index 000..475d0e9
--- /dev/null
+++ b/lib/list_lru.c
@@ -0,0 +1,117 @@
+/*
+ * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
+ * Author: David Chinner
+ *
+ * Generic LRU infrastructure
+ */
+#include 
+#include 
+#include 
+
+int
+list_lru_add(
+   struct list_lru *lru,
+   struct list_head *item)
+{
+   spin_lock(&lru->lock);
+   if (list_empty(item)) {
+   list_add_tail(item, &lru->list);
+   lru->nr_items++;
+   spin_unlock(&lru->lock);
+   return 1;
+   }
+   spin_unlock(&lru->lock);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_add);
+
+int
+list_lru_del(
+   struct list_lru *lru,
+   struct list_head *item)
+{
+   spin_lock(&lru->lock);
+   if (!list_empty(item)) {
+   list_del_init(item);
+   lru->nr_items--;
+   spin_unlock(&lru->lock);
+   return 1;
+   }
+   spin_unlock(&lru->lock);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_del);
+
+long
+list_lru_walk(
+   struct list_lru *lru,
+   list_lru_walk_cb isolate,
+   void*cb_arg,
+   longnr_to_walk)
+{
+   struct list_head *item, *n;
+   long removed = 0;
+restart:
+   spin_lock(&lru->lock);
+   list_for_each_safe(item, n, &lru->list) {
+   int ret;
+
+   if (nr_to_walk-- < 0)
+   break;
+
+   ret = isolate(item, &lru->lock, cb_arg);
+   switch (ret) {
+   case 0: /* item removed from list */
+   lru->nr_items--;
+   removed++;
+   break;
+   case 1: /* item referenced, give another pass */
+   list_move_tail(item, &lru->list);
+   break;
+   case 2: /* item cannot be locked, skip */
+   break;
+   case 3: /* item not freeable, lock dropped */
+   goto restart;
+   default:
+   BUG();
+   }
+   }
+   spin_unlock(&lru->lock);
+   return removed;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
+
+long
+list_lru_dispose_all(
+   struct list_lru *lru,
+   list_lru_dispose_cb dispose)
+{
+   long disposed = 0;
+   LIST_HEAD(dispose_list);
+
+   spin_lock(&lru->lock);
+   while (!list_empty(&

[PATCH 14/19] xfs: use generic AG walk for background inode reclaim

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

The per-ag inode cache radix trees are not walked by the shrinkers
any more, so there is no need for a special walker that contained
heurisitcs to prevent multiple shrinker instances from colliding
with each other. Hence we can just remote that and convert the code
to use the generic walker.

Signed-off-by: Dave Chinner 
---
 fs/xfs/xfs_ag.h  |2 -
 fs/xfs/xfs_icache.c  |  217 +++---
 fs/xfs/xfs_icache.h  |4 +-
 fs/xfs/xfs_mount.c   |1 -
 fs/xfs/xfs_qm_syscalls.c |2 +-
 5 files changed, 55 insertions(+), 171 deletions(-)

diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index f2aeedb..40a7df9 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -218,8 +218,6 @@ typedef struct xfs_perag {
spinlock_t  pag_ici_lock;   /* incore inode cache lock */
struct radix_tree_root pag_ici_root;/* incore inode cache root */
int pag_ici_reclaimable;/* reclaimable inodes */
-   struct mutexpag_ici_reclaim_lock;   /* serialisation point */
-   unsigned long   pag_ici_reclaim_cursor; /* reclaim restart point */
 
/* buffer cache index */
spinlock_t  pag_buf_lock;   /* lock for pag_buf_tree */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 82b053f..5cfc2eb 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -468,7 +468,8 @@ out_error_or_again:
 
 STATIC int
 xfs_inode_ag_walk_grab(
-   struct xfs_inode*ip)
+   struct xfs_inode*ip,
+   int flags)
 {
struct inode*inode = VFS_I(ip);
 
@@ -517,6 +518,7 @@ STATIC int
 xfs_inode_ag_walk(
struct xfs_mount*mp,
struct xfs_perag*pag,
+   int (*grab)(struct xfs_inode *ip, int flags),
int (*execute)(struct xfs_inode *ip,
   struct xfs_perag *pag, int flags,
   void *args),
@@ -530,6 +532,9 @@ xfs_inode_ag_walk(
int done;
int nr_found;
 
+   if (!grab)
+   grab = xfs_inode_ag_walk_grab;
+
 restart:
done = 0;
skipped = 0;
@@ -564,7 +569,7 @@ restart:
for (i = 0; i < nr_found; i++) {
struct xfs_inode *ip = batch[i];
 
-   if (done || xfs_inode_ag_walk_grab(ip))
+   if (done || grab(ip, flags))
batch[i] = NULL;
 
/*
@@ -593,7 +598,8 @@ restart:
if (!batch[i])
continue;
error = execute(batch[i], pag, flags, args);
-   IRELE(batch[i]);
+   if (grab == xfs_inode_ag_walk_grab)
+   IRELE(batch[i]);
if (error == EAGAIN) {
skipped++;
continue;
@@ -617,35 +623,10 @@ restart:
return last_error;
 }
 
-/*
- * Background scanning to trim post-EOF preallocated space. This is queued
- * based on the 'background_prealloc_discard_period' tunable (5m by default).
- */
-STATIC void
-xfs_queue_eofblocks(
-   struct xfs_mount *mp)
-{
-   rcu_read_lock();
-   if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_EOFBLOCKS_TAG))
-   queue_delayed_work(mp->m_eofblocks_workqueue,
-  &mp->m_eofblocks_work,
-  msecs_to_jiffies(xfs_eofb_secs * 1000));
-   rcu_read_unlock();
-}
-
-void
-xfs_eofblocks_worker(
-   struct work_struct *work)
-{
-   struct xfs_mount *mp = container_of(to_delayed_work(work),
-   struct xfs_mount, m_eofblocks_work);
-   xfs_icache_free_eofblocks(mp, NULL);
-   xfs_queue_eofblocks(mp);
-}
-
 int
 xfs_inode_ag_iterator(
struct xfs_mount*mp,
+   int (*grab)(struct xfs_inode *ip, int flags),
int (*execute)(struct xfs_inode *ip,
   struct xfs_perag *pag, int flags,
   void *args),
@@ -660,7 +641,8 @@ xfs_inode_ag_iterator(
ag = 0;
while ((pag = xfs_perag_get(mp, ag))) {
ag = pag->pag_agno + 1;
-   error = xfs_inode_ag_walk(mp, pag, execute, flags, args, -1);
+   error = xfs_inode_ag_walk(mp, pag, grab, execute,
+ flags, args, -1);
xfs_perag_put(pag);
if (error) {
last_error = error;
@@ -674,6 +656,7 @@ xfs_inode_ag_iterator(
 int
 xfs_inode_ag_iterator_tag(
struct xfs_mount*mp,
+   int (*grab)(struct xfs_i

[PATCH 18/19] shrinker: convert remaining shrinkers to count/scan API

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Convert the remaining couple of random shrinkers in the tree to the
new API.

Signed-off-by: Dave Chinner 
---
 arch/x86/kvm/mmu.c |   35 +--
 net/sunrpc/auth.c  |   45 +++--
 2 files changed, 56 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6f85fe0..3dbc3c0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4212,26 +4212,28 @@ restart:
spin_unlock(&kvm->mmu_lock);
 }
 
-static void kvm_mmu_remove_some_alloc_mmu_pages(struct kvm *kvm,
+static long kvm_mmu_remove_some_alloc_mmu_pages(struct kvm *kvm,
struct list_head *invalid_list)
 {
struct kvm_mmu_page *page;
 
if (list_empty(&kvm->arch.active_mmu_pages))
-   return;
+   return 0;
 
page = container_of(kvm->arch.active_mmu_pages.prev,
struct kvm_mmu_page, link);
-   kvm_mmu_prepare_zap_page(kvm, page, invalid_list);
+   return kvm_mmu_prepare_zap_page(kvm, page, invalid_list);
 }
 
-static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
+
+static long
+mmu_shrink_scan(
+   struct shrinker *shrink,
+   struct shrink_control   *sc)
 {
struct kvm *kvm;
int nr_to_scan = sc->nr_to_scan;
-
-   if (nr_to_scan == 0)
-   goto out;
+   long freed = 0;
 
raw_spin_lock(&kvm_lock);
 
@@ -4259,24 +4261,37 @@ static int mmu_shrink(struct shrinker *shrink, struct 
shrink_control *sc)
idx = srcu_read_lock(&kvm->srcu);
spin_lock(&kvm->mmu_lock);
 
-   kvm_mmu_remove_some_alloc_mmu_pages(kvm, &invalid_list);
+   freed += kvm_mmu_remove_some_alloc_mmu_pages(kvm, 
&invalid_list);
kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);
 
+   /*
+* unfair on small ones
+* per-vm shrinkers cry out
+* sadness comes quickly
+*/
list_move_tail(&kvm->vm_list, &vm_list);
break;
}
 
raw_spin_unlock(&kvm_lock);
+   return freed;
 
-out:
+}
+
+static long
+mmu_shrink_count(
+   struct shrinker *shrink,
+   struct shrink_control   *sc)
+{
return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
 }
 
 static struct shrinker mmu_shrinker = {
-   .shrink = mmu_shrink,
+   .count_objects = mmu_shrink_count,
+   .scan_objects = mmu_shrink_scan,
.seeks = DEFAULT_SEEKS * 10,
 };
 
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index b5c067b..969c629 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -340,12 +340,13 @@ EXPORT_SYMBOL_GPL(rpcauth_destroy_credcache);
 /*
  * Remove stale credentials. Avoid sleeping inside the loop.
  */
-static int
+static long
 rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
 {
spinlock_t *cache_lock;
struct rpc_cred *cred, *next;
unsigned long expired = jiffies - RPC_AUTH_EXPIRY_MORATORIUM;
+   long freed = 0;
 
list_for_each_entry_safe(cred, next, &cred_unused, cr_lru) {
 
@@ -357,10 +358,11 @@ rpcauth_prune_expired(struct list_head *free, int 
nr_to_scan)
 */
if (time_in_range(cred->cr_expire, expired, jiffies) &&
test_bit(RPCAUTH_CRED_HASHED, &cred->cr_flags) != 0)
-   return 0;
+   break;
 
list_del_init(&cred->cr_lru);
number_cred_unused--;
+   freed++;
if (atomic_read(&cred->cr_count) != 0)
continue;
 
@@ -373,29 +375,43 @@ rpcauth_prune_expired(struct list_head *free, int 
nr_to_scan)
}
spin_unlock(cache_lock);
}
-   return (number_cred_unused / 100) * sysctl_vfs_cache_pressure;
+   return freed;
 }
 
 /*
  * Run memory cache shrinker.
  */
-static int
-rpcauth_cache_shrinker(struct shrinker *shrink, struct shrink_control *sc)
+static long
+rpcauth_cache_shrink_scan(
+   struct shrinker *shrink,
+   struct shrink_control   *sc)
+
 {
LIST_HEAD(free);
-   int res;
-   int nr_to_scan = sc->nr_to_scan;
-   gfp_t gfp_mask = sc->gfp_mask;
+   long freed;
+
+   if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
+   return -1;
 
-   if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
-   return (nr_to_scan == 0) ? 0 : -1;
+   /* nothing left, don't come back */
if (list_empty(&cred_unused))
-   return 0;
+   return -1;
+
spin_lock(&rpc_credc

[PATCH 19/19] shrinker: Kill old ->shrink API.

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

There are no more users of this API, so kill it dead, dead, dead and
quietly bury the corpse in a shallow, unmarked grave in a dark
forest deep in the hills...

Signed-off-by: Dave Chinner 
---
 include/linux/shrinker.h  |   15 +--
 include/trace/events/vmscan.h |4 ++--
 mm/vmscan.c   |   39 ---
 3 files changed, 15 insertions(+), 43 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index e71286f..d4636a0 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -7,14 +7,15 @@
  *
  * The 'gfpmask' refers to the allocation we are currently trying to
  * fulfil.
- *
- * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
- * querying the cache size, so a fastpath for that case is appropriate.
  */
 struct shrink_control {
gfp_t gfp_mask;
 
-   /* How many slab objects shrinker() should scan and try to reclaim */
+   /*
+* How many objects scan_objects should scan and try to reclaim.
+* This is reset before every call, so it is safe for callees
+* to modify.
+*/
long nr_to_scan;
 
/* shrink from these nodes */
@@ -24,11 +25,6 @@ struct shrink_control {
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
- * @shrink() should look through the least-recently-used 'nr_to_scan' entries
- * and attempt to free them up.  It should return the number of objects which
- * remain in the cache.  If it returns -1, it means it cannot do any scanning 
at
- * this time (eg. there is a risk of deadlock).
- *
  * @count_objects should return the number of freeable items in the cache. If
  * there are no objects to free or the number of freeable items cannot be
  * determined, it should return 0. No deadlock checks should be done during the
@@ -44,7 +40,6 @@ struct shrink_control {
  * @scan_objects will be made from the current reclaim context.
  */
 struct shrinker {
-   int (*shrink)(struct shrinker *, struct shrink_control *sc);
long (*count_objects)(struct shrinker *, struct shrink_control *sc);
long (*scan_objects)(struct shrinker *, struct shrink_control *sc);
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 63cfccc..132a985 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -202,7 +202,7 @@ TRACE_EVENT(mm_shrink_slab_start,
 
TP_fast_assign(
__entry->shr = shr;
-   __entry->shrink = shr->shrink;
+   __entry->shrink = shr->scan_objects;
__entry->nr_objects_to_shrink = nr_objects_to_shrink;
__entry->gfp_flags = sc->gfp_mask;
__entry->pgs_scanned = pgs_scanned;
@@ -241,7 +241,7 @@ TRACE_EVENT(mm_shrink_slab_end,
 
TP_fast_assign(
__entry->shr = shr;
-   __entry->shrink = shr->shrink;
+   __entry->shrink = shr->scan_objects;
__entry->unused_scan = unused_scan_cnt;
__entry->new_scan = new_scan_cnt;
__entry->retval = shrinker_retval;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4a602ec..81731f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -176,14 +176,6 @@ void unregister_shrinker(struct shrinker *shrinker)
 }
 EXPORT_SYMBOL(unregister_shrinker);
 
-static inline int do_shrinker_shrink(struct shrinker *shrinker,
-struct shrink_control *sc,
-unsigned long nr_to_scan)
-{
-   sc->nr_to_scan = nr_to_scan;
-   return (*shrinker->shrink)(shrinker, sc);
-}
-
 #define SHRINK_BATCH 128
 /*
  * Call the shrink functions to age shrinkable caches
@@ -229,11 +221,8 @@ unsigned long shrink_slab(struct shrink_control *sc,
long batch_size = shrinker->batch ? shrinker->batch
  : SHRINK_BATCH;
 
-   if (shrinker->scan_objects) {
-   max_pass = shrinker->count_objects(shrinker, sc);
-   WARN_ON(max_pass < 0);
-   } else
-   max_pass = do_shrinker_shrink(shrinker, sc, 0);
+   max_pass = shrinker->count_objects(shrinker, sc);
+   WARN_ON(max_pass < 0);
if (max_pass <= 0)
continue;
 
@@ -252,7 +241,7 @@ unsigned long shrink_slab(struct shrink_control *sc,
if (total_scan < 0) {
printk(KERN_ERR
"shrink_slab: %pF negative objects to delete nr=%ld\n",
-  shrinker->shrink, total_scan);
+  shrinker->scan_objects, total_scan);
total_scan = max_pass;

[PATCH 10/19] shrinker: add node awareness

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Pass the node of the current zone being reclaimed to shrink_slab(),
allowing the shrinker control nodemask to be set appropriately for
node aware shrinkers.

Signed-off-by: Dave Chinner 
---
 fs/drop_caches.c |1 +
 include/linux/shrinker.h |3 +++
 mm/memory-failure.c  |2 ++
 mm/vmscan.c  |   12 +---
 4 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index c00e055..9fd702f 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -44,6 +44,7 @@ static void drop_slab(void)
.gfp_mask = GFP_KERNEL,
};
 
+   nodes_setall(shrink.nodes_to_scan);
do {
nr_objects = shrink_slab(&shrink, 1000, 1000);
} while (nr_objects > 10);
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 4f59615..e71286f 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -16,6 +16,9 @@ struct shrink_control {
 
/* How many slab objects shrinker() should scan and try to reclaim */
long nr_to_scan;
+
+   /* shrink from these nodes */
+   nodemask_t nodes_to_scan;
 };
 
 /*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6c5899b..7bcbde4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -248,10 +248,12 @@ void shake_page(struct page *p, int access)
 */
if (access) {
int nr;
+   int nid = page_to_nid(p);
do {
struct shrink_control shrink = {
.gfp_mask = GFP_KERNEL,
};
+   node_set(nid, shrink.nodes_to_scan);
 
nr = shrink_slab(&shrink, 1000, 1000);
if (page_count(p) == 1)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 55c4fc9..4a602ec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2122,15 +2122,20 @@ static unsigned long do_try_to_free_pages(struct 
zonelist *zonelist,
 */
if (global_reclaim(sc)) {
unsigned long lru_pages = 0;
+
+   nodes_clear(shrink->nodes_to_scan);
for_each_zone_zonelist(zone, z, zonelist,
gfp_zone(sc->gfp_mask)) {
if (!cpuset_zone_allowed_hardwall(zone, 
GFP_KERNEL))
continue;
 
lru_pages += zone_reclaimable_pages(zone);
+   node_set(zone_to_nid(zone),
+shrink->nodes_to_scan);
}
 
shrink_slab(shrink, sc->nr_scanned, lru_pages);
+
if (reclaim_state) {
sc->nr_reclaimed += 
reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
@@ -2682,6 +2687,8 @@ loop_again:
shrink_zone(zone, &sc);
 
reclaim_state->reclaimed_slab = 0;
+   nodes_clear(shrink.nodes_to_scan);
+   node_set(zone_to_nid(zone), 
shrink.nodes_to_scan);
nr_slab = shrink_slab(&shrink, sc.nr_scanned, 
lru_pages);
sc.nr_reclaimed += 
reclaim_state->reclaimed_slab;
total_scanned += sc.nr_scanned;
@@ -3318,10 +3325,9 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
 * number of slab pages and shake the slab until it is reduced
 * by the same nr_pages that we used for reclaiming unmapped
 * pages.
-*
-* Note that shrink_slab will free memory on all zones and may
-* take a long time.
 */
+   nodes_clear(shrink.nodes_to_scan);
+   node_set(zone_to_nid(zone), shrink.nodes_to_scan);
for (;;) {
unsigned long lru_pages = zone_reclaimable_pages(zone);
 
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/19] xfs: Node aware direct inode reclaim

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

XFS currently only tracks inodes for reclaim via tag bits in the
inode cache radix tree. While this is awesome for background reclaim
because it allows inodes to be reclaimed in ascending disk offset
order, it sucks for direct memory reclaim which really is trying to
free the oldest inodes from memory.

As such, the direct reclaim code is a bit of a mess. It has all
sorts of heuristics code to try to avoid dueling shrinker problems
and to limit each radix tree to a single direct reclaim walker at a
time. We can do better.

Given that at the point in time that we mark an inode as under
reclaim, it has been evicted from the VFS inode cache, we can reuse
the struct inode LRU fields to hold our own reclaim ordered LRU
list. With the generic LRU code, it doesn't impact on scalability,
and the shrinker can walk the LRU lists directly giving us
node-aware inode cache reclaim.

This means that we get the best of both worlds - background reclaim
runs very efficiently in terms of IO for cleaning dirty reclaimable
inodes, while direct reclaim can walk the LRU lists and pick inodes
to reclaim that suit the MM subsystem the best.

Signed-off-by: Dave Chinner 
---
 fs/xfs/xfs_icache.c |   77 ++-
 fs/xfs/xfs_icache.h |4 +--
 fs/xfs/xfs_linux.h  |1 +
 fs/xfs/xfs_mount.h  |2 +-
 fs/xfs/xfs_super.c  |6 ++--
 5 files changed, 65 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 2f91e2b..82b053f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -244,6 +244,8 @@ xfs_iget_cache_hit(
 
spin_unlock(&ip->i_flags_lock);
spin_unlock(&pag->pag_ici_lock);
+
+   list_lru_del(&mp->m_inode_lru, &VFS_I(ip)->i_lru);
} else {
/* If the VFS inode is being torn down, pause and try again. */
if (!igrab(inode)) {
@@ -990,6 +992,17 @@ reclaim:
spin_unlock(&pag->pag_ici_lock);
 
/*
+* iT is safe to do this unlocked check as we've guaranteed that we have
+* exclusive access to this inode via the XFS_IRECLAIM flag. Hence
+* concurrent LRU list walks will avoid removing this inode from the
+* list. For direct reclaim, we know the inode has already been removed
+* from any list it might be on, hence there's no need to traffic the
+* LRU code to find that out.
+*/
+   if (!list_empty(&VFS_I(ip)->i_lru))
+   list_lru_del(&ip->i_mount->m_inode_lru, &VFS_I(ip)->i_lru);
+
+   /*
 * Here we do an (almost) spurious inode lock in order to coordinate
 * with inode cache radix tree lookups.  This is because the lookup
 * can reference the inodes in the cache without taking references.
@@ -1155,6 +1168,32 @@ xfs_reclaim_inodes(
return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
 }
 
+static int
+xfs_reclaim_inode_isolate(
+   struct list_head*item,
+   spinlock_t  *lru_lock,
+   void*cb_arg)
+{
+   struct inode*inode = container_of(item, struct inode,
+ i_lru);
+   struct list_head*dispose = cb_arg;
+
+   rcu_read_lock();
+   if (xfs_reclaim_inode_grab(XFS_I(inode), SYNC_TRYLOCK)) {
+   /* not a reclaim candiate, skip it */
+   rcu_read_unlock();
+   return 2;
+   }
+   rcu_read_unlock();
+
+   /*
+* We have the XFS_IRECLAIM flag set now, so nobody is going to touch
+* this inode now except us.
+*/
+   list_move(item, dispose);
+   return 0;
+}
+
 /*
  * Scan a certain number of inodes for reclaim.
  *
@@ -1167,36 +1206,34 @@ xfs_reclaim_inodes(
 long
 xfs_reclaim_inodes_nr(
struct xfs_mount*mp,
-   longnr_to_scan)
+   longnr_to_scan,
+   nodemask_t  *nodes_to_scan)
 {
-   long nr = nr_to_scan;
+   LIST_HEAD(dispose);
+   long freed;
 
/* kick background reclaimer and push the AIL */
xfs_reclaim_work_queue(mp);
xfs_ail_push_all(mp->m_ail);
 
-   xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr);
-   return nr_to_scan - nr;
-}
+   freed = list_lru_walk_nodemask(&mp->m_inode_lru,
+  xfs_reclaim_inode_isolate, &dispose,
+  nr_to_scan, nodes_to_scan);
 
-/*
- * Return the number of reclaimable inodes in the filesystem for
- * the shrinker to determine how much to reclaim.
- */
-long
-xfs_reclaim_inodes_count(
-   struct xfs_mount*mp)
-{
-   struct xfs_perag*pag;
-   xfs_agnumber_t  ag = 0;
-   longreclaimable = 0;
+   while (!list_

[PATCH 17/19] drivers: convert shrinkers to new count/scan API

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Convert the driver shrinkers to the new API. Most changes are
compile tested only because I either don't have the hardware or it's
staging stuff.

FWIW, the md and android code is pretty good, but the rest of it
makes me want to claw my eyes out.  The amount of broken code I just
encountered is mind boggling.  I've added comments explaining what
is broken, but I fear that some of the code would be best dealt with
by being dragged behind the bike shed, burying in mud up to it's
neck and then run over repeatedly with a blunt lawn mower.

Special mention goes to the zcache/zcache2 drivers. They can't
co-exist in the build at the same time, they are under different
menu options in menuconfig, they only show up when you've got the
right set of mm subsystem options configured and so even compile
testing is an exercise in pulling teeth.  And that doesn't even take
into account the horrible, broken code...

Signed-off-by: Dave Chinner 
---
 drivers/gpu/drm/i915/i915_dma.c   |4 +-
 drivers/gpu/drm/i915/i915_gem.c   |   64 +---
 drivers/gpu/drm/ttm/ttm_page_alloc.c  |   48 ++---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c  |   55 +++-
 drivers/md/dm-bufio.c |   65 +++--
 drivers/staging/android/ashmem.c  |   44 ---
 drivers/staging/android/lowmemorykiller.c |   60 +-
 drivers/staging/ramster/zcache-main.c |   58 ++---
 drivers/staging/zcache/zcache-main.c  |   40 ++
 9 files changed, 297 insertions(+), 141 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 61ae104..0ddec32 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1658,7 +1658,7 @@ int i915_driver_load(struct drm_device *dev, unsigned 
long flags)
return 0;
 
 out_gem_unload:
-   if (dev_priv->mm.inactive_shrinker.shrink)
+   if (dev_priv->mm.inactive_shrinker.scan_objects)
unregister_shrinker(&dev_priv->mm.inactive_shrinker);
 
if (dev->pdev->msi_enabled)
@@ -1695,7 +1695,7 @@ int i915_driver_unload(struct drm_device *dev)
 
i915_teardown_sysfs(dev);
 
-   if (dev_priv->mm.inactive_shrinker.shrink)
+   if (dev_priv->mm.inactive_shrinker.scan_objects)
unregister_shrinker(&dev_priv->mm.inactive_shrinker);
 
mutex_lock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 107f09b..ceab752 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -53,8 +53,10 @@ static void i915_gem_object_update_fence(struct 
drm_i915_gem_object *obj,
 struct drm_i915_fence_reg *fence,
 bool enable);
 
-static int i915_gem_inactive_shrink(struct shrinker *shrinker,
+static long i915_gem_inactive_count(struct shrinker *shrinker,
struct shrink_control *sc);
+static long i915_gem_inactive_scan(struct shrinker *shrinker,
+  struct shrink_control *sc);
 static long i915_gem_purge(struct drm_i915_private *dev_priv, long target);
 static void i915_gem_shrink_all(struct drm_i915_private *dev_priv);
 static void i915_gem_object_truncate(struct drm_i915_gem_object *obj);
@@ -4197,7 +4199,8 @@ i915_gem_load(struct drm_device *dev)
 
dev_priv->mm.interruptible = true;
 
-   dev_priv->mm.inactive_shrinker.shrink = i915_gem_inactive_shrink;
+   dev_priv->mm.inactive_shrinker.count_objects = i915_gem_inactive_count;
+   dev_priv->mm.inactive_shrinker.scan_objects = i915_gem_inactive_scan;
dev_priv->mm.inactive_shrinker.seeks = DEFAULT_SEEKS;
register_shrinker(&dev_priv->mm.inactive_shrinker);
 }
@@ -4407,35 +4410,64 @@ void i915_gem_release(struct drm_device *dev, struct 
drm_file *file)
spin_unlock(&file_priv->mm.lock);
 }
 
-static int
-i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
+/*
+ * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
+ *
+ * i915_gem_purge() expects a byte count to be passed, and the minimum object
+ * size is PAGE_SIZE. The shrinker doesn't work on bytes - it works on
+ * *objects*. So it passes a nr_to_scan of 128 objects, which is interpreted
+ * here to mean "free 128 bytes". That means a single object will be freed, as
+ * the minimum object size is a page.
+ *
+ * But the craziest part comes when i915_gem_purge() has walked all the objects
+ * and can't free any memory. That results in i915_gem_shrink_all() being
+ * called, which idles the GPU and frees everything the driver has in it's
+ * active and inactiv

[PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu counters

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Before we split up the dcache_lru_lock, the unused dentry counter
needs to be made independent of the global dcache_lru_lock. Convert
it to per-cpu counters to do this.

Signed-off-by: Dave Chinner 
Reviewed-by: Christoph Hellwig 
---
 fs/dcache.c |   17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3a463d0..2fc0daa 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -118,6 +118,7 @@ struct dentry_stat_t dentry_stat = {
 };
 
 static DEFINE_PER_CPU(unsigned int, nr_dentry);
+static DEFINE_PER_CPU(unsigned int, nr_dentry_unused);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 static int get_nr_dentry(void)
@@ -129,10 +130,20 @@ static int get_nr_dentry(void)
return sum < 0 ? 0 : sum;
 }
 
+static int get_nr_dentry_unused(void)
+{
+   int i;
+   int sum = 0;
+   for_each_possible_cpu(i)
+   sum += per_cpu(nr_dentry_unused, i);
+   return sum < 0 ? 0 : sum;
+}
+
 int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
   size_t *lenp, loff_t *ppos)
 {
dentry_stat.nr_dentry = get_nr_dentry();
+   dentry_stat.nr_unused = get_nr_dentry_unused();
return proc_dointvec(table, write, buffer, lenp, ppos);
 }
 #endif
@@ -312,7 +323,7 @@ static void dentry_lru_add(struct dentry *dentry)
spin_lock(&dcache_lru_lock);
list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
dentry->d_sb->s_nr_dentry_unused++;
-   dentry_stat.nr_unused++;
+   this_cpu_inc(nr_dentry_unused);
spin_unlock(&dcache_lru_lock);
}
 }
@@ -322,7 +333,7 @@ static void __dentry_lru_del(struct dentry *dentry)
list_del_init(&dentry->d_lru);
dentry->d_flags &= ~DCACHE_SHRINK_LIST;
dentry->d_sb->s_nr_dentry_unused--;
-   dentry_stat.nr_unused--;
+   this_cpu_dec(nr_dentry_unused);
 }
 
 /*
@@ -360,7 +371,7 @@ static void dentry_lru_move_list(struct dentry *dentry, 
struct list_head *list)
if (list_empty(&dentry->d_lru)) {
list_add_tail(&dentry->d_lru, list);
dentry->d_sb->s_nr_dentry_unused++;
-   dentry_stat.nr_unused++;
+   this_cpu_inc(nr_dentry_unused);
} else {
list_move_tail(&dentry->d_lru, list);
}
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 16/19] fs: convert fs shrinkers to new scan/count API

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Convert the filesystem shrinkers to use the new API, and standardise
some of the behaviours of the shrinkers at the same time. For
example, nr_to_scan means the number of objects to scan, not the
number of objects to free.

I refactored the CIFS idmap shrinker a little - it really needs to
be broken up into a shrinker per tree and keep an item count with
the tree root so that we don't need to walk the tree every time the
shrinker needs to count the number of objects in the tree (i.e.
all the time under memory pressure).

Signed-off-by: Dave Chinner 
---
 fs/cifs/cifsacl.c |  112 -
 fs/gfs2/glock.c   |   23 ++-
 fs/gfs2/main.c|3 +-
 fs/gfs2/quota.c   |   12 +++---
 fs/gfs2/quota.h   |4 +-
 fs/mbcache.c  |   53 ++---
 fs/nfs/dir.c  |   21 --
 fs/nfs/internal.h |4 +-
 fs/nfs/super.c|3 +-
 fs/quota/dquot.c  |   37 --
 10 files changed, 163 insertions(+), 109 deletions(-)

diff --git a/fs/cifs/cifsacl.c b/fs/cifs/cifsacl.c
index 0fb15bb..a0e5c22 100644
--- a/fs/cifs/cifsacl.c
+++ b/fs/cifs/cifsacl.c
@@ -44,66 +44,95 @@ static const struct cifs_sid sid_user = {1, 2 , {0, 0, 0, 
0, 0, 5}, {} };
 
 const struct cred *root_cred;
 
-static void
-shrink_idmap_tree(struct rb_root *root, int nr_to_scan, int *nr_rem,
-   int *nr_del)
+static long
+cifs_idmap_tree_scan(
+   struct rb_root  *root,
+   spinlock_t  *tree_lock,
+   longnr_to_scan)
 {
struct rb_node *node;
-   struct rb_node *tmp;
-   struct cifs_sid_id *psidid;
+   long freed = 0;
 
+   spin_lock(tree_lock);
node = rb_first(root);
-   while (node) {
+   while (nr_to_scan-- >= 0 && node) {
+   struct cifs_sid_id *psidid;
+   struct rb_node *tmp;
+
tmp = node;
node = rb_next(tmp);
psidid = rb_entry(tmp, struct cifs_sid_id, rbnode);
-   if (nr_to_scan == 0 || *nr_del == nr_to_scan)
-   ++(*nr_rem);
-   else {
-   if (time_after(jiffies, psidid->time + SID_MAP_EXPIRE)
-   && psidid->refcount == 0) {
-   rb_erase(tmp, root);
-   ++(*nr_del);
-   } else
-   ++(*nr_rem);
+   if (time_after(jiffies, psidid->time + SID_MAP_EXPIRE)
+   && psidid->refcount == 0) {
+   rb_erase(tmp, root);
+   freed++;
}
}
+   spin_unlock(tree_lock);
+   return freed;
+}
+
+static long
+cifs_idmap_tree_count(
+   struct rb_root  *root,
+   spinlock_t  *tree_lock)
+{
+   struct rb_node *node;
+   long count = 0;
+
+   spin_lock(tree_lock);
+   node = rb_first(root);
+   while (node) {
+   node = rb_next(node);
+   count++;
+   }
+   spin_unlock(tree_lock);
+   return count;
 }
 
 /*
- * Run idmap cache shrinker.
+ * idmap tree shrinker.
+ *
+ * XXX (dchinner): this should really be 4 separate shrinker instances (one per
+ * tree structure) so that each are shrunk proportionally to their individual
+ * sizes.
  */
-static int
-cifs_idmap_shrinker(struct shrinker *shrink, struct shrink_control *sc)
+static long
+cifs_idmap_shrink_scan(
+   struct shrinker *shrink,
+   struct shrink_control   *sc)
 {
-   int nr_to_scan = sc->nr_to_scan;
-   int nr_del = 0;
-   int nr_rem = 0;
-   struct rb_root *root;
+   long freed = 0;
 
-   root = &uidtree;
-   spin_lock(&siduidlock);
-   shrink_idmap_tree(root, nr_to_scan, &nr_rem, &nr_del);
-   spin_unlock(&siduidlock);
+   freed += cifs_idmap_tree_scan(&uidtree, &siduidlock, sc->nr_to_scan);
+   freed += cifs_idmap_tree_scan(&gidtree, &sidgidlock, sc->nr_to_scan);
+   freed += cifs_idmap_tree_scan(&siduidtree, &siduidlock, sc->nr_to_scan);
+   freed += cifs_idmap_tree_scan(&sidgidtree, &sidgidlock, sc->nr_to_scan);
 
-   root = &gidtree;
-   spin_lock(&sidgidlock);
-   shrink_idmap_tree(root, nr_to_scan, &nr_rem, &nr_del);
-   spin_unlock(&sidgidlock);
+   return freed;
+}
 
-   root = &siduidtree;
-   spin_lock(&uidsidlock);
-   shrink_idmap_tree(root, nr_to_scan, &nr_rem, &nr_del);
-   spin_unlock(&uidsidlock);
+static long
+cifs_idmap_shrink_count(
+   struct shrinker *shrink,
+   struct shrink_control   *sc)
+{
+   long count = 0;
 
-   root = &sidgidtree;
-   spin_lock(&gidsidlock);
-   shrink_idmap_tree(root, nr_to_scan, &nr_rem, &

[PATCH 02/19] dentry: move to per-sb LRU locks

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

With the dentry LRUs being per-sb structures, there is no real need
for a global dentry_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.

Signed-off-by: Dave Chinner 
Reviewed-by: Christoph Hellwig 
---
 fs/dcache.c|   37 ++---
 fs/super.c |1 +
 include/linux/fs.h |4 +++-
 3 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 2fc0daa..e0c97fe 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -48,7 +48,7 @@
  *   - the dcache hash table
  * s_anon bl list spinlock protects:
  *   - the s_anon list (see __d_drop)
- * dcache_lru_lock protects:
+ * dentry->d_sb->s_dentry_lru_lock protects:
  *   - the dcache lru lists and counters
  * d_lock protects:
  *   - d_flags
@@ -63,7 +63,7 @@
  * Ordering:
  * dentry->d_inode->i_lock
  *   dentry->d_lock
- * dcache_lru_lock
+ * dentry->d_sb->s_dentry_lru_lock
  * dcache_hash_bucket lock
  * s_anon lock
  *
@@ -81,7 +81,6 @@
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
@@ -320,11 +319,11 @@ static void dentry_unlink_inode(struct dentry * dentry)
 static void dentry_lru_add(struct dentry *dentry)
 {
if (list_empty(&dentry->d_lru)) {
-   spin_lock(&dcache_lru_lock);
+   spin_lock(&dentry->d_sb->s_dentry_lru_lock);
list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
dentry->d_sb->s_nr_dentry_unused++;
this_cpu_inc(nr_dentry_unused);
-   spin_unlock(&dcache_lru_lock);
+   spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
}
 }
 
@@ -342,9 +341,9 @@ static void __dentry_lru_del(struct dentry *dentry)
 static void dentry_lru_del(struct dentry *dentry)
 {
if (!list_empty(&dentry->d_lru)) {
-   spin_lock(&dcache_lru_lock);
+   spin_lock(&dentry->d_sb->s_dentry_lru_lock);
__dentry_lru_del(dentry);
-   spin_unlock(&dcache_lru_lock);
+   spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
}
 }
 
@@ -359,15 +358,15 @@ static void dentry_lru_prune(struct dentry *dentry)
if (dentry->d_flags & DCACHE_OP_PRUNE)
dentry->d_op->d_prune(dentry);
 
-   spin_lock(&dcache_lru_lock);
+   spin_lock(&dentry->d_sb->s_dentry_lru_lock);
__dentry_lru_del(dentry);
-   spin_unlock(&dcache_lru_lock);
+   spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
}
 }
 
 static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 {
-   spin_lock(&dcache_lru_lock);
+   spin_lock(&dentry->d_sb->s_dentry_lru_lock);
if (list_empty(&dentry->d_lru)) {
list_add_tail(&dentry->d_lru, list);
dentry->d_sb->s_nr_dentry_unused++;
@@ -375,7 +374,7 @@ static void dentry_lru_move_list(struct dentry *dentry, 
struct list_head *list)
} else {
list_move_tail(&dentry->d_lru, list);
}
-   spin_unlock(&dcache_lru_lock);
+   spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 }
 
 /**
@@ -879,14 +878,14 @@ void prune_dcache_sb(struct super_block *sb, int count)
LIST_HEAD(tmp);
 
 relock:
-   spin_lock(&dcache_lru_lock);
+   spin_lock(&sb->s_dentry_lru_lock);
while (!list_empty(&sb->s_dentry_lru)) {
dentry = list_entry(sb->s_dentry_lru.prev,
struct dentry, d_lru);
BUG_ON(dentry->d_sb != sb);
 
if (!spin_trylock(&dentry->d_lock)) {
-   spin_unlock(&dcache_lru_lock);
+   spin_unlock(&sb->s_dentry_lru_lock);
cpu_relax();
goto relock;
}
@@ -902,11 +901,11 @@ relock:
if (!--count)
break;
}
-   cond_resched_lock(&dcache_lru_lock);
+   cond_resched_lock(&sb->s_dentry_lru_lock);
}
if (!list_empty(&referenced))
list_splice(&referenced, &sb->s_dentry_lru);
-   spin_unlock(&dcache_lru_lock);
+   spin_unlock(&sb->s_dentry_lru_lock);
 
shrink_dentry_list(&tmp);
 }
@@ -922,14 +921,14 @@ void shrink_dcache_sb(struct super_block *sb)
 {
LIST_HEAD(

[PATCH 12/19] xfs: convert buftarg LRU to generic code

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Convert the buftarg LRU to use the new generic LRU list and take
advantage of the functionality it supplies to make the buffer cache
shrinker node aware.

Signed-off-by: Dave Chinner 
---
 fs/xfs/xfs_buf.c |  162 +-
 fs/xfs/xfs_buf.h |5 +-
 2 files changed, 76 insertions(+), 91 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index a80195b..1011c59 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -85,20 +85,14 @@ xfs_buf_vmap_len(
  * The LRU takes a new reference to the buffer so that it will only be freed
  * once the shrinker takes the buffer off the LRU.
  */
-STATIC void
+static void
 xfs_buf_lru_add(
struct xfs_buf  *bp)
 {
-   struct xfs_buftarg *btp = bp->b_target;
-
-   spin_lock(&btp->bt_lru_lock);
-   if (list_empty(&bp->b_lru)) {
-   atomic_inc(&bp->b_hold);
-   list_add_tail(&bp->b_lru, &btp->bt_lru);
-   btp->bt_lru_nr++;
+   if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
+   atomic_inc(&bp->b_hold);
}
-   spin_unlock(&btp->bt_lru_lock);
 }
 
 /*
@@ -107,24 +101,13 @@ xfs_buf_lru_add(
  * The unlocked check is safe here because it only occurs when there are not
  * b_lru_ref counts left on the inode under the pag->pag_buf_lock. it is there
  * to optimise the shrinker removing the buffer from the LRU and calling
- * xfs_buf_free(). i.e. it removes an unnecessary round trip on the
- * bt_lru_lock.
+ * xfs_buf_free().
  */
-STATIC void
+static void
 xfs_buf_lru_del(
struct xfs_buf  *bp)
 {
-   struct xfs_buftarg *btp = bp->b_target;
-
-   if (list_empty(&bp->b_lru))
-   return;
-
-   spin_lock(&btp->bt_lru_lock);
-   if (!list_empty(&bp->b_lru)) {
-   list_del_init(&bp->b_lru);
-   btp->bt_lru_nr--;
-   }
-   spin_unlock(&btp->bt_lru_lock);
+   list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
 }
 
 /*
@@ -151,18 +134,10 @@ xfs_buf_stale(
bp->b_flags &= ~_XBF_DELWRI_Q;
 
atomic_set(&(bp)->b_lru_ref, 0);
-   if (!list_empty(&bp->b_lru)) {
-   struct xfs_buftarg *btp = bp->b_target;
-
-   spin_lock(&btp->bt_lru_lock);
-   if (!list_empty(&bp->b_lru) &&
-   !(bp->b_lru_flags & _XBF_LRU_DISPOSE)) {
-   list_del_init(&bp->b_lru);
-   btp->bt_lru_nr--;
-   atomic_dec(&bp->b_hold);
-   }
-   spin_unlock(&btp->bt_lru_lock);
-   }
+   if (!(bp->b_lru_flags & _XBF_LRU_DISPOSE) &&
+   (list_lru_del(&bp->b_target->bt_lru, &bp->b_lru)))
+   atomic_dec(&bp->b_hold);
+
ASSERT(atomic_read(&bp->b_hold) >= 1);
 }
 
@@ -1476,81 +1451,92 @@ xfs_buf_iomove(
  * returned. These buffers will have an elevated hold count, so wait on those
  * while freeing all the buffers only held by the LRU.
  */
-void
-xfs_wait_buftarg(
-   struct xfs_buftarg  *btp)
+static int
+xfs_buftarg_wait_rele(
+   struct list_head*item,
+   spinlock_t  *lru_lock,
+   void*arg)
 {
-   struct xfs_buf  *bp;
+   struct xfs_buf  *bp = container_of(item, struct xfs_buf, b_lru);
 
-restart:
-   spin_lock(&btp->bt_lru_lock);
-   while (!list_empty(&btp->bt_lru)) {
-   bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
-   if (atomic_read(&bp->b_hold) > 1) {
-   spin_unlock(&btp->bt_lru_lock);
-   delay(100);
-   goto restart;
-   }
+   if (atomic_read(&bp->b_hold) > 1) {
+   /* need to wait */
+   spin_unlock(lru_lock);
+   delay(100);
+   } else {
/*
 * clear the LRU reference count so the buffer doesn't get
 * ignored in xfs_buf_rele().
 */
atomic_set(&bp->b_lru_ref, 0);
-   spin_unlock(&btp->bt_lru_lock);
+   spin_unlock(lru_lock);
xfs_buf_rele(bp);
-   spin_lock(&btp->bt_lru_lock);
}
-   spin_unlock(&btp->bt_lru_lock);
+   return 3;
 }
 
-int
-xfs_buftarg_shrink(
+void
+xfs_wait_buftarg(
+   struct xfs_buftarg  *btp)
+{
+   while (list_lru_count(&btp->bt_lru))
+   list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
+ NULL, LONG_MAX);
+}
+
+static int
+xfs_bufta

[PATCH 08/19] dcache: convert to use new lru list infrastructure

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Signed-off-by: Dave Chinner 
---
 fs/dcache.c|  171 +---
 fs/super.c |   10 +--
 include/linux/fs.h |   15 +++--
 3 files changed, 82 insertions(+), 114 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ca647b8..d72e388 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "mount.h"
 
@@ -318,20 +319,8 @@ static void dentry_unlink_inode(struct dentry * dentry)
  */
 static void dentry_lru_add(struct dentry *dentry)
 {
-   if (list_empty(&dentry->d_lru)) {
-   spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-   list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
-   dentry->d_sb->s_nr_dentry_unused++;
+   if (list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
this_cpu_inc(nr_dentry_unused);
-   spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
-   }
-}
-
-static void __dentry_lru_del(struct dentry *dentry)
-{
-   list_del_init(&dentry->d_lru);
-   dentry->d_sb->s_nr_dentry_unused--;
-   this_cpu_dec(nr_dentry_unused);
 }
 
 /*
@@ -341,11 +330,8 @@ static void dentry_lru_del(struct dentry *dentry)
 {
BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
 
-   if (!list_empty(&dentry->d_lru)) {
-   spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-   __dentry_lru_del(dentry);
-   spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
-   }
+   if (list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
+   this_cpu_dec(nr_dentry_unused);
 }
 
 /*
@@ -361,35 +347,19 @@ static void dentry_lru_del(struct dentry *dentry)
  */
 static void dentry_lru_prune(struct dentry *dentry)
 {
-   if (!list_empty(&dentry->d_lru)) {
+   int prune = dentry->d_flags & DCACHE_OP_PRUNE;
 
-   if (dentry->d_flags & DCACHE_OP_PRUNE)
-   dentry->d_op->d_prune(dentry);
-
-   if ((dentry->d_flags & DCACHE_SHRINK_LIST))
-   list_del_init(&dentry->d_lru);
-   else {
-   spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-   __dentry_lru_del(dentry);
-   spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
-   }
-   dentry->d_flags &= ~DCACHE_SHRINK_LIST;
-   }
-}
-
-static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
-{
-   BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
-
-   spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-   if (list_empty(&dentry->d_lru)) {
-   list_add_tail(&dentry->d_lru, list);
-   } else {
-   list_move_tail(&dentry->d_lru, list);
-   dentry->d_sb->s_nr_dentry_unused--;
+   if (!list_empty(&dentry->d_lru) &&
+   (dentry->d_flags & DCACHE_SHRINK_LIST))
+   list_del_init(&dentry->d_lru);
+   else if (list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
this_cpu_dec(nr_dentry_unused);
-   }
-   spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
+   else
+   prune = 0;
+
+   dentry->d_flags &= ~DCACHE_SHRINK_LIST;
+   if (prune)
+   dentry->d_op->d_prune(dentry);
 }
 
 /**
@@ -880,6 +850,51 @@ static void shrink_dentry_list(struct list_head *list)
rcu_read_unlock();
 }
 
+static int dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
+   void *arg)
+{
+   struct list_head *freeable = arg;
+   struct dentry   *dentry = container_of(item, struct dentry, d_lru);
+
+
+   /*
+* we are inverting the lru lock/dentry->d_lock here,
+* so use a trylock. If we fail to get the lock, just skip
+* it
+*/
+   if (!spin_trylock(&dentry->d_lock))
+   return 2;
+
+   /*
+* Referenced dentries are still in use. If they have active
+* counts, just remove them from the LRU. Otherwise give them
+* another pass through the LRU.
+*/
+   if (dentry->d_count) {
+   list_del_init(&dentry->d_lru);
+   spin_unlock(&dentry->d_lock);
+   return 0;
+   }
+
+   if (dentry->d_flags & DCACHE_REFERENCED) {
+   dentry->d_flags &= ~DCACHE_REFERENCED;
+   spin_unlock(&dentry->d_lock);
+
+   /*
+* XXX: this list move should be be done under d_lock. Need to
+* determine if it is safe just 

[PATCH 09/19] list_lru: per-node list infrastructure

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Now that we have an LRU list API, we can start to enhance the
implementation.  This splits the single LRU list into per-node lists
and locks to enhance scalability. Items are placed on lists
according to the node the memory belongs to. To make scanning the
lists efficient, also track whether the per-node lists have entries
in them in a active nodemask.

Signed-off-by: Dave Chinner 
---
 include/linux/list_lru.h |   14 ++--
 lib/list_lru.c   |  160 +++---
 2 files changed, 129 insertions(+), 45 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3423949..b0e3ba2 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -8,21 +8,23 @@
 #define _LRU_LIST_H 0
 
 #include 
+#include 
 
-struct list_lru {
+struct list_lru_node {
spinlock_t  lock;
struct list_headlist;
longnr_items;
+} cacheline_aligned_in_smp;
+
+struct list_lru {
+   struct list_lru_nodenode[MAX_NUMNODES];
+   nodemask_t  active_nodes;
 };
 
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
-
-static inline long list_lru_count(struct list_lru *lru)
-{
-   return lru->nr_items;
-}
+long list_lru_count(struct list_lru *lru);
 
 typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
void *cb_arg);
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 475d0e9..881e342 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -6,6 +6,7 @@
  */
 #include 
 #include 
+#include 
 #include 
 
 int
@@ -13,14 +14,19 @@ list_lru_add(
struct list_lru *lru,
struct list_head *item)
 {
-   spin_lock(&lru->lock);
+   int nid = page_to_nid(virt_to_page(item));
+   struct list_lru_node *nlru = &lru->node[nid];
+
+   spin_lock(&nlru->lock);
+   BUG_ON(nlru->nr_items < 0);
if (list_empty(item)) {
-   list_add_tail(item, &lru->list);
-   lru->nr_items++;
-   spin_unlock(&lru->lock);
+   list_add_tail(item, &nlru->list);
+   if (nlru->nr_items++ == 0)
+   node_set(nid, lru->active_nodes);
+   spin_unlock(&nlru->lock);
return 1;
}
-   spin_unlock(&lru->lock);
+   spin_unlock(&nlru->lock);
return 0;
 }
 EXPORT_SYMBOL_GPL(list_lru_add);
@@ -30,43 +36,72 @@ list_lru_del(
struct list_lru *lru,
struct list_head *item)
 {
-   spin_lock(&lru->lock);
+   int nid = page_to_nid(virt_to_page(item));
+   struct list_lru_node *nlru = &lru->node[nid];
+
+   spin_lock(&nlru->lock);
if (!list_empty(item)) {
list_del_init(item);
-   lru->nr_items--;
-   spin_unlock(&lru->lock);
+   if (--nlru->nr_items == 0)
+   node_clear(nid, lru->active_nodes);
+   BUG_ON(nlru->nr_items < 0);
+   spin_unlock(&nlru->lock);
return 1;
}
-   spin_unlock(&lru->lock);
+   spin_unlock(&nlru->lock);
return 0;
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 long
-list_lru_walk(
-   struct list_lru *lru,
-   list_lru_walk_cb isolate,
-   void*cb_arg,
-   longnr_to_walk)
+list_lru_count(
+   struct list_lru *lru)
 {
+   long count = 0;
+   int nid;
+
+   for_each_node_mask(nid, lru->active_nodes) {
+   struct list_lru_node *nlru = &lru->node[nid];
+
+   spin_lock(&nlru->lock);
+   BUG_ON(nlru->nr_items < 0);
+   count += nlru->nr_items;
+   spin_unlock(&nlru->lock);
+   }
+
+   return count;
+}
+EXPORT_SYMBOL_GPL(list_lru_count);
+
+static long
+list_lru_walk_node(
+   struct list_lru *lru,
+   int nid,
+   list_lru_walk_cbisolate,
+   void*cb_arg,
+   long*nr_to_walk)
+{
+   struct list_lru_node*nlru = &lru->node[nid];
struct list_head *item, *n;
-   long removed = 0;
+   long isolated = 0;
 restart:
-   spin_lock(&lru->lock);
-   list_for_each_safe(item, n, &lru->list) {
+   spin_lock(&nlru->lock);
+   list_for_each_safe(item, n, &nlru->list) {
int ret;
 
-   if (nr_to_walk-- < 0)
+   if ((*nr_to_walk)-- < 0)
break;
 
-   ret = isolate(item, &lru->lock, cb_arg);
+   ret = isolate(item, &nlru->lock, cb_arg);
swit

[PATCH 07/19] inode: convert inode lru list to generic lru list code.

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Signed-off-by: Dave Chinner 
---
 fs/inode.c |  173 +---
 fs/super.c |   11 ++--
 include/linux/fs.h |6 +-
 3 files changed, 75 insertions(+), 115 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 3624ae0..2662305 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -17,6 +17,7 @@
 #include 
 #include  /* for inode_has_buffers */
 #include 
+#include 
 #include "internal.h"
 
 /*
@@ -24,7 +25,7 @@
  *
  * inode->i_lock protects:
  *   inode->i_state, inode->i_hash, __iget()
- * inode->i_sb->s_inode_lru_lock protects:
+ * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
  * inode_sb_list_lock protects:
  *   sb->s_inodes, inode->i_sb_list
@@ -37,7 +38,7 @@
  *
  * inode_sb_list_lock
  *   inode->i_lock
- * inode->i_sb->s_inode_lru_lock
+ * Inode LRU list locks
  *
  * bdi->wb.list_lock
  *   inode->i_lock
@@ -399,24 +400,14 @@ EXPORT_SYMBOL(ihold);
 
 static void inode_lru_list_add(struct inode *inode)
 {
-   spin_lock(&inode->i_sb->s_inode_lru_lock);
-   if (list_empty(&inode->i_lru)) {
-   list_add(&inode->i_lru, &inode->i_sb->s_inode_lru);
-   inode->i_sb->s_nr_inodes_unused++;
+   if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru))
this_cpu_inc(nr_unused);
-   }
-   spin_unlock(&inode->i_sb->s_inode_lru_lock);
 }
 
 static void inode_lru_list_del(struct inode *inode)
 {
-   spin_lock(&inode->i_sb->s_inode_lru_lock);
-   if (!list_empty(&inode->i_lru)) {
-   list_del_init(&inode->i_lru);
-   inode->i_sb->s_nr_inodes_unused--;
+   if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
this_cpu_dec(nr_unused);
-   }
-   spin_unlock(&inode->i_sb->s_inode_lru_lock);
 }
 
 /**
@@ -660,24 +651,8 @@ int invalidate_inodes(struct super_block *sb, bool 
kill_dirty)
return busy;
 }
 
-static int can_unuse(struct inode *inode)
-{
-   if (inode->i_state & ~I_REFERENCED)
-   return 0;
-   if (inode_has_buffers(inode))
-   return 0;
-   if (atomic_read(&inode->i_count))
-   return 0;
-   if (inode->i_data.nrpages)
-   return 0;
-   return 1;
-}
-
 /*
- * Walk the superblock inode LRU for freeable inodes and attempt to free them.
- * This is called from the superblock shrinker function with a number of inodes
- * to trim from the LRU. Inodes to be freed are moved to a temporary list and
- * then are freed outside inode_lock by dispose_list().
+ * Isolate the inode from the LRU in preparation for freeing it.
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  If the inode has metadata buffers attached to
@@ -691,90 +666,78 @@ static int can_unuse(struct inode *inode)
  * LRU does not have strict ordering. Hence we don't want to reclaim inodes
  * with this flag set because they are the inodes that are out of order.
  */
-long prune_icache_sb(struct super_block *sb, long nr_to_scan)
+static int inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
+   void *arg)
 {
-   LIST_HEAD(freeable);
-   long nr_scanned;
-   long freed = 0;
-   unsigned long reap = 0;
+   struct list_head *freeable = arg;
+   struct inode*inode = container_of(item, struct inode, i_lru);
 
-   spin_lock(&sb->s_inode_lru_lock);
-   for (nr_scanned = nr_to_scan; nr_scanned >= 0; nr_scanned--) {
-   struct inode *inode;
+   /*
+* we are inverting the lru lock/inode->i_lock here, so use a trylock.
+* If we fail to get the lock, just skip it.
+*/
+   if (!spin_trylock(&inode->i_lock))
+   return 2;
 
-   if (list_empty(&sb->s_inode_lru))
-   break;
+   /*
+* Referenced or dirty inodes are still in use. Give them another pass
+* through the LRU as we canot reclaim them now.
+*/
+   if (atomic_read(&inode->i_count) ||
+   (inode->i_state & ~I_REFERENCED)) {
+   list_del_init(&inode->i_lru);
+   spin_unlock(&inode->i_lock);
+   this_cpu_dec(nr_unused);
+   return 0;
+   }
 
-   inode = list_entry(sb->s_inode_lru.prev, struct inode, i_lru);
+   /* recently referenced inodes get one more pass */
+   if (inode->i_state & I_REFERENCED) {
+   inode->i_state &= ~I_REFERENCED;
+   spin_unlock(&inode->i_lock);
+   return 1;
+   }
 
-   /*
-* we

[PATCH 03/19] dcache: remove dentries from LRU before putting on dispose list

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

One of the big problems with modifying the way the dcache shrinker
and LRU implementation works is that the LRU is abused in several
ways. One of these is shrink_dentry_list().

Basically, we can move a dentry off the LRU onto a different list
without doing any accounting changes, and then use dentry_lru_prune()
to remove it from what-ever list it is now on to do the LRU
accounting at that point.

This makes it -really hard- to change the LRU implementation. The
use of the per-sb LRU lock serialises movement of the dentries
between the different lists and the removal of them, and this is the
only reason that it works. If we want to break up the dentry LRU
lock and lists into, say, per-node lists, we remove the only
serialisation that allows this lru list/dispose list abuse to work.

To make this work effectively, the dispose list has to be isolated
from the LRU list - dentries have to be removed from the LRU
*before* being placed on the dispose list. This means that the LRU
accounting and isolation is completed before disposal is started,
and that means we can change the LRU implementation freely in
future.

This means that dentries *must* be marked with DCACHE_SHRINK_LIST
when they are placed on the dispose list so that we don't think that
parent dentries found in try_prune_one_dentry() are on the LRU when
the are actually on the dispose list. This would result in
accounting the dentry to the LRU a second time. Hence
dentry_lru_prune() has to handle the DCACHE_SHRINK_LIST case
differently because the dentry isn't on the LRU list.

Signed-off-by: Dave Chinner 
---
 fs/dcache.c |   73 +++
 1 file changed, 63 insertions(+), 10 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e0c97fe..0124a84 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -330,7 +330,6 @@ static void dentry_lru_add(struct dentry *dentry)
 static void __dentry_lru_del(struct dentry *dentry)
 {
list_del_init(&dentry->d_lru);
-   dentry->d_flags &= ~DCACHE_SHRINK_LIST;
dentry->d_sb->s_nr_dentry_unused--;
this_cpu_dec(nr_dentry_unused);
 }
@@ -340,6 +339,8 @@ static void __dentry_lru_del(struct dentry *dentry)
  */
 static void dentry_lru_del(struct dentry *dentry)
 {
+   BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
+
if (!list_empty(&dentry->d_lru)) {
spin_lock(&dentry->d_sb->s_dentry_lru_lock);
__dentry_lru_del(dentry);
@@ -351,28 +352,42 @@ static void dentry_lru_del(struct dentry *dentry)
  * Remove a dentry that is unreferenced and about to be pruned
  * (unhashed and destroyed) from the LRU, and inform the file system.
  * This wrapper should be called _prior_ to unhashing a victim dentry.
+ *
+ * Check that the dentry really is on the LRU as it may be on a private dispose
+ * list and in that case we do not want to call the generic LRU removal
+ * functions. This typically happens when shrink_dcache_sb() clears the LRU in
+ * one go and then try_prune_one_dentry() walks back up the parent chain 
finding
+ * dentries that are also on the dispose list.
  */
 static void dentry_lru_prune(struct dentry *dentry)
 {
if (!list_empty(&dentry->d_lru)) {
+
if (dentry->d_flags & DCACHE_OP_PRUNE)
dentry->d_op->d_prune(dentry);
 
-   spin_lock(&dentry->d_sb->s_dentry_lru_lock);
-   __dentry_lru_del(dentry);
-   spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
+   if ((dentry->d_flags & DCACHE_SHRINK_LIST))
+   list_del_init(&dentry->d_lru);
+   else {
+   spin_lock(&dentry->d_sb->s_dentry_lru_lock);
+   __dentry_lru_del(dentry);
+   spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
+   }
+   dentry->d_flags &= ~DCACHE_SHRINK_LIST;
}
 }
 
 static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 {
+   BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
+
spin_lock(&dentry->d_sb->s_dentry_lru_lock);
if (list_empty(&dentry->d_lru)) {
list_add_tail(&dentry->d_lru, list);
-   dentry->d_sb->s_nr_dentry_unused++;
-   this_cpu_inc(nr_dentry_unused);
} else {
list_move_tail(&dentry->d_lru, list);
+   dentry->d_sb->s_nr_dentry_unused--;
+   this_cpu_dec(nr_dentry_unused);
}
spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 }
@@ -840,12 +855,18 @@ static void shrink_dentry_list(struct list_head *list)
}
 
/*
+* The dispose list is isolated and dentries are not accounted
+

[RFC, PATCH 00/19] Numa aware LRU lists and shrinkers

2012-11-27 Thread Dave Chinner
Hi Glauber,

Here's a working version of my patchset for generic LRU lists and
NUMA-aware shrinkers.

There are several parts to this patch set. The NUMA aware shrinkers
are based on having a generic node-based LRU list implementation,
and there are subsystems that need to be converted to use these
lists as part of the process. There is also a long overdue change to
the shrinker API to give it separate object count and object scan
callbacks, getting rid of the magic and confusing "nr_to_scan = 0"
semantics.

First of all, the patch set is based on a current 3.7-rc7 tree with
the current xfs dev tree merged into it [can be found at
git://oss.sgi.com/xfs/xfs]. That's because there are lots of XFS
changes in the patch set, and theres no way I'm going to write them
a second time in a couple of weeks when the current dev tree is
merged into 3.8-rc1

So, where's what the patches do:

[PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu
[PATCH 02/19] dentry: move to per-sb LRU locks
[PATCH 03/19] dcache: remove dentries from LRU before putting on

These three patches are preparation of the dcache for moving to the
generic LRU list API. it basically gets rid of the global dentry LRU
lock, and in doing so has to avoid several creative abuses of the
lru list detection to allow dentries on shrink lists to be still
magically be on the LRU list. The main change here is that now
dentries on the shrink lists *must* have the DCACHE_SHRINK_LIST flag
set and be entirely removed from the LRU before being disposed of.

This is probably a good cleanup to do regardless of the rest of the
patch set because it removes a couple of landmines in
shrink_dentry_list() that took me a while to work out...

[PATCH 04/19] mm: new shrinker API
[PATCH 05/19] shrinker: convert superblock shrinkers to new API

These introduce the count/scan shrinker API, and for testing
purposes convert the superblock shrinker to use it before any other
changes are made. This gives a clean separation of counting the
number of objects in a cache for pressure calculations, and the act
of scanning objects in an attempt to free memory. Indeed, the
scan_objects() callback now returns the number of objects freed by
the scan instead of having to try to work out whether any progress
was made by comparing absolute counts.

This is also more efficient as we don't have to count all the
objects in a cache on every scan pass. It is now done once per
shrink_slab() invocation to calculate how much to scan, and we get
direct feedback on how much gets reclaimed in that pass. i.e. we get
reliable, accurate feedback on shrinker progress.

[PATCH 06/19] list: add a new LRU list type
[PATCH 07/19] inode: convert inode lru list to generic lru list
[PATCH 08/19] dcache: convert to use new lru list infrastructure

These add the generic LRU list API and infrastructure and convert
the inode and dentry caches to use it. This is still just a single
global list per LRU at this point, so it's really only changing the
where the LRU implemenation is rather than the fundamental
algorithm. It does, however, introduce a new method of walking the
LRU lists and building the dispose list of items for shrinkers, but
because we are still dealing with a global list the algorithmic
changes are minimal.

[PATCH 09/19] list_lru: per-node list infrastructure

This makes the generic LRU list much more scalable by changing it to
a {list,lock,count} tuple per node. There are no external API
changes to this changeover, so is transparent to current users.

[PATCH 10/19] shrinker: add node awareness
[PATCH 11/19] fs: convert inode and dentry shrinking to be node

Adds a nodemask to the struct shrink_control for callers of
shrink_slab to set appropriately for their reclaim context. This
nodemask is then passed by the inode and dentry cache reclaim code
to the generic LRU list code to implement node aware shrinking.

What this doesn't do is convert the internal shrink_slab() algorithm
to be node aware. I'm not sure what the best approach is here, but
it strikes me that it should really be calculating and keeping track
of scan counts and pressure on a per-node basis. The current code
seems to work OK at the moment, though.

[PATCH 12/19] xfs: convert buftarg LRU to generic code
[PATCH 13/19] xfs: Node aware direct inode reclaim
[PATCH 14/19] xfs: use generic AG walk for background inode reclaim
[PATCH 15/19] xfs: convert dquot cache lru to list_lru

These patches convert all the XFS LRUs and shrinkers to be node
aware. This gets rid of a lot of hairy, special case code in the
inode cache shrinker for avoiding concurrent shrinker contention and
to throttle direct reclaim to prevent premature OOM conditions.
Removing this code greatly simplifies inode cache reclaim whilst
reducing overhead and improving performance. In all, it converts
three separate caches and shrinkers to use the generic LRU lists and
pass nodemasks around appropriately.

This is how I've really tested the code - lots of interestin

[PATCH 15/19] xfs: convert dquot cache lru to list_lru

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Convert the XFS dquot lru to use the list_lru construct and convert
the shrinker to being node aware.

Signed-off-by: Dave Chinner 
---
 fs/xfs/xfs_dquot.c |7 +-
 fs/xfs/xfs_qm.c|  307 ++--
 fs/xfs/xfs_qm.h|4 +-
 3 files changed, 156 insertions(+), 162 deletions(-)

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 9e1bf52..4fcd42e 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -843,13 +843,8 @@ xfs_qm_dqput_final(
 
trace_xfs_dqput_free(dqp);
 
-   mutex_lock(&qi->qi_lru_lock);
-   if (list_empty(&dqp->q_lru)) {
-   list_add_tail(&dqp->q_lru, &qi->qi_lru_list);
-   qi->qi_lru_count++;
+   if (list_lru_add(&qi->qi_lru, &dqp->q_lru))
XFS_STATS_INC(xs_qm_dquot_unused);
-   }
-   mutex_unlock(&qi->qi_lru_lock);
 
/*
 * If we just added a udquot to the freelist, then we want to release
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index e6a0af0..534b924 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -50,7 +50,6 @@
  */
 STATIC int xfs_qm_init_quotainos(xfs_mount_t *);
 STATIC int xfs_qm_init_quotainfo(xfs_mount_t *);
-STATIC int xfs_qm_shake(struct shrinker *, struct shrink_control *);
 
 /*
  * We use the batch lookup interface to iterate over the dquots as it
@@ -196,12 +195,9 @@ xfs_qm_dqpurge(
 * We move dquots to the freelist as soon as their reference count
 * hits zero, so it really should be on the freelist here.
 */
-   mutex_lock(&qi->qi_lru_lock);
ASSERT(!list_empty(&dqp->q_lru));
-   list_del_init(&dqp->q_lru);
-   qi->qi_lru_count--;
+   list_lru_del(&qi->qi_lru, &dqp->q_lru);
XFS_STATS_DEC(xs_qm_dquot_unused);
-   mutex_unlock(&qi->qi_lru_lock);
 
xfs_qm_dqdestroy(dqp);
 
@@ -617,6 +613,156 @@ xfs_qm_dqdetach(
}
 }
 
+STATIC void
+xfs_qm_dqfree_one(
+   struct xfs_dquot*dqp)
+{
+   struct xfs_mount*mp = dqp->q_mount;
+   struct xfs_quotainfo*qi = mp->m_quotainfo;
+
+   mutex_lock(&qi->qi_tree_lock);
+   radix_tree_delete(XFS_DQUOT_TREE(qi, dqp->q_core.d_flags),
+be32_to_cpu(dqp->q_core.d_id));
+
+   qi->qi_dquots--;
+   mutex_unlock(&qi->qi_tree_lock);
+
+   xfs_qm_dqdestroy(dqp);
+}
+
+struct xfs_qm_isolate {
+   struct list_headbuffers;
+   struct list_headdispose;
+};
+
+static int
+xfs_qm_dquot_isolate(
+   struct list_head*item,
+   spinlock_t  *lru_lock,
+   void*arg)
+{
+   struct xfs_dquot*dqp = container_of(item,
+   struct xfs_dquot, q_lru);
+   struct xfs_qm_isolate   *isol = arg;
+
+   if (!xfs_dqlock_nowait(dqp))
+   goto out_miss_busy;
+
+   /*
+* This dquot has acquired a reference in the meantime remove it from
+* the freelist and try again.
+*/
+   if (dqp->q_nrefs) {
+   xfs_dqunlock(dqp);
+   XFS_STATS_INC(xs_qm_dqwants);
+
+   trace_xfs_dqreclaim_want(dqp);
+   list_del_init(&dqp->q_lru);
+   XFS_STATS_DEC(xs_qm_dquot_unused);
+   return 0;
+   }
+
+   /*
+* If the dquot is dirty, flush it. If it's already being flushed, just
+* skip it so there is time for the IO to complete before we try to
+* reclaim it again on the next LRU pass.
+*/
+   if (!xfs_dqflock_nowait(dqp))
+   xfs_dqunlock(dqp);
+   goto out_miss_busy;
+
+   if (XFS_DQ_IS_DIRTY(dqp)) {
+   struct xfs_buf  *bp = NULL;
+   int error;
+
+   trace_xfs_dqreclaim_dirty(dqp);
+
+   /* we have to drop the LRU lock to flush the dquot */
+   spin_unlock(lru_lock);
+
+   error = xfs_qm_dqflush(dqp, &bp);
+   if (error) {
+   xfs_warn(dqp->q_mount, "%s: dquot %p flush failed",
+__func__, dqp);
+   goto out_unlock_dirty;
+   }
+
+   xfs_buf_delwri_queue(bp, &isol->buffers);
+   xfs_buf_relse(bp);
+   goto out_unlock_dirty;
+   }
+   xfs_dqfunlock(dqp);
+
+   /*
+* Prevent lookups now that we are past the point of no return.
+*/
+   dqp->dq_flags |= XFS_DQ_FREEING;
+   xfs_dqunlock(dqp);
+
+   ASSERT(dqp->q_nrefs == 0);
+   list_move_tail(&dqp->q_lru, &isol->dispose);
+   XFS_STATS_DEC(xs_qm_dquot_unused);
+   trace_xfs_dqreclaim_done(dqp);
+   XFS_STA

[PATCH 05/19] shrinker: convert superblock shrinkers to new API

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Convert superblock shrinker to use the new count/scan API, and
propagate the API changes through to the filesystem callouts. The
filesystem callouts already use a count/scan API, so it's just
changing counters to longs to match the VM API.

This requires the dentry and inode shrinker callouts to be converted
to the count/scan API. This is mainly a mechanical change.

Signed-off-by: Dave Chinner 
---
 fs/dcache.c |   10 +---
 fs/inode.c  |7 +++--
 fs/internal.h   |3 +++
 fs/super.c  |   71 ++-
 fs/xfs/xfs_icache.c |   17 +++-
 fs/xfs/xfs_icache.h |4 +--
 fs/xfs/xfs_super.c  |8 +++---
 include/linux/fs.h  |8 ++
 8 files changed, 74 insertions(+), 54 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0124a84..ca647b8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -892,11 +892,12 @@ static void shrink_dentry_list(struct list_head *list)
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-void prune_dcache_sb(struct super_block *sb, int count)
+long prune_dcache_sb(struct super_block *sb, long nr_to_scan)
 {
struct dentry *dentry;
LIST_HEAD(referenced);
LIST_HEAD(tmp);
+   long freed = 0;
 
 relock:
spin_lock(&sb->s_dentry_lru_lock);
@@ -921,7 +922,8 @@ relock:
this_cpu_dec(nr_dentry_unused);
sb->s_nr_dentry_unused--;
spin_unlock(&dentry->d_lock);
-   if (!--count)
+   freed++;
+   if (!--nr_to_scan)
break;
}
cond_resched_lock(&sb->s_dentry_lru_lock);
@@ -931,6 +933,7 @@ relock:
spin_unlock(&sb->s_dentry_lru_lock);
 
shrink_dentry_list(&tmp);
+   return freed;
 }
 
 /*
@@ -1317,9 +1320,8 @@ rename_retry:
 void shrink_dcache_parent(struct dentry * parent)
 {
LIST_HEAD(dispose);
-   int found;
 
-   while ((found = select_parent(parent, &dispose)) != 0)
+   while (select_parent(parent, &dispose))
shrink_dentry_list(&dispose);
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
diff --git a/fs/inode.c b/fs/inode.c
index c9fb382..3624ae0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -691,10 +691,11 @@ static int can_unuse(struct inode *inode)
  * LRU does not have strict ordering. Hence we don't want to reclaim inodes
  * with this flag set because they are the inodes that are out of order.
  */
-void prune_icache_sb(struct super_block *sb, int nr_to_scan)
+long prune_icache_sb(struct super_block *sb, long nr_to_scan)
 {
LIST_HEAD(freeable);
-   int nr_scanned;
+   long nr_scanned;
+   long freed = 0;
unsigned long reap = 0;
 
spin_lock(&sb->s_inode_lru_lock);
@@ -764,6 +765,7 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan)
list_move(&inode->i_lru, &freeable);
sb->s_nr_inodes_unused--;
this_cpu_dec(nr_unused);
+   freed++;
}
if (current_is_kswapd())
__count_vm_events(KSWAPD_INODESTEAL, reap);
@@ -774,6 +776,7 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan)
current->reclaim_state->reclaimed_slab += reap;
 
dispose_list(&freeable);
+   return freed;
 }
 
 static void __wait_on_freeing_inode(struct inode *inode);
diff --git a/fs/internal.h b/fs/internal.h
index 916b7cb..7d7908b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -110,6 +110,8 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
+extern long prune_icache_sb(struct super_block *sb, long nr_to_scan);
+
 
 /*
  * fs-writeback.c
@@ -124,3 +126,4 @@ extern int invalidate_inodes(struct super_block *, bool);
  * dcache.c
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
+extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan);
diff --git a/fs/super.c b/fs/super.c
index 21abf02..fda6f12 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -53,11 +53,14 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
  * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
  * take a passive reference to the superblock to avoid this from occurring.
  */
-static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
+static long super_cache_scan(struct shrinker *shrink, struct shrink_control 
*sc)
 {
struct super_block *sb;
-   int fs_objects = 0;
-   int total_objects;
+   longfs_objects = 0;
+   longtotal_objects;
+   longfreed = 0;
+   longdentries;
+   longinodes;
 
sb = container_of(shrink, struct super_block, s_shrink);
 
@@ -65,7 +68,7 @@ static int prune_

[PATCH 11/19] fs: convert inode and dentry shrinking to be node aware

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

Now that the shrinker is passing a nodemask in the scan control
structure, we can pass this to the the generic LRU list code to
isolate reclaim to the lists on matching nodes.

This requires a small amount of refactoring of the LRU list API,
which might be best split out into a separate patch.

Signed-off-by: Dave Chinner 
---
 fs/dcache.c  |7 ---
 fs/inode.c   |7 ---
 fs/internal.h|6 --
 fs/super.c   |   22 +-
 fs/xfs/xfs_super.c   |6 --
 include/linux/fs.h   |4 ++--
 include/linux/list_lru.h |   19 ---
 lib/list_lru.c   |   18 ++
 8 files changed, 57 insertions(+), 32 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index d72e388..7f107fb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -907,13 +907,14 @@ static int dentry_lru_isolate(struct list_head *item, 
spinlock_t *lru_lock,
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-long prune_dcache_sb(struct super_block *sb, long nr_to_scan)
+long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
+nodemask_t *nodes_to_walk)
 {
LIST_HEAD(dispose);
long freed;
 
-   freed = list_lru_walk(&sb->s_dentry_lru, dentry_lru_isolate,
- &dispose, nr_to_scan);
+   freed = list_lru_walk_nodemask(&sb->s_dentry_lru, dentry_lru_isolate,
+  &dispose, nr_to_scan, nodes_to_walk);
shrink_dentry_list(&dispose);
return freed;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 2662305..3857f9f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -731,13 +731,14 @@ static int inode_lru_isolate(struct list_head *item, 
spinlock_t *lru_lock,
  * to trim from the LRU. Inodes to be freed are moved to a temporary list and
  * then are freed outside inode_lock by dispose_list().
  */
-long prune_icache_sb(struct super_block *sb, long nr_to_scan)
+long prune_icache_sb(struct super_block *sb, long nr_to_scan,
+nodemask_t *nodes_to_walk)
 {
LIST_HEAD(freeable);
long freed;
 
-   freed = list_lru_walk(&sb->s_inode_lru, inode_lru_isolate,
-   &freeable, nr_to_scan);
+   freed = list_lru_walk_nodemask(&sb->s_inode_lru, inode_lru_isolate,
+  &freeable, nr_to_scan, nodes_to_walk);
dispose_list(&freeable);
return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 7d7908b..95c4e9b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -110,7 +110,8 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, long nr_to_scan);
+extern long prune_icache_sb(struct super_block *sb, long nr_to_scan,
+   nodemask_t *nodes_to_scan);
 
 
 /*
@@ -126,4 +127,5 @@ extern int invalidate_inodes(struct super_block *, bool);
  * dcache.c
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
-extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan);
+extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
+   nodemask_t *nodes_to_scan);
diff --git a/fs/super.c b/fs/super.c
index b1d24ef..3c975b1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -75,10 +75,10 @@ static long super_cache_scan(struct shrinker *shrink, 
struct shrink_control *sc)
return -1;
 
if (sb->s_op && sb->s_op->nr_cached_objects)
-   fs_objects = sb->s_op->nr_cached_objects(sb);
+   fs_objects = sb->s_op->nr_cached_objects(sb, 
&sc->nodes_to_scan);
 
-   inodes = list_lru_count(&sb->s_inode_lru);
-   dentries = list_lru_count(&sb->s_dentry_lru);
+   inodes = list_lru_count_nodemask(&sb->s_inode_lru, &sc->nodes_to_scan);
+   dentries = list_lru_count_nodemask(&sb->s_dentry_lru, 
&sc->nodes_to_scan);
total_objects = dentries + inodes + fs_objects + 1;
 
/* proportion the scan between the caches */
@@ -89,12 +89,13 @@ static long super_cache_scan(struct shrinker *shrink, 
struct shrink_control *sc)
 * prune the dcache first as the icache is pinned by it, then
 * prune the icache, followed by the filesystem specific caches
 */
-   freed = prune_dcache_sb(sb, dentries);
-   freed += prune_icache_sb(sb, inodes);
+   freed = prune_dcache_sb(sb, dentries, &sc->nodes_to_scan);
+   freed += prune_icache_sb(sb, inodes, &sc->nodes_to_scan);
 
if (fs_objects) {
fs_objects = (sc->nr_to_scan * fs_objects) / total_objects;
-   freed += sb->s_op->free_cached_objects(sb, fs_o

[PATCH 04/19] mm: new shrinker API

2012-11-27 Thread Dave Chinner
From: Dave Chinner 

The current shrinker callout API uses an a single shrinker call for
multiple functions. To determine the function, a special magical
value is passed in a parameter to change the behaviour. This
complicates the implementation and return value specification for
the different behaviours.

Separate the two different behaviours into separate operations, one
to return a count of freeable objects in the cache, and another to
scan a certain number of objects in the cache for freeing. In
defining these new operations, ensure the return values and
resultant behaviours are clearly defined and documented.

Modify shrink_slab() to use the new API and implement the callouts
for all the existing shrinkers.

Signed-off-by: Dave Chinner 
---
 include/linux/shrinker.h |   37 --
 mm/vmscan.c  |   50 +-
 2 files changed, 59 insertions(+), 28 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index ac6b8ee..4f59615 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -4,31 +4,47 @@
 /*
  * This struct is used to pass information from page reclaim to the shrinkers.
  * We consolidate the values for easier extention later.
+ *
+ * The 'gfpmask' refers to the allocation we are currently trying to
+ * fulfil.
+ *
+ * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
+ * querying the cache size, so a fastpath for that case is appropriate.
  */
 struct shrink_control {
gfp_t gfp_mask;
 
/* How many slab objects shrinker() should scan and try to reclaim */
-   unsigned long nr_to_scan;
+   long nr_to_scan;
 };
 
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
- * 'sc' is passed shrink_control which includes a count 'nr_to_scan'
- * and a 'gfpmask'.  It should look through the least-recently-used
- * 'nr_to_scan' entries and attempt to free them up.  It should return
- * the number of objects which remain in the cache.  If it returns -1, it means
- * it cannot do any scanning at this time (eg. there is a risk of deadlock).
+ * @shrink() should look through the least-recently-used 'nr_to_scan' entries
+ * and attempt to free them up.  It should return the number of objects which
+ * remain in the cache.  If it returns -1, it means it cannot do any scanning 
at
+ * this time (eg. there is a risk of deadlock).
  *
- * The 'gfpmask' refers to the allocation we are currently trying to
- * fulfil.
+ * @count_objects should return the number of freeable items in the cache. If
+ * there are no objects to free or the number of freeable items cannot be
+ * determined, it should return 0. No deadlock checks should be done during the
+ * count callback - the shrinker relies on aggregating scan counts that 
couldn't
+ * be executed due to potential deadlocks to be run at a later call when the
+ * deadlock condition is no longer pending.
  *
- * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
- * querying the cache size, so a fastpath for that case is appropriate.
+ * @scan_objects will only be called if @count_objects returned a positive
+ * value for the number of freeable objects. The callout should scan the cache
+ * and attemp to free items from the cache. It should then return the number of
+ * objects freed during the scan, or -1 if progress cannot be made due to
+ * potential deadlocks. If -1 is returned, then no further attempts to call the
+ * @scan_objects will be made from the current reclaim context.
  */
 struct shrinker {
int (*shrink)(struct shrinker *, struct shrink_control *sc);
+   long (*count_objects)(struct shrinker *, struct shrink_control *sc);
+   long (*scan_objects)(struct shrinker *, struct shrink_control *sc);
+
int seeks;  /* seeks to recreate an obj */
long batch; /* reclaim batch size, 0 = default */
 
@@ -36,6 +52,7 @@ struct shrinker {
struct list_head list;
atomic_long_t nr_in_batch; /* objs pending delete */
 };
+
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 extern void register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48550c6..55c4fc9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,19 +204,19 @@ static inline int do_shrinker_shrink(struct shrinker 
*shrinker,
  *
  * Returns the number of slab objects which we shrunk.
  */
-unsigned long shrink_slab(struct shrink_control *shrink,
+unsigned long shrink_slab(struct shrink_control *sc,
  unsigned long nr_pages_scanned,
  unsigned long lru_pages)
 {
struct shrinker *shrinker;
-   unsigned long ret = 0;
+   unsigned long freed = 0;
 
if (nr_pages_scanned == 0)
nr_pages_scanne

Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

2012-11-27 Thread Dave Chinner
On Wed, Nov 28, 2012 at 01:13:11AM +, Chris Wilson wrote:
> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner  wrote:
> > +/*
> > + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've 
> > seen.
> > + *
> > + * i915_gem_purge() expects a byte count to be passed, and the minimum 
> > object
> > + * size is PAGE_SIZE.
> 
> No, purge() expects a count of pages to be freed. Each pass of the
> shrinker therefore tries to free a minimum of 128 pages.

Ah, I got the shifts mixed up. I'd been looking at way too much crap
already when I saw this. But the fact this can be misunderstood says
something about the level of documentation that the code has (i.e.
none).

> > The shrinker doesn't work on bytes - it works on
> > + * *objects*.
> 
> And I thought you were reviewing the shrinker API to be useful where a
> single object may range between 4K and 4G.

Which requires rewriting all the algorithms to not be dependent on
the subsystems using a fixed size object. The shrinker control
function is called shrink_slab() for a reason - it was expected to
be used to shrink caches of fixed sized objects allocated from slab
memory.

It has no concept of the amount of memory that each object consumes,
just an idea of how much *IO* it takes to replace the object in
memory once it's been reclaimed. The DEFAULT_SEEKS is design to
encode the fact it generally takes 2 IOs to replace either a LRU
page or a filesystem slab object, and so balances the scanning based
on that value. i.e. the shrinker algorithms are solidly based around
fixed sized objects that have some relationship to the cost of
physical IO operations to replace them in the cache.

The API change is the first step in the path to removing these built
in assumptions. The current API is just insane and any attempt to
build on it is going to be futile. The way I see this developing is
this:

- make the shrink_slab count -> scan algorithm per node

- add information about size of objects in the cache for
  fixed size object caches.
- the shrinker now has some idea of how many objects
  need to be freed to be able to free a page of
  memory, as well as the relative penalty for
  replacing them.
- tells the shrinker the size of the cache
  in bytes so overall memory footprint of the caches
  can be taken into account

- add new count and scan operations for caches that are
  based on memory used, not object counts
- allows us to use the same count/scan algorithm for
  calculating how much pressure to put on caches
  with variable size objects.

My care factor mostly ends here, as it will allow XFS to corectly
balance the metadata buffer cache (variable size objects) against the
inode, dentry and dquot caches which are object based. The next
steps that I'm about to give you are based on some discussions with
some MM people over bottles of red wine, so take it with a grain of
salt...

- calculate a "pressure" value for each cache controlled by a
  shrinker so that the relative memory pressure between
  caches can be compared. This allows the shrinkers to bias
  reclaim based on where the memory pressure is being
  generated

- start grouping shrinkers into a heirarchy, allowing
  related shrinkers (e.g. all the caches in a memcg) to be
  shrunk according resource limits that can be placed on the
  group. i.e. memory pressure is proportioned across
  groups rather than many individual shrinkers.

- comments have been made to the extent that with generic
  per-node lists and a node aware shrinker, all of the page
  scanning could be driven by the shrinker infrastructure,
  rather than the shrinkers being driven by how many pages
  in the page cache just got scanned for reclaim.

  IOWs, the main memory reclaim algorithm walks all the
  shrinkers groups to calculate overall memory pressure,
  calculate how much reclaim is necessary, and then
  proportion reclaim across all the shrinker groups. i.e.
  everything is a shrinker.

This patch set is really just the start of a long process. balance
between the page cache and VFS/filesystem shrinkers is critical to
the efficient operation of the OS under many, many workloads, so I'm
not about to change more than oe little thing at a time. This API
change is just one little step. You'll get what you want eventually,
but you're not going to get it as a first step.

> > + * But the craziest part comes when i915_gem_purge() has walked all the 
> > objects
> > + * and can't free any memory. That results in i915_g

Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

2012-11-28 Thread Dave Chinner
On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote:
> On 11/28/2012 07:17 AM, Dave Chinner wrote:
> > On Wed, Nov 28, 2012 at 01:13:11AM +, Chris Wilson wrote:
> >> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner  
> >> wrote:
> >>> +/*
> >>> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've 
> >>> seen.
> >>> + *
> >>> + * i915_gem_purge() expects a byte count to be passed, and the minimum 
> >>> object
> >>> + * size is PAGE_SIZE.
> >>
> >> No, purge() expects a count of pages to be freed. Each pass of the
> >> shrinker therefore tries to free a minimum of 128 pages.
> > 
> > Ah, I got the shifts mixed up. I'd been looking at way too much crap
> > already when I saw this. But the fact this can be misunderstood says
> > something about the level of documentation that the code has (i.e.
> > none).
> > 
> >>> The shrinker doesn't work on bytes - it works on
> >>> + * *objects*.
> >>
> >> And I thought you were reviewing the shrinker API to be useful where a
> >> single object may range between 4K and 4G.
> > 
> > Which requires rewriting all the algorithms to not be dependent on
> > the subsystems using a fixed size object. The shrinker control
> > function is called shrink_slab() for a reason - it was expected to
> > be used to shrink caches of fixed sized objects allocated from slab
> > memory.
> > 
> > It has no concept of the amount of memory that each object consumes,
> > just an idea of how much *IO* it takes to replace the object in
> > memory once it's been reclaimed. The DEFAULT_SEEKS is design to
> > encode the fact it generally takes 2 IOs to replace either a LRU
> > page or a filesystem slab object, and so balances the scanning based
> > on that value. i.e. the shrinker algorithms are solidly based around
> > fixed sized objects that have some relationship to the cost of
> > physical IO operations to replace them in the cache.
> 
> One nit: It shouldn't take 2IOs to replace a slab object, right?
> objects.

A random dentry in a small directory will take on IO to read the
inode, then another to read the block the dirent sits in. TO read an
inode froma cached dentry will generally take one IO to read the
inode, and another to read related, out of inode information (e.g.
attributes or extent/block maps). Sometimes it will only take on IO,
sometimes it might take 3 or, in the case of dirents, coult take
hundreds of IOs if the directory structure is large enough.

So a default of 2 seeks to replace any single dentry/inode in the
cache is a pretty good default to use.

> This
> should be the cost of allocating a new page, that can contain, multiple
> Once the page is in, a new object should be quite cheap to come up with.

It's not the cost of allocating the page (a couple of microseconds)
that is being considered - it the 3-4 orders of magnitude worse cost
of reading the object from disk (could be 20ms). The slab/page
allocation is lost in the noise compared to the time it takes to
fill the page cache page with data or a single slab object.
Essentially, slab pages with multiple objects in them are much more
expensive to replace in the cache than a page cache page

> This is a very wild thought, but now that I am diving deep in the
> shrinker API, and seeing things like this:
> 
> if (reclaim_state) {
> sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> reclaim_state->reclaimed_slab = 0;
> }

That's not part of the shrinker - that's part of the vmscan
code, external to the shrinker infrastructure. It's getting
information back from the slab caches behind the shrinkers, and it's
not the full picture because many shrinkers are not backed by slab
caches. It's a work around for not not having accurate feedback from
the shrink_slab() code about how many pages were freed.

Essentially, the problem is an impedance mismatch between the way
the LRUs are scanned/balanced (in pages) and slab caches are managed
(by objects). That's what needs unifying...

> I am becoming more convinced that we should have a page-based mechanism,
> like the rest of vmscan.

Been thought about and consiered before. Would you like to rewrite
the slab code?

> Also, if we are seeing pressure from someone requesting user pages, what
> good does it make to free, say, 35 Mb of memory, if this means we are
> freeing objects across 5k different pages, without actually releasing
> any of them? (still is TBD if this is a theoretical problem or a
> practical one). It would maybe be better to free objects that are
> moderately hot, but are on page

Re: [PATCH] tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)

2012-11-28 Thread Dave Chinner
On Wed, Nov 28, 2012 at 05:22:03PM -0800, Hugh Dickins wrote:
> Revert 3.5's f21f8062201f ("tmpfs: revert SEEK_DATA and SEEK_HOLE")
> to reinstate 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE"),
> with the intervening additional arg to generic_file_llseek_size().
> 
> In 3.8, ext4 is expected to join btrfs, ocfs2 and xfs with proper
> SEEK_DATA and SEEK_HOLE support; and a good case has now been made
> for it on tmpfs, so let's join the party.
> 
> It's quite easy for tmpfs to scan the radix_tree to support llseek's new
> SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are still
> on my mind (in particular, the !PageUptodate-ness of pages fallocated but
> still unwritten).
> 
> [a...@linux-foundation.org: fix warning with CONFIG_TMPFS=n]
> Signed-off-by: Hugh Dickins 
> ---

Does it pass the seek hole/data tests (285, 286) in xfstests?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)

2012-11-29 Thread Dave Chinner
\0\0\0\0"..., 
33554432) = 33554432
mmap(NULL, 134225920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f629fcfd000
munmap(0x7f62a7cff000, 67117056)= 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
67108864) = 67108864
mmap(NULL, 268443648, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f628fcfb000
munmap(0x7f629fcfd000, 134225920)   = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
134217728) = 134217728
mmap(NULL, 536879104, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f626fcf9000
munmap(0x7f628fcfb000, 268443648)   = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
268435456) = 268435456
mmap(NULL, 1073750016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f622fcf7000
munmap(0x7f626fcf9000, 536879104)   = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
536870912) = 536870912
mmap(NULL, 2147491840, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f61afcf5000
munmap(0x7f622fcf7000, 1073750016)  = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1073741824) = 1073741824
mmap(NULL, 4294975488, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f60afcf3000
munmap(0x7f61afcf5000, 2147491840)  = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
2147483648) = 2147479552
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
4096) = 4096
mmap(NULL, 8589942784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= -1 ENOMEM (Cannot allocate memory)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

2012-11-29 Thread Dave Chinner
On Thu, Nov 29, 2012 at 02:29:33PM +0400, Glauber Costa wrote:
> On 11/29/2012 01:28 AM, Dave Chinner wrote:
> > On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote:
> >> On 11/28/2012 07:17 AM, Dave Chinner wrote:
> >>> On Wed, Nov 28, 2012 at 01:13:11AM +, Chris Wilson wrote:
> >>>> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner  
> >>>> wrote:
> >>>>> The shrinker doesn't work on bytes - it works on
> >>>>> + * *objects*.
> >>>>
> >>>> And I thought you were reviewing the shrinker API to be useful where a
> >>>> single object may range between 4K and 4G.
> >>>
> >>> Which requires rewriting all the algorithms to not be dependent on
> >>> the subsystems using a fixed size object. The shrinker control
> >>> function is called shrink_slab() for a reason - it was expected to
> >>> be used to shrink caches of fixed sized objects allocated from slab
> >>> memory.
> >>>
> >>> It has no concept of the amount of memory that each object consumes,
> >>> just an idea of how much *IO* it takes to replace the object in
> >>> memory once it's been reclaimed. The DEFAULT_SEEKS is design to
> >>> encode the fact it generally takes 2 IOs to replace either a LRU
> >>> page or a filesystem slab object, and so balances the scanning based
> >>> on that value. i.e. the shrinker algorithms are solidly based around
> >>> fixed sized objects that have some relationship to the cost of
> >>> physical IO operations to replace them in the cache.
> >>
> >> One nit: It shouldn't take 2IOs to replace a slab object, right?
> >> objects.
> > 
> > A random dentry in a small directory will take on IO to read the
> > inode, then another to read the block the dirent sits in. TO read an
> > inode froma cached dentry will generally take one IO to read the
> > inode, and another to read related, out of inode information (e.g.
> > attributes or extent/block maps). Sometimes it will only take on IO,
> > sometimes it might take 3 or, in the case of dirents, coult take
> > hundreds of IOs if the directory structure is large enough.
> > 
> > So a default of 2 seeks to replace any single dentry/inode in the
> > cache is a pretty good default to use.
> > 
> >> This
> >> should be the cost of allocating a new page, that can contain, multiple
> >> Once the page is in, a new object should be quite cheap to come up with.
> > 
> 
> Indeed. More on this in the next paragraph...

I'm not sure what you are trying to say here. Are you saying that
you think that the IO cost for replacing a slab cache object doesn't
matter?

> > It's not the cost of allocating the page (a couple of microseconds)
> > that is being considered - it the 3-4 orders of magnitude worse cost
> > of reading the object from disk (could be 20ms). The slab/page
> > allocation is lost in the noise compared to the time it takes to
> > fill the page cache page with data or a single slab object.
> > Essentially, slab pages with multiple objects in them are much more
> > expensive to replace in the cache than a page cache page
> > 
> >> This is a very wild thought, but now that I am diving deep in the
> >> shrinker API, and seeing things like this:
> >>
> >> if (reclaim_state) {
> >> sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> >> reclaim_state->reclaimed_slab = 0;
> >> }
> > 
> > That's not part of the shrinker - that's part of the vmscan
> > code, external to the shrinker infrastructure. It's getting
> > information back from the slab caches behind the shrinkers, and it's
> > not the full picture because many shrinkers are not backed by slab
> > caches. It's a work around for not not having accurate feedback from
> > the shrink_slab() code about how many pages were freed.
> > 
> I know it is not part of the shrinkers, and that is precisely my point.
> vmscan needs to go through this kinds of hacks because our API is not
> strong enough to just give it back the answer that matters to the caller.

What matters is that the slab caches are shrunk in proportion to the
page cache. i.e. balanced reclaim. For dentry and inode caches, what
matters is the number of objects reclaimed because the shrinker
algorithm balances based on the relative cost of object replacement
in the cache.

e.g. if you have 1000 pages in the page LRUs, and 1000 objects in
the dentry cache, each takes 1 IO

Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers

2012-11-29 Thread Dave Chinner
On Thu, Nov 29, 2012 at 11:02:24AM -0800, Andi Kleen wrote:
> Dave Chinner  writes:
> >
> > Comments, thoughts and flames all welcome.
> 
> Doing the reclaim per CPU sounds like a big change in the VM balance. 

It's per node, not per CPU. And AFAICT, it hasn't changed the
balance of page cache vs inode/dentry caches under general, global
workloads at all.

> Doesn't this invalidate some zone reclaim mode settings?

No, because zone reclaim is per-node and the shrinkers now can
reclaim just from a single node. i.e. the behaviour is now better
suited to the aims of zone reclaim which is to free memory from a
single, targetted node. Indeed, I removed a hack in the zone reclaim
code that sprayed slab reclaim across the entire machine until
sufficient objects had been freed from the target node

> How did you validate all this?

fakenuma setups, various workloads that generate even dentry/slab
cache loadings across all nodes, adding page cache pressure on a
single node, watching slab reclaim from a single node. that sort of
thing.

I haven't really done any performance testing other than "not
obviously slower". There's no point optimising anything before
there's any sort of agreement as to whether this is the right
approach to take or not

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Dave Chinner
On Thu, Nov 29, 2012 at 02:16:50PM -0800, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason  wrote:
> >
> > Just reading the new blkdev_get_blocks, it looks like we're mixing
> > shifts.  In direct-io.c map_bh->b_size is how much we'd like to map, and
> > it has no relation at all to the actual block size of the device.  The
> > interface is abusing b_size to ask for as large a mapping as possible.
> 
> Ugh. That's a big violation of how buffer-heads are supposed to work:
> the block number is very much defined to be in multiples of b_size
> (see for example "submit_bh()" that turns it into a sector number).
> 
> But you're right. The direct-IO code really *is* violating that, and
> knows that get_block() ends up being defined in i_blkbits regardless
> of b_size.

Same with mpage_readpages(), so it's not just direct IO that has
this problem

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

2013-03-17 Thread Dave Chinner
On Sun, Mar 17, 2013 at 01:06:59PM +, David Howells wrote:
> Miklos Szeredi  wrote:
> 
> > Export do_splice_direct() to modules.  Needed by overlay filesystem.
> 
> Apparently you cannot call this from any function that is holding an i_mutex
> if the target of the splice uses generic_file_splice_write().
> 
> The problem is a potential deadlock situation:
> 
> We have places already that do:
> 
>   mnt_want_write()
>   mutex_lock()
> 
> This can be found in do_last() for example.
> 
> However, mnt_want_write() calls sb_start_write() as does
> generic_file_splice_write().  So now in ovl_copy_up_locked() you're adding:
> 
>   mutex_lock()
>   sb_start_write()
> 
> which lockdep reports as a potential ABBA deadlock.
> 
> Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
> might operate, so it's possible that this is a false alarm.  Maybe Jan Kara 
> can
> illuminate further, so I've added him to the cc list.
> 
> I've attached the report I got with unionmount.

There's plenty of problems with splice locking that can lead to
deadlocks. Here's another that's been known for ages:

http://oss.sgi.com/archives/xfs/2011-08/msg00168.html

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-18 Thread Dave Chinner
On Wed, Mar 13, 2013 at 10:00:51PM -0400, Peter Hurley wrote:
> On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote:
> > We don't care about the ordering between multiple concurrent
> > metadata modifications - what matters is whether the ongoing data IO
> > around them is ordered correctly.
> 
> Dave,
> 
> The point that Michel is making is that there never was any ordering
> guarantee by rwsem. It's an illusion.

Weasel words.

> The reason is simple: to even get to the lock the cpu has to be
> sleep-able. So for every submission that you believe is ordered, is by
> its very nature __not ordered__, even when used by kernel code.
>
> Why? Because any thread on its way to claim the lock (reader or writer)
> could be pre-empted for some other task, thus delaying the submission of
> whatever i/o you believed to be ordered.

You think I don't know this?  You're arguing fine grained, low level
behaviour between tasks is unpredictable. I get that. I understand
that. But I'm not arguing about fine-grained, low level, microsecond
semantics of the locking order

What you (and Michael) appear to be failing to see is what happens
on a macro level when you have read locks being held for periods
measured in *seconds* (e.g. direct IO gets queued behind a few
thousand other IOs in the elevator waiting for a request slot),
and the subsequent effect of inserting an operation that requires a
write lock into that IO stream.

IOWs, it simply doesn't matter if there's a micro-level race between
the write lock and a couple of the readers. That's the level you
guys are arguing at but it simply does not matter in the cases I'm
describing. I'm talking about high level serialisation behaviours
that might take of *seconds* to play out and the ordering behaviours
observed at that scale.

That is, I don't care if a couple of threads out of a few thousand
race with the write lock over few tens to hundreds of microseconds,
but I most definitely care if a few thousand IOs issued seconds
after the write lock is queued jump over the write lock. That is a
gross behavioural change at the macro-level.

> So just to reiterate: there is no 'queue' and no 'barrier'. The
> guarantees that rwsem makes are;
> 1. Multiple readers can own the lock.
> 2. Only a single writer can own the lock.
> 3. Readers will not starve writers.

You've conveniently ignored the fact that the current implementation
also provides following guarantee:

4. new readers will block behind existing writers

And that's the behaviour we currently depend on, whether you like it
or not.

> Where lock policy can have a significant impact is on performance. But
> predicting that impact is difficult -- it's better just to measure.

Predicting the impact in this case is trivial - it's obvious that
ordering of operations will change and break high level assumptions
that userspace currently makes about various IO operations on XFS
filesystems

> It's not my intention to convince you (or anyone else) that there should
> only be One True Rwsem, because I don't believe that. But I didn't want
> the impression to persist that rwsem does anything more than implement a
> fair reader/writer semaphore.

I'm sorry, but redefining "fair" to suit your own needs doesn't
convince me of anything. rwsem behaviour has been unchanged for at
least 10 years and hence the current implementation defines what is
"fair", not what you say is fair

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] writeback: replace custom worker pool implementation with unbound workqueue

2013-03-20 Thread Dave Chinner
On Tue, Mar 19, 2013 at 10:28:24AM -0700, Tejun Heo wrote:
> Hello, Jan.
> 
> On Tue, Mar 19, 2013 at 8:46 AM, Jan Kara  wrote:
> >   Well, but what you often get is just output of sysrq-w, or sysrq-t, or
> > splat from scheduler about stuck task. You often don't have the comfort of
> > tracing... Can't we somehow change 'comm' of the task when it starts
> > processing work of some bdi?
> 
> You sure can but I'd prefer not to do that. If you wanna do it
> properly, you have to grab task lock every time a work item starts
> execution. I'm not sure how beneficial having the block device
> identifier would be. Backtrace would be there the same. Is identifying
> the block device that important?

When you have a system that has 50+ active filesystems (pretty
common in the distributed storage environments were every disk has
it's own filesystem), knowing which filesystem(s) are getting stuck
in writeback from the sysrq-w or hangcheck output is pretty damn
important

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6 RFC] Mapping range lock

2013-02-05 Thread Dave Chinner
On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote:
> On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> > > c) i_mutex doesn't allow any paralellism of operations using it and some
> > >filesystems workaround this for specific cases (e.g. DIO reads). Using
> > >range locking allows for concurrent operations (e.g. writes, DIO) on
> > >different parts of the file. Of course, range locking itself isn't
> > >enough to make the parallelism possible. Filesystems still have to
> > >somehow deal with the concurrency when manipulating inode allocation
> > >data. But the range locking at least provides a common VFS mechanism 
> > > for
> > >serialization VFS itself needs and it's upto each filesystem to
> > >serialize more if it needs to.
> > 
> > That would be useful to end-users, but I'm having trouble predicting
> > *how* useful.
>   As Zheng said, there are people interested in this for DIO. Currently
> filesystems each invent their own tweaks to avoid the serialization at
> least for the easiest cases.

The thing is, this won't replace the locking those filesystems use
to parallelise DIO - it just adds another layer of locking they'll
need to use. The locks filesystems like XFS use to serialise IO
against hole punch also serialise against many more internal
functions and so if these range locks don't have the same capability
we're going to have to retain those locks even after the range locks
are introduced. It basically means we're going to have two layers
of range locks - one for IO sanity and atomicity, and then this
layer just for hole punch vs mmap.

As i've said before, what we really need in XFS is IO range locks
because we need to be able to serialise operations against IO in
progress, not page cache operations in progress.  IOWs, locking at
the mapping tree level does not provide the right exclusion
semantics we need to get rid of the existing filesystem locking that
allows concurrent IO to be managed.  Hence the XFS IO path locking
suddenly because 4 locks deep:

i_mutex
  XFS_IOLOCK_{SHARED,EXCL}
mapping range lock
  XFS_ILOCK_{SHARED,EXCL}

That's because the buffered IO path uses per-page lock ranges and to
provide atomicity of read vs write, read vs truncate, etc we still
need to use the XFS_IOLOCK_EXCL to provide this functionality.

Hence I really think we need to be driving this lock outwards to
where the i_mutex currently sits, turning it into an *IO range
lock*, and not an inner-level mapping range lock. i.e flattening the
locking to:

io_range_lock(off, len)
  fs internal inode metadata modification lock

Yes, I know this causes problems with mmap and locking orders, but
perhaps we should be trying to get that fixed first because it
simplifies the whole locking schema we need for filesystems to
behave sanely. i.e. shouldn't we be aiming to simplify things
as we rework locking rather than make the more complex?

IOWs, I think the "it's a mapping range lock" approach is not the
right level to be providing IO exclusion semantics. After all, it's
entire IO ranges that we need to provide -atomic- exclusion for...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6 RFC] Mapping range lock

2013-02-06 Thread Dave Chinner
On Wed, Feb 06, 2013 at 08:25:34PM +0100, Jan Kara wrote:
> On Wed 06-02-13 10:25:12, Dave Chinner wrote:
> > On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote:
> > > On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> > > > > c) i_mutex doesn't allow any paralellism of operations using it and 
> > > > > some
> > > > >filesystems workaround this for specific cases (e.g. DIO reads). 
> > > > > Using
> > > > >range locking allows for concurrent operations (e.g. writes, DIO) 
> > > > > on
> > > > >different parts of the file. Of course, range locking itself isn't
> > > > >enough to make the parallelism possible. Filesystems still have to
> > > > >somehow deal with the concurrency when manipulating inode 
> > > > > allocation
> > > > >data. But the range locking at least provides a common VFS 
> > > > > mechanism for
> > > > >serialization VFS itself needs and it's upto each filesystem to
> > > > >serialize more if it needs to.
> > > > 
> > > > That would be useful to end-users, but I'm having trouble predicting
> > > > *how* useful.
> > >   As Zheng said, there are people interested in this for DIO. Currently
> > > filesystems each invent their own tweaks to avoid the serialization at
> > > least for the easiest cases.
> > 
> > The thing is, this won't replace the locking those filesystems use
> > to parallelise DIO - it just adds another layer of locking they'll
> > need to use. The locks filesystems like XFS use to serialise IO
> > against hole punch also serialise against many more internal
> > functions and so if these range locks don't have the same capability
> > we're going to have to retain those locks even after the range locks
> > are introduced. It basically means we're going to have two layers
> > of range locks - one for IO sanity and atomicity, and then this
> > layer just for hole punch vs mmap.
> > 
> > As i've said before, what we really need in XFS is IO range locks
> > because we need to be able to serialise operations against IO in
> > progress, not page cache operations in progress.
>   Hum, I'm not sure I follow you here. So mapping tree lock + PageLocked +
> PageWriteback serialize all IO for part of the file underlying the page.
> I.e. at most one of truncate (punch hole), DIO, writeback, buffered write,
> buffered read, page fault can run on that part of file.

Right, it serialises page cache operations sufficient to avoid
page cache coherence problems, but it does not serialise operations
sufficiently to provide atomicity between operations that should be
atomic w.r.t. each other.

> So how come it
> doesn't provide enough serialization for XFS?
> 
> Ah, is it the problem that if two threads do overlapping buffered writes
> to a file then we can end up with data mixed from the two writes (if we
> didn't have something like i_mutex)?

That's one case of specific concern - the POSIX write() atomicity
guarantee - but it indicates the cause of many of my other concerns,
too. e.g. write vs prealloc, write vs punch, read vs truncate, write
vs truncate, buffered vs direct write, etc.

Basically, page-cache granularity locking for buffered IO means that
it cannot be wholly serialised against any other operation in
progress. That means we can't use the range lock to provide a
barrier to guarantee that no IO is currently in progress at all, and
hence it doesn't provide the IO barrier semantics we need for
various operations within XFS.

An example of this is that the online defrag ioctl requires copyin +
mtime updates in the write path are atomic w.r.t the swap extents
ioctl so that it can detect concurrent modification of the file being
defragged and abort. The page cache based range locks simply don't
provide this coverage, and so we'd need to maintain the IO operation
locking we currently have to provide this exclusion..

Truncate is something I also see as particularly troublesome,
because the i_mutex current provides mutual exclusion against the
operational range of a buffered write (i.e. at the .aio_write level)
and not page granularity like this patch set would result in. Hence
the behaviour of write vs truncate races could change quite
significantly. e.g.  instead of "write completes then truncate" or
"truncate completes then write", we could have "partial write,
truncate, write continues and completes" resulting in a bunch of
zeros inside the range the write call wrote to. The application
won't even realise that the data it wr

Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type

2013-02-11 Thread Dave Chinner
On Mon, Feb 11, 2013 at 05:25:58PM +0900, Namjae Jeon wrote:
> From: Namjae Jeon 
> 
> This patch is a follow up on below patch:
> 
> [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
> commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 

> diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c
> index a836118..3391800 100644
> --- a/fs/xfs/xfs_export.c
> +++ b/fs/xfs/xfs_export.c
> @@ -48,7 +48,7 @@ static int xfs_fileid_length(int fileid_type)
>   case FILEID_INO32_GEN_PARENT | XFS_FILEID_TYPE_64FLAG:
>   return 6;
>   }
> - return 255; /* invalid */
> + return FILEID_INVALID; /* invalid */
>  }

I think you can drop the "/* invalid */" comment from there now as
it is redundant with this change.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.8-rc5 xfs corruption

2013-01-30 Thread Dave Chinner
On Wed, Jan 30, 2013 at 10:16:47PM -0500, CAI Qian wrote:
> Hello,
> 
> (Sorry to post to xfs mailing lists but unsure about which one is the
> best for this.)

Trimmed to just x...@oss.sgi.com.

> I have seen something like this once during testing on a system with a
> EMC VNX FC/multipath back-end.

This is a trace from the verifier code that was added in 3.8-rc1 so
I doubt it has anything to do with any problem you've seen in the
past

Can you tell us what workload you were running and what hardware you
are using as per:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

As it is, if you mounted the filesystem after this problem was
detected, log recovery probably propagated it to disk. I'd suggest
that you run xfs_repair -n on the device and post the output so we
can see if any corruption has actaully made it to disk. If no
corruption made it to disk, it's possible that we've got the
incorrect verifier attached to the buffer.

> [ 3025.063024] 8801a0d5: 2e 2e 2f 2e 2e 2f 75 73 72 2f 6c 69 62 2f 6d 
> 6f  ../../usr/lib/mo 

The start of a block contains a path and the only
type of block that can contain this format of metadata is remote
symlink block. Remote symlink blocks don't have a verifier attached
to them as there is nothing that can currently be used to verify
them as correct.

I can't see exactly how this can occur as stale buffers have the
verifier ops cleared before being returned to the new user, and
newly allocated xfs_bufs are zeroed before being initialised. I
really need to know what you are doing to be able to get to the
bottom of it

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] xfs: Fix possible truncation of log data in xlog_bread_noalign()

2013-02-23 Thread Dave Chinner
On Sat, Feb 23, 2013 at 07:06:10AM +, Tony Lu wrote:
> >From: Dave Chinner [mailto:da...@fromorbit.com]
> >On Fri, Feb 22, 2013 at 08:12:52AM +, Tony Lu wrote:
> >> I encountered the following panic when using xfs partitions as rootfs, 
> >> which
> >> is due to the truncated log data read by xlog_bread_noalign(). We should
> >> extend the buffer by one extra log sector to ensure there's enough space to
> >> accommodate requested log data, which we indeed did in xlog_get_bp(), but 
> >> we
> >> forgot to do in xlog_bread_noalign().
> >
> >We've never done that round up in xlog_bread_noalign(). It shouldn't
> >be necessary as xlog_get_bp() and xlog_bread_noalign() are doing
> >fundamentally different things. That is, xlog_get_bp() is ensuring
> >the buffer is large enough for the upcoming IO that will be
> >requested, while xlog_bread_noalign() is simply ensuring what it is
> >passed is correctly aligned to device sector boundaries.
> 
> I set the sector size as 4096 when making the xfs filesystem.
> -sh-4.1# mkfs.xfs -s size=4096 -f /dev/sda3

.

> In this case, xlog_bread_noalign() needs to do such round up and
> round down frequently. And it is used to ensure what it is passed
> is aligned to the log sector size, but not the device sector
> boundaries.

If you have a 4k sector device, then the log sector size is the same
as the physical device. Hence the log code assumes that if you have
a specific log sector size, it is operating on a device that has the
physical IO constraints of that sector size.

> >So, if you have to fudge an extra block for xlog_bread_noalign(),
> >that implies that what xlog_bread_noalign() was passed was
> >probably not correct. It also implies that you are using sector
> >sizes larger than 512 bytes, because that's the only time this
> >might matter. Put simply, this:
> 
> While debugging, I found when it crashed, the blk_no was not align
> to the log sector size and nnblks was aligned to the log sector
> size, which makes sense.

Actually, it doesn't. The log writes done by the kernel are supposed
to be aligned and padded to sector size, which means that we should
never see an unaligned block numbers when reading log buffer headers
back off disk. i.e. when you run mkfs.xfs -s size=4096, you end up
with a a log stripe unit of 4096 bytes, which means it pads
every write to 4096 byte boundaries rather than 512 byte boundaries.

> Starting XFS recovery on filesystem: ram0 (logdev: internal)

Ok, you're using a ramdisk and not a real 4k sector device. Hence it
won't fail if we do 512 byte aligned IO rather than 4k aligned IO.

> xlog_bread_noalign--before round down/up: blk_no=0xf4d,nbblks=0x1
> xlog_bread_noalign--after round down/up: blk_no=0xf4c,nbblks=0x4
> xlog_bread_noalign--before round down/up: blk_no=0xf4d,nbblks=0x1
> xlog_bread_noalign--after round down/up: blk_no=0xf4c,nbblks=0x4
> xlog_bread_noalign--before round down/up: blk_no=0xf4e,nbblks=0x3f
> xlog_bread_noalign--after round down/up: blk_no=0xf4c,nbblks=0x40
> XFS: xlog_recover_process_data: bad clientid
>
> For example, if xlog_bread_noalign() wants to read blocks from #1
> to # 9, in which case the passed parameter blk_no is 1, and nbblks
> is 8, sectBBsize is 8, after the round down and round up
> operations, we get blk_no as 0, and nbblks as still 8. We
> definitely lose the last block of the log data.

Yes, I fully understand that. But I also understand how the log
works and that this behaviour *should not happen*. That's why
I'm asking questions about what the problem you are trying to fix.

The issue here is that the log buffer write was not aligned to the
underlying sector size. That is, what we see here is a header block
read, followed by the log buffer data read. The header size is
determined by the iclogbuf size - a 512 byte block implies default
32k iclogbuf size - and the following data region read of 63 blocks
also indicates a 32k iclogbuf size.

IOWs, what we have here is a 32k log buffer write apparently at a
sector-unaligned block address (0xf4d = 3917 which is not a multiple
of 8). This is why log recovery went wrong: a fundamental
architectural assumption the log is built around has somehow been
violated.

That is, the log recovery failure does not appear to be a problem
with the sector alignment done by xlog_bread_noalign() - it appears
to be a failure with the alignment of log buffer IO written to the
log. That's a far more serious problem than a log recovery problem,
but I can't see how that could occur and so I need a test case that
reproduces the recovery failure for deeper analysis

> I was using the 2.6.38.6 kernel, and using xfs as a rootfs
> partition. After untaring the rootfs files on 

Re: Debugging system freezes on filesystem writes

2013-02-23 Thread Dave Chinner
On Sat, Feb 23, 2013 at 01:27:38AM +0200, Marcus Sundman wrote:
> >$ cat /proc/mounts
> >rootfs / rootfs rw 0 0
> >sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
> >proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
> >udev /dev devtmpfs rw,relatime,size=1964816k,nr_inodes=491204,mode=755 0 0
> >devpts /dev/pts devpts
> >rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
> >tmpfs /run tmpfs rw,nosuid,relatime,size=789652k,mode=755 0 0
> >/dev/disk/by-uuid/5bfa7a58-2d35-4758-954e-4deafb09b892 / ext4
> >rw,noatime,discard,errors=remount-ro 0 0
  ^^^

> >none /sys/fs/fuse/connections fusectl rw,relatime 0 0
> >none /sys/kernel/debug debugfs rw,relatime 0 0
> >none /sys/kernel/security securityfs rw,relatime 0 0
> >none /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
> >none /run/shm tmpfs rw,nosuid,nodev,relatime 0 0
> >none /run/user tmpfs
> >rw,nosuid,nodev,noexec,relatime,size=102400k,mode=755 0 0
> >/dev/sda6 /home ext4 rw,noatime,discard 0 0
   ^^^

I'd say that's your problem

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] xfs: Fix possible truncation of log data in xlog_bread_noalign()

2013-02-24 Thread Dave Chinner
On Sun, Feb 24, 2013 at 04:46:30AM +, Tony Lu wrote:
> >> For example, if xlog_bread_noalign() wants to read blocks from #1
> >> to # 9, in which case the passed parameter blk_no is 1, and nbblks
> >> is 8, sectBBsize is 8, after the round down and round up
> >> operations, we get blk_no as 0, and nbblks as still 8. We
> >> definitely lose the last block of the log data.
> >
> >Yes, I fully understand that. But I also understand how the log
> >works and that this behaviour *should not happen*. That's why
> >I'm asking questions about what the problem you are trying to fix.
> 
> I am not sure about this, since I saw many reads on
> non-sector-align blocks even when successfully mounting good XFS
> partitions.

I didn't say that non-sector align reads should not be attempted by
log recovery - it's obvious from the on disk format of the log that
we have to parse it in chunks of 512 bytes to make sense of it's
contents, and that leads to the 512 byte reads and other subsequent
unaligned reads.

*However*

Seeing that there are unaligned reads occurring does not mean that
the structures in the log should be unaligned. Your test output
indicated a log record header at an unaligned block address, and
that's incorrect. It doesn't matter what the rest of the log
recovery code does with non-aligned IO - the fact is that your debug
implies that the contents of the log is corrupt and that implies a
deeper problem

> And also there is code in xlog_write_log_records() which handles
> non-sector-align reads and writes.

Yes, it does handle it, but that doesn't mean that it is correct to
pass unaligned block ranges to it.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] xfs: Fix possible truncation of log data in xlog_bread_noalign()

2013-02-26 Thread Dave Chinner
On Tue, Feb 26, 2013 at 07:28:19AM +, Tony Lu wrote:
> I get a reliable way to reproduce this bug. The logprint and metadump are 
> attached.
> 
> Kernel version: 2.6.38.8

This is important



 because this:

> 4 umount /dev/sda1 /mnt
> 5 mount /dev/sda1 /mnt
> XFS mounting filesystem sda1
> Starting XFS recovery on filesystem: sda1 (logdev: internal)
> Ending XFS recovery on filesystem: sda1 (logdev: internal)

Indicates that the unmount record is either not being written, it is
being written when there log has not been fully flushed or log
recovery is not finding it. You need to copy out the log
first to determine what the state of the log is before you mount the
filesystem - that way if log recovery is run you can see whether it
was supposed to run. (i.e. a clean log should never run recovery,
and unmount should always leave a clean log).

Either way, I'm more than 10,000 iterations into a run of 100k
iterations of this script on 3.8.0, and I have not seen a single log
recovery attempt occur. That implies you are seeing a bug in 2.6.38
that has since been fixed. It would be a good idea for you to
upgrade the system to a 3.8 kernel and determine if you can still
reproduce the problem on your system - that way we'll know if the
bug really has been fixed or not

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] ext4 updates for 3.9

2013-02-28 Thread Dave Chinner
On Wed, Feb 27, 2013 at 02:29:07PM -0500, Theodore Ts'o wrote:
> On Wed, Feb 27, 2013 at 02:19:23PM -0500, Dave Jones wrote:
> > 
> > Looks like it's fixed here too.
> > 
> > How did this make it through -next without anyone hitting it ?
> 
> > Is anyone running xfstests or similar on linux-next regularly ?
> 
> I run xfstests on the ext4 tree, and I ran it on ext4 plus Linus's tip
> before I submitted a pull request.  The problem is that XFSTESTS is
> S-L-O-W if you use large partitions, so typically I use a 5GB
> partition sizes for my test runs.

This isn't the case for XFS. I typically see 1TB scratch devices
only being ~10-20% slower than 10GB scratch devices, and 10TB only
being a little slower than 1TB scratch devices. I have to use sparse
devices and --large-fs for 100TB filesystem testing, so I can't
directly compare the speeds to those that I run on physical devices.
However I can say that it isn't significantly slower than using
small scratch devices...

> So what we probably need to do is to have a separate set of tests
> using a loopback mount, and perhaps an artificially created file
> system which has a large percentage of the blocks in the middle of the
> file system busied out, to make efficient testing of these sorts of

That's exactly what the --large-fs patch set I posted months ago does
for ext4 - it uses fallocate() to fill all but 50GB of the large
filesystem without actually writing any data and runs the standard
tests in the remaining unused space.

However, last time I tested ext4 with this patchset (when I posted
the patches months ago), multi-TB preallocation on ext4 was still too
slow to make it practical for testing on devices larger than 2-3TB.
Perhaps it would make testing 1-2TB ext4 filesystems fast enough
that you could do it regularly...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 02/16] xfs: Store projectid as a single variable.

2013-02-18 Thread Dave Chinner
On Sun, Feb 17, 2013 at 05:10:55PM -0800, Eric W. Biederman wrote:
> From: "Eric W. Biederman" 
> 
> xfs_get_projid is torturous to read and it will not work at all when
> project ids are stored in a kprojid_t.  So add a i_projid to
> xfs_inode, that is cheap to read and can handle future needs, and
> update all callers of xfs_get_projid to use i_projid.
> 
> Retain xfs_set_projid to handle the needed double updates, as there
> are now two places the value needs to be set.
> 
> In xfs_iread after filling in i_d drom the on-disk inode update the
> new i_projid field.
> 
> Cc: Ben Myers 
> Cc: Alex Elder 
> Cc: Dave Chinner 
> Signed-off-by: "Eric W. Biederman" 
> ---
>  fs/xfs/xfs_icache.c   |2 +-
>  fs/xfs/xfs_inode.c|6 +-
>  fs/xfs/xfs_inode.h|7 ++-
>  fs/xfs/xfs_ioctl.c|6 +++---
>  fs/xfs/xfs_iops.c |2 +-
>  fs/xfs/xfs_itable.c   |4 ++--
>  fs/xfs/xfs_qm.c   |   10 +-
>  fs/xfs/xfs_qm_bhv.c   |2 +-
>  fs/xfs/xfs_rename.c   |2 +-
>  fs/xfs/xfs_vnodeops.c |6 +++---
>  10 files changed, 24 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 96e344e..4f109ca 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1210,7 +1210,7 @@ xfs_inode_match_id(
>   return 0;
>  
>   if (eofb->eof_flags & XFS_EOF_FLAGS_PRID &&
> - xfs_get_projid(ip) != eofb->eof_prid)
> + ip->i_projid != eofb->eof_prid)
>   return 0;

Please retain the xfs_get_projid(ip) wrapper and do all the
necessary conversions via that wrapper. We don't need a second copy
of the project ID to support namespace aware project ID support.

>   return 1;
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 66282dc..51c2597 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1013,6 +1013,10 @@ xfs_iread(
>*/
>   if (dip->di_mode) {
>   xfs_dinode_from_disk(&ip->i_d, dip);
> +
> + ip->i_projid = ((projid_t)ip->i_d.di_projid_hi << 16) |
> +   ip->i_d.di_projid_lo;
> +

This does not belong here. At minimum, it would need to be in
xfs_iformat(). Further, if there is a requirement it is initialised
correctly, it needs to be zeroed in xfs_inode_alloc() where we pull
a newly allocated inode out of the slab.

As it is, however, having read further through the patches, I can't
really see why we need even need a separate variable - the
conversions shoul dbe done at the edge of the filesystem (i.e. VFS
and ioctl interfaces, and the core of the filesystem left completely
untouched.

>  static inline void
>  xfs_set_projid(struct xfs_inode *ip,
>   prid_t projid)
>  {
> + ip->i_projid = projid;
>   ip->i_d.di_projid_hi = (__uint16_t) (projid >> 16);
>   ip->i_d.di_projid_lo = (__uint16_t) (projid & 0x);
>  }

As all you are doing is introduing a requirement that we keep two
variables in sync and increase the size of the struct xfs_inode
unnecessarily. History says this sort of duplication is a source of
subtle bugs

> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 2ea7d40..cf5b1d0 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -91,8 +91,8 @@ xfs_bulkstat_one_int(
>* further change.
>*/
>   buf->bs_nlink = dic->di_nlink;
> - buf->bs_projid_lo = dic->di_projid_lo;
> - buf->bs_projid_hi = dic->di_projid_hi;
> + buf->bs_projid_lo = (u16)(ip->i_projid & 0x);
> + buf->bs_projid_hi = (u16)(ip->i_projid >> 16);

There is no need for this change at all. Even if we have a second
variable, the two are in sync and of reading from the dic is
perfectly OK.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 03/16] xfs: Always read uids and gids from the vfs inode

2013-02-18 Thread Dave Chinner
, namespace aware value.

> @@ -539,8 +539,8 @@ xfs_setattr_nonsize(
>* while we didn't have the inode locked, inode's dquot(s)
>* would have changed also.
>*/
> - iuid = ip->i_d.di_uid;
> - igid = ip->i_d.di_gid;
> + iuid = VFS_I(ip)->i_uid;
> + igid = VFS_I(ip)->i_gid;
>   gid = (mask & ATTR_GID) ? iattr->ia_gid : igid;
>   uid = (mask & ATTR_UID) ? iattr->ia_uid : iuid;
>  
> @@ -587,8 +587,7 @@ xfs_setattr_nonsize(
>   olddquot1 = xfs_qm_vop_chown(tp, ip,
>   &ip->i_udquot, udqp);
>   }
> - ip->i_d.di_uid = uid;
> - inode->i_uid = uid;
> + xfs_set_uid(ip, uid);

PLease keep these as separate updates, that way we can see clearly
that we are updating both the VFS inode and the XFS inode here.

> @@ -1155,8 +1153,6 @@ xfs_setup_inode(
>  
>   inode->i_mode   = ip->i_d.di_mode;
>   set_nlink(inode, ip->i_d.di_nlink);
> - inode->i_uid= ip->i_d.di_uid;
> - inode->i_gid= ip->i_d.di_gid;

Which further empahsises the layer violation...

>   switch (inode->i_mode & S_IFMT) {
>   case S_IFBLK:
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index cf5b1d0..a9e07dd 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -95,8 +95,8 @@ xfs_bulkstat_one_int(
>   buf->bs_projid_hi = (u16)(ip->i_projid >> 16);
>   buf->bs_ino = ino;
>   buf->bs_mode = dic->di_mode;
> - buf->bs_uid = dic->di_uid;
> - buf->bs_gid = dic->di_gid;
> + buf->bs_uid = VFS_I(ip)->i_uid;
> + buf->bs_gid = VFS_I(ip)->i_gid;

Same as the project ID changes - bulkstat is supposed to return the
raw on disk values, not namespace munged values.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 08/16] xfs: Use kprojids when allocating inodes.

2013-02-18 Thread Dave Chinner
On Sun, Feb 17, 2013 at 05:11:01PM -0800, Eric W. Biederman wrote:
> From: "Eric W. Biederman" 
> 
> In xfs_create and xfs_symlink compute the desired kprojid and pass it
> down into xfs_ialloc.

NACK.

The first time you posted this code I NACKed it because:

>> This sort of thing just makes me cringe. This is all internal
>> project ID management that has nothing to do with namespaces.
>> It's for project ID's that are inherited from the parent inode,
>> and as such we do not care one bit what the namespace is.
>> Internal passing of project IDs like this this should not be
>> converted at all as it has nothing at all to do with the
>> namespaces.

Please drop this patch or replace it with a simple patch that passes
the project ID as an xfs_dqid_t (i.e. a flat, 32 bit quota
identifier) instead so you can kill the prid_t type.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 07/16] xfs: Update ioctl(XFS_IOC_FREE_EOFBLOCKS) to handle callers in any userspace

2013-02-18 Thread Dave Chinner
On Sun, Feb 17, 2013 at 05:11:00PM -0800, Eric W. Biederman wrote:
> From: "Eric W. Biederman" 
> 
> - Modify the ioctl to convert from uids, gid, and projids in the
>   current user namespace to kuids, kgids, and kprojids, and to report
>   an error if the conversion fails.

Please just convert to xfs_dqid_t at the interface.

> - Create struct xfs_internal_eofblocks to hold the same information as
>   struct xfs_eofblocks but with uids, gids, and projids stored as
>   kuids, kgids, and kprojids preventing confusion.

No need. Just convert the struct xfs_eof_blocks to define them all
as xfs_dqid_t and convert them in place to the type that is
compatible with the XFS core use of these fields (i.e. comparing
them with the on-disk inode uid/gid/prid values).

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 05/16] xfs: Update xfs_ioctl_setattr to handle projids in any user namespace

2013-02-18 Thread Dave Chinner
On Sun, Feb 17, 2013 at 05:10:58PM -0800, Eric W. Biederman wrote:
> From: "Eric W. Biederman" 
> 
> - Convert the userspace value in fa->fsx_projid into a kprojid and
>   store it in the variable projid.
> - Verify that xfs can store the projid after it is converted into
>   xfs's user namespace.
> - Replace uses of fa->fsx_projid with projid throughout
>   xfs_ioctl_setattr.
> 
> Cc: Ben Myers 
> Cc: Alex Elder 
> Cc: Dave Chinner 
> Signed-off-by: "Eric W. Biederman" 
> ---
>  fs/xfs/xfs_ioctl.c |   26 ++
>  1 files changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 016624b..4a55f50 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -925,6 +925,7 @@ xfs_ioctl_setattr(
>   struct xfs_dquot*gdqp = NULL;
>   struct xfs_dquot*olddquot = NULL;
>   int code;
> + kprojid_t   projid = INVALID_PROJID;
>  
>   trace_xfs_ioctl_setattr(ip);
>  
> @@ -934,11 +935,20 @@ xfs_ioctl_setattr(
>   return XFS_ERROR(EIO);
>  
>   /*
> -  * Disallow 32bit project ids when projid32bit feature is not enabled.
> +  * Verify the specifid project id is valid.
>*/
> - if ((mask & FSX_PROJID) && (fa->fsx_projid > (__uint16_t)-1) &&
> - !xfs_sb_version_hasprojid32bit(&ip->i_mount->m_sb))
> - return XFS_ERROR(EINVAL);
> + if (mask & FSX_PROJID) {
> + projid = make_kprojid(current_user_ns(), fa->fsx_projid);
> + if (!projid_valid(projid))
> + return XFS_ERROR(EINVAL);
> +
> + /*
> +  * Disallow 32bit project ids when projid32bit feature is not 
> enabled.
> +  */
> + if ((from_kprojid(&init_user_ns, projid) > (__uint16_t)-1) &&
> +     !xfs_sb_version_hasprojid32bit(&ip->i_mount->m_sb))
> + return XFS_ERROR(EINVAL);
> + }

That looks busted. Why does one use current_user_ns() and the other
&init_user_ns()?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 09/16] xfs: Modify xfs_qm_vop_dqalloc to take kuids, kgids, and kprojids.

2013-02-18 Thread Dave Chinner
On Sun, Feb 17, 2013 at 05:11:02PM -0800, Eric W. Biederman wrote:
> From: "Eric W. Biederman" 
> 
> Cc: Ben Myers 
> Cc: Alex Elder 
> Cc: Dave Chinner 
> Signed-off-by: "Eric W. Biederman" 
> ---
>  fs/xfs/xfs_qm.c|6 +++---
>  fs/xfs/xfs_quota.h |4 ++--
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index 6fce3d3..80b8c81 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -1617,9 +1617,9 @@ xfs_qm_write_sb_changes(
>  int
>  xfs_qm_vop_dqalloc(
>   struct xfs_inode*ip,
> - uid_t   uid,
> - gid_t   gid,
> - prid_t  prid,
> + kuid_t  uid,
> + kgid_t  gid,
> + kprojid_t   prid,
>   uintflags,
>   struct xfs_dquot**O_udqpp,
>   struct xfs_dquot**O_gdqpp)

Probably they should use xfs_dqid_t as they are supposed to be
internal XFS quota ID types

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 10/16] xfs: Push struct kqid into xfs_qm_scall_qmlim and xfs_qm_scall_getquota

2013-02-18 Thread Dave Chinner
On Sun, Feb 17, 2013 at 05:11:03PM -0800, Eric W. Biederman wrote:
> From: "Eric W. Biederman" 
> 
> - In xfs_qm_scall_getquota map the quota id into the callers
>   user namespace in the returned struct fs_disk_quota
> 
> - Add a helper is_superquota and use it in xfs_qm_scall_setqlimi
>   to see if we are setting the superusers quota limit.  Setting
>   the superuses quota limit on xfs sets the default quota limits
>   for all users.

These seem fine.

> - Move xfs_quota_type into xfs_qm_syscalls.c where it is now used.

Now that I've seen the code, I really don't see any advantage to
driving the kqid into XFS quota subsystem. (i.e the rest of this
patch and the subsequent follow up patches that drive it further
inward).

I did say previously:

>> From there, targetted patches can drive the kernel structures
>> inward from the entry points where it makes sense to do so (e.g.
>> common places that the quota entry points call that take a
>> type/id pair).  The last thing that should happen is internal
>> structures be converted from type/id pairs to the kernel types if
>> it makes sense to do so and it makes the code simpler and easier
>> to read

But seeing the result, IMO, it doesn't actually improve the code
(it's neither simpler nor easier to read), and it doesn't actually
add any functionality. It makes the code strange and different and
somewhat inconsistent and litters id/namespace conversions all over
the place, so i don't think these cahgnes are necessary.

Hence I'd say just do absolute minimum needed for the
is_superquota() checks to work and leave all the kqid ->
xfs_dqid_t+type conversion at the boundary of the quota subsystem
where it already is

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 10/12] userns: Convert xfs to use kuid/kgid/kprojid where appropriate

2013-02-18 Thread Dave Chinner
On Sun, Feb 17, 2013 at 05:25:43PM -0800, Eric W. Biederman wrote:
> Dave Chinner  writes:
> 
> > On Wed, Feb 13, 2013 at 10:13:16AM -0800, Eric W. Biederman wrote:
> >
> >> The crazy thing is that is that xfs appears to
> >> directly write their incore inode structure into their journal. 
> >
> > Off topic, but it's actually a very sane thing to do. It's called
> > logical object logging, as opposed to physical logging like ext3/4
> > and ocfs2 use. XFS uses a combination of logical logging
> > (superblock, dquots, inodes) and physical logging (via buffers).
> 
> Not putting your structures in disk-endian before putting them on-disk
> seems silly.  As far as I can tell if you switch endianness of the
> machine accessing your xfs filesystem and have to do a log recover
> it won't work because a lot of the log entries will appear corrupted.
> 
> It also seems silly to require your in-memory structure to be binary
> compatibile with your log when you immediately copy that structure to
> another buffer when it comes time to queue a version of it to put into
> the log.
> 
> The fact that you sometimes need to allocate memory and make a copy so
> you can stuff your data into the logvec whose only purpose is to then
> copy the data a second time seems silly and wasteful.
> 
> Logical logging itself seems reasonable.  I just find the implementation
> in xfs odd.
> 
> It looks like with a few little changes xfs could retain backwards
> compatibility with today, remove extra memory copies, and completely
> decouple the format of the in-core structures with the format of the
> on-disk structures.  Allowing scary comments to be removed.

If you think removing the copies is that easy, go right ahead - I'd
love to see patches that checkpoint changes directly from the
in-memory objects to the log without deadlocking

Decoupling the in-memory structure from the log format could be done
at any time. But it's just not something that is needed, and for the
rare cases where it is needed it's better to put the format
detection and conversion code into log recovery. i.e. take the
conversion penalty once when needed on the slow path rather than on
every operation through the fast path

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] dcache: make Oracle more scalable on large systems

2013-02-21 Thread Dave Chinner
On Tue, Feb 19, 2013 at 01:50:55PM -0500, Waiman Long wrote:
> It was found that the Oracle database software issues a lot of call
> to the seq_path() kernel function which translates a (dentry, mnt)
> pair to an absolute path. The seq_path() function will eventually
> take the following two locks:

Nobody should be doing reverse dentry-to-name lookups in a quantity
sufficient for it to become a performance limiting factor. What is
the Oracle DB actually using this path for?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] dcache: make Oracle more scalable on large systems

2013-02-22 Thread Dave Chinner
On Thu, Feb 21, 2013 at 11:13:27PM -0500, Waiman Long wrote:
> On 02/21/2013 07:13 PM, Andi Kleen wrote:
> >Dave Chinner  writes:
> >
> >>On Tue, Feb 19, 2013 at 01:50:55PM -0500, Waiman Long wrote:
> >>>It was found that the Oracle database software issues a lot of call
> >>>to the seq_path() kernel function which translates a (dentry, mnt)
> >>>pair to an absolute path. The seq_path() function will eventually
> >>>take the following two locks:
> >>Nobody should be doing reverse dentry-to-name lookups in a quantity
> >>sufficient for it to become a performance limiting factor. What is
> >>the Oracle DB actually using this path for?
> >Yes calling d_path frequently is usually a bug elsewhere.
> >Is that through /proc ?
> >
> >-Andi
> >
> >
> A sample strace of Oracle indicates that it opens a lot of /proc
> filesystem files such as the stat, maps, etc many times while
> running. Oracle has a very detailed system performance reporting
> infrastructure in place to report almost all aspect of system
> performance through its AWR reporting tool or the browser-base
> enterprise manager. Maybe that is the reason why it is hitting this
> performance bottleneck.

That seems to me like an application problem - poking at what the
kernel is doing via diagnostic interfaces so often that it gets in
the way of the kernel actually doing stuff is not a problem the
kernel can solve.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] xfs: Fix possible truncation of log data in xlog_bread_noalign()

2013-02-22 Thread Dave Chinner
On Fri, Feb 22, 2013 at 08:12:52AM +, Tony Lu wrote:
> I encountered the following panic when using xfs partitions as rootfs, which
> is due to the truncated log data read by xlog_bread_noalign(). We should
> extend the buffer by one extra log sector to ensure there's enough space to
> accommodate requested log data, which we indeed did in xlog_get_bp(), but we
> forgot to do in xlog_bread_noalign().

We've never done that round up in xlog_bread_noalign(). It shouldn't
be necessary as xlog_get_bp() and xlog_bread_noalign() are doing
fundamentally different things. That is, xlog_get_bp() is ensuring
the buffer is large enough for the upcoming IO that will be
requested, while xlog_bread_noalign() is simply ensuring what it is
passed is correctly aligned to device sector boundaries.

So, if you have to fudge an extra block for xlog_bread_noalign(),
that implies that what xlog_bread_noalign() was passed was probably
not correct. It also implies that you are using sector sizes larger
than 512 bytes, because that's the only time this might matter. Put
simply, this:

> XFS mounting filesystem sda2
> Starting XFS recovery on filesystem: sda2 (logdev: internal)
> XFS: xlog_recover_process_data: bad clientid
> XFS: log mount/recovery failed: error 5
> XFS: log mount failed

Is not sufficient information for me to determine if you've correctly
analysed the problem you were seeing and that this is the correct
fix for it. I don't even know what kernel you are seeing this on, or
how you are reproducing it.

Note that I'm not saying the fix isn't necessary or correct, just
that I cannot review it based this commit message.  Given that this
code is essentially unchanged in behaviour since the large sector
size support was adding in 2003(*), understanding how it is
deficient is critical part of the reviewi process

Information you need to provide so I have a chance of reviewing
whether it is correct or not:

- what kernel you saw this on,
- what the filesystem configuration was
- what workload reproduced this problem (a test case would
  be nice, and xfstest even better)
- the actual contents of the log that lead to the short read
  during recovery
- whether xfs_logprint was capable of parsing the log
  correctly
- where in the actual log recovery process the failure
  occurred (e.g. was it trying to recover transactions from
  a section of a wrapped log?)

IOWs, please show your working so we can determine if this is the
root cause of the problem you are seeing. :)

(*) 
http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=f14e527f411712f89178c31370b5d733ea1d0280

FWIW, I think your change might need work - there's the possibility
that is can round up the length beyond the end of the log if we ask
to read up to the last sector of the log (i.e. blkno + blklen ==
end of log) and then round up blklen by one sector

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: torrent hash failures since 3.9.0-rc1

2013-03-12 Thread Dave Chinner
On Tue, Mar 12, 2013 at 09:28:54AM +0100, Sander wrote:
> Markus Trippelsdorf wrote (ao):
> > On 2013.03.11 at 16:37 -0400, Theodore Ts'o wrote:
> > > We actually run fsx in a number of different configruations as part of
> > > our regression testing before we send Linus a pull request, and
> > > haven't found any issues.  So unless it's a hardware problem, it seems
> > > unlikely to me that your running fsx would turn up anything.
> > 
> > Yes, I let it run for a while anyway and it didn't report any failure.
> 
> > Please note that my local rtorrent version was configured with
> > "--with-posix-fallocate".
> 
> Would it be possible to enhance fsx to detect such an issue?

fsx in xfstests already uses fallocate() for preallocation and hole
punching, so such problems related to these operations can be found
using fsx. The issue here, however, involves memory reclaim
interactions and so is not something fsx can reproduce in isolation.  :/


Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader

2013-03-12 Thread Dave Chinner
On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote:
> Hi Dave,
> 
> On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner  wrote:
> > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> >> - since all readers are woken at once, you might see burst of direct
> >> IO operations followed by bursts of truncate operations, instead of
> >> having them interleaved in smaller groups as before.
> >
> > And this will result in the same application IO pattern giving
> > vastly different results. e.g. a read IO promoted ahead of a
> > truncate will now return data instead of -1 (beyond EOF).
> 
> I will reply to this part first, as I think this gives a more concrete
> anchor to the discussion.
> 
> The crucial point is that the truncate system call hasn't completed
> yet - that thread is still queued in the rwsem.
> 
> You still have the guarantee that the truncate will complete
> eventually (even if there is a never-ending stream of IOs), and that
> all IO system calls that start after the truncate system call
> completes will see the file as truncated.

Sure, but the problem is not about when the syscall completes - the
problem is that you are changing where in the pipeline of IO the
truncate is initially executed.  i.e. ordering is not defined by
when an operation completes, but by the order ing which the queue is
processed after a blocking operation completes. That is not when the
syscall completes, but when the filesystem drops the exclusive lock.

>From a single threaded userspace application perspective or
multithreaded apps with their own userspace locking, truncates
complete when either the syscall completes or some time after when
the application drops it's lock. Either way, we're looking at
completion time serialisation, and in that case what XFS or the
kernel does simply doesn't matter.

However, if we are talking about userspace applications that use
lockless IO algorithms or kernel side applications (like knfsd) that
are exposed directly to the filesystem locking semantics,
serialisation behaviour is actually defined by filesystem's
submission side locking behaviour. There is no external
serialisation of IO completion, so serialisation and ordering is
defined (for XFS) solely by the rwsem behaviour.

> You don't have guarantees about which system call will "appear to run
> before the other" if the execution times of the two system calls
> overlap each other, but you actually never had such a guarantee from a
> correctness perspective (i.e. you could never guarantee which of the
> two would get queued on the rwsem ahead of the other).

Sure, but as long as the submission side ordering is deterministic,
it doesn't matter.

> > Ok, so I can see where your latency figure comes from, but it's
> > still a change of ordering in that W2 is no longer a barrier to the
> > reads queued after it.
> 
> My claim is that it's not a visible change from a correctness
> perspective

I am not arguing that it is incorrect. I'm arguing that the change
of ordering semantics breaks assumptions a lot of code makes about
how these locks work.

> > similar to this:
> >
> > W1R1W2R2W3R3.WnRm
> >
> > By my reading of the algorithm you are proposing, after W1 is
> > released, we end up with the queue being treated like this:
> >
> > R1R2R3RmW2W3Wn
> >
> > Right?
> 
> Yes, if what you are representing is the state of the queue at a given
> point of time (which implies R1..Rm must be distinct threads, not just
> the same thread doing repeated requests).

Yup, that's very typical.

> As new requests come in over time, one important point to remember is
> that each writer will see at most one batch of readers wake up ahead
> of it (though this batch may include readers that were originally
> queued both before and after it).

And that's *exactly* the problem with the changes you are proposing
- rwsems will no longer provide strongly ordered write vs read
barrier semantics.

> I find the name 'barrier' actually confusing when used to describe
> synchronous operations.  To me a barrier is usualy between
> asynchronous operations, and it is well defined which operations
> are ahead or behind of the barrier (either because they were
> queued by the same thread, or they were queued by different
> threads which may have synchronized together some other way).

When you have hundreds or thousands of threads doing IO to the one
file, it doesn't matter if the IO is issued synchronously or
asynchronously by the threads - you simply have a huge amount of
data IO concurrency and very, very deep pipelines.

Inserting a metadata modification (truncate, preallocation,
holepu

Re: [PATCH] userns: Add basic quota support v4

2012-08-28 Thread Dave Chinner
OSPC later */
>   if (dqp->dq_flags & XFS_DQ_PROJ)
>   return;
> - quota_send_warning((dqp->dq_flags & XFS_DQ_USER) ? USRQUOTA : GRPQUOTA,
> -be32_to_cpu(dqp->q_core.d_id), mp->m_super->s_dev,
> -type);
> + qtype = (dqp->dq_flags & XFS_DQ_USER) ? USRQUOTA : GRPQUOTA;
> + qid = make_kqid(&init_user_ns, qtype, be32_to_cpu(dqp->q_core.d_id));
> + quota_send_warning(qid, mp->m_super->s_dev, type);
>  }
>  
>  /*
> diff --git a/include/linux/quota.h b/include/linux/quota.h
> index 524ede8..0e73250 100644
> --- a/include/linux/quota.h
> +++ b/include/linux/quota.h
> @@ -181,10 +181,161 @@ enum {
>  #include 
>  
>  #include 
> +#include 
>  
>  typedef __kernel_uid32_t qid_t; /* Type in which we store ids in memory */
>  typedef long long qsize_t;   /* Type in which we store sizes */

>From fs/xfs/xfs_types.h:

typedef __uint32_t  prid_t; /* project ID */

Perhaps it would be better to have an official kprid_t definition
here, i.e:

>  
> +struct kqid {/* Type in which we store the quota 
> identifier */
> + union {
> + kuid_t uid;
> + kgid_t gid;
> + qid_t prj;

kprid_t prid;

> + };
> + int type; /* USRQUOTA (uid) or GRPQUOTA (gid) or XQM_PRJQUOTA (prj) */
> +};
> +
> +static inline bool qid_eq(struct kqid left, struct kqid right)
> +{
> + if (left.type != right.type)
> + return false;
> + switch(left.type) {
> + case USRQUOTA:
> + return uid_eq(left.uid, right.uid);
> + case GRPQUOTA:
> + return gid_eq(left.gid, right.gid);
> + case XQM_PRJQUOTA:
> + return left.prj == right.prj;
> + default:
> + BUG();

BUG()? Seriously? The most this justifies is a WARN_ON_ONCE() to
indicate a potential programming error, not bringing down the entire
machine.

> + }
> +}
> +
> +static inline bool qid_lt(struct kqid left, struct kqid right)
> +{
> + if (left.type < right.type)
> + return true;
> + if (left.type > right.type)
> + return false;
> + switch (left.type) {
> + case USRQUOTA:
> + return uid_lt(left.uid, right.uid);
> + case GRPQUOTA:
> + return gid_lt(left.gid, right.gid);
> + case XQM_PRJQUOTA:
> + return left.prj < right.prj;
> + default:
> + BUG();
> + }
> +}

What is this function used for? it's not referenced at all by the
patch, and there's no documentation/comments explaining why it
exists or how it is intended to be used

> +static inline qid_t from_kqid(struct user_namespace *user_ns, struct kqid 
> qid)
> +{
> + switch (qid.type) {
> + case USRQUOTA:
> + return from_kuid(user_ns, qid.uid);
> + case GRPQUOTA:
> + return from_kgid(user_ns, qid.gid);
> + case XQM_PRJQUOTA:
> + return (user_ns == &init_user_ns) ? qid.prj : -1;
> + default:
> + BUG();
> + }
> +}

Oh, this can return an error. That's only checked in a coupl eof
places this function is called. it needs tobe checked everywhere,
otherwise we now have the possibility of quota usage being accounted
to uid/gid/prid 0x when namespace matches are not found.

> +static inline qid_t from_kqid_munged(struct user_namespace *user_ns,
> +struct kqid qid)

What does munging do to the return value? how is it different to
from_kqid()? Document your API


> +static inline struct kqid make_kqid(struct user_namespace *user_ns,
> + int type, qid_t qid)
> +{
> + struct kqid kqid;
> +
> + kqid.type = type;
> + switch (type) {
> + case USRQUOTA:
> + kqid.uid = make_kuid(user_ns, qid);
> + break;
> + case GRPQUOTA:
> + kqid.gid = make_kgid(user_ns, qid);
> + break;
> + case XQM_PRJQUOTA:
> + if (user_ns == &init_user_ns)
> + kqid.prj = qid;
> + else
> + kqid.prj = -1;
> + break;

kqid.prj = (user_ns == &init_user_ns) ? qid : -1;

> + default:
> + BUG();
> + }
> + return kqid;
> +}
> +
> +static inline struct kqid make_kqid_invalid(int type)
> +{
> + struct kqid kqid;
> +
> + kqid.type = type;
> + switch (type) {
> +     case USRQUOTA:
> + kqid.uid = INVALID_UID;
> + break;
> + case GRPQUOTA:
> +

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-28 Thread Dave Chinner
On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote:
> On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote:
> > On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote:
> > > Let me explain more to clarify my whole scenario. If a memory error
> > > hits on a dirty pagecache, kernel works like below:
> > > 
> > >   1. handles a MCE interrupt (logging MCE events,)
> > >   2. calls memory error handler (doing 3 to 6,)
> > >   3. sets PageHWPoison flag on the error page,
> > >   4. unmaps all mappings to processes' virtual addresses,
> > 
> > So nothing in userspace sees the bad page after this.
> > 
> > >   5. sets AS_HWPOISON on mappings to which the error page belongs
> > >   6. invalidates the error page (unlinks it from LRU list and removes
> > >  it from pagecache,)
> > >   (memory error handler finished)
> > 
> > Ok, so the moment a memory error is handled, the page has been
> > removed from the inode's mapping, and it will never be seen by
> > aplications again. It's a transient error
> > 
> > >   7. later accesses to the file returns -EIO,
> > >   8. AS_HWPOISON is cleared when the file is removed or completely
> > >  truncated.
> > 
> >  so why do we have to keep an EIO on the inode forever?
> > 
> > If the page is not dirty, then just tossing it from the cache (as
> > is already done) and rereading it from disk next time it is accessed
> > removes the need for any error to be reported at all. It's
> > effectively a transient error at this point, and as such no errors
> > should be visible from userspace.
> > 
> > If the page is dirty, then it needs to be treated just like any
> > other failed page write - the page is invalidated and the address
> > space is marked with AS_EIO, and that is reported to the next
> > operation that waits on IO on that file (i.e. fsync)
> > 
> > If you have a second application that reads the files that depends
> > on a guarantee of good data, then the first step in that process is
> > that application that writes it needs to use fsync to check the data
> > was written correctly. That ensures that you only have clean pages
> > in the cache before the writer closes the file, and any h/w error
> > then devolves to the above transient clean page invalidation case.
> 
> Thank you for detailed explanations.
> And yes, I understand it's ideal, but many applications choose not to
> do that for performance reason.

You choose: data integrity or performance.

> So I think it's helpful if we can surely report to such applications.

If performance is chosen over data integrity, we are under no
obligation to keep the error around indefinitely.  Fundamentally,
ensuring a write completes successfully is the reponsibility of the
application, not the kernel. There are so many different filesytem
and storage errors that can be lost right now because data is not
fsync()d, adding another one to them really doesn't change anything.
IOWs, a memory error is no different to a disk failing or the system
crashing when it comes to data integrity. If you care, you use
fsync().

> > Hence I fail to see why this type of IO error needs to be sticky.
> > The error on the mapping is transient - it is gone as soon as the
> > page is removed from the mapping. Hence the error can be dropped as
> > soon as it is reported to userspace because the mapping is now error
> > free.
> 
> It's error free only for the applications which do fsync check in
> each write, but not for the applications which don't do.
> I think the penalty for the latters (ignore dirty data lost and get
> wrong results) is too big to consider it as a reasonable trade-off.

I'm guessing that you don't deal with data integrity issues very
often. What you are suggesting is not a reasonable tradeoff - either
applications are coded correctly for data integrity, or they give
up any expectation that errors will be detected and reported
reliably.  Hoping that we might be able to report an error somewhere
to someone who didn't care to avoid or collect in the first place
does not improve the situation for anyone

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] userns: Add basic quota support v4

2012-08-30 Thread Dave Chinner
On Wed, Aug 29, 2012 at 02:31:26AM -0700, Eric W. Biederman wrote:
> 
> Dave thanks for taking the time to take a detailed look at this code.
> 
> Dave Chinner  writes:
> 
> > On Tue, Aug 28, 2012 at 12:09:56PM -0700, Eric W. Biederman wrote:
> >> 
> >> Add the data type struct kqid which holds the kernel internal form of
> >> the owning identifier of a quota.  struct kqid is a replacement for
> >> the implicit union of uid, gid and project stored in an unsigned int
> >> and the quota type field that is was used in the quota data
> >> structures.  Making the data type explicit allows the kuid_t and
> >> kgid_t type safety to propogate more thoroughly through the code,
> >> revealing more places where uid/gid conversions need be made.
> >> 
> >> Along with the data type struct kqid comes the helper functions
> >> qid_eq, qid_lt, from_kqid, from_kqid_munged, qid_valid, make_kqid,
> >
> > I think Jan's comment about from_kqid being named id_from_kgid is
> > better, though I also think it would read better as kqid_to_id().
> > ie:
> >
> > id = kqid_to_id(ns, qid);
> 
> kqid and qid are the same thing just in a different encoding.
> Emphasizing the quota identifier instead of the kernel vs user encoding
> change is paying attention to the wrong thing.

Not from a quota perspective. The only thing the quota code really
cares about is the quota identifier, not the encoding.

Fundamentally, from_kqid() doen't tell me anything about what I'm
getting from the kqid. There's code all over the place that used the
"_to_" convention because it's obvious what is
being converted from/to. e.g. cpu_to_beXX, compat_to_ptr,
dma_to_phys, pfn_to_page, etc.  Best practises say "follow existing
conventions".

> Using make_kqid and from_kqid follows the exact same conventions as I have
> established for kuids and kgids.  So if you learn one you have learned
> them all.

For those of us that have to look at it once every few months,
following the same conventions as all the other code in the kernel
(i.e. kqid_to_id()) tells me everything I need to know without
having to go through the process of looking up the unusual
from_kqid() function and then from_kuid() to find out what it is
actually doing

> >> make_kqid_invalid, make_kqid_uid, make_kqid_gid.
> >
> > and these named something like uid_to_kqid()
> 
> The last two are indeed weird, and definitely not the common case,
> since there is no precedent I can almost see doing something different
> but I don't see a good case for a different name.

There's plenty of precendence in other code that converts format.
A very common convention that is used everywhere is DEFINE_...().
That would be make the code easier to grasp than "make...".

> >> Change struct dquot dq_id to a struct kqid and remove the now
> >> unecessary dq_type.
> >> 
> >> Update the signature of dqget, quota_send_warning, dquot_get_dqblk,
> >> and dquot_set_dqblk to use struct kqid.
> >> 
> >> Make minimal changes to ext3, ext4, gfs2, ocfs2, and xfs to deal with
> >> the change in quota structures and signatures.  The ocfs2 changes are
> >> larger than most because of the extensive tracing throughout the ocfs2
> >> quota code that prints out dq_id.
> >
> > How did you test that this all works?
> 
> By making it a compile error if you get a conversion wrong and making it
> a rule not to make any logic changes.
>
> That combined with code review
> and running the code a bit to make certain I did not horribly mess up.

But no actual regression testing. You're messing with code that I
will have to triage when it goes wrong for a user, so IMO your code
has to pass the same bar as the code I write has to pass for review
- please regression test your code and write new regression tests
for new functionality.

> > e.g. run xfstests -g quota on
> > each of those filesystems and check for no regressions? And if you
> > wrote any tests, can you convert them to be part of xfstests so that
> > namespace aware quotas get tested regularly?
> 
> I have not written any tests, and running the xfstests in a namespace
> should roughly be a matter of "unshare -U xfstest -g quota"  It isn't
> quite that easy because  /proc/self/uid_map and /proc/self/gid_map need

Asking people to run the entire regression test suite differently
and with special setup magic won't get the code tested regularly.
Writing a new, self contained test that exercises quota in multiple
namespaces simultaneously is what is needed - that way people who
don't even know that namespaces exist will be regressi

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-09-02 Thread Dave Chinner
On Wed, Aug 29, 2012 at 02:32:04PM +0900, Jun'ichi Nomura wrote:
> On 08/29/12 11:59, Dave Chinner wrote:
> > On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote:
> >> And yes, I understand it's ideal, but many applications choose not to
> >> do that for performance reason.
> >> So I think it's helpful if we can surely report to such applications.
> 
> I suspect "performance vs. integrity" is not a correct
> description of the problem.

Right, to be  more precise, it's a "eat my data" vs "integrity"
problem. And in almost all cases I've seen over the years, "eat my
data" is done for performance reasons...

> > If performance is chosen over data integrity, we are under no
> > obligation to keep the error around indefinitely.  Fundamentally,
> > ensuring a write completes successfully is the reponsibility of the
> > application, not the kernel. There are so many different filesytem
> > and storage errors that can be lost right now because data is not
> > fsync()d, adding another one to them really doesn't change anything.
> > IOWs, a memory error is no different to a disk failing or the system
> > crashing when it comes to data integrity. If you care, you use
> > fsync().
> 
> I agree that applications should fsync() or O_SYNC
> when it wants to make sure the written data in on disk.
> 
> AFAIU, what Naoya is going to address is the case where
> fsync() is not necessarily needed.
> 
> For example, if someone do:
>   $ patch -p1 < ../a.patch
>   $ tar cf . > ../a.tar
> 
> and disk failure occurred between "patch" and "tar",
> "tar" will either see uptodate data or I/O error.

No, it won't. The only place AS_EIO is tested is in
filemap_fdatawait_range(), which is only called in the fsync() path.
The only way to report async write IO errors is to use fsync() -
subsequent reads of the file do *not* see the write error.

IOWs, tar will be oblivious of any IO error that preceeded it
reading the files it is copying.

> OTOH, if the failure was detected on dirty pagecache, the current memory
> failure handler invalidates the dirty page and the "tar" command will
> re-read old contents from disk without error.

After an IO error, the dirty page is no longer uptodate - that gets
cleared - so when the page is read the data will be re-read from
disk just like if a memory error occurred. So tar will behave the
same regardless of whether it is a memory error or an IO error (i.e.
reread old data from disk)

> (Well, the failures above are permanent failures.

Write IO errors can also be transient or permanent. Transient, for
example, when a path failure occurs and multipathing then detects
this and fails over to a good path. A subsequent write will then
succeed. Permanent, for example, when someone unplugs a USB drive.

> IOW, the current
>  memory failure handler turns permanent failure into transient error,
>  which is often more difficult to handle, I think.)

The patch I commented on is turning a transient error (error in a
page that is then poisoned and never used again) into a permanent
error (error on an address space that is reported on every future
operation that tries to insert a page into the page cache).

> Naoya's patch will keep the failure information and allows the reader
> to get I/O error when it reads from broken pagecache.

It only adds a hwposion check in add_to_page_cache_locked(). If the
page is already in the cache, then no error will be sent to the
reader because it never gets to add_to_page_cache_locked().

So there's no guarantee that the reader is even going to see the
error, or that they see the error on the page that actually caused
it - access to any missing page in the page cache will trigger it.
And as memory reclaim clears pages off the inode, more and more of
the range of the inode's data will return an error, even though
there is nothing wrong with the data in most of the file.

Indeed, what happens if the application calls fsync, gets the error
and tries to rewrite the page? i.e. it does everything correctly to
handle the write error? With this patch set, the application
cannot insert a replacement page into the page cache, so all
subsequent writes fail! IOWs, it makes it impossible for
applications to recover from a detected and handled memory failure.

I have no issue with reporting the problem to userspace - that needs
t am I saying that the current IO reporting is wonderful and can't
be improved. What I am saying, though, is that I really don't think
this patch set has been well thought through from either an IO path
or userspace error handling point of view.  The problems with this
patch set are quite significant:
- permanent, unclearable error e

Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

2012-09-02 Thread Dave Chinner
On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
> On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
> 
> [..]
> > > Performance aside, punting submission to per device worker in case of deep
> > > stack usage sounds cleaner solution to me.
> > 
> > Agreed, but performance tends to matter in the real world. And either
> > way the tricky bits are going to be confined to a few functions, so I
> > don't think it matters that much.
> > 
> > If someone wants to code up the workqueue version and test it, they're
> > more than welcome...
> 
> Here is one quick and dirty proof of concept patch. It checks for stack
> depth and if remaining space is less than 20% of stack size, then it
> defers the bio submission to per queue worker.

Given that we are working around stack depth issues in the
filesystems already in several places, and now it seems like there's
a reason to work around it in the block layers as well, shouldn't we
simply increase the default stack size rather than introduce
complexity and performance regressions to try and work around not
having enough stack?

I mean, we can deal with it like the ia32 4k stack issue was dealt
with (i.e. ignore those stupid XFS people, that's an XFS bug), or
we can face the reality that storage stacks have become so complex
that 8k is no longer a big enough stack for a modern system

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.5.2: moving files from xfs/disk -> nfs: radix_tree_lookup_slot+0xe/0x10

2012-09-02 Thread Dave Chinner
On Sat, Sep 01, 2012 at 07:13:38PM -0400, Christoph Hellwig wrote:
> I'd suspect it's something with the actual radix tree code, Ccing
> linux-mm in case they know more.

I don't think it has anything to do with the radix tree code

> On Mon, Aug 27, 2012 at 11:00:10AM -0400, Justin Piszcz wrote:
> > Hi,
> > 
> > Moving ~276GB of files (mainly large backups) and everything has
> > seemed to lockup on the client moving data to the server, it is still
> > in this state..
.

> > [75716.705720] Call Trace:
> > [75716.705729]  [] ? radix_tree_lookup_slot+0xe/0x10

It's just a symbol that was found in the stack. The real trace is
this:

> > [75716.705747]  [] schedule+0x24/0x70
> > [75716.705751]  [] schedule_timeout+0x1a9/0x210
> > [75716.705764]  [] wait_for_common+0xc0/0x150
> > [75716.705773]  [] wait_for_completion+0x18/0x20
> > [75716.705777]  [] writeback_inodes_sb_nr+0x77/0xa0
> > [75716.705785]  [] writeback_inodes_sb+0x29/0x40
> > [75716.705788]  [] __sync_filesystem+0x47/0x90
> > [75716.705791]  [] sync_one_sb+0x1b/0x20
> > [75716.705795]  [] iterate_supers+0xe1/0xf0
> > [75716.705798]  [] sys_sync+0x2b/0x60
> > [75716.705802]  [] system_call_fastpath+0x1a/0x1f
> > [75836.701197] INFO: task sync:8790 blocked for more than 120 seconds.

Which simply says that writeback of the dirty data at the time of
the sync call has taken longer than 120s.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

2012-09-04 Thread Dave Chinner
On Tue, Sep 04, 2012 at 11:26:33AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 04, 2012 at 09:54:23AM -0400, Vivek Goyal wrote:
> > > Given that we are working around stack depth issues in the
> > > filesystems already in several places, and now it seems like there's
> > > a reason to work around it in the block layers as well, shouldn't we
> > > simply increase the default stack size rather than introduce
> > > complexity and performance regressions to try and work around not
> > > having enough stack?
> > 
> > Dave,
> > 
> > In this particular instance, we really don't have any bug reports of
> > stack overflowing. Just discussing what will happen if we make 
> > generic_make_request() recursive again.
> 
> I think there was one and that's why we added the bio_list thing.

There was more than one - it was a regular enough to be considered a
feature... :/

> > > I mean, we can deal with it like the ia32 4k stack issue was dealt
> > > with (i.e. ignore those stupid XFS people, that's an XFS bug), or
> > > we can face the reality that storage stacks have become so complex
> > > that 8k is no longer a big enough stack for a modern system
> > 
> > So first question will be, what's the right stack size? If we make
> > generic_make_request() recursive, then at some storage stack depth we will
> > overflow stack anyway (if we have created too deep a stack). Hence
> > keeping current logic kind of makes sense as in theory we can support
> > arbitrary depth of storage stack.
> 
> But, yeah, this can't be solved by enlarging the stack size.  The
> upper limit is unbound.

Sure, but recursion issue is isolated to the block layer.

If we can still submit IO directly through the block layer without
pushing it off to a work queue, then the overall stack usage problem
still exists. But if the block layer always pushes the IO off into
another workqueue to avoid stack overflows, then the context
switches are going to cause significant performance regressions for
high IOPS workloads.  I don't really like either situation.

So while you are discussing stack issues, think a little about the
bigger picture outside of the immediate issue at hand - a better
solution for everyone might pop up

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] Add ratelimited printk for different alert levels

2012-09-11 Thread Dave Chinner
On Wed, Sep 12, 2012 at 03:43:22AM +0530, raghu.prabh...@gmail.com wrote:
> From: Raghavendra D Prabhu 
> 
> Ratelimited printk will be useful in printing xfs messages which are otherwise
> not required to be printed always due to their high rate (to prevent kernel 
> ring
> buffer from overflowing), while at the same time required to be printed.
> 
> Signed-off-by: Raghavendra D Prabhu 
> ---
>  fs/xfs/xfs_message.h | 28 
>  1 file changed, 28 insertions(+)
> 
> diff --git a/fs/xfs/xfs_message.h b/fs/xfs/xfs_message.h
> index 56dc0c1..87999a5 100644
> --- a/fs/xfs/xfs_message.h
> +++ b/fs/xfs/xfs_message.h
> @@ -1,6 +1,8 @@
>  #ifndef __XFS_MESSAGE_H
>  #define __XFS_MESSAGE_H 1
>  
> +#include 
> +

Include this in xfs_linux.h rather than here.

>  struct xfs_mount;
>  
>  extern __printf(2, 3)
> @@ -30,6 +32,32 @@ void xfs_debug(const struct xfs_mount *mp, const char 
> *fmt, ...)
>  }
>  #endif
>  
> +#define xfs_printk_ratelimited(xfs_printk, dev, fmt, ...)\
> +do { \
> + static DEFINE_RATELIMIT_STATE(_rs,  \
> +   DEFAULT_RATELIMIT_INTERVAL,   \
> +   DEFAULT_RATELIMIT_BURST); \
> + if (__ratelimit(&_rs))  \
> + xfs_printk(dev, fmt, ##__VA_ARGS__);\
> +} while (0)

Use "func" not xfs_printk here. xfs_printk looks too much like a
real function name (indeed, we already have __xfs_printk) rather
than a macro parameter.

> +#define xfs_emerg_ratelimited(dev, fmt, ...) \
> + xfs_printk_ratelimited(xfs_emerg, dev, fmt, ##__VA_ARGS__)
> +#define xfs_alert_ratelimited(dev, fmt, ...) \
> + xfs_printk_ratelimited(xfs_alert, dev, fmt, ##__VA_ARGS__)
> +#define xfs_crit_ratelimited(dev, fmt, ...)  \
> + xfs_printk_ratelimited(xfs_crit, dev, fmt, ##__VA_ARGS__)
> +#define xfs_err_ratelimited(dev, fmt, ...)   \
> + xfs_printk_ratelimited(xfs_err, dev, fmt, ##__VA_ARGS__)
> +#define xfs_warn_ratelimited(dev, fmt, ...)  \
> + xfs_printk_ratelimited(xfs_warn, dev, fmt, ##__VA_ARGS__)
> +#define xfs_notice_ratelimited(dev, fmt, ...)
> \
> + xfs_printk_ratelimited(xfs_notice, dev, fmt, ##__VA_ARGS__)
> +#define xfs_info_ratelimited(dev, fmt, ...)  \
> + xfs_printk_ratelimited(xfs_info, dev, fmt, ##__VA_ARGS__)
> +#define xfs_dbg_ratelimited(dev, fmt, ...)   \
> + xfs_printk_ratelimited(xfs_dbg, dev, fmt, ##__VA_ARGS__)

Here's the problem with adding macros that aren't used. xfs_dbg
does not exist - the function is xfs_debug(). The compiler won't
catch that until the macro is used, so only add the macros which are
needed for this patch series.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] XFS: Print error when xfs_ialloc_ag_select fails to find continuous free space.

2012-09-11 Thread Dave Chinner
On Wed, Sep 12, 2012 at 03:43:23AM +0530, raghu.prabh...@gmail.com wrote:
> From: Raghavendra D Prabhu 
> 
> When xfs_ialloc_ag_select fails to find any AG with continuous free blocks
> required for inode allocation, printk the error in ratelimited manner.
> 
> Signed-off-by: Raghavendra D Prabhu 
> ---
>  fs/xfs/xfs_ialloc.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index 5aceb3f..e75a39d 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -539,8 +539,11 @@ nextag:
>   if (agno >= agcount)
>   agno = 0;
>   if (agno == pagno) {
> - if (flags == 0)
> + if (flags == 0) {
> + xfs_err_ratelimited(mp,
> + "Out of continuous free blocks for 
> inode allocation");

http://oss.sgi.com/archives/xfs/2012-06/msg00041.html


Couple of things for all 3 patches. Firstly - 80 columns. We tend
to keep the pformat string on a single line so it is easy to grep
for like so:

pr_err_once(mp,
"Insufficient contiguous free space for inode allocation");


So, you need to change the error message to the one suggested, and
follow 80-character width limits like the rest of the code.

Also, I think the error message is better at the caller site, not in
the function itself. i.e. if we get a NULLAGNUMBER returned, the
caller decided whether to emit an error message or not.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] XFS: Print error when unable to allocate inodes or out of free inodes.

2012-09-11 Thread Dave Chinner
On Wed, Sep 12, 2012 at 03:43:24AM +0530, raghu.prabh...@gmail.com wrote:
> From: Raghavendra D Prabhu 
> 
> When xfs_dialloc is unable to allocate required number of inodes or there are 
> no
> AGs with free inodes, printk the error in ratelimited manner.
> 
> Signed-off-by: Raghavendra D Prabhu 
> ---
>  fs/xfs/xfs_ialloc.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index e75a39d..034131b 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -990,8 +990,11 @@ xfs_dialloc(
>   goto out_error;
>  
>   xfs_perag_put(pag);
> - *inop = NULLFSINO;
> - return 0;
> +
> + xfs_err_ratelimited(mp,
> + "Unable to allocate inodes in AG %d: Required 
> %d, Current %llu, Maximum %llu",
> + agno, XFS_IALLOC_INODES(mp), 
> mp->m_sb.sb_icount, mp->m_maxicount);
> + goto out_spc;

This changes the error to be returned from 0 to ENOSPC. Adding error
messages shouldn't change the logic of the code.

Also, you might want tolook at how ENOSPC is returned from
xfs_ialloc_ag_alloc(). it only occurs when:

if (mp->m_maxicount &&
mp->m_sb.sb_icount + newlen > mp->m_maxicount) {

i.e. it is exactly the same error case as the "noroom" error below.
It has nothing to do with being unable to allocate inodes in the
specific AG - the global inode count is too high. IOWs, the error
message is not correct.

Also, 80 columns.

>   }
>  
>   if (ialloced) {
> @@ -1016,11 +1019,19 @@ nextag:
>   if (++agno == mp->m_sb.sb_agcount)
>   agno = 0;
>   if (agno == start_agno) {
> - *inop = NULLFSINO;
> - return noroom ? ENOSPC : 0;
> + if (noroom) {
> + xfs_err_ratelimited(mp,
> + "Out of AGs with free inodes: Required 
> %d, Current %llu, Maximum %llu",
> +  XFS_IALLOC_INODES(mp), 
> mp->m_sb.sb_icount, mp->m_maxicount);

The error message here is misleading - the error is that we've
exceeded the maximum inode count for the filesystem (same as the
above error message case), so no further allocations are allowed.

What about the !noroom case? Isn't that a real ENOSPC condition?
i.e. we've tried to allocate inodes in every AG and yet we've failed
in all of them because there is no aligned, contiguous free space in
any of the AGs. Shouldn't that emit an appropriate warning?

> + goto out_spc;
> + }
> + return 0;
>   }
>   }
>  
> +out_spc:
> + *inop = NULLFSINO;
> + return ENOSPC;
>  out_alloc:
>   *IO_agbp = NULL;
>   return xfs_dialloc_ag(tp, agbp, parent, inop);

Default behaviour on a loop break is to allocate inodes, not return
ENOSPC.

BTW, there's no need to cc LKML for XFS specific patches. LKML is
noisy enough as it is without unnecessary cross-posts

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] Add ratelimited printk for different alert levels

2012-09-12 Thread Dave Chinner
On Tue, Sep 11, 2012 at 08:22:39PM -0700, Joe Perches wrote:
> On Wed, 2012-09-12 at 03:43 +0530, raghu.prabh...@gmail.com wrote:
> > Ratelimited printk will be useful in printing xfs messages which are 
> > otherwise
> > not required to be printed always due to their high rate (to prevent kernel 
> > ring
> > buffer from overflowing), while at the same time required to be printed.
> []
> > diff --git a/fs/xfs/xfs_message.h b/fs/xfs/xfs_message.h
> []
> > @@ -30,6 +32,32 @@ void xfs_debug(const struct xfs_mount *mp, const char 
> > *fmt, ...)
> >  }
> >  #endif
> >  
> > +#define xfs_printk_ratelimited(xfs_printk, dev, fmt, ...)  \
> > +do {   
> > \
> > +   static DEFINE_RATELIMIT_STATE(_rs,  \
> > + DEFAULT_RATELIMIT_INTERVAL,   \
> > + DEFAULT_RATELIMIT_BURST); \
> > +   if (__ratelimit(&_rs))  \
> > +   xfs_printk(dev, fmt, ##__VA_ARGS__);\
> > +} while (0)
> 
> It might be better to use an xfs singleton RATELIMIT_STATE
> 
> DEFINE_RATELIMIT_STATE(xfs_rs);
> ...
> #define xfs_printk_ratelimited(xfs_printk, dev, fmt, ...) \
> do {  \
>   if (__ratelimit(&xfs_rs))   \
>   xfs_printk(dev, fmt, ##__VA_ARGS__);\
> } while (0)

Which would then result in ratelimiting dropping potentially
important, unique messages. I think it's much better to guarantee
ratelimited messages get emitted at least once, especially as there
is the potential for multiple filesystems to emit messages
simultaneously.

I think per-location rate limiting is fine for the current usage -
ratelimiting is not widespread so there isn't a massive increase in
size as a result of this. If we do start to use ratelimiting in lots
of places in XFS, then we might have to revisit this, but it's OK
for now.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] loop: Limit the number of requests in the bio list

2012-10-02 Thread Dave Chinner
On Tue, Oct 02, 2012 at 10:52:05AM +0200, Lukáš Czerner wrote:
> On Mon, 1 Oct 2012, Jeff Moyer wrote:
> > Date: Mon, 01 Oct 2012 12:52:19 -0400
> > From: Jeff Moyer 
> > To: Lukas Czerner 
> > Cc: Jens Axboe , linux-kernel@vger.kernel.org,
> > Dave Chinner 
> > Subject: Re: [PATCH] loop: Limit the number of requests in the bio list
> > 
> > Lukas Czerner  writes:
> > 
> > > Currently there is not limitation of number of requests in the loop bio
> > > list. This can lead into some nasty situations when the caller spawns
> > > tons of bio requests taking huge amount of memory. This is even more
> > > obvious with discard where blkdev_issue_discard() will submit all bios
> > > for the range and wait for them to finish afterwards. On really big loop
> > > devices this can lead to OOM situation as reported by Dave Chinner.
> > >
> > > With this patch we will wait in loop_make_request() if the number of
> > > bios in the loop bio list would exceed 'nr_requests' number of requests.
> > > We'll wake up the process as we process the bios form the list.
> > 
> > I think you might want to do something similar to what is done for
> > request_queues by implementing a congestion on and off threshold.  As
> > Jens writes in this commit (predating the conversion to git):
> 
> Right, I've had the same idea. However my first proof-of-concept
> worked quite well without this and my simple performance testing did
> not show any regression.
> 
> I've basically done just fstrim, and blkdiscard on huge loop device
> measuring time to finish and dd bs=4k throughput. None of those showed
> any performance regression. I've chosen those for being quite simple
> and supposedly issuing quite a lot of bios. Any better
> recommendation to test this ?
> 
> Also I am still unable to reproduce the problem Dave originally
> experienced and I was hoping that he can test whether this helps or
> not.
> 
> Dave could you give it a try please ? By creating huge (500T, 1000T,
> 1500T) loop device on machine with 2GB memory I was not able to reproduce
> that. Maybe it's that xfs punch hole implementation is so damn fast
> :). Please let me know.

Try a file with a few hundred thousand extents in it (preallocate
them). I found this while testing large block devices on loopback
devices, not with empty files.

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-02 Thread Dave Chinner
On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote:
> On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> > Kent Overstreet  writes:
> > 
> > > So, I and other people keep running into things where we really need to
> > > add an interface to pass some auxiliary... stuff along with a pread() or
> > > pwrite().
> > >
> > > A few examples:
> > >
> > > * IO scheduler hints. Some userspace program wants to, per IO, specify
> > > either priorities or a cgroup - by specifying a cgroup you can have a
> > > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> > > quotas.
> > 
> > You can do this today by splitting I/O between processes and placing
> > those processes in different cgroups.  For io priority, there is
> > ioprio_set, which incurs an extra system call, but can be used.  Not
> > elegant, but possible.
> 
> Yes - those are things I'm trying to replace. Doing it that way is a
> real pain, both as it's a lousy interface for this and it does impact
> performance (ioprio_set doesn't really work too well with aio, too).
> 
> > > * Cache hints. For bcache and other things, userspace may want to specify
> > > "this data should be cached", "this data should bypass the cache", etc.
> > 
> > Please explain how you will differentiate this from posix_fadvise.
> 
> Oh sorry, I think about SSD caching so much I forget to say that's what
> I'm talking about. posix_fadvise is for the page cache, we want
> something different for an SSD cache (IMO it'd be really ugly to use it
> for both, and posix_fadvise() can't really specifify everything we'd
> want to for an SSD cache).

Similar discussions about posix_fadvise() are being had for marking
ranges of files as volatile (i.e. useful for determining what can be
evicted from a cache when space reclaim is required).

https://lkml.org/lkml/2012/10/2/501

If you have requirements for specific cache management, then it
might be worth seeing if you can steer an existing interface
proposal for some form of cache management in the direction you
need.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Extensible AIO interface

2012-10-03 Thread Dave Chinner
On Tue, Oct 02, 2012 at 07:41:10PM -0700, Kent Overstreet wrote:
> On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote:
> > On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote:
> > > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> > > > Kent Overstreet  writes:
> > > > 
> > > > > So, I and other people keep running into things where we really need 
> > > > > to
> > > > > add an interface to pass some auxiliary... stuff along with a pread() 
> > > > > or
> > > > > pwrite().
> > > > >
> > > > > A few examples:
> > > > >
> > > > > * IO scheduler hints. Some userspace program wants to, per IO, specify
> > > > > either priorities or a cgroup - by specifying a cgroup you can have a
> > > > > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> > > > > quotas.
> > > > 
> > > > You can do this today by splitting I/O between processes and placing
> > > > those processes in different cgroups.  For io priority, there is
> > > > ioprio_set, which incurs an extra system call, but can be used.  Not
> > > > elegant, but possible.
> > > 
> > > Yes - those are things I'm trying to replace. Doing it that way is a
> > > real pain, both as it's a lousy interface for this and it does impact
> > > performance (ioprio_set doesn't really work too well with aio, too).
> > > 
> > > > > * Cache hints. For bcache and other things, userspace may want to 
> > > > > specify
> > > > > "this data should be cached", "this data should bypass the cache", 
> > > > > etc.
> > > > 
> > > > Please explain how you will differentiate this from posix_fadvise.
> > > 
> > > Oh sorry, I think about SSD caching so much I forget to say that's what
> > > I'm talking about. posix_fadvise is for the page cache, we want
> > > something different for an SSD cache (IMO it'd be really ugly to use it
> > > for both, and posix_fadvise() can't really specifify everything we'd
> > > want to for an SSD cache).
> > 
> > Similar discussions about posix_fadvise() are being had for marking
> > ranges of files as volatile (i.e. useful for determining what can be
> > evicted from a cache when space reclaim is required).
> > 
> > https://lkml.org/lkml/2012/10/2/501
> 
> Hmm, interesting
> 
> Speaking as an implementor though, hints that aren't associated with any
> specific IO are harder to make use of - stuff is in the cache. What you
> really want is to know, for a given IO, whether to cache it or not, and
> possibly where in the LRU to stick it.

I can see how it might be useful, but it needs to have a defined set
of attributes that a file IO is allowed to have. If you don't define
the set, then what really have is an arbitrary set of storage-device
specific interfaces.

Of course, once we have a defined set of per-file IO policy
attributes, we don't really need per-IO attributes - you can just
set them through a range interface like fadvise() or fallocate().

> Well, it's quite possible that different implementations would have no
> trouble making use of those kinds of hints, I'm no doubt biased by
> having implemented bcache. With bcache though, cache replacement is done
> in terms of physical address space, not logical (i.e. the address space
> of the device being cached). 
> 
> So to handle posix_fadvise, we'd have to walk the btree and chase
> pointers to buckets, and modify the bucket priorities up or down... but
> what about the other data in those buckets? It's not clear what should
> happen, but there isn't any good way to take that into account.
> 
> (The exception is dropping data from the cache entirely, we can just
> invalidate that part of the keyspace and garbage collection will reclaim
> the buckets they pointed to. Bcache does that for discard requests,
> currently).

It sounds to me like you are saying is that the design of bcache is
unsuited to file-level management of caching policy, and that is why
you want to pass attributes directly to bcache with specific IOs. Is
that a fair summary of the problem you are describing here?

My problem with this approach has nothing to do with the per-IO
nature of it - it's to do with the layering violations and the
amount of storage specific knowledge needed to make effective use of
it. i.e. it seems like an interface that can only be used by people
intimately familiar with underlying storage imp

Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-23 Thread Dave Chinner
On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
> Hi,
>   Recently we ran into the bug that an opened file's ra_pages does not
> synchronize with it's backing device's when the latter is changed
> with blockdev --setra, the application needs to reopen the file
> to know the change,

or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
window to the (new) bdi default.

> which is inappropriate under our circumstances.

Which are? We don't know your circumstances, so you need to tell us
why you need this and why existing methods of handling such changes
are insufficient...

Optimal readahead windows tend to be a physical property of the
storage and that does not tend to change dynamically. Hence block
device readahead should only need to be set up once, and generally
that can be done before the filesystem is mounted and files are
opened (e.g. via udev rules). Hence you need to explain why you need
to change the default block device readahead on the fly, and why
fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
windows to the new defaults.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/16] f2fs: add inode operations for special inodes

2012-10-23 Thread Dave Chinner
On Wed, Oct 17, 2012 at 12:50:11PM +, Arnd Bergmann wrote:
> On Tuesday 16 October 2012, Jaegeuk Kim wrote:
> > > IIRC, fs2fs uses 4k inodes, so IMO per-inode xattr tress with
> > > internal storage before spilling to an external block is probably
> > > the best approach to take...
> > 
> > Yes, indeed this is the best approach to f2fs's xattr.
> > Apart from giving fs hints, it is worth enough to optimize later.
> 
> I've thought a bit more about how this could be represented efficiently
> in 4KB nodes. This would require a significant change of the way you
> represent inodes, but can improve a number of things at the same time.
> 
> The idea is to replace the fixed area in the inode that contains block
> pointers with an extensible TLV (type/length/value) list that can contain
> multiple variable-length fields, like this.

You've just re-invented inode forks... ;)

> All TLVs together with the
> fixed-length inode data can fill a 4KB block.
> 
> The obvious types would be:
> 
> * Direct file contents if the file is less than a block
> * List of block pointers, as before, minimum 1, maximum until the end
>   of the block
> * List of indirect pointers, now also a variable length, similar to the
>   list of block pointers
> * List of double-indirect block pointers
> * direct xattr: zero-terminated attribute name followed by contents
> * indirect xattr: zero-terminated attribute name followed by up to
>   16 block pointers to store a maximum of 64KB sized xattrs
> 
> This could be extended later to cover additional types, e.g. a list
> of erase block pointers, triple-indirect blocks or extents.

An inode fork doesn't care about the data in it - it's just an
independent block mapping index. i.e. inline, direct,
indirect, double indirect. The data in the fork is managed
externally to the format of the fork. e.g. XFS has two forks - one
for storing data (file data, directory contents, etc) and the other
for storing attributes.

The main issue with supporting an arbitrary number of forks is space
management of the inode literal area.  e.g. one fork is in inline
format (e.g.  direct file contents) and then we add an attribute.
The attribute won't fit inline, nor will an extent form fork header,
so the inline data fork has to be converted to extent format before
the xattr can be added. Now scale that problem up to an arbitrary
number of forks

> As a variation of this, it would also be nice to turn around the order
> in which the pointers are walked, to optimize for space and for growing
> files, rather than for reading the beginning of a file. With this, you
> can represent a 9 KB file using a list of two block pointers, and 1KB
> of direct data, all in the inode. When the user adds another byte, you
> only need to rewrite the inode. Similarly, a 5 MB file would have a
> single indirect node (covering block pointers for 4 MB), plus 256
> separate block pointers (covering the last megabyte), and a 5 GB file
> can be represented using 1 double-indirect node and 256 indirect nodes,
> and each of them can still be followed by direct "tail" data and
> extended attributes.

I'm not sure that the resultant code complexity is worth saving an
extra block here and there.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-23 Thread Dave Chinner
On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
> 
> 
> On Fri, 19 Oct 2012, Peter Zijlstra wrote:
> 
> > > Yes, I tried this approach - it involves doing LOCK instruction on read 
> > > lock, remembering the cpu and doing another LOCK instruction on read 
> > > unlock (which will hopefully be on the same CPU, so no cacheline bouncing 
> > > happens in the common case). It was slower than the approach without any 
> > > LOCK instructions (43.3 seconds seconds for the implementation with 
> > > per-cpu LOCKed access, 42.7 seconds for this implementation without 
> > > atomic 
> > > instruction; the benchmark involved doing 512-byte direct-io reads and 
> > > writes on a ramdisk with 8 processes on 8-core machine).
> > 
> > So why is that a problem? Surely that's already tons better then what
> > you've currently got.
> 
> Percpu rw-semaphores do not improve performance at all. I put them there 
> to avoid performance regression, not to improve performance.
> 
> All Linux kernels have a race condition - when you change block size of a 
> block device and you read or write the device at the same time, a crash 
> may happen. This bug is there since ever. Recently, this bug started to 
> cause major trouble - multiple high profile business sites report crashes 
> because of this race condition.
>
> You can fix this race by using a read lock around I/O paths and write lock 
> around block size changing, but normal rw semaphore cause cache line 
> bouncing when taken for read by multiple processors and I/O performance 
> degradation because of it is measurable.

This doesn't sound like a new problem.  Hasn't this global access,
single modifier exclusion problem been solved before in the VFS?
e.g. mnt_want_write()/mnt_make_readonly()

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-24 Thread Dave Chinner
On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:
> Hi Dave,
> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner  wrote:
> > On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
> >> Hi,
> >>   Recently we ran into the bug that an opened file's ra_pages does not
> >> synchronize with it's backing device's when the latter is changed
> >> with blockdev --setra, the application needs to reopen the file
> >> to know the change,
> >
> > or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
> > window to the (new) bdi default.
> >
> >> which is inappropriate under our circumstances.
> >
> > Which are? We don't know your circumstances, so you need to tell us
> > why you need this and why existing methods of handling such changes
> > are insufficient...
> >
> > Optimal readahead windows tend to be a physical property of the
> > storage and that does not tend to change dynamically. Hence block
> > device readahead should only need to be set up once, and generally
> > that can be done before the filesystem is mounted and files are
> > opened (e.g. via udev rules). Hence you need to explain why you need
> > to change the default block device readahead on the fly, and why
> > fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
> > windows to the new defaults.
> Our system is a fuse-based file system, fuse creates a
> pseudo backing device for the user space file systems, the default readahead
> size is 128KB and it can't fully utilize the backing storage's read ability,
> so we should tune it.

Sure, but that doesn't tell me anything about why you can't do this
at mount time before the application opens any files. i.e.  you've
simply stated the reason why readahead is tunable, not why you need
to be fully dynamic.

> The above third-party application using our file system maintains
> some long-opened files, we does not have any chances
> to force them to call fadvise(POSIX_FADV_NORMAL). :(

So raise a bug/feature request with the third party.  Modifying
kernel code because you can't directly modify the application isn't
the best solution for anyone. This really is an application problem
- the kernel already provides the mechanisms to solve this
problem...  :/

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-24 Thread Dave Chinner
On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote:
> On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner  wrote:
> > On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:
> >> Hi Dave,
> >> On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner  wrote:
> >> > On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:
> >> >> Hi,
> >> >>   Recently we ran into the bug that an opened file's ra_pages does not
> >> >> synchronize with it's backing device's when the latter is changed
> >> >> with blockdev --setra, the application needs to reopen the file
> >> >> to know the change,
> >> >
> >> > or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
> >> > window to the (new) bdi default.
> >> >
> >> >> which is inappropriate under our circumstances.
> >> >
> >> > Which are? We don't know your circumstances, so you need to tell us
> >> > why you need this and why existing methods of handling such changes
> >> > are insufficient...
> >> >
> >> > Optimal readahead windows tend to be a physical property of the
> >> > storage and that does not tend to change dynamically. Hence block
> >> > device readahead should only need to be set up once, and generally
> >> > that can be done before the filesystem is mounted and files are
> >> > opened (e.g. via udev rules). Hence you need to explain why you need
> >> > to change the default block device readahead on the fly, and why
> >> > fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
> >> > windows to the new defaults.
> >> Our system is a fuse-based file system, fuse creates a
> >> pseudo backing device for the user space file systems, the default 
> >> readahead
> >> size is 128KB and it can't fully utilize the backing storage's read 
> >> ability,
> >> so we should tune it.
> >
> > Sure, but that doesn't tell me anything about why you can't do this
> > at mount time before the application opens any files. i.e.  you've
> > simply stated the reason why readahead is tunable, not why you need
> > to be fully dynamic.
> We store our file system's data on different disks so we need to change 
> ra_pages
> dynamically according to where the data resides, it can't be fixed at mount 
> time
> or when we open files.

That doesn't make a whole lot of sense to me. let me try to get this
straight.

There is data that resides on two devices (A + B), and a fuse
filesystem to access that data. There is a single file in the fuse
fs has data on both devices. An app has the file open, and when the
data it is accessing is on device A you need to set the readahead to
what is best for device A? And when the app tries to access data for
that file that is on device B, you need to set the readahead to what
is best for device B? And you are changing the fuse BDI readahead
settings according to where the data in the back end lies?

It seems to me that you should be setting the fuse readahead to the
maximum of the readahead windows the data devices have configured at
mount time and leaving it at that

> The abstract bdi of fuse and btrfs provides some dynamically changing
> bdi.ra_pages
> based on the real backing device. IMHO this should not be ignored.

btrfs simply takes into account the number of disks it has for a
given storage pool when setting up the default bdi ra_pages during
mount.  This is basically doing what I suggested above.  Same with
the generic fuse code - it's simply setting a sensible default value
for the given fuse configuration.

Neither are dynamic in the sense you are talking about, though.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] brw_mutex: big read-write mutex

2012-10-25 Thread Dave Chinner
On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote:
> 
> 
> On Wed, 24 Oct 2012, Dave Chinner wrote:
> 
> > On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Fri, 19 Oct 2012, Peter Zijlstra wrote:
> > > 
> > > > > Yes, I tried this approach - it involves doing LOCK instruction on 
> > > > > read 
> > > > > lock, remembering the cpu and doing another LOCK instruction on read 
> > > > > unlock (which will hopefully be on the same CPU, so no cacheline 
> > > > > bouncing 
> > > > > happens in the common case). It was slower than the approach without 
> > > > > any 
> > > > > LOCK instructions (43.3 seconds seconds for the implementation with 
> > > > > per-cpu LOCKed access, 42.7 seconds for this implementation without 
> > > > > atomic 
> > > > > instruction; the benchmark involved doing 512-byte direct-io reads 
> > > > > and 
> > > > > writes on a ramdisk with 8 processes on 8-core machine).
> > > > 
> > > > So why is that a problem? Surely that's already tons better then what
> > > > you've currently got.
> > > 
> > > Percpu rw-semaphores do not improve performance at all. I put them there 
> > > to avoid performance regression, not to improve performance.
> > > 
> > > All Linux kernels have a race condition - when you change block size of a 
> > > block device and you read or write the device at the same time, a crash 
> > > may happen. This bug is there since ever. Recently, this bug started to 
> > > cause major trouble - multiple high profile business sites report crashes 
> > > because of this race condition.
> > >
> > > You can fix this race by using a read lock around I/O paths and write 
> > > lock 
> > > around block size changing, but normal rw semaphore cause cache line 
> > > bouncing when taken for read by multiple processors and I/O performance 
> > > degradation because of it is measurable.
> > 
> > This doesn't sound like a new problem.  Hasn't this global access,
> > single modifier exclusion problem been solved before in the VFS?
> > e.g. mnt_want_write()/mnt_make_readonly()
> > 
> > Cheers,
> > 
> > Dave.
> 
> Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw 
> semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() 
> to use percpu rw semaphores and remove the duplicated code.

I think you misunderstood my point - that rather than re-inventing
the wheel, why didn't you just copy something that is known to
work?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-25 Thread Dave Chinner
On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:
> Hi Chen,
> 
> > But how can bdi related ra_pages reflect different files' readahead
> > window? Maybe these different files are sequential read, random read
> > and so on.
> 
> It's simple: sequential reads will get ra_pages readahead size while
> random reads will not get readahead at all.
> 
> Talking about the below chunk, it might hurt someone that explicitly
> takes advantage of the behavior, however the ra_pages*2 seems more
> like a hack than general solution to me: if the user will need
> POSIX_FADV_SEQUENTIAL to double the max readahead window size for
> improving IO performance, then why not just increase bdi->ra_pages and
> benefit all reads? One may argue that it offers some differential
> behavior to specific applications, however it may also present as a
> counter-optimization: if the root already tuned bdi->ra_pages to the
> optimal size, the doubled readahead size will only cost more memory
> and perhaps IO latency.
> 
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t 
> len, int advice)
> spin_unlock(&file->f_lock);
> break;
> case POSIX_FADV_SEQUENTIAL:
> -   file->f_ra.ra_pages = bdi->ra_pages * 2;

I think we really have to reset file->f_ra.ra_pages here as it is
not a set-and-forget value. e.g.  shrink_readahead_size_eio() can
reduce ra_pages as a result of IO errors. Hence if you have had io
errors, telling the kernel that you are now going to do  sequential
IO should reset the readahead to the maximum ra_pages value
supported

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hang in XFS reclaim on 3.7.0-rc3

2012-11-01 Thread Dave Chinner
On Thu, Nov 01, 2012 at 04:30:10PM -0500, Ben Myers wrote:
> Hi Dave,
> 
> On Tue, Oct 30, 2012 at 09:26:13AM +1100, Dave Chinner wrote:
> > On Mon, Oct 29, 2012 at 09:03:15PM +0100, Torsten Kaiser wrote:
> > > After experiencing a hang of all IO yesterday (
> > > http://marc.info/?l=linux-kernel&m=135142236520624&w=2 ), I turned on
> > > LOCKDEP after upgrading to -rc3.
> > > 
> > > I then tried to replicate the load that hung yesterday and got the
> > > following lockdep report, implicating XFS instead of by stacking swap
> > > onto dm-crypt and md.
> > > 
> > > [ 2844.971913]
> > > [ 2844.971920] =
> > > [ 2844.971921] [ INFO: inconsistent lock state ]
> > > [ 2844.971924] 3.7.0-rc3 #1 Not tainted
> > > [ 2844.971925] -
> > > [ 2844.971927] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> > > [ 2844.971929] kswapd0/725 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > > [ 2844.971931] (&(&ip->i_lock)->mr_lock){?.}, at: 
> > > [] xfs_ilock+0x84/0xb0
> > > [ 2844.971941] {RECLAIM_FS-ON-W} state was registered at:
> > > [ 2844.971942]   [] mark_held_locks+0x7e/0x130
> > > [ 2844.971947]   [] lockdep_trace_alloc+0x63/0xc0
> > > [ 2844.971949]   [] kmem_cache_alloc+0x35/0xe0
> > > [ 2844.971952]   [] vm_map_ram+0x271/0x770
> > > [ 2844.971955]   [] _xfs_buf_map_pages+0x46/0xe0
.
> > We shouldn't be mapping pages there. See if the patch below fixes
> > it.
> > 
> > Fundamentally, though, the lockdep warning has come about because
> > vm_map_ram is doing a GFP_KERNEL allocation when we need it to be
> > doing GFP_NOFS - we are within a transaction here, so memory reclaim
> > is not allowed to recurse back into the filesystem.
> > 
> > mm-folk: can we please get this vmalloc/gfp_flags passing API
> > fixed once and for all? This is the fourth time in the last month or
> > so that I've seen XFS bug reports with silent hangs and associated
> > lockdep output that implicate GFP_KERNEL allocations from vm_map_ram
> > in GFP_NOFS conditions as the potential cause
> > 
> > xfs: don't vmap inode cluster buffers during free
> 
> Could you write up a little more background for the commit message?

Sure, that was just a test patch and often I don't bother putting a
detailed description in them until I know they fix the problem. My
current tree has:

xfs: don't vmap inode cluster buffers during free

Inode buffers do not need to be mapped as inodes are read or written
directly from/to the pages underlying the buffer. This fixes a
regression introduced by commit 611c994 ("xfs: make XBF_MAPPED the
default behaviour").

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VFS hot tracking: How to calculate data temperature?

2012-11-05 Thread Dave Chinner
On Mon, Nov 05, 2012 at 10:35:50AM +0800, Zhi Yong Wu wrote:
> On Sat, Nov 3, 2012 at 5:27 AM, Mingming.cao  wrote:
> > On Fri, 2012-11-02 at 14:38 +0800, Zhi Yong Wu wrote:
> >> Here also has another question.
> >>
> >> How to save the file temperature among the umount to be able to
> >> preserve the file tempreture after reboot?
> >>
> >> This above is the requirement from DB product.
> >> I thought that we can save file temperature in its inode struct, that
> >> is, add one new field in struct inode, then this info will be written
> >> to disk with inode.
> >>
> >> Any comments or ideas are appreciated, thanks.
> >>
> >>
> >
> > Maybe could save the last file temperature with extended attributes.
> It seems that only ext4 has the concept of extended attributes.

All major filesystems have xattr support. They are used extensively
by the security and integrity subsystems, for example.

Saving the information might be something that is useful to certian
applications, but lets have the people that need that functionality
spell out their requirements before discussing how or what to
implement.  Indeed, discussion shoul dreally focus on getting the
core, in-memory infrastructure sorted out first before trying to
expand the functionality further...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem with DISCARD and RAID5

2012-11-05 Thread Dave Chinner
On Fri, Nov 02, 2012 at 09:40:58AM +0800, Shaohua Li wrote:
> On Thu, Nov 01, 2012 at 05:38:54PM +1100, NeilBrown wrote:
> > 
> > Hi Shaohua,
> >  I've been doing some testing and discovered a problem with your discard
> >  support for RAID5.
> > 
> >  The code in blkdev_issue_discard assumes that the 'granularity' is a power
> >  of 2, and for example subtracts 1 to get a mask.
> > 
> >  However RAID5 sets the granularity to be the stripe size which often is not
> >  a power of two.  When this happens you can easily get into an infinite 
> > loop.
> > 
> >  I suspect that to make this work properly, blkdev_issue_discard will need 
> > to
> >  be changed to allow 'granularity' to be an arbitrary value.
> >  When it is a power of two, the current masking can be used.
> >  When it is anything else, it will need to use sector_div().
> 
> Yep, looks we need use sector_div. And this isn't the only problem. discard
> request can be merged, and the merge check only checks max_discard_sectors.
> That means the split requests in blkdev_issue_discard can be merged again. The
> split nerver works.
> 
> I'm wondering what's purpose of discard_alignment and discard_granularity. Are
> there devices with discard_granularity not 1 sector?

Most certainly. Thin provisioned storage often has granularity in the
order of megabytes

> If bio isn't discard
> aligned, what device will do?

Up to the device.

> Further, why driver handles alignment/granularity
> if device will ignore misaligned request.

When you send a series of sequential unaligned requests, the device
may ignore them all. Hence you end up with nothing being discarded,
even though the entire range being discarded is much, much larger
than the discard granularity

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] Set bi_rw when alloc bio before call bio_add_page.

2012-07-30 Thread Dave Chinner
On Mon, Jul 30, 2012 at 03:14:28PM +0800, majianpeng wrote:
> When exec bio_alloc, the bi_rw is zero.But after calling bio_add_page,
> it will use bi_rw.
> Fox example, in functiion __bio_add_page,it will call merge_bvec_fn().
> The merge_bvec_fn of raid456 will use the bi_rw to judge the merge.
> >> if ((bvm->bi_rw & 1) == WRITE)
> >> return biovec->bv_len; /* always allow writes to be mergeable */

So if bio_add_page() requires bi_rw to be set, then shouldn't it be
set up for every caller? I noticed there are about 50 call sites for
bio_add_page(), and you've only touched about 10 of them. Indeed, I
notice that the RAID0/1 code uses bio_add_page, and as that can be
stacked on top of RAID456, it also needs to set bi_rw correctly.
As a result, your patch set is nowhere near complete, not does it
document that bio_add_page requires that bi_rw be set before calling
(which is the new API requirement, AFAICT).

So, my question is whether the RAID456 code is doing something
valid.  That write optimisation is clearly not enabled for a
significant amount of code and filesystems, so the first thing to do
is quantify the benefit of the optimisation. I can't evalute the
merit of this change without data telling me it is worthwhile, and
it's a lot of code to churn for no benefit

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] xfs: check for possible overflow in xfs_ioc_trim

2012-07-30 Thread Dave Chinner
On Mon, Jul 30, 2012 at 10:13:44AM +0200, Tomas Racek wrote:
> If range.start argument was between ULLONG_MAX - BBSIZE and ULLONG_MAX,
> BTOBB macro resulted in overflow which caused start to be set to 0.
> Now, invalid argument error is returned instead.
> 
> Signed-off-by: Tomas Racek 
> ---
>  fs/xfs/xfs_discard.c |4 
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
> index f9c3fe3..0ef7dd4 100644
> --- a/fs/xfs/xfs_discard.c
> +++ b/fs/xfs/xfs_discard.c
> @@ -179,6 +179,10 @@ xfs_ioc_trim(
>* used by the fstrim application.  In the end it really doesn't
>* matter as trimming blocks is an advisory interface.
>*/
> +
> + if (range.start > ULLONG_MAX - BBSIZE)
> + return -XFS_ERROR(EINVAL);
> +

There's no point checking for overflow on the range start - what we
need to check is whether it is larger than the size of the
filesystem. We do that after the conversion of range.start to basic
blocks, so that check needs to be promoted to before this. i.e.

if (range.start >= XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks))
return -XFS_ERROR(EINVAL);

>   start = BTOBB(range.start);
>   end = start + BTOBBT(range.len) - 1;
>   minlen = BTOBB(max_t(u64, granularity, range.minlen));

And that will prevent the overflow in BTOBB() just as effectively...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [PATCH 0/8] Set bi_rw when alloc bio before call bio_add_page.

2012-07-30 Thread Dave Chinner
On Tue, Jul 31, 2012 at 08:55:59AM +0800, majianpeng wrote:
> On 2012-07-31 05:42 Dave Chinner  Wrote:
> >On Mon, Jul 30, 2012 at 03:14:28PM +0800, majianpeng wrote:
> >> When exec bio_alloc, the bi_rw is zero.But after calling bio_add_page,
> >> it will use bi_rw.
> >> Fox example, in functiion __bio_add_page,it will call merge_bvec_fn().
> >> The merge_bvec_fn of raid456 will use the bi_rw to judge the merge.
> >> >> if ((bvm->bi_rw & 1) == WRITE)
> >> >> return biovec->bv_len; /* always allow writes to be mergeable */
> >
> >So if bio_add_page() requires bi_rw to be set, then shouldn't it be
> >set up for every caller? I noticed there are about 50 call sites for
> >bio_add_page(), and you've only touched about 10 of them. Indeed, I
> >notice that the RAID0/1 code uses bio_add_page, and as that can be
> >stacked on top of RAID456, it also needs to set bi_rw correctly.
> >As a result, your patch set is nowhere near complete, not does it
> >document that bio_add_page requires that bi_rw be set before calling
> >(which is the new API requirement, AFAICT).
> There are many place call bio_add_page and I send some of those. Because my 
> abilty, so I only send 
> some patchs which i understand clearly.

Sure, but my point is that there is no point changing only a few and
ignoring the great majority of callers. Either fix them all, fix it
some other way (e.g. API change), or remove the code from the RAID5
function that requires it.

> In __bio_add_page:
> >>if (q->merge_bvec_fn) {
> >>struct bvec_merge_data bvm = {
> >>/* prev_bvec is already charged in
> >>   bi_size, discharge it in order to
> >>   simulate merging updated prev_bvec
> >>   as new bvec. */
> >>.bi_bdev = bio->bi_bdev,
> >>.bi_sector = bio->bi_sector,
> >>.bi_size = bio->bi_size - prev_bv_len,
> >>.bi_rw = bio->bi_rw,
> >>};
> it used bio->bi_rw.
> Before raid5_mergeable_bvec appearing, in kernel 'merge_bvec_fn' did not use 
> bio->bi_rw.

Right, but as things stand right now, the RAID5 code is a no-op
because nobody is setting bio->bi_rw correctly. it is effectively
dead code.

> But i think we shold not suppose bi_rw not meanless.

To decide whether we should take it to have meaning, data is
required to show that the RAID5 optimisation it enables is
worthwhile.  If the optimisation is not worthwhile, then the correct
thing to do is remove the optimisation in the RAID5 code and remove
the bi_rw field from the struct bvec_merge_data.

> >So, my question is whether the RAID456 code is doing something
> >valid.  That write optimisation is clearly not enabled for a
> >significant amount of code and filesystems, so the first thing to do
> >is quantify the benefit of the optimisation. I can't evalute the
> >merit of this change without data telling me it is worthwhile, and
> >it's a lot of code to churn for no benefit
> >
> Sorry, we do not think the 'merge_bvec_fn' did not use bi_rw.

It's entirely possible that when bi_rw was added to struct
bvec_merge_data, the person who added it was mistaken that bi_rw was
set at this point in time when in fact it never has been. Hence it's
presence and reliance on it would be a bug.

That's what I'm asking - is this actually beneificial, or should it
simply be removed from struct bvec_merge_data? Data is needed to
answer that question

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-21 Thread Dave Chinner
On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
> This patchset optimizes several places which take the per inode spin lock.
> They have not been fully tested yet, thus they are marked as RFC. 

Inodes are RCU freed. The i_lock spinlock on the i_state field forms
part of the memory barrier that allows the RCU read side to
correctly detect a freed inode during a RCU protected cache lookup
(hash list traversal for the VFS, or a radix tree traversal for XFS).
The i_lock usage around the hahs list operations ensures the hash
list operations are atomic with state changes so that such changes
are correctly detected during RCU-protected traversals...

IOWs, removing the i_lock from around the i_state transitions and
inode hash insert/remove/traversal operations will cause races in
the RCU lookups and result in incorrectly using freed inodes instead
of failing the lookup and creating a new one.

So I don't think this is a good idea at all...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-23 Thread Dave Chinner
On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
> On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:
> 
> > On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
> > > This patchset optimizes several places which take the per inode spin lock.
> > > They have not been fully tested yet, thus they are marked as RFC. 
> > 
> > Inodes are RCU freed. The i_lock spinlock on the i_state field forms
> > part of the memory barrier that allows the RCU read side to
> > correctly detect a freed inode during a RCU protected cache lookup
> > (hash list traversal for the VFS, or a radix tree traversal for XFS).
> > The i_lock usage around the hahs list operations ensures the hash
> > list operations are atomic with state changes so that such changes
> > are correctly detected during RCU-protected traversals...
> > 
> > IOWs, removing the i_lock from around the i_state transitions and
> > inode hash insert/remove/traversal operations will cause races in
> > the RCU lookups and result in incorrectly using freed inodes instead
> > of failing the lookup and creating a new one.
> > 
> > So I don't think this is a good idea at all...
> >
> 
> Hello, Dave:
> 
>   Thanks for your explanation.
>  
>   Though I can't fully understand it, your concern seems to be that
> RCU inode lookup will be bothered by this change. But we do not have 
> RCU inode lookup in VFS: inode lookup is done by rather a tranditional
> way. 

Ah, I'd forgotten that neither of these RCU-based lookups ever got
merged:

https://lkml.org/lkml/2010/6/23/397
http://thread.gmane.org/gmane.linux.kernel/1056494

That, however, is the approach that the inode caches shoul dbe
moving towards - RCU lookups to reduce locking, not changing
i_lock/i_state atomicity that has been designed to facilitate RCU
safe lookups...

>   XFS gives me the impression that it implements its own inode cache.
> There may be such thing there. I have little knowledge on XFS, but I
> guess it's unlikely impacted by the change of code implementing VFS 
> inode cache.

Yeah, I dropped the generic inode hash RCU conversion - the
SLAB_DESTROY_BY_RCU was proving to be rather complex, and I didn't
have any motiviation to see it through because I'd already converted
XFs to avoid the global inode_hash_lock and use RCU lookups on it's
internal inode cache...

>   As far as I can see, RCU inode free is for RCU dentry lookup, which
> seems have nothing to do with 'detect a freed inode'.

If you know nothing of the history of this, then it might seem that
way

> Taking i_lock in these
> places looks like to me a result of following old lock scheme blindly when 
> breaking the big global inode lock.

The i_state/i_hash_list/i_lock relationship was created specifically
during the inode_lock breakup to allow us to guarantee that certain
fields of the inode are unchanging without needing to take multiple
nested locks:

$ gl -n 1 250df6e
commit 250df6ed274d767da844a5d9f05720b804240197
Author: Dave Chinner 
Date:   Tue Mar 22 22:23:36 2011 +1100

fs: protect inode->i_state with inode->i_lock

Protect inode state transitions and validity checks with the
inode->i_lock. This enables us to make inode state transitions
independently of the inode_lock and is the first step to peeling
away the inode_lock from the code.

This requires that __iget() is done atomically with i_state checks
during list traversals so that we don't race with another thread
marking the inode I_FREEING between the state check and grabbing the
reference.

Also remove the unlock_new_inode() memory barrier optimisation
required to avoid taking the inode_lock when clearing I_NEW.
Simplify the code by simply taking the inode->i_lock around the
state change and wakeup. Because the wakeup is no longer tricky,
remove the wake_up_inode() function and open code the wakeup where
necessary.

Signed-off-by: Dave Chinner 
Signed-off-by: Al Viro 

The inode hash lookup needs to check i_state atomically during the
traversal so inodes being freed are skipped (e.g. I_FREEING,
I_WILL_FREE). those i_state flags are set only with the i_lock held,
and so inode hash lookups need to take the i_lock to guarantee the
i_state field is correct. This inode data field synchronisation is
separate to the cache hash list traversal protection.

The only way to do this is to have an inner lock (inode->i_lock)
that protects both the inode->i_hash_list and inode->i_state fields,
and a lock order that provides outer list traversal protections
(inode_hash_lock). Whether the outer lock is the inode_hash_lock or
rcu_read_lock(), the lock order and the data fields the locks are
protecting are the same

> Of course,

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-23 Thread Dave Chinner
On Mon, Sep 24, 2012 at 02:12:05PM +0800, Guo Chao wrote:
> On Mon, Sep 24, 2012 at 02:23:43PM +1000, Dave Chinner wrote:
> > On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
> > > On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:
> > > 
> > > > On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
> > > > > This patchset optimizes several places which take the per inode spin 
> > > > > lock.
> > > > > They have not been fully tested yet, thus they are marked as RFC. 
> > > > 
> > > > Inodes are RCU freed. The i_lock spinlock on the i_state field forms
> > > > part of the memory barrier that allows the RCU read side to
> > > > correctly detect a freed inode during a RCU protected cache lookup
> > > > (hash list traversal for the VFS, or a radix tree traversal for XFS).
> > > > The i_lock usage around the hahs list operations ensures the hash
> > > > list operations are atomic with state changes so that such changes
> > > > are correctly detected during RCU-protected traversals...
> > > > 
> > > > IOWs, removing the i_lock from around the i_state transitions and
> > > > inode hash insert/remove/traversal operations will cause races in
> > > > the RCU lookups and result in incorrectly using freed inodes instead
> > > > of failing the lookup and creating a new one.
> > > > 
> > > > So I don't think this is a good idea at all...
.
> > The inode hash lookup needs to check i_state atomically during the
> > traversal so inodes being freed are skipped (e.g. I_FREEING,
> > I_WILL_FREE). those i_state flags are set only with the i_lock held,
> > and so inode hash lookups need to take the i_lock to guarantee the
> > i_state field is correct. This inode data field synchronisation is
> > separate to the cache hash list traversal protection.
> > 
> > The only way to do this is to have an inner lock (inode->i_lock)
> > that protects both the inode->i_hash_list and inode->i_state fields,
> > and a lock order that provides outer list traversal protections
> > (inode_hash_lock). Whether the outer lock is the inode_hash_lock or
> > rcu_read_lock(), the lock order and the data fields the locks are
> > protecting are the same
> > 
> > > Of course, maybe they are there for something. Could you speak
> > > more about the race this change (patch 1,2?) brings up? Thank you.
> > 
> > When you drop the lock from the i_state initialisation, you end up
> > dropping the implicit unlock->lock memory barrier that the
> > inode->i_lock provides. i.e. you get this in iget_locked():
> > 
> > 
> > thread 1thread 2
> > 
> > lock(inode_hash_lock)
> > for_each_hash_item()
> > 
> > inode->i_state = I_NEW
> > hash_list_insert
> > 
> > 
> > lock(inode->i_lock)
> > unlock(inode->i_lock)
> > unlock(inode_hash_lock)
> > 
> > wait_on_inode()
> > i_state = 0 >
> >  >  is complete>
> > 
> > IOWs, there is no unlock->lock transition occurring on any lock, so
> > there are no implicit memory barriers in this code, and so other
> > CPUs are not guaranteed to see the "inode->i_state = I_NEW" write
> > that thread 2 did. The lock/unlock pair around this I_NEW assignment
> > guarantees that thread 1 will see the change to i_state correctly.
> > 
> > So even without RCU, dropping the i_lock from these
> > i_state/hash insert/remove operations will result in races
> > occurring...
>
> This interleave can never happen because of inode_hash_lock.

Ah, sorry, I'm context switching too much right now.

s/lock(inode_hash_lock)/rcu_read_lock.

And that's the race condition the the locking order is *intended* to
avoid. It's just that we haven't done the last piece of the work,
which is replacing the read side inode_hash_lock usage with
rcu_read_lock.

> > Seriously, if you want to improve the locking of this code, go back
> > an resurrect the basic RCU hash traversal patches (i.e. Nick's
> > original patch rather than my SLAB_DESTROY_BY_RCU based ones). That
> > has much more benefit to many more workloads than just removing a
> > non-global, uncontended locks like this patch set does.
> > 
> 
> Ah, this is intended to be a code clean patchset actually. I thought these
> locks are

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-24 Thread Dave Chinner
On Mon, Sep 24, 2012 at 03:08:52PM +0800, Guo Chao wrote:
> On Mon, Sep 24, 2012 at 04:28:12PM +1000, Dave Chinner wrote:
> > > Ah, this is intended to be a code clean patchset actually. I thought these
> > > locks are redundant in an obvious and trivial manner. If, on the 
> > > contrary, 
> > > they are such tricky, then never mind :) Thanks for your patient.
> > 
> > The RCU conversion is actually trivial - everything is already set
> > up for it to be done, and is simpler than this patch set. It pretty
> > much is simply replacing all the read side inode_hash_lock pairs
> > with rcu_read_lock()/rcu_read_unlock() pairs. Like I said, if you
> > want to clean up this code, then RCU traversals are the conversion
> > to make.
> >
> 
> Thanks for your suggestion. Though I doubt it's such trivial, I will try this
> after a little investigation.

Probably best to start with the patch below - it's run under heavy
concurrent load on ext4 with a working set of inodes about 20x
larger than can fit in memory for the past hour....

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

fs: Use RCU lookups for inode cache

From: Dave Chinner 

Convert inode cache lookups to be protected by RCU locking rather
than the global inode_hash_lock. This will improve scalability of
inode lookup intensive workloads.

Smoke tested w/ ext4 on concurrent fsmark/lookup/unlink workloads
over 50 million or so inodes...

Signed-off-by: Dave Chinner 
---
 fs/inode.c |   74 ++--
 1 file changed, 42 insertions(+), 32 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index ac8d904..2e92674 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -464,7 +464,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long 
hashval)
 
spin_lock(&inode_hash_lock);
spin_lock(&inode->i_lock);
-   hlist_add_head(&inode->i_hash, b);
+   hlist_add_head_rcu(&inode->i_hash, b);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_hash_lock);
 }
@@ -480,7 +480,7 @@ void __remove_inode_hash(struct inode *inode)
 {
spin_lock(&inode_hash_lock);
spin_lock(&inode->i_lock);
-   hlist_del_init(&inode->i_hash);
+   hlist_del_init_rcu(&inode->i_hash);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_hash_lock);
 }
@@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode);
 static struct inode *find_inode(struct super_block *sb,
struct hlist_head *head,
int (*test)(struct inode *, void *),
-   void *data)
+   void *data, bool locked)
 {
struct hlist_node *node;
struct inode *inode = NULL;
 
 repeat:
-   hlist_for_each_entry(inode, node, head, i_hash) {
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(inode, node, head, i_hash) {
spin_lock(&inode->i_lock);
+   if (inode_unhashed(inode)) {
+   spin_unlock(&inode->i_lock);
+   continue;
+   }
if (inode->i_sb != sb) {
spin_unlock(&inode->i_lock);
continue;
@@ -800,13 +805,20 @@ repeat:
continue;
}
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+   rcu_read_unlock();
+   if (locked)
+   spin_unlock(&inode_hash_lock);
__wait_on_freeing_inode(inode);
+   if (locked)
+   spin_lock(&inode_hash_lock);
goto repeat;
}
__iget(inode);
spin_unlock(&inode->i_lock);
+   rcu_read_unlock();
return inode;
}
+   rcu_read_unlock();
return NULL;
 }
 
@@ -815,14 +827,20 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-   struct hlist_head *head, unsigned long ino)
+struct hlist_head *head,
+unsigned long ino, bool locked)
 {
struct hlist_node *node;
struct inode *inode = NULL;
 
 repeat:
-   hlist_for_each_entry(inode, node, head, i_hash) {
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(inode, node, head, i_hash) {
spin_lock(&inode->i_lock);
+   if (inode_unhashed(inode)) {
+   spin_unlock(&inode->i_lock);
+   continue;
+   }
if (inode->i_i

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-24 Thread Dave Chinner
9.42MB/sec |  10.55MB/sec |
> > |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
> > |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
> > |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
> > |8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
> > |4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
> > -

While this set of numbers looks good, it's a very limited in scope.
I can't evaluate whether the change is worthwhile or not from this
test. If I was writing this patch, the questions I'd be seeking to
answer before proposing it for inclusion are as follows

1. what's the comparison in performance to typical NFS
server writeback parameter tuning? i.e. dirty_background_ratio=5,
dirty_ratio=10, dirty_expire_centiseconds=1000,
dirty_writeback_centisecs=1? i.e. does this give change give any
benefit over the current common practice for configuring NFS
servers?

2. what happens when you have 10 clients all writing to the server
at once? Or a 100? NFS servers rarely have a single writer to a
single file at a time, so what impact does this change have on
multiple concurrent file write performance from multiple clients?

3. Following on from the multiple client test, what difference does it
make to file fragmentation rates? Writing more frequently means
smaller allocations and writes, and that tends to lead to higher
fragmentation rates, especially when multiple files are being
written concurrently. Higher fragmentation also means lower
performance over time as fragmentation accelerates filesystem aging
effects on performance.  IOWs, it may be faster when new, but it
will be slower 3 months down the track and that's a bad tradeoff to
make.

4. What happens for higher bandwidth network links? e.g. gigE or
10gigE? Are the improvements still there? Or does it cause
regressions at higher speeds? I'm especially interested in what
happens to multiple writers at higher network speeds, because that's
a key performance metric used to measure enterprise level NFS
servers.

5. Are the improvements consistent across different filesystem
types?  We've had writeback changes in the past cause improvements
on one filesystem but significant regressions on others.  I'd
suggest that you need to present results for ext4, XFS and btrfs so
that we have a decent idea of what we can expect from the change to
the generic code.

Yeah, I'm asking a lot of questions. That's because the generic
writeback code is extremely important to performance and the impact
of a change cannot be evaluated from a single test.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 01/10] vfs: introduce private rb structures

2012-09-25 Thread Dave Chinner
ATA_TYPE_RANGE (1 << 1)

is just fine by itself.


> +/* A frequency data struct holds values that are used to
> + * determine temperature of files and file ranges. These structs
> + * are members of hot_inode_item and hot_range_item
> + */

/*
 * This is a
 * multiline comment. ;)
 */

> +struct hot_freq_data {
> + struct timespec last_read_time;
> + struct timespec last_write_time;
> + u32 nr_reads;
> + u32 nr_writes;
> + u64 avg_delta_reads;
> + u64 avg_delta_writes;
> + u8 flags;
> + u32 last_temperature;

may as well make the flags a u32 - the compiler will ues that much
space anyway as it aligned the u32 last_temperature variable after
it.

> +};
> +
> +/* An item representing an inode and its access frequency */
> +struct hot_inode_item {
> + /* node for hot_inode_tree rb_tree */
> + struct rb_node rb_node;
> + /* tree of ranges in this inode */
> + struct hot_range_tree hot_range_tree;
> + /* frequency data for this inode */
> + struct hot_freq_data hot_freq_data;
> + /* inode number, copied from inode */
> + unsigned long i_ino;
> + /* used to check for errors in ref counting */
> + u8 in_tree;
> + /* protects hot_freq_data, i_no, in_tree */
> + spinlock_t lock;
> + /* prevents kfree */
> + struct kref refs;

It's hard to see the code in the commentsi, and some of comments are
redundant.. It's easier to read if you do this:

struct hot_inode_item {
struct rb_node node;/* hot_inode_tree index */
struct hot_range_tree hot_range_tree;   /* tree of ranges */
struct hot_freq_data hot_freq_data; /* frequency data */
unsigned long i_ino;/* inode number from inode */
u8 in_tree; /* ref counting check */
spinlock_t lock;/* protects object data */
struct kref refs;   /* prevents kfree */
}

Also: 
- i_ino really needs to be a 64 bit quantity as some
  filesystems can use 64 bit inode numbers even on 32
  bit systems (e.g. XFS).
- in_tree can be u32 or a flags field if it is boolean. if
  it is just debug, then maybe it can be removed whenteh
  code is ready for commit.

> +};
> +
> +/*
> + * An item representing a range inside of an inode whose frequency
> + * is being tracked
> + */
> +struct hot_range_item {
> + /* node for hot_range_tree rb_tree */
> + struct rb_node rb_node;
> + /* frequency data for this range */
> + struct hot_freq_data hot_freq_data;
> + /* the hot_inode_item associated with this hot_range_item */
> + struct hot_inode_item *hot_inode;
> + /* starting offset of this range */
> + u64 start;
> + /* length of this range */
> + u64 len;

What units?
u64 start;  /* start offset in bytes */
u64 len /* length in bytes */

> + /* used to check for errors in ref counting */
> + u8 in_tree;
> + /* protects hot_freq_data, start, len, and in_tree */
> + spinlock_t lock;
> + /* prevents kfree */
> + struct kref refs;
> +};
> +
> +struct hot_info {
> + /* red-black tree that keeps track of fs-wide hot data */
> + struct hot_inode_tree hot_inode_tree;
> +};

The comment is redundant...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 02/10] vfs: add support for updating access frequency

2012-09-25 Thread Dave Chinner
 manner
to xfs_buf_get_map/_xfs_buf_find()

(And yes, you'll see a lot of similarities between that code and the
suggestions I've been making, like a single function that does both
lookup and insert...)

> +
> + spin_lock(&he->lock);
> + hot_rb_update_freq(&he->hot_freq_data, rw);
> + spin_unlock(&he->lock);

Probably should move the he->lock into hot_inode_freq_update().

> +
> +out:
> + return he;
> +}
> +
> +/* Update range frequency struct */
> +static bool hot_rb_update_range_freq(struct hot_inode_item *he,
> + u64 off, u64 len, int rw,
> + struct hot_info *root)
> +{
> + struct hot_range_tree *hrtree = &(he->hot_range_tree);
> + struct hot_range_item *hr = NULL;
> + u64 start_off = off & RANGE_SIZE_MASK;
> + u64 end_off = (off + len - 1) & RANGE_SIZE_MASK;
> + u64 cur;
> + int ret = true;
> +
> + if (len == 0)
> + return false;
> +
> + /*
> +  * Align ranges on RANGE_SIZE boundary to prevent proliferation
> +  * of range structs
> +  */
> + for (cur = start_off; cur <= end_off; cur += RANGE_SIZE) {
> + read_lock(&hrtree->lock);
> + hr = hot_rb_lookup_hot_range_item(hrtree, cur);
> + read_unlock(&hrtree->lock);
> +
> + if (!hr) {
> + hr = kmem_cache_alloc(hot_range_item_cache,
> + GFP_KERNEL | GFP_NOFS);
> + if (!hr) {
> + ret = false;
> + goto out;
> + }
> +
> + write_lock(&hrtree->lock);
> + hot_rb_range_item_init(hr);
> + hr->start = cur & RANGE_SIZE_MASK;
> + hr->len = RANGE_SIZE;
> + hr->hot_inode = he;
> + hot_rb_add_hot_range_item(hrtree, hr);
> + write_unlock(&hrtree->lock);
> + }
> +
> + spin_lock(&hr->lock);
> + hot_rb_update_freq(&hr->hot_freq_data, rw);
> + spin_unlock(&hr->lock);
> + hot_rb_free_hot_range_item(hr);
> + }

Same comments again about locking.

I note that the code will always insert range items of a length
RANGE_SIZE. This means you have a fixed object granularity and hence
you have no need for a range based search. That is, you could use a
radix tree where each entry in the radix tree points directly to the
range object similar to how the page cache uses a radix tree for
indexing pages. That brings the possibility of lockless range item
lookups


> +
> +out:
> + return ret;
> +}
> +
> +/* main function to update access frequency from read/writepage(s) hooks */
> +void hot_rb_update_freqs(struct inode *inode, u64 start,
> + u64 len, int rw)
> +{
> + struct hot_inode_item *he;
> +
> + he = hot_rb_update_inode_freq(inode, rw);
> +
> + WARN_ON(!he);
> +
> + if (he) {
> + hot_rb_update_range_freq(he, start, len,
> + rw, &(inode->i_sb->s_hotinfo));
> +
> + hot_rb_free_hot_inode_item(he);
> + }

This is very assymetric. it would be much better to collapse all
the abstraction down to something much simpler, say:


int hot_inode_update_freqs()
{
he = hot_inode_item_find(tree, inum, null)
if (!he) {
new_he = allocate()
if (!new_he)
return -ENOMEM;

    
he = hot_inode_item_find(tree, inum, new_he)
if (he != new_he)
free new_he
}
hot_inode_freq_update(&he->hot_freq_data, write)

for (cur = start_off; cur <= end_off; cur += RANGE_SIZE) {
hr = hot_range_item_find(tree, cur, NULL)
if (!hr) {
new_hr = allocate()
if (!new_hr)
return -ENOMEM;


hr = hot_inode_item_find(tree, cur, new_hr)
if (hr != new_hr)
free new_hr
}
hot_inode_freq_update(&hr->hot_freq_data, write)
hot_range_item_put(hr);
{
hot_inode_item_put(he);
}






> +}
> +
> +/*
>   * Initialize kmem cache for hot_inode_item
>   * and hot_range_item
>   */
> diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
> index 269b67a..2ba29e4 100644
> --- a/fs/hot_tracking.h
> +++ b/fs/hot_tracking.h
> @@ -21,6 +21,21 @@
>  #define FREQ_DATA_TYPE_INODE (1 << 0)
>  /* freq data struct is for a range */
>  #define FREQ_DATA_TYPE_RANGE (1 << 1)
> +/* size of sub-file ranges */
> +#define RANGE_SIZE (1 << 20)
> +#define RANGE_SIZE_MASK (~((u64)(RANGE_SIZE - 1)))

You might want to state what units these ranges are in, and that
there is one tracking object per range per inode

> +#define FREQ_POWER 4
> +
> +struct hot_info;
> +struct inode;
> +
> +struct hot_inode_item

You've already included include/linux/hot_tracking.h, so you
shouldn't need these forward declarations...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'

2012-09-25 Thread Dave Chinner
On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote:
> From: Zhi Yong Wu 
> 
>   Introduce one new mount option '-o hottrack',
> and add its parsing support.
>   Its usage looks like:
>mount -o hottrack
>mount -o nouser,hottrack
>mount -o nouser,hottrack,loop
>mount -o hottrack,nouser

I think that this option parsing should be done by the filesystem,
even though the tracking functionality is in the VFS. That way ony
the filesystems that can use the tracking information will turn it
on, rather than being able to turn it on for everything regardless
of whether it is useful or not.

Along those lines, just using a normal superblock flag to indicate
it is active (e.g. MS_HOT_INODE_TRACKING in sb->s_flags) means you
don't need to allocate the sb->s_hot_info structure just to be able
to check whether we are tracking hot inodes or not.

This then means the hot inode tracking for the superblock can be
initialised by the filesystem as part of it's fill_super method,
along with the filesystem specific code that will use the hot
tracking information the VFS gathers

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-25 Thread Dave Chinner
On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote:
> On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote:
> > @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode 
> > *inode);
> >  static struct inode *find_inode(struct super_block *sb,
> > struct hlist_head *head,
> > int (*test)(struct inode *, void *),
> > -   void *data)
> > +   void *data, bool locked)
> >  {
> > struct hlist_node *node;
> > struct inode *inode = NULL;
> > 
> >  repeat:
> > -   hlist_for_each_entry(inode, node, head, i_hash) {
> > +   rcu_read_lock();
> > +   hlist_for_each_entry_rcu(inode, node, head, i_hash) {
> > spin_lock(&inode->i_lock);
> > +   if (inode_unhashed(inode)) {
> > +   spin_unlock(&inode->i_lock);
> > +   continue;
> > +   }
> 
> Is this check too early? If the unhashed inode happened to be the target
> inode, we are wasting our time to continue the traversal and we do not wait 
> on it.

If the inode is unhashed, then it is already passing through evict()
or has already passed through. If it has already passed through
evict() then it is too late to call __wait_on_freeing_inode() as the
wakeup occurs in evict() immediately after the inode is removed
from the hash. i.e:

remove_inode_hash(inode);

spin_lock(&inode->i_lock);
wake_up_bit(&inode->i_state, __I_NEW);
BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
spin_unlock(&inode->i_lock);

i.e. if we get the case:

Thread 1, RCU hash traversalThread 2, evicting foo

rcu_read_lock()
found inode foo
remove_inode_hash(inode);
spin_lock(&foo->i_lock);
wake_up(I_NEW)
spin_unlock(&foo->i_lock);
destroy_inode()
..
spin_lock(foo->i_lock)
match sb, ino
I_FREEING
  rcu_read_unlock()



  wait_on_freeing_inode
wait_on_bit(I_NEW)



Hence if the inode is unhashed, it doesn't matter what inode it is,
it is never valid to use it any further because it may have already
been freed and the only reason we can safely access here it is that
the RCU grace period will not expire until we call
rcu_read_unlock().

> > @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, 
> > unsigned long ino)
> > struct inode *old;
> > 
> > spin_lock(&inode_hash_lock);
> > -   /* We released the lock, so.. */
> > -   old = find_inode_fast(sb, head, ino);
> > +   old = find_inode_fast(sb, head, ino, true);
> > if (!old) {
> > inode->i_ino = ino;
> > spin_lock(&inode->i_lock);
> 
> E ... couldn't we use memory barrier API instead of irrelevant spin
> lock on newly allocated inode to publish I_NEW?

Yes, we could.

However, having multiple synchronisation methods for a single
variable that should only be used in certain circumstances is
something that is easy to misunderstand and get wrong. Memory
barriers are much more subtle and harder to understand than spin
locks, and every memory barrier needs to be commented to explain
what the barrier is actually protecting against.

In the case where a spin lock is guaranteed to be uncontended and
the cache line hot in the CPU cache, it makes no sense to replace
the spin lock with a memory barrier, especially when every other
place we modify the i_state/i_hash fields we have to wrap them
with i_lock

Simple code is good code - save the complexity for something that
needs it.

> I go through many mails of the last trend of scaling VFS. Many patches
> seem quite natural, say RCU inode lookup

Sure, but the implementation in those RCU lookup patches sucked.

> or per-bucket inode hash lock or 

It was a bad idea. At minimum, you can't use lockdep on it. Worse
for the realtime guys is the fact it can't be converted to a
sleeping lock. Worst was the refusal to change it in any way to
address concerns.

And realistically, the fundamental problem is not with the
inode_hash_lock, it's with the fact that the cache is based on a
hash table rather than a more scalable structure like a radix tree
or btree. This is a primary reason for XFS having it's own inode
cache - hashes can only hold a certain number of entries before
performance collapses catastrophically and so don't scale well to
tens o

Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page

2012-10-28 Thread Dave Chinner
On Sat, Oct 27, 2012 at 06:16:26PM -0400, Theodore Ts'o wrote:
> On Fri, Oct 26, 2012 at 10:24:23PM +, Luck, Tony wrote:
> > > Well, we could set a new attribute bit on the file which indicates
> > > that the file has been corrupted, and this could cause any attempts to
> > > open the file to return some error until the bit has been cleared.
> > 
> > That sounds a lot better than renaming/moving the file.
> 
> What I would recommend is adding a 
> 
> #define FS_CORRUPTED_FL   0x0100 /* File is corrupted */
> 
> ... and which could be accessed and cleared via the lsattr and chattr
> programs.

Except that there are filesystems that cannot implement such flags,
or require on-disk format changes to add more of those flags. This
is most definitely not a filesystem specific behaviour, so any sort
of VFS level per-file state needs to be kept in xattrs, not special
flags. Filesystems are welcome to optimise the storage of such
special xattrs (e.g. down to a single boolean flag in an inode), but
using a flag for something that dould, in fact, storage the exactly
offset and length of the corruption is far better than just storing
a "something is corrupted in this file" bit

> > > Application programs could also get very confused when any attempt to
> > > open or read from a file suddenly returned some new error code (EIO,
> > > or should we designate a new errno code for this purpose, so there is
> > > a better indication of what the heck was going on?)
> > 
> > EIO sounds wrong ... but it is perhaps the best of the existing codes. 
> > Adding
> > a new one is also challenging too.
> 
> I think we really need a different error code from EIO; it's already
> horribly overloaded already, and if this is new behavior when the
> customers get confused and call up the distribution help desk, they
> won't thank us if we further overload EIO.  This is abusing one of the
> System V stream errno's, but no one else is using it:
> 
> #define EADV   68  /* Advertise error */
> 
> I note that we've already added a new error code:
> 
> #define EHWPOISON 133   /* Memory page has hardware error */
> 
> ... although the glibc shipping with Debian testing hasn't been taught
> what it is, so strerror(EHWPOISON) returns "Unknown error 133".  We
> could simply allow open(2) and stat(2) return this error, although I
> wonder if we're just better off defining a new error code.

If we are going to add special new "file corrupted" errors, we
should add EFSCORRUPTED (i.e. "filesystem corrupted") at the same
time

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 03/15] vfs,hot_track: add the function for collecting I/O frequency

2012-10-28 Thread Dave Chinner
On Sun, Oct 28, 2012 at 09:51:48PM +0800, Zhi Yong Wu wrote:
> On Sun, Oct 28, 2012 at 3:55 PM, Zheng Liu  wrote:
> > Hi Zhiyong,
> >
> > On Thu, Oct 25, 2012 at 11:08:55PM +0800, zwu.ker...@gmail.com wrote:
> > [snip]
> >> @@ -199,6 +342,54 @@ err:
> >>  }
> >>
> >>  /*
> >> + * Main function to update access frequency from read/writepage(s) hooks
> >> + */
> >> +inline void hot_update_freqs(struct inode *inode, u64 start,
> >> + u64 len, int rw)
> >
> > This function seems too big.  So we really need to inline this function?
> As Dave said in his comments, it will add a function call
> overhead even when tracking is not enabled. a static inline function
> will just result in no extra overhead other than the if
> statement

I don't think I said that with respect to this code. I think I said
it w.r.t. a define or a small wrapper that decides to call
hot_update_freqs().  A static inline fucntion should only be a
couple of lines of code at most.

A static function, OTOH, can be inlined by the compiler if the
compiler thinks that is a win. But

.

> >> +EXPORT_SYMBOL_GPL(hot_update_freqs);

... it's an exported function, so it can't be inline or static, so
using "inline" is wrong whatever way you look at it. ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hang in XFS reclaim on 3.7.0-rc3

2012-10-29 Thread Dave Chinner
On Mon, Oct 29, 2012 at 09:03:15PM +0100, Torsten Kaiser wrote:
> After experiencing a hang of all IO yesterday (
> http://marc.info/?l=linux-kernel&m=135142236520624&w=2 ), I turned on
> LOCKDEP after upgrading to -rc3.
> 
> I then tried to replicate the load that hung yesterday and got the
> following lockdep report, implicating XFS instead of by stacking swap
> onto dm-crypt and md.
> 
> [ 2844.971913]
> [ 2844.971920] =
> [ 2844.971921] [ INFO: inconsistent lock state ]
> [ 2844.971924] 3.7.0-rc3 #1 Not tainted
> [ 2844.971925] -
> [ 2844.971927] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [ 2844.971929] kswapd0/725 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 2844.971931] (&(&ip->i_lock)->mr_lock){?.}, at: [] 
> xfs_ilock+0x84/0xb0
> [ 2844.971941] {RECLAIM_FS-ON-W} state was registered at:
> [ 2844.971942]   [] mark_held_locks+0x7e/0x130
> [ 2844.971947]   [] lockdep_trace_alloc+0x63/0xc0
> [ 2844.971949]   [] kmem_cache_alloc+0x35/0xe0
> [ 2844.971952]   [] vm_map_ram+0x271/0x770
> [ 2844.971955]   [] _xfs_buf_map_pages+0x46/0xe0
> [ 2844.971959]   [] xfs_buf_get_map+0x8a/0x130
> [ 2844.971961]   [] xfs_trans_get_buf_map+0xa9/0xd0
> [ 2844.971964]   [] xfs_ifree_cluster+0x129/0x670
> [ 2844.971967]   [] xfs_ifree+0xe9/0xf0
> [ 2844.971969]   [] xfs_inactive+0x2af/0x480
> [ 2844.971972]   [] xfs_fs_evict_inode+0x70/0x80
> [ 2844.971974]   [] evict+0xaf/0x1b0
> [ 2844.971977]   [] iput+0x105/0x210
> [ 2844.971979]   [] dentry_iput+0xa0/0xe0
> [ 2844.971981]   [] dput+0x150/0x280
> [ 2844.971983]   [] sys_renameat+0x21b/0x290
> [ 2844.971986]   [] sys_rename+0x16/0x20
> [ 2844.971988]   [] system_call_fastpath+0x16/0x1b

We shouldn't be mapping pages there. See if the patch below fixes
it.

Fundamentally, though, the lockdep warning has come about because
vm_map_ram is doing a GFP_KERNEL allocation when we need it to be
doing GFP_NOFS - we are within a transaction here, so memory reclaim
is not allowed to recurse back into the filesystem.

mm-folk: can we please get this vmalloc/gfp_flags passing API
fixed once and for all? This is the fourth time in the last month or
so that I've seen XFS bug reports with silent hangs and associated
lockdep output that implicate GFP_KERNEL allocations from vm_map_ram
in GFP_NOFS conditions as the potential cause....

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

xfs: don't vmap inode cluster buffers during free

From: Dave Chinner 

Signed-off-by: Dave Chinner 
---
 fs/xfs/xfs_inode.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c4add46..82f6e5d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1781,7 +1781,8 @@ xfs_ifree_cluster(
 * to mark all the active inodes on the buffer stale.
 */
bp = xfs_trans_get_buf(tp, mp->m_ddev_targp, blkno,
-   mp->m_bsize * blks_per_cluster, 0);
+   mp->m_bsize * blks_per_cluster,
+   XBF_UNMAPPED);
 
if (!bp)
return ENOMEM;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hang in XFS reclaim on 3.7.0-rc3

2012-10-29 Thread Dave Chinner
[add the linux-mm cc I forgot to add before sending]

On Tue, Oct 30, 2012 at 09:26:13AM +1100, Dave Chinner wrote:
> On Mon, Oct 29, 2012 at 09:03:15PM +0100, Torsten Kaiser wrote:
> > After experiencing a hang of all IO yesterday (
> > http://marc.info/?l=linux-kernel&m=135142236520624&w=2 ), I turned on
> > LOCKDEP after upgrading to -rc3.
> > 
> > I then tried to replicate the load that hung yesterday and got the
> > following lockdep report, implicating XFS instead of by stacking swap
> > onto dm-crypt and md.
> > 
> > [ 2844.971913]
> > [ 2844.971920] =
> > [ 2844.971921] [ INFO: inconsistent lock state ]
> > [ 2844.971924] 3.7.0-rc3 #1 Not tainted
> > [ 2844.971925] -
> > [ 2844.971927] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> > [ 2844.971929] kswapd0/725 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > [ 2844.971931] (&(&ip->i_lock)->mr_lock){?.}, at: [] 
> > xfs_ilock+0x84/0xb0
> > [ 2844.971941] {RECLAIM_FS-ON-W} state was registered at:
> > [ 2844.971942]   [] mark_held_locks+0x7e/0x130
> > [ 2844.971947]   [] lockdep_trace_alloc+0x63/0xc0
> > [ 2844.971949]   [] kmem_cache_alloc+0x35/0xe0
> > [ 2844.971952]   [] vm_map_ram+0x271/0x770
> > [ 2844.971955]   [] _xfs_buf_map_pages+0x46/0xe0
> > [ 2844.971959]   [] xfs_buf_get_map+0x8a/0x130
> > [ 2844.971961]   [] xfs_trans_get_buf_map+0xa9/0xd0
> > [ 2844.971964]   [] xfs_ifree_cluster+0x129/0x670
> > [ 2844.971967]   [] xfs_ifree+0xe9/0xf0
> > [ 2844.971969]   [] xfs_inactive+0x2af/0x480
> > [ 2844.971972]   [] xfs_fs_evict_inode+0x70/0x80
> > [ 2844.971974]   [] evict+0xaf/0x1b0
> > [ 2844.971977]   [] iput+0x105/0x210
> > [ 2844.971979]   [] dentry_iput+0xa0/0xe0
> > [ 2844.971981]   [] dput+0x150/0x280
> > [ 2844.971983]   [] sys_renameat+0x21b/0x290
> > [ 2844.971986]   [] sys_rename+0x16/0x20
> > [ 2844.971988]   [] system_call_fastpath+0x16/0x1b
> 
> We shouldn't be mapping pages there. See if the patch below fixes
> it.
> 
> Fundamentally, though, the lockdep warning has come about because
> vm_map_ram is doing a GFP_KERNEL allocation when we need it to be
> doing GFP_NOFS - we are within a transaction here, so memory reclaim
> is not allowed to recurse back into the filesystem.
> 
> mm-folk: can we please get this vmalloc/gfp_flags passing API
> fixed once and for all? This is the fourth time in the last month or
> so that I've seen XFS bug reports with silent hangs and associated
> lockdep output that implicate GFP_KERNEL allocations from vm_map_ram
> in GFP_NOFS conditions as the potential cause
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> xfs: don't vmap inode cluster buffers during free
> 
> From: Dave Chinner 
> 
> Signed-off-by: Dave Chinner 
> ---
>  fs/xfs/xfs_inode.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c4add46..82f6e5d 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1781,7 +1781,8 @@ xfs_ifree_cluster(
>* to mark all the active inodes on the buffer stale.
>*/
>   bp = xfs_trans_get_buf(tp, mp->m_ddev_targp, blkno,
> - mp->m_bsize * blks_per_cluster, 0);
> + mp->m_bsize * blks_per_cluster,
> + XBF_UNMAPPED);
>  
>   if (!bp)
>   return ENOMEM;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> 

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page

2012-10-30 Thread Dave Chinner
On Mon, Oct 29, 2012 at 06:11:58PM +, Luck, Tony wrote:
> > What I would recommend is adding a 
> >
> > #define FS_CORRUPTED_FL 0x0100 /* File is corrupted */
> >
> > ... and which could be accessed and cleared via the lsattr and chattr
> > programs.
> 
> Good - but we need some space to save the corrupted range information
> too. These errors should be quite rare, so one range per file should be
> enough.
> 
> New file systems should plan to add space in their on-disk format. The
> corruption isn't going to go away across a reboot.

No, not at all. if you want to store something in the filesystem
permanently, then use xattrs. You cannot rely on the filesystem
being able to store random application specific data in their
on-disk format. That's the *exact purpose* that xattrs were
invented for - they are an extensible, user-defined, per-file
metadata storage mechanism that is not tied to the filesystem
on-disk format.

The kernel already makes extensive use of xattrs for such metadata -
just look at all the security and integrity code that uses xattrs to
store their application-specific metadata.  Hence *anything* that
the kernel wants to store on permanent storage should be using
xattrs because then the application has complete control of what is
stored without caring about what filesystem it is storing it on.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoid useless inodes and dentries reclamation

2013-08-29 Thread Dave Chinner
On Wed, Aug 28, 2013 at 02:52:12PM -0700, Tim Chen wrote:
> This patch detects that when free inodes and dentries are really
> low, their reclamation is skipped so we do not have to contend
> on the global sb_lock uselessly under memory pressure. Otherwise
> we create a log jam trying to acquire the sb_lock in prune_super(),
> with little or no freed memory to show for the effort.
> 
> The profile below shows a multi-threaded large file read exerting
> pressure on memory with page cache usage.  It is dominated
> by the sb_lock contention in the cpu cycles profile.  The patch
> eliminates the sb_lock contention almost entirely for prune_super().
> 
> 43.94%   usemem  [kernel.kallsyms] [k] _raw_spin_lock
>  |
>  --- _raw_spin_lock
> |
> |--32.44%-- grab_super_passive
> |  prune_super
> |  shrink_slab
> |  do_try_to_free_pages
> |  try_to_free_pages
> |  __alloc_pages_nodemask
> |  alloc_pages_current
> |
> |--32.18%-- put_super
> |  drop_super
> |  prune_super
> |  shrink_slab
> |  do_try_to_free_pages
> |  try_to_free_pages
> |  __alloc_pages_nodemask
> |  alloc_pages_current
> 
> Signed-off-by: Tim Chen 
> ---
>  fs/super.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 68307c0..70fa26c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
>   * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
>   * take a passive reference to the superblock to avoid this from occurring.
>   */
> +#define SB_CACHE_LOW 5
>  static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  {
>   struct super_block *sb;
> @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct 
> shrink_control *sc)
>   if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
>   return -1;
>  
> + /*
> +  * Don't prune if we have few cached objects to reclaim to
> +  * avoid useless sb_lock contention
> +  */
> + if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW)
> + return -1;

Those counters no longer exist in the current mmotm tree and the
shrinker infrastructure is somewhat different, so this patch isn't
the right way to solve this problem.

Given that superblock LRUs and shrinkers in mmotm are node aware,
there may even be more pressure on the sblock in such a workload.  I
think the right way to deal with this is to give the shrinker itself
a "minimum call count" so that we can avoid even attempting to
shrink caches that does have enough entries in them to be worthwhile
shrinking.

That said, the memcg guys have been saying that even small numbers
of items per cache can be meaningful in terms of memory reclaim
(e.g. when there are lots of memcgs) then such a threshold might
only be appropriate for caches that are not memcg controlled. In
that case, handling it in the shrinker infrastructure itself is a
much better idea than hacking thresholds into individual shrinker
callouts.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-03 Thread Dave Chinner
On Wed, Oct 02, 2013 at 08:01:45PM -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 1 Oct 2013, Joe Thornber wrote:
> 
> > > Alternatively, delaying them will stall the filesystem because it's
> > > waiting for said REQ_FUA IO to complete. For example, journal writes
> > > in XFS are extremely IO latency sensitive in workloads that have a
> > > signifincant number of ordering constraints (e.g. O_SYNC writes,
> > > fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
> > > filesystem for the majority of that barrier_deadline_ms.
> > 
> > Yes, this is a valid concern, but I assume Akira has benchmarked.
> > With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to
> > see if there are any other FUA requests on my queue that can be
> > aggregated into a single flush.  I agree with you that the target
> > should never delay waiting for new io; that's asking for trouble.
> > 
> > - Joe
> 
> You could send the first REQ_FUA/REQ_FLUSH request directly to the disk 
> and aggregate all the requests that were received while you processed the 
> initial request. This way, you can do request batching without introducing 
> artifical delays.

Yes, that's what XFS does with it's log when lots of fsync requests
come in. i.e. the first is dispatched immmediately, and the others
are gathered into the next log buffer until it is either full or the
original REQ_FUA log write completes.

That's where arbitrary delays in the storage stack below XFS cause
problems - if the first FUA log write is delayed, the next log
buffer will get filled, issued and delayed, and when we run out of
log buffers (there are 8 maximum) the entire log subsystem will
stall, stopping *all* log commit operations until log buffer
IOs complete and become free again. i.e. it can stall modifications
across the entire filesystem while we wait for batch timeouts to
expire and issue and complete FUA requests.

IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the
point where they are issued - any attempt to further optimise them
by adding delays down in the stack to aggregate FUA operations will
only increase latency of the operations that the issuer want to have
complete as fast as possible

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fs/attr.c:notify_change locking warning.

2013-10-04 Thread Dave Chinner
On Fri, Oct 04, 2013 at 08:52:10PM -0400, Dave Jones wrote:
> WARNING: CPU: 3 PID: 26128 at fs/attr.c:178 notify_change+0x34d/0x360()
> Modules linked in: dlci 8021q garp sctp snd_seq_dummy bridge stp tun fuse 
> rfcomm hidp ipt_ULOG nfc caif_socket caif af_802154 phonet af_rxrpc bnep 
> bluetooth rfkill can_bcm can_raw can llc2 pppoe pppox ppp_generic slhc irda 
> crc_ccitt rds scsi_transport_iscsi nfnetlink af_key rose x25 atm netrom 
> appletalk ipx p8023 psnap p8022 llc ax25 coretemp hwmon x86_pkg_temp_thermal 
> kvm_intel kvm snd_hda_codec_hdmi xfs snd_hda_codec_realtek snd_hda_intel 
> snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc 
> libcrc32c snd_timer crct10dif_pclmul crc32c_intel snd ghash_clmulni_intel 
> e1000e microcode usb_debug serio_raw pcspkr ptp soundcore pps_core shpchp
> CPU: 3 PID: 26128 Comm: trinity-child57 Not tainted 3.12.0-rc3+ #93 
>  81a3e2ec 880071d71bb8 8172a763 
>  880071d71bf0 810552cd 0a00 88023d6e8b90
>  8ad0 880071d71c50 8802392882c8 880071d71c00
> Call Trace:
>  [] dump_stack+0x4e/0x82
>  [] warn_slowpath_common+0x7d/0xa0
>  [] warn_slowpath_null+0x1a/0x20
>  [] notify_change+0x34d/0x360
>  [] file_remove_suid+0x87/0xa0
>  [] ? __mnt_drop_write+0x29/0x50
>  [] ? __mnt_drop_write_file+0x12/0x20
>  [] ? file_update_time+0x8a/0xd0
>  [] xfs_file_aio_write_checks+0xea/0xf0 [xfs]
>  [] xfs_file_dio_aio_write+0xd0/0x3e0 [xfs]
>  [] xfs_file_aio_write+0x129/0x130 [xfs]
>  [] do_sync_readv_writev+0x4c/0x80
>  [] do_readv_writev+0xbb/0x240
>  [] ? xfs_file_buffered_aio_write+0x2a0/0x2a0 [xfs]
>  [] ? do_sync_read+0x90/0x90
>  [] ? _raw_spin_unlock+0x31/0x60
>  [] ? trace_hardirqs_on_caller+0x16/0x1e0
>  [] vfs_writev+0x38/0x60
>  [] SyS_writev+0x50/0xd0
>  [] tracesys+0xdd/0xe2
> ---[ end trace 201843ae71ab5a7c ]---
> 
> 
> 177 
> 178 WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex));

Yup, we don't hold the i_mutex *at all* through the fast path for
direct IO writes. Having to grab the i_mutex on every IO just for
the extremely unlikely case we need to remove a suid bit on the file
would add a significant serialisation point into the direct Io model
that XFS uses, and is the difference between 50,000 and 2+ million
direct IO IOPS to a single file.

I'm unwilling to sacrifice the concurrency of direct IO writes just
to shut up ths warning, especially as the actual modifications that
are made to remove SUID bits are correctly serialised within XFS
once notify_change() calls ->setattr(). If it really matters, I'll
just open code file_remove_suid() into XFS like ocfs2 does just so
we don't get that warning being emitted by trinity.

FWIW, buffered IO on XFS - the normal case for most operations -
holds the i_mutex over the call to file_remove_suid(), and that's
why this warning is pretty much never seen - direct IO writes to
suid files is very unusual

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: slab shrinkers: BUG at mm/list_lru.c:92

2013-07-09 Thread Dave Chinner
On Mon, Jul 08, 2013 at 02:53:52PM +0200, Michal Hocko wrote:
> On Thu 04-07-13 18:36:43, Michal Hocko wrote:
> > On Wed 03-07-13 21:24:03, Dave Chinner wrote:
> > > On Tue, Jul 02, 2013 at 02:44:27PM +0200, Michal Hocko wrote:
> > > > On Tue 02-07-13 22:19:47, Dave Chinner wrote:
> > > > [...]
> > > > > Ok, so it's been leaked from a dispose list somehow. Thanks for the
> > > > > info, Michal, it's time to go look at the code
> > > > 
> > > > OK, just in case we will need it, I am keeping the machine in this state
> > > > for now. So we still can play with crash and check all the juicy
> > > > internals.
> > > 
> > > My current suspect is the LRU_RETRY code. I don't think what it is
> > > doing is at all valid - list_for_each_safe() is not safe if you drop
> > > the lock that protects the list. i.e. there is nothing that protects
> > > the stored next pointer from being removed from the list by someone
> > > else. Hence what I think is occurring is this:
> > > 
> > > 
> > > thread 1  thread 2
> > > lock(lru)
> > > list_for_each_safe(lru)   lock(lru)
> > >   isolate ..
> > > lock(i_lock)
> > > has buffers
> > >   __iget
> > >   unlock(i_lock)
> > >   unlock(lru)
> > >   .   (gets lru lock)
> > >   list_for_each_safe(lru)
> > > walks all the inodes
> > > finds inode being isolated by other thread
> > > isolate
> > >   i_count > 0
> > > list_del_init(i_lru)
> > > return LRU_REMOVED;
> > >  moves to next inode, inode that
> > >  other thread has stored as next
> > >  isolate
> > >i_state |= I_FREEING
> > >list_move(dispose_list)
> > >return LRU_REMOVED
> > >
> > >unlock(lru)
> > >   lock(lru)
> > >   return LRU_RETRY;
> > >   if (!first_pass)
> > > 
> > >   --nr_to_scan
> > >   (loop again using next, which has already been removed from the
> > >   LRU by the other thread!)
> > >   isolate
> > > lock(i_lock)
> > > if (i_state & ~I_REFERENCED)
> > >   list_del_init(i_lru)<<<<< inode is on dispose list!
> > >   <<<<< inode is now isolated, with I_FREEING set
> > >   return LRU_REMOVED;
> > > 
> > > That fits the corpse left on your machine, Michal. One thread has
> > > moved the inode to a dispose list, the other thread thinks it is
> > > still on the LRU and should be removed, and removes it.
> > > 
> > > This also explains the lru item count going negative - the same item
> > > is being removed from the lru twice. So it seems like all the
> > > problems you've been seeing are caused by this one problem
> > > 
> > > Patch below that should fix this.
> > 
> > Good news! The test was running since morning and it didn't hang nor
> > crashed. So this really looks like the right fix. It will run also
> > during weekend to be 100% sure. But I guess it is safe to say
> 
> Hmm, it seems I was too optimistic or we have yet another issue here (I
> guess the later is more probable).
> 
> The weekend testing got stuck as well. 

> 20761 [] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
> [] xlog_grant_head_check+0xc6/0xe0 [xfs]
> [] xfs_log_reserve+0xff/0x240 [xfs]
> [] xfs_trans_reserve+0x234/0x240 [xfs]
> [] xfs_create+0x1a9/0x5c0 [xfs]
> [] xfs_vn_mknod+0x8a/0x1a0 [xfs]
> [] xfs_vn_create+0xe/0x10 [xfs]
> [] vfs_create+0xad/0xd0
> [] lookup_open+0x1b8/0x1d0
> [] do_last+0x2de/0x780
> [] path_openat+0xda/0x400
> [] do_filp_open+0x43/0xa0
> [] do_sys_open+0x160/0x1e0
> [] sys_open+0x1c/0x20
> [] system_call_fastpath+0x16/0x1b
> [] 0x

That's an XFS log space issue, indicating that it has run out of
space in IO the log and it is waiting for more to come free. That
requires IO completion to occur.

> [276962.652076] INFO: task xfs-data/sd

Re: [PATCH 10/13] xfs: use get_unused_fd_flags(0) instead of get_unused_fd()

2013-07-10 Thread Dave Chinner
On Wed, Jul 10, 2013 at 12:00:57PM +0200, Yann Droneaud wrote:
> Hi,
> 
> Le 09.07.2013 22:53, Ben Myers a écrit :
> >On Mon, Jul 08, 2013 at 05:41:33PM -0500, Ben Myers wrote:
> >>On Tue, Jul 02, 2013 at 06:39:34PM +0200, Yann Droneaud wrote:
> 
> >>> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> >>> index 5e99968..dc5b659 100644
> >>> --- a/fs/xfs/xfs_ioctl.c
> >>> +++ b/fs/xfs/xfs_ioctl.c
> >>> @@ -248,7 +248,7 @@ xfs_open_by_handle(
> >>>   goto out_dput;
> >>>   }
> >>>
> >>> - fd = get_unused_fd();
> >>> + fd = get_unused_fd_flags(0);
> >>
> >>O_CLOEXEC should be fine in this case.
> >>
> >>Reviewed-by: Ben Myers 
> >
> >Applied at git://oss.sgi.com/xfs/xfs.git.  Looks like I was wrong about
> >O_CLOEXEC being ok here.  There may be applications which
> >open_by_handle then
> >fork/exec and expect to still be able to use that file descriptor.
> >
> 
> OK, it's very important to not cause regression here.
> 
> For the record, xfs_open_by_handle() is not related to
> open_by_handle_at() syscall.
> 
> It's an ioctl (XFS_IOC_OPEN_BY_HANDLE) which is used by xfsprogs's
> libhandle
> in functions open_by_fshandle() and open_by_handle().
> 
> http://sources.debian.net/src/xfsprogs/3.1.9/libhandle/handle.c?hl=284#L284
> http://sources.debian.net/src/xfsprogs/3.1.9/libhandle/handle.c?hl=308#L308
> 
> According to codesearch.debian.org, libhandle's open_by_handle() is
> only used
> by xfsdump
> 
> http://sources.debian.net/src/xfsdump/3.1.1/restore/tree.c?hl=2534#L2534
> 
> So there's no many *known* users of this features ... but it's more
> important
> not to break *unknown* users of it.

There are commercial products (i.e. proprietary, closed source) that
use it. SGI has one (DMF) and there are a couple of other backup
programs that have used it in the past, too.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs: sync: fixed performance regression

2013-07-10 Thread Dave Chinner
On Wed, Jul 10, 2013 at 04:12:36PM -0700, Paul Taysom wrote:
> The following commit introduced a 10x regression for
> syncing inodes in ext4 with relatime enabled where just
> the atime had been modified.
> 
> commit 4ea425b63a3dfeb7707fc7cc7161c11a51e871ed
> Author: Jan Kara 
> Date:   Tue Jul 3 16:45:34 2012 +0200
> vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder 
> sync passes
> 
> See also: http://www.kernelhub.org/?msg=93100&p=2
> 
> Fixed by putting back in the call to writeback_inodes_sb.
> 
> I'll attach the test in a reply to this e-mail.
> 
> The test starts by creating 512 files, syncing, reading one byte
> from each of those files, syncing, and then deleting each file
> and syncing. The time to do each sync is printed. The process
> is then repeated for 1024 files and then the next power of
> two up to 262144 files.
> 
> Note, when running the test, the slow down doesn't always happen
> but most of the tests will show a slow down.

Can you please check if the patch attached to this mail:

http://marc.info/?l=linux-kernel&m=137276874025813&w=2

Fixes your problem?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >