Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-14 Thread Johannes Weiner
On Fri, Feb 14, 2014 at 03:05:41PM +0900, Tetsuo Handa wrote:
> Johannes Weiner wrote:
> > Thanks for the report.  There is already a fix for this in -mm:
> > http://marc.info/?l=linux-mm-commits=139180637114625=2
> > 
> > It was merged on the 7th, so it should show up in -next... any day
> > now?
> 
> That patch solved this bproblem but breaks build instead.
> 
>   ERROR: \"list_lru_init_key\" [fs/xfs/xfs.ko] undefined!
>   ERROR: \"list_lru_init_key\" [fs/gfs2/gfs2.ko] undefined!
>   make[1]: *** [__modpost] Error 1

There is a follow-up fix in -mm:

http://marc.info/?l=linux-mm-commits=139180636814624=2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-14 Thread Johannes Weiner
On Fri, Feb 14, 2014 at 03:05:41PM +0900, Tetsuo Handa wrote:
 Johannes Weiner wrote:
  Thanks for the report.  There is already a fix for this in -mm:
  http://marc.info/?l=linux-mm-commitsm=139180637114625w=2
  
  It was merged on the 7th, so it should show up in -next... any day
  now?
 
 That patch solved this bproblem but breaks build instead.
 
   ERROR: \list_lru_init_key\ [fs/xfs/xfs.ko] undefined!
   ERROR: \list_lru_init_key\ [fs/gfs2/gfs2.ko] undefined!
   make[1]: *** [__modpost] Error 1

There is a follow-up fix in -mm:

http://marc.info/?l=linux-mm-commitsm=139180636814624w=2
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-13 Thread Tetsuo Handa
Johannes Weiner wrote:
> Thanks for the report.  There is already a fix for this in -mm:
> http://marc.info/?l=linux-mm-commits=139180637114625=2
> 
> It was merged on the 7th, so it should show up in -next... any day
> now?

That patch solved this bproblem but breaks build instead.

  ERROR: \"list_lru_init_key\" [fs/xfs/xfs.ko] undefined!
  ERROR: \"list_lru_init_key\" [fs/gfs2/gfs2.ko] undefined!
  make[1]: *** [__modpost] Error 1

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 2a5b8fd..f1a0db1 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -143,7 +143,7 @@ int list_lru_init_key(struct list_lru *lru, struct 
lock_class_key *key)
}
return 0;
 }
-EXPORT_SYMBOL_GPL(list_lru_init);
+EXPORT_SYMBOL_GPL(list_lru_init_key);
 
 void list_lru_destroy(struct list_lru *lru)
 {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-13 Thread Stephen Rothwell
Hi Andrew,

On Thu, 13 Feb 2014 14:24:07 -0800 Andrew Morton  
wrote:
>
> On Thu, 13 Feb 2014 17:11:26 -0500 Johannes Weiner  wrote:
> 
> > On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote:
> > > Hello.
> > > 
> > > I got a lockdep warning shown below, and the bad commit seems to be 
> > > de055616
> > > \"mm: keep page cache radix tree nodes in check\" as of next-20140212
> > > on linux-next.git.
> > 
> > Thanks for the report.  There is already a fix for this in -mm:
> > http://marc.info/?l=linux-mm-commits=139180637114625=2
> > 
> > It was merged on the 7th, so it should show up in -next... any day
> > now?
> 
> Tomorrow, if Stephen works weekends, which I expect he doesn't ;)

Actually later today (my time), since Friday is not a weekend :-(

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpBhKe_IeCp7.pgp
Description: PGP signature


Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-13 Thread Andrew Morton
On Thu, 13 Feb 2014 17:11:26 -0500 Johannes Weiner  wrote:

> Hi Tetsuo,
> 
> On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote:
> > Hello.
> > 
> > I got a lockdep warning shown below, and the bad commit seems to be de055616
> > \"mm: keep page cache radix tree nodes in check\" as of next-20140212
> > on linux-next.git.
> 
> Thanks for the report.  There is already a fix for this in -mm:
> http://marc.info/?l=linux-mm-commits=139180637114625=2
> 
> It was merged on the 7th, so it should show up in -next... any day
> now?

Tomorrow, if Stephen works weekends, which I expect he doesn't ;)

http://ozlabs.org/~akpm/mmotm/broken-out/mm-keep-page-cache-radix-tree-nodes-in-check-fix.patch
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-13 Thread Johannes Weiner
Hi Tetsuo,

On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote:
> Hello.
> 
> I got a lockdep warning shown below, and the bad commit seems to be de055616
> \"mm: keep page cache radix tree nodes in check\" as of next-20140212
> on linux-next.git.

Thanks for the report.  There is already a fix for this in -mm:
http://marc.info/?l=linux-mm-commits=139180637114625=2

It was merged on the 7th, so it should show up in -next... any day
now?

> Regards.
> 
> =
> [ INFO: possible irq lock inversion dependency detected ]
> 3.14.0-rc1-00099-gde05561 #126 Tainted: GF   
> -
> swapper/0/0 just changed the state of lock:
>  (&(>tree_lock)->rlock){..-.-.}, at: [] 
> test_clear_page_writeback+0x48/0x190
> but this lock took another, SOFTIRQ-unsafe lock in the past:
>  (&(>node[i].lock)->rlock){+.+.-.}
> 
> and interrupts could create inverse lock ordering between them.
> 
> 
> other info that might help us debug this:
>  Possible interrupt unsafe locking scenario:
> 
>CPU0CPU1
>
>   lock(&(>node[i].lock)->rlock);
>local_irq_disable();
>lock(&(>tree_lock)->rlock);
>lock(&(>node[i].lock)->rlock);
>   
> lock(&(>tree_lock)->rlock);
> 
>  *** DEADLOCK ***
> 
> no locks held by swapper/0/0.
> 
> the shortest dependencies between 2nd lock and 1st lock:
>  -> (&(>node[i].lock)->rlock){+.+.-.} ops: 445715 {
> HARDIRQ-ON-W at:
>   [] mark_irqflags+0x130/0x190
>   [] __lock_acquire+0x3bc/0x5e0
>   [] lock_acquire+0x9e/0x170
>   [] _raw_spin_lock+0x3e/0x80
>   [] list_lru_add+0x5b/0xf0
>   [] dput+0xbc/0x120
>   [] __fput+0x1d2/0x310
>   [] fput+0xe/0x10
>   [] task_work_run+0xad/0xe0
>   [] do_notify_resume+0x75/0x80
>   [] int_signal+0x12/0x17
> SOFTIRQ-ON-W at:
>   [] mark_irqflags+0x154/0x190
>   [] __lock_acquire+0x3bc/0x5e0
>   [] lock_acquire+0x9e/0x170
>   [] _raw_spin_lock+0x3e/0x80
>   [] list_lru_add+0x5b/0xf0
>   [] dput+0xbc/0x120
>   [] __fput+0x1d2/0x310
>   [] fput+0xe/0x10
>   [] task_work_run+0xad/0xe0
>   [] do_notify_resume+0x75/0x80
>   [] int_signal+0x12/0x17
> IN-RECLAIM_FS-W at:
>  [] mark_irqflags+0xc6/0x190
>  [] __lock_acquire+0x3bc/0x5e0
>  [] lock_acquire+0x9e/0x170
>  [] _raw_spin_lock+0x3e/0x80
>  [] list_lru_count_node+0x28/0x70
>  [] super_cache_count+0x83/0x120
>  [] shrink_slab_node+0x47/0x350
>  [] shrink_slab+0x8d/0x160
>  [] kswapd_shrink_zone+0x130/0x1c0
>  [] balance_pgdat+0x389/0x520
>  [] kswapd+0x1bf/0x380
>  [] kthread+0xee/0x110
>  [] ret_from_fork+0x7c/0xb0
> INITIAL USE at:
>  [] __lock_acquire+0x214/0x5e0
>  [] lock_acquire+0x9e/0x170
>  [] _raw_spin_lock+0x3e/0x80
>  [] list_lru_add+0x5b/0xf0
>  [] dput+0xbc/0x120
>  [] __fput+0x1d2/0x310
>  [] fput+0xe/0x10
>  [] task_work_run+0xad/0xe0
>  [] do_notify_resume+0x75/0x80
>  [] int_signal+0x12/0x17
>   }
>   ... key  at: [] __key.23573+0x0/0xc
>   ... acquired at:
>[] validate_chain+0x6e1/0x840
>[] __lock_acquire+0x367/0x5e0
>[] lock_acquire+0x9e/0x170
>[] _raw_spin_lock+0x3e/0x80
>[] list_lru_add+0x5b/0xf0
>[] page_cache_tree_delete+0x140/0x1a0
>[] __delete_from_page_cache+0x50/0x1c0
>[] __remove_mapping+0x9d/0x170
>[] shrink_page_list+0x617/0x7f0
>[] shrink_inactive_list+0x26a/0x520
>[] shrink_lruvec+0x336/0x420
>[] shrink_zone+0x5c/0x120
>[] kswapd_shrink_zone+0xfb/0x1c0
>[] balance_pgdat+0x389/0x520
>[] kswapd+0x1bf/0x380
>[] kthread+0xee/0x110
>[] ret_from_fork+0x7c/0xb0
> 
> -> (&(>tree_lock)->rlock){..-.-.} ops: 11597 {
>IN-SOFTIRQ-W at:
> [] mark_irqflags+0x109/0x190
> [] __lock_acquire+0x3bc/0x5e0
> [] lock_acquire+0x9e/0x170
> [] _raw_spin_lock_irqsave+0x50/0x90
> [] 

Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-13 Thread Johannes Weiner
Hi Tetsuo,

On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote:
 Hello.
 
 I got a lockdep warning shown below, and the bad commit seems to be de055616
 \mm: keep page cache radix tree nodes in check\ as of next-20140212
 on linux-next.git.

Thanks for the report.  There is already a fix for this in -mm:
http://marc.info/?l=linux-mm-commitsm=139180637114625w=2

It was merged on the 7th, so it should show up in -next... any day
now?

 Regards.
 
 =
 [ INFO: possible irq lock inversion dependency detected ]
 3.14.0-rc1-00099-gde05561 #126 Tainted: GF   
 -
 swapper/0/0 just changed the state of lock:
  ((mapping-tree_lock)-rlock){..-.-.}, at: [8116f3e8] 
 test_clear_page_writeback+0x48/0x190
 but this lock took another, SOFTIRQ-unsafe lock in the past:
  ((lru-node[i].lock)-rlock){+.+.-.}
 
 and interrupts could create inverse lock ordering between them.
 
 
 other info that might help us debug this:
  Possible interrupt unsafe locking scenario:
 
CPU0CPU1

   lock((lru-node[i].lock)-rlock);
local_irq_disable();
lock((mapping-tree_lock)-rlock);
lock((lru-node[i].lock)-rlock);
   Interrupt
 lock((mapping-tree_lock)-rlock);
 
  *** DEADLOCK ***
 
 no locks held by swapper/0/0.
 
 the shortest dependencies between 2nd lock and 1st lock:
  - ((lru-node[i].lock)-rlock){+.+.-.} ops: 445715 {
 HARDIRQ-ON-W at:
   [810b6490] mark_irqflags+0x130/0x190
   [810b7edc] __lock_acquire+0x3bc/0x5e0
   [810b819e] lock_acquire+0x9e/0x170
   [816305ae] _raw_spin_lock+0x3e/0x80
   [8118c82b] list_lru_add+0x5b/0xf0
   [811e5c1c] dput+0xbc/0x120
   [811cebc2] __fput+0x1d2/0x310
   [811cedae] fput+0xe/0x10
   [8107cc2d] task_work_run+0xad/0xe0
   [81003be5] do_notify_resume+0x75/0x80
   [8163b01a] int_signal+0x12/0x17
 SOFTIRQ-ON-W at:
   [810b64b4] mark_irqflags+0x154/0x190
   [810b7edc] __lock_acquire+0x3bc/0x5e0
   [810b819e] lock_acquire+0x9e/0x170
   [816305ae] _raw_spin_lock+0x3e/0x80
   [8118c82b] list_lru_add+0x5b/0xf0
   [811e5c1c] dput+0xbc/0x120
   [811cebc2] __fput+0x1d2/0x310
   [811cedae] fput+0xe/0x10
   [8107cc2d] task_work_run+0xad/0xe0
   [81003be5] do_notify_resume+0x75/0x80
   [8163b01a] int_signal+0x12/0x17
 IN-RECLAIM_FS-W at:
  [810b6426] mark_irqflags+0xc6/0x190
  [810b7edc] __lock_acquire+0x3bc/0x5e0
  [810b819e] lock_acquire+0x9e/0x170
  [816305ae] _raw_spin_lock+0x3e/0x80
  [8118c698] list_lru_count_node+0x28/0x70
  [811d05b3] super_cache_count+0x83/0x120
  [81176647] shrink_slab_node+0x47/0x350
  [811769dd] shrink_slab+0x8d/0x160
  [81179480] kswapd_shrink_zone+0x130/0x1c0
  [81179fe9] balance_pgdat+0x389/0x520
  [8117b92f] kswapd+0x1bf/0x380
  [81080abe] kthread+0xee/0x110
  [8163ac6c] ret_from_fork+0x7c/0xb0
 INITIAL USE at:
  [810b7d34] __lock_acquire+0x214/0x5e0
  [810b819e] lock_acquire+0x9e/0x170
  [816305ae] _raw_spin_lock+0x3e/0x80
  [8118c82b] list_lru_add+0x5b/0xf0
  [811e5c1c] dput+0xbc/0x120
  [811cebc2] __fput+0x1d2/0x310
  [811cedae] fput+0xe/0x10
  [8107cc2d] task_work_run+0xad/0xe0
  [81003be5] do_notify_resume+0x75/0x80
  [8163b01a] int_signal+0x12/0x17
   }
   ... key  at: [82b59f34] __key.23573+0x0/0xc
   ... acquired at:
[810b79c1] validate_chain+0x6e1/0x840
[810b7e87] __lock_acquire+0x367/0x5e0
[810b819e] lock_acquire+0x9e/0x170
[816305ae] _raw_spin_lock+0x3e/0x80
[8118c82b] list_lru_add+0x5b/0xf0

Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-13 Thread Andrew Morton
On Thu, 13 Feb 2014 17:11:26 -0500 Johannes Weiner han...@cmpxchg.org wrote:

 Hi Tetsuo,
 
 On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote:
  Hello.
  
  I got a lockdep warning shown below, and the bad commit seems to be de055616
  \mm: keep page cache radix tree nodes in check\ as of next-20140212
  on linux-next.git.
 
 Thanks for the report.  There is already a fix for this in -mm:
 http://marc.info/?l=linux-mm-commitsm=139180637114625w=2
 
 It was merged on the 7th, so it should show up in -next... any day
 now?

Tomorrow, if Stephen works weekends, which I expect he doesn't ;)

http://ozlabs.org/~akpm/mmotm/broken-out/mm-keep-page-cache-radix-tree-nodes-in-check-fix.patch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-13 Thread Stephen Rothwell
Hi Andrew,

On Thu, 13 Feb 2014 14:24:07 -0800 Andrew Morton a...@linux-foundation.org 
wrote:

 On Thu, 13 Feb 2014 17:11:26 -0500 Johannes Weiner han...@cmpxchg.org wrote:
 
  On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote:
   Hello.
   
   I got a lockdep warning shown below, and the bad commit seems to be 
   de055616
   \mm: keep page cache radix tree nodes in check\ as of next-20140212
   on linux-next.git.
  
  Thanks for the report.  There is already a fix for this in -mm:
  http://marc.info/?l=linux-mm-commitsm=139180637114625w=2
  
  It was merged on the 7th, so it should show up in -next... any day
  now?
 
 Tomorrow, if Stephen works weekends, which I expect he doesn't ;)

Actually later today (my time), since Friday is not a weekend :-(

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpBhKe_IeCp7.pgp
Description: PGP signature


Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-13 Thread Tetsuo Handa
Johannes Weiner wrote:
 Thanks for the report.  There is already a fix for this in -mm:
 http://marc.info/?l=linux-mm-commitsm=139180637114625w=2
 
 It was merged on the 7th, so it should show up in -next... any day
 now?

That patch solved this bproblem but breaks build instead.

  ERROR: \list_lru_init_key\ [fs/xfs/xfs.ko] undefined!
  ERROR: \list_lru_init_key\ [fs/gfs2/gfs2.ko] undefined!
  make[1]: *** [__modpost] Error 1

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 2a5b8fd..f1a0db1 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -143,7 +143,7 @@ int list_lru_init_key(struct list_lru *lru, struct 
lock_class_key *key)
}
return 0;
 }
-EXPORT_SYMBOL_GPL(list_lru_init);
+EXPORT_SYMBOL_GPL(list_lru_init_key);
 
 void list_lru_destroy(struct list_lru *lru)
 {

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-12 Thread Tetsuo Handa
Hello.

I got a lockdep warning shown below, and the bad commit seems to be de055616
\"mm: keep page cache radix tree nodes in check\" as of next-20140212
on linux-next.git.

Regards.

=
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-rc1-00099-gde05561 #126 Tainted: GF   
-
swapper/0/0 just changed the state of lock:
 (&(>tree_lock)->rlock){..-.-.}, at: [] 
test_clear_page_writeback+0x48/0x190
but this lock took another, SOFTIRQ-unsafe lock in the past:
 (&(>node[i].lock)->rlock){+.+.-.}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
 Possible interrupt unsafe locking scenario:

   CPU0CPU1
   
  lock(&(>node[i].lock)->rlock);
   local_irq_disable();
   lock(&(>tree_lock)->rlock);
   lock(&(>node[i].lock)->rlock);
  
lock(&(>tree_lock)->rlock);

 *** DEADLOCK ***

no locks held by swapper/0/0.

the shortest dependencies between 2nd lock and 1st lock:
 -> (&(>node[i].lock)->rlock){+.+.-.} ops: 445715 {
HARDIRQ-ON-W at:
  [] mark_irqflags+0x130/0x190
  [] __lock_acquire+0x3bc/0x5e0
  [] lock_acquire+0x9e/0x170
  [] _raw_spin_lock+0x3e/0x80
  [] list_lru_add+0x5b/0xf0
  [] dput+0xbc/0x120
  [] __fput+0x1d2/0x310
  [] fput+0xe/0x10
  [] task_work_run+0xad/0xe0
  [] do_notify_resume+0x75/0x80
  [] int_signal+0x12/0x17
SOFTIRQ-ON-W at:
  [] mark_irqflags+0x154/0x190
  [] __lock_acquire+0x3bc/0x5e0
  [] lock_acquire+0x9e/0x170
  [] _raw_spin_lock+0x3e/0x80
  [] list_lru_add+0x5b/0xf0
  [] dput+0xbc/0x120
  [] __fput+0x1d2/0x310
  [] fput+0xe/0x10
  [] task_work_run+0xad/0xe0
  [] do_notify_resume+0x75/0x80
  [] int_signal+0x12/0x17
IN-RECLAIM_FS-W at:
 [] mark_irqflags+0xc6/0x190
 [] __lock_acquire+0x3bc/0x5e0
 [] lock_acquire+0x9e/0x170
 [] _raw_spin_lock+0x3e/0x80
 [] list_lru_count_node+0x28/0x70
 [] super_cache_count+0x83/0x120
 [] shrink_slab_node+0x47/0x350
 [] shrink_slab+0x8d/0x160
 [] kswapd_shrink_zone+0x130/0x1c0
 [] balance_pgdat+0x389/0x520
 [] kswapd+0x1bf/0x380
 [] kthread+0xee/0x110
 [] ret_from_fork+0x7c/0xb0
INITIAL USE at:
 [] __lock_acquire+0x214/0x5e0
 [] lock_acquire+0x9e/0x170
 [] _raw_spin_lock+0x3e/0x80
 [] list_lru_add+0x5b/0xf0
 [] dput+0xbc/0x120
 [] __fput+0x1d2/0x310
 [] fput+0xe/0x10
 [] task_work_run+0xad/0xe0
 [] do_notify_resume+0x75/0x80
 [] int_signal+0x12/0x17
  }
  ... key  at: [] __key.23573+0x0/0xc
  ... acquired at:
   [] validate_chain+0x6e1/0x840
   [] __lock_acquire+0x367/0x5e0
   [] lock_acquire+0x9e/0x170
   [] _raw_spin_lock+0x3e/0x80
   [] list_lru_add+0x5b/0xf0
   [] page_cache_tree_delete+0x140/0x1a0
   [] __delete_from_page_cache+0x50/0x1c0
   [] __remove_mapping+0x9d/0x170
   [] shrink_page_list+0x617/0x7f0
   [] shrink_inactive_list+0x26a/0x520
   [] shrink_lruvec+0x336/0x420
   [] shrink_zone+0x5c/0x120
   [] kswapd_shrink_zone+0xfb/0x1c0
   [] balance_pgdat+0x389/0x520
   [] kswapd+0x1bf/0x380
   [] kthread+0xee/0x110
   [] ret_from_fork+0x7c/0xb0

-> (&(>tree_lock)->rlock){..-.-.} ops: 11597 {
   IN-SOFTIRQ-W at:
[] mark_irqflags+0x109/0x190
[] __lock_acquire+0x3bc/0x5e0
[] lock_acquire+0x9e/0x170
[] _raw_spin_lock_irqsave+0x50/0x90
[] test_clear_page_writeback+0x48/0x190
[] end_page_writeback+0x20/0x60
[] ext4_finish_bio+0x168/0x220 [ext4]
[] ext4_end_bio+0x97/0xe0 [ext4]
[] bio_endio+0x53/0xa0
[] blk_update_request+0x213/0x430
[] blk_update_bidi_request+0x27/0xb0
[] blk_end_bidi_request+0x2f/0x80
[] blk_end_request+0x10/0x20
[] scsi_end_request+0x40/0xb0
[] 

Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-12 Thread Tetsuo Handa
Hello.

I got a lockdep warning shown below, and the bad commit seems to be de055616
\mm: keep page cache radix tree nodes in check\ as of next-20140212
on linux-next.git.

Regards.

=
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-rc1-00099-gde05561 #126 Tainted: GF   
-
swapper/0/0 just changed the state of lock:
 ((mapping-tree_lock)-rlock){..-.-.}, at: [8116f3e8] 
test_clear_page_writeback+0x48/0x190
but this lock took another, SOFTIRQ-unsafe lock in the past:
 ((lru-node[i].lock)-rlock){+.+.-.}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
 Possible interrupt unsafe locking scenario:

   CPU0CPU1
   
  lock((lru-node[i].lock)-rlock);
   local_irq_disable();
   lock((mapping-tree_lock)-rlock);
   lock((lru-node[i].lock)-rlock);
  Interrupt
lock((mapping-tree_lock)-rlock);

 *** DEADLOCK ***

no locks held by swapper/0/0.

the shortest dependencies between 2nd lock and 1st lock:
 - ((lru-node[i].lock)-rlock){+.+.-.} ops: 445715 {
HARDIRQ-ON-W at:
  [810b6490] mark_irqflags+0x130/0x190
  [810b7edc] __lock_acquire+0x3bc/0x5e0
  [810b819e] lock_acquire+0x9e/0x170
  [816305ae] _raw_spin_lock+0x3e/0x80
  [8118c82b] list_lru_add+0x5b/0xf0
  [811e5c1c] dput+0xbc/0x120
  [811cebc2] __fput+0x1d2/0x310
  [811cedae] fput+0xe/0x10
  [8107cc2d] task_work_run+0xad/0xe0
  [81003be5] do_notify_resume+0x75/0x80
  [8163b01a] int_signal+0x12/0x17
SOFTIRQ-ON-W at:
  [810b64b4] mark_irqflags+0x154/0x190
  [810b7edc] __lock_acquire+0x3bc/0x5e0
  [810b819e] lock_acquire+0x9e/0x170
  [816305ae] _raw_spin_lock+0x3e/0x80
  [8118c82b] list_lru_add+0x5b/0xf0
  [811e5c1c] dput+0xbc/0x120
  [811cebc2] __fput+0x1d2/0x310
  [811cedae] fput+0xe/0x10
  [8107cc2d] task_work_run+0xad/0xe0
  [81003be5] do_notify_resume+0x75/0x80
  [8163b01a] int_signal+0x12/0x17
IN-RECLAIM_FS-W at:
 [810b6426] mark_irqflags+0xc6/0x190
 [810b7edc] __lock_acquire+0x3bc/0x5e0
 [810b819e] lock_acquire+0x9e/0x170
 [816305ae] _raw_spin_lock+0x3e/0x80
 [8118c698] list_lru_count_node+0x28/0x70
 [811d05b3] super_cache_count+0x83/0x120
 [81176647] shrink_slab_node+0x47/0x350
 [811769dd] shrink_slab+0x8d/0x160
 [81179480] kswapd_shrink_zone+0x130/0x1c0
 [81179fe9] balance_pgdat+0x389/0x520
 [8117b92f] kswapd+0x1bf/0x380
 [81080abe] kthread+0xee/0x110
 [8163ac6c] ret_from_fork+0x7c/0xb0
INITIAL USE at:
 [810b7d34] __lock_acquire+0x214/0x5e0
 [810b819e] lock_acquire+0x9e/0x170
 [816305ae] _raw_spin_lock+0x3e/0x80
 [8118c82b] list_lru_add+0x5b/0xf0
 [811e5c1c] dput+0xbc/0x120
 [811cebc2] __fput+0x1d2/0x310
 [811cedae] fput+0xe/0x10
 [8107cc2d] task_work_run+0xad/0xe0
 [81003be5] do_notify_resume+0x75/0x80
 [8163b01a] int_signal+0x12/0x17
  }
  ... key  at: [82b59f34] __key.23573+0x0/0xc
  ... acquired at:
   [810b79c1] validate_chain+0x6e1/0x840
   [810b7e87] __lock_acquire+0x367/0x5e0
   [810b819e] lock_acquire+0x9e/0x170
   [816305ae] _raw_spin_lock+0x3e/0x80
   [8118c82b] list_lru_add+0x5b/0xf0
   [81161180] page_cache_tree_delete+0x140/0x1a0
   [81161230] __delete_from_page_cache+0x50/0x1c0
   [8117571d] __remove_mapping+0x9d/0x170
   [81177347] shrink_page_list+0x617/0x7f0
   [811783aa] shrink_inactive_list+0x26a/0x520
   [81178cc6] shrink_lruvec+0x336/0x420
   [81178e0c] shrink_zone+0x5c/0x120
   

Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-04 Thread Johannes Weiner
On Tue, Feb 04, 2014 at 03:14:24PM -0800, Andrew Morton wrote:
> On Mon,  3 Feb 2014 19:53:32 -0500 Johannes Weiner  wrote:
> 
> > o Fix vmstat build problems on UP (Fengguang Wu's build bot)
> > 
> > o Clarify why optimistic radix_tree_node->private_list link checking
> >   is safe without holding the list_lru lock (Dave Chinner)
> > 
> > o Assert locking balance when the list_lru isolator says it dropped
> >   the list lock (Dave Chinner)
> > 
> > o Remove remnant of a manual reclaim counter in the shadow isolator,
> >   the list_lru-provided accounting is accurate now that we added
> >   LRU_REMOVED_RETRY (Dave Chinner)
> > 
> > o Set an object limit for the shadow shrinker instead of messing with
> >   its seeks setting.  The configured seeks define how pressure applied
> >   to pages translates to pressure on the object pool, in itself it is
> >   not enough to replace proper object valuation to classify expired
> >   and in-use objects.  Shadow nodes contain up to 64 shadow entries
> >   from different/alternating zones that have their own atomic age
> >   counter, so determining if a node is overall expired is crazy
> >   expensive.  Instead, use an object limit above which nodes are very
> >   likely to be expired.
> > 
> > o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim)
> > 
> > o radix_tree_node->count accessors for pages and shadows (Minchan Kim)
> > 
> > o Rebase to v3.14-rc1 and add review tags
> 
> An earlier version caused a 24-byte inode bloatage.  That appears to
> have been reduced to 8 bytes, yes?  What was done there?

Instead of inodes, the shrinker now directly tracks radix tree nodes
that contain only shadow entries.  So the 16 bytes for the list_head
are now in struct radix_tree_node, but due to different slab packing
it didn't increase memory consumption.

> > 69 files changed, 1438 insertions(+), 462 deletions(-)
> 
> omigod

Most of it is comments and Minchan's accessor functions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-04 Thread Andrew Morton
On Mon,  3 Feb 2014 19:53:32 -0500 Johannes Weiner  wrote:

> o Fix vmstat build problems on UP (Fengguang Wu's build bot)
> 
> o Clarify why optimistic radix_tree_node->private_list link checking
>   is safe without holding the list_lru lock (Dave Chinner)
> 
> o Assert locking balance when the list_lru isolator says it dropped
>   the list lock (Dave Chinner)
> 
> o Remove remnant of a manual reclaim counter in the shadow isolator,
>   the list_lru-provided accounting is accurate now that we added
>   LRU_REMOVED_RETRY (Dave Chinner)
> 
> o Set an object limit for the shadow shrinker instead of messing with
>   its seeks setting.  The configured seeks define how pressure applied
>   to pages translates to pressure on the object pool, in itself it is
>   not enough to replace proper object valuation to classify expired
>   and in-use objects.  Shadow nodes contain up to 64 shadow entries
>   from different/alternating zones that have their own atomic age
>   counter, so determining if a node is overall expired is crazy
>   expensive.  Instead, use an object limit above which nodes are very
>   likely to be expired.
> 
> o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim)
> 
> o radix_tree_node->count accessors for pages and shadows (Minchan Kim)
> 
> o Rebase to v3.14-rc1 and add review tags

An earlier version caused a 24-byte inode bloatage.  That appears to
have been reduced to 8 bytes, yes?  What was done there?

> 69 files changed, 1438 insertions(+), 462 deletions(-)

omigod

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-04 Thread Andrew Morton
On Mon,  3 Feb 2014 19:53:32 -0500 Johannes Weiner han...@cmpxchg.org wrote:

 o Fix vmstat build problems on UP (Fengguang Wu's build bot)
 
 o Clarify why optimistic radix_tree_node-private_list link checking
   is safe without holding the list_lru lock (Dave Chinner)
 
 o Assert locking balance when the list_lru isolator says it dropped
   the list lock (Dave Chinner)
 
 o Remove remnant of a manual reclaim counter in the shadow isolator,
   the list_lru-provided accounting is accurate now that we added
   LRU_REMOVED_RETRY (Dave Chinner)
 
 o Set an object limit for the shadow shrinker instead of messing with
   its seeks setting.  The configured seeks define how pressure applied
   to pages translates to pressure on the object pool, in itself it is
   not enough to replace proper object valuation to classify expired
   and in-use objects.  Shadow nodes contain up to 64 shadow entries
   from different/alternating zones that have their own atomic age
   counter, so determining if a node is overall expired is crazy
   expensive.  Instead, use an object limit above which nodes are very
   likely to be expired.
 
 o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim)
 
 o radix_tree_node-count accessors for pages and shadows (Minchan Kim)
 
 o Rebase to v3.14-rc1 and add review tags

An earlier version caused a 24-byte inode bloatage.  That appears to
have been reduced to 8 bytes, yes?  What was done there?

 69 files changed, 1438 insertions(+), 462 deletions(-)

omigod

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-04 Thread Johannes Weiner
On Tue, Feb 04, 2014 at 03:14:24PM -0800, Andrew Morton wrote:
 On Mon,  3 Feb 2014 19:53:32 -0500 Johannes Weiner han...@cmpxchg.org wrote:
 
  o Fix vmstat build problems on UP (Fengguang Wu's build bot)
  
  o Clarify why optimistic radix_tree_node-private_list link checking
is safe without holding the list_lru lock (Dave Chinner)
  
  o Assert locking balance when the list_lru isolator says it dropped
the list lock (Dave Chinner)
  
  o Remove remnant of a manual reclaim counter in the shadow isolator,
the list_lru-provided accounting is accurate now that we added
LRU_REMOVED_RETRY (Dave Chinner)
  
  o Set an object limit for the shadow shrinker instead of messing with
its seeks setting.  The configured seeks define how pressure applied
to pages translates to pressure on the object pool, in itself it is
not enough to replace proper object valuation to classify expired
and in-use objects.  Shadow nodes contain up to 64 shadow entries
from different/alternating zones that have their own atomic age
counter, so determining if a node is overall expired is crazy
expensive.  Instead, use an object limit above which nodes are very
likely to be expired.
  
  o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim)
  
  o radix_tree_node-count accessors for pages and shadows (Minchan Kim)
  
  o Rebase to v3.14-rc1 and add review tags
 
 An earlier version caused a 24-byte inode bloatage.  That appears to
 have been reduced to 8 bytes, yes?  What was done there?

Instead of inodes, the shrinker now directly tracks radix tree nodes
that contain only shadow entries.  So the 16 bytes for the list_head
are now in struct radix_tree_node, but due to different slab packing
it didn't increase memory consumption.

  69 files changed, 1438 insertions(+), 462 deletions(-)
 
 omigod

Most of it is comments and Minchan's accessor functions.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-03 Thread Johannes Weiner
Changes in this revision

o Fix vmstat build problems on UP (Fengguang Wu's build bot)

o Clarify why optimistic radix_tree_node->private_list link checking
  is safe without holding the list_lru lock (Dave Chinner)

o Assert locking balance when the list_lru isolator says it dropped
  the list lock (Dave Chinner)

o Remove remnant of a manual reclaim counter in the shadow isolator,
  the list_lru-provided accounting is accurate now that we added
  LRU_REMOVED_RETRY (Dave Chinner)

o Set an object limit for the shadow shrinker instead of messing with
  its seeks setting.  The configured seeks define how pressure applied
  to pages translates to pressure on the object pool, in itself it is
  not enough to replace proper object valuation to classify expired
  and in-use objects.  Shadow nodes contain up to 64 shadow entries
  from different/alternating zones that have their own atomic age
  counter, so determining if a node is overall expired is crazy
  expensive.  Instead, use an object limit above which nodes are very
  likely to be expired.

o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim)

o radix_tree_node->count accessors for pages and shadows (Minchan Kim)

o Rebase to v3.14-rc1 and add review tags

Summary

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to *protect* frequently used
pages and the ability to *detect* frequently used pages.  This means
that working set changes bigger than half of cache memory go
undetected and thrash indefinitely, whereas working sets bigger than
half of cache memory are unprotected against used-once streams that
don't even need caching.

This happens on file servers and media streaming servers, where the
popular files and file sections change over time.  Even though the
individual files might be smaller than half of memory, concurrent
access to many of them may still result in their inter-reference
distance being greater than half of memory.  It's also been reported
as a problem on database workloads that switch back and forth between
tables that are bigger than half of memory.  In these cases the VM
never recognizes the new working set and will for the remainder of the
workload thrash disk data which could easily live in memory.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use-once streams,
but ultimately was not significantly better than a FIFO policy and
still thrashed cache based on eviction speed, rather than actual
demand for cache.

This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to detect frequently
used pages regardless of inactive list size and facilitate working set
transitions.

Tests

The reported database workload is easily demonstrated on a 8G machine
with two filesets a 6G.  This fio workload operates on one set first,
then switches to the other.  The VM should obviously always cache the
set that the workload is currently using.

This test is based on a problem encountered by Citus Data customers:
http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data

unpatched:
db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, 
mint= 113672msec, maxt= 113672msec
db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, 
mint=1521302msec, maxt=1521302msec
sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92%

real27m15.541s
user0m19.059s
sys 0m51.459s

patched:
db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, 
mint=114679msec, maxt=114679msec
db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, 
mint=253273msec, maxt=253273msec
sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40%

real6m8.630s
user0m14.714s
sys 0m31.233s

As can be seen, the unpatched kernel simply never adapts to the
workingset change and db2 is stuck indefinitely with secondary storage
speed.  The patched kernel needs 2-3 iterations over db2 before it
replaces db1 and reaches full memory speed.  Given the unbounded
negative affect of the existing VM behavior, these patches should be
considered correctness fixes rather than performance optimizations.

Another test resembles a fileserver or streaming server workload,
where data in 

[patch 00/10] mm: thrash detection-based file cache sizing v9

2014-02-03 Thread Johannes Weiner
Changes in this revision

o Fix vmstat build problems on UP (Fengguang Wu's build bot)

o Clarify why optimistic radix_tree_node-private_list link checking
  is safe without holding the list_lru lock (Dave Chinner)

o Assert locking balance when the list_lru isolator says it dropped
  the list lock (Dave Chinner)

o Remove remnant of a manual reclaim counter in the shadow isolator,
  the list_lru-provided accounting is accurate now that we added
  LRU_REMOVED_RETRY (Dave Chinner)

o Set an object limit for the shadow shrinker instead of messing with
  its seeks setting.  The configured seeks define how pressure applied
  to pages translates to pressure on the object pool, in itself it is
  not enough to replace proper object valuation to classify expired
  and in-use objects.  Shadow nodes contain up to 64 shadow entries
  from different/alternating zones that have their own atomic age
  counter, so determining if a node is overall expired is crazy
  expensive.  Instead, use an object limit above which nodes are very
  likely to be expired.

o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim)

o radix_tree_node-count accessors for pages and shadows (Minchan Kim)

o Rebase to v3.14-rc1 and add review tags

Summary

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list inactive list and the frequently used list active list.

Currently, the VM aims for a 1:1 ratio between the lists, which is the
perfect trade-off between the ability to *protect* frequently used
pages and the ability to *detect* frequently used pages.  This means
that working set changes bigger than half of cache memory go
undetected and thrash indefinitely, whereas working sets bigger than
half of cache memory are unprotected against used-once streams that
don't even need caching.

This happens on file servers and media streaming servers, where the
popular files and file sections change over time.  Even though the
individual files might be smaller than half of memory, concurrent
access to many of them may still result in their inter-reference
distance being greater than half of memory.  It's also been reported
as a problem on database workloads that switch back and forth between
tables that are bigger than half of memory.  In these cases the VM
never recognizes the new working set and will for the remainder of the
workload thrash disk data which could easily live in memory.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use-once streams,
but ultimately was not significantly better than a FIFO policy and
still thrashed cache based on eviction speed, rather than actual
demand for cache.

This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to detect frequently
used pages regardless of inactive list size and facilitate working set
transitions.

Tests

The reported database workload is easily demonstrated on a 8G machine
with two filesets a 6G.  This fio workload operates on one set first,
then switches to the other.  The VM should obviously always cache the
set that the workload is currently using.

This test is based on a problem encountered by Citus Data customers:
http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data

unpatched:
db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, 
mint= 113672msec, maxt= 113672msec
db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, 
mint=1521302msec, maxt=1521302msec
sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92%

real27m15.541s
user0m19.059s
sys 0m51.459s

patched:
db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, 
mint=114679msec, maxt=114679msec
db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, 
mint=253273msec, maxt=253273msec
sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40%

real6m8.630s
user0m14.714s
sys 0m31.233s

As can be seen, the unpatched kernel simply never adapts to the
workingset change and db2 is stuck indefinitely with secondary storage
speed.  The patched kernel needs 2-3 iterations over db2 before it
replaces db1 and reaches full memory speed.  Given the unbounded
negative affect of the existing VM behavior, these patches should be
considered correctness fixes rather than performance optimizations.

Another test resembles a fileserver or streaming server workload,
where data in excess of