Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
On Fri, Feb 14, 2014 at 03:05:41PM +0900, Tetsuo Handa wrote: > Johannes Weiner wrote: > > Thanks for the report. There is already a fix for this in -mm: > > http://marc.info/?l=linux-mm-commits=139180637114625=2 > > > > It was merged on the 7th, so it should show up in -next... any day > > now? > > That patch solved this bproblem but breaks build instead. > > ERROR: \"list_lru_init_key\" [fs/xfs/xfs.ko] undefined! > ERROR: \"list_lru_init_key\" [fs/gfs2/gfs2.ko] undefined! > make[1]: *** [__modpost] Error 1 There is a follow-up fix in -mm: http://marc.info/?l=linux-mm-commits=139180636814624=2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
On Fri, Feb 14, 2014 at 03:05:41PM +0900, Tetsuo Handa wrote: Johannes Weiner wrote: Thanks for the report. There is already a fix for this in -mm: http://marc.info/?l=linux-mm-commitsm=139180637114625w=2 It was merged on the 7th, so it should show up in -next... any day now? That patch solved this bproblem but breaks build instead. ERROR: \list_lru_init_key\ [fs/xfs/xfs.ko] undefined! ERROR: \list_lru_init_key\ [fs/gfs2/gfs2.ko] undefined! make[1]: *** [__modpost] Error 1 There is a follow-up fix in -mm: http://marc.info/?l=linux-mm-commitsm=139180636814624w=2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
Johannes Weiner wrote: > Thanks for the report. There is already a fix for this in -mm: > http://marc.info/?l=linux-mm-commits=139180637114625=2 > > It was merged on the 7th, so it should show up in -next... any day > now? That patch solved this bproblem but breaks build instead. ERROR: \"list_lru_init_key\" [fs/xfs/xfs.ko] undefined! ERROR: \"list_lru_init_key\" [fs/gfs2/gfs2.ko] undefined! make[1]: *** [__modpost] Error 1 diff --git a/mm/list_lru.c b/mm/list_lru.c index 2a5b8fd..f1a0db1 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -143,7 +143,7 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key) } return 0; } -EXPORT_SYMBOL_GPL(list_lru_init); +EXPORT_SYMBOL_GPL(list_lru_init_key); void list_lru_destroy(struct list_lru *lru) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
Hi Andrew, On Thu, 13 Feb 2014 14:24:07 -0800 Andrew Morton wrote: > > On Thu, 13 Feb 2014 17:11:26 -0500 Johannes Weiner wrote: > > > On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote: > > > Hello. > > > > > > I got a lockdep warning shown below, and the bad commit seems to be > > > de055616 > > > \"mm: keep page cache radix tree nodes in check\" as of next-20140212 > > > on linux-next.git. > > > > Thanks for the report. There is already a fix for this in -mm: > > http://marc.info/?l=linux-mm-commits=139180637114625=2 > > > > It was merged on the 7th, so it should show up in -next... any day > > now? > > Tomorrow, if Stephen works weekends, which I expect he doesn't ;) Actually later today (my time), since Friday is not a weekend :-( -- Cheers, Stephen Rothwells...@canb.auug.org.au pgpBhKe_IeCp7.pgp Description: PGP signature
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
On Thu, 13 Feb 2014 17:11:26 -0500 Johannes Weiner wrote: > Hi Tetsuo, > > On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote: > > Hello. > > > > I got a lockdep warning shown below, and the bad commit seems to be de055616 > > \"mm: keep page cache radix tree nodes in check\" as of next-20140212 > > on linux-next.git. > > Thanks for the report. There is already a fix for this in -mm: > http://marc.info/?l=linux-mm-commits=139180637114625=2 > > It was merged on the 7th, so it should show up in -next... any day > now? Tomorrow, if Stephen works weekends, which I expect he doesn't ;) http://ozlabs.org/~akpm/mmotm/broken-out/mm-keep-page-cache-radix-tree-nodes-in-check-fix.patch -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
Hi Tetsuo, On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote: > Hello. > > I got a lockdep warning shown below, and the bad commit seems to be de055616 > \"mm: keep page cache radix tree nodes in check\" as of next-20140212 > on linux-next.git. Thanks for the report. There is already a fix for this in -mm: http://marc.info/?l=linux-mm-commits=139180637114625=2 It was merged on the 7th, so it should show up in -next... any day now? > Regards. > > = > [ INFO: possible irq lock inversion dependency detected ] > 3.14.0-rc1-00099-gde05561 #126 Tainted: GF > - > swapper/0/0 just changed the state of lock: > (&(>tree_lock)->rlock){..-.-.}, at: [] > test_clear_page_writeback+0x48/0x190 > but this lock took another, SOFTIRQ-unsafe lock in the past: > (&(>node[i].lock)->rlock){+.+.-.} > > and interrupts could create inverse lock ordering between them. > > > other info that might help us debug this: > Possible interrupt unsafe locking scenario: > >CPU0CPU1 > > lock(&(>node[i].lock)->rlock); >local_irq_disable(); >lock(&(>tree_lock)->rlock); >lock(&(>node[i].lock)->rlock); > > lock(&(>tree_lock)->rlock); > > *** DEADLOCK *** > > no locks held by swapper/0/0. > > the shortest dependencies between 2nd lock and 1st lock: > -> (&(>node[i].lock)->rlock){+.+.-.} ops: 445715 { > HARDIRQ-ON-W at: > [] mark_irqflags+0x130/0x190 > [] __lock_acquire+0x3bc/0x5e0 > [] lock_acquire+0x9e/0x170 > [] _raw_spin_lock+0x3e/0x80 > [] list_lru_add+0x5b/0xf0 > [] dput+0xbc/0x120 > [] __fput+0x1d2/0x310 > [] fput+0xe/0x10 > [] task_work_run+0xad/0xe0 > [] do_notify_resume+0x75/0x80 > [] int_signal+0x12/0x17 > SOFTIRQ-ON-W at: > [] mark_irqflags+0x154/0x190 > [] __lock_acquire+0x3bc/0x5e0 > [] lock_acquire+0x9e/0x170 > [] _raw_spin_lock+0x3e/0x80 > [] list_lru_add+0x5b/0xf0 > [] dput+0xbc/0x120 > [] __fput+0x1d2/0x310 > [] fput+0xe/0x10 > [] task_work_run+0xad/0xe0 > [] do_notify_resume+0x75/0x80 > [] int_signal+0x12/0x17 > IN-RECLAIM_FS-W at: > [] mark_irqflags+0xc6/0x190 > [] __lock_acquire+0x3bc/0x5e0 > [] lock_acquire+0x9e/0x170 > [] _raw_spin_lock+0x3e/0x80 > [] list_lru_count_node+0x28/0x70 > [] super_cache_count+0x83/0x120 > [] shrink_slab_node+0x47/0x350 > [] shrink_slab+0x8d/0x160 > [] kswapd_shrink_zone+0x130/0x1c0 > [] balance_pgdat+0x389/0x520 > [] kswapd+0x1bf/0x380 > [] kthread+0xee/0x110 > [] ret_from_fork+0x7c/0xb0 > INITIAL USE at: > [] __lock_acquire+0x214/0x5e0 > [] lock_acquire+0x9e/0x170 > [] _raw_spin_lock+0x3e/0x80 > [] list_lru_add+0x5b/0xf0 > [] dput+0xbc/0x120 > [] __fput+0x1d2/0x310 > [] fput+0xe/0x10 > [] task_work_run+0xad/0xe0 > [] do_notify_resume+0x75/0x80 > [] int_signal+0x12/0x17 > } > ... key at: [] __key.23573+0x0/0xc > ... acquired at: >[] validate_chain+0x6e1/0x840 >[] __lock_acquire+0x367/0x5e0 >[] lock_acquire+0x9e/0x170 >[] _raw_spin_lock+0x3e/0x80 >[] list_lru_add+0x5b/0xf0 >[] page_cache_tree_delete+0x140/0x1a0 >[] __delete_from_page_cache+0x50/0x1c0 >[] __remove_mapping+0x9d/0x170 >[] shrink_page_list+0x617/0x7f0 >[] shrink_inactive_list+0x26a/0x520 >[] shrink_lruvec+0x336/0x420 >[] shrink_zone+0x5c/0x120 >[] kswapd_shrink_zone+0xfb/0x1c0 >[] balance_pgdat+0x389/0x520 >[] kswapd+0x1bf/0x380 >[] kthread+0xee/0x110 >[] ret_from_fork+0x7c/0xb0 > > -> (&(>tree_lock)->rlock){..-.-.} ops: 11597 { >IN-SOFTIRQ-W at: > [] mark_irqflags+0x109/0x190 > [] __lock_acquire+0x3bc/0x5e0 > [] lock_acquire+0x9e/0x170 > [] _raw_spin_lock_irqsave+0x50/0x90 > []
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
Hi Tetsuo, On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote: Hello. I got a lockdep warning shown below, and the bad commit seems to be de055616 \mm: keep page cache radix tree nodes in check\ as of next-20140212 on linux-next.git. Thanks for the report. There is already a fix for this in -mm: http://marc.info/?l=linux-mm-commitsm=139180637114625w=2 It was merged on the 7th, so it should show up in -next... any day now? Regards. = [ INFO: possible irq lock inversion dependency detected ] 3.14.0-rc1-00099-gde05561 #126 Tainted: GF - swapper/0/0 just changed the state of lock: ((mapping-tree_lock)-rlock){..-.-.}, at: [8116f3e8] test_clear_page_writeback+0x48/0x190 but this lock took another, SOFTIRQ-unsafe lock in the past: ((lru-node[i].lock)-rlock){+.+.-.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0CPU1 lock((lru-node[i].lock)-rlock); local_irq_disable(); lock((mapping-tree_lock)-rlock); lock((lru-node[i].lock)-rlock); Interrupt lock((mapping-tree_lock)-rlock); *** DEADLOCK *** no locks held by swapper/0/0. the shortest dependencies between 2nd lock and 1st lock: - ((lru-node[i].lock)-rlock){+.+.-.} ops: 445715 { HARDIRQ-ON-W at: [810b6490] mark_irqflags+0x130/0x190 [810b7edc] __lock_acquire+0x3bc/0x5e0 [810b819e] lock_acquire+0x9e/0x170 [816305ae] _raw_spin_lock+0x3e/0x80 [8118c82b] list_lru_add+0x5b/0xf0 [811e5c1c] dput+0xbc/0x120 [811cebc2] __fput+0x1d2/0x310 [811cedae] fput+0xe/0x10 [8107cc2d] task_work_run+0xad/0xe0 [81003be5] do_notify_resume+0x75/0x80 [8163b01a] int_signal+0x12/0x17 SOFTIRQ-ON-W at: [810b64b4] mark_irqflags+0x154/0x190 [810b7edc] __lock_acquire+0x3bc/0x5e0 [810b819e] lock_acquire+0x9e/0x170 [816305ae] _raw_spin_lock+0x3e/0x80 [8118c82b] list_lru_add+0x5b/0xf0 [811e5c1c] dput+0xbc/0x120 [811cebc2] __fput+0x1d2/0x310 [811cedae] fput+0xe/0x10 [8107cc2d] task_work_run+0xad/0xe0 [81003be5] do_notify_resume+0x75/0x80 [8163b01a] int_signal+0x12/0x17 IN-RECLAIM_FS-W at: [810b6426] mark_irqflags+0xc6/0x190 [810b7edc] __lock_acquire+0x3bc/0x5e0 [810b819e] lock_acquire+0x9e/0x170 [816305ae] _raw_spin_lock+0x3e/0x80 [8118c698] list_lru_count_node+0x28/0x70 [811d05b3] super_cache_count+0x83/0x120 [81176647] shrink_slab_node+0x47/0x350 [811769dd] shrink_slab+0x8d/0x160 [81179480] kswapd_shrink_zone+0x130/0x1c0 [81179fe9] balance_pgdat+0x389/0x520 [8117b92f] kswapd+0x1bf/0x380 [81080abe] kthread+0xee/0x110 [8163ac6c] ret_from_fork+0x7c/0xb0 INITIAL USE at: [810b7d34] __lock_acquire+0x214/0x5e0 [810b819e] lock_acquire+0x9e/0x170 [816305ae] _raw_spin_lock+0x3e/0x80 [8118c82b] list_lru_add+0x5b/0xf0 [811e5c1c] dput+0xbc/0x120 [811cebc2] __fput+0x1d2/0x310 [811cedae] fput+0xe/0x10 [8107cc2d] task_work_run+0xad/0xe0 [81003be5] do_notify_resume+0x75/0x80 [8163b01a] int_signal+0x12/0x17 } ... key at: [82b59f34] __key.23573+0x0/0xc ... acquired at: [810b79c1] validate_chain+0x6e1/0x840 [810b7e87] __lock_acquire+0x367/0x5e0 [810b819e] lock_acquire+0x9e/0x170 [816305ae] _raw_spin_lock+0x3e/0x80 [8118c82b] list_lru_add+0x5b/0xf0
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
On Thu, 13 Feb 2014 17:11:26 -0500 Johannes Weiner han...@cmpxchg.org wrote: Hi Tetsuo, On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote: Hello. I got a lockdep warning shown below, and the bad commit seems to be de055616 \mm: keep page cache radix tree nodes in check\ as of next-20140212 on linux-next.git. Thanks for the report. There is already a fix for this in -mm: http://marc.info/?l=linux-mm-commitsm=139180637114625w=2 It was merged on the 7th, so it should show up in -next... any day now? Tomorrow, if Stephen works weekends, which I expect he doesn't ;) http://ozlabs.org/~akpm/mmotm/broken-out/mm-keep-page-cache-radix-tree-nodes-in-check-fix.patch -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
Hi Andrew, On Thu, 13 Feb 2014 14:24:07 -0800 Andrew Morton a...@linux-foundation.org wrote: On Thu, 13 Feb 2014 17:11:26 -0500 Johannes Weiner han...@cmpxchg.org wrote: On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote: Hello. I got a lockdep warning shown below, and the bad commit seems to be de055616 \mm: keep page cache radix tree nodes in check\ as of next-20140212 on linux-next.git. Thanks for the report. There is already a fix for this in -mm: http://marc.info/?l=linux-mm-commitsm=139180637114625w=2 It was merged on the 7th, so it should show up in -next... any day now? Tomorrow, if Stephen works weekends, which I expect he doesn't ;) Actually later today (my time), since Friday is not a weekend :-( -- Cheers, Stephen Rothwells...@canb.auug.org.au pgpBhKe_IeCp7.pgp Description: PGP signature
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
Johannes Weiner wrote: Thanks for the report. There is already a fix for this in -mm: http://marc.info/?l=linux-mm-commitsm=139180637114625w=2 It was merged on the 7th, so it should show up in -next... any day now? That patch solved this bproblem but breaks build instead. ERROR: \list_lru_init_key\ [fs/xfs/xfs.ko] undefined! ERROR: \list_lru_init_key\ [fs/gfs2/gfs2.ko] undefined! make[1]: *** [__modpost] Error 1 diff --git a/mm/list_lru.c b/mm/list_lru.c index 2a5b8fd..f1a0db1 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -143,7 +143,7 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key) } return 0; } -EXPORT_SYMBOL_GPL(list_lru_init); +EXPORT_SYMBOL_GPL(list_lru_init_key); void list_lru_destroy(struct list_lru *lru) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
Hello. I got a lockdep warning shown below, and the bad commit seems to be de055616 \"mm: keep page cache radix tree nodes in check\" as of next-20140212 on linux-next.git. Regards. = [ INFO: possible irq lock inversion dependency detected ] 3.14.0-rc1-00099-gde05561 #126 Tainted: GF - swapper/0/0 just changed the state of lock: (&(>tree_lock)->rlock){..-.-.}, at: [] test_clear_page_writeback+0x48/0x190 but this lock took another, SOFTIRQ-unsafe lock in the past: (&(>node[i].lock)->rlock){+.+.-.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0CPU1 lock(&(>node[i].lock)->rlock); local_irq_disable(); lock(&(>tree_lock)->rlock); lock(&(>node[i].lock)->rlock); lock(&(>tree_lock)->rlock); *** DEADLOCK *** no locks held by swapper/0/0. the shortest dependencies between 2nd lock and 1st lock: -> (&(>node[i].lock)->rlock){+.+.-.} ops: 445715 { HARDIRQ-ON-W at: [] mark_irqflags+0x130/0x190 [] __lock_acquire+0x3bc/0x5e0 [] lock_acquire+0x9e/0x170 [] _raw_spin_lock+0x3e/0x80 [] list_lru_add+0x5b/0xf0 [] dput+0xbc/0x120 [] __fput+0x1d2/0x310 [] fput+0xe/0x10 [] task_work_run+0xad/0xe0 [] do_notify_resume+0x75/0x80 [] int_signal+0x12/0x17 SOFTIRQ-ON-W at: [] mark_irqflags+0x154/0x190 [] __lock_acquire+0x3bc/0x5e0 [] lock_acquire+0x9e/0x170 [] _raw_spin_lock+0x3e/0x80 [] list_lru_add+0x5b/0xf0 [] dput+0xbc/0x120 [] __fput+0x1d2/0x310 [] fput+0xe/0x10 [] task_work_run+0xad/0xe0 [] do_notify_resume+0x75/0x80 [] int_signal+0x12/0x17 IN-RECLAIM_FS-W at: [] mark_irqflags+0xc6/0x190 [] __lock_acquire+0x3bc/0x5e0 [] lock_acquire+0x9e/0x170 [] _raw_spin_lock+0x3e/0x80 [] list_lru_count_node+0x28/0x70 [] super_cache_count+0x83/0x120 [] shrink_slab_node+0x47/0x350 [] shrink_slab+0x8d/0x160 [] kswapd_shrink_zone+0x130/0x1c0 [] balance_pgdat+0x389/0x520 [] kswapd+0x1bf/0x380 [] kthread+0xee/0x110 [] ret_from_fork+0x7c/0xb0 INITIAL USE at: [] __lock_acquire+0x214/0x5e0 [] lock_acquire+0x9e/0x170 [] _raw_spin_lock+0x3e/0x80 [] list_lru_add+0x5b/0xf0 [] dput+0xbc/0x120 [] __fput+0x1d2/0x310 [] fput+0xe/0x10 [] task_work_run+0xad/0xe0 [] do_notify_resume+0x75/0x80 [] int_signal+0x12/0x17 } ... key at: [] __key.23573+0x0/0xc ... acquired at: [] validate_chain+0x6e1/0x840 [] __lock_acquire+0x367/0x5e0 [] lock_acquire+0x9e/0x170 [] _raw_spin_lock+0x3e/0x80 [] list_lru_add+0x5b/0xf0 [] page_cache_tree_delete+0x140/0x1a0 [] __delete_from_page_cache+0x50/0x1c0 [] __remove_mapping+0x9d/0x170 [] shrink_page_list+0x617/0x7f0 [] shrink_inactive_list+0x26a/0x520 [] shrink_lruvec+0x336/0x420 [] shrink_zone+0x5c/0x120 [] kswapd_shrink_zone+0xfb/0x1c0 [] balance_pgdat+0x389/0x520 [] kswapd+0x1bf/0x380 [] kthread+0xee/0x110 [] ret_from_fork+0x7c/0xb0 -> (&(>tree_lock)->rlock){..-.-.} ops: 11597 { IN-SOFTIRQ-W at: [] mark_irqflags+0x109/0x190 [] __lock_acquire+0x3bc/0x5e0 [] lock_acquire+0x9e/0x170 [] _raw_spin_lock_irqsave+0x50/0x90 [] test_clear_page_writeback+0x48/0x190 [] end_page_writeback+0x20/0x60 [] ext4_finish_bio+0x168/0x220 [ext4] [] ext4_end_bio+0x97/0xe0 [ext4] [] bio_endio+0x53/0xa0 [] blk_update_request+0x213/0x430 [] blk_update_bidi_request+0x27/0xb0 [] blk_end_bidi_request+0x2f/0x80 [] blk_end_request+0x10/0x20 [] scsi_end_request+0x40/0xb0 []
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
Hello. I got a lockdep warning shown below, and the bad commit seems to be de055616 \mm: keep page cache radix tree nodes in check\ as of next-20140212 on linux-next.git. Regards. = [ INFO: possible irq lock inversion dependency detected ] 3.14.0-rc1-00099-gde05561 #126 Tainted: GF - swapper/0/0 just changed the state of lock: ((mapping-tree_lock)-rlock){..-.-.}, at: [8116f3e8] test_clear_page_writeback+0x48/0x190 but this lock took another, SOFTIRQ-unsafe lock in the past: ((lru-node[i].lock)-rlock){+.+.-.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0CPU1 lock((lru-node[i].lock)-rlock); local_irq_disable(); lock((mapping-tree_lock)-rlock); lock((lru-node[i].lock)-rlock); Interrupt lock((mapping-tree_lock)-rlock); *** DEADLOCK *** no locks held by swapper/0/0. the shortest dependencies between 2nd lock and 1st lock: - ((lru-node[i].lock)-rlock){+.+.-.} ops: 445715 { HARDIRQ-ON-W at: [810b6490] mark_irqflags+0x130/0x190 [810b7edc] __lock_acquire+0x3bc/0x5e0 [810b819e] lock_acquire+0x9e/0x170 [816305ae] _raw_spin_lock+0x3e/0x80 [8118c82b] list_lru_add+0x5b/0xf0 [811e5c1c] dput+0xbc/0x120 [811cebc2] __fput+0x1d2/0x310 [811cedae] fput+0xe/0x10 [8107cc2d] task_work_run+0xad/0xe0 [81003be5] do_notify_resume+0x75/0x80 [8163b01a] int_signal+0x12/0x17 SOFTIRQ-ON-W at: [810b64b4] mark_irqflags+0x154/0x190 [810b7edc] __lock_acquire+0x3bc/0x5e0 [810b819e] lock_acquire+0x9e/0x170 [816305ae] _raw_spin_lock+0x3e/0x80 [8118c82b] list_lru_add+0x5b/0xf0 [811e5c1c] dput+0xbc/0x120 [811cebc2] __fput+0x1d2/0x310 [811cedae] fput+0xe/0x10 [8107cc2d] task_work_run+0xad/0xe0 [81003be5] do_notify_resume+0x75/0x80 [8163b01a] int_signal+0x12/0x17 IN-RECLAIM_FS-W at: [810b6426] mark_irqflags+0xc6/0x190 [810b7edc] __lock_acquire+0x3bc/0x5e0 [810b819e] lock_acquire+0x9e/0x170 [816305ae] _raw_spin_lock+0x3e/0x80 [8118c698] list_lru_count_node+0x28/0x70 [811d05b3] super_cache_count+0x83/0x120 [81176647] shrink_slab_node+0x47/0x350 [811769dd] shrink_slab+0x8d/0x160 [81179480] kswapd_shrink_zone+0x130/0x1c0 [81179fe9] balance_pgdat+0x389/0x520 [8117b92f] kswapd+0x1bf/0x380 [81080abe] kthread+0xee/0x110 [8163ac6c] ret_from_fork+0x7c/0xb0 INITIAL USE at: [810b7d34] __lock_acquire+0x214/0x5e0 [810b819e] lock_acquire+0x9e/0x170 [816305ae] _raw_spin_lock+0x3e/0x80 [8118c82b] list_lru_add+0x5b/0xf0 [811e5c1c] dput+0xbc/0x120 [811cebc2] __fput+0x1d2/0x310 [811cedae] fput+0xe/0x10 [8107cc2d] task_work_run+0xad/0xe0 [81003be5] do_notify_resume+0x75/0x80 [8163b01a] int_signal+0x12/0x17 } ... key at: [82b59f34] __key.23573+0x0/0xc ... acquired at: [810b79c1] validate_chain+0x6e1/0x840 [810b7e87] __lock_acquire+0x367/0x5e0 [810b819e] lock_acquire+0x9e/0x170 [816305ae] _raw_spin_lock+0x3e/0x80 [8118c82b] list_lru_add+0x5b/0xf0 [81161180] page_cache_tree_delete+0x140/0x1a0 [81161230] __delete_from_page_cache+0x50/0x1c0 [8117571d] __remove_mapping+0x9d/0x170 [81177347] shrink_page_list+0x617/0x7f0 [811783aa] shrink_inactive_list+0x26a/0x520 [81178cc6] shrink_lruvec+0x336/0x420 [81178e0c] shrink_zone+0x5c/0x120
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
On Tue, Feb 04, 2014 at 03:14:24PM -0800, Andrew Morton wrote: > On Mon, 3 Feb 2014 19:53:32 -0500 Johannes Weiner wrote: > > > o Fix vmstat build problems on UP (Fengguang Wu's build bot) > > > > o Clarify why optimistic radix_tree_node->private_list link checking > > is safe without holding the list_lru lock (Dave Chinner) > > > > o Assert locking balance when the list_lru isolator says it dropped > > the list lock (Dave Chinner) > > > > o Remove remnant of a manual reclaim counter in the shadow isolator, > > the list_lru-provided accounting is accurate now that we added > > LRU_REMOVED_RETRY (Dave Chinner) > > > > o Set an object limit for the shadow shrinker instead of messing with > > its seeks setting. The configured seeks define how pressure applied > > to pages translates to pressure on the object pool, in itself it is > > not enough to replace proper object valuation to classify expired > > and in-use objects. Shadow nodes contain up to 64 shadow entries > > from different/alternating zones that have their own atomic age > > counter, so determining if a node is overall expired is crazy > > expensive. Instead, use an object limit above which nodes are very > > likely to be expired. > > > > o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim) > > > > o radix_tree_node->count accessors for pages and shadows (Minchan Kim) > > > > o Rebase to v3.14-rc1 and add review tags > > An earlier version caused a 24-byte inode bloatage. That appears to > have been reduced to 8 bytes, yes? What was done there? Instead of inodes, the shrinker now directly tracks radix tree nodes that contain only shadow entries. So the 16 bytes for the list_head are now in struct radix_tree_node, but due to different slab packing it didn't increase memory consumption. > > 69 files changed, 1438 insertions(+), 462 deletions(-) > > omigod Most of it is comments and Minchan's accessor functions. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
On Mon, 3 Feb 2014 19:53:32 -0500 Johannes Weiner wrote: > o Fix vmstat build problems on UP (Fengguang Wu's build bot) > > o Clarify why optimistic radix_tree_node->private_list link checking > is safe without holding the list_lru lock (Dave Chinner) > > o Assert locking balance when the list_lru isolator says it dropped > the list lock (Dave Chinner) > > o Remove remnant of a manual reclaim counter in the shadow isolator, > the list_lru-provided accounting is accurate now that we added > LRU_REMOVED_RETRY (Dave Chinner) > > o Set an object limit for the shadow shrinker instead of messing with > its seeks setting. The configured seeks define how pressure applied > to pages translates to pressure on the object pool, in itself it is > not enough to replace proper object valuation to classify expired > and in-use objects. Shadow nodes contain up to 64 shadow entries > from different/alternating zones that have their own atomic age > counter, so determining if a node is overall expired is crazy > expensive. Instead, use an object limit above which nodes are very > likely to be expired. > > o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim) > > o radix_tree_node->count accessors for pages and shadows (Minchan Kim) > > o Rebase to v3.14-rc1 and add review tags An earlier version caused a 24-byte inode bloatage. That appears to have been reduced to 8 bytes, yes? What was done there? > 69 files changed, 1438 insertions(+), 462 deletions(-) omigod -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
On Mon, 3 Feb 2014 19:53:32 -0500 Johannes Weiner han...@cmpxchg.org wrote: o Fix vmstat build problems on UP (Fengguang Wu's build bot) o Clarify why optimistic radix_tree_node-private_list link checking is safe without holding the list_lru lock (Dave Chinner) o Assert locking balance when the list_lru isolator says it dropped the list lock (Dave Chinner) o Remove remnant of a manual reclaim counter in the shadow isolator, the list_lru-provided accounting is accurate now that we added LRU_REMOVED_RETRY (Dave Chinner) o Set an object limit for the shadow shrinker instead of messing with its seeks setting. The configured seeks define how pressure applied to pages translates to pressure on the object pool, in itself it is not enough to replace proper object valuation to classify expired and in-use objects. Shadow nodes contain up to 64 shadow entries from different/alternating zones that have their own atomic age counter, so determining if a node is overall expired is crazy expensive. Instead, use an object limit above which nodes are very likely to be expired. o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim) o radix_tree_node-count accessors for pages and shadows (Minchan Kim) o Rebase to v3.14-rc1 and add review tags An earlier version caused a 24-byte inode bloatage. That appears to have been reduced to 8 bytes, yes? What was done there? 69 files changed, 1438 insertions(+), 462 deletions(-) omigod -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/10] mm: thrash detection-based file cache sizing v9
On Tue, Feb 04, 2014 at 03:14:24PM -0800, Andrew Morton wrote: On Mon, 3 Feb 2014 19:53:32 -0500 Johannes Weiner han...@cmpxchg.org wrote: o Fix vmstat build problems on UP (Fengguang Wu's build bot) o Clarify why optimistic radix_tree_node-private_list link checking is safe without holding the list_lru lock (Dave Chinner) o Assert locking balance when the list_lru isolator says it dropped the list lock (Dave Chinner) o Remove remnant of a manual reclaim counter in the shadow isolator, the list_lru-provided accounting is accurate now that we added LRU_REMOVED_RETRY (Dave Chinner) o Set an object limit for the shadow shrinker instead of messing with its seeks setting. The configured seeks define how pressure applied to pages translates to pressure on the object pool, in itself it is not enough to replace proper object valuation to classify expired and in-use objects. Shadow nodes contain up to 64 shadow entries from different/alternating zones that have their own atomic age counter, so determining if a node is overall expired is crazy expensive. Instead, use an object limit above which nodes are very likely to be expired. o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim) o radix_tree_node-count accessors for pages and shadows (Minchan Kim) o Rebase to v3.14-rc1 and add review tags An earlier version caused a 24-byte inode bloatage. That appears to have been reduced to 8 bytes, yes? What was done there? Instead of inodes, the shrinker now directly tracks radix tree nodes that contain only shadow entries. So the 16 bytes for the list_head are now in struct radix_tree_node, but due to different slab packing it didn't increase memory consumption. 69 files changed, 1438 insertions(+), 462 deletions(-) omigod Most of it is comments and Minchan's accessor functions. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 00/10] mm: thrash detection-based file cache sizing v9
Changes in this revision o Fix vmstat build problems on UP (Fengguang Wu's build bot) o Clarify why optimistic radix_tree_node->private_list link checking is safe without holding the list_lru lock (Dave Chinner) o Assert locking balance when the list_lru isolator says it dropped the list lock (Dave Chinner) o Remove remnant of a manual reclaim counter in the shadow isolator, the list_lru-provided accounting is accurate now that we added LRU_REMOVED_RETRY (Dave Chinner) o Set an object limit for the shadow shrinker instead of messing with its seeks setting. The configured seeks define how pressure applied to pages translates to pressure on the object pool, in itself it is not enough to replace proper object valuation to classify expired and in-use objects. Shadow nodes contain up to 64 shadow entries from different/alternating zones that have their own atomic age counter, so determining if a node is overall expired is crazy expensive. Instead, use an object limit above which nodes are very likely to be expired. o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim) o radix_tree_node->count accessors for pages and shadows (Minchan Kim) o Rebase to v3.14-rc1 and add review tags Summary The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list "inactive list" and the frequently used list "active list". Currently, the VM aims for a 1:1 ratio between the lists, which is the "perfect" trade-off between the ability to *protect* frequently used pages and the ability to *detect* frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching. This happens on file servers and media streaming servers, where the popular files and file sections change over time. Even though the individual files might be smaller than half of memory, concurrent access to many of them may still result in their inter-reference distance being greater than half of memory. It's also been reported as a problem on database workloads that switch back and forth between tables that are bigger than half of memory. In these cases the VM never recognizes the new working set and will for the remainder of the workload thrash disk data which could easily live in memory. Historically, every reclaim scan of the inactive list also took a smaller number of pages from the tail of the active list and moved them to the head of the inactive list. This model gave established working sets more gracetime in the face of temporary use-once streams, but ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This series solves the problem by maintaining a history of pages evicted from the inactive list, enabling the VM to detect frequently used pages regardless of inactive list size and facilitate working set transitions. Tests The reported database workload is easily demonstrated on a 8G machine with two filesets a 6G. This fio workload operates on one set first, then switches to the other. The VM should obviously always cache the set that the workload is currently using. This test is based on a problem encountered by Citus Data customers: http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data unpatched: db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, mint= 113672msec, maxt= 113672msec db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, mint=1521302msec, maxt=1521302msec sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92% real27m15.541s user0m19.059s sys 0m51.459s patched: db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, mint=114679msec, maxt=114679msec db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, mint=253273msec, maxt=253273msec sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40% real6m8.630s user0m14.714s sys 0m31.233s As can be seen, the unpatched kernel simply never adapts to the workingset change and db2 is stuck indefinitely with secondary storage speed. The patched kernel needs 2-3 iterations over db2 before it replaces db1 and reaches full memory speed. Given the unbounded negative affect of the existing VM behavior, these patches should be considered correctness fixes rather than performance optimizations. Another test resembles a fileserver or streaming server workload, where data in
[patch 00/10] mm: thrash detection-based file cache sizing v9
Changes in this revision o Fix vmstat build problems on UP (Fengguang Wu's build bot) o Clarify why optimistic radix_tree_node-private_list link checking is safe without holding the list_lru lock (Dave Chinner) o Assert locking balance when the list_lru isolator says it dropped the list lock (Dave Chinner) o Remove remnant of a manual reclaim counter in the shadow isolator, the list_lru-provided accounting is accurate now that we added LRU_REMOVED_RETRY (Dave Chinner) o Set an object limit for the shadow shrinker instead of messing with its seeks setting. The configured seeks define how pressure applied to pages translates to pressure on the object pool, in itself it is not enough to replace proper object valuation to classify expired and in-use objects. Shadow nodes contain up to 64 shadow entries from different/alternating zones that have their own atomic age counter, so determining if a node is overall expired is crazy expensive. Instead, use an object limit above which nodes are very likely to be expired. o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim) o radix_tree_node-count accessors for pages and shadows (Minchan Kim) o Rebase to v3.14-rc1 and add review tags Summary The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list inactive list and the frequently used list active list. Currently, the VM aims for a 1:1 ratio between the lists, which is the perfect trade-off between the ability to *protect* frequently used pages and the ability to *detect* frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching. This happens on file servers and media streaming servers, where the popular files and file sections change over time. Even though the individual files might be smaller than half of memory, concurrent access to many of them may still result in their inter-reference distance being greater than half of memory. It's also been reported as a problem on database workloads that switch back and forth between tables that are bigger than half of memory. In these cases the VM never recognizes the new working set and will for the remainder of the workload thrash disk data which could easily live in memory. Historically, every reclaim scan of the inactive list also took a smaller number of pages from the tail of the active list and moved them to the head of the inactive list. This model gave established working sets more gracetime in the face of temporary use-once streams, but ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This series solves the problem by maintaining a history of pages evicted from the inactive list, enabling the VM to detect frequently used pages regardless of inactive list size and facilitate working set transitions. Tests The reported database workload is easily demonstrated on a 8G machine with two filesets a 6G. This fio workload operates on one set first, then switches to the other. The VM should obviously always cache the set that the workload is currently using. This test is based on a problem encountered by Citus Data customers: http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data unpatched: db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, mint= 113672msec, maxt= 113672msec db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, mint=1521302msec, maxt=1521302msec sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92% real27m15.541s user0m19.059s sys 0m51.459s patched: db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, mint=114679msec, maxt=114679msec db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, mint=253273msec, maxt=253273msec sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40% real6m8.630s user0m14.714s sys 0m31.233s As can be seen, the unpatched kernel simply never adapts to the workingset change and db2 is stuck indefinitely with secondary storage speed. The patched kernel needs 2-3 iterations over db2 before it replaces db1 and reaches full memory speed. Given the unbounded negative affect of the existing VM behavior, these patches should be considered correctness fixes rather than performance optimizations. Another test resembles a fileserver or streaming server workload, where data in excess of