Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton
On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:

> On Saturday 29 September 2007 19:27, Andrew Morton wrote:
> > On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra <[EMAIL PROTECTED]> 
> wrote:
> > > > oom-killings, or page allocation failures?  The latter, one hopes.
> > >
> > > Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 
> > > (Ubuntu
> > > 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
> > >
> > > ...
> > >
> > >
> > > mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
> > > Call Trace:
> > > 611b3878:  [<6002dd28>] printk_ratelimit+0x15/0x17
> > > 611b3888:  [<60052ed4>] out_of_memory+0x80/0x100
> > > 611b38c8:  [<60054b0c>] __alloc_pages+0x1ed/0x280
> > > 611b3948:  [<6006c608>] allocate_slab+0x5b/0xb0
> > > 611b3968:  [<6006c705>] new_slab+0x7e/0x183
> > > 611b39a8:  [<6006cbae>] __slab_alloc+0xc9/0x14b
> > > 611b39b0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > > 611b39b8:  [<600980f2>] do_mpage_readpage+0x3b3/0x472
> > > 611b39e0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > > 611b39f8:  [<6006cc81>] kmem_cache_alloc+0x51/0x98
> > > 611b3a38:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > > 611b3a58:  [<6004f8e2>] add_to_page_cache+0x22/0xf7
> > > 611b3a98:  [<6004f9c6>] add_to_page_cache_lru+0xf/0x24
> > > 611b3ab8:  [<6009821e>] mpage_readpages+0x6d/0x109
> > > 611b3ac0:  [<600d59f0>] ext3_get_block+0x0/0xf2
> > > 611b3b08:  [<6005483d>] get_page_from_freelist+0x8d/0xc1
> > > 611b3b88:  [<600d6937>] ext3_readpages+0x18/0x1a
> > > 611b3b98:  [<60056f00>] read_pages+0x37/0x9b
> > > 611b3bd8:  [<60057064>] __do_page_cache_readahead+0x100/0x157
> > > 611b3c48:  [<60057196>] do_page_cache_readahead+0x52/0x5f
> > > 611b3c78:  [<60050ab4>] filemap_fault+0x145/0x278
> > > 611b3ca8:  [<60022b61>] run_syscall_stub+0xd1/0xdd
> > > 611b3ce8:  [<6005eae3>] __do_fault+0x7e/0x3ca
> > > 611b3d68:  [<6005ee60>] do_linear_fault+0x31/0x33
> > > 611b3d88:  [<6005f149>] handle_mm_fault+0x14e/0x246
> > > 611b3da8:  [<60120a7b>] __up_read+0x73/0x7b
> > > 611b3de8:  [<60013177>] handle_page_fault+0x11f/0x23b
> > > 611b3e48:  [<60013419>] segv+0xac/0x297
> > > 611b3f28:  [<60013367>] segv_handler+0x68/0x6e
> > > 611b3f48:  [<600232ad>] get_skas_faultinfo+0x9c/0xa1
> > > 611b3f68:  [<60023853>] userspace+0x13a/0x19d
> > > 611b3fc8:  [<60010d58>] fork_handler+0x86/0x8d
> >
> > OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
> > allocations aren't supposed to fail.
> >
> > I'm suspecting that did_some_progress thing.
> 
> The allocation didn't fail -- it invoked the OOM killer because the kernel
> ran out of unfragmented memory.

We can't "run out of unfragmented memory" for an order-2 GFP_KERNEL
allocation in this workload.  We go and synchronously free stuff up to make
it work.

How did this get broken?

> Probably because higher order
> allocations are the new vogue in -mm at the moment ;)

That's a different bug.

bug 1: We shouldn't be doing higher-order allocations in slub because of
the considerable damage this does to atomic allocations.

bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-09-29 Thread Sorin Faibish

Interesting that you mention the multitude of file systems because
I was very surprised to see NILFS being promoted in the latest Linux
Magazine but no mention of the other more important file systems
currently in work like UnionFS ChunkFS or ext4 so publisized.
I can say I was disapointed of the article. I still didn't
see any real prove that NILFS is the best file system since bread.
Neither I see any comments on nilfs from Andrew and others and
yet this is the best new file system coming to Linux. Maybe I missed
something that happened in Ottawa.

/Sorin


On Mon, 18 Jun 2007 05:45:24 -0400, Andreas Dilger <[EMAIL PROTECTED]>  
wrote:



On Jun 16, 2007  16:53 +0200, Jörn Engel wrote:

On Fri, 15 June 2007 15:51:07 -0700, alan wrote:
> >Thus, in the end it turns out that this stuff is better handled by
> >explicit version-control systems (which require explicit operations  
to

> >manage revisions) and atomic snapshots (for backup.)
>
> ZFS is the cool new thing in that space.  Too bad the license makes it
> hard to incorporate it into the kernel.

It may be the coolest, but there are others as well.  Btrfs looks good,
nilfs finally has a cleaner and may be worth a try, logfs will get
snapshots sooner or later.  Heck, even my crusty old cowlinks can be
viewed as snapshots.

If one has spare cycles to waste, working on one of those makes more
sense than implementing file versioning.


Too bad everyone is spending time on 10 similar-but-slightly-different
filesystems.  This will likely end up with a bunch of filesystems that
implement some easy subset of features, but will not get polished for
users or have a full set of features implemented (e.g. ACL, quota, fsck,
etc).  While I don't think there is a single answer to every question,
it does seem that the number of filesystem projects has climbed lately.

Maybe there should be a BOF at OLS to merge these filesystem projects
(btrfs, chunkfs, tilefs, logfs, etc) into a single project with multiple
people working on getting it solid, scalable (parallel readers/writers on
lots of CPUs), robust (checksums, failure localization), recoverable,  
etc.
I thought Val's FS summits were designed to get developers to  
collaborate,

but it seems everyone has gone back to their corners to work on their own
filesystem?

Working on getting hooks into DM/MD so that the filesystem and RAID  
layers

can move beyond "ignorance is bliss" when talking to each other would be
great.  Not rebuilding empty parts of the fs, limit parity resync to  
parts
of the fs that were in the previous transaction, use fs-supplied  
checksums
to verify on-disk data is correct, use RAID geometry when doing  
allocations,

etc.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"  
in

the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html







--
Best Regards
Sorin Faibish
Senior Technologist
Senior Consulting Software Engineer Network Storage Group

   EMC²
where information lives

Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Nick Piggin
On Saturday 29 September 2007 04:41, Christoph Lameter wrote:
> On Fri, 28 Sep 2007, Peter Zijlstra wrote:
> > memory got massively fragemented, as anti-frag gets easily defeated.
> > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
> > order blocks to stay available, so we don't mix types. however 12M on
> > 128M is rather a lot.
>
> Yes, strict ordering would be much better. On NUMA it may be possible to
> completely forbid merging. We can fall back to other nodes if necessary.
> 12M is not much on a NUMA system.
>
> But this shows that (unsurprisingly) we may have issues on systems with a
> small amounts of memory and we may not want to use higher orders on such
> systems.
>
> The case you got may be good to use as a testcase for the virtual
> fallback. H... Maybe it is possible to allocate the stack as a virtual
> compound page. Got some script/code to produce that problem?

Yeah, you could do that, but we generally don't have big problems allocating
stacks in mainline, because we have very few users of higher order pages,
the few that are there don't seem to be a problem.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Nick Piggin
On Saturday 29 September 2007 19:27, Andrew Morton wrote:
> On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra <[EMAIL PROTECTED]> 
wrote:
> > > oom-killings, or page allocation failures?  The latter, one hopes.
> >
> > Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 
> > (Ubuntu
> > 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
> >
> > ...
> >
> >
> > mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
> > Call Trace:
> > 611b3878:  [<6002dd28>] printk_ratelimit+0x15/0x17
> > 611b3888:  [<60052ed4>] out_of_memory+0x80/0x100
> > 611b38c8:  [<60054b0c>] __alloc_pages+0x1ed/0x280
> > 611b3948:  [<6006c608>] allocate_slab+0x5b/0xb0
> > 611b3968:  [<6006c705>] new_slab+0x7e/0x183
> > 611b39a8:  [<6006cbae>] __slab_alloc+0xc9/0x14b
> > 611b39b0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > 611b39b8:  [<600980f2>] do_mpage_readpage+0x3b3/0x472
> > 611b39e0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > 611b39f8:  [<6006cc81>] kmem_cache_alloc+0x51/0x98
> > 611b3a38:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> > 611b3a58:  [<6004f8e2>] add_to_page_cache+0x22/0xf7
> > 611b3a98:  [<6004f9c6>] add_to_page_cache_lru+0xf/0x24
> > 611b3ab8:  [<6009821e>] mpage_readpages+0x6d/0x109
> > 611b3ac0:  [<600d59f0>] ext3_get_block+0x0/0xf2
> > 611b3b08:  [<6005483d>] get_page_from_freelist+0x8d/0xc1
> > 611b3b88:  [<600d6937>] ext3_readpages+0x18/0x1a
> > 611b3b98:  [<60056f00>] read_pages+0x37/0x9b
> > 611b3bd8:  [<60057064>] __do_page_cache_readahead+0x100/0x157
> > 611b3c48:  [<60057196>] do_page_cache_readahead+0x52/0x5f
> > 611b3c78:  [<60050ab4>] filemap_fault+0x145/0x278
> > 611b3ca8:  [<60022b61>] run_syscall_stub+0xd1/0xdd
> > 611b3ce8:  [<6005eae3>] __do_fault+0x7e/0x3ca
> > 611b3d68:  [<6005ee60>] do_linear_fault+0x31/0x33
> > 611b3d88:  [<6005f149>] handle_mm_fault+0x14e/0x246
> > 611b3da8:  [<60120a7b>] __up_read+0x73/0x7b
> > 611b3de8:  [<60013177>] handle_page_fault+0x11f/0x23b
> > 611b3e48:  [<60013419>] segv+0xac/0x297
> > 611b3f28:  [<60013367>] segv_handler+0x68/0x6e
> > 611b3f48:  [<600232ad>] get_skas_faultinfo+0x9c/0xa1
> > 611b3f68:  [<60023853>] userspace+0x13a/0x19d
> > 611b3fc8:  [<60010d58>] fork_handler+0x86/0x8d
>
> OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
> allocations aren't supposed to fail.
>
> I'm suspecting that did_some_progress thing.

The allocation didn't fail -- it invoked the OOM killer because the kernel
ran out of unfragmented memory. Probably because higher order
allocations are the new vogue in -mm at the moment ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton
On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> > oom-killings, or page allocation failures?  The latter, one hopes.
> 
> 
> Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 
> (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
> 
> ...
> 
> 
> mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
> Call Trace:
> 611b3878:  [<6002dd28>] printk_ratelimit+0x15/0x17
> 611b3888:  [<60052ed4>] out_of_memory+0x80/0x100
> 611b38c8:  [<60054b0c>] __alloc_pages+0x1ed/0x280
> 611b3948:  [<6006c608>] allocate_slab+0x5b/0xb0
> 611b3968:  [<6006c705>] new_slab+0x7e/0x183
> 611b39a8:  [<6006cbae>] __slab_alloc+0xc9/0x14b
> 611b39b0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> 611b39b8:  [<600980f2>] do_mpage_readpage+0x3b3/0x472
> 611b39e0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> 611b39f8:  [<6006cc81>] kmem_cache_alloc+0x51/0x98
> 611b3a38:  [<6011f89f>] radix_tree_preload+0x70/0xbf
> 611b3a58:  [<6004f8e2>] add_to_page_cache+0x22/0xf7
> 611b3a98:  [<6004f9c6>] add_to_page_cache_lru+0xf/0x24
> 611b3ab8:  [<6009821e>] mpage_readpages+0x6d/0x109
> 611b3ac0:  [<600d59f0>] ext3_get_block+0x0/0xf2
> 611b3b08:  [<6005483d>] get_page_from_freelist+0x8d/0xc1
> 611b3b88:  [<600d6937>] ext3_readpages+0x18/0x1a
> 611b3b98:  [<60056f00>] read_pages+0x37/0x9b
> 611b3bd8:  [<60057064>] __do_page_cache_readahead+0x100/0x157
> 611b3c48:  [<60057196>] do_page_cache_readahead+0x52/0x5f
> 611b3c78:  [<60050ab4>] filemap_fault+0x145/0x278
> 611b3ca8:  [<60022b61>] run_syscall_stub+0xd1/0xdd
> 611b3ce8:  [<6005eae3>] __do_fault+0x7e/0x3ca
> 611b3d68:  [<6005ee60>] do_linear_fault+0x31/0x33
> 611b3d88:  [<6005f149>] handle_mm_fault+0x14e/0x246
> 611b3da8:  [<60120a7b>] __up_read+0x73/0x7b
> 611b3de8:  [<60013177>] handle_page_fault+0x11f/0x23b
> 611b3e48:  [<60013419>] segv+0xac/0x297
> 611b3f28:  [<60013367>] segv_handler+0x68/0x6e
> 611b3f48:  [<600232ad>] get_skas_faultinfo+0x9c/0xa1
> 611b3f68:  [<60023853>] userspace+0x13a/0x19d
> 611b3fc8:  [<60010d58>] fork_handler+0x86/0x8d

OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
allocations aren't supposed to fail.

I'm suspecting that did_some_progress thing.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Peter Zijlstra

On Sat, 2007-09-29 at 02:01 -0700, Andrew Morton wrote:
> On Sat, 29 Sep 2007 10:53:41 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > 
> > On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote:
> > 
> > > Ah, right, that was the detail... all this lumpy reclaim is useless for
> > > atomic allocations. And with SLUB using higher order pages, atomic !0
> > > order allocations will be very very common.
> > > 
> > > One I can remember was:
> > > 
> > >   add_to_page_cache()
> > > radix_tree_insert()
> > >   radix_tree_node_alloc()
> > > kmem_cache_alloc()
> > > 
> > > which is an atomic callsite.
> > > 
> > > Which leaves us in a situation where we can load pages, because there is
> > > free memory, but can't manage to allocate memory to track them.. 
> > 
> > Ah, I found a boot log of one of these sessions, its also full of
> > order-2 OOMs.. :-/
> 
> oom-killings, or page allocation failures?  The latter, one hopes.


Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 
(Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007

...


mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
Call Trace:
611b3878:  [<6002dd28>] printk_ratelimit+0x15/0x17
611b3888:  [<60052ed4>] out_of_memory+0x80/0x100
611b38c8:  [<60054b0c>] __alloc_pages+0x1ed/0x280
611b3948:  [<6006c608>] allocate_slab+0x5b/0xb0
611b3968:  [<6006c705>] new_slab+0x7e/0x183
611b39a8:  [<6006cbae>] __slab_alloc+0xc9/0x14b
611b39b0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
611b39b8:  [<600980f2>] do_mpage_readpage+0x3b3/0x472
611b39e0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
611b39f8:  [<6006cc81>] kmem_cache_alloc+0x51/0x98
611b3a38:  [<6011f89f>] radix_tree_preload+0x70/0xbf
611b3a58:  [<6004f8e2>] add_to_page_cache+0x22/0xf7
611b3a98:  [<6004f9c6>] add_to_page_cache_lru+0xf/0x24
611b3ab8:  [<6009821e>] mpage_readpages+0x6d/0x109
611b3ac0:  [<600d59f0>] ext3_get_block+0x0/0xf2
611b3b08:  [<6005483d>] get_page_from_freelist+0x8d/0xc1
611b3b88:  [<600d6937>] ext3_readpages+0x18/0x1a
611b3b98:  [<60056f00>] read_pages+0x37/0x9b
611b3bd8:  [<60057064>] __do_page_cache_readahead+0x100/0x157
611b3c48:  [<60057196>] do_page_cache_readahead+0x52/0x5f
611b3c78:  [<60050ab4>] filemap_fault+0x145/0x278
611b3ca8:  [<60022b61>] run_syscall_stub+0xd1/0xdd
611b3ce8:  [<6005eae3>] __do_fault+0x7e/0x3ca
611b3d68:  [<6005ee60>] do_linear_fault+0x31/0x33
611b3d88:  [<6005f149>] handle_mm_fault+0x14e/0x246
611b3da8:  [<60120a7b>] __up_read+0x73/0x7b
611b3de8:  [<60013177>] handle_page_fault+0x11f/0x23b
611b3e48:  [<60013419>] segv+0xac/0x297
611b3f28:  [<60013367>] segv_handler+0x68/0x6e
611b3f48:  [<600232ad>] get_skas_faultinfo+0x9c/0xa1
611b3f68:  [<60023853>] userspace+0x13a/0x19d
611b3fc8:  [<60010d58>] fork_handler+0x86/0x8d

Mem-info:
Normal per-cpu:
CPU0: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
Active:11 inactive:9 dirty:0 writeback:1 unstable:0
 free:19533 slab:10587 mapped:0 pagetables:260 bounce:0
Normal free:78132kB min:4096kB low:5120kB high:6144kB active:44kB inactive:36kB 
present:129280kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0
Normal: 7503*4kB 5977*8kB 19*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 
0*1024kB 0*2048kB 0*4096kB = 78132kB
Swap cache: add 1192822, delete 1192790, find 491441/626861, race 0+1
Free swap  = 455300kB
Total swap = 524280kB
Free swap:   455300kB
32768 pages of RAM
0 pages of HIGHMEM
1948 reserved pages
11 pages shared
32 pages swap cached
Out of memory: kill process 2647 (portmap) score 2233 or a child
Killed process 2647 (portmap)


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Upgrading datastructures between different filesystem versions

2007-09-29 Thread Christoph Hellwig
On Fri, Sep 28, 2007 at 03:47:24PM -0400, Theodore Tso wrote:
> Ext3 does something similar, zapping space at the beginning AND the
> end of the partition (because the MD superblocks are at the end).
> It's just a misfeature of reiserfs's mkfs that it doesn't do this.

mkfs.xfs of course also whipes at the end.  I just wanted to show how
easy this is to fix.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton
On Sat, 29 Sep 2007 10:47:12 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> 
> On Sat, 2007-09-29 at 01:13 -0700, Andrew Morton wrote:
> > On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > 
> > > 
> > > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
> > > 
> > > > > start 2 processes that each mmap a separate 64M file, and which does
> > > > > sequential writes on them. start a 3th process that does the same with
> > > > > 64M anonymous.
> > > > > 
> > > > > wait for a while, and you'll see order=1 failures.
> > > > 
> > > > Really? That means we can no longer even allocate stacks for forking.
> > > > 
> > > > Its surprising that neither lumpy reclaim nor the mobility patches can 
> > > > deal with it? Lumpy reclaim should be able to free neighboring pages to 
> > > > avoid the order 1 failure unless there are lots of pinned pages.
> > > > 
> > > > I guess then that lots of pages are pinned through I/O?
> > > 
> > > memory got massively fragemented, as anti-frag gets easily defeated.
> > > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
> > > order blocks to stay available, so we don't mix types. however 12M on
> > > 128M is rather a lot.
> > > 
> > > its still on my todo list to look at it further..
> > > 
> > 
> > That would be really really bad (as in: patch-dropping time) if those
> > order-1 allocations are not atomic.
> > 
> > What's the callsite? 
> 
> Ah, right, that was the detail... all this lumpy reclaim is useless for
> atomic allocations. And with SLUB using higher order pages, atomic !0
> order allocations will be very very common.

Oh OK.

I thought we'd already fixed slub so that it didn't do that.  Maybe that
fix is in -mm but I don't think so.

Trying to do atomic order-1 allocations on behalf of arbitray slab caches
just won't fly - this is a significant degradation in kernel reliability,
as you've very easily demonstrated.

> One I can remember was:
> 
>   add_to_page_cache()
> radix_tree_insert()
>   radix_tree_node_alloc()
> kmem_cache_alloc()
> 
> which is an atomic callsite.
> 
> Which leaves us in a situation where we can load pages, because there is
> free memory, but can't manage to allocate memory to track them.. 

Right.  Leading to application failure which for many is equivalent to a
complete system outage.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton
On Sat, 29 Sep 2007 10:53:41 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> 
> On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote:
> 
> > Ah, right, that was the detail... all this lumpy reclaim is useless for
> > atomic allocations. And with SLUB using higher order pages, atomic !0
> > order allocations will be very very common.
> > 
> > One I can remember was:
> > 
> >   add_to_page_cache()
> > radix_tree_insert()
> >   radix_tree_node_alloc()
> > kmem_cache_alloc()
> > 
> > which is an atomic callsite.
> > 
> > Which leaves us in a situation where we can load pages, because there is
> > free memory, but can't manage to allocate memory to track them.. 
> 
> Ah, I found a boot log of one of these sessions, its also full of
> order-2 OOMs.. :-/

oom-killings, or page allocation failures?  The latter, one hopes.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Peter Zijlstra

On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote:

> Ah, right, that was the detail... all this lumpy reclaim is useless for
> atomic allocations. And with SLUB using higher order pages, atomic !0
> order allocations will be very very common.
> 
> One I can remember was:
> 
>   add_to_page_cache()
> radix_tree_insert()
>   radix_tree_node_alloc()
> kmem_cache_alloc()
> 
> which is an atomic callsite.
> 
> Which leaves us in a situation where we can load pages, because there is
> free memory, but can't manage to allocate memory to track them.. 

Ah, I found a boot log of one of these sessions, its also full of
order-2 OOMs.. :-/

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Peter Zijlstra

On Sat, 2007-09-29 at 01:13 -0700, Andrew Morton wrote:
> On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > 
> > On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
> > 
> > > > start 2 processes that each mmap a separate 64M file, and which does
> > > > sequential writes on them. start a 3th process that does the same with
> > > > 64M anonymous.
> > > > 
> > > > wait for a while, and you'll see order=1 failures.
> > > 
> > > Really? That means we can no longer even allocate stacks for forking.
> > > 
> > > Its surprising that neither lumpy reclaim nor the mobility patches can 
> > > deal with it? Lumpy reclaim should be able to free neighboring pages to 
> > > avoid the order 1 failure unless there are lots of pinned pages.
> > > 
> > > I guess then that lots of pages are pinned through I/O?
> > 
> > memory got massively fragemented, as anti-frag gets easily defeated.
> > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
> > order blocks to stay available, so we don't mix types. however 12M on
> > 128M is rather a lot.
> > 
> > its still on my todo list to look at it further..
> > 
> 
> That would be really really bad (as in: patch-dropping time) if those
> order-1 allocations are not atomic.
> 
> What's the callsite? 

Ah, right, that was the detail... all this lumpy reclaim is useless for
atomic allocations. And with SLUB using higher order pages, atomic !0
order allocations will be very very common.

One I can remember was:

  add_to_page_cache()
radix_tree_insert()
  radix_tree_node_alloc()
kmem_cache_alloc()

which is an atomic callsite.

Which leaves us in a situation where we can load pages, because there is
free memory, but can't manage to allocate memory to track them.. 

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Peter Zijlstra

On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:

> Really? That means we can no longer even allocate stacks for forking.

I think I'm running with 4k stacks...

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton
On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> 
> On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
> 
> > > start 2 processes that each mmap a separate 64M file, and which does
> > > sequential writes on them. start a 3th process that does the same with
> > > 64M anonymous.
> > > 
> > > wait for a while, and you'll see order=1 failures.
> > 
> > Really? That means we can no longer even allocate stacks for forking.
> > 
> > Its surprising that neither lumpy reclaim nor the mobility patches can 
> > deal with it? Lumpy reclaim should be able to free neighboring pages to 
> > avoid the order 1 failure unless there are lots of pinned pages.
> > 
> > I guess then that lots of pages are pinned through I/O?
> 
> memory got massively fragemented, as anti-frag gets easily defeated.
> setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
> order blocks to stay available, so we don't mix types. however 12M on
> 128M is rather a lot.
> 
> its still on my todo list to look at it further..
> 

That would be really really bad (as in: patch-dropping time) if those
order-1 allocations are not atomic.

What's the callsite? 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html