Re: migratepage failures on reiserfs

2007-11-07 Thread Mel Gorman
On (05/11/07 14:46), Christoph Lameter didst pronounce:
> On Mon, 5 Nov 2007, Mel Gorman wrote:
> 
> > The grow_dev_page() pages should be reclaimable even though migration
> > is not supported for those pages? They were marked movable as it was
> > useful for lumpy reclaim taking back pages for hugepage allocations and
> > the like. Would it make sense for memory unremove to attempt migration
> > first and reclaim second?
> 
> Note that a page is still movable even if there is no file system method 
> for migration available. In that case the page needs to be cleaned before 
> it can be moved.
> 

Badari, do you know if the pages failed to migrate because they were
dirty or because the filesystem simply had ownership of the pages and
wouldn't let them go?

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-11-05 Thread Mel Gorman
On (01/11/07 10:10), Badari Pulavarty didst pronounce:
> On Thu, 2007-11-01 at 11:51 -0400, Chris Mason wrote:
> > On Thu, 01 Nov 2007 08:38:57 -0800
> > Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> > 
> > > On Wed, 2007-10-31 at 13:40 -0400, Chris Mason wrote:
> > > > On Wed, 31 Oct 2007 08:14:21 -0800
> > > > Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> > > > > 
> > > > > I tried data=writeback mode and it didn't help :(
> > > > 
> > > > Ouch, so much for the easy way out.
> > > > 
> > > > > 
> > > > > unable to release the page 262070
> > > > > bh c000211b9408 flags 110029 count 1 private 0
> > > > > unable to release the page 262098
> > > > > bh c00020ec9198 flags 110029 count 1 private 0
> > > > > memory offlining 3f000 to 4 failed
> > > > > 
> > > > 
> > > > The only other special thing reiserfs does with the page cache is
> > > > file tails.  I don't suppose all of these pages are index zero in
> > > > files smaller than 4k?
> > > 
> > > Ah !! I am so blind :(
> > > 
> > > I have been suspecting reiserfs all along, since its executing
> > > fallback_migrate_page(). Actually, these buffer heads are
> > > backing blockdev. I guess these are metadata buffers :( 
> > > I am not sure we can do much with these..
> > 
> > Hmpf, my first reply had a paragraph about the block device inode
> > pages, I noticed the phrase file data pages and deleted it ;)
> > 
> > But, for the metadata buffers there's not much we can do.  They are
> > included in a bunch of different lists and the patch would
> > be non-trivial.
> 
> Unfortunately, these buffer pages are spread all around making
> those sections of memory non-removable. Of course, one can use
> ZONE_MOVABLE to make sure to guarantee the remove. But I am
> hoping we could easily group all these allocations and minimize
> spreading them around. Mel ?

The grow_dev_page() pages should be reclaimable even though migration
is not supported for those pages? They were marked movable as it was
useful for lumpy reclaim taking back pages for hugepage allocations and
the like. Would it make sense for memory unremove to attempt migration
first and reclaim second?

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Mel Gorman
On (28/09/07 11:41), Christoph Lameter didst pronounce:
> On Fri, 28 Sep 2007, Peter Zijlstra wrote:
> 
> > memory got massively fragemented, as anti-frag gets easily defeated.
> > setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
> > order blocks to stay available, so we don't mix types. however 12M on
> > 128M is rather a lot.
> 
> Yes, strict ordering would be much better. On NUMA it may be possible to 
> completely forbid merging.

The forbidding of merging is trivial and the code is isolated to one function
__rmqueue_fallback(). We don't do it because the decision at development
time was that it was better to allow fragmentation than take a reclaim step
for example[1] and slow things up. This is based on my initial assumption
of anti-frag being mainly of interest to hugepages which are happy to wait
long periods during startup or fail.

> We can fall back to other nodes if necessary. 
> 12M is not much on a NUMA system.
> 
> But this shows that (unsurprisingly) we may have issues on systems with a 
> small amounts of memory and we may not want to use higher orders on such 
> systems.
> 

This is another option if you want to use a higher order for SLUB by
default. Use order-0 unless you are sure there is enough memory. At boot
if there is loads of memory, set the higher order and up min_free_kbytes on
each node to reduce mixing[2]. We can test with Peters uber-hostile
case to see if it works[3].

> The case you got may be good to use as a testcase for the virtual 
> fallback. H...

For sure.

> Maybe it is possible to allocate the stack as a virtual 
> compound page. Got some script/code to produce that problem?
> 

[1] It might be tunnel vision but I still keep hugepages in mind as the
principal user of anti-frag. Andy used to have patches that force evicted
pages of the "foreign" type when mixing occured so the end result was
no mixing. We never fully completed them because it was too costly
for hugepages.

[2] This would require the identification of mixed blocks to be a
statistic available in mainline. Right now, it's only available in -mm
when PAGE_OWNER is set

[3] The definition of working in this case being that order-0
allocations fail which he has produced

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Mel Gorman
On (28/09/07 10:33), Christoph Lameter didst pronounce:
> On Fri, 28 Sep 2007, Nick Piggin wrote:
> 
> > On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
> > > SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
> > > available then the conservative settings for higher order allocations are
> > > overridden. We then request an order that can accomodate at mininum
> > > 100 objects. The size of an individual slab allocation is allowed to reach
> > > up to 256k (order 6 on i386, order 4 on IA64).
> > 
> > How come SLUB wants such a big amount of objects? I thought the
> > unqueued nature of it made it better than slab because it minimised
> > the amount of cache hot memory lying around in slabs...
> 
> The more objects in a page the more the fast path runs. The more the fast 
> path runs the lower the cache footprint and the faster the overall 
> allocations etc.
> 
> SLAB can be configured for large queues holdings lots of objects. 
> SLUB can only reach the same through large pages because it does not 
> have queues.

Large pages, flood gates etc. Be wary.

SLUB has to run 100% reliable or things go whoops. SLUB regularly depends on
atomic allocations and cannot take the necessary steps to get the contiguous
pages if it gets into trouble. This means that something like lumpy reclaim
cannot help you in it's current state.

We currently do not take the per-emptive steps with kswapd to ensure the
high-order pages are free. We also don't do something like have users that
can sleep keep the watermarks high. I had considered the possibility but
didn't have the justification for the complexity.

Minimally, SLUB by default should continue to use order-0 pages. Peter has
managed to bust order-1 pages with mem=128MB. Admittedly, it was a really
hostile workload but the point remains. It was artifically worked around
with min_free_kbytes (value set based on pageblock_order, could also have
been artifically worked around by dropping pageblock_order) and he eventually
caused order-0 failures so the workload is pretty damn hostile to everything.

> One could add the ability to manage pools of cpu slabs but 
> that would be adding yet another layer to compensate for the problem of 
> the small pages.

A compromise may be to have per-cpu lists for higher-order pages in the page
allocator itself as they can be easily drained unlike the SLAB queues. The
thing to watch for would be excessive IPI calls which would offset any
performance gained by SLUB using larger pages.

> Reliable large page allocations means that we can get rid 
> of these layers and the many workarounds that we have in place right now.
> 

They are not reliable yet, particularly for atomic allocs.

> The unqueued nature of SLUB reduces memory requirements and in general the 
> more efficient code paths of SLUB offset the advantage that SLAB can reach 
> by being able to put more objects onto its queues. SLAB necessarily 
> introduces complexity and cache line use through the need to manage those 
> queues.
> 
> > vmalloc is incredibly slow and unscalable at the moment. I'm still working
> > on making it more scalable and faster -- hopefully to a point where it would
> > actually be usable for this... but you still get moved off large TLBs, and
> > also have to inevitably do tlb flushing.
> 
> Again I have not seen any fallbacks to vmalloc in my testing. What we are 
> doing here is mainly to address your theoretical cases that we so far have 
> never seen to be a problem and increase the reliability of allocations of
> page orders larger than 3 to a usable level. So far I have so far not 
> dared to enable orders larger than 3 by default.
> 
> AFAICT The performance of vmalloc is not really relevant. If this would 
> become an issue then it would be possible to reduce the orders used to 
> avoid fallbacks.
> 

If we're falling back to vmalloc ever, there is a danger that the
problem is postponed until vmalloc space is consumed. More an issue for
32 bit.

> > Or do you have SLUB at a point where performance is comparable to SLAB,
> > and this is just a possible idea for more performance?
> 
> AFAICT SLUBs performance is superior to SLAB in most cases and it was like 
> that from the beginning. I am still concerned about several corner cases 
> though (I think most of them are going to be addressed by the per cpu 
> patches in mm). Having a comparable or larger amount of per cpu objects as 
> SLAB is something that also could address some of these concerns and could 
> increase performance much further.
> 

-- 
-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Mel Gorman
On (28/09/07 20:25), Peter Zijlstra didst pronounce:
> 
> On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
> 
> > > start 2 processes that each mmap a separate 64M file, and which does
> > > sequential writes on them. start a 3th process that does the same with
> > > 64M anonymous.
> > > 
> > > wait for a while, and you'll see order=1 failures.
> > 
> > Really? That means we can no longer even allocate stacks for forking.
> > 
> > Its surprising that neither lumpy reclaim nor the mobility patches can 
> > deal with it? Lumpy reclaim should be able to free neighboring pages to 
> > avoid the order 1 failure unless there are lots of pinned pages.
> > 
> > I guess then that lots of pages are pinned through I/O?
> 
> memory got massively fragemented, as anti-frag gets easily defeated.
> setting min_free_kbytes to 12M does seem to solve it - it forces 2 max

The 12MB is related to the size of pageblock_order. I strongly suspect
that if you forced pageblock_order to be something like 4 or 5, the
min_free_kbytes would not need to be raised. The current values are
selected based on the hugepage size.

> order blocks to stay available, so we don't mix types. however 12M on
> 128M is rather a lot.
> 
> its still on my todo list to look at it further..
> 

-- 
-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Mel Gorman
On (17/09/07 15:00), Christoph Lameter didst pronounce:
> On Sun, 16 Sep 2007, Nick Piggin wrote:
> 
> > I don't know how it would prevent fragmentation from building up
> > anyway. It's commonly the case that potentially unmovable objects
> > are allowed to fill up all of ram (dentries, inodes, etc).
> 
> Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from 
> ZONE_MOVABLE and thus the memory that can be allocated for them is 
> limited.
> 

As Nick points out, having to configure something makes it a #2
solution. However, I at least am ok with that. ZONE_MOVABLE is a get-out
clause to be able to control fragmentation no matter what the workload is
as it gives hard guarantees. Even when ZONE_MOVABLE is replaced by some
mechanism in grouping pages by mobility to force a number of blocks to be
MIGRATE_MOVABLE_ONLY, the emergency option will exist,

We still lack data on what sort of workloads really benefit from large
blocks (assuming there are any that cannot also be solved by improving
order-0). With Christophs approach + grouping pages by mobility +
ZONE_MOVABLE-if-it-screws-up, people can start collecting that data over the
course of the next few months while we're waiting for fsblock or software
pagesize to mature.

Do we really need to keep discussing this as no new point has been made ina
while? Can we at least take out the non-contentious parts of Christoph's
patches such as the page cache macros and do something with them?

-- 
Mel "tired of typing" Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Mel Gorman
On (16/09/07 23:31), Andrea Arcangeli didst pronounce:
> On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
> > The 16MB is the size of a hugepage, the size of interest as far as I am
> > concerned. Your idea makes sense for large block support, but much less
> > for huge pages because you are incurring a cost in the general case for
> > something that may not be used.
> 
> Sorry for the misunderstanding, I totally agree!
> 

Great. It's clear that we had different use cases in mind when we were
poking holes in each approach.

> > There is nothing to say that both can't be done. Raise the size of
> > order-0 for large block support and continue trying to group the block
> > to make hugepage allocations likely succeed during the lifetime of the
> > system.
> 
> Sure, I completely agree.
> 
> > At the risk of repeating, your approach will be adding a new and
> > significant dimension to the internal fragmentation problem where a
> > kernel allocation may fail because the larger order-0 pages are all
> > being pinned by userspace pages.
> 
> This is exactly correct, some memory will be wasted. It'll reach 0
> free memory more quickly depending on which kind of applications are
> being run.
> 

I look forward to seeing how you deal with it. When/if you get to trying
to move pages out of slabs, I suggest you take a look at the Memory
Compaction patches or the memory unplug patches for simple examples of
how to use  page migration.

> > It improves the probabilty of hugepage allocations working because the
> > blocks with slab pages can be targetted and cleared if necessary.
> 
> Agreed.
> 
> > That suggestion was aimed at the large block support more than
> > hugepages. It helps large blocks because we'll be allocating and freeing
> > as more or less the same size. It certainly is easier to set
> > slub_min_order to the same size as what is needed for large blocks in
> > the system than introducing the necessary mechanics to allocate
> > pagetable pages and userspace pages from slab.
> 
> Allocating userpages from slab in 4k chunks with a 64k PAGE_SIZE is
> really complex indeed. I'm not planning for that in the short term but
> it remains a possibility to make the kernel more generic. Perhaps it
> could worth it...
> 

Perhaps.

> Allocating ptes from slab is fairly simple but I think it would be
> better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
> nearby ptes in the per-task local pagetable tree, to reduce the number
> of locks taken and not to enter the slab at all for that.

It runs the risk of pinning up to 60K of data per task that is unusable for
any other purpose. On average, it'll be more like 32K but worth keeping
in mind.

> Infact we
> could allocate the 4 levels (or anyway more than one level) in one
> single alloc_pages(0) and track the leftovers in the mm (or similar).
> 
> > I'm not sure what you are getting at here. I think it would make more
> > sense if you said "when you read /proc/buddyinfo, you know the order-0
> > pages are really free for use with large blocks" with your approach.
> 
> I'm unsure who reads /proc/buddyinfo (that can change a lot and that
> is not very significant information if the vm can defrag well inside
> the reclaim code),

I read it although as you say, it's difficult to know what will happen if
you try and reclaim memory. It's why there is also a /proc/pagetypeinfo so
one can see the number of movable blocks that exist. That leads to better
guessing. In -mm, you can also see the number of mixed blocks but that will
not be available in mainline.

> but it's not much different and it's more about
> knowing the real meaning of /proc/meminfo, freeable (unmapped) cache,
> anon ram, and free memory.
> 
> The idea is that to succeed an mmap over a large xfs file with
> mlockall being invoked, those largepages must become available or
> it'll be oom despite there are still 512M free... I'm quite sure
> admins will gets confused if they get oom killer invoked with lots of
> ram still free.
> 
> The overcommit feature will also break, just to make an example (so
> much for overcommit 2 guaranteeing -ENOMEM retvals instead of oom
> killage ;).
> 
> > All this aside, there is nothing mutually exclusive with what you are 
> > proposing
> > and what grouping pages by mobility does. Your stuff can exist even if 
> > grouping
> > pages by mobility is in place. In fact, it'll help us by giving an important
> > comparison point as grouping pages by mobility can be trivially disabled 
> > with
> > a one-liner for the purposes of testing. If your ap

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Mel Gorman
On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
> [EMAIL PROTECTED] (Mel Gorman) writes:
> 
> > On (15/09/07 14:14), Goswin von Brederlow didst pronounce:
> >> Andrew Morton <[EMAIL PROTECTED]> writes:
> >> 
> >> > On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel <[EMAIL PROTECTED]> wrote:
> >> >
> >> >> While I agree with your concern, those numbers are quite silly.  The
> >> >> chances of 99.8% of pages being free and the remaining 0.2% being
> >> >> perfectly spread across all 2MB large_pages are lower than those of SHA1
> >> >> creating a collision.
> >> >
> >> > Actually it'd be pretty easy to craft an application which allocates 
> >> > seven
> >> > pages for pagecache, then one for , then seven for pagecache, 
> >> > then
> >> > one for , etc.
> >> >
> >> > I've had test apps which do that sort of thing accidentally.  The result
> >> > wasn't pretty.
> >> 
> >> Except that the applications 7 pages are movable and the 
> >> would have to be unmovable. And then they should not share the same
> >> memory region. At least they should never be allowed to interleave in
> >> such a pattern on a larger scale.
> >> 
> >
> > It is actually really easy to force regions to never share. At the
> > moment, there is a fallback list that determines a preference for what
> > block to mix.
> >
> > The reason why this isn't enforced is the cost of moving. On x86 and
> > x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of
> > those pages to prevent any mixing would be bad enough. On PowerPC, it's
> > potentially 16MB. On IA64, it's 1GB.
> >
> > As this was fragmentation avoidance, not guarantees, the decision was
> > made to not strictly enforce the types of pages within a block as the
> > cost cannot be made back unless the system was making agressive use of
> > large pages. This is not the case with Linux.
> 
> I don't say the group should never be mixed. The movable objects could
> be moved out on demand. If 64k get allocated then up to 64k get
> moved.

This type of action makes sense in the context of Andrea's approach and
large blocks. I don't think it makes sense today to do it in the general
case, at least not yet.

> That would reduce the impact as the kernel does not hang while
> it moves 2MB or even 1GB. It also allows objects to be freed and the
> space reused in the unmovable and mixed groups. There could also be a
> certain number or percentage of mixed groupd be allowed to further
> increase the chance of movable objects freeing themself from mixed
> groups.
> 
> But when you already have say 10% of the ram in mixed groups then it
> is a sign the external fragmentation happens and some time should be
> spend on moving movable objects.
> 

I'll play around with it on the side and see what sort of results I get.
I won't be pushing anything any time soon in relation to this though.
For now, I don't intend to fiddle more with grouping pages by mobility
for something that may or may not be of benefit to a feature that hasn't
been widely tested with what exists today.

> >> The only way a fragmentation catastroph can be (proovable) avoided is
> >> by having so few unmovable objects that size + max waste << ram
> >> size. The smaller the better. Allowing movable and unmovable objects
> >> to mix means that max waste goes way up. In your example waste would
> >> be 7*size. With 2MB uper order limit it would be 511*size.
> >> 
> >> I keep coming back to the fact that movable objects should be moved
> >> out of the way for unmovable ones. Anything else just allows
> >> fragmentation to build up.
> >> 
> >
> > This is easily achieved, just really really expensive because of the
> > amount of copying that would have to take place. It would also compel
> > that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
> > MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
> > a lot of free memory to keep around which is why fragmentation avoidance
> > doesn't do it.
> 
> In your sample graphics you had 1152 groups. Reserving a few of those
> doesnt sound too bad.

No, which on those systems, I would suggest setting min_free_kbytes to a
higher value. Doesn't work as well on IA-64.

> And how many migrate types do we talk about.
> So
> far we only had movable and unmovable.

Movable, unmovable, reclaimable and reserve in the

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Mel Gorman
On (17/09/07 00:48), Goswin von Brederlow didst pronounce:
> [EMAIL PROTECTED] (Mel Gorman) writes:
> 
> > On (16/09/07 17:08), Andrea Arcangeli didst pronounce:
> >> zooming in I see red pixels all over the squares mized with green
> >> pixels in the same square. This is exactly what happens with the
> >> variable order page cache and that's why it provides zero guarantees
> >> in terms of how much ram is really "free" (free as in "available").
> >> 
> >
> > This picture is not grouping pages by mobility so that is hardly a
> > suprise. This picture is not running grouping pages by mobility. This is
> > what the normal kernel looks like. Look at the videos in
> > http://www.skynet.ie/~mel/anti-frag/2007-02-28 and see how list-based
> > compares to vanilla. These are from February when there was less control
> > over mixing blocks than there is today.
> >
> > In the current version mixing occurs in the lower blocks as much as possible
> > not the upper ones. So there are a number of mixed blocks but the number is
> > kept to a minimum.
> >
> > The number of mixed blocks could have been enforced as 0, but I felt it was
> > better in the general case to fragment rather than regress performance.
> > That may be different for large blocks where you will want to take the
> > enforcement steps.
> 
> I agree that 0 is a bad value. But so is infinity. There should be
> some mixing but not a lot. You say "kept to a minimum". Is that
> actively done or already happens by itself. Hopefully the later which
> would be just splendid.
> 

Happens by itself due to biasing mixing blocks at lower PFNs. The exact
number is unknown. We used to track it a long time ago but not any more.

> >> With config-page-shift mmap works on 4k chunks but it's always backed
> >> by 64k or any other largesize that you choosed at compile time. And if
> 
> But would mapping a random 4K page out of a file then consume 64k?
> That sounds like an awfull lot of internal fragmentation. I hope the
> unaligned bits and pices get put into a slab or something as you
> suggested previously.
> 

This is a possibility but Andrea seems confident he can handle it.

> >> the virtual alignment of mmap matches the physical alignment of the
> >> physical largepage and is >= PAGE_SIZE (software PAGE_SIZE I mean) we
> >> could use the 62nd bit of the pte to use a 64k tlb (if future cpus
> >> will allow that). Nick also suggested to still set all ptes equal to
> >> make life easier for the tlb miss microcode.
> 
> It is too bad that existing amd64 CPUs only allow such large physical
> pages. But it kind of makes sense to cut away a full level or page
> tables for the next bigger size each.
> 

Yep on both counts.

> >> > big you can make it. I don't think my system with 1GB ram would work
> >> > so well with 2MB order 0 pages. But I wasn't refering to that but to
> >> > the picture.
> >> 
> >> Sure! 2M is sure way excessive for a 1G system, 64k most certainly
> >> too, of course unless you're running a db or a multimedia streaming
> >> service, in which case it should be ideal.
> 
> rtorrent, Xemacs/gnus, bash, xterm, zsh, make, gcc, galeon and the
> ocasional mplayer.
> 
> I would mostly be concerned how rtorrents totaly random access of
> mmapped files negatively impacts such a 64k page system.
> 

For what it's worth, the last allocation failure that occured with
grouping pages by mobility was order-1 atomic failures for a wireless
network card when bittorrent was running. You're likely right in that
torrents will be an interesting workload in terms of fragmentation.

-- 
-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Mel Gorman
On (17/09/07 00:38), Goswin von Brederlow didst pronounce:
> [EMAIL PROTECTED] (Mel Gorman) writes:
> 
> > On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
> >> Mel Gorman <[EMAIL PROTECTED]> writes:
> >> 
> >> > On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
> >> >> Nick Piggin <[EMAIL PROTECTED]> writes:
> >> >> 
> >> >> > In my attack, I cause the kernel to allocate lots of unmovable 
> >> >> > allocations
> >> >> > and deplete movable groups. I theoretically then only need to keep a
> >> >> > small number (1/2^N) of these allocations around in order to DoS a
> >> >> > page allocation of order N.
> >> >> 
> >> >> I'm assuming that when an unmovable allocation hijacks a movable group
> >> >> any further unmovable alloc will evict movable objects out of that
> >> >> group before hijacking another one. right?
> >> >> 
> >> >
> >> > No eviction takes place. If an unmovable allocation gets placed in a
> >> > movable group, then steps are taken to ensure that future unmovable
> >> > allocations will take place in the same range (these decisions take
> >> > place in __rmqueue_fallback()). When choosing a movable block to
> >> > pollute, it will also choose the lowest possible block in PFN terms to
> >> > steal so that fragmentation pollution will be as confined as possible.
> >> > Evicting the unmovable pages would be one of those expensive steps that
> >> > have been avoided to date.
> >> 
> >> But then you can have all blocks filled with movable data, free 4K in
> >> one group, allocate 4K unmovable to take over the group, free 4k in
> >> the next group, take that group and so on. You can end with 4k
> >> unmovable in every 64k easily by accident.
> >> 
> >
> > As the mixing takes place at the lowest possible block, it's
> > exceptionally difficult to trigger this. Possible, but exceptionally
> > difficult.
> 
> Why is it difficult?
> 

Unless mlock() is being used, it is difficult to place the pages in the
way you suggest. Feel free to put together a test program that forces an
adverse fragmentation situation, it'll be useful in the future for comparing
reliability of any large block solution.

> When user space allocates memory wouldn't it get it contiously?

Not unless it's using libhugetlbfs or it's very very early in the
lifetime of the system. Even then, another process faulting at the same
time will break it up.

> I mean
> that is one of the goals, to use larger continious allocations and map
> them with a single page table entry where possible, right?

It's a goal ultimately but not what we do right now. There have been
suggestions of allocating the contiguous pages optimistically in the
fault path and later promoting with an arch-specific hook but it's
vapourware right now.

> And then
> you can roughly predict where an munmap() would free a page.
> 
> Say the application does map a few GB of file, uses madvice to tell
> the kernel it needs a 2MB block (to get a continious 2MB chunk
> mapped), waits for it and then munmaps 4K in there. A 4k hole for some
> unmovable object to fill.

With grouping pages by mobility, that 4K hole would be on the movable
free lists. To get an unmovable allocation in there, the system needs to
be under considerable stress. Even just raising min_free_kbytes a bit
would make it considerably harder.

With the standard kernel, it would be easier to place as you suggest.

> If you can then trigger the creation of an
> unmovable object as well (stat some file?) and loop you will fill the
> ram quickly. Maybe it only works in 10% but then you just do it 10
> times as often.
> 
> Over long times it could occur naturally. This is just to demonstrate
> it with malice.
> 

Try writing such a program. I'd be interested in reading it.

> > As I have stated repeatedly, the guarantees can be made but potential
> > hugepage allocation did not justify it. Large blocks might.
> >
> >> There should be a lot of preassure for movable objects to vacate a
> >> mixed group or you do get fragmentation catastrophs.
> >
> > We (Andy Whitcroft and I) did implement something like that. It hooked into
> > kswapd to clean mixed blocks. If the caller could do the cleaning, it
> > did the work instead of kswapd.
> 
> Do you have a graphic like
> http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg
> for that case?
> 

Not at the moment. I don&#

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Mel Gorman
On (16/09/07 19:53), J?rn Engel didst pronounce:
> On Sat, 15 September 2007 01:44:49 -0700, Andrew Morton wrote:
> > On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel <[EMAIL PROTECTED]> wrote:
> > 
> > > While I agree with your concern, those numbers are quite silly.  The
> > > chances of 99.8% of pages being free and the remaining 0.2% being
> > > perfectly spread across all 2MB large_pages are lower than those of SHA1
> > > creating a collision.
> > 
> > Actually it'd be pretty easy to craft an application which allocates seven
> > pages for pagecache, then one for , then seven for pagecache, 
> > then
> > one for , etc.
> > 
> > I've had test apps which do that sort of thing accidentally.  The result
> > wasn't pretty.
> 
> I bet!  My (false) assumption was the same as Goswin's.  If non-movable
> pages are clearly seperated from movable ones and will evict movable
> ones before polluting further mixed superpages, Nick's scenario would be
> nearly infinitely impossible.
> 

It would be plain impossible from a fragmentation point-of-view but you
meet interesting situations when a GFP_NOFS allocation has no kernel blocks
available to use. It can't reclaim, maybe it can move but not with current
code (it should be able to with the Memory Compaction patches).

> Assumption doesn't reflect current code.  Enforcing this assumption
> would cost extra overhead.  The amount of effort to make Christoph's
> approach work reliably seems substantial and I have no idea whether it
> would be worth it.
> 

Current code doesn't reflect your assumptions simply because the costs are so
high. We'd need to be really sure it's worth it and if the answer is "yes",
then we are looking at Andrea's approach (more likely) or I can check out
evicting blocks of 16KB, 64KB or whatever the large block is.

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Mel Gorman
On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
> Mel Gorman <[EMAIL PROTECTED]> writes:
> 
> > On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
> >> Nick Piggin <[EMAIL PROTECTED]> writes:
> >> 
> >> > In my attack, I cause the kernel to allocate lots of unmovable 
> >> > allocations
> >> > and deplete movable groups. I theoretically then only need to keep a
> >> > small number (1/2^N) of these allocations around in order to DoS a
> >> > page allocation of order N.
> >> 
> >> I'm assuming that when an unmovable allocation hijacks a movable group
> >> any further unmovable alloc will evict movable objects out of that
> >> group before hijacking another one. right?
> >> 
> >
> > No eviction takes place. If an unmovable allocation gets placed in a
> > movable group, then steps are taken to ensure that future unmovable
> > allocations will take place in the same range (these decisions take
> > place in __rmqueue_fallback()). When choosing a movable block to
> > pollute, it will also choose the lowest possible block in PFN terms to
> > steal so that fragmentation pollution will be as confined as possible.
> > Evicting the unmovable pages would be one of those expensive steps that
> > have been avoided to date.
> 
> But then you can have all blocks filled with movable data, free 4K in
> one group, allocate 4K unmovable to take over the group, free 4k in
> the next group, take that group and so on. You can end with 4k
> unmovable in every 64k easily by accident.
> 

As the mixing takes place at the lowest possible block, it's
exceptionally difficult to trigger this. Possible, but exceptionally
difficult.

As I have stated repeatedly, the guarantees can be made but potential
hugepage allocation did not justify it. Large blocks might.

> There should be a lot of preassure for movable objects to vacate a
> mixed group or you do get fragmentation catastrophs.

We (Andy Whitcroft and I) did implement something like that. It hooked into
kswapd to clean mixed blocks. If the caller could do the cleaning, it
did the work instead of kswapd.

> Looking at my
> little test program evicting movable objects from a mixed group should
> not be that expensive as it doesn't happen often.

It happens regularly if the size of the block you need to keep clean is
lower than min_free_kbytes. In the case of hugepages, that was always
the case.

> The cost of it
> should be freeing some pages (or finding free ones in a movable group)
> and then memcpy.

Freeing pages is not cheap. Copying pages is cheaper but not cheap.

> With my simplified simulation it never happens so I
> expect it to only happen when the work set changes.
> 
> >> > And it doesn't even have to be a DoS. The natural fragmentation
> >> > that occurs today in a kernel today has the possibility to slowly push 
> >> > out
> >> > the movable groups and give you the same situation.
> >> 
> >> How would you cause that? Say you do want to purposefully place one
> >> unmovable 4k page into every 64k compund page. So you allocate
> >> 4K. First 64k page locked. But now, to get 4K into the second 64K page
> >> you have to first use up all the rest of the first 64k page. Meaning
> >> one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then
> >> will a new 64k chunk be broken and become locked.
> >
> > It would be easier early in the boot to mmap a large area and fault it
> > in in virtual address order then mlock every a page every 64K. Early in
> > the systems lifetime, there will be a rough correlation between physical
> > and virtual memory.
> >
> > Without mlock(), the most successful attack will like mmap() a 60K
> > region and fault it in as an attempt to get pagetable pages placed in
> > every 64K region. This strategy would not work with grouping pages by
> > mobility though as it would group the pagetable pages together.
> 
> But even with mlock the virtual pages should still be movable.

They are. The Memory Compaction patches were able to do the job.

> So if
> you evict movable objects from mixed group when needed all the
> pagetable pages would end up in the same mixed group slowly taking it
> over completly. No fragmentation at all. See how essential that
> feature is. :)
> 

To move pages, there must be enough blocks free. That is where
min_free_kbytes had to come in. If you cared only about keeping 64KB
chunks free, it makes sense but it didn't in the context of hugepages.

> > Targetted attacks on grouping pages by mobility are not very easy and
> > 

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Mel Gorman
sly if userspace has a minimum order of 64k chunks then it will
> > never break any region smaller than 64k chunks and will never cause a
> > fragmentation catastroph. I know that is verry roughly your aproach
> > (make order 0 bigger), and I like it, but it has some limits as to how
> 
> Yep, exactly this is what happens, it avoids that trouble. But as far
> as fragmentation guarantees goes, it's really about keeping the
> unmovable out of our way (instead of spreading the unmovable all over
> the buddy randomly, or with ugly
> boot-time-fixed-numbers-memory-reservations) than to map largepages in
> userland. Infact as I said we could map kmalloced 4k entries in
> userland to save memory if we would really want to hurt the fast paths
> to make a generic kernel to use on smaller systems, but that would be
> very complex. Since those 4k entries would be 100% movable (not like
> the rest of the slab, like dentries and inodes etc..) that wouldn't
> make the design less reliable, it'd still be 100% reliable and
> performance would be ok because that memory is userland memory, we've
> to set the pte anyway, regardless if it's a 4k page or a largepage.
> 

Ok, get it implemented so and we'll try it out because we're just hand-waving
here and not actually producing anything to compare. It'll be interesting
to see how it works out for large blocks and hugepages (although I expect
the latter to fail unless grouping pages by mobility is in place).  Ideally,
they'll complement each other nicely but only ever having mixing take place
at the 64KB boundary. I have the testing setup necessary for checking
out hugepages at least and I hope to put together something that tests
large blocks as well. Minimally, running the hugepage allocation tests
on a filesystem using large blocks would be a decent starting point.

> > big you can make it. I don't think my system with 1GB ram would work
> > so well with 2MB order 0 pages. But I wasn't refering to that but to
> > the picture.
> 
> Sure! 2M is sure way excessive for a 1G system, 64k most certainly
> too, of course unless you're running a db or a multimedia streaming
> service, in which case it should be ideal.
> 

-- 
-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Mel Gorman
On (16/09/07 20:50), Andrea Arcangeli didst pronounce:
> On Sun, Sep 16, 2007 at 07:15:04PM +0100, Mel Gorman wrote:
> > Except now as I've repeatadly pointed out, you have internal fragmentation
> > problems. If we went with the SLAB, we would need 16MB slabs on PowerPC for
> > example to get the same sort of results and a lot of copying and moving when
> 
> Well not sure about the 16MB number, since I'm unsure what the size of
> the ram was.

The 16MB is the size of a hugepage, the size of interest as far as I am
concerned. Your idea makes sense for large block support, but much less
for huge pages because you are incurring a cost in the general case for
something that may not be used.

There is nothing to say that both can't be done. Raise the size of
order-0 for large block support and continue trying to group the block
to make hugepage allocations likely succeed during the lifetime of the
system.

> But clearly I agree there are fragmentation issues in the
> slab too, there have always been, except they're much less severe, and
> the slab is meant to deal with that regardless of the PAGE_SIZE. That
> is not a new problem, you are introducing a new problem instead.
> 

At the risk of repeating, your approach will be adding a new and
significant dimension to the internal fragmentation problem where a
kernel allocation may fail because the larger order-0 pages are all
being pinned by userspace pages.

> We can do a lot better than slab currently does without requiring any
> defrag move-or-shrink at all.
> 
> slab is trying to defrag memory for small objects at nearly zero cost,
> by not giving pages away randomly. I thought you agreed that solving
> the slab fragmentation was going to provide better guarantees when in

It improves the probabilty of hugepage allocations working because the
blocks with slab pages can be targetted and cleared if necessary.

> another email you suggested that you could start allocating order > 0
> pages in the slab to reduce the fragmentation (to achieve most of the
> guarantee provided by config-page-shift, but while still keeping the
> order 0 at 4k for whatever reason I can't see).
> 

That suggestion was aimed at the large block support more than
hugepages. It helps large blocks because we'll be allocating and freeing
as more or less the same size. It certainly is easier to set
slub_min_order to the same size as what is needed for large blocks in
the system than introducing the necessary mechanics to allocate
pagetable pages and userspace pages from slab.

> > a suitable slab page was not available.
> 
> You ignore one other bit, when "/usr/bin/free" says 1G is free, with
> config-page-shift it's free no matter what, same goes for not mlocked
> cache. With variable order page cache, /usr/bin/free becomes mostly a
> lie as long as there's no 4k fallback (like fsblock).
> 

I'm not sure what you are getting at here. I think it would make more
sense if you said "when you read /proc/buddyinfo, you know the order-0
pages are really free for use with large blocks" with your approach.

> And most important you're only tackling on the pagecache and I/O
> performance with the inefficient I/O devices, the whole kernel has no
> cahnce to get a speedup, infact you're making the fast paths slower,
> just the opposite of config-page-shift and original Hugh's large
> PAGE_SIZE ;).
> 

As the kernel pages are all getting grouped together, there are fewer TLB
entries needed to address the kernels working set so there is a general
improvement although how much is workload

All this aside, there is nothing mutually exclusive with what you are proposing
and what grouping pages by mobility does. Your stuff can exist even if grouping
pages by mobility is in place. In fact, it'll help us by giving an important
comparison point as grouping pages by mobility can be trivially disabled with
a one-liner for the purposes of testing. If your approach is brought to being
a general solution that also helps hugepage allocation, then we can revisit
grouping pages by mobility by comparing kernels with it enabled and without.

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Mel Gorman
On (15/09/07 17:51), Andrea Arcangeli didst pronounce:
> On Sat, Sep 15, 2007 at 02:14:42PM +0200, Goswin von Brederlow wrote:
> > I keep coming back to the fact that movable objects should be moved
> > out of the way for unmovable ones. Anything else just allows
> 
> That's incidentally exactly what the slab does, no need to reinvent
> the wheel for that, it's an old problem and there's room for
> optimization in the slab partial-reuse logic too.

Except now as I've repeatadly pointed out, you have internal fragmentation
problems. If we went with the SLAB, we would need 16MB slabs on PowerPC for
example to get the same sort of results and a lot of copying and moving when
a suitable slab page was not available.

> Just boost the order
> 0 page size and use the slab to get the 4k chunks. The sgi/defrag
> design is backwards.
> 

Nothing stops you altering the PAGE_SIZE so that large blocks work in
the way you envision and keep grouping pages by mobility for huge page
sizes.

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Mel Gorman
On (15/09/07 14:14), Goswin von Brederlow didst pronounce:
> Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> > On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel <[EMAIL PROTECTED]> wrote:
> >
> >> While I agree with your concern, those numbers are quite silly.  The
> >> chances of 99.8% of pages being free and the remaining 0.2% being
> >> perfectly spread across all 2MB large_pages are lower than those of SHA1
> >> creating a collision.
> >
> > Actually it'd be pretty easy to craft an application which allocates seven
> > pages for pagecache, then one for , then seven for pagecache, 
> > then
> > one for , etc.
> >
> > I've had test apps which do that sort of thing accidentally.  The result
> > wasn't pretty.
> 
> Except that the applications 7 pages are movable and the 
> would have to be unmovable. And then they should not share the same
> memory region. At least they should never be allowed to interleave in
> such a pattern on a larger scale.
> 

It is actually really easy to force regions to never share. At the
moment, there is a fallback list that determines a preference for what
block to mix.

The reason why this isn't enforced is the cost of moving. On x86 and
x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of
those pages to prevent any mixing would be bad enough. On PowerPC, it's
potentially 16MB. On IA64, it's 1GB.

As this was fragmentation avoidance, not guarantees, the decision was
made to not strictly enforce the types of pages within a block as the
cost cannot be made back unless the system was making agressive use of
large pages. This is not the case with Linux.

> The only way a fragmentation catastroph can be (proovable) avoided is
> by having so few unmovable objects that size + max waste << ram
> size. The smaller the better. Allowing movable and unmovable objects
> to mix means that max waste goes way up. In your example waste would
> be 7*size. With 2MB uper order limit it would be 511*size.
> 
> I keep coming back to the fact that movable objects should be moved
> out of the way for unmovable ones. Anything else just allows
> fragmentation to build up.
> 

This is easily achieved, just really really expensive because of the
amount of copying that would have to take place. It would also compel
that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
a lot of free memory to keep around which is why fragmentation avoidance
doesn't do it.

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Mel Gorman
;   (What I actualy do is throw 8 dice and sum them up and shift the
>result.)
> 

I doubt this is how the kernel behaves either.

> Display:
> I start with a white window. Every page allocation draws a black box
> from the address of the page and as wide as the page is big (-1 pixel to
> give a seperation to the next page). Every page free draws a yellow
> box in place of the black one. Yellow to show where a page was in use
> at one point while white means the page was never used.
> 
> As the time ticks the memory fills up. Quickly at first and then comes
> to a stop around 80% filled. And then something interesting
> happens. The yellow regions (previously used but now free) start
> drifting up. Small pages tend to end up in the lower addresses and big
> pages at the higher addresses. The memory defragments itself to some
> degree.
> 
> http://mrvn.homeip.net/fragment/
> 
> Simulating 256MB ram and after 1472943 ticks and 530095 4k, 411841 8k,
> 295296 16k, 176647 32k and 59064 64k allocations you get this:
> http://mrvn.homeip.net/fragment/256mb.png
> 
> Simulating 1GB ram and after 5881185 ticks  and 2116671 4k, 1645957
> 8k, 1176994 16k, 705873 32k and 235690 64k allocations you get this:
> http://mrvn.homeip.net/fragment/1gb.png
> 

These type of pictures feel somewhat familiar
(http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg).

-- 
Mel Gorman

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-13 Thread Mel Gorman
On (12/09/07 16:17), Christoph Lameter didst pronounce:
> On Wed, 12 Sep 2007, Nick Piggin wrote:
> 
> > I will still argue that my approach is the better technical solution for 
> > large
> > block support than yours, I don't think we made progress on that. And I'm
> > quite sure we agreed at the VM summit not to rely on your patches for
> > VM or IO scalability.
> 
> The approach has already been tried (see the XFS layer) and found lacking. 
> 
> Having a fake linear block through vmalloc means that a special software 
> layer must be introduced and we may face special casing in the block / fs 
> layer to check if we have one of these strange vmalloc blocks.
> 

One of Nick's points is that to have a 100% reliable solution, that is
what is required. We already have a layering between the VM and the FS
but my understanding is that fsblock replaces rather than adds to it.

Surely, we'll be able to detect the situation where the memory is really
contiguous as a fast path and have a slower path where fragmentation was
a problem.

> > But you just showed in two emails that you don't understand what the
> > problem is. To reiterate: lumpy reclaim does *not* invalidate my formulae;
> > and antifrag does *not* isolate the issue.
> 
> I do understand what the problem is. I just do not get what your problem 
> with this is and why you have this drive to demand perfection. We are 
> working a variety of approaches on the (potential) issue but you 
> categorically state that it cannot be solved.
> 

This is going in circles.

His point is that we also cannot prove it is 100% correct in all
situations. Without taking additional (expensive) steps, there will be a
workload that fragments physical memory. He doesn't know what it is and neither
do we, but that does not mean that someone else will find it. He also has a
point about the slow degredation of fragmentation that is woefully difficult
to reproduce. We've had this provability of correctness problem before.

His initial problem was not with the patches as such but the fact that they
seemed to be presented as a 1st class feature that we fully support and
is a potential solution for some VM and IO Scalability problems.  This is
not the case, we have to treat it as a 2nd class feature until we *know* no
situation exists where it breaks down. These patches on their own would have
to run for months if not a year or so before we could be really sure about it.

The only implementation question about these patches that hasn't been addressed
is the mmap() support. What's wrong with it in it's current form. Can it be
fixed or if it's fundamentally screwed etc. That has fallen by the
wayside.

> > But what do you say about viable alternatives that do not have to
> > worry about these "unlikely scenarios", full stop? So, why should we
> > not use fs block for higher order page support?
> 
> Because it has already been rejected in another form and adds more 
> layering to the filesystem and more checking for special cases in which 
> we only have virtual linearity? It does not reduce the number of page 
> structs that have to be handled by the lower layers etc.
> 

Unless callers always use an iterator for blocks that is optimised in the
physically linear case to be a simple array offset and when not physically
linear it either walks chains (complex) or uses vmap (must deal with TLB
flushes amoung other things). If it optimistically uses physically contiguous
memory, we may find a way to use only one page struct as well.

> Maybe we coud get to something like a hybrid that avoids some of these 
> issues?

Or gee whiz, I don't know. Start with your patches as a strictly 2nd class
citizen and build fsblock in while trying to keep use of physically contiguous
memory where possible and it makes sense.

> Add support so something like a virtual compound page can be 
> handled transparently in the filesystem layer with special casing if 
> such a beast reaches the block layer?
> 
> > I didn't skip that. We have large page pools today. How does that give
> > first class of support to those allocations if you have to have memory
> > reserves?
> 
> See my other mail. That portion is not complete yet. Sorry.
> 

I am *very* wary of using reserve pools for anything other than
emergency situations. If nothing else pools == wasted memory + a sizing
problem. But hey, it is one option.

Are we going to agree on some sort of plan or are we just going to
handwave ourselves to death?

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Mel Gorman
On (11/09/07 14:48), Christoph Lameter didst pronounce:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> 
> > But that's not my place to say, and I'm actually not arguing that high
> > order pagecache does not have uses (especially as a practical,
> > shorter-term solution which is unintrusive to filesystems).
> > 
> > So no, I don't think I'm really going against the basics of what we agreed
> > in Cambridge. But it sounds like it's still being billed as first-order
> > support right off the bat here.
> 
> Well its seems that we have different interpretations of what was agreed 
> on. My understanding was that the large blocksize patchset was okay 
> provided that I supply an acceptable mmap implementation and put a 
> warning in.
> 

Warnings == #2 citizen in my mind with known potential failure cases. That
was the point I thought.

> > But even so, you can just hold an open fd in order to pin the dentry you
> > want. My attack would go like this: get the page size and allocation group
> > size for the machine, then get the number of dentries required to fill a
> > slab. Then read in that many dentries and pin one of them. Repeat the
> > process. Even if there is other activity on the system, it seems possible
> > that such a thing will cause some headaches after not too long a time.
> > Some sources of pinned memory are going to be better than others for
> > this of course, so yeah maybe pagetables will be a bit easier (I don't 
> > know).
> 
> Well even without slab targeted reclaim: Mel's antifrag will sort the 
> dentries into separate blocks of memory and so isolate the issue.

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Mel Gorman
On (11/09/07 11:44), Nick Piggin didst pronounce:
> On Wednesday 12 September 2007 01:36, Mel Gorman wrote:
> > On Tue, 2007-09-11 at 04:52 +1000, Nick Piggin wrote:
> > > On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
> > > > 5. VM scalability
> > > >Large block sizes mean less state keeping for the information being
> > > >transferred. For a 1TB file one needs to handle 256 million page
> > > >structs in the VM if one uses 4k page size. A 64k page size reduces
> > > >that amount to 16 million. If the limitation in existing filesystems
> > > >are removed then even higher reductions become possible. For very
> > > >large files like that a page size of 2 MB may be beneficial which
> > > >will reduce the number of page struct to handle to 512k. The
> > > > variable nature of the block size means that the size can be tuned at
> > > > file system creation time for the anticipated needs on a volume.
> > >
> > > There is a limitation in the VM. Fragmentation. You keep saying this
> > > is a solved issue and just assuming you'll be able to fix any cases
> > > that come up as they happen.
> > >
> > > I still don't get the feeling you realise that there is a fundamental
> > > fragmentation issue that is unsolvable with Mel's approach.
> >
> > I thought we had discussed this already at VM and reached something
> > resembling a conclusion. It was acknowledged that depending on
> > contiguous allocations to always succeed will get a caller into trouble
> > and they need to deal with fallback - whether the problem was
> > theoritical or not. It was also strongly pointed out that the large
> > block patches as presented would be vunerable to that problem.
> 
> Well Christoph seems to still be spinning them as a solution for VM
> scalability and first class support for making contiguous IOs, large
> filesystem block sizes etc.
> 

Yeah, I can't argue with you there. I was under the impression that we
would be dealing with this strictly as a second class solution to see
what it bought to help steer the direction of fsblock.

> At the VM summit I think the conclusion was that grouping by
> mobility could be merged. I'm still not thrilled by that, but I was
> going to get steamrolled[*] anyway... and seeing as the userspace
> hugepages is a relatively demanded workload and can be
> implemented in this way with basically no other changes to the
> kernel and already must have fallbacks then that's actually a
> reasonable case for it.
> 

As you say, a difference is if we fail to allocate a hugepage, the world
does not end. It's been a well known problem for years and grouping pages
by mobility is aimed at relaxing some of the more painful points. It has
other uses as well, but each of them is expected to deal with failures with
contiguous range allocation.

> The higher order pagecache, again I'm just going to get steamrolled
> on, and it actually isn't so intrusive minus the mmap changes, so I
> didn't have much to reasonably say there.
> 

If the mmap() change is bad, then it gets halted up.

> And I would have kept quiet this time too, except for the worrying idea
> to use higher order pages to fix the SLUB vs SLAB regression, and if
> the rationale for this patchset was more realistic.
> 

I don't agree with using higher order pages to fix SLUB vs SLAB performance
issues either. SLUB has to be able to compete with SLAB on it's own terms. If
SLUB gains x% over SLAB in specialised cases with high orders, then fair
enough but minimally, SLUB has to perform the same as SLAB at order-0. Like
you, I think if we depend on SLUB using high orders to match SLAB, we are
going to get kicked further down the line.

However, this discussion belongs more with the non-existant-remove-slab patch.
Based on what we've seen since the summits, we need a thorough analysis
with benchmarks before making a final decision (kernbench, ebizzy, tbench
(netpipe if someone has the time/resources), hackbench and maybe sysbench
as well as something the filesystem people recommend to get good coverage
of the subsystems).

> [*] And I don't say steamrolled because I'm bitter and twisted :) I
> personally want the kernel to be perfect. But I realise it already isn't
> and for practical purposes people want these things, so I accept
> being overruled, no problem. The fact simply is -- I would have been
> steamrolled I think :P
> 

I'd rather not get side-tracked here. I regret you feel stream-rolled but I
think grouping pages by mobility is the right thing to do for better usage
of the TLB by the kernel

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Mel Gorman
On (11/09/07 15:17), Nick Piggin didst pronounce:
> On Wednesday 12 September 2007 06:01, Christoph Lameter wrote:
> > On Tue, 11 Sep 2007, Nick Piggin wrote:
> > > There is a limitation in the VM. Fragmentation. You keep saying this
> > > is a solved issue and just assuming you'll be able to fix any cases
> > > that come up as they happen.
> > >
> > > I still don't get the feeling you realise that there is a fundamental
> > > fragmentation issue that is unsolvable with Mel's approach.
> >
> > Well my problem first of all is that you did not read the full message. It
> > discusses that later and provides page pools to address the issue.
> >
> > Secondly you keep FUDding people with lots of theoretical concerns
> > assuming Mel's approaches must fail. If there is an issue (I guess there
> > must be right?) then please give us a concrete case of a failure that we
> > can work against.
> 
> And BTW, before you accuse me of FUD, I'm actually talking about the
> fragmentation issues on which Mel I think mostly agrees with me at this
> point.
> 

I'm half way between you two on this one. I agree with Christoph in that
it's currently very difficult to trigger a failure scenario and today we
don't have a way of dealing with it. I agree with Nick in that conceivably a
failure scenario does exist somewhere and the careful person (or paranoid if
you prefer) would deal with it pre-emptively. The fact is that no one knows
what a large block workload is going to look like to the allocator so we're
all hand-waving.

Right now, I can't trigger the worst failure scenarious that cannot be
dealt with for fragmentation but that might change with large blocks. The
worst situation I can think is a process that continously dirties large
amounts of data on a large block filesystem while another set of processes
works with large amounts of anonymous data without any swap space configured
with slub_min_order set somewhere between order-0 and the large block size.
Fragmentation wise, that's just a kick in the pants and might produce
the failure scenario being looked for.

If it does fail, I don't think it should be used to beat Christoph with as
such because it was meant to be a #2 solution. What hits it is if the mmap()
change is unacceptable.

> Also have you really a rational reason why we should just up and accept
> all these big changes happening just because that, while there are lots
> of theoretical issues, the person pointing them out to you hasn't happened
> to give you a concrete failure case. Oh, and the actual performance
> benefit is actually not really even quantified yet, crappy hardware not
> withstanding, and neither has a proper evaluation of the alternatives.
> 

Performance figures would be nice. dbench is flaky as hell but can
comparison figures be generated on one filesystem with 4K blocks and one
with 64K? I guess we can do it ourselves too because this should work on
normal machines.

> So... would you drive over a bridge if the engineer had this mindset?
> 

If I had this bus that couldn't go below 50MPH, right.. never mind.

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Mel Gorman
On (11/09/07 12:26), Nick Piggin didst pronounce:
> On Wednesday 12 September 2007 04:31, Mel Gorman wrote:
> > On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
> > > On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> > > > that increasing the pagesize like what Andrea suggested would lead to
> > > > internal fragmentation problems. Regrettably we didn't discuss Andrea's
> > >
> > > The config_page_shift guarantees the kernel stacks or whatever not
> > > defragmentable allocation other allocation goes into the same 64k "not
> > > defragmentable" page. Not like with SGI design that a 8k kernel stack
> > > could be allocated in the first 64k page, and then another 8k stack
> > > could be allocated in the next 64k page, effectively pinning all 64k
> > > pages until Nick worst case scenario triggers.
> >
> > In practice, it's pretty difficult to trigger. Buddy allocators always
> > try and use the smallest possible sized buddy to split. Once a 64K is
> > split for a 4K or 8K allocation, the remainder of that block will be
> > used for other 4K, 8K, 16K, 32K allocations. The situation where
> > multiple 64K blocks gets split does not occur.
> >
> > Now, the worst case scenario for your patch is that a hostile process
> > allocates large amount of memory and mlocks() one 4K page per 64K chunk
> > (this is unlikely in practice I know). The end result is you have many
> > 64KB regions that are now unusable because 4K is pinned in each of them.
> > Your approach is not immune from problems either. To me, only Nicks
> > approach is bullet-proof in the long run.
> 
> One important thing I think in Andrea's case, the memory will be accounted
> for (eg. we can limit mlock, or work within various memory accounting things).
> 

For mlock()ed sure. Not for pagetables though, kmalloc slabs etc. It
might be a non-issue as well. Like the large block patches, there are
aspects of Andrea's case that we simply do not know.

> With fragmentation, I suspect it will be much more difficult to do this. It
> would be another layer of heuristics that will also inevitably go wrong
> at times if you try to limit how much "fragmentation" a process can do.
> Quite likely it is hard to make something even work reasonably well in
> most cases.

Regreattably, this is also woefully difficult to prove. For
fragmentation, I can look into having a more expensive version of
/proc/pagetypeinfo to give a detailed account of the current
fragmentation state but it's a side-issue.

> > >  We can still try to save some memory by
> > > defragging the slab a bit, but it's by far *not* required with
> > > config_page_shift. No defrag at all is required infact.
> >
> > You will need to take some sort of defragmentation to deal with internal
> > fragmentation. It's a very similar problem to blasting away at slab
> > pages and still not being able to free them because objects are in use.
> > Replace "slab" with "large page" and "object" with "4k page" and the
> > issues are similar.
> 
> Well yes and slab has issues today too with internal fragmentation,
> targetted reclaim and some (small) higher order allocations too today.
> But at least with config_page_shift, you don't introduce _new_ sources
> of problems (eg. coming from pagecache or other allocs).
> 

Well, we do extend the internal fragmentation problem. Previous, it was
inode, dcache and friends. Now we have to deal with internal
fragmentation related to page tables, per-cpu pages etc. Maybe they can
be solved too, but they are of similar difficulty to what Christoph
faces.

> Sure, there are some other things -- like pagecache can actually use
> up more memory instead -- but there are a number of other positives
> that Andrea's has as well. It is using order-0 pages, which are first class
> throughout the VM; they have per-cpu queues, and do not require any
> special reclaim code.

Being able to use the per-cpu queues is a big plus.

> They also *actually do* reduce the page
> management overhead in the general case, unlike higher order pcache.
> 
> So combined with the accounting issues, I think it is unfair to say that
> Andrea's is just moving the fragmentation to internal. It has a number
> of upsides. I have no idea how it will actually behave and perform, mind
> you ;)
> 

Neither do I. Andrea's suggestion definitly has upsides. I'm just saying
it's not going to cure cancer any better than the large block patchset ;)

> 
> > > Plus there's a cost in defragging and freeing cache... the more you
> 

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Mel Gorman
On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 

Hi,

> On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> > that increasing the pagesize like what Andrea suggested would lead to
> > internal fragmentation problems. Regrettably we didn't discuss Andrea's
> 
> The config_page_shift guarantees the kernel stacks or whatever not
> defragmentable allocation other allocation goes into the same 64k "not
> defragmentable" page. Not like with SGI design that a 8k kernel stack
> could be allocated in the first 64k page, and then another 8k stack
> could be allocated in the next 64k page, effectively pinning all 64k
> pages until Nick worst case scenario triggers.
> 

In practice, it's pretty difficult to trigger. Buddy allocators always
try and use the smallest possible sized buddy to split. Once a 64K is
split for a 4K or 8K allocation, the remainder of that block will be
used for other 4K, 8K, 16K, 32K allocations. The situation where
multiple 64K blocks gets split does not occur.

Now, the worst case scenario for your patch is that a hostile process
allocates large amount of memory and mlocks() one 4K page per 64K chunk
(this is unlikely in practice I know). The end result is you have many
64KB regions that are now unusable because 4K is pinned in each of them.
Your approach is not immune from problems either. To me, only Nicks
approach is bullet-proof in the long run.

> What I said at the VM summit is that your reclaim-defrag patch in the
> slub isn't necessarily entirely useless with config_page_shift,
> because the larger the software page_size, the more partial pages we
> could find in the slab, so to save some memory if there are tons of
> pages very partially used, we could free some of them.
> 

This is true. Slub targetted reclaim (Chrisophs work) is useful
independent of this current problem.

> But the whole point is that with the config_page_shift, Nick's worst
> case scenario can't happen by design regardless of defrag or not
> defrag.

I agree with this. It's why I thought Nick's approach was where we were
going to finish up ultimately.

>  While it can _definitely_ happen with SGI design (regardless
> of any defrag thing).

I have never stated that the SGI design is immune from this problem.

>  We can still try to save some memory by
> defragging the slab a bit, but it's by far *not* required with
> config_page_shift. No defrag at all is required infact.
> 

You will need to take some sort of defragmentation to deal with internal
fragmentation. It's a very similar problem to blasting away at slab
pages and still not being able to free them because objects are in use.
Replace "slab" with "large page" and "object" with "4k page" and the
issues are similar.

> Plus there's a cost in defragging and freeing cache... the more you
> need defrag, the slower the kernel will be.
> 
> > approach in depth.
> 
> Well it wasn't my fault if we didn't discuss it in depth though.

If it's my fault, sorry about that. It wasn't my intention.

>  I
> tried to discuss it in all possible occasions where I was suggested to
> talk about it and where it was somewhat on topic.

Who said it was off-topic? Again, if this was me, sorry - you should
have chucked something at my head to shut me up.

>  Given I wasn't even
> invited at the KS, I felt it would not be appropriate for me to try to
> monopolize the VM summit according to my agenda. So I happily listened
> to what the top kernel developers are planning ;), while giving
> some hints on what I think the right direction is instead.
> 

Right, clearly we failed or at least had sub-optimal results dicussion
this one at VM Summit. Good job we have mail to pick up the stick with.

> > I *thought* that the end conclusion was that we would go with
> 
> Frankly I don't care what the end conclusion was.
> 

heh. Well we need to come to some sort of conclusion here or this will
go around the merri-go-round till we're all bald.

> > Christoph's approach pending two things being resolved;
> > 
> > o mmap() support that we agreed on is good
> 
> Let's see how good the mmap support for variable order page size will
> work after the 2 weeks...
> 

Ok, I'm ok with that.

> > o A clear statement, with logging maybe for users that mounted a large 
> >   block filesystem that it might blow up and they get to keep both parts
> >   when it does. Basically, for now it's only suitable in specialised
> >   environments.
> 
> Yes, but perhaps you missed that such printk is needed exactly to
> provide proof that SGI design is the wrong way and it needs to be
> dumped. If that printk ev

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Mel Gorman
On Tue, 2007-09-11 at 04:52 +1000, Nick Piggin wrote:
> On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
> 
> > 5. VM scalability
> >Large block sizes mean less state keeping for the information being
> >transferred. For a 1TB file one needs to handle 256 million page
> >structs in the VM if one uses 4k page size. A 64k page size reduces
> >that amount to 16 million. If the limitation in existing filesystems
> >are removed then even higher reductions become possible. For very
> >large files like that a page size of 2 MB may be beneficial which
> >will reduce the number of page struct to handle to 512k. The variable
> >nature of the block size means that the size can be tuned at file
> >system creation time for the anticipated needs on a volume.
> 
> There is a limitation in the VM. Fragmentation. You keep saying this
> is a solved issue and just assuming you'll be able to fix any cases
> that come up as they happen.
> 
> I still don't get the feeling you realise that there is a fundamental
> fragmentation issue that is unsolvable with Mel's approach.
> 

I thought we had discussed this already at VM and reached something
resembling a conclusion. It was acknowledged that depending on
contiguous allocations to always succeed will get a caller into trouble
and they need to deal with fallback - whether the problem was
theoritical or not. It was also strongly pointed out that the large
block patches as presented would be vunerable to that problem.

The alternatives were fs-block and increasing the size of order-0. It
was felt that fs-block was far away because it's complex and I thought
that increasing the pagesize like what Andrea suggested would lead to
internal fragmentation problems. Regrettably we didn't discuss Andrea's
approach in depth.

I *thought* that the end conclusion was that we would go with
Christoph's approach pending two things being resolved;

o mmap() support that we agreed on is good
o A clear statement, with logging maybe for users that mounted a large 
  block filesystem that it might blow up and they get to keep both parts
  when it does. Basically, for now it's only suitable in specialised
  environments.

I also thought there was an acknowledgement that long-term, fs-block was
the way to go - possibly using contiguous pages optimistically instead
of virtual mapping the pages. At that point, it would be a general
solution and we could remove the warnings.

Basically, to start out with, this was going to be an SGI-only thing so
they get to rattle out the issues we expect to encounter with large
blocks and help steer the direction of the
more-complex-but-safer-overall fs-block.

> The idea that there even _is_ a bug to fail when higher order pages
> cannot be allocated was also brushed aside by some people at the
> vm/fs summit.

When that brushing occured, I thought I made it very clear what the
expectations were and that without fallback they would be taking a risk.
I am not sure if that message actually sank in or not.

That said, the filesystem people can experiement to some extent against
Christoph's approach as long as they don't think they are 100% safe.
Again, their experimenting will help steer the direction of fs-block.

>
>  I don't know if those people had gone through the
> math about this, but it goes somewhat like this: if you use a 64K
> page size, you can "run out of memory" with 93% of your pages free.
> If you use a 2MB page size, you can fail with 99.8% of your pages
> still free. That's 64GB of memory used on a 32TB Altix.
> 

That's the absolute worst case but yes, in theory this can occur and
it's safest to assume the situation will occur somewhere to someone. It
would be difficult to craft an attack to do it but conceivably a machine
running for a long enough time would trigger it particularly if the
large block allocations are GFP_NOIO or GFP_NOFS.

> If you don't consider that is a problem because you don't care about
> theoretical issues or nobody has reported it from running -mm
> kernels, then I simply can't argue against that on a technical basis.

The -mm kernels have patches related to watermarking that will not be
making it to mainline for reasons we don't need to revisit right now.
The lack of the watermarking patches may turn out to be a non-issue but
the point is that what's in mainline is not exactly the same as -mm and
mainline will be running for longer periods of time in a different
environment.

Where we expected to see the the use of this patchset was in specialised
environments *only*. The SGI people can mitigate their mixed
fragmentation problems somewhat by setting slub_min_order ==
large_block_order so that blocks get allocated and freed at the same
size. This is partial way towards Andrea's solution of raising the size
of an order-0 allocation. The point of printing out the warnings at
mount time was not so much for a general user who may miss the logs but
for distributions that consider turning