Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-24 Thread Andrea Arcangeli
On Sun, Sep 23, 2007 at 08:56:39AM +0200, Goswin von Brederlow wrote:
> As a user I know it because I didn't put a kernel source into /tmp. A
> programm can't reasonably know that.

Various apps requires you (admin/user) to tune the size of their
caches. Seems like you never tried to setup a database, oh well.

> Xen has its own memory pool and can quite agressively reclaim memory
> from dom0 when needed. I just ment to say that the number in

The whole point is if there's not enough ram of course... this is why
you should check.

> /proc/meminfo can change in a second so it is not much use knowing
> what it said last minute.

The numbers will change depending on what's running on your
system. It's up to you to know plus I normally keep vmstat monitored
in the background to see how the cache/free levels change over
time. Those numbers are worthless if they could be fragmented...

> I would kill any programm that does that to find out how much free ram
> the system has.

The admin should do that if he's unsure, not a program of course!
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-20 Thread Andrea Arcangeli
On Thu, Sep 20, 2007 at 11:38:21AM +1000, David Chinner wrote:
> Sure, and that's what I meant when I said VPC + large pages was
> a means to the end, not the only solution to the problem.

The whole point is that it's not an end, it's an end to your own fs
centric view only (which is sure fair enough), but I watch the whole
VM not just the pagecache...

The same way the fs-centric view will hope to get this little bit of
further optimization from largepages to reach "the end", my VM-wide
view wants the same little bit of opitmization for *everything*
including tmpfs and anonymous memory, slab etc..! This is clearly why
config-page-shift is better...

If you're ok not to be on the edge and you want a generic rpm image
that run quite optimally for any workload, then 4k+fslblock is just
fine of course. But if we go on the edge we should aim for the _very_
end for the whole VM, not just for "the end of the pagecache on
certain files". Especially when the complexity involved in the mmap
code is similar, and it will reject heavily if we merge this
not-very-end solution that only reaches "the end" for the pagecache.

> No, I don't like fsblock because it is inherently a "struture
> per filesystem block" construct, just like buggerheads. You
> still need to allocate millions of them when you have millions
> dirty pages around. Rather than type it all out again, read
> the fsblocks thread from here:
> 
> http://marc.info/?l=linux-fsdevel&m=118284983925719&w=2

Thanks for the pointer!

> FWIW, with Chris mason's extent-based block mapping (which btrfs
> is using and Christoph Hellwig is porting XFS over to) we completely
> remove buggerheads from XFS and so fsblock would be a pretty
> major step backwards for us if Chris's work goes into mainline.

I tend to agree if we change it fsblock should support extent if
that's what you need on xfs to support range-locking etc... Whatever
happens in vfs should please all existing fs without people needing to
go their own way again... Or replace fsblock with Chris's block
mapping. Frankly I didn't see Chris's code so I cannot comment
further. But your complains sounds sensible. We certainly want to
avoid lowlevel fs to get smarter again than the vfs. The brainer stuff
should be in vfs!

> That's not in the filesystem, though. ;)
> 
> However, I agree that if you don't have mmap then it's not
> worthwhile and the changes for VPC aren't trivial.

Yep.

> 
> > >   3. avoiding the need for vmap() as it has great
> > >  overhead and does not scale
> > >   -> Nick is starting to work on that and has
> > >  already had good results.
> > 
> > Frankly I don't follow this vmap thing. Can you elaborate?
> 
> We current support metadata blocks larger than page size for
> certain types of metadata in XFS. e.g. directory blocks.
> This however, requires vmap()ing a bunch of individual,
> non-contiguous pages out of a block device address space
> in exactly the fashion that was proposed by Nick with fsblock
> originally.
> 
> vmap() has severe scalability problems - read this subthread
> of this discussion between Nick and myself:
> 
> http://lkml.org/lkml/2007/9/11/508

So the idea of vmap is that it's much simpler to have a contiguous
virtual address space large blocksize, than to find the right
b_data[index] once you exceed PAGE_SIZE...

The global tlb flush with ipi would kill performance, you can forget
any global mapping here. The only chance to do this would be like we
do with kmap_atomic per-cpu on highmem, with preempt_disable (for the
enjoyment of the rt folks out there ;). what's the problem of having
it per-cpu? Is this what fsblock already does? You've just have to
allocate a new virtual range large numberofentriesinvmap*blocksize
every time you mount a new fs. Then instead of calling kmap you call
vmap and vunmap when you're finished. That should provide decent
performance, especially with physically indexed caches.

Anything more heavyweight than what I suggested is probably overkill,
even vmalloc_to_page.

> Hmm - so you'll need page cache tail packing as well in that case
> to prevent memory being wasted on small files. That means any way
> we look at it (VPC+mmap or config-page-shift+fsblock+pctails)
> we've got some non-trivial VM  modifications to make. 

Hmm no, the point of config-page-shift is that if you really need to
reach "the very end", you probably don't care about wasting some
memory, because either your workload can't fit in cache, or it fits in
cache regardless, or you're not wasting memory because you work with
large files...

The only point of this largepage stuff is to go an extra mile to save
a bit more of cpu vs a strict vmap based solution (fsblock of course
will be smart enough that if it notices the PAGE_SIZE is >= blocksize
it doesn't need to run any vmap at all and it can just use the direct
mapping, so vmap translates in 1 branch only to check the blocksize
variable, PAGE_SIZE is immediate in the .text at compile time). But

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Andrea Arcangeli
On Wed, Sep 19, 2007 at 03:09:10PM +1000, David Chinner wrote:
> Ok, let's step back for a moment and look at a basic, fundamental
> constraint of disks - seek capacity. A decade ago, a terabyte of
> filesystem had 30 disks behind it - a seek capacity of about
> 6000 seeks/s. Nowdays, that's a single disk with a seek
> capacity of about 200/s. We're going *rapidly* backwards in
> terms of seek capacity per terabyte of storage.
> 
> Now fill that terabyte of storage and index it in the most efficient
> way - let's say btrees are used because lots of filesystems use
> them. Hence the depth of the tree is roughly O((log n)/m) where m is
> a factor of the btree block size.  Effectively, btree depth = seek
> count on lookup of any object.

I agree. btrees will clearly benefit if the nodes are larger. We've an
excess of disk capacity and an huge gap between seeking and contiguous
bandwidth.

You don't need largepages for this, fsblocks is enough.

Largepages for you are a further improvement to avoid reducing the SG
entries and potentially reducing the cpu utilization a bit (not much
though, only the pagecache works with largepages and especially with
small sized random I/O you'll be taking the radix tree lock the same
number of times...).

Plus of course you don't like fsblock because it requires work to
adapt a fs to it, I can't argue about that.

> Ok, so let's set the record straight. There were 3 justifications
> for using *large pages* to *support* large filesystem block sizes
> The justifications for the variable order page cache with large
> pages were:
> 
>   1. little code change needed in the filesystems
>   -> still true

Disagree, the mmap side is not a little change. If you do it just for
the not-mmapped I/O that truly is an hack, but then frankly I would
prefer only the read/write hack (without mmap) so it will not reject
heavily with my stuff and it'll be quicker to nuke it out of the
kernel later.

>   3. avoiding the need for vmap() as it has great
>  overhead and does not scale
>   -> Nick is starting to work on that and has
>  already had good results.

Frankly I don't follow this vmap thing. Can you elaborate? Is this
about allowing the blkdev pagecache for metadata to go in highmemory?
Is that the kmap thing? I think we can stick to a direct mapped b_data
and avoid all overhead of converting a struct page to a virtual
address. It takes the same 64bit size anyway in ram and we avoid one
layer of indirection and many modifications. If we wanted to switch to
kmap for blkdev pagecache we should have done years ago, now it's far
too late to worry about it.

> Everyone seems to be focussing on #2 as the entire justification for
> large block sizes in filesystems and that this is an "SGI" problem.

I agree it's not a SGI problem and this is why I want a design that
has a _slight chance_ to improve performance on x86-64 too. If
variable order page cache will provide any further improvement on top
of fsblock will be only because your I/O device isn't fast with small
sg entries.

For the I/O layout the fsblock is more than enough, but I don't think
your variable order page cache will help in any significant way on
x86-64. Furthermore the complexity of handle page faults on largepages
is almost equivalent to the complexity of config-page-shift, but
config-page-shift gives you the whole cpu-saving benefits that you can
never remotely hope to achieve with variable order page cache.

config-page-shift + fsblock IMHO is the way to go for x86-64, with one
additional 64k PAGE_SIZE rpm. config-page-shift will stack nicely on
top of fsblocks.

fsblock will provide the guarantee of "mounting" all fs anywhere no
matter which config-page-shift you selected at compile time, as well
as dvd writing. Then config-page-shift will provide the cpu
optimization on all fronts, not just for the pagecache I/O for the
large ram systems, without fragmentation issues and with 100%
reliability in the "free" numbers (not working by luck). That's all we
need as far as I can tell.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Andrea Arcangeli
On Mon, Sep 17, 2007 at 12:56:07AM +0200, Goswin von Brederlow wrote:
> When has free ever given any usefull "free" number? I can perfectly
> fine allocate another gigabyte of memory despide free saing 25MB. But
> that is because I know that the buffer/cached are not locked in.

Well, as you said you know that buffer/cached are not locked in. If
/proc/meminfo would be rubbish like you seem to imply in the first
line, why would we ever bother to export that information and even
waste time writing a binary that parse it for admins?

> On the other hand 1GB can instantly vanish when I start a xen domain
> and anything relying on the free value would loose.

Actually you better check meminfo or free before starting a 1G of Xen!!

> The only sensible thing for an application concerned with swapping is
> to whatch the swapping and then reduce itself. Not the amount
> free. Although I wish there were some kernel interface to get a
> preasure value of how valuable free pages would be right now. I would
> like that for fuse so a userspace filesystem can do caching without
> cripling the kernel.

Repeated drop caches + free can help.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Andrea Arcangeli
On Tue, Sep 18, 2007 at 11:30:17AM -0700, Linus Torvalds wrote:
> The fact is, *none* of those things are true. The VM doesn't guarantee 
> anything, and is already very much about statistics in many places. You 

Many? I can't recall anything besides PF_MEMALLOC and the decision
that the VM is oom. Those are the only two gray areas... the safety
margin is large enough that nobody notices the lack of black-and-white
solution.

So instead of working to provide guarantees for the above two gray
spots, we're making everything weaker, that's the wrong direction as
far as I can tell, especially if we're going to mess up big time the
commo code in a backwards way only for those few users of those few
I/O devices out there.

In general every time reliability has a low priority than performance
I've an hard time to enjoy it.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Andrea Arcangeli
On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
> The 16MB is the size of a hugepage, the size of interest as far as I am
> concerned. Your idea makes sense for large block support, but much less
> for huge pages because you are incurring a cost in the general case for
> something that may not be used.

Sorry for the misunderstanding, I totally agree!

> There is nothing to say that both can't be done. Raise the size of
> order-0 for large block support and continue trying to group the block
> to make hugepage allocations likely succeed during the lifetime of the
> system.

Sure, I completely agree.

> At the risk of repeating, your approach will be adding a new and
> significant dimension to the internal fragmentation problem where a
> kernel allocation may fail because the larger order-0 pages are all
> being pinned by userspace pages.

This is exactly correct, some memory will be wasted. It'll reach 0
free memory more quickly depending on which kind of applications are
being run.

> It improves the probabilty of hugepage allocations working because the
> blocks with slab pages can be targetted and cleared if necessary.

Agreed.

> That suggestion was aimed at the large block support more than
> hugepages. It helps large blocks because we'll be allocating and freeing
> as more or less the same size. It certainly is easier to set
> slub_min_order to the same size as what is needed for large blocks in
> the system than introducing the necessary mechanics to allocate
> pagetable pages and userspace pages from slab.

Allocating userpages from slab in 4k chunks with a 64k PAGE_SIZE is
really complex indeed. I'm not planning for that in the short term but
it remains a possibility to make the kernel more generic. Perhaps it
could worth it...

Allocating ptes from slab is fairly simple but I think it would be
better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
nearby ptes in the per-task local pagetable tree, to reduce the number
of locks taken and not to enter the slab at all for that. Infact we
could allocate the 4 levels (or anyway more than one level) in one
single alloc_pages(0) and track the leftovers in the mm (or similar).

> I'm not sure what you are getting at here. I think it would make more
> sense if you said "when you read /proc/buddyinfo, you know the order-0
> pages are really free for use with large blocks" with your approach.

I'm unsure who reads /proc/buddyinfo (that can change a lot and that
is not very significant information if the vm can defrag well inside
the reclaim code), but it's not much different and it's more about
knowing the real meaning of /proc/meminfo, freeable (unmapped) cache,
anon ram, and free memory.

The idea is that to succeed an mmap over a large xfs file with
mlockall being invoked, those largepages must become available or
it'll be oom despite there are still 512M free... I'm quite sure
admins will gets confused if they get oom killer invoked with lots of
ram still free.

The overcommit feature will also break, just to make an example (so
much for overcommit 2 guaranteeing -ENOMEM retvals instead of oom
killage ;).

> All this aside, there is nothing mutually exclusive with what you are 
> proposing
> and what grouping pages by mobility does. Your stuff can exist even if 
> grouping
> pages by mobility is in place. In fact, it'll help us by giving an important
> comparison point as grouping pages by mobility can be trivially disabled with
> a one-liner for the purposes of testing. If your approach is brought to being
> a general solution that also helps hugepage allocation, then we can revisit
> grouping pages by mobility by comparing kernels with it enabled and without.

Yes, I totally agree. It sounds worthwhile to have a good defrag logic
in the VM. Even allocating the kernel stack in today kernels should be
able to benefit from your work. It's just comparing a fork() failure
(something that will happen with ulimit -n too and that apps must be
able to deal with) with an I/O failure that worries me a bit. I'm
quite sure a db failing I/O will not recover too nicely. If fork fails
that's most certainly ok... at worst a new client won't be able to
connect and he can retry later. Plus order 1 isn't really a big deal,
you know the probability to succeeds decreases exponentially with the
order.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Andrea Arcangeli
On Sun, Sep 16, 2007 at 07:15:04PM +0100, Mel Gorman wrote:
> Except now as I've repeatadly pointed out, you have internal fragmentation
> problems. If we went with the SLAB, we would need 16MB slabs on PowerPC for
> example to get the same sort of results and a lot of copying and moving when

Well not sure about the 16MB number, since I'm unsure what the size of
the ram was. But clearly I agree there are fragmentation issues in the
slab too, there have always been, except they're much less severe, and
the slab is meant to deal with that regardless of the PAGE_SIZE. That
is not a new problem, you are introducing a new problem instead.

We can do a lot better than slab currently does without requiring any
defrag move-or-shrink at all.

slab is trying to defrag memory for small objects at nearly zero cost,
by not giving pages away randomly. I thought you agreed that solving
the slab fragmentation was going to provide better guarantees when in
another email you suggested that you could start allocating order > 0
pages in the slab to reduce the fragmentation (to achieve most of the
guarantee provided by config-page-shift, but while still keeping the
order 0 at 4k for whatever reason I can't see).

> a suitable slab page was not available.

You ignore one other bit, when "/usr/bin/free" says 1G is free, with
config-page-shift it's free no matter what, same goes for not mlocked
cache. With variable order page cache, /usr/bin/free becomes mostly a
lie as long as there's no 4k fallback (like fsblock).

And most important you're only tackling on the pagecache and I/O
performance with the inefficient I/O devices, the whole kernel has no
cahnce to get a speedup, infact you're making the fast paths slower,
just the opposite of config-page-shift and original Hugh's large
PAGE_SIZE ;).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Andrea Arcangeli
On Sun, Sep 16, 2007 at 03:54:56PM +0200, Goswin von Brederlow wrote:
> Andrea Arcangeli <[EMAIL PROTECTED]> writes:
> 
> > On Sat, Sep 15, 2007 at 10:14:44PM +0200, Goswin von Brederlow wrote:
> >> - Userspace allocates a lot of memory in those slabs.
> >
> > If with slabs you mean slab/slub, I can't follow, there has never been
> > a single byte of userland memory allocated there since ever the slab
> > existed in linux.
> 
> This and other comments in your reply show me that you completly
> misunderstood what I was talking about.
> 
> Look at
> http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg

What does the large square represent here? A "largepage"? If yes,
which order? There seem to be quite some pixels in each square...

> The red dots (pinned) are dentries, page tables, kernel stacks,
> whatever kernel stuff, right?
> 
> The green dots (movable) are mostly userspace pages being mapped
> there, right?

If the largepage is the square, there can't be red pixels mixed with
green pixels with the config-page-shift design, this is the whole
difference...

zooming in I see red pixels all over the squares mized with green
pixels in the same square. This is exactly what happens with the
variable order page cache and that's why it provides zero guarantees
in terms of how much ram is really "free" (free as in "available").

> What I was refering too is that because movable objects (green dots)
> aren't moved out of a mixed group (the boxes) when some unmovable
> object needs space all the groups become mixed over time. That means
> the unmovable objects are spread out over all the ram and the buddy
> system can't recombine regions when unmovable objects free them. There
> will nearly always be some movable objects in the other buddy. The
> system of having unmovable and movable groups breaks down and becomes
> useless.

If I understood correctly, here you agree that mixing movable and
unmovable objects in the same largepage is a bad thing, and that's
incidentally what config-page-shift prevents. It avoids it instead of
undoing the mixture later with defrag when it's far too late for
anything but updatedb.

> I'm assuming here that we want the possibility of larger order pages
> for unmovable objects (large continiuos regions for DMA for example)
> than the smallest order user space gets (or any movable object). If
> mmap() still works on 4k page bounaries then those will fragment all
> regions into 4k chunks in the worst case.

With config-page-shift mmap works on 4k chunks but it's always backed
by 64k or any other largesize that you choosed at compile time. And if
the virtual alignment of mmap matches the physical alignment of the
physical largepage and is >= PAGE_SIZE (software PAGE_SIZE I mean) we
could use the 62nd bit of the pte to use a 64k tlb (if future cpus
will allow that). Nick also suggested to still set all ptes equal to
make life easier for the tlb miss microcode.

> Obviously if userspace has a minimum order of 64k chunks then it will
> never break any region smaller than 64k chunks and will never cause a
> fragmentation catastroph. I know that is verry roughly your aproach
> (make order 0 bigger), and I like it, but it has some limits as to how

Yep, exactly this is what happens, it avoids that trouble. But as far
as fragmentation guarantees goes, it's really about keeping the
unmovable out of our way (instead of spreading the unmovable all over
the buddy randomly, or with ugly
boot-time-fixed-numbers-memory-reservations) than to map largepages in
userland. Infact as I said we could map kmalloced 4k entries in
userland to save memory if we would really want to hurt the fast paths
to make a generic kernel to use on smaller systems, but that would be
very complex. Since those 4k entries would be 100% movable (not like
the rest of the slab, like dentries and inodes etc..) that wouldn't
make the design less reliable, it'd still be 100% reliable and
performance would be ok because that memory is userland memory, we've
to set the pte anyway, regardless if it's a 4k page or a largepage.

> big you can make it. I don't think my system with 1GB ram would work
> so well with 2MB order 0 pages. But I wasn't refering to that but to
> the picture.

Sure! 2M is sure way excessive for a 1G system, 64k most certainly
too, of course unless you're running a db or a multimedia streaming
service, in which case it should be ideal.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-15 Thread Andrea Arcangeli
On Sat, Sep 15, 2007 at 10:14:44PM +0200, Goswin von Brederlow wrote:
> How does that help? Will slabs move objects around to combine two

1. It helps providing a few guarantees: when you run "/usr/bin/free"
you won't get a random number, but a strong _guarantee_. That ram will
be available no matter what.

With variable order page size you may run oom by mlocking some half
free ram in pagecache backed by largepages. "free" becomes a fake
number provided by a weak design.

Apps and admin need to know for sure the ram that is available to be
able to fine-tune the workload to avoid running into swap but while
using all available ram at the same time.

> partially filled slabs into nearly full one? If not consider this:

2. yes, slab can indeed be freed to release an excessive number of 64k
   pages pinned by an insignificant number of small objects. I already
   told to Mel even at the VM summit, that the slab defrag can payoff
   regardless, and this is nothing new, since it will payoff even
   today with 2.6.23 with regard to kmalloc(32).

> - You create a slab for 4k objects based on 64k compound pages.
>   (first of all that wastes you a page already for the meta infos)

There's not just 1 4k object in the system... The whole point is to
make sure all those 4k objects goes into the same 64k page. This way
for you to be able to reproduce Nick's worst case scenario you have to
allocate total_ram/4k objects large 4k...

> - Something movable allocates a 14 4k page in there making the slab
>   partially filled.

Movable? I rather assume all slab allocations aren't movable. Then
slab defrag can try to tackle on users like dcache and inodes. Keep in
mind that with the exception of updatedb, those inodes/dentries will
be pinned and you won't move them, which is why I prefer to consider
them not movable too... since there's no guarantee they are.

> - Something unmovable alloactes a 4k page making the slab mixed and
>   full.

The entire slab being full is a perfect scenario. It means zero memory
waste, it's actually the ideal scenario, I can't follow your logic...

> - Repeat until out of memory.

for(;;) kmalloc(32); is supposed to run oom, no breaking news here...

> - Userspace allocates a lot of memory in those slabs.

If with slabs you mean slab/slub, I can't follow, there has never been
a single byte of userland memory allocated there since ever the slab
existed in linux.

> - Userspace frees one in every 15 4k chunks.

I guess you're confusing the config-page-shift design with the sgi
design where userland memory gets mixed with slab entries in the same
64k page... Also with config-page-shift the userland pages will all be
64k.

Things will get more complicated if we later decide to allow
kmalloc(4k) pagecache to be mapped in userland instead of only being
available for reads. But then we can restrict that to a slab and to
make it relocatable by following the ptes. That will complicate things
a lot.

But the whole point is that you don't need all that complexity,
and that as long as you're ok to lose some memory, you will get a
strong guarantee when "free" tells you 1G is free or available as
cache.

> - Userspace forks 1000 times causing an unmovable task structure to
>   appear in 1000 slabs. 

If 1000 kmem_cache_alloc(kernel_stack) in a row will keep pinned 1000
64k slab pages it means previously there have at least been
64k/8k*1000 simultaneous tasks allocated at once, not just your 1000
fork.

Even if when "free" says there's 1G free, it wouldn't be a 100% strong
guarantee, and even if the slab wouldn't provide strong defrag
avoidance guarantees by design, splitting pages down in the core, and
then merging them up outside the core, sounds less efficient than
keeping the pages large in the core, and then splitting them outside
the core for the few non-performance critical small users. We're not
talking about laptops here, if the major load happens on tiny things
and tiny objects nobody should compile a kernel with 64k page size,
which is why there need to be 2 rpm to get peak performance.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-15 Thread Andrea Arcangeli
On Sat, Sep 15, 2007 at 02:14:42PM +0200, Goswin von Brederlow wrote:
> I keep coming back to the fact that movable objects should be moved
> out of the way for unmovable ones. Anything else just allows

That's incidentally exactly what the slab does, no need to reinvent
the wheel for that, it's an old problem and there's room for
optimization in the slab partial-reuse logic too. Just boost the order
0 page size and use the slab to get the 4k chunks. The sgi/defrag
design is backwards.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-12 Thread Andrea Arcangeli
On Tue, Sep 11, 2007 at 05:04:41PM -0700, Christoph Lameter wrote:
> I would think that your approach would be slower since you always have to 
> populate 1 << N ptes when mmapping a file? Plus there is a lot of wastage 

I don't have to populate them, I could just map one at time. The only
reason I want to populate every possible pte that could map that page
(by checking vma ranges) is to _improve_ performance by decreasing the
number of page faults of an order of magnitude. Then with the 62th bit
after NX giving me a 64k tlb, I could decrease the frequency of the
tlb misses too.

> of memory because even a file with one character needs an order N page? So 
> there are less pages available for the same workload.

This is a known issue. The same is true for ppc64 64k. If that really
is an issue, that may need some generic solution with tail packing.

> Then you are breaking mmap assumptions of applications becaused the order 
> N kernel will no longer be able to map 4k pages.  You likely need a new 
> binary format that has pages correctly aligned. I know that we would need 
> one on IA64 if we go beyond the established page sizes.

No you misunderstood the whole design. My patch will be 100% backwards
compatible in all respects. If I could break backwards compatibility
70% of the complexity would go away...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Andrea Arcangeli
On Tue, Sep 11, 2007 at 01:41:08PM -0700, Christoph Lameter wrote:
> The advantages of this approach over Andreas is basically that the 4k 
> filesystems still can be used as is. 4k is useful for binaries and for 

If you mean that with my approach you can't use a 4k filesystem as is,
that's not correct. I even run the (admittedly premature but
promising) benchmarks on my patch on a 4k blocksized
filesystem... Guess what, you can even still mount a 1k fs on a 2.6
kernel.

The main advantage I can see in your patch is that distributions won't
need to ship a 64k PAGE_SIZE kernel rpm (but your single rpm will be
slower).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Andrea Arcangeli
Hi,

On Tue, Sep 11, 2007 at 07:31:01PM +0100, Mel Gorman wrote:
> Now, the worst case scenario for your patch is that a hostile process
> allocates large amount of memory and mlocks() one 4K page per 64K chunk
> (this is unlikely in practice I know). The end result is you have many
> 64KB regions that are now unusable because 4K is pinned in each of them.

Initially 4k kmalloced tails aren't going to be mapped in
userland. But let's take the kernel stack that would generate the same
problem and that is clearly going to pin the whole 64k slab/slub
page.

What I think you're missing is that for Nick's worst case to trigger
with the config_page_shift design, you would need the _whole_ ram to
be _at_least_once_ allocated completely in kernel stacks. If the whole
100% of ram wouldn't go allocated in slub as a pure kernel stack, such
a scenario could never materialize.

With the SGI design + defrag, Nick's scenario can instead happen with
only total_ram/64k kernel stacks allocated.

The the problem with the slub fragmentation isn't a new problem, it
happens in today kernels as well and at least the slab by design is
meant to _defrag_ internally. So it's practically already solved and
it provides some guarantee unlike the buddy allocator.

> If it's my fault, sorry about that. It wasn't my intention.

It's not the fault of anyone, I simply didn't push too hard towards my
agenda for the reasons I just said, but I used any given opportunity
to discuss it.

With on-topic I meant not talking about it during the other topics,
like mmap_sem or RCU with radix tree lock ;)

> heh. Well we need to come to some sort of conclusion here or this will
> go around the merri-go-round till we're all bald.

Well, I only meant I'm still free to disagree if I think there's a
better way. All SGI has provided so far is data to show that their I/O
subsystem is much faster if the data is physically contiguous in ram
(ask Linus if you want more details, or better don't ask). That's not
very interesting data for my usages and with my hardware, and I guess
it's more likely that config_page_shift will produce interesting
numbers than their patch on my possible usage cases, but we'll never
know until both are finished.

> heh, I suggested printing the warning because I knew it had this
> problem. The purpose in my mind was to see how far the design could be
> brought before fs-block had to fill in the holes.

Indeed!

> I am still failing to see what happens when there are pagetable pages,
> slab objects or mlocked 4k pages pinning the 64K pages and you need to
> allocate another 64K page for the filesystem. I *think* you deadlock in
> a similar fashion to Christoph's approach but the shape of the problem
> is different because we are dealing with internal instead of external
> fragmentation. Am I wrong?

pagetables aren't the issue. They should be still pre-allocated in
page_size chunks. The 4k entries with 64k page-size are sure not worse
than a 32byte kmalloc today, the slab by design defragments the
stuff. There's probably room for improvement in that area even without
freeing any object by just ordering the list with an rbtree (or better
an heak like CFS should also use!!) so to always allocate new slabs
from the most full partial slab, that alone would help a lot probably
(not sure if slub does anything like that, I'm not fond on slub yet).

> Yes. I just think you have a different worst case that is just as bad.

Disagree here...

> small files (you need something like Shaggy's page tail packing),
> pagetable pages, pte pages all have to be dealt with. These are the
> things I think will cause us internal fragmentaiton problems.

Also note that not all users will need to turn on the tail
packing. We're here talking about features that not all users will
need anyway.. And we're in the same boat as ppc64, no difference.

Thanks!
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Andrea Arcangeli
Hi Mel,

On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> that increasing the pagesize like what Andrea suggested would lead to
> internal fragmentation problems. Regrettably we didn't discuss Andrea's

The config_page_shift guarantees the kernel stacks or whatever not
defragmentable allocation other allocation goes into the same 64k "not
defragmentable" page. Not like with SGI design that a 8k kernel stack
could be allocated in the first 64k page, and then another 8k stack
could be allocated in the next 64k page, effectively pinning all 64k
pages until Nick worst case scenario triggers.

What I said at the VM summit is that your reclaim-defrag patch in the
slub isn't necessarily entirely useless with config_page_shift,
because the larger the software page_size, the more partial pages we
could find in the slab, so to save some memory if there are tons of
pages very partially used, we could free some of them.

But the whole point is that with the config_page_shift, Nick's worst
case scenario can't happen by design regardless of defrag or not
defrag. While it can _definitely_ happen with SGI design (regardless
of any defrag thing). We can still try to save some memory by
defragging the slab a bit, but it's by far *not* required with
config_page_shift. No defrag at all is required infact.

Plus there's a cost in defragging and freeing cache... the more you
need defrag, the slower the kernel will be.

> approach in depth.

Well it wasn't my fault if we didn't discuss it in depth though. I
tried to discuss it in all possible occasions where I was suggested to
talk about it and where it was somewhat on topic. Given I wasn't even
invited at the KS, I felt it would not be appropriate for me to try to
monopolize the VM summit according to my agenda. So I happily listened
to what the top kernel developers are planning ;), while giving
some hints on what I think the right direction is instead.

> I *thought* that the end conclusion was that we would go with

Frankly I don't care what the end conclusion was.

> Christoph's approach pending two things being resolved;
> 
> o mmap() support that we agreed on is good

Let's see how good the mmap support for variable order page size will
work after the 2 weeks...

> o A clear statement, with logging maybe for users that mounted a large 
>   block filesystem that it might blow up and they get to keep both parts
>   when it does. Basically, for now it's only suitable in specialised
>   environments.

Yes, but perhaps you missed that such printk is needed exactly to
provide proof that SGI design is the wrong way and it needs to be
dumped. If that printk ever triggers it means you were totally wrong.

> I also thought there was an acknowledgement that long-term, fs-block was
> the way to go - possibly using contiguous pages optimistically instead
> of virtual mapping the pages. At that point, it would be a general
> solution and we could remove the warnings.

fsblock should stack on top of config_page_shift simply. Both are
needed. You don't want to use 64k pages on a laptop but you may want a
larger blocksize for the btrees etc... if you've a large harddisk and
not much ram.

> That's the absolute worst case but yes, in theory this can occur and
> it's safest to assume the situation will occur somewhere to someone. It

Do you agree this worst case can't happen with config_page_shift?

> Where we expected to see the the use of this patchset was in specialised
> environments *only*. The SGI people can mitigate their mixed
> fragmentation problems somewhat by setting slub_min_order ==
> large_block_order so that blocks get allocated and freed at the same
> size. This is partial way towards Andrea's solution of raising the size
> of an order-0 allocation. The point of printing out the warnings at

Except you don't get all the full benefits of it...

Even if I could end up mapping 4k kmalloced entries in userland for
the tail packing, that IMHO would still be a preferable solution than
to keep the base-page small and to make an hard effort to create large
pages out of small pages. The approach I advocate keeps the base page
big and the fast path fast, and it rather does some work to split the
base pages outside the buddy for the small files.

All your defrag work is still good to have, like I said at the VM
summit if you remember, to grow the hugetlbfs at runtime etc... I just
rather avoid to depend on it to avoid I/O failure in presence of
mlocked pagecache for example.

> Independently of that, we would work on order-0 scalability,
> particularly readahead and batching operations on ranges of pages as
> much as possible.

That's pretty much an unnecessary logic, if the order0 pages become
larger.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Andrea Arcangeli
On Tue, Sep 11, 2007 at 04:52:19AM +1000, Nick Piggin wrote:
> The idea that there even _is_ a bug to fail when higher order pages
> cannot be allocated was also brushed aside by some people at the
> vm/fs summit. I don't know if those people had gone through the
> math about this, but it goes somewhat like this: if you use a 64K
> page size, you can "run out of memory" with 93% of your pages free.
> If you use a 2MB page size, you can fail with 99.8% of your pages
> still free. That's 64GB of memory used on a 32TB Altix.
> 
> If you don't consider that is a problem because you don't care about
> theoretical issues or nobody has reported it from running -mm

MM kernels also forbids mmap, so there's no chance the largepages are
mlocked etc... that's not the final thing that is being measured.

> kernels, then I simply can't argue against that on a technical basis.
> But I'm totally against introducing known big fundamental problems to
> the VM at this stage of the kernel. God knows how long it takes to ever
> fix them in future after they have become pervasive throughout the 
> kernel.

Seconded.

> IMO the only thing that higher order pagecache is good for is a quick
> hack for filesystems to support larger block sizes. And after seeing it
> is fairly ugly to support mmap, I'm not even really happy for it to do
> that.

Additionally I feel the ones that will get the main advantage from the
quick hack are the crippled devices that are ~30% slower if the SG
tables are large.

> If VM scalability is a problem, then it needs to be addressed in other
> areas anyway for order-0 pages, and if contiguous pages helps IO
> scalability or crappy hardware, then there is nothing stopping us from

Yep.

> *attempting* to get contiguous memory in the current scheme.
> 
> Basically, if you're placing your hopes for VM and IO scalability on this,
> then I think that's a totally broken thing to do and will end up making
> the kernel worse in the years to come (except maybe on some poor
> configurations of bad hardware).

Agreed. From my part I am really convinced the only sane way to
approach the VM scalability and larger-physically contiguous pages
problem is the CONFIG_PAGE_SHIFT patch (aka large PAGE_SIZE from Hugh
for 2.4). I also have to say I always disliked the PAGE_CACHE_SIZE
definition too ;). I take it only as an attempt to documentation.

Furthermore all the issues with writeprotect faults over MAP_PRIVATE
regions will have to be addressed the same way with both approaches if
we want real 100% 4k-granular backwards compatibility.

On this topic I'm also going to suggest the cpu vendors to add a 64k
tlb using the reserved 62th bitflag in the pte (right after the NX
bit). So if alignment allows we can map pagecache with a 64k large tlb
on x86 (with a PAGE_SIZE of 64k), mixing it with the 4k tlb in the
same address space if userland alignment forbids using the 64k tlb. If
we want to break backwards compatibility and force all alignments on
64k and get rid of any 4k tlb to simplify the page fault code we can
do it later anyway... No idea if this feasible to achieve on the
hardware level though, it's not my problem anyway to judge this ;). As
constraints to the hardware interface it would be ok to require the
62th 64k-tlb bitflag to be only available on the pte that would have
normally mapped a physical address 64k naturally aligned, and to
require all later overlapping 4k ptes to be set to 0. If you've better
ideas to achieve this than my interface please let me know.

And if I'm terribly wrong and the variable order pagecache is the way
to go for the long run, the 64k tlb feature will fit in that model
very nicely too.

The reason of the 64k magic number is that this is the minimum unit of
contiguous I/O required to reach platter speed on most devices out
there. And it incidentally also matches ppc64 ;).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lvm-devel] *** ANNOUNCEMENT *** LVM 0.9.1 beta5 available at www.sistina.com

2001-02-20 Thread Andrea Arcangeli

On Tue, Feb 20, 2001 at 05:31:25PM -0700, Andreas Dilger wrote:
> The reason why the IOP was changed was because the VG_CREATE ioctl now
> depends on the vg_number in the supplied vg_t to determine which VG minor
> number to use.  The old interface used the minor number of the opened
> device inode, but for devfs the device inodes don't exist until the VG
> is created...  If you run an older kernel with new tools, you can only
> use the first VG.

Ah, I was reading the patch incidentally against 2.2 patch where devfs support
is not included, so I wasn't thinking the devfs way ;). Thanks for the
explanation.

I assume it's not possible to mknod on top of devfs.  So then we could use a
temporary device in /var/tmp or whatever for that.  However those workarounds
tends to be ugly.

Probably the best way to preserve the IOP that I recommend for beta6 is to add
a new ioctl to the VG chardevice.  Rename VG_CREATE to VG_CREATE_OLD.
VG_CREATE_OLD is a wrapper that calculates the minor number from the inode and
then fallbacks into VG_CREATE, and the new VG_CREATE is the one that gets
the minor of the vg from userspace.

Either ways we don't break backwards compatibilty across 0.9* cycle.

If there would been a strong reason and it would be a mess to provide backwards
compatibilty I would of course agree to raise at IOP 11, but just to avoid a
few lines of code for a wrapper or a temporary mknod on /tmp for a devfs-only
fix, I think it worth to preserve IOP 10.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [lvm-devel] *** ANNOUNCEMENT *** LVM 0.9.1 beta5 available at www.sistina.com

2001-02-20 Thread Andrea Arcangeli

On Tue, Feb 20, 2001 at 10:49:07PM +, Heinz Mauelshagen wrote:
> 
> Hi all,
> 
> a tarball of the Linux Logical Volume Manager 0.9.1 Beta 5 is available now at
> 
>
> 
> for download (Follow the "LVM download page" link).
> 
> This release fixes several bugs.
> See the CHANGELOG file contained in the tarball for further information.
> 
> A change in the i/o protocoll version *forces* you to update
> the driver as well.
> Follow instructions in PATCHES/README to achieve this please.
> 
> 
> Please help us to stabilize for 0.9.1 ASAP and test is as much as possible!
> Feed back related information to <[EMAIL PROTECTED]>.

The bheads in the lv_t is the wrong way to go, I just wrote an alternate patch
for rawio that keeps the bh inside the kiovec, not in the lv, this also
imrproves rawio performance in general (such allocation deallocation flood was
wasteful). No a single change is required at the lvm layer, all the changes lives in
the kiobuf layer. It's tested and it works for me.

diff -urN rawio-ref/fs/buffer.c rawio/fs/buffer.c
--- rawio-ref/fs/buffer.c   Tue Feb 20 23:17:10 2001
+++ rawio/fs/buffer.c   Tue Feb 20 23:17:27 2001
@@ -1240,6 +1240,29 @@
wake_up(&buffer_wait);
 }
 
+int alloc_kiobuf_bhs(struct kiobuf * kiobuf)
+{
+   int i, j;
+
+   for (i = 0; i < KIO_MAX_SECTORS; i++)
+   if (!(kiobuf->bh[i] = get_unused_buffer_head(0))) {
+   for (j = 0; j < i; j++)
+   put_unused_buffer_head(kiobuf->bh[j]);
+   wake_up(&buffer_wait);
+   return -ENOMEM;
+   }
+   return 0;
+}
+
+void free_kiobuf_bhs(struct kiobuf * kiobuf)
+{
+   int i;
+
+   for (i = 0; i < KIO_MAX_SECTORS; i++)
+   put_unused_buffer_head(kiobuf->bh[i]);
+   wake_up(&buffer_wait);
+}
+
 static void end_buffer_io_async(struct buffer_head * bh, int uptodate)
 {
unsigned long flags;
@@ -1333,10 +1356,8 @@
iosize = 0;
}

-   put_unused_buffer_head(tmp);
iosize += size;
}
-   wake_up(&buffer_wait);

dprintk ("do_kio end %d %d\n", iosize, err);

@@ -1390,7 +1411,7 @@
int i;
int bufind;
int pageind;
-   int bhind;
+   int bhind, kiobuf_bh_nr;
int offset;
unsigned long   blocknr;
struct kiobuf * iobuf = NULL;
@@ -1422,6 +1443,7 @@
 */
bufind = bhind = transferred = err = 0;
for (i = 0; i < nr; i++) {
+   kiobuf_bh_nr = 0;
iobuf = iovec[i];
err = setup_kiobuf_bounce_pages(iobuf, GFP_USER);
if (err) 
@@ -1444,12 +1466,8 @@
 
while (length > 0) {
blocknr = b[bufind++];
-   tmp = get_unused_buffer_head(0);
-   if (!tmp) {
-   err = -ENOMEM;
-   goto error;
-   }
-   
+   tmp = iobuf->bh[kiobuf_bh_nr++];
+
tmp->b_dev = B_FREE;
tmp->b_size = size;
tmp->b_data = (char *) (page + offset);
@@ -1460,7 +1478,8 @@
if (rw == WRITE) {
set_bit(BH_Uptodate, &tmp->b_state);
set_bit(BH_Dirty, &tmp->b_state);
-   }
+   } else
+   clear_bit(BH_Uptodate, &tmp->b_state);
 
dprintk ("buffer %d (%d) at %p\n", 
 bhind, tmp->b_blocknr, tmp->b_data);
@@ -1478,7 +1497,7 @@
transferred += err;
else
goto finished;
-   bhind = 0;
+   kiobuf_bh_nr = bhind = 0;
}

if (offset >= PAGE_SIZE) {
@@ -1506,17 +1525,6 @@
if (transferred)
return transferred;
return err;
-
- error:
-   /* We got an error allocation the bh'es.  Just free the current
-   buffer_heads and exit. */
-   for (i = 0; i < bhind; i++)
-   put_unused_buffer_head(bh[i]);
-   wake_up(&buffer_wait);
-
-   clear_kiobuf_bounce_pages(iobuf);
-
-   goto finished;
 }
 
 /*
diff -urN rawio-ref/fs/iobuf.c rawio/fs/iobuf.c
--- rawio-ref/fs/iobuf.cTue Feb 20 23:17:10 2001
+++ rawio/fs/iobuf.cTu

Re: [patch] O_SYNC patch 3/3, add inode dirty buffer list support to ext2

2000-11-24 Thread Andrea Arcangeli

On Thu, Nov 23, 2000 at 01:01:25PM -0700, Jeff V. Merkey wrote:
> On Thu, Nov 23, 2000 at 12:01:35PM +, Stephen C. Tweedie wrote:
> > Hi,
> > 
> > On Wed, Nov 22, 2000 at 11:54:24AM -0700, Jeff V. Merkey wrote:
> > > 
> > > I have not implemented O_SYNC in NWFS, but it looks like I need to add it 
> > > before posting the final patches.  This patch appears to force write-through 
> > > of only dirty inodes, and allow reads to continue from cache.  Is this
> > > assumption correct
> > 
> > Yes: O_SYNC is not required to force reads to be made from disk.
> > SingleUnix has an "O_RSYNC" option which does that, but O_SYNC and
> > O_DSYNC don't imply that.
> 
> Cool.  ORACLE is going to **SMOKE** on EXT2 with this change.

Note that this is nothing new, linux (say 2.2.18pre23) always used the O_SYNC
semantics Stephen implemented in the 2.4.x O_SYNC showstopper bugfix.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: What sets PG_dirty?

2000-11-20 Thread Andrea Arcangeli

On Mon, Nov 20, 2000 at 05:42:48PM +1100, David Gibson wrote:
> [..] What am I missing?

You should rename it to PG_protected.

Andrea



On Mon, Nov 06, 2000 at 04:54:16PM +, Stephen C. Tweedie wrote:
> [..] The
> one piece of that missing [..]

Ok, I was just looking the context of your diff.

About the implementation of the missing VM infrastructure for handling page
dirty at the physical pagecache layer, I'd suggest to change ramfs to use a new
PG_protected bitfield with the current semantics of PG_dirty, and to use
PG_dirty for the stuff that we must flush to disk.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [EMAIL PROTECTED]  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/