Re: [RFC] Generic deferred file writing

2001-01-02 Thread Chris Mason



On Saturday, December 30, 2000 06:28:39 PM -0800 Linus Torvalds
<[EMAIL PROTECTED]> wrote:
 
> There are only two real advantages to deferred writing:
> 
>  - not having to do get_block() at all for temp-files, as we never have to
>do the allocation if we end up removing the file.
> 
>NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
>right is quite possibly big enough that this advantage is possibly very
>questionable.  It's very possible that people could speed things up
>using this approach, but I also suspect that it is equally (if not
>more) possible to speed things up by just making sure that the
>low-level FS has a fast get_block().
> 
>  - Using "global" access patterns to do a better job of "get_block()", ie
>taking advantage of issues with journalling etc and deferring the write
>in order to get a bigger journal.
> 
> The second point is completely different, and THIS is where I think there
> are potentially real advantages. 

Absolutely.  I wrote reiserfs delayed allocation code back in october, and
kind of left it alone until the VM had the callbacks needed to make it
clean (err, less ugly).  I included bunches of optimizations to
reiserfs_get_block, and the most effective one was a cache of block
pointers in the inode to avoid consecutive tree searches.  This was a
locking and an i/o win, for both reading and writing (reiserfs needs this
more than ext2 does)

For growing the file, delayed allocation was a huge bonus.  For all the
reasons you've already discussed, and because writing a file went from this:
 
(reiserfs_get_block is starting/stopping the transaction)
while(bytes_to_write) 
start_transaction
allocate block
insert block pointer
end_transaction
end

To this:

while(bytes_to_write)
update counters
end

(delayed alloc routine is starting/stopping trans)
start_transaction
allocate X blocks
insert X block pointers
update counters
end_transaction

A big fat transaction is a happy one ;-)

Anyway, I'll return to the optimizations once things have settled down a
bit, and might give the generic delayed allocation (instead of reiserfs
only code) a try.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-02 Thread Roman Zippel

Hi,

On Tue, 2 Jan 2001, Alexander Viro wrote:

> Depends on a filesystem. Generally you don't want asynchronous operations
> to grab semaphores shared with something else. kswapd knows to skip the locked
> pages, but that's it - if writepage() is going to block on a semaphore you
> will not know what had hit you. And while buffer-cache operations will not
> trigger writepage() (grep for GFP_BUFFER and GFP_IO and you'll see) you have
> no such warranties for other sources of memory pressure. If one of them
> hits while you are holding such semaphore - you are toast.

I just checked that and you're right, sorry for causing confusion and
thanks for clearing this up.

> We probably could pull it off for ext2_truncate() vs. ext2_get_block()
> but it would not do us any good. It would give excessive exclusion for
> operations that can be done in parallel just fine (example: we have
> a hole from 100Kb to 200Kb. Pageouts in that area can be trivially
> done i parallel - current code will not even try to do unrolls. With
> your locking they will be serialized for no good reason). What for?

Let me come back to the three phases I mentioned earlier:
alloc_block: does only a read-only check whether a block needs to be
allocated or not, this can be done in parallel and only needs the page
lock.
get_block: blocks are now really allocated and this needs locking of the
bitmap.
commit_block: write the allocated blocks to the inode and this now would
use an inode specific semaphore to protect the updates of indirect blocks.

The only problem I see is truncate, but if we move the release of unneeded
indirect blocks to file_close, only new indirect blocks can appear while
the file is open, but they won't change anymore, what would make lots of
the checks easier.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-02 Thread Roman Zippel

Hi,

On Tue, 2 Jan 2001, Alexander Viro wrote:

 Depends on a filesystem. Generally you don't want asynchronous operations
 to grab semaphores shared with something else. kswapd knows to skip the locked
 pages, but that's it - if writepage() is going to block on a semaphore you
 will not know what had hit you. And while buffer-cache operations will not
 trigger writepage() (grep for GFP_BUFFER and GFP_IO and you'll see) you have
 no such warranties for other sources of memory pressure. If one of them
 hits while you are holding such semaphore - you are toast.

I just checked that and you're right, sorry for causing confusion and
thanks for clearing this up.

 We probably could pull it off for ext2_truncate() vs. ext2_get_block()
 but it would not do us any good. It would give excessive exclusion for
 operations that can be done in parallel just fine (example: we have
 a hole from 100Kb to 200Kb. Pageouts in that area can be trivially
 done i parallel - current code will not even try to do unrolls. With
 your locking they will be serialized for no good reason). What for?

Let me come back to the three phases I mentioned earlier:
alloc_block: does only a read-only check whether a block needs to be
allocated or not, this can be done in parallel and only needs the page
lock.
get_block: blocks are now really allocated and this needs locking of the
bitmap.
commit_block: write the allocated blocks to the inode and this now would
use an inode specific semaphore to protect the updates of indirect blocks.

The only problem I see is truncate, but if we move the release of unneeded
indirect blocks to file_close, only new indirect blocks can appear while
the file is open, but they won't change anymore, what would make lots of
the checks easier.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-02 Thread Chris Mason



On Saturday, December 30, 2000 06:28:39 PM -0800 Linus Torvalds
[EMAIL PROTECTED] wrote:
 
 There are only two real advantages to deferred writing:
 
  - not having to do get_block() at all for temp-files, as we never have to
do the allocation if we end up removing the file.
 
NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
right is quite possibly big enough that this advantage is possibly very
questionable.  It's very possible that people could speed things up
using this approach, but I also suspect that it is equally (if not
more) possible to speed things up by just making sure that the
low-level FS has a fast get_block().
 
  - Using "global" access patterns to do a better job of "get_block()", ie
taking advantage of issues with journalling etc and deferring the write
in order to get a bigger journal.
 
 The second point is completely different, and THIS is where I think there
 are potentially real advantages. 

Absolutely.  I wrote reiserfs delayed allocation code back in october, and
kind of left it alone until the VM had the callbacks needed to make it
clean (err, less ugly).  I included bunches of optimizations to
reiserfs_get_block, and the most effective one was a cache of block
pointers in the inode to avoid consecutive tree searches.  This was a
locking and an i/o win, for both reading and writing (reiserfs needs this
more than ext2 does)

For growing the file, delayed allocation was a huge bonus.  For all the
reasons you've already discussed, and because writing a file went from this:
 
(reiserfs_get_block is starting/stopping the transaction)
while(bytes_to_write) 
start_transaction
allocate block
insert block pointer
end_transaction
end

To this:

while(bytes_to_write)
update counters
end

(delayed alloc routine is starting/stopping trans)
start_transaction
allocate X blocks
insert X block pointers
update counters
end_transaction

A big fat transaction is a happy one ;-)

Anyway, I'll return to the optimizations once things have settled down a
bit, and might give the generic delayed allocation (instead of reiserfs
only code) a try.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-01 Thread Alexander Viro



On Tue, 2 Jan 2001, Roman Zippel wrote:

> Block allocation is not my problem right now (and even directory handling
> is not that difficult), but I will post somethings about this on fsdevel
> later.
> But one question is still open, I'd really like an answer for:
> Is it possible to use a per-inode-indirect-block-semaphore?

Depends on a filesystem. Generally you don't want asynchronous operations
to grab semaphores shared with something else. kswapd knows to skip the locked
pages, but that's it - if writepage() is going to block on a semaphore you
will not know what had hit you. And while buffer-cache operations will not
trigger writepage() (grep for GFP_BUFFER and GFP_IO and you'll see) you have
no such warranties for other sources of memory pressure. If one of them
hits while you are holding such semaphore - you are toast.

We probably could pull it off for ext2_truncate() vs. ext2_get_block()
but it would not do us any good. It would give excessive exclusion for
operations that can be done in parallel just fine (example: we have
a hole from 100Kb to 200Kb. Pageouts in that area can be trivially
done i parallel - current code will not even try to do unrolls. With
your locking they will be serialized for no good reason). What for?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-01 Thread Roman Zippel

Hi,

On Mon, 1 Jan 2001, Alexander Viro wrote:

> But... But with AFFS you _have_ exclusion between block-allocation and
> truncate(). It has no sparse files, so pageout will never allocate
> anything. I.e. all allocations come from write(2). And both write(2) and
> truncate(2) hold i_sem.
> 
> Problem with AFFS is on the directory side of that business and there it's
> really scary. Block allocation is trivial...

Block allocation is not my problem right now (and even directory handling
is not that difficult), but I will post somethings about this on fsdevel
later.
But one question is still open, I'd really like an answer for:
Is it possible to use a per-inode-indirect-block-semaphore?

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-01 Thread Daniel Phillips

Alexander Viro wrote:
> GFP_BUFFER _may_ become an issue if we move bitmaps into pagecache.
> Then we'll need a per-address_space gfp_mask. Again, patches exist
> and had been tested (not right now - I didn't port them to current
> tree yet). Bugger if I remember whether they were posted or not - they've
> definitely had been mentioned on linux-mm, but IIRC I had sent the
> modifications of VM code only to Jens. I can repost them.

Please, and I'll ask Rik to post them on the kernelnewbies.org patch
page.  (Rik?)  Putting bitmaps and group descriptors into the page cache
will allow the current adhoc bitmap and groupdesc caching code to be
deleted - the for-real cache should have better lru, be more efficient
to access, not need special locking and won't have an arbitrary limit on
number of cached bitmaps.  About 300 lines of spagetti gone.  I suppose
the group descriptor pages still need to be locked in memory so we can
address them through a table instead of searching the hash.  OK, this
must be what you meant by a 'proper' fix to the ialloc group desc bug I
posted last month, which by the way is *still* there.  How about
applying my patch in the interim?  It's a real bug, it just doesn't
trigger often.

> Some pieces of balloc.c cleanup had been posted on fsdevel. Check the
> archives. They prepare the ground for killing lock_super() contention
> on ext2_new_inode(), but that part wasn't there back then.
> 
> I will start -bird (aka FS-CURRENT) branch as soon as Linus opens 2.4.
> Hopefully by the time of 2.5 it will be tested well enough. Right now
> it exists as a large patchset against more or less recent -test and
> I'm waiting for slowdown of the changes in main tree to put them all
> together.

It would be awfully nice to have those patches available via ftp. 
Web-based mail archives don't make it because you can't generally can't
get the patches out intact - the tabs get expanded and other noise
inserted.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-01 Thread Alexander Viro



On Mon, 1 Jan 2001, Roman Zippel wrote:

> The other reason for the question is that I'm currently overwork the block
> handling in affs, especially the extended block handling, where I'm
> implementing a new extended block cache, where I would pretty much prefer
> to use a semaphore to protect it. Although I could do it probably without
> the semaphore and use a spinlock+rechecking, but it would keep it so much
> simpler. (I can post more details about this part on fsdevel if needed /
> wanted.)

But... But with AFFS you _have_ exclusion between block-allocation and
truncate(). It has no sparse files, so pageout will never allocate
anything. I.e. all allocations come from write(2). And both write(2) and
truncate(2) hold i_sem.

Problem with AFFS is on the directory side of that business and there it's
really scary. Block allocation is trivial...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-01 Thread Roman Zippel

Hi,

On Sun, 31 Dec 2000, Alexander Viro wrote:

> Reread the original thread. GFP_BUFFER protects us from buffer cache
> allocations triggering pageouts. It has nothing to the deadlock scenario
> that would come from grabbing ->i_sem on pageout.

I don't want to grab i_sem. It was a very, very early idea... :)

> Sheesh... "Complexity" of ext2_get_block() (down to the ext2_new_block()
> calls) is really, really not a problem. Normally it just gives you the
> straightforward path. All unrolls are for contention cases and they
> are precisely what we have to do there.

Maybe complexity is the wrong word, of course the logic in there is
straight forward (once one understood it :) ).
Let me ask it differently and it's now only about indirect block handling.
Is it possible to use a per-inode-indirect-block-semaphore?
The reason for the question is, that I maybe see a different sort of
contention here - live locks. I don't mind that getting of resources and
rechecking if everything went well. The problem is how much resources you
need to get (and to release, if something failed). Somewhere is always a
point, where two threads can't make any progress or one thread can stall
the progress of a second.
To get back to ext2_get_block: IMO such a scenario could happen in the
double or triple indirect block case, when two or more threads try to
allocate/truncate a block here. Maybe my concerns are baseless, but I'd
just like to know, that there isn't a possibility for a DOS attack here.
(BTW that's what I mean with complexity, it's less the logical complexity,
it's more the "runtime complexity").

The other reason for the question is that I'm currently overwork the block
handling in affs, especially the extended block handling, where I'm
implementing a new extended block cache, where I would pretty much prefer
to use a semaphore to protect it. Although I could do it probably without
the semaphore and use a spinlock+rechecking, but it would keep it so much
simpler. (I can post more details about this part on fsdevel if needed /
wanted.)

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-01 Thread Roman Zippel

Hi,

On Sun, 31 Dec 2000, Alexander Viro wrote:

 Reread the original thread. GFP_BUFFER protects us from buffer cache
 allocations triggering pageouts. It has nothing to the deadlock scenario
 that would come from grabbing -i_sem on pageout.

I don't want to grab i_sem. It was a very, very early idea... :)

 Sheesh... "Complexity" of ext2_get_block() (down to the ext2_new_block()
 calls) is really, really not a problem. Normally it just gives you the
 straightforward path. All unrolls are for contention cases and they
 are precisely what we have to do there.

Maybe complexity is the wrong word, of course the logic in there is
straight forward (once one understood it :) ).
Let me ask it differently and it's now only about indirect block handling.
Is it possible to use a per-inode-indirect-block-semaphore?
The reason for the question is, that I maybe see a different sort of
contention here - live locks. I don't mind that getting of resources and
rechecking if everything went well. The problem is how much resources you
need to get (and to release, if something failed). Somewhere is always a
point, where two threads can't make any progress or one thread can stall
the progress of a second.
To get back to ext2_get_block: IMO such a scenario could happen in the
double or triple indirect block case, when two or more threads try to
allocate/truncate a block here. Maybe my concerns are baseless, but I'd
just like to know, that there isn't a possibility for a DOS attack here.
(BTW that's what I mean with complexity, it's less the logical complexity,
it's more the "runtime complexity").

The other reason for the question is that I'm currently overwork the block
handling in affs, especially the extended block handling, where I'm
implementing a new extended block cache, where I would pretty much prefer
to use a semaphore to protect it. Although I could do it probably without
the semaphore and use a spinlock+rechecking, but it would keep it so much
simpler. (I can post more details about this part on fsdevel if needed /
wanted.)

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-01 Thread Alexander Viro



On Mon, 1 Jan 2001, Roman Zippel wrote:

 The other reason for the question is that I'm currently overwork the block
 handling in affs, especially the extended block handling, where I'm
 implementing a new extended block cache, where I would pretty much prefer
 to use a semaphore to protect it. Although I could do it probably without
 the semaphore and use a spinlock+rechecking, but it would keep it so much
 simpler. (I can post more details about this part on fsdevel if needed /
 wanted.)

But... But with AFFS you _have_ exclusion between block-allocation and
truncate(). It has no sparse files, so pageout will never allocate
anything. I.e. all allocations come from write(2). And both write(2) and
truncate(2) hold i_sem.

Problem with AFFS is on the directory side of that business and there it's
really scary. Block allocation is trivial...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-01 Thread Daniel Phillips

Alexander Viro wrote:
 GFP_BUFFER _may_ become an issue if we move bitmaps into pagecache.
 Then we'll need a per-address_space gfp_mask. Again, patches exist
 and had been tested (not right now - I didn't port them to current
 tree yet). Bugger if I remember whether they were posted or not - they've
 definitely had been mentioned on linux-mm, but IIRC I had sent the
 modifications of VM code only to Jens. I can repost them.

Please, and I'll ask Rik to post them on the kernelnewbies.org patch
page.  (Rik?)  Putting bitmaps and group descriptors into the page cache
will allow the current adhoc bitmap and groupdesc caching code to be
deleted - the for-real cache should have better lru, be more efficient
to access, not need special locking and won't have an arbitrary limit on
number of cached bitmaps.  About 300 lines of spagetti gone.  I suppose
the group descriptor pages still need to be locked in memory so we can
address them through a table instead of searching the hash.  OK, this
must be what you meant by a 'proper' fix to the ialloc group desc bug I
posted last month, which by the way is *still* there.  How about
applying my patch in the interim?  It's a real bug, it just doesn't
trigger often.

 Some pieces of balloc.c cleanup had been posted on fsdevel. Check the
 archives. They prepare the ground for killing lock_super() contention
 on ext2_new_inode(), but that part wasn't there back then.
 
 I will start -bird (aka FS-CURRENT) branch as soon as Linus opens 2.4.
 Hopefully by the time of 2.5 it will be tested well enough. Right now
 it exists as a large patchset against more or less recent -testn and
 I'm waiting for slowdown of the changes in main tree to put them all
 together.

It would be awfully nice to have those patches available via ftp. 
Web-based mail archives don't make it because you can't generally can't
get the patches out intact - the tabs get expanded and other noise
inserted.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-01 Thread Roman Zippel

Hi,

On Mon, 1 Jan 2001, Alexander Viro wrote:

 But... But with AFFS you _have_ exclusion between block-allocation and
 truncate(). It has no sparse files, so pageout will never allocate
 anything. I.e. all allocations come from write(2). And both write(2) and
 truncate(2) hold i_sem.
 
 Problem with AFFS is on the directory side of that business and there it's
 really scary. Block allocation is trivial...

Block allocation is not my problem right now (and even directory handling
is not that difficult), but I will post somethings about this on fsdevel
later.
But one question is still open, I'd really like an answer for:
Is it possible to use a per-inode-indirect-block-semaphore?

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2001-01-01 Thread Alexander Viro



On Tue, 2 Jan 2001, Roman Zippel wrote:

 Block allocation is not my problem right now (and even directory handling
 is not that difficult), but I will post somethings about this on fsdevel
 later.
 But one question is still open, I'd really like an answer for:
 Is it possible to use a per-inode-indirect-block-semaphore?

Depends on a filesystem. Generally you don't want asynchronous operations
to grab semaphores shared with something else. kswapd knows to skip the locked
pages, but that's it - if writepage() is going to block on a semaphore you
will not know what had hit you. And while buffer-cache operations will not
trigger writepage() (grep for GFP_BUFFER and GFP_IO and you'll see) you have
no such warranties for other sources of memory pressure. If one of them
hits while you are holding such semaphore - you are toast.

We probably could pull it off for ext2_truncate() vs. ext2_get_block()
but it would not do us any good. It would give excessive exclusion for
operations that can be done in parallel just fine (example: we have
a hole from 100Kb to 200Kb. Pageouts in that area can be trivially
done i parallel - current code will not even try to do unrolls. With
your locking they will be serialized for no good reason). What for?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Alexander Viro



On Mon, 1 Jan 2001, Roman Zippel wrote:

> I just rechecked that, but I don't see no superblock lock here, it uses
> the kernel_lock instead. Although Al could give the definitive answer for
> this, he wrote it. :)

No superblock lock in get_block() proper. Tons of it in the dungheap called
balloc.c. _That's_ where the bottleneck is. BTW, even BKL is easily removable
from get_block() - check /* Reader: */ and /* Writer: */ comments, they mark
the places to put spinlock in.

> > The way the Linux MM works, if the lower levels need to do buffer
> > allocations, they will use GFP_BUFFER (which "bread()" does internally),
> > which will mean that the VM layer will _not_ call down recursively to
> > write stuff out while it's already trying to write something else. This is
> > exactly so that filesystems don't have to release and re-try if they don't
> > want to.
> > 
> > In short, I don't see any of your arguments.
> 
> Then I must have misunderstood Al. Al?
> If you were right here, I would see absolutely no reason for the current
> complexity. (Me is a bit confused here.)

Reread the original thread. GFP_BUFFER protects us from buffer cache
allocations triggering pageouts. It has nothing to the deadlock scenario
that would come from grabbing ->i_sem on pageout.

Sheesh... "Complexity" of ext2_get_block() (down to the ext2_new_block()
calls) is really, really not a problem. Normally it just gives you the
straightforward path. All unrolls are for contention cases and they
are precisely what we have to do there.

Again, scalability problems are in the block allocator, not in the
indirect blocks handling. It's completely independent from get_block().
We overlock. Big way. And the structures we are protecting excessively
are:
* cylinder group descriptors
* block bitmaps
* (to less extent) inode bitmaps and inode table.
That's what needs to be fixed. It has nothing to VFS or VM - purely
internal ext2 business.

Another ext2 issue is reducing the buffer cache pressure - mostly by
moving the directories into page cache. I've posted such patches on
fsdevel and they are applied to the kernel I'm running here. Works
like a charm and allows the rest of metadata stay in cache longer.

GFP_BUFFER _may_ become an issue if we move bitmaps into pagecache.
Then we'll need a per-address_space gfp_mask. Again, patches exist
and had been tested (not right now - I didn't port them to current
tree yet). Bugger if I remember whether they were posted or not - they've
definitely had been mentioned on linux-mm, but IIRC I had sent the
modifications of VM code only to Jens. I can repost them.

Some pieces of balloc.c cleanup had been posted on fsdevel. Check the
archives. They prepare the ground for killing lock_super() contention
on ext2_new_inode(), but that part wasn't there back then.

I will start -bird (aka FS-CURRENT) branch as soon as Linus opens 2.4.
Hopefully by the time of 2.5 it will be tested well enough. Right now
it exists as a large patchset against more or less recent -test and
I'm waiting for slowdown of the changes in main tree to put them all
together.
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Roman Zippel

Hi,

On Sun, 31 Dec 2000, Linus Torvalds wrote:

>   cached_allocation = NULL;
> 
>   repeat:
>   spin_lock();
>   result = try_to_find_existing();
>   if (!result) {
>   if (!cached_allocation) {
>   spin_unlock();
>   cached_allocation = allocate_block();
>   goto repeat;
>   }
>   result = cached_allocation;
>   add_to_datastructures(result);
>   }
>   spin_unlock();
>   return result;
> 
> This is quite standard, and Linux does it in many places. It doesn't have
> to be complex or ugly.

No problem with that.

> Also, I don't see why you claim the current get_block() is recursive and
> hard to use: it obviously isn't. If you look at the current ext2
> get_block(), the way it protects most of its data structures is by the
> super-block-global lock. That wouldn't work if your claims of recursive
> invocation were true. 

I just rechecked that, but I don't see no superblock lock here, it uses
the kernel_lock instead. Although Al could give the definitive answer for
this, he wrote it. :)

> The way the Linux MM works, if the lower levels need to do buffer
> allocations, they will use GFP_BUFFER (which "bread()" does internally),
> which will mean that the VM layer will _not_ call down recursively to
> write stuff out while it's already trying to write something else. This is
> exactly so that filesystems don't have to release and re-try if they don't
> want to.
> 
> In short, I don't see any of your arguments.

Then I must have misunderstood Al. Al?
If you were right here, I would see absolutely no reason for the current
complexity. (Me is a bit confused here.)

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Linus Torvalds



On Sun, 31 Dec 2000, Roman Zippel wrote:
> 
> On Sun, 31 Dec 2000, Linus Torvalds wrote:
> 
> > Let me repeat myself one more time:
> > 
> >  I do not believe that "get_block()" is as big of a problem as people make
> >  it out to be.
> 
> The real problem is that get_block() doesn't scale and it's very hard to
> do. A recursive per inode-semaphore might help, but it's still a pain to
> get it right.

Not true.

There's nothing unscalable in get_block() per se. The only lock we hold is
the per-page lock, which we must hold anyway. get_block() itself does not
need any real locking: you can do it with a simple per-inode spinlock if
you want to (and release the spinlock and try again if you need to fetch
non-cached data blocks).

Sure, the current ext2 _implementation_ sucks. Nobody has ever contested
that. 

Stop re-designing something just because you want to.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Roman Zippel

Hi,

On Sun, 31 Dec 2000, Linus Torvalds wrote:

> Let me repeat myself one more time:
> 
>  I do not believe that "get_block()" is as big of a problem as people make
>  it out to be.

The real problem is that get_block() doesn't scale and it's very hard to
do. A recursive per inode-semaphore might help, but it's still a pain to
get it right.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Linus Torvalds

On Sun, 31 Dec 2000, Daniel Phillips wrote:

> Linus Torvalds wrote:
> >  I do not believe that "get_block()" is as big of a problem as people make
> >  it out to be.
>
> I didn't mention get_block - disk accesses obviously far outweigh
> filesystem cpu/cache usage in overall impact.  The question is, what
> happens to disk access patterns when we do the deferred allocation.

Note that the deferred allocation is only possible with a full page write.

Go and do statistics on a system of how often this happens, and what the
circumstances are. Just for fun.

I will bet you $5 USD that 99.9% of all such writes are to new files, at
the end of the file. I'm sure you can come up with other usage patterns,
but they'll be special (big databases etc, and I bet that they'll want to
have stuff cached all the time anyway for other reasons).

So I seriously doubt that you'll have much of an IO component to the
writing anyway - except for the "normal" deferred write of actually
writing the stuff out at all.

Now, this is where I agree with you, but I disagree with where most of the
discussion has been going: I _do_ believe that we may want to change block
allocation discussions at write-out-time. That makes sense to me. But that
doesn't really impact "ENOSPC" - the write would not be really "deferred"
by the VM layer, and the filesystem would always be aware of the writer
synchronously.

> > One form of deferred writes I _do_ like is the mount-time-option form.
> > Because that one doesn't add complexity. Kind of like the "noatime" mount
> > option - it can be worth it under some circumstances, and sometimes it's
> > acceptable to not get 100% unix semantics - at which point deferred writes
> > have none of the disadvantages of trying to be clever.
>
> And the added attraction of requiring almost no effort.

Did I mention my belief in the true meaning of "intelligence"?

"Intelligence is the ability to avoid doing work, yet get the work done".

Lazy programmers are the best programmers. Think Tom Sawyer painting the
fence. That's intelligence.

Requireing almost no effort is a big plus in my book.

It's the "clever" programmer I'm afraid of. The one who isn't afraid of
generating complexity, because he has a Plan (capital "P"), and he knows
he can work out the details later.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Daniel Phillips

Linus Torvalds wrote:
>  I do not believe that "get_block()" is as big of a problem as people make
>  it out to be.

I didn't mention get_block - disk accesses obviously far outweigh
filesystem cpu/cache usage in overall impact.  The question is, what
happens to disk access patterns when we do the deferred allocation.

> One form of deferred writes I _do_ like is the mount-time-option form.
> Because that one doesn't add complexity. Kind of like the "noatime" mount
> option - it can be worth it under some circumstances, and sometimes it's
> acceptable to not get 100% unix semantics - at which point deferred writes
> have none of the disadvantages of trying to be clever.

And the added attraction of requiring almost no effort.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Linus Torvalds



On Sun, 31 Dec 2000, Daniel Phillips wrote:
> 
> It's not that hard or inefficient to return the ENOSPC from the usual
> point.  For example, make a gross overestimate of the space needed for
> the write, compare to a cached filesystem free space value less the
> amount deferred so far, and fail to take the optimization if it looks
> even close.

Let me repeat myself one more time:

 I do not believe that "get_block()" is as big of a problem as people make
 it out to be.

And more importantly:

 I strongly believe that trying to be clever is detrimental to your
 health. 

 The "clever" approach is to add tons of complexity, have various
 heuristics to try to not overflow, and then try to debug it considering
 that the ENOSPC case is actually rather rare.

 The "intelligent" approach is just to say that if get_block() shows up on
 the performance profiles, then it should be optimized.

I'd rather be intelligent than clever. Optimize get_block(), which in the
case of ext2 seems to be mostly ext2_new_block() and the balloc.c mess.

The argument that Andrea had is bogus: the common case for writes (and
writes is the only part that deferred writing would touch) is re-writing
the whole file, and the IO to look up the metadata is never an issue for
that case. Everything is basically cached and created on-the-fly. IO is
not the issue, being good about new block allocation _is_ the issue.

Don't get me wrong: I like the notion of deferred writes. But I'm also
very pragmatic: I have not heard of a really good argument that makes it
obvious that deferred writes is a major win performance-wise that would
make it worth the complexity.

One form of deferred writes I _do_ like is the mount-time-option form.
Because that one doesn't add complexity. Kind of like the "noatime" mount
option - it can be worth it under some circumstances, and sometimes it's
acceptable to not get 100% unix semantics - at which point deferred writes
have none of the disadvantages of trying to be clever.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Daniel Phillips

Linus Torvalds wrote:
> There are only two real advantages to deferred writing:
> 
>  - not having to do get_block() at all for temp-files, as we never have to
>do the allocation if we end up removing the file.
> 
>NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
>right is quite possibly big enough that this advantage is possibly very
>questionable.  It's very possible that people could speed things up
>using this approach, but I also suspect that it is equally (if not
>more) possible to speed things up by just making sure that the
>low-level FS has a fast get_block().

It's not that hard or inefficient to return the ENOSPC from the usual
point.  For example, make a gross overestimate of the space needed for
the write, compare to a cached filesystem free space value less the
amount deferred so far, and fail to take the optimization if it looks
even close.  Also, it's not necessarily bad to defer the ENOSPC to file
close time.  The same thing can happen with failed disk io, and that's
just a fact of life with asynchronous io.  A reliable program needs to
be able to deal with it.  (I think there was a long thread on this not
long ago, regarding filesystem errors returned after a program exits.)

>  - Using "global" access patterns to do a better job of "get_block()", ie
>taking advantage of issues with journalling etc and deferring the write
>in order to get a bigger journal.

Another nicety is not having to bother the filesystem at all about
sequences of short writes.

> The second point is completely different, and THIS is where I think there
> are potentially real advantages. However, I also think that this is not
> actually about deferred writes at all: it's really a question of the
> filesystem having access to the page when the physical write is actually
> started so that the filesystem might choose to _change_ the allocation it
> did - it might have allocated a backing store block at "get_block()" time,
> but by the time it actually writes the stuff out to disk it may have
> allocated a bigger contiguous area somewhere else for the data..

You'd most likely be able to measure the overhead under load of doing
the allocation twice, due to the occasional need to read back a
swapped-out metadata block.

XFS does deferred allocation and the Reiser guys are talking about it -
it seems to be something worth doing.  For me the question is, if the
VFS can already do the deferring, is there a good reason for a
filesystem to duplicate that functionality?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Linus Torvalds



On Sun, 31 Dec 2000, Alexander Viro wrote:
> 
> On Sun, 31 Dec 2000, Linus Torvalds wrote:
> 
> > The other thing is that one of the common cases for writing is consecutive
> > writing to the end of the file. Now, you figure it out: if get_block()
> > really is a bottle-neck, why not cache the last tree lookup? You'd get a
> > 99% hitrate for that common case.
> 
> Because it is not a bottleneck. The _real_ bottleneck is in ext2_new_block().
> Try to profile it and you'll see.
> 
> We could diddle with ext2_get_block(). No arguments. But the real source of
> PITA is balloc.c, not inode.c. Look at the group descriptor cache code and
> weep. That, and bitmaps handling.

I'm not surprised. Just doign pre-allocation 32 blocks at a time would
probably help. But that code really should be re-written, I think.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Alexander Viro



On Sun, 31 Dec 2000, Linus Torvalds wrote:

> The other thing is that one of the common cases for writing is consecutive
> writing to the end of the file. Now, you figure it out: if get_block()
> really is a bottle-neck, why not cache the last tree lookup? You'd get a
> 99% hitrate for that common case.

Because it is not a bottleneck. The _real_ bottleneck is in ext2_new_block().
Try to profile it and you'll see.

We could diddle with ext2_get_block(). No arguments. But the real source of
PITA is balloc.c, not inode.c. Look at the group descriptor cache code and
weep. That, and bitmaps handling.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Andrea Arcangeli

On Sun, Dec 31, 2000 at 08:33:01AM -0800, Linus Torvalds wrote:
> By doing a better job of caching stuff.

Caching can happen after we are been slow and we waited for I/O synchronously
the first time (bread).

How can we optimize the first time (when the indirect blocks are out of buffer
cache) without changing the on-disk format? We can't as far I can see.

It's of course fine to optimize the address_space->physical_block resolver
algorithm, because it has to run anyways if we want to write such data to disk
eventually (despite it's asynchronous with allocate on flush, or synchronous
like now).  Probably it's a more sensible optimization than the allocate
on flush thing. But still being able to run the resolver in an asynchronous
manner, in parallel, only at the time we need to flush the page to disk, looks
nicer behaviour to me.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Linus Torvalds



On Sun, 31 Dec 2000, Andrea Arcangeli wrote:
> 
> get_block for large files can be improved using extents, but how can we
> implement a fast get_block without restructuring the on-disk format of the
> filesystem? (in turn using another filesystem instead of ext2?)

By doing a better job of caching stuff.

There are multiple levels of caching here. One issue is the question of
"allocate a new block". Go and look how the ext2 block pre-allocation
works, and cry. It should _not_ be a loop that sets one bit at a time, it
should be something that notices when the (u32 *) is zero and grabs the
whole 32 blocks in one go. Instead of defaulting to a pre-allocation of 8
blocks, doing that same expensive thing much too often.

The other thing is that one of the common cases for writing is consecutive
writing to the end of the file. Now, you figure it out: if get_block()
really is a bottle-neck, why not cache the last tree lookup? You'd get a
99% hitrate for that common case.

You should realize that all the block allocation etc code was written for
a very different VFS layer. "get_block()" didn't even exist. We didn't
have SMP issues. We had very different access patterns (the virtual caches
in the page cache makes the accesses to "get_block()" very different, as
the VFS layer keeps track of man mappings on its own.

And get_block() was basically tacked on top of the old code. Al Viro did a
good job of cleaning up some of the issues, but go and look at get_block()
and follow it all the way down into "ext2_new_block()" and back, and I bet
you'll ask yourself why it's so complex. And wonder if it couldn't just be
cleaned up and sped up a lot that way.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Andrea Arcangeli

On Sat, Dec 30, 2000 at 06:28:39PM -0800, Linus Torvalds wrote:
> There are only two real advantages to deferred writing:
> 
>  - not having to do get_block() at all for temp-files, as we never have to
>do the allocation if we end up removing the file.
> 
>NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
>right is quite possibly big enough that this advantage is possibly very
>questionable.  It's very possible that people could speed things up
>using this approach, but I also suspect that it is equally (if not
>more) possible to speed things up by just making sure that the
>low-level FS has a fast get_block().

get_block for large files can be improved using extents, but how can we
implement a fast get_block without restructuring the on-disk format of the
filesystem? (in turn using another filesystem instead of ext2?)

get_block needs to walk all level of inode metadata indirection if they
exists. It has to map the logical page from its (inode) address space to the
physical blocks. If those indirection blocks aren't in cache it has to block
to read them. It doesn't matter how it is actually implemented in core. And
then later as you say those allocated blocks could never get written because
the file may be deleted in the meantime.

With allocate on flush we can run the slow get_block in parallel
asynchronously using a kernel daemon after the page flushtime timeout
triggers. It looks nicer to me. The in-core overhead of the reserved
blocks for delayed allocation should be not relevant at all (and it also
should not need the big kernel lock making the whole write path big lock
free).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Roman Zippel

Hi,

On Sat, 30 Dec 2000, Linus Torvalds wrote:

> In fact, in a properly designed filesystem just a bit of low-level caching
> would easily make the average "get_block()" be very fast indeed. The fact
> that right now ext2 has not been optimized for this is _not_ a reason to
> design the VFS layer around a slow get_block() operation.
> [..]
> The second point is completely different, and THIS is where I think there
> are potentially real advantages. However, I also think that this is not
> actually about deferred writes at all: it's really a question of the
> filesystem having access to the page when the physical write is actually
> started so that the filesystem might choose to _change_ the allocation it
> did - it might have allocated a backing store block at "get_block()" time,
> but by the time it actually writes the stuff out to disk it may have
> allocated a bigger contiguous area somewhere else for the data..
> 
> I really think that the #2 thing is the more interesting one, and that
> anybody looking at ext2 should look at just improving the locking and
> making the block allocation functions run faster. Which should not be all
> that difficult - the last time I looked at the thing it was quite
> horrible.

What makes get_block business complicated now, is that can be called
recursively: get_block needs to allocate something, what might start new
i/o which calls again get_block.
Writing dirty pages should be a real asynchronous process, but it isn't
right now, as get_block is synchronous. Making get_block asynchronous is
almost impossible, so one usually does it in a separate thread.
So IMO something like this should happen: dirty pages should be put on a
separate list and a thread takes these pages and allocates the buffers for
them and starts the i/o. This had another advantage: get_block wouldn't
really need to do preallocation anymore, the get_block function could work
instead on a number of pages (preallocation would instead happen in the
page cache).
This could make the get_block function and the needed locking very simple, 
e.g. one could use a simple semaphore instead of kernel_lock to protect
getting of multiple blocks instead of only one. Also splitting it into
several tasks can make it faster, so in one step we just do the resource
allocation to guarantee the write, in a separate step we do the real
allocation. If this is done for several pages at once, it can be very 
fast and simple.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Roman Zippel

Hi,

On Sat, 30 Dec 2000, Linus Torvalds wrote:

 In fact, in a properly designed filesystem just a bit of low-level caching
 would easily make the average "get_block()" be very fast indeed. The fact
 that right now ext2 has not been optimized for this is _not_ a reason to
 design the VFS layer around a slow get_block() operation.
 [..]
 The second point is completely different, and THIS is where I think there
 are potentially real advantages. However, I also think that this is not
 actually about deferred writes at all: it's really a question of the
 filesystem having access to the page when the physical write is actually
 started so that the filesystem might choose to _change_ the allocation it
 did - it might have allocated a backing store block at "get_block()" time,
 but by the time it actually writes the stuff out to disk it may have
 allocated a bigger contiguous area somewhere else for the data..
 
 I really think that the #2 thing is the more interesting one, and that
 anybody looking at ext2 should look at just improving the locking and
 making the block allocation functions run faster. Which should not be all
 that difficult - the last time I looked at the thing it was quite
 horrible.

What makes get_block business complicated now, is that can be called
recursively: get_block needs to allocate something, what might start new
i/o which calls again get_block.
Writing dirty pages should be a real asynchronous process, but it isn't
right now, as get_block is synchronous. Making get_block asynchronous is
almost impossible, so one usually does it in a separate thread.
So IMO something like this should happen: dirty pages should be put on a
separate list and a thread takes these pages and allocates the buffers for
them and starts the i/o. This had another advantage: get_block wouldn't
really need to do preallocation anymore, the get_block function could work
instead on a number of pages (preallocation would instead happen in the
page cache).
This could make the get_block function and the needed locking very simple, 
e.g. one could use a simple semaphore instead of kernel_lock to protect
getting of multiple blocks instead of only one. Also splitting it into
several tasks can make it faster, so in one step we just do the resource
allocation to guarantee the write, in a separate step we do the real
allocation. If this is done for several pages at once, it can be very 
fast and simple.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Andrea Arcangeli

On Sat, Dec 30, 2000 at 06:28:39PM -0800, Linus Torvalds wrote:
 There are only two real advantages to deferred writing:
 
  - not having to do get_block() at all for temp-files, as we never have to
do the allocation if we end up removing the file.
 
NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
right is quite possibly big enough that this advantage is possibly very
questionable.  It's very possible that people could speed things up
using this approach, but I also suspect that it is equally (if not
more) possible to speed things up by just making sure that the
low-level FS has a fast get_block().

get_block for large files can be improved using extents, but how can we
implement a fast get_block without restructuring the on-disk format of the
filesystem? (in turn using another filesystem instead of ext2?)

get_block needs to walk all level of inode metadata indirection if they
exists. It has to map the logical page from its (inode) address space to the
physical blocks. If those indirection blocks aren't in cache it has to block
to read them. It doesn't matter how it is actually implemented in core. And
then later as you say those allocated blocks could never get written because
the file may be deleted in the meantime.

With allocate on flush we can run the slow get_block in parallel
asynchronously using a kernel daemon after the page flushtime timeout
triggers. It looks nicer to me. The in-core overhead of the reserved
blocks for delayed allocation should be not relevant at all (and it also
should not need the big kernel lock making the whole write path big lock
free).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Linus Torvalds



On Sun, 31 Dec 2000, Andrea Arcangeli wrote:
 
 get_block for large files can be improved using extents, but how can we
 implement a fast get_block without restructuring the on-disk format of the
 filesystem? (in turn using another filesystem instead of ext2?)

By doing a better job of caching stuff.

There are multiple levels of caching here. One issue is the question of
"allocate a new block". Go and look how the ext2 block pre-allocation
works, and cry. It should _not_ be a loop that sets one bit at a time, it
should be something that notices when the (u32 *) is zero and grabs the
whole 32 blocks in one go. Instead of defaulting to a pre-allocation of 8
blocks, doing that same expensive thing much too often.

The other thing is that one of the common cases for writing is consecutive
writing to the end of the file. Now, you figure it out: if get_block()
really is a bottle-neck, why not cache the last tree lookup? You'd get a
99% hitrate for that common case.

You should realize that all the block allocation etc code was written for
a very different VFS layer. "get_block()" didn't even exist. We didn't
have SMP issues. We had very different access patterns (the virtual caches
in the page cache makes the accesses to "get_block()" very different, as
the VFS layer keeps track of man mappings on its own.

And get_block() was basically tacked on top of the old code. Al Viro did a
good job of cleaning up some of the issues, but go and look at get_block()
and follow it all the way down into "ext2_new_block()" and back, and I bet
you'll ask yourself why it's so complex. And wonder if it couldn't just be
cleaned up and sped up a lot that way.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Andrea Arcangeli

On Sun, Dec 31, 2000 at 08:33:01AM -0800, Linus Torvalds wrote:
 By doing a better job of caching stuff.

Caching can happen after we are been slow and we waited for I/O synchronously
the first time (bread).

How can we optimize the first time (when the indirect blocks are out of buffer
cache) without changing the on-disk format? We can't as far I can see.

It's of course fine to optimize the address_space-physical_block resolver
algorithm, because it has to run anyways if we want to write such data to disk
eventually (despite it's asynchronous with allocate on flush, or synchronous
like now).  Probably it's a more sensible optimization than the allocate
on flush thing. But still being able to run the resolver in an asynchronous
manner, in parallel, only at the time we need to flush the page to disk, looks
nicer behaviour to me.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Alexander Viro



On Sun, 31 Dec 2000, Linus Torvalds wrote:

 The other thing is that one of the common cases for writing is consecutive
 writing to the end of the file. Now, you figure it out: if get_block()
 really is a bottle-neck, why not cache the last tree lookup? You'd get a
 99% hitrate for that common case.

Because it is not a bottleneck. The _real_ bottleneck is in ext2_new_block().
Try to profile it and you'll see.

We could diddle with ext2_get_block(). No arguments. But the real source of
PITA is balloc.c, not inode.c. Look at the group descriptor cache code and
weep. That, and bitmaps handling.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Linus Torvalds



On Sun, 31 Dec 2000, Alexander Viro wrote:
 
 On Sun, 31 Dec 2000, Linus Torvalds wrote:
 
  The other thing is that one of the common cases for writing is consecutive
  writing to the end of the file. Now, you figure it out: if get_block()
  really is a bottle-neck, why not cache the last tree lookup? You'd get a
  99% hitrate for that common case.
 
 Because it is not a bottleneck. The _real_ bottleneck is in ext2_new_block().
 Try to profile it and you'll see.
 
 We could diddle with ext2_get_block(). No arguments. But the real source of
 PITA is balloc.c, not inode.c. Look at the group descriptor cache code and
 weep. That, and bitmaps handling.

I'm not surprised. Just doign pre-allocation 32 blocks at a time would
probably help. But that code really should be re-written, I think.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Daniel Phillips

Linus Torvalds wrote:
 There are only two real advantages to deferred writing:
 
  - not having to do get_block() at all for temp-files, as we never have to
do the allocation if we end up removing the file.
 
NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
right is quite possibly big enough that this advantage is possibly very
questionable.  It's very possible that people could speed things up
using this approach, but I also suspect that it is equally (if not
more) possible to speed things up by just making sure that the
low-level FS has a fast get_block().

It's not that hard or inefficient to return the ENOSPC from the usual
point.  For example, make a gross overestimate of the space needed for
the write, compare to a cached filesystem free space value less the
amount deferred so far, and fail to take the optimization if it looks
even close.  Also, it's not necessarily bad to defer the ENOSPC to file
close time.  The same thing can happen with failed disk io, and that's
just a fact of life with asynchronous io.  A reliable program needs to
be able to deal with it.  (I think there was a long thread on this not
long ago, regarding filesystem errors returned after a program exits.)

  - Using "global" access patterns to do a better job of "get_block()", ie
taking advantage of issues with journalling etc and deferring the write
in order to get a bigger journal.

Another nicety is not having to bother the filesystem at all about
sequences of short writes.

 The second point is completely different, and THIS is where I think there
 are potentially real advantages. However, I also think that this is not
 actually about deferred writes at all: it's really a question of the
 filesystem having access to the page when the physical write is actually
 started so that the filesystem might choose to _change_ the allocation it
 did - it might have allocated a backing store block at "get_block()" time,
 but by the time it actually writes the stuff out to disk it may have
 allocated a bigger contiguous area somewhere else for the data..

You'd most likely be able to measure the overhead under load of doing
the allocation twice, due to the occasional need to read back a
swapped-out metadata block.

XFS does deferred allocation and the Reiser guys are talking about it -
it seems to be something worth doing.  For me the question is, if the
VFS can already do the deferring, is there a good reason for a
filesystem to duplicate that functionality?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Linus Torvalds



On Sun, 31 Dec 2000, Daniel Phillips wrote:
 
 It's not that hard or inefficient to return the ENOSPC from the usual
 point.  For example, make a gross overestimate of the space needed for
 the write, compare to a cached filesystem free space value less the
 amount deferred so far, and fail to take the optimization if it looks
 even close.

Let me repeat myself one more time:

 I do not believe that "get_block()" is as big of a problem as people make
 it out to be.

And more importantly:

 I strongly believe that trying to be clever is detrimental to your
 health. 

 The "clever" approach is to add tons of complexity, have various
 heuristics to try to not overflow, and then try to debug it considering
 that the ENOSPC case is actually rather rare.

 The "intelligent" approach is just to say that if get_block() shows up on
 the performance profiles, then it should be optimized.

I'd rather be intelligent than clever. Optimize get_block(), which in the
case of ext2 seems to be mostly ext2_new_block() and the balloc.c mess.

The argument that Andrea had is bogus: the common case for writes (and
writes is the only part that deferred writing would touch) is re-writing
the whole file, and the IO to look up the metadata is never an issue for
that case. Everything is basically cached and created on-the-fly. IO is
not the issue, being good about new block allocation _is_ the issue.

Don't get me wrong: I like the notion of deferred writes. But I'm also
very pragmatic: I have not heard of a really good argument that makes it
obvious that deferred writes is a major win performance-wise that would
make it worth the complexity.

One form of deferred writes I _do_ like is the mount-time-option form.
Because that one doesn't add complexity. Kind of like the "noatime" mount
option - it can be worth it under some circumstances, and sometimes it's
acceptable to not get 100% unix semantics - at which point deferred writes
have none of the disadvantages of trying to be clever.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Daniel Phillips

Linus Torvalds wrote:
  I do not believe that "get_block()" is as big of a problem as people make
  it out to be.

I didn't mention get_block - disk accesses obviously far outweigh
filesystem cpu/cache usage in overall impact.  The question is, what
happens to disk access patterns when we do the deferred allocation.

 One form of deferred writes I _do_ like is the mount-time-option form.
 Because that one doesn't add complexity. Kind of like the "noatime" mount
 option - it can be worth it under some circumstances, and sometimes it's
 acceptable to not get 100% unix semantics - at which point deferred writes
 have none of the disadvantages of trying to be clever.

And the added attraction of requiring almost no effort.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Linus Torvalds

On Sun, 31 Dec 2000, Daniel Phillips wrote:

 Linus Torvalds wrote:
   I do not believe that "get_block()" is as big of a problem as people make
   it out to be.

 I didn't mention get_block - disk accesses obviously far outweigh
 filesystem cpu/cache usage in overall impact.  The question is, what
 happens to disk access patterns when we do the deferred allocation.

Note that the deferred allocation is only possible with a full page write.

Go and do statistics on a system of how often this happens, and what the
circumstances are. Just for fun.

I will bet you $5 USD that 99.9% of all such writes are to new files, at
the end of the file. I'm sure you can come up with other usage patterns,
but they'll be special (big databases etc, and I bet that they'll want to
have stuff cached all the time anyway for other reasons).

So I seriously doubt that you'll have much of an IO component to the
writing anyway - except for the "normal" deferred write of actually
writing the stuff out at all.

Now, this is where I agree with you, but I disagree with where most of the
discussion has been going: I _do_ believe that we may want to change block
allocation discussions at write-out-time. That makes sense to me. But that
doesn't really impact "ENOSPC" - the write would not be really "deferred"
by the VM layer, and the filesystem would always be aware of the writer
synchronously.

  One form of deferred writes I _do_ like is the mount-time-option form.
  Because that one doesn't add complexity. Kind of like the "noatime" mount
  option - it can be worth it under some circumstances, and sometimes it's
  acceptable to not get 100% unix semantics - at which point deferred writes
  have none of the disadvantages of trying to be clever.

 And the added attraction of requiring almost no effort.

Did I mention my belief in the true meaning of "intelligence"?

"Intelligence is the ability to avoid doing work, yet get the work done".

Lazy programmers are the best programmers. Think Tom Sawyer painting the
fence. That's intelligence.

Requireing almost no effort is a big plus in my book.

It's the "clever" programmer I'm afraid of. The one who isn't afraid of
generating complexity, because he has a Plan (capital "P"), and he knows
he can work out the details later.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Roman Zippel

Hi,

On Sun, 31 Dec 2000, Linus Torvalds wrote:

 Let me repeat myself one more time:
 
  I do not believe that "get_block()" is as big of a problem as people make
  it out to be.

The real problem is that get_block() doesn't scale and it's very hard to
do. A recursive per inode-semaphore might help, but it's still a pain to
get it right.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Linus Torvalds



On Sun, 31 Dec 2000, Roman Zippel wrote:
 
 On Sun, 31 Dec 2000, Linus Torvalds wrote:
 
  Let me repeat myself one more time:
  
   I do not believe that "get_block()" is as big of a problem as people make
   it out to be.
 
 The real problem is that get_block() doesn't scale and it's very hard to
 do. A recursive per inode-semaphore might help, but it's still a pain to
 get it right.

Not true.

There's nothing unscalable in get_block() per se. The only lock we hold is
the per-page lock, which we must hold anyway. get_block() itself does not
need any real locking: you can do it with a simple per-inode spinlock if
you want to (and release the spinlock and try again if you need to fetch
non-cached data blocks).

Sure, the current ext2 _implementation_ sucks. Nobody has ever contested
that. 

Stop re-designing something just because you want to.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Roman Zippel

Hi,

On Sun, 31 Dec 2000, Linus Torvalds wrote:

   cached_allocation = NULL;
 
   repeat:
   spin_lock();
   result = try_to_find_existing();
   if (!result) {
   if (!cached_allocation) {
   spin_unlock();
   cached_allocation = allocate_block();
   goto repeat;
   }
   result = cached_allocation;
   add_to_datastructures(result);
   }
   spin_unlock();
   return result;
 
 This is quite standard, and Linux does it in many places. It doesn't have
 to be complex or ugly.

No problem with that.

 Also, I don't see why you claim the current get_block() is recursive and
 hard to use: it obviously isn't. If you look at the current ext2
 get_block(), the way it protects most of its data structures is by the
 super-block-global lock. That wouldn't work if your claims of recursive
 invocation were true. 

I just rechecked that, but I don't see no superblock lock here, it uses
the kernel_lock instead. Although Al could give the definitive answer for
this, he wrote it. :)

 The way the Linux MM works, if the lower levels need to do buffer
 allocations, they will use GFP_BUFFER (which "bread()" does internally),
 which will mean that the VM layer will _not_ call down recursively to
 write stuff out while it's already trying to write something else. This is
 exactly so that filesystems don't have to release and re-try if they don't
 want to.
 
 In short, I don't see any of your arguments.

Then I must have misunderstood Al. Al?
If you were right here, I would see absolutely no reason for the current
complexity. (Me is a bit confused here.)

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-31 Thread Alexander Viro



On Mon, 1 Jan 2001, Roman Zippel wrote:

 I just rechecked that, but I don't see no superblock lock here, it uses
 the kernel_lock instead. Although Al could give the definitive answer for
 this, he wrote it. :)

No superblock lock in get_block() proper. Tons of it in the dungheap called
balloc.c. _That's_ where the bottleneck is. BTW, even BKL is easily removable
from get_block() - check /* Reader: */ and /* Writer: */ comments, they mark
the places to put spinlock in.

  The way the Linux MM works, if the lower levels need to do buffer
  allocations, they will use GFP_BUFFER (which "bread()" does internally),
  which will mean that the VM layer will _not_ call down recursively to
  write stuff out while it's already trying to write something else. This is
  exactly so that filesystems don't have to release and re-try if they don't
  want to.
  
  In short, I don't see any of your arguments.
 
 Then I must have misunderstood Al. Al?
 If you were right here, I would see absolutely no reason for the current
 complexity. (Me is a bit confused here.)

Reread the original thread. GFP_BUFFER protects us from buffer cache
allocations triggering pageouts. It has nothing to the deadlock scenario
that would come from grabbing -i_sem on pageout.

Sheesh... "Complexity" of ext2_get_block() (down to the ext2_new_block()
calls) is really, really not a problem. Normally it just gives you the
straightforward path. All unrolls are for contention cases and they
are precisely what we have to do there.

Again, scalability problems are in the block allocator, not in the
indirect blocks handling. It's completely independent from get_block().
We overlock. Big way. And the structures we are protecting excessively
are:
* cylinder group descriptors
* block bitmaps
* (to less extent) inode bitmaps and inode table.
That's what needs to be fixed. It has nothing to VFS or VM - purely
internal ext2 business.

Another ext2 issue is reducing the buffer cache pressure - mostly by
moving the directories into page cache. I've posted such patches on
fsdevel and they are applied to the kernel I'm running here. Works
like a charm and allows the rest of metadata stay in cache longer.

GFP_BUFFER _may_ become an issue if we move bitmaps into pagecache.
Then we'll need a per-address_space gfp_mask. Again, patches exist
and had been tested (not right now - I didn't port them to current
tree yet). Bugger if I remember whether they were posted or not - they've
definitely had been mentioned on linux-mm, but IIRC I had sent the
modifications of VM code only to Jens. I can repost them.

Some pieces of balloc.c cleanup had been posted on fsdevel. Check the
archives. They prepare the ground for killing lock_super() contention
on ext2_new_inode(), but that part wasn't there back then.

I will start -bird (aka FS-CURRENT) branch as soon as Linus opens 2.4.
Hopefully by the time of 2.5 it will be tested well enough. Right now
it exists as a large patchset against more or less recent -testn and
I'm waiting for slowdown of the changes in main tree to put them all
together.
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Andrea Arcangeli

On Sat, Dec 30, 2000 at 08:50:52PM -0500, Alexander Viro wrote:
> And its meaning for 2/3 of filesystems would be?

It should stay in the private part of the in-core superblock of course.

> I _doubt_ it. If it is a pagecache issue it should apply to NFS. It should
> apply to ramfs. It should apply to helluva lot of filesystems that are not
> block-based. Pagecache doesn't (and shouldn't) know about blocks.

With pagecache I meant the library of pagecache methods in buffer.c. Even
if they are recalled by the lowlevel filesystem code and they can be
overridden by lowlevel filesystem code, they aren't lowlevel filesystem code
but they're infact common code.  We can implement another version of them that
instead of knowing about get_block, also know about another filesystem
callback and when possible it only reserve the space for a delayed allocation
later triggered (in parallel) by future kupdate. They will know about this new
callback in the same way the current standard pagecache library methods knows
about get_block_t. Filesystems implementing this callback will be able to use
those new pagecache library methods.

> it should use functions that do not expect such argument. That's it. No
> need to invent new methods or shoehorn all block filesystems into the same
> scheme.

Of course.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Linus Torvalds



On Sun, 31 Dec 2000, Roman Zippel wrote:
> 
> On Sun, 31 Dec 2000, Andrea Arcangeli wrote:
> 
> > > estimate than just the data blocks it should not be hard to add an
> > > extra callback to the filesystem.  
> > 
> > Yes, I was thinking at this callback too. Such a callback is nearly the only
> > support we need from the filesystem to provide allocate on flush.
> 
> Actually the getblock function could be split into 3 functions:
> - alloc_block: mostly just decrementing a counter (and quota)
> - get_block: allocating a block from the bitmap
> - commit_block: inserting the new block into the inode
> 
> This would be really useful for streaming, one could get as fast as
> possible the block number and the data could be very quickly written,
> while keeping the cache usage low. Or streaming directly from a device
> to disk also wants to get rid of the data as fast as possible.

Now, to insert a small note of sanity here: I think people are starting to
overdesign stuff.

The fact is that currently the "get_block()" interface that the page cache
helper functions use does NOT have to be very expensive at all.

In fact, in a properly designed filesystem just a bit of low-level caching
would easily make the average "get_block()" be very fast indeed. The fact
that right now ext2 has not been optimized for this is _not_ a reason to
design the VFS layer around a slow get_block() operation.

If you look at the generic block-based writing routines, they are actually
not all that expensive. Any kind of complication is only going to make
those functions more complex, and any performance gained could easily be
lost in extra complexity.

There are only two real advantages to deferred writing:

 - not having to do get_block() at all for temp-files, as we never have to
   do the allocation if we end up removing the file.

   NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
   right is quite possibly big enough that this advantage is possibly very
   questionable.  It's very possible that people could speed things up
   using this approach, but I also suspect that it is equally (if not
   more) possible to speed things up by just making sure that the
   low-level FS has a fast get_block().

 - Using "global" access patterns to do a better job of "get_block()", ie
   taking advantage of issues with journalling etc and deferring the write
   in order to get a bigger journal.

The second point is completely different, and THIS is where I think there
are potentially real advantages. However, I also think that this is not
actually about deferred writes at all: it's really a question of the
filesystem having access to the page when the physical write is actually
started so that the filesystem might choose to _change_ the allocation it
did - it might have allocated a backing store block at "get_block()" time,
but by the time it actually writes the stuff out to disk it may have
allocated a bigger contiguous area somewhere else for the data..

I really think that the #2 thing is the more interesting one, and that
anybody looking at ext2 should look at just improving the locking and
making the block allocation functions run faster. Which should not be all
that difficult - the last time I looked at the thing it was quite
horrible.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Roman Zippel

Hi,

On Sun, 31 Dec 2000, Andrea Arcangeli wrote:

> > estimate than just the data blocks it should not be hard to add an
> > extra callback to the filesystem.  
> 
> Yes, I was thinking at this callback too. Such a callback is nearly the only
> support we need from the filesystem to provide allocate on flush.

Actually the getblock function could be split into 3 functions:
- alloc_block: mostly just decrementing a counter (and quota)
- get_block: allocating a block from the bitmap
- commit_block: inserting the new block into the inode

This would be really useful for streaming, one could get as fast as
possible the block number and the data could be very quickly written,
while keeping the cache usage low. Or streaming directly from a device
to disk also wants to get rid of the data as fast as possible.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Alexander Viro



On Sun, 31 Dec 2000, Andrea Arcangeli wrote:

> On Sat, Dec 30, 2000 at 03:00:43PM -0700, Eric W. Biederman wrote:
> > To get ENOSPC handling 99% correct all we need to do is decrement a counter,
> > that remembers how many disks blocks are free.  If we need a better
> 
> Yes, we need to add one field to the in-core superblock to do this accounting.

And its meaning for 2/3 of filesystems would be?

> > estimate than just the data blocks it should not be hard to add an
> > extra callback to the filesystem.  
> 
> Yes, I was thinking at this callback too. Such a callback is nearly the only
> support we need from the filesystem to provide allocate on flush. Allocate on
> flush is a pagecache issue, not really a filesystem issue. When a filesystem

I _doubt_ it. If it is a pagecache issue it should apply to NFS. It should
apply to ramfs. It should apply to helluva lot of filesystems that are not
block-based. Pagecache doesn't (and shouldn't) know about blocks.

> doesn't implement such callback we can simply get_block(create) at pagecache
> creation time as usual.

Umm... You do realize that get_block is _not_ visible on pagecache level?
Sure thing, filesystem should be free to use whatever functions it wants
for address_space methods. No arguments here. It should be able whatever
callbacks these functions expect. If filesystem doesn't implement reservation
it should use functions that do not expect such argument. That's it. No
need to invent new methods or shoehorn all block filesystems into the same
scheme.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Eric W. Biederman

Linus Torvalds <[EMAIL PROTECTED]> writes:

> On 30 Dec 2000, Eric W. Biederman wrote:
> > 
> > One other thing to think about for the VFS/MM layer is limiting the
> > total number of dirty pages in the system (to what disk pressure shows
> > the disk can handle), to keep system performance smooth when swapping.
> 
> This is a separate issue,  and I think that it is most closely tied in to
> the "RSS limit" kind of patches because of the memory mapping issues. If
> you've seen the RSS rlimit patch (it's been posted a few times this week),
> then you could think of that modified by a "Resident writable pages Set
> Size" approach. 

Building on the RSS limit approach sounds much simpler then they way
I was thinking.

> Not just for shared mappings - this is also an issue with
> limiting swapout.
> 
> (I actually don't think that RSS is all that interesting, it's really the
> "potentially dirty RSS" that counts for VM behaviour - everything else can
> be dropped easily enough)

Definitely.

Now the only tricky bit is how do we sense when we are overloading
the swap disks.  Well that is the next step.  I'll take a look
and see what it takes to keep statistics on dirty mapped pages.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Andrea Arcangeli

On Sat, Dec 30, 2000 at 03:00:43PM -0700, Eric W. Biederman wrote:
> To get ENOSPC handling 99% correct all we need to do is decrement a counter,
> that remembers how many disks blocks are free.  If we need a better

Yes, we need to add one field to the in-core superblock to do this accounting.

> estimate than just the data blocks it should not be hard to add an
> extra callback to the filesystem.  

Yes, I was thinking at this callback too. Such a callback is nearly the only
support we need from the filesystem to provide allocate on flush. Allocate on
flush is a pagecache issue, not really a filesystem issue. When a filesystem
doesn't implement such callback we can simply get_block(create) at pagecache
creation time as usual.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Daniel Phillips

On Sat, 30 Dec 2000, Alexander Viro wrote:
> Well, see above. I'm pretty nervous about breaking the ordering of metadata
> allocation. For pageout() we don't have such ordering. For write() we
> certainly do. Notice that reserving disk space upon write() and eating it
> later is _very_ messy job - you'll have to take care of situations when
> we reserve the space upon write() and get pageout do the real allocation.
> Not nice, since pageout has no way in hell to tell whether it is eating
> from a reserved area or just flushing the mmaped one. We could keep the
> per-bh "reserved" flag to fold that information into the pagecache, but
> IMO it's simply not worth the trouble. If some filesystems wants that -
> hey, it can do that right now. Just make ->prepare_write() do reservations
> and let ->commit_write() mark the page dirty. Then ->writepage() will
> eventually flush it.

This is a refinement of the idea and some abstraction like that is
clearly needed, and maybe that is exactly the right one.  For now I'm
interested in putting this on the table so that we can check the
stability and performance, maybe uncover come more bugs, then start
going after some of the things that need to be done to turn it into a
useful option.

P.S., I humbly apologize for writing (!offset && bytes == PAGE_SIZE)
when I could have just written (bytes == PAGE_SIZE).

-- 
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Linus Torvalds



On 30 Dec 2000, Eric W. Biederman wrote:
> 
> One other thing to think about for the VFS/MM layer is limiting the
> total number of dirty pages in the system (to what disk pressure shows
> the disk can handle), to keep system performance smooth when swapping.

This is a separate issue, and I think that it is most closely tied in to
the "RSS limit" kind of patches because of the memory mapping issues. If
you've seen the RSS rlimit patch (it's been posted a few times this week),
then you could think of that modified by a "Resident writable pages Set
Size" approach. Not just for shared mappings - this is also an issue with
limiting swapout.

(I actually don't think that RSS is all that interesting, it's really the
"potentially dirty RSS" that counts for VM behaviour - everything else can
be dropped easily enough)

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Eric W. Biederman

Linus Torvalds <[EMAIL PROTECTED]> writes:


> In short, I don't see _those_ kinds of issues. I do see error reporting as
> a major issue, though. If we need to do proper low-level block allocation
> in order to get correct ENOSPC handling, then the win from doing deferred
> writes is not very big.

To get ENOSPC handling 99% correct all we need to do is decrement a counter,
that remembers how many disks blocks are free.  If we need a better
estimate than just the data blocks it should not be hard to add an
extra callback to the filesystem.  

There look to be some interesting cases to handle when we fill up a
filesystem.  Before actually failing and returning ENOSPC the
filesystem might want to fsync itself. And see how correct it's
estimates were.  But that is the rare case and shouldn't affect
performance.


In the long term VFS support for deferred writes looks like a major
win.  Knowing how large a file is before we write it to disk allows
very efficient disk organization, and fast file access (esp combined
with an extent based fs).   Support for compressing files in real time
falls out naturally.  Support for filesystems maintain coherency by
never writing the same block back to the same disk location also
appears.


One other thing to think about for the VFS/MM layer is limiting the
total number of dirty pages in the system (to what disk pressure shows
the disk can handle), to keep system performance smooth when swapping.
All cases except mmaped files are easy, and they can be handled by a
modified page fault handler that directly puts the dirty bit on the
struct page.  (Except that is buggy with respect to clearing the dirty
bit on the struct page.)  In reality we would have to create a queue
of pointers to dirty pte's from the page fault handler and depending
on a timer or memory pressure move the dirty bits to the actual page.

Combined with the code to make sync and fsync to work on the page
cache we msync would be obsolete?

Of course the most important part is that when all of that is
working, the VFS/MM layer it would be perfect.  World domination
would be achieved.  For who would be caught using an OS with an
imperfect VFS layer :)

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Alexander Viro



On Sat, 30 Dec 2000, Linus Torvalds wrote:

> 
> 
> On Sat, 30 Dec 2000, Alexander Viro wrote:
> > 
> > Except that we've got file-expanding writes outside of ->i_sem. Thanks, but
> > no thanks.
> 
> No, Al, the file size is still updated inside i_sem.

Then we are screwed. Look: we call write(). Twice. The second call happens
to overflow the quota. Getting the second chunk of data written and the
first one ending up as a hole is the last thing you would expect, isn't it?

> In short, I don't see _those_ kinds of issues. I do see error reporting as
> a major issue, though. If we need to do proper low-level block allocation
> in order to get correct ENOSPC handling, then the win from doing deferred
> writes is not very big.

Well, see above. I'm pretty nervous about breaking the ordering of metadata
allocation. For pageout() we don't have such ordering. For write() we
certainly do. Notice that reserving disk space upon write() and eating it
later is _very_ messy job - you'll have to take care of situations when
we reserve the space upon write() and get pageout do the real allocation.
Not nice, since pageout has no way in hell to tell whether it is eating
from a reserved area or just flushing the mmaped one. We could keep the
per-bh "reserved" flag to fold that information into the pagecache, but
IMO it's simply not worth the trouble. If some filesystems wants that -
hey, it can do that right now. Just make ->prepare_write() do reservations
and let ->commit_write() mark the page dirty. Then ->writepage() will
eventually flush it.

Again, if one is willing to implement reservation on block level - fine,
there is no need to change anything in VFS or VM. I certainly don't want
to mess with that, but hey, if somebody is into masochism - let them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Andreas Dilger

Linus writes:
> In short, I don't see _those_ kinds of issues. I do see error reporting as
> a major issue, though. If we need to do proper low-level block allocation
> in order to get correct ENOSPC handling, then the win from doing deferred
> writes is not very big.

It should be relatively light-weight to call into the filesystem simply
to allocate a "count" of blocks needed for the current file.  It may
even be possible to do delayed inode allocation.  This would defer
the inode/block bitmap searching/allocation on ext2 until the file
was actually written - only the free_blocks/free_inodes count in the
superblock would be decremented, and we would get ENOSPC immediately
if we don't have enough space for the file.  On fsck, these values are
recalculated from the group descriptors on ext2, so it wouldn't be a
problem if the system crashed with pre-allocated blocks.

It would definitely be a win on ext2 and XFS, and if it isn't possible
on other filesystems, it should at least not be a loss.

We would need to ensure we also keep enough space for indirect blocks
and such, so we need to pass more information than just the number of
blocks added (i.e. how big the file already is).

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Linus Torvalds



On Sat, 30 Dec 2000, Alexander Viro wrote:
> 
> Except that we've got file-expanding writes outside of ->i_sem. Thanks, but
> no thanks.

No, Al, the file size is still updated inside i_sem.

Yes, it will do actual block allocation outside i_sem, but that is already
true of any mmap'ed writes, and has been true for a long long time. So if
we have a bug here (and I don't think we have one), it's not something
new. But the inode semaphore doesn't protect the balloc() data structures
anyway, as they are filesystem-global.

If you're nervous about the effects of "truncate()", then that should be
handled properly by truncate_inode_pages().

In short, I don't see _those_ kinds of issues. I do see error reporting as
a major issue, though. If we need to do proper low-level block allocation
in order to get correct ENOSPC handling, then the win from doing deferred
writes is not very big.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Alexander Viro



On Sat, 30 Dec 2000, Daniel Phillips wrote:

> When I saw you put in the if (PageDirty) -->writepage and related code
> over the last couple of weeks I was wondering if you realize how close
> we are to having generic deferred file writing in the VFS.  I took some
> time today to code this little hack and it comes awfully close to doing
> the job.  However, *** Warning, do not run this on a machine you care
> about, it will mess it up ***.
> 
> The advantages of deferred file writing are pretty obvious.  Right now
> we are deferring just the writeout of data to the disk, but we can also
> defer the disk mapping, so that metadata blocks don't have to stay
> around in cache waiting for data blocks to get mapped into them one at
> a time - a whole group can be done in one flush.

Except that we've got file-expanding writes outside of ->i_sem. Thanks, but
no thanks.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Linus Torvalds



On Sat, 30 Dec 2000, Daniel Phillips wrote:
>
> When I saw you put in the if (PageDirty) -->writepage and related code
> over the last couple of weeks I was wondering if you realize how close
> we are to having generic deferred file writing in the VFS.

I'm very aware of it indeed. 

However, it does break various common assumptions, one of them being
proper error handling. Things like proper detection of quota overflows and
even simple "out of disk space" issues. 

One of the main advantages of deferred writing would be that we could do
temp-files without ever actually doing most of the low-level filesystem
block allocation, but in order to get that advantage we really need to
handle the out-of-disk case gracefully.

I considered doing something like this as a mount option, so that people
could decide on their own whether they want a safe filesystem, or whether
it's ok to do deferred writes. People might find it worth it for /tmp, but
might be unwilling to use it for /var/spool/mail, for example.

(Hmm.. It might be perfectly fine for /vsr/spool/mail - mail delivery
tends to be really careful about doing "fsync()" etc and actually pick up
the errors that way. HOWEVER, before doing that you should expand the
writepage logic to set the page "error" bit for when it fails to write out
a full page - right now we just lose the error completely).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Linus Torvalds



On Sat, 30 Dec 2000, Daniel Phillips wrote:

 When I saw you put in the if (PageDirty) --writepage and related code
 over the last couple of weeks I was wondering if you realize how close
 we are to having generic deferred file writing in the VFS.

I'm very aware of it indeed. 

However, it does break various common assumptions, one of them being
proper error handling. Things like proper detection of quota overflows and
even simple "out of disk space" issues. 

One of the main advantages of deferred writing would be that we could do
temp-files without ever actually doing most of the low-level filesystem
block allocation, but in order to get that advantage we really need to
handle the out-of-disk case gracefully.

I considered doing something like this as a mount option, so that people
could decide on their own whether they want a safe filesystem, or whether
it's ok to do deferred writes. People might find it worth it for /tmp, but
might be unwilling to use it for /var/spool/mail, for example.

(Hmm.. It might be perfectly fine for /vsr/spool/mail - mail delivery
tends to be really careful about doing "fsync()" etc and actually pick up
the errors that way. HOWEVER, before doing that you should expand the
writepage logic to set the page "error" bit for when it fails to write out
a full page - right now we just lose the error completely).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Alexander Viro



On Sat, 30 Dec 2000, Daniel Phillips wrote:

 When I saw you put in the if (PageDirty) --writepage and related code
 over the last couple of weeks I was wondering if you realize how close
 we are to having generic deferred file writing in the VFS.  I took some
 time today to code this little hack and it comes awfully close to doing
 the job.  However, *** Warning, do not run this on a machine you care
 about, it will mess it up ***.
 
 The advantages of deferred file writing are pretty obvious.  Right now
 we are deferring just the writeout of data to the disk, but we can also
 defer the disk mapping, so that metadata blocks don't have to stay
 around in cache waiting for data blocks to get mapped into them one at
 a time - a whole group can be done in one flush.

Except that we've got file-expanding writes outside of -i_sem. Thanks, but
no thanks.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Linus Torvalds



On Sat, 30 Dec 2000, Alexander Viro wrote:
 
 Except that we've got file-expanding writes outside of -i_sem. Thanks, but
 no thanks.

No, Al, the file size is still updated inside i_sem.

Yes, it will do actual block allocation outside i_sem, but that is already
true of any mmap'ed writes, and has been true for a long long time. So if
we have a bug here (and I don't think we have one), it's not something
new. But the inode semaphore doesn't protect the balloc() data structures
anyway, as they are filesystem-global.

If you're nervous about the effects of "truncate()", then that should be
handled properly by truncate_inode_pages().

In short, I don't see _those_ kinds of issues. I do see error reporting as
a major issue, though. If we need to do proper low-level block allocation
in order to get correct ENOSPC handling, then the win from doing deferred
writes is not very big.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Andreas Dilger

Linus writes:
 In short, I don't see _those_ kinds of issues. I do see error reporting as
 a major issue, though. If we need to do proper low-level block allocation
 in order to get correct ENOSPC handling, then the win from doing deferred
 writes is not very big.

It should be relatively light-weight to call into the filesystem simply
to allocate a "count" of blocks needed for the current file.  It may
even be possible to do delayed inode allocation.  This would defer
the inode/block bitmap searching/allocation on ext2 until the file
was actually written - only the free_blocks/free_inodes count in the
superblock would be decremented, and we would get ENOSPC immediately
if we don't have enough space for the file.  On fsck, these values are
recalculated from the group descriptors on ext2, so it wouldn't be a
problem if the system crashed with pre-allocated blocks.

It would definitely be a win on ext2 and XFS, and if it isn't possible
on other filesystems, it should at least not be a loss.

We would need to ensure we also keep enough space for indirect blocks
and such, so we need to pass more information than just the number of
blocks added (i.e. how big the file already is).

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Alexander Viro



On Sat, 30 Dec 2000, Linus Torvalds wrote:

 
 
 On Sat, 30 Dec 2000, Alexander Viro wrote:
  
  Except that we've got file-expanding writes outside of -i_sem. Thanks, but
  no thanks.
 
 No, Al, the file size is still updated inside i_sem.

Then we are screwed. Look: we call write(). Twice. The second call happens
to overflow the quota. Getting the second chunk of data written and the
first one ending up as a hole is the last thing you would expect, isn't it?

 In short, I don't see _those_ kinds of issues. I do see error reporting as
 a major issue, though. If we need to do proper low-level block allocation
 in order to get correct ENOSPC handling, then the win from doing deferred
 writes is not very big.

Well, see above. I'm pretty nervous about breaking the ordering of metadata
allocation. For pageout() we don't have such ordering. For write() we
certainly do. Notice that reserving disk space upon write() and eating it
later is _very_ messy job - you'll have to take care of situations when
we reserve the space upon write() and get pageout do the real allocation.
Not nice, since pageout has no way in hell to tell whether it is eating
from a reserved area or just flushing the mmaped one. We could keep the
per-bh "reserved" flag to fold that information into the pagecache, but
IMO it's simply not worth the trouble. If some filesystems wants that -
hey, it can do that right now. Just make -prepare_write() do reservations
and let -commit_write() mark the page dirty. Then -writepage() will
eventually flush it.

Again, if one is willing to implement reservation on block level - fine,
there is no need to change anything in VFS or VM. I certainly don't want
to mess with that, but hey, if somebody is into masochism - let them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Eric W. Biederman

Linus Torvalds [EMAIL PROTECTED] writes:


 In short, I don't see _those_ kinds of issues. I do see error reporting as
 a major issue, though. If we need to do proper low-level block allocation
 in order to get correct ENOSPC handling, then the win from doing deferred
 writes is not very big.

To get ENOSPC handling 99% correct all we need to do is decrement a counter,
that remembers how many disks blocks are free.  If we need a better
estimate than just the data blocks it should not be hard to add an
extra callback to the filesystem.  

There look to be some interesting cases to handle when we fill up a
filesystem.  Before actually failing and returning ENOSPC the
filesystem might want to fsync itself. And see how correct it's
estimates were.  But that is the rare case and shouldn't affect
performance.

rant
In the long term VFS support for deferred writes looks like a major
win.  Knowing how large a file is before we write it to disk allows
very efficient disk organization, and fast file access (esp combined
with an extent based fs).   Support for compressing files in real time
falls out naturally.  Support for filesystems maintain coherency by
never writing the same block back to the same disk location also
appears.
/rant

One other thing to think about for the VFS/MM layer is limiting the
total number of dirty pages in the system (to what disk pressure shows
the disk can handle), to keep system performance smooth when swapping.
All cases except mmaped files are easy, and they can be handled by a
modified page fault handler that directly puts the dirty bit on the
struct page.  (Except that is buggy with respect to clearing the dirty
bit on the struct page.)  In reality we would have to create a queue
of pointers to dirty pte's from the page fault handler and depending
on a timer or memory pressure move the dirty bits to the actual page.

Combined with the code to make sync and fsync to work on the page
cache we msync would be obsolete?

Of course the most important part is that when all of that is
working, the VFS/MM layer it would be perfect.  World domination
would be achieved.  For who would be caught using an OS with an
imperfect VFS layer :)

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Linus Torvalds



On 30 Dec 2000, Eric W. Biederman wrote:
 
 One other thing to think about for the VFS/MM layer is limiting the
 total number of dirty pages in the system (to what disk pressure shows
 the disk can handle), to keep system performance smooth when swapping.

This is a separate issue, and I think that it is most closely tied in to
the "RSS limit" kind of patches because of the memory mapping issues. If
you've seen the RSS rlimit patch (it's been posted a few times this week),
then you could think of that modified by a "Resident writable pages Set
Size" approach. Not just for shared mappings - this is also an issue with
limiting swapout.

(I actually don't think that RSS is all that interesting, it's really the
"potentially dirty RSS" that counts for VM behaviour - everything else can
be dropped easily enough)

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Daniel Phillips

On Sat, 30 Dec 2000, Alexander Viro wrote:
 Well, see above. I'm pretty nervous about breaking the ordering of metadata
 allocation. For pageout() we don't have such ordering. For write() we
 certainly do. Notice that reserving disk space upon write() and eating it
 later is _very_ messy job - you'll have to take care of situations when
 we reserve the space upon write() and get pageout do the real allocation.
 Not nice, since pageout has no way in hell to tell whether it is eating
 from a reserved area or just flushing the mmaped one. We could keep the
 per-bh "reserved" flag to fold that information into the pagecache, but
 IMO it's simply not worth the trouble. If some filesystems wants that -
 hey, it can do that right now. Just make -prepare_write() do reservations
 and let -commit_write() mark the page dirty. Then -writepage() will
 eventually flush it.

This is a refinement of the idea and some abstraction like that is
clearly needed, and maybe that is exactly the right one.  For now I'm
interested in putting this on the table so that we can check the
stability and performance, maybe uncover come more bugs, then start
going after some of the things that need to be done to turn it into a
useful option.

P.S., I humbly apologize for writing (!offset  bytes == PAGE_SIZE)
when I could have just written (bytes == PAGE_SIZE).

-- 
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Andrea Arcangeli

On Sat, Dec 30, 2000 at 03:00:43PM -0700, Eric W. Biederman wrote:
 To get ENOSPC handling 99% correct all we need to do is decrement a counter,
 that remembers how many disks blocks are free.  If we need a better

Yes, we need to add one field to the in-core superblock to do this accounting.

 estimate than just the data blocks it should not be hard to add an
 extra callback to the filesystem.  

Yes, I was thinking at this callback too. Such a callback is nearly the only
support we need from the filesystem to provide allocate on flush. Allocate on
flush is a pagecache issue, not really a filesystem issue. When a filesystem
doesn't implement such callback we can simply get_block(create) at pagecache
creation time as usual.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Eric W. Biederman

Linus Torvalds [EMAIL PROTECTED] writes:

 On 30 Dec 2000, Eric W. Biederman wrote:
  
  One other thing to think about for the VFS/MM layer is limiting the
  total number of dirty pages in the system (to what disk pressure shows
  the disk can handle), to keep system performance smooth when swapping.
 
 This is a separate issue,  and I think that it is most closely tied in to
 the "RSS limit" kind of patches because of the memory mapping issues. If
 you've seen the RSS rlimit patch (it's been posted a few times this week),
 then you could think of that modified by a "Resident writable pages Set
 Size" approach. 

Building on the RSS limit approach sounds much simpler then they way
I was thinking.

 Not just for shared mappings - this is also an issue with
 limiting swapout.
 
 (I actually don't think that RSS is all that interesting, it's really the
 "potentially dirty RSS" that counts for VM behaviour - everything else can
 be dropped easily enough)

Definitely.

Now the only tricky bit is how do we sense when we are overloading
the swap disks.  Well that is the next step.  I'll take a look
and see what it takes to keep statistics on dirty mapped pages.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Alexander Viro



On Sun, 31 Dec 2000, Andrea Arcangeli wrote:

 On Sat, Dec 30, 2000 at 03:00:43PM -0700, Eric W. Biederman wrote:
  To get ENOSPC handling 99% correct all we need to do is decrement a counter,
  that remembers how many disks blocks are free.  If we need a better
 
 Yes, we need to add one field to the in-core superblock to do this accounting.

And its meaning for 2/3 of filesystems would be?

  estimate than just the data blocks it should not be hard to add an
  extra callback to the filesystem.  
 
 Yes, I was thinking at this callback too. Such a callback is nearly the only
 support we need from the filesystem to provide allocate on flush. Allocate on
 flush is a pagecache issue, not really a filesystem issue. When a filesystem

I _doubt_ it. If it is a pagecache issue it should apply to NFS. It should
apply to ramfs. It should apply to helluva lot of filesystems that are not
block-based. Pagecache doesn't (and shouldn't) know about blocks.

 doesn't implement such callback we can simply get_block(create) at pagecache
 creation time as usual.

Umm... You do realize that get_block is _not_ visible on pagecache level?
Sure thing, filesystem should be free to use whatever functions it wants
for address_space methods. No arguments here. It should be able whatever
callbacks these functions expect. If filesystem doesn't implement reservation
it should use functions that do not expect such argument. That's it. No
need to invent new methods or shoehorn all block filesystems into the same
scheme.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Roman Zippel

Hi,

On Sun, 31 Dec 2000, Andrea Arcangeli wrote:

  estimate than just the data blocks it should not be hard to add an
  extra callback to the filesystem.  
 
 Yes, I was thinking at this callback too. Such a callback is nearly the only
 support we need from the filesystem to provide allocate on flush.

Actually the getblock function could be split into 3 functions:
- alloc_block: mostly just decrementing a counter (and quota)
- get_block: allocating a block from the bitmap
- commit_block: inserting the new block into the inode

This would be really useful for streaming, one could get as fast as
possible the block number and the data could be very quickly written,
while keeping the cache usage low. Or streaming directly from a device
to disk also wants to get rid of the data as fast as possible.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Linus Torvalds



On Sun, 31 Dec 2000, Roman Zippel wrote:
 
 On Sun, 31 Dec 2000, Andrea Arcangeli wrote:
 
   estimate than just the data blocks it should not be hard to add an
   extra callback to the filesystem.  
  
  Yes, I was thinking at this callback too. Such a callback is nearly the only
  support we need from the filesystem to provide allocate on flush.
 
 Actually the getblock function could be split into 3 functions:
 - alloc_block: mostly just decrementing a counter (and quota)
 - get_block: allocating a block from the bitmap
 - commit_block: inserting the new block into the inode
 
 This would be really useful for streaming, one could get as fast as
 possible the block number and the data could be very quickly written,
 while keeping the cache usage low. Or streaming directly from a device
 to disk also wants to get rid of the data as fast as possible.

Now, to insert a small note of sanity here: I think people are starting to
overdesign stuff.

The fact is that currently the "get_block()" interface that the page cache
helper functions use does NOT have to be very expensive at all.

In fact, in a properly designed filesystem just a bit of low-level caching
would easily make the average "get_block()" be very fast indeed. The fact
that right now ext2 has not been optimized for this is _not_ a reason to
design the VFS layer around a slow get_block() operation.

If you look at the generic block-based writing routines, they are actually
not all that expensive. Any kind of complication is only going to make
those functions more complex, and any performance gained could easily be
lost in extra complexity.

There are only two real advantages to deferred writing:

 - not having to do get_block() at all for temp-files, as we never have to
   do the allocation if we end up removing the file.

   NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
   right is quite possibly big enough that this advantage is possibly very
   questionable.  It's very possible that people could speed things up
   using this approach, but I also suspect that it is equally (if not
   more) possible to speed things up by just making sure that the
   low-level FS has a fast get_block().

 - Using "global" access patterns to do a better job of "get_block()", ie
   taking advantage of issues with journalling etc and deferring the write
   in order to get a bigger journal.

The second point is completely different, and THIS is where I think there
are potentially real advantages. However, I also think that this is not
actually about deferred writes at all: it's really a question of the
filesystem having access to the page when the physical write is actually
started so that the filesystem might choose to _change_ the allocation it
did - it might have allocated a backing store block at "get_block()" time,
but by the time it actually writes the stuff out to disk it may have
allocated a bigger contiguous area somewhere else for the data..

I really think that the #2 thing is the more interesting one, and that
anybody looking at ext2 should look at just improving the locking and
making the block allocation functions run faster. Which should not be all
that difficult - the last time I looked at the thing it was quite
horrible.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Generic deferred file writing

2000-12-30 Thread Andrea Arcangeli

On Sat, Dec 30, 2000 at 08:50:52PM -0500, Alexander Viro wrote:
 And its meaning for 2/3 of filesystems would be?

It should stay in the private part of the in-core superblock of course.

 I _doubt_ it. If it is a pagecache issue it should apply to NFS. It should
 apply to ramfs. It should apply to helluva lot of filesystems that are not
 block-based. Pagecache doesn't (and shouldn't) know about blocks.

With pagecache I meant the library of pagecache methods in buffer.c. Even
if they are recalled by the lowlevel filesystem code and they can be
overridden by lowlevel filesystem code, they aren't lowlevel filesystem code
but they're infact common code.  We can implement another version of them that
instead of knowing about get_block, also know about another filesystem
callback and when possible it only reserve the space for a delayed allocation
later triggered (in parallel) by future kupdate. They will know about this new
callback in the same way the current standard pagecache library methods knows
about get_block_t. Filesystems implementing this callback will be able to use
those new pagecache library methods.

 it should use functions that do not expect such argument. That's it. No
 need to invent new methods or shoehorn all block filesystems into the same
 scheme.

Of course.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/