Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-14 Thread Werner Almesberger
Mingming Cao wrote:
 I agree delayed allocation make much sense with multiblock allocation.
 But I still think itself worth the effort, even without multiple block
 allocation.

On ABISS, we're currently also experimenting with delayed allocation.
There, the goal is less to improve overall performance, but to move
the accesses out of the synchronous code path for write(2).

The code works quite nicely for FAT and ext2, limiting the time it
takes to make a write call writing new data to about 4-6 ms on a
fairly sluggish machine (plus about 2-4 ms for moving the playout
point, which is a separate operation in ABISS), and with eight
competing best-effort writers who each enjoy write latencies of some
8 seconds, worst-case, overwriting old data.

Of course, this fails horribly on ext3, because it doesn't do anything
useful with the journal. Another problem is error handling. Since FAT
and ext2 don't have any form of reservation, a full disk isn't detected
until it's far too late.

So, a VFS-level reservation function would indeed be nice to have.

I looked at ext3 delalloc briefly, and while it did indeed improve
performance quite nicely, by being tied to ext3 internals, it would
be difficult to use in the framework of ABISS, where the code paths
are different (e.g. the prepare/commit functions should be as close
to no-ops as possible, and leave all the work to the prefetcher
thread), and which tries to be relatively file system independent.

- Werner

-- 
  _
 / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] /
/_http://www.almesberger.net//
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-14 Thread Suparna Bhattacharya
On Mon, Mar 14, 2005 at 05:36:58AM -0300, Werner Almesberger wrote:
 Mingming Cao wrote:
  I agree delayed allocation make much sense with multiblock allocation.
  But I still think itself worth the effort, even without multiple block
  allocation.
 
 On ABISS, we're currently also experimenting with delayed allocation.
 There, the goal is less to improve overall performance, but to move
 the accesses out of the synchronous code path for write(2).
 
 The code works quite nicely for FAT and ext2, limiting the time it
 takes to make a write call writing new data to about 4-6 ms on a
 fairly sluggish machine (plus about 2-4 ms for moving the playout
 point, which is a separate operation in ABISS), and with eight
 competing best-effort writers who each enjoy write latencies of some
 8 seconds, worst-case, overwriting old data.
 
 Of course, this fails horribly on ext3, because it doesn't do anything
 useful with the journal. Another problem is error handling. Since FAT
 and ext2 don't have any form of reservation, a full disk isn't detected
 until it's far too late.
 
 So, a VFS-level reservation function would indeed be nice to have.
 
 I looked at ext3 delalloc briefly, and while it did indeed improve
 performance quite nicely, by being tied to ext3 internals, it would
 be difficult to use in the framework of ABISS, where the code paths
 are different (e.g. the prepare/commit functions should be as close
 to no-ops as possible, and leave all the work to the prefetcher
 thread), and which tries to be relatively file system independent.

I'm looking at whether we can do most of it at VFS level ... with
ext3 only taking care of the additional journalling bit - seems
quite feasible. There are two reqs (1) reservation (2) changing
mpage_writepages to use get_blocks(), which don't seem too hard.
ext3 ordered mode will need a bit more thought.

Of course, I haven't looked at how ABISS does delayed alloc -- 
do you have a patch snippet I can look at ?

Regards
Suparna

 
 - Werner
 
 -- 
   _
  / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] /
 /_http://www.almesberger.net//
 
 
 ---
 SF email is sponsored by - The IT Product Guide
 Read honest  candid reviews on hundreds of IT Products from real users.
 Discover which products truly live up to the hype. Start reading now.
 http://ads.osdn.com/?ad_id=6595alloc_id=14396op=click
 ___
 Ext2-devel mailing list
 [EMAIL PROTECTED]
 https://lists.sourceforge.net/lists/listinfo/ext2-devel

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-14 Thread Werner Almesberger
Suparna Bhattacharya wrote:
 I'm looking at whether we can do most of it at VFS level

Do you plan to reserve space as blocks, somewhere, or as these
specific on-disk locations ? In ABISS, we did something of the
latter kind (in order to make large contiguous allocations also on
FAT), and it turned out to be a big mess, because ABISS needed too
much support from the file system driver. So we just scrapped that
bit :-)

 Of course, I haven't looked at how ABISS does delayed alloc -- 
 do you have a patch snippet I can look at ?

I just made a release. The kernel patch is in
abiss-7/kernel/abiss.patch  It's all in one big patch, sorry.
The main purpose of this is to see what we can achieve, so it's
not very polished.

The main parts: we added a new page flag, PG_delalloc, which
basically tells everyone to stay away from that page. There are
two purposes: (a) to make sure no allocation happens unless
explicitly requested, and (b) prevent the page from being written
back while it is still in ABISS' playout buffer. The reason for
(b) is that the page gets locked during writeback, which could
cause delays if the ABISS-using application then decides to
access the page.

The hands off code is mainly in fs/buffer.c, in the functions
__block_commit_write (set the page dirty, then go away),
cont_prepare_write (for FAT, do nothing),
block_prepare_write  (for ext2, do nothing),
and then fs/mpage.c:mpage_writepages (skip pages marked for
delayed allocation).

cont_prepare_write also needs to handle the special case where
it has to fill holes in a file. In this case, it simply overrides
delayed allocation. This bit will need more work.

Since ABISS prefetches pages, cont_prepare_write and
cont_prepare_write may now see pages that are already up to date,
so they must not zero them.

The prefetching happens in fs/abiss/sched_lib.c:abiss_read_page,
and writeback in abiss_put_page. We also experimented with
leaving the writeback to MM, but that led to OOM far too often.
The current solution works quite smoothly even if we tax the
system hard.

In order to keep things simple, I didn't try to make delayed
allocation do anything for writers that don't use ABISS.

The life cycle of a page is about as follows: when an application
reads or writes a file, ABISS maintains a playout buffer for it,
that typically reaches a few hundred kB ahead of the current file
position. Pages are prefetched and locked in the playout buffer.
The playout buffer is dimensioned that when file data enters the
playout buffer, there is enough time for the data to be in memory
by the time the application reaches it.

ABISS just calls readpage to get the data, which either causes it
to be read from disk, or the page to be zeroed, if we're beyond
EOF or at a hole.

The application accesses the page through the normal VFS functions,
so in the case of writing, the prepare/commit process happens.

Once the application has accessed the page, and moves the playout
buffer beyond it, the page is released and written back to disk.
Prefetching and writeback is done in a separate kernel thread, so
the application does not get delayed.

- Werner

-- 
  _
 / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] /
/_http://www.almesberger.net//
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-14 Thread Alex Tomas
 Werner Almesberger (WA) writes:

 WA Do you plan to reserve space as blocks, somewhere, or as these
 WA specific on-disk locations ? In ABISS, we did something of the
 WA latter kind (in order to make large contiguous allocations also on
 WA FAT), and it turned out to be a big mess, because ABISS needed too
 WA much support from the file system driver. So we just scrapped that
 WA bit :-)

I see no reason to reserve specific block in -prepare/-commit in
delayed allocation case. We already do this with reservation.
The sole point of delayed allocation is to allocate many blocks at once:
to minimize fragmentation, to decrease allocator involvement, to avoid
allocation at all if the file gets truncated quickly.

 WA The main parts: we added a new page flag, PG_delalloc, which
 WA basically tells everyone to stay away from that page. There are
 WA two purposes: (a) to make sure no allocation happens unless
 WA explicitly requested, and (b) prevent the page from being written
 WA back while it is still in ABISS' playout buffer. The reason for
 WA (b) is that the page gets locked during writeback, which could
 WA cause delays if the ABISS-using application then decides to
 WA access the page.

locked during writeback? PG_writeback should be used instead of PG_locked.


thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-14 Thread Werner Almesberger
Alex Tomas wrote:
 I see no reason to reserve specific block in -prepare/-commit in
 delayed allocation case. We already do this with reservation.

This seems like a sensible approach to me. Trying to reserve specific
blocks in an FS-independent way was what got us in trouble on ABISS.
So the plan B is to add this kind of reservation to where it is really
lacking (i.e. FAT).

Hmm, it's a bit confusing that we call both things reservation.
Well, airlines do this too, free seating.

 locked during writeback? PG_writeback should be used instead of PG_locked.

In mpage_writepages, writepage can also get called with the page just
PG_locked.

- Werner

-- 
  _
 / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] /
/_http://www.almesberger.net//
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-14 Thread Alex Tomas
 Werner Almesberger (WA) writes:

  locked during writeback? PG_writeback should be used instead of PG_locked.

 WA In mpage_writepages, writepage can also get called with the page just
 WA PG_locked.

you can drop PG_locked right as you set PG_writeback, I think

thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-14 Thread Bryan Henderson
Hmm, it's a bit confusing that we call both things reservation.

I think reservation is wrong for one of them and anyone using it that 
way should stop.  I believe the common terminology is:

- choosing the blocks is placement.

- committing the required number of blocks from the resource pool for the 
instant use is reservation.

- the combination of reservation and placement is allocation.

Obviously, traditional filesystem drivers haven't split placement from 
reservation, so don't bother to use those terms.

Most delaying schemes delay the placement but not the reservation because 
they don't want to accept the possibility that a write would fail for lack 
of space after the write() system call succeeded.

Even in non-filesystem areas, allocate usually means to assign 
particular resources, while reserve just means to make arrangements so 
that a future allocate will succeed.  For example, if you know you need up 
to 10 blocks of memory to complete a task without deadlocking, but you 
don't know yet how exactly how many, you would reserve 10 blocks and 
later, if necessary, allocate the actual blocks.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-14 Thread Werner Almesberger
Bryan Henderson wrote:
 I think reservation is wrong for one of them and anyone using it that 
 way should stop.

Hehe, start with ext3 :-)

 I believe the common terminology is:

Sounds reasonable. The thing with reservation is that people use
it in daily life with all kinds of meanings, and often with the
object of the reservation, e.g. reserve a seat (typically a
specific seat), reserve some time (often not a specific interval),
or reserve a table (at a restaurant, you don't know which one,
but the restaurant staff does).

To muddy the issue further, reservations can be more or less firm.
E.g. if we reserve the next hundred blocks, so that allocation is
contiguous, we may want to be able to take them away if some other
file needs them. On the other hand, if storage is already committed,
but just not on disk yet, that reservation shouldn't be revokable.

- Werner

-- 
  _
 / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] /
/_http://www.almesberger.net//
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-04 Thread Alex Tomas
On 03 Mar 2005 17:12:14 -0800
Badari Pulavarty [EMAIL PROTECTED] wrote:

 One more thing, we need to keep in mind is - we need to make sure
 that ordered mode also improved - since all our testcode 
 focuses on writeback mode and the default mode is ordered :(
 

I've just cooked the patch to implement ordered mode for delayed
allocation path. please take it:

ftp://ftp.clusterfs.com/pub/people/alex/2.6.11/ext3-delalloc-ordered-2.6.11-0.1.patch

Stephen, Andrew could you review it, please?

thanks, Alex


Index: linux-2.6.11/include/linux/jbd.h
===
--- linux-2.6.11.orig/include/linux/jbd.h   2005-03-02 20:49:13.0 
+0300
+++ linux-2.6.11/include/linux/jbd.h2005-03-04 17:03:52.0 +0300
@@ -486,6 +486,12 @@
struct journal_head *t_sync_datalist;
 
/*
+* Number of BIO's submited in context of the transaction we
+* want to complete before committing
+*/
+atomic_t   t_bios_in_flight;
+
+   /*
 * Doubly-linked circular list of all forget buffers (superseded
 * buffers which we can un-checkpoint once this transaction commits)
 * [j_list_lock]
@@ -678,6 +684,9 @@
/* Wait queue to wait for updates to complete */
wait_queue_head_t   j_wait_updates;
 
+   /* Wait queue to wait for all BIOs to complete */
+   wait_queue_head_t   j_wait_bios;
+
/* Semaphore for locking against concurrent checkpoints */
struct semaphorej_checkpoint_sem;
 
Index: linux-2.6.11/fs/jbd/commit.c
===
--- linux-2.6.11.orig/fs/jbd/commit.c   2005-03-02 20:49:09.0 +0300
+++ linux-2.6.11/fs/jbd/commit.c2005-03-04 17:53:52.0 +0300
@@ -619,6 +620,13 @@
if (is_journal_aborted(journal))
goto skip_commit;
 
+   /*
+* Before the commit record, we have to wait for all bio's
+* ext3_wb_writepages() issued against newly-allocated blocks
+*/
+   wait_event(journal-j_wait_bios, 
+   atomic_read(commit_transaction-t_bios_in_flight) == 0);
+
/* Done it all: now write the commit record.  We should have
 * cleaned up our previous buffers by now, so if we are in abort
 * mode we can now just skip the rest of the journal write
Index: linux-2.6.11/fs/jbd/transaction.c
===
--- linux-2.6.11.orig/fs/jbd/transaction.c  2005-03-02 20:49:09.0 
+0300
+++ linux-2.6.11/fs/jbd/transaction.c   2005-03-04 17:05:28.0 +0300
@@ -51,6 +51,7 @@
transaction-t_tid = journal-j_transaction_sequence++;
transaction-t_expires = jiffies + journal-j_commit_interval;
spin_lock_init(transaction-t_handle_lock);
+   atomic_set(transaction-t_bios_in_flight, 0);
 
/* Set up the commit timer for the new transaction. */
journal-j_commit_timer-expires = transaction-t_expires;
Index: linux-2.6.11/fs/jbd/journal.c
===
--- linux-2.6.11.orig/fs/jbd/journal.c  2005-03-04 17:04:29.0 +0300
+++ linux-2.6.11/fs/jbd/journal.c   2005-03-04 17:04:40.0 +0300
@@ -671,6 +671,7 @@
init_waitqueue_head(journal-j_wait_checkpoint);
init_waitqueue_head(journal-j_wait_commit);
init_waitqueue_head(journal-j_wait_updates);
+   init_waitqueue_head(journal-j_wait_bios);
init_MUTEX(journal-j_barrier);
init_MUTEX(journal-j_checkpoint_sem);
spin_lock_init(journal-j_revoke_lock);
Index: linux-2.6.11/fs/ext3/writeback.c
===
--- linux-2.6.11.orig/fs/ext3/writeback.c   2005-03-04 15:10:01.0 
+0300
+++ linux-2.6.11/fs/ext3/writeback.c2005-03-04 17:33:05.0 +0300
@@ -145,6 +145,17 @@
if (bio-bi_size)
return 1;
 
+   if (bio-bi_private) {
+   transaction_t *transaction = bio-bi_private;
+
+   /* 
+* journal_commit_transaction() may be awaiting
+* the bio to complete.
+*/
+   if (atomic_dec_and_test(transaction-t_bios_in_flight))
+   wake_up(transaction-t_journal-j_wait_bios);
+   }
+
do {
struct page *page = bvec-bv_page;
 
@@ -162,6 +173,16 @@
 static struct bio *ext3_wb_bio_submit(struct bio *bio, handle_t *handle)
 {
bio-bi_end_io = ext3_wb_end_io;
+   if (handle) {
+   /*
+* In data=ordered we shouldn't commit the transaction
+* until all data related to the transaction get on a
+* platter.
+*/
+   atomic_inc(handle-h_transaction-t_bios_in_flight);
+   bio-bi_private = handle-h_transaction;
+   } else
+   

Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-03 Thread Badari Pulavarty
On Thu, 2005-03-03 at 00:33, Suparna Bhattacharya wrote:
 Since the performance improvements seen so far are quite encouraging, 
 and momentum is picking up so well, I started looking through the
 patches from Alex ... just a quick code walkthrough to get a hang
 of it and think about what kind of simplifications might be possible
 and what it might take for inclusion.
 
 I haven't had a chance to go too deep line by line yet,
 but thought I'd initiate some discussion with some first impressions
 and summary of what directions I hear several people converging
 towards to validate if I'm on the right track here.
 
 diffstat of the 3 patches : 22 files changed, 5920 insertions(+), 
 47 deletions. The largest is in the extents patch (2743), mballoc 
 is 1968, and delalloc is 1209. To use delalloc, which gives us
 all the performance benefits, right now we need all the 3 patches
 to be used in conjunction. Supporting extent map btrees as well 
 as traditional indexing and associated options for compatibility etc
 is perhaps the more invasive of changes. Given that keeping ext3 
 stable and maintainable is a key concern (that is after all a major 
 reason why a lot of users rely on ext3), a somewhat incremental 
 approach is desirable. 
 
 So, I'll start from the direction that has been suggested by
 some -- (1) delayed allocation without changing the
 on-disk format. And then later (2) go on to breaking format with 
 all changes for scalability to larger files with full extents 
 support (haven't thought enough about this yet - maybe in a
 separate mail)
 

Just doing delayed allocation without multiblock allocation
(with the current layout) is not really a useful thing, IMHO.
We will benifit few cases, but in general - we moved the
block allocation overhead from prepare write to writepages/writepage
time. There is a little benifit of not doing journaling twice etc..
but I don't think it would be enough to justify the effort. 
Isn't it ?

So, may be we should look at adding multiblock allocation +
delayed allocation to current ext3 layout. Then we can evaluate
the benifits of having extents etc and then break the layout ?

One more thing, we need to keep in mind is - we need to make sure
that ordered mode also improved - since all our testcode 
focuses on writeback mode and the default mode is ordered :(


Thanks,
Badari


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-03 Thread Suparna Bhattacharya
On Thu, Mar 03, 2005 at 05:46:13PM -0800, Mingming Cao wrote:
 On Thu, 2005-03-03 at 17:12 -0800, Badari Pulavarty wrote:
  Just doing delayed allocation without multiblock allocation
  (with the current layout) is not really a useful thing, IMHO.
  We will benifit few cases, but in general - we moved the
  block allocation overhead from prepare write to writepages/writepage
  time. There is a little benifit of not doing journaling twice etc..
  but I don't think it would be enough to justify the effort. 
  Isn't it ?
  
 
 Hi Badari
 
 I agree delayed allocation make much sense with multiblock allocation.
 But I still think itself worth the effort, even without multiple block
 allocation. If we have a seeky random write application, and if later
 the application try to fill those holes, we normally will end up pretty
 ugly file layout. With delayed allocation, we could have better chance
 to get contigous blocks on disk for that file.
 
 I happened found Ted has mentioned this before:
 http://marc.theaimsgroup.com/?l=ext2-develm=107239591117758w=2
 
  So, may be we should look at adding multiblock allocation +
  delayed allocation to current ext3 layout. Then we can evaluate
  the benifits of having extents etc and then break the layout ?
  
 
 Current reservation code could be improved to return back how big the
 free chunk inside the window, and we could use that to help make
 ext3_new_blocks()/ext3_get_blocks() happen.

Yup this is exactly what I was thinking.

It'll probably only be a step along the way ... but I am hoping that
this will give us a direction to merge these pieces in incrementally, 
a little at a time, each piece being very well-understood and with 
demonstrated performance improvements at every step. For example, 
the next step after the following could be to plug parts of mballoc
in to the above, etc ... 

Does that make sense ?

Regards
Suparna


-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html