Re: [Ext2-devel] Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
On Thu, Mar 03, 2005 at 02:40:21AM -0700, Andreas Dilger wrote: On Mar 03, 2005 14:03 +0530, Suparna Bhattacharya wrote: diffstat of the 3 patches : 22 files changed, 5920 insertions(+), 47 deletions. The largest is in the extents patch (2743), mballoc is 1968, and delalloc is 1209. To use delalloc, which gives us all the performance benefits, right now we need all the 3 patches to be used in conjunction. Supporting extent map btrees as well as traditional indexing and associated options for compatibility etc is perhaps the more invasive of changes. Given that keeping ext3 stable and maintainable is a key concern (that is after all a major reason why a lot of users rely on ext3), a somewhat incremental approach is desirable. So, I'll start from the direction that has been suggested by some -- (1) delayed allocation without changing the on-disk format. And then later (2) go on to breaking format with all changes for scalability to larger files with full extents support (haven't thought enough about this yet - maybe in a separate mail) Well, for a starter, the extents format changes are not forced on users, only if they mount with -o extents and write files will it mark the superblock incompatible and start allocating files this way. I believe (though I have never tested) that even if extents are enabled, writes to a block-mapped file will continue to work and that file will not be converted to an extent file. Files that are created with extents will not be viewable by an older kernel, though (I think) - which is where the format breakage comes in (is that correct ?). But I don't see this as a major issue, since it can perhaps be taken care of through a little bit of migration tooling as Ted indicated. So, compatibility in itself wasn't the main concern bothering me but how we could make it easier to assure stability maintainability even with all the cool stuff. For example, if we have both mballoc and regular balloc and similarly extents and regular indexing based on growth patterns (a nice idea, btw), does it multiply the scenarios to verify on the testing front ? Or in dealing with changes in the future ? I'm guessing that this might be one of the things (besides agreement on the disk layout) holding up inclusion of extents, despite the patches being around for a while now .. but then I could be wrong. B-tree based extent maps were mentioned by sct way back in his 2000 paper ! And of course every filesystem out there implements B-trees in its own way. I can see arguments flying both ways ... at what point do we decide to break towards an ext4 ? BTW, has anyone tried playing with the idea of ext4 as not a cp -r fs/ext3 fs/ext4 and edit, but if possible using some layered filesystem techniques to reuse much of ext3 directly, and just override a few operations (like get_blocks for extents etc) where there is a layout impact ? Alex, have you had a chance to prototype your idea of rooting extents in ea ? A few random things that come to mind for (1), going through the code: - There might be possibilities for code reduction, by extending generic routines as far as possible, e.g. ext3_wb_writepages has a lot in common with generic writepages. That would also make it easier to maintain. I'm sure some support for this could be gotten from e.g. XFS as well, since their filesystem (on Irix at least) was all about delayed alloc (not sure what it does under Linux), and I believe ReiserFS/Reiser4 also desire the ability to have delayed allocation from the VFS (i.e. some sort of light-weight reserve space call for each page dirtied and then getting the actual file + offsets en masse later (if the VFS/VM doesn't discard the whole thing). *nod* Regards Suparna Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
On Fri, 4 Mar 2005 16:43:31 +0530 Suparna Bhattacharya [EMAIL PROTECTED] wrote: Alex, have you had a chance to prototype your idea of rooting extents in ea ? I think all you need for this are: 1) allocate EA in ext3_new_inode() 2) write a replacement for ext3_init_tree_desc() just few lines of code 3) write .get_write_access and .mark_buffer_dirty methods again few lines 4) use replacement of ext3_init_tree_desc() in few places thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
On 03 Mar 2005 17:12:14 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: One more thing, we need to keep in mind is - we need to make sure that ordered mode also improved - since all our testcode focuses on writeback mode and the default mode is ordered :( I've just cooked the patch to implement ordered mode for delayed allocation path. please take it: ftp://ftp.clusterfs.com/pub/people/alex/2.6.11/ext3-delalloc-ordered-2.6.11-0.1.patch Stephen, Andrew could you review it, please? thanks, Alex Index: linux-2.6.11/include/linux/jbd.h === --- linux-2.6.11.orig/include/linux/jbd.h 2005-03-02 20:49:13.0 +0300 +++ linux-2.6.11/include/linux/jbd.h2005-03-04 17:03:52.0 +0300 @@ -486,6 +486,12 @@ struct journal_head *t_sync_datalist; /* +* Number of BIO's submited in context of the transaction we +* want to complete before committing +*/ +atomic_t t_bios_in_flight; + + /* * Doubly-linked circular list of all forget buffers (superseded * buffers which we can un-checkpoint once this transaction commits) * [j_list_lock] @@ -678,6 +684,9 @@ /* Wait queue to wait for updates to complete */ wait_queue_head_t j_wait_updates; + /* Wait queue to wait for all BIOs to complete */ + wait_queue_head_t j_wait_bios; + /* Semaphore for locking against concurrent checkpoints */ struct semaphorej_checkpoint_sem; Index: linux-2.6.11/fs/jbd/commit.c === --- linux-2.6.11.orig/fs/jbd/commit.c 2005-03-02 20:49:09.0 +0300 +++ linux-2.6.11/fs/jbd/commit.c2005-03-04 17:53:52.0 +0300 @@ -619,6 +620,13 @@ if (is_journal_aborted(journal)) goto skip_commit; + /* +* Before the commit record, we have to wait for all bio's +* ext3_wb_writepages() issued against newly-allocated blocks +*/ + wait_event(journal-j_wait_bios, + atomic_read(commit_transaction-t_bios_in_flight) == 0); + /* Done it all: now write the commit record. We should have * cleaned up our previous buffers by now, so if we are in abort * mode we can now just skip the rest of the journal write Index: linux-2.6.11/fs/jbd/transaction.c === --- linux-2.6.11.orig/fs/jbd/transaction.c 2005-03-02 20:49:09.0 +0300 +++ linux-2.6.11/fs/jbd/transaction.c 2005-03-04 17:05:28.0 +0300 @@ -51,6 +51,7 @@ transaction-t_tid = journal-j_transaction_sequence++; transaction-t_expires = jiffies + journal-j_commit_interval; spin_lock_init(transaction-t_handle_lock); + atomic_set(transaction-t_bios_in_flight, 0); /* Set up the commit timer for the new transaction. */ journal-j_commit_timer-expires = transaction-t_expires; Index: linux-2.6.11/fs/jbd/journal.c === --- linux-2.6.11.orig/fs/jbd/journal.c 2005-03-04 17:04:29.0 +0300 +++ linux-2.6.11/fs/jbd/journal.c 2005-03-04 17:04:40.0 +0300 @@ -671,6 +671,7 @@ init_waitqueue_head(journal-j_wait_checkpoint); init_waitqueue_head(journal-j_wait_commit); init_waitqueue_head(journal-j_wait_updates); + init_waitqueue_head(journal-j_wait_bios); init_MUTEX(journal-j_barrier); init_MUTEX(journal-j_checkpoint_sem); spin_lock_init(journal-j_revoke_lock); Index: linux-2.6.11/fs/ext3/writeback.c === --- linux-2.6.11.orig/fs/ext3/writeback.c 2005-03-04 15:10:01.0 +0300 +++ linux-2.6.11/fs/ext3/writeback.c2005-03-04 17:33:05.0 +0300 @@ -145,6 +145,17 @@ if (bio-bi_size) return 1; + if (bio-bi_private) { + transaction_t *transaction = bio-bi_private; + + /* +* journal_commit_transaction() may be awaiting +* the bio to complete. +*/ + if (atomic_dec_and_test(transaction-t_bios_in_flight)) + wake_up(transaction-t_journal-j_wait_bios); + } + do { struct page *page = bvec-bv_page; @@ -162,6 +173,16 @@ static struct bio *ext3_wb_bio_submit(struct bio *bio, handle_t *handle) { bio-bi_end_io = ext3_wb_end_io; + if (handle) { + /* +* In data=ordered we shouldn't commit the transaction +* until all data related to the transaction get on a +* platter. +*/ + atomic_inc(handle-h_transaction-t_bios_in_flight); + bio-bi_private = handle-h_transaction; + } else +
Re: [Ext2-devel] Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
On Mar 04, 2005 15:29 +0300, Alex Tomas wrote: On Fri, 4 Mar 2005 16:43:31 +0530 Suparna Bhattacharya [EMAIL PROTECTED] wrote: Alex, have you had a chance to prototype your idea of rooting extents in ea ? I think all you need for this are: 1) allocate EA in ext3_new_inode() 2) write a replacement for ext3_init_tree_desc() just few lines of code 3) write .get_write_access and .mark_buffer_dirty methods again few lines 4) use replacement of ext3_init_tree_desc() in few places This should of course only be done for large inodes. Also, at some point it will consume all of the EA space and we need to use an external block. It might help in some middle cases (i.e. files with more extents than can fit in i_blocks (60 bytes), but less than fit into the large inode space (128 or maybe 384 bytes)) but it might also hurt other things if we need to allocate an EA block for another EA... Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ pgpwmhI74h2Rc.pgp Description: PGP signature