Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
Mingming Cao wrote: I agree delayed allocation make much sense with multiblock allocation. But I still think itself worth the effort, even without multiple block allocation. On ABISS, we're currently also experimenting with delayed allocation. There, the goal is less to improve overall performance, but to move the accesses out of the synchronous code path for write(2). The code works quite nicely for FAT and ext2, limiting the time it takes to make a write call writing new data to about 4-6 ms on a fairly sluggish machine (plus about 2-4 ms for moving the playout point, which is a separate operation in ABISS), and with eight competing best-effort writers who each enjoy write latencies of some 8 seconds, worst-case, overwriting old data. Of course, this fails horribly on ext3, because it doesn't do anything useful with the journal. Another problem is error handling. Since FAT and ext2 don't have any form of reservation, a full disk isn't detected until it's far too late. So, a VFS-level reservation function would indeed be nice to have. I looked at ext3 delalloc briefly, and while it did indeed improve performance quite nicely, by being tied to ext3 internals, it would be difficult to use in the framework of ABISS, where the code paths are different (e.g. the prepare/commit functions should be as close to no-ops as possible, and leave all the work to the prefetcher thread), and which tries to be relatively file system independent. - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
On Mon, Mar 14, 2005 at 05:36:58AM -0300, Werner Almesberger wrote: Mingming Cao wrote: I agree delayed allocation make much sense with multiblock allocation. But I still think itself worth the effort, even without multiple block allocation. On ABISS, we're currently also experimenting with delayed allocation. There, the goal is less to improve overall performance, but to move the accesses out of the synchronous code path for write(2). The code works quite nicely for FAT and ext2, limiting the time it takes to make a write call writing new data to about 4-6 ms on a fairly sluggish machine (plus about 2-4 ms for moving the playout point, which is a separate operation in ABISS), and with eight competing best-effort writers who each enjoy write latencies of some 8 seconds, worst-case, overwriting old data. Of course, this fails horribly on ext3, because it doesn't do anything useful with the journal. Another problem is error handling. Since FAT and ext2 don't have any form of reservation, a full disk isn't detected until it's far too late. So, a VFS-level reservation function would indeed be nice to have. I looked at ext3 delalloc briefly, and while it did indeed improve performance quite nicely, by being tied to ext3 internals, it would be difficult to use in the framework of ABISS, where the code paths are different (e.g. the prepare/commit functions should be as close to no-ops as possible, and leave all the work to the prefetcher thread), and which tries to be relatively file system independent. I'm looking at whether we can do most of it at VFS level ... with ext3 only taking care of the additional journalling bit - seems quite feasible. There are two reqs (1) reservation (2) changing mpage_writepages to use get_blocks(), which don't seem too hard. ext3 ordered mode will need a bit more thought. Of course, I haven't looked at how ABISS does delayed alloc -- do you have a patch snippet I can look at ? Regards Suparna - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// --- SF email is sponsored by - The IT Product Guide Read honest candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595alloc_id=14396op=click ___ Ext2-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/ext2-devel -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
Suparna Bhattacharya wrote: I'm looking at whether we can do most of it at VFS level Do you plan to reserve space as blocks, somewhere, or as these specific on-disk locations ? In ABISS, we did something of the latter kind (in order to make large contiguous allocations also on FAT), and it turned out to be a big mess, because ABISS needed too much support from the file system driver. So we just scrapped that bit :-) Of course, I haven't looked at how ABISS does delayed alloc -- do you have a patch snippet I can look at ? I just made a release. The kernel patch is in abiss-7/kernel/abiss.patch It's all in one big patch, sorry. The main purpose of this is to see what we can achieve, so it's not very polished. The main parts: we added a new page flag, PG_delalloc, which basically tells everyone to stay away from that page. There are two purposes: (a) to make sure no allocation happens unless explicitly requested, and (b) prevent the page from being written back while it is still in ABISS' playout buffer. The reason for (b) is that the page gets locked during writeback, which could cause delays if the ABISS-using application then decides to access the page. The hands off code is mainly in fs/buffer.c, in the functions __block_commit_write (set the page dirty, then go away), cont_prepare_write (for FAT, do nothing), block_prepare_write (for ext2, do nothing), and then fs/mpage.c:mpage_writepages (skip pages marked for delayed allocation). cont_prepare_write also needs to handle the special case where it has to fill holes in a file. In this case, it simply overrides delayed allocation. This bit will need more work. Since ABISS prefetches pages, cont_prepare_write and cont_prepare_write may now see pages that are already up to date, so they must not zero them. The prefetching happens in fs/abiss/sched_lib.c:abiss_read_page, and writeback in abiss_put_page. We also experimented with leaving the writeback to MM, but that led to OOM far too often. The current solution works quite smoothly even if we tax the system hard. In order to keep things simple, I didn't try to make delayed allocation do anything for writers that don't use ABISS. The life cycle of a page is about as follows: when an application reads or writes a file, ABISS maintains a playout buffer for it, that typically reaches a few hundred kB ahead of the current file position. Pages are prefetched and locked in the playout buffer. The playout buffer is dimensioned that when file data enters the playout buffer, there is enough time for the data to be in memory by the time the application reaches it. ABISS just calls readpage to get the data, which either causes it to be read from disk, or the page to be zeroed, if we're beyond EOF or at a hole. The application accesses the page through the normal VFS functions, so in the case of writing, the prepare/commit process happens. Once the application has accessed the page, and moves the playout buffer beyond it, the page is released and written back to disk. Prefetching and writeback is done in a separate kernel thread, so the application does not get delayed. - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
Werner Almesberger (WA) writes: WA Do you plan to reserve space as blocks, somewhere, or as these WA specific on-disk locations ? In ABISS, we did something of the WA latter kind (in order to make large contiguous allocations also on WA FAT), and it turned out to be a big mess, because ABISS needed too WA much support from the file system driver. So we just scrapped that WA bit :-) I see no reason to reserve specific block in -prepare/-commit in delayed allocation case. We already do this with reservation. The sole point of delayed allocation is to allocate many blocks at once: to minimize fragmentation, to decrease allocator involvement, to avoid allocation at all if the file gets truncated quickly. WA The main parts: we added a new page flag, PG_delalloc, which WA basically tells everyone to stay away from that page. There are WA two purposes: (a) to make sure no allocation happens unless WA explicitly requested, and (b) prevent the page from being written WA back while it is still in ABISS' playout buffer. The reason for WA (b) is that the page gets locked during writeback, which could WA cause delays if the ABISS-using application then decides to WA access the page. locked during writeback? PG_writeback should be used instead of PG_locked. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
Alex Tomas wrote: I see no reason to reserve specific block in -prepare/-commit in delayed allocation case. We already do this with reservation. This seems like a sensible approach to me. Trying to reserve specific blocks in an FS-independent way was what got us in trouble on ABISS. So the plan B is to add this kind of reservation to where it is really lacking (i.e. FAT). Hmm, it's a bit confusing that we call both things reservation. Well, airlines do this too, free seating. locked during writeback? PG_writeback should be used instead of PG_locked. In mpage_writepages, writepage can also get called with the page just PG_locked. - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
Werner Almesberger (WA) writes: locked during writeback? PG_writeback should be used instead of PG_locked. WA In mpage_writepages, writepage can also get called with the page just WA PG_locked. you can drop PG_locked right as you set PG_writeback, I think thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
Hmm, it's a bit confusing that we call both things reservation. I think reservation is wrong for one of them and anyone using it that way should stop. I believe the common terminology is: - choosing the blocks is placement. - committing the required number of blocks from the resource pool for the instant use is reservation. - the combination of reservation and placement is allocation. Obviously, traditional filesystem drivers haven't split placement from reservation, so don't bother to use those terms. Most delaying schemes delay the placement but not the reservation because they don't want to accept the possibility that a write would fail for lack of space after the write() system call succeeded. Even in non-filesystem areas, allocate usually means to assign particular resources, while reserve just means to make arrangements so that a future allocate will succeed. For example, if you know you need up to 10 blocks of memory to complete a task without deadlocking, but you don't know yet how exactly how many, you would reserve 10 blocks and later, if necessary, allocate the actual blocks. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
Bryan Henderson wrote: I think reservation is wrong for one of them and anyone using it that way should stop. Hehe, start with ext3 :-) I believe the common terminology is: Sounds reasonable. The thing with reservation is that people use it in daily life with all kinds of meanings, and often with the object of the reservation, e.g. reserve a seat (typically a specific seat), reserve some time (often not a specific interval), or reserve a table (at a restaurant, you don't know which one, but the restaurant staff does). To muddy the issue further, reservations can be more or less firm. E.g. if we reserve the next hundred blocks, so that allocation is contiguous, we may want to be able to take them away if some other file needs them. On the other hand, if storage is already committed, but just not on disk yet, that reservation shouldn't be revokable. - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
On 03 Mar 2005 17:12:14 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: One more thing, we need to keep in mind is - we need to make sure that ordered mode also improved - since all our testcode focuses on writeback mode and the default mode is ordered :( I've just cooked the patch to implement ordered mode for delayed allocation path. please take it: ftp://ftp.clusterfs.com/pub/people/alex/2.6.11/ext3-delalloc-ordered-2.6.11-0.1.patch Stephen, Andrew could you review it, please? thanks, Alex Index: linux-2.6.11/include/linux/jbd.h === --- linux-2.6.11.orig/include/linux/jbd.h 2005-03-02 20:49:13.0 +0300 +++ linux-2.6.11/include/linux/jbd.h2005-03-04 17:03:52.0 +0300 @@ -486,6 +486,12 @@ struct journal_head *t_sync_datalist; /* +* Number of BIO's submited in context of the transaction we +* want to complete before committing +*/ +atomic_t t_bios_in_flight; + + /* * Doubly-linked circular list of all forget buffers (superseded * buffers which we can un-checkpoint once this transaction commits) * [j_list_lock] @@ -678,6 +684,9 @@ /* Wait queue to wait for updates to complete */ wait_queue_head_t j_wait_updates; + /* Wait queue to wait for all BIOs to complete */ + wait_queue_head_t j_wait_bios; + /* Semaphore for locking against concurrent checkpoints */ struct semaphorej_checkpoint_sem; Index: linux-2.6.11/fs/jbd/commit.c === --- linux-2.6.11.orig/fs/jbd/commit.c 2005-03-02 20:49:09.0 +0300 +++ linux-2.6.11/fs/jbd/commit.c2005-03-04 17:53:52.0 +0300 @@ -619,6 +620,13 @@ if (is_journal_aborted(journal)) goto skip_commit; + /* +* Before the commit record, we have to wait for all bio's +* ext3_wb_writepages() issued against newly-allocated blocks +*/ + wait_event(journal-j_wait_bios, + atomic_read(commit_transaction-t_bios_in_flight) == 0); + /* Done it all: now write the commit record. We should have * cleaned up our previous buffers by now, so if we are in abort * mode we can now just skip the rest of the journal write Index: linux-2.6.11/fs/jbd/transaction.c === --- linux-2.6.11.orig/fs/jbd/transaction.c 2005-03-02 20:49:09.0 +0300 +++ linux-2.6.11/fs/jbd/transaction.c 2005-03-04 17:05:28.0 +0300 @@ -51,6 +51,7 @@ transaction-t_tid = journal-j_transaction_sequence++; transaction-t_expires = jiffies + journal-j_commit_interval; spin_lock_init(transaction-t_handle_lock); + atomic_set(transaction-t_bios_in_flight, 0); /* Set up the commit timer for the new transaction. */ journal-j_commit_timer-expires = transaction-t_expires; Index: linux-2.6.11/fs/jbd/journal.c === --- linux-2.6.11.orig/fs/jbd/journal.c 2005-03-04 17:04:29.0 +0300 +++ linux-2.6.11/fs/jbd/journal.c 2005-03-04 17:04:40.0 +0300 @@ -671,6 +671,7 @@ init_waitqueue_head(journal-j_wait_checkpoint); init_waitqueue_head(journal-j_wait_commit); init_waitqueue_head(journal-j_wait_updates); + init_waitqueue_head(journal-j_wait_bios); init_MUTEX(journal-j_barrier); init_MUTEX(journal-j_checkpoint_sem); spin_lock_init(journal-j_revoke_lock); Index: linux-2.6.11/fs/ext3/writeback.c === --- linux-2.6.11.orig/fs/ext3/writeback.c 2005-03-04 15:10:01.0 +0300 +++ linux-2.6.11/fs/ext3/writeback.c2005-03-04 17:33:05.0 +0300 @@ -145,6 +145,17 @@ if (bio-bi_size) return 1; + if (bio-bi_private) { + transaction_t *transaction = bio-bi_private; + + /* +* journal_commit_transaction() may be awaiting +* the bio to complete. +*/ + if (atomic_dec_and_test(transaction-t_bios_in_flight)) + wake_up(transaction-t_journal-j_wait_bios); + } + do { struct page *page = bvec-bv_page; @@ -162,6 +173,16 @@ static struct bio *ext3_wb_bio_submit(struct bio *bio, handle_t *handle) { bio-bi_end_io = ext3_wb_end_io; + if (handle) { + /* +* In data=ordered we shouldn't commit the transaction +* until all data related to the transaction get on a +* platter. +*/ + atomic_inc(handle-h_transaction-t_bios_in_flight); + bio-bi_private = handle-h_transaction; + } else +
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
On Thu, 2005-03-03 at 00:33, Suparna Bhattacharya wrote: Since the performance improvements seen so far are quite encouraging, and momentum is picking up so well, I started looking through the patches from Alex ... just a quick code walkthrough to get a hang of it and think about what kind of simplifications might be possible and what it might take for inclusion. I haven't had a chance to go too deep line by line yet, but thought I'd initiate some discussion with some first impressions and summary of what directions I hear several people converging towards to validate if I'm on the right track here. diffstat of the 3 patches : 22 files changed, 5920 insertions(+), 47 deletions. The largest is in the extents patch (2743), mballoc is 1968, and delalloc is 1209. To use delalloc, which gives us all the performance benefits, right now we need all the 3 patches to be used in conjunction. Supporting extent map btrees as well as traditional indexing and associated options for compatibility etc is perhaps the more invasive of changes. Given that keeping ext3 stable and maintainable is a key concern (that is after all a major reason why a lot of users rely on ext3), a somewhat incremental approach is desirable. So, I'll start from the direction that has been suggested by some -- (1) delayed allocation without changing the on-disk format. And then later (2) go on to breaking format with all changes for scalability to larger files with full extents support (haven't thought enough about this yet - maybe in a separate mail) Just doing delayed allocation without multiblock allocation (with the current layout) is not really a useful thing, IMHO. We will benifit few cases, but in general - we moved the block allocation overhead from prepare write to writepages/writepage time. There is a little benifit of not doing journaling twice etc.. but I don't think it would be enough to justify the effort. Isn't it ? So, may be we should look at adding multiblock allocation + delayed allocation to current ext3 layout. Then we can evaluate the benifits of having extents etc and then break the layout ? One more thing, we need to keep in mind is - we need to make sure that ordered mode also improved - since all our testcode focuses on writeback mode and the default mode is ordered :( Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
On Thu, Mar 03, 2005 at 05:46:13PM -0800, Mingming Cao wrote: On Thu, 2005-03-03 at 17:12 -0800, Badari Pulavarty wrote: Just doing delayed allocation without multiblock allocation (with the current layout) is not really a useful thing, IMHO. We will benifit few cases, but in general - we moved the block allocation overhead from prepare write to writepages/writepage time. There is a little benifit of not doing journaling twice etc.. but I don't think it would be enough to justify the effort. Isn't it ? Hi Badari I agree delayed allocation make much sense with multiblock allocation. But I still think itself worth the effort, even without multiple block allocation. If we have a seeky random write application, and if later the application try to fill those holes, we normally will end up pretty ugly file layout. With delayed allocation, we could have better chance to get contigous blocks on disk for that file. I happened found Ted has mentioned this before: http://marc.theaimsgroup.com/?l=ext2-develm=107239591117758w=2 So, may be we should look at adding multiblock allocation + delayed allocation to current ext3 layout. Then we can evaluate the benifits of having extents etc and then break the layout ? Current reservation code could be improved to return back how big the free chunk inside the window, and we could use that to help make ext3_new_blocks()/ext3_get_blocks() happen. Yup this is exactly what I was thinking. It'll probably only be a step along the way ... but I am hoping that this will give us a direction to merge these pieces in incrementally, a little at a time, each piece being very well-understood and with demonstrated performance improvements at every step. For example, the next step after the following could be to plug parts of mballoc in to the above, etc ... Does that make sense ? Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html