Re: [RFC][PATCH] fix journal overflow problem
= bh2jh(bh); BUFFER_TRACE(bh, entry); + + jbd_lock_bh_state(bh); + /* + * if nobody else has requested write access to this jh then set + * b_next_transaction to NULL to keep it from getting refiled onto + * this transaction + */ + if (!(--jh-b_nr_access)) { + + /* + * A journal_get_undo_access()+journal_release_buffer() will + * leave undo-committed data that we no longer need + */ + if (jh-b_next_transaction jh-b_committed_data) { + jbd_free(jh-b_committed_data, bh-b_size); + jh-b_committed_data = NULL; + } + + jh-b_next_transaction = NULL; + } + + jbd_unlock_bh_state(bh); } /** @@ -1245,6 +1307,11 @@ int journal_forget (handle_t *handle, struct buffer_head *bh) */ jh-b_modified = 0; + /* + * drop one of our write access credits + */ + jh-b_nr_access--; + if (jh-b_transaction == handle-h_transaction) { J_ASSERT_JH(jh, !jh-b_frozen_data); diff --git a/include/linux/journal-head.h b/include/linux/journal-head.h index 8a62d1e..160f1d1 100644 --- a/include/linux/journal-head.h +++ b/include/linux/journal-head.h @@ -39,6 +39,13 @@ struct journal_head { unsigned b_modified; /* + * This is a counter of the number of things who have requested write + * access to this buffer by the current transaction + * [jbd_lock_bh_state()] + */ + unsigned b_nr_access; + + /* * Copy of the buffer data frozen for writing to the log. * [jbd_lock_bh_state()] */ -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] jbd: fix assertion failure in journal_next_log_block
On Tue 05-02-08 13:59:05, Josef Bacik wrote: On Tuesday 05 February 2008 12:27:31 pm Jan Kara wrote: Hello, Sorry for replying a bit late but I'm currently falling behind in maling-list reading... The way jbd tries to determine if there is enough space left on the journal in order to start a new transaction is looking at the space left in the journal and the space needed for the committing transaction, which is 1/4 of the journal + the number of t_outstanding_credits for that transaction. In this case its assumed that t_outstanding_credits accurately represents the number of credits Yes. waiting to be written to the journal, but this sometimes isn't the case. The transaction has two counters for buffers, t_outstanding_credits which is used in conjunction with handles that are added to the transaction, and t_nr_buffers which is only incremented/decremented when buffers are added/removed from the transaction and are actually destined to be journaled. Normally these two t_nr_buffers actually represents number of buffers on BJ_Metadata list and nobody uses it (except for the assertion in __journal_temp_unlink_buffer()). t_outstanding_credits is supposed to be *the* counter making sure we don't write more than we have credits for. counters are the same, however there are cases where the committing transaction can have buffers moved to the next running transaction, for example any buffers on the committing transactions t_reserved list would be moved to the next (running) transaction, and if it had been dirtied in the process it would immediately make it onto the t_updates list, which would increment t_nr_buffers You probably mean t_buffers list here... but not t_outstanding_credits. So you get into this situation where But which moving and dirtying do you mean? The caller which dirties the buffer must make sure that he has acquired enough credits for the transaction where the buffer ends up... So if there were not enough buffers in the running transaction where we refiled the buffer it is a bug in the caller which dirties the buffer. You know now that you say that I feel like an idiot, you are right the only way for something to actually end up on that list was if somebody dirtied it and if they did it would have had to been accounted for at some point on the running transaction. Never mind :) I also made several mistakes before I learned how JBD works... t_nr_buffers (the actual number of buffers that are on the transaction) is greater than the number of buffers accounted for via t_outstanding_credits. This presents a problem since as we loop through writting buffers to the journal, we decrement t_outstanding_credits, and if t_nr_buffers is more than t_outstanding_credits then we end up with a negative number for t_outstanding_credits, which means we start saying we need less than 1/4 of the journal for our committing transaction and allow more transactions than we can handle to start, and then bam we fail because journal_next_log_block doesn't have enough free blocks in order to handle the request. This has been tested and fixes the issue (which could not be reproduced by me but several other people could get it to reproduce using postmark), and although I couldn't reproduce the assertion, I could very easily reproduce the situation where t_outstanding_credits was than t_nr_buffers. I suppose you see the assertion J_ASSERT(journal-j_free 1); to fail, right? I don't see how your patch could help avoid that assertion. You've just removed accounting of t_outstanding_credits which has no impact on the real number of free blocks in the journal stored in j_free. Anyway, if you can reproduce t_outstanding_credits t_nr_buffers, then there's something fishy. Are you able to reproduce it also with a current kernel? Thanks for looking into the problem :) Well my patch helped avoid the assertion because t_outstanding_credits was going negative therefore we were letting transactions start when we shouldn't be, and eventually we would end up with too much of the journal in use and we'd assert. Course I can't reproduce where t_outstanding_credits t_nr_buffers upstream (again I feel like an idiot, should have tested that first). Thanks for looking at this Jan. Glad to hear we don't have a bug there :) Anyway, assertion that t_outstanding_credits doesn't go below zero would be a useful assertion to add... Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: Fix circular locking dependency with migrate and rm.
On Wed 06-02-08 16:30:04, Aneesh Kumar K.V wrote: We now take inode-i_mutex lock to prevent any update of the inode i_data field. Before we switch the inode format we take i_data_sem to prevent parallel read. === [ INFO: possible circular locking dependency detected ] 2.6.24-rc8 #6 --- rm/2401 is trying to acquire lock: (ei-i_data_sem){}, at: [c01dca58] ext4_get_blocks_wrap+0x21/0x108 but task is already holding lock: (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (jbd2_handle){--..}: [c0143a5c] __lock_acquire+0xa31/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c01fc4ca] jbd2_journal_start+0xf5/0xff [c01e3539] ext4_journal_start_sb+0x48/0x4a [c01eb980] ext4_ext_migrate+0x7d/0x535 [c01df328] ext4_ioctl+0x528/0x56c [c0177700] do_ioctl+0x50/0x67 [c017794e] vfs_ioctl+0x237/0x24a [c0177992] sys_ioctl+0x31/0x4b [c0104f8a] sysenter_past_esp+0x5f/0xa5 [] 0x - #0 (ei-i_data_sem){}: [c014394c] __lock_acquire+0x921/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c044f247] down_read+0x42/0x79 [c01dca58] ext4_get_blocks_wrap+0x21/0x108 [c01dcba1] ext4_getblk+0x62/0x1c4 [c01e0de9] ext4_find_entry+0x350/0x5b7 [c01e200c] ext4_unlink+0x6e/0x1a4 [c017449e] vfs_unlink+0x49/0x89 [c0175f02] do_unlinkat+0x96/0x12c [c0175fa8] sys_unlink+0x10/0x12 [c0104f8a] sysenter_past_esp+0x5f/0xa5 [] 0x other info that might help us debug this: 3 locks held by rm/2401: #0: (type-i_mutex_dir_key#5/1){--..}, at: [c0175eca] do_unlinkat+0x5e/0x12c #1: (sb-s_type-i_mutex_key#8){--..}, at: [c017448b] vfs_unlink+0x36/0x89 #2: (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff stack backtrace: Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6 [c0106017] show_trace_log_lvl+0x1a/0x2f [c0106893] show_trace+0x12/0x14 [c0106b89] dump_stack+0x6c/0x72 [c0141b26] print_circular_bug_tail+0x5f/0x68 [c014394c] __lock_acquire+0x921/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c044f247] down_read+0x42/0x79 [c01dca58] ext4_get_blocks_wrap+0x21/0x108 [c01dcba1] ext4_getblk+0x62/0x1c4 [c01e0de9] ext4_find_entry+0x350/0x5b7 [c01e200c] ext4_unlink+0x6e/0x1a4 [c017449e] vfs_unlink+0x49/0x89 [c0175f02] do_unlinkat+0x96/0x12c [c0175fa8] sys_unlink+0x10/0x12 [c0104f8a] sysenter_past_esp+0x5f/0xa5 Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] Add Acked-by: Jan Kara [EMAIL PROTECTED] if you wish. Honza --- fs/ext4/migrate.c | 117 +--- 1 files changed, 74 insertions(+), 43 deletions(-) diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c index 9ee1f7c..8c6c685 100644 --- a/fs/ext4/migrate.c +++ b/fs/ext4/migrate.c @@ -61,10 +61,9 @@ static int finish_range(handle_t *handle, struct inode *inode, retval = ext4_journal_restart(handle, needed); if (retval) goto err_out; - } - if (needed) { + } else if (needed) { retval = ext4_journal_extend(handle, needed); - if (retval != 0) { + if (retval) { /* * IF not able to extend the journal restart the journal */ @@ -220,6 +219,26 @@ static int update_tind_extent_range(handle_t *handle, struct inode *inode, } +static int extend_credit_for_blkdel(handle_t *handle, struct inode *inode) +{ + int retval = 0, needed; + + if (handle-h_buffer_credits EXT4_RESERVE_TRANS_BLOCKS) + return 0; + /* + * We are freeing a blocks. During this we touch + * superblock, group descriptor and block bitmap. + * So allocate a credit of 3. We may update + * quota (user and group). + */ + needed = 3 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode-i_sb); + + if (ext4_journal_extend(handle, needed) != 0) + retval = ext4_journal_restart(handle, needed); + + return retval; +} + static int free_dind_blocks(handle_t *handle, struct inode *inode, __le32 i_data) { @@ -234,11 +253,14 @@ static int free_dind_blocks(handle_t *handle, tmp_idata = (__le32 *)bh-b_data; for (i = 0; i max_entries; i++) { - if (tmp_idata[i]) + if (tmp_idata[i]) { + extend_credit_for_blkdel(handle, inode); ext4_free_blocks(handle, inode, le32_to_cpu(tmp_idata[i]), 1, 1); + } } put_bh(bh
Re: [RESEND] [PATCH] ext3,4:fdatasync should skip metadata writeout when overwriting
Hi, Currently fdatasync is identical to fsync in ext3,4. I think fdatasync should skip journal flush in data=ordered and data=writeback mode when it overwrites to already-instantiated blocks on HDD. When I_DIRTY_DATASYNC flag is not set, fdatasync should skip journal writeout because this indicates only atime or/and mtime updates. Following patch is the same approach of ext2's fsync code(ext2_sync_file). I did a performance test using the sysbench. #sysbench --num-threads=128 --max-requests=5 --test=fileio --file-total-size=128G --file-test-mode=rndwr --file-fsync-mode=fdatasync run The result was: -2.6.24 Operations performed: 0 Read, 50080 Write, 59600 Other = 109680 Total Read 0b Written 782.5Mb Total transferred 782.5Mb (12.116Mb/sec) 775.45 Requests/sec executed Test execution summary: total time: 64.5814s total number of events: 50080 total time taken by event execution: 3713.9836 per-request statistics: min:0.s avg:0.0742s max:0.9375s approx. 95 percentile: 0.2901s Threads fairness: events (avg/stddev): 391.2500/23.26 execution time (avg/stddev): 29.0155/1.99 -2.6.24-patched Operations performed: 0 Read, 50009 Write, 61596 Other = 111605 Total Read 0b Written 781.39Mb Total transferred 781.39Mb (16.419Mb/sec) 1050.83 Requests/sec executed Test execution summary: total time: 47.5900s total number of events: 50009 total time taken by event execution: 2934.5768 per-request statistics: min:0.s avg:0.0587s max:0.8938s approx. 95 percentile: 0.1993s Threads fairness: events (avg/stddev): 390.6953/22.64 execution time (avg/stddev): 22.9264/1.17 Filesystem I/O throughput was improved. Thanks. Signed-off-by :Hisashi Hifumi [EMAIL PROTECTED] Yes, the patch looks fine. You can add Acked-by: Jan Kara [EMAIL PROTECTED] if you wish. Honza diff -Nrup linux-2.6.24.org/fs/ext3/fsync.c linux-2.6.24/fs/ext3/fsync.c --- linux-2.6.24.org/fs/ext3/fsync.c 2008-01-25 07:58:37.0 +0900 +++ linux-2.6.24/fs/ext3/fsync.c 2008-02-04 12:42:42.0 +0900 @@ -72,6 +72,9 @@ int ext3_sync_file(struct file * file, s goto out; } + if (datasync !(inode-i_state I_DIRTY_DATASYNC)) + goto out; + /* * The VFS has written the file data. If the inode is unaltered * then we need not start a commit. diff -Nrup linux-2.6.24.org/fs/ext4/fsync.c linux-2.6.24/fs/ext4/fsync.c --- linux-2.6.24.org/fs/ext4/fsync.c 2008-01-25 07:58:37.0 +0900 +++ linux-2.6.24/fs/ext4/fsync.c 2008-02-04 12:43:37.0 +0900 @@ -72,6 +72,9 @@ int ext4_sync_file(struct file * file, s goto out; } +if (datasync !(inode-i_state I_DIRTY_DATASYNC)) +goto out; + /* * The VFS has written the file data. If the inode is unaltered * then we need not start a commit. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: Fix DIO locking
Hi, this patch fixes lock inversion in ext4 direct IO path. The similar patch has already been accepted for ext3. Mingming, will you put it into ext4 patch queue? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- We cannot start transaction in ext4_direct_IO() and just let it last during the whole write because dio_get_page() acquires mmap_sem which ranks above transaction start (e.g. because we have dependency chain mmap_sem-PageLock-journal_start, or because we update atime while holding mmap_sem) and thus deadlocks could happen. We solve the problem by starting a transaction separately for each ext4_get_block() call. We *could* have a problem that we allocate a block and before its data are written out the machine crashes and thus we expose stale data. But that does not happen because for hole-filling generic code falls back to buffered writes and for file extension, we add inode to orphan list and thus in case of crash, journal replay will truncate inode back to the original size. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index bb717cb..1948b97 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -895,7 +895,16 @@ out: return err; } -#define DIO_CREDITS (EXT4_RESERVE_TRANS_BLOCKS + 32) +/* Maximum number of blocks we map for direct IO at once. */ +#define DIO_MAX_BLOCKS 4096 +/* + * Number of credits we need for writing DIO_MAX_BLOCKS: + * We need sb + group descriptor + bitmap + inode - 4 + * For B blocks with A block pointers per block we need: + * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect). + * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25. + */ +#define DIO_CREDITS 25 int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, unsigned long max_blocks, struct buffer_head *bh, @@ -942,49 +951,31 @@ static int ext4_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create) { handle_t *handle = ext4_journal_current_handle(); - int ret = 0; + int ret = 0, started = 0; unsigned max_blocks = bh_result-b_size inode-i_blkbits; - if (!create) - goto get_block; /* A read */ - - if (max_blocks == 1) - goto get_block; /* A single block get */ - - if (handle-h_transaction-t_state == T_LOCKED) { - /* -* Huge direct-io writes can hold off commits for long -* periods of time. Let this commit run. -*/ - ext4_journal_stop(handle); - handle = ext4_journal_start(inode, DIO_CREDITS); - if (IS_ERR(handle)) + + if (create !handle) {/* Direct IO write... */ + if (max_blocks DIO_MAX_BLOCKS) + max_blocks = DIO_MAX_BLOCKS; + handle = ext4_journal_start(inode, DIO_CREDITS + + 2 * EXT4_QUOTA_TRANS_BLOCKS(inode-i_sb)); + if (IS_ERR(handle)) { ret = PTR_ERR(handle); - goto get_block; - } - - if (handle-h_buffer_credits = EXT4_RESERVE_TRANS_BLOCKS) { - /* -* Getting low on buffer credits... -*/ - ret = ext4_journal_extend(handle, DIO_CREDITS); - if (ret 0) { - /* -* Couldn't extend the transaction. Start a new one. -*/ - ret = ext4_journal_restart(handle, DIO_CREDITS); + goto out; } + started = 1; } -get_block: - if (ret == 0) { - ret = ext4_get_blocks_wrap(handle, inode, iblock, + ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks, bh_result, create, 0); - if (ret 0) { - bh_result-b_size = (ret inode-i_blkbits); - ret = 0; - } + if (ret 0) { + bh_result-b_size = (ret inode-i_blkbits); + ret = 0; } + if (started) + ext4_journal_stop(handle); +out: return ret; } @@ -1674,7 +1665,8 @@ static int ext4_releasepage(struct page *page, gfp_t wait) * if the machine crashes during the write. * * If the O_DIRECT write is intantiating holes inside i_size and the machine - * crashes then stale disk data _may_ be exposed inside the file. + * crashes then stale disk data _may_ be exposed inside the file. But current + * VFS code falls back into buffered path in that case so we are safe. */ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, @@ -1683,7 +1675,7 @@ static
Re: jbd2_handle and i_data_sem circular locking dependency detected
On Tue 05-02-08 17:53:42, Aneesh Kumar K.V wrote: On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote: Hi, On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote: This is with the new ext3 - ext4 migrate code added. The recently added lockdep for jbd2 helped to find this out. We want to hold the i_data_sem on the ext3 inode during migration to prevent walking the ext3 inode when it is being converted to ext4 format. Also we want to avoid file truncation and new blocks being added while converting to ext4. Also we dont want to reserve large number of credits for journal. Any idea how to fix this ? Hmm, while briefly looking at the code - why do you introduce i_data_sem and not use i_alloc_sem which is already in VFS inode? That is aimed exactly at the serialization of truncates, writes and similar users. That doesn't solve problems with lock ordering but I was just wondering... Another problem - ext4_fallocate() has the same lock ordering problem as the migration code and maybe there are others... One (stupid) solution to your problem is to make i_data_sem be always locked before the transaction is started. It could possibly have negative performance impact because you'd have to hold the semaphore for a longer time and thus a writer would block readers for longer time. So one would have to measure how big difference that would make. Another possibility is to start a single transaction for migration and extend it as long as you can (as truncate does it). And when you can't extend any more, you drop the i_data_sem and start a new transaction and acquire the semaphore again. This has the disadvantage that after dropping the semaphore you have to resync your original inode with the temporary one your are building which probably ends up being ugly as night... Hmm, but maybe we could get rid of this - hold i_mutex to protect against all writes (that ranks outside of transaction start so you can hold it for the whole migration time - maybe you even hold it if you are called from the write path...). After dropping i_data_sem you let some readers proceed but writers still wait on i_mutex so the file shouldn't change under you (but I suggest adding some BUG_ONs to verify that the file really doesn't change :). How about the patch below. I did the below testing a) migrate a file b) run fs_inode fsstres fsx_linux. The intention was to find out whether the new locking is breaking any of the other expected hierarchy. It seems to fine. I didn't get any lockdep warning. I think there's a problem in the patch. I don't think you can call free_ind_block() while readers are accessing the file. That could make them think the file contains holes when they aren't there (or even worse read freed blocks or so). So you need to lock i_data_sem before this call (that means you have to move journal_extend() as well). Actually, I don't quite get why ext4_journal_extend() is called in that function. You can just count with the 1 credit in ext4_ext_migrate() when you call ext4_journal_extend() before calling ext4_ext_swap_inode_data(). Oh, probably because free_ind_block() could extend the transaction (which they don't do now as far as I can see). BTW: That freeing code should really take into account that it can modify bitmaps in different block groups. It's even not that hard to do. Just before each ext4_free_blocks() in free_ind_block() you check whether you have still enough credits in the handle (use h_buffer_credits) and if not, extend it by some amount. Maybe you could do some test like writing a big file with some data and then while migration is running read it in a loop and compare that MD5SUM is the same all the time. Also run some memory-pressure during this test so that data blocks aren't cached for the whole time of the test... That should reasonably stress the migration code. ext4: Fix circular locking dependency with migrate and rm. From: Aneesh Kumar K.V [EMAIL PROTECTED] We now take inode-i_mutex lock to prevent any update of the inode i_data field. Before we switch the inode format we take i_data_sem to prevent parallel read. === [ INFO: possible circular locking dependency detected ] 2.6.24-rc8 #6 --- rm/2401 is trying to acquire lock: (ei-i_data_sem){}, at: [c01dca58] ext4_get_blocks_wrap+0x21/0x108 but task is already holding lock: (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (jbd2_handle){--..}: [c0143a5c] __lock_acquire+0xa31/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c01fc4ca] jbd2_journal_start+0xf5/0xff [c01e3539] ext4_journal_start_sb+0x48/0x4a [c01eb980] ext4_ext_migrate+0x7d/0x535 [c01df328
Re: jbd2_handle and i_data_sem circular locking dependency detected
On Tue 05-02-08 21:57:03, Aneesh Kumar K.V wrote: On Tue, Feb 05, 2008 at 02:42:28PM +0100, Jan Kara wrote: On Tue 05-02-08 17:53:42, Aneesh Kumar K.V wrote: How about the patch below. I did the below testing a) migrate a file b) run fs_inode fsstres fsx_linux. The intention was to find out whether the new locking is breaking any of the other expected hierarchy. It seems to fine. I didn't get any lockdep warning. I think there's a problem in the patch. I don't think you can call free_ind_block() while readers are accessing the file. That could make them think the file contains holes when they aren't there (or even worse read freed blocks or so). So you need to lock i_data_sem before this call (that means you have to move journal_extend() as well). Actually, I don't quite get why ext4_journal_extend() is called in that function. You can just count with the 1 credit in ext4_ext_migrate() when you call ext4_journal_extend() before calling ext4_ext_swap_inode_data(). Oh, probably because free_ind_block() could extend the transaction (which they don't do now as far as I can see). ext4_journal_extend is called to extend the journal by one credit to take care of writing the block containing inode. I moved the journal extend before taking i_data_sem lock. diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c index f97c993..dc6617a 100644 --- a/fs/ext4/migrate.c +++ b/fs/ext4/migrate.c @@ -302,10 +302,6 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode, struct ext4_inode_info *ei = EXT4_I(inode); struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode); - retval = free_ind_block(handle, inode); - if (retval) - goto err_out; - /* * One credit accounted for writing the * i_data field of the original inode @@ -318,6 +314,10 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode, } down_write(EXT4_I(inode)-i_data_sem); + retval = free_ind_block(handle, inode); + if (retval) + goto err_out; + /* * We have the extent map build with the tmp inode. * Now copy the i_data across The above change also make sure we don't fail after we free the indirect blocks. Yeah, now it looks fine. BTW: That freeing code should really take into account that it can modify bitmaps in different block groups. It's even not that hard to do. Just before each ext4_free_blocks() in free_ind_block() you check whether you have still enough credits in the handle (use h_buffer_credits) and if not, extend it by some amount. I have a FIXME at migrate.c:524 documenting exactly that. The difficult question was by how much we should extent the journal. ? But in reality we might have accumulated enough journal credits, I never really ran across a case where we are running out of the journal credit. Yes, I don't think it is likely to happen in reality but if somebody would trigger this, it would be almost impossible to track down so one should be quite careful with these things... And as I described, doing it failsafe is easy - just look how try_to_extend_transaction() in ext4/inode.c handles similar problems with truncate. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] jbd: fix assertion failure in journal_next_log_block
Hello, Sorry for replying a bit late but I'm currently falling behind in maling-list reading... The way jbd tries to determine if there is enough space left on the journal in order to start a new transaction is looking at the space left in the journal and the space needed for the committing transaction, which is 1/4 of the journal + the number of t_outstanding_credits for that transaction. In this case its assumed that t_outstanding_credits accurately represents the number of credits Yes. waiting to be written to the journal, but this sometimes isn't the case. The transaction has two counters for buffers, t_outstanding_credits which is used in conjunction with handles that are added to the transaction, and t_nr_buffers which is only incremented/decremented when buffers are added/removed from the transaction and are actually destined to be journaled. Normally these two t_nr_buffers actually represents number of buffers on BJ_Metadata list and nobody uses it (except for the assertion in __journal_temp_unlink_buffer()). t_outstanding_credits is supposed to be *the* counter making sure we don't write more than we have credits for. counters are the same, however there are cases where the committing transaction can have buffers moved to the next running transaction, for example any buffers on the committing transactions t_reserved list would be moved to the next (running) transaction, and if it had been dirtied in the process it would immediately make it onto the t_updates list, which would increment t_nr_buffers You probably mean t_buffers list here... but not t_outstanding_credits. So you get into this situation where But which moving and dirtying do you mean? The caller which dirties the buffer must make sure that he has acquired enough credits for the transaction where the buffer ends up... So if there were not enough buffers in the running transaction where we refiled the buffer it is a bug in the caller which dirties the buffer. t_nr_buffers (the actual number of buffers that are on the transaction) is greater than the number of buffers accounted for via t_outstanding_credits. This presents a problem since as we loop through writting buffers to the journal, we decrement t_outstanding_credits, and if t_nr_buffers is more than t_outstanding_credits then we end up with a negative number for t_outstanding_credits, which means we start saying we need less than 1/4 of the journal for our committing transaction and allow more transactions than we can handle to start, and then bam we fail because journal_next_log_block doesn't have enough free blocks in order to handle the request. This has been tested and fixes the issue (which could not be reproduced by me but several other people could get it to reproduce using postmark), and although I couldn't reproduce the assertion, I could very easily reproduce the situation where t_outstanding_credits was than t_nr_buffers. I suppose you see the assertion J_ASSERT(journal-j_free 1); to fail, right? I don't see how your patch could help avoid that assertion. You've just removed accounting of t_outstanding_credits which has no impact on the real number of free blocks in the journal stored in j_free. Anyway, if you can reproduce t_outstanding_credits t_nr_buffers, then there's something fishy. Are you able to reproduce it also with a current kernel? Thanks for looking into the problem :) Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: jbd2_handle and i_data_sem circular locking dependency detected
On Mon 04-02-08 22:42:08, Aneesh Kumar K.V wrote: On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote: Hi, On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote: This is with the new ext3 - ext4 migrate code added. The recently added lockdep for jbd2 helped to find this out. We want to hold the i_data_sem on the ext3 inode during migration to prevent walking the ext3 inode when it is being converted to ext4 format. Also we want to avoid file truncation and new blocks being added while converting to ext4. Also we dont want to reserve large number of credits for journal. Any idea how to fix this ? Hmm, while briefly looking at the code - why do you introduce i_data_sem and not use i_alloc_sem which is already in VFS inode? That is aimed exactly at the serialization of truncates, writes and similar users. How about read ? We are changing the format of inode. We don't want even the read to go through. I just meant that the code could use i_alloc_sem instead of i_data_sem in all the places, remove i_data_sem and you safe some memory... It's just a suggestion for a cleanup. One (stupid) solution to your problem is to make i_data_sem be always locked before the transaction is started. It could possibly have negative performance impact because you'd have to hold the semaphore for a longer time and thus a writer would block readers for longer time. So one would have to measure how big difference that would make. Another possibility is to start a single transaction for migration and extend it as long as you can (as truncate does it). And when you can't extend any more, you drop the i_data_sem and start a new transaction and acquire the semaphore again. This has the disadvantage that after dropping the semaphore you have to resync your original inode with the temporary one your are building which probably ends up being ugly as night... Hmm, but maybe we could get rid of this - hold i_mutex to protect against all writes (that ranks outside of transaction start so you can hold it for the whole migration time - maybe you even hold it if you are called from the write path...). After dropping i_data_sem you let some readers proceed but writers still wait on i_mutex so the file shouldn't change under you (but I suggest adding some BUG_ONs to verify that the file really doesn't change :). A quick look says truncate can happen even when we hold i_mutex ?? No, it shouldn't happen - do_truncate() in fs/open.c acquires i_mutex before it calls notify_change(). Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: jbd2_handle and i_data_sem circular locking dependency detected
Hi, On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote: This is with the new ext3 - ext4 migrate code added. The recently added lockdep for jbd2 helped to find this out. We want to hold the i_data_sem on the ext3 inode during migration to prevent walking the ext3 inode when it is being converted to ext4 format. Also we want to avoid file truncation and new blocks being added while converting to ext4. Also we dont want to reserve large number of credits for journal. Any idea how to fix this ? Hmm, while briefly looking at the code - why do you introduce i_data_sem and not use i_alloc_sem which is already in VFS inode? That is aimed exactly at the serialization of truncates, writes and similar users. That doesn't solve problems with lock ordering but I was just wondering... Another problem - ext4_fallocate() has the same lock ordering problem as the migration code and maybe there are others... One (stupid) solution to your problem is to make i_data_sem be always locked before the transaction is started. It could possibly have negative performance impact because you'd have to hold the semaphore for a longer time and thus a writer would block readers for longer time. So one would have to measure how big difference that would make. Another possibility is to start a single transaction for migration and extend it as long as you can (as truncate does it). And when you can't extend any more, you drop the i_data_sem and start a new transaction and acquire the semaphore again. This has the disadvantage that after dropping the semaphore you have to resync your original inode with the temporary one your are building which probably ends up being ugly as night... Hmm, but maybe we could get rid of this - hold i_mutex to protect against all writes (that ranks outside of transaction start so you can hold it for the whole migration time - maybe you even hold it if you are called from the write path...). After dropping i_data_sem you let some readers proceed but writers still wait on i_mutex so the file shouldn't change under you (but I suggest adding some BUG_ONs to verify that the file really doesn't change :). Honza === [ INFO: possible circular locking dependency detected ] 2.6.24-rc8 #6 --- rm/2401 is trying to acquire lock: (ei-i_data_sem){}, at: [c01dca58] ext4_get_blocks_wrap+0x21/0x108 but task is already holding lock: (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (jbd2_handle){--..}: [c0143a5c] __lock_acquire+0xa31/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c01fc4ca] jbd2_journal_start+0xf5/0xff [c01e3539] ext4_journal_start_sb+0x48/0x4a [c01eb980] ext4_ext_migrate+0x7d/0x535 [c01df328] ext4_ioctl+0x528/0x56c [c0177700] do_ioctl+0x50/0x67 [c017794e] vfs_ioctl+0x237/0x24a [c0177992] sys_ioctl+0x31/0x4b [c0104f8a] sysenter_past_esp+0x5f/0xa5 [] 0x - #0 (ei-i_data_sem){}: [c014394c] __lock_acquire+0x921/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c044f247] down_read+0x42/0x79 [c01dca58] ext4_get_blocks_wrap+0x21/0x108 [c01dcba1] ext4_getblk+0x62/0x1c4 [c01e0de9] ext4_find_entry+0x350/0x5b7 [c01e200c] ext4_unlink+0x6e/0x1a4 [c017449e] vfs_unlink+0x49/0x89 [c0175f02] do_unlinkat+0x96/0x12c [c0175fa8] sys_unlink+0x10/0x12 [c0104f8a] sysenter_past_esp+0x5f/0xa5 [] 0x other info that might help us debug this: 3 locks held by rm/2401: #0: (type-i_mutex_dir_key#5/1){--..}, at: [c0175eca] do_unlinkat+0x5e/0x12c #1: (sb-s_type-i_mutex_key#8){--..}, at: [c017448b] vfs_unlink+0x36/0x89 #2: (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff stack backtrace: Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6 [c0106017] show_trace_log_lvl+0x1a/0x2f [c0106893] show_trace+0x12/0x14 [c0106b89] dump_stack+0x6c/0x72 [c0141b26] print_circular_bug_tail+0x5f/0x68 [c014394c] __lock_acquire+0x921/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c044f247] down_read+0x42/0x79 [c01dca58] ext4_get_blocks_wrap+0x21/0x108 [c01dcba1] ext4_getblk+0x62/0x1c4 [c01e0de9] ext4_find_entry+0x350/0x5b7 [c01e200c] ext4_unlink+0x6e/0x1a4 [c017449e] vfs_unlink+0x49/0x89 [c0175f02] do_unlinkat+0x96/0x12c [c0175fa8] sys_unlink+0x10/0x12 [c0104f8a] sysenter_past_esp+0x5f/0xa5 -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND] jbd: Remove useless loop in journal_write_commit_record()
Hi, resending the patch just in case you've missed it. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- Commit block was intended to have several copies of the header. But due to a bug it never had them and actually, nobody checks that. So just remove the useless loop. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c index 610264b..b54948f 100644 --- a/fs/jbd/commit.c +++ b/fs/jbd/commit.c @@ -104,7 +104,8 @@ static int journal_write_commit_record(journal_t *journal, { struct journal_head *descriptor; struct buffer_head *bh; - int i, ret; + journal_header_t *header; + int ret; int barrier_done = 0; if (is_journal_aborted(journal)) @@ -116,13 +117,10 @@ static int journal_write_commit_record(journal_t *journal, bh = jh2bh(descriptor); - /* AKPM: buglet - add `i' to tmp! */ - for (i = 0; i bh-b_size; i += 512) { - journal_header_t *tmp = (journal_header_t*)bh-b_data; - tmp-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER); - tmp-h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK); - tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid); - } + header = (journal_header_t*)(bh-b_data); + header-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER); + header-h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK); + header-h_sequence = cpu_to_be32(commit_transaction-t_tid); JBUFFER_TRACE(descriptor, write commit block); set_buffer_dirty(bh); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH, RFC] Add new development flag to the ext4 filesystem
On Fri 25-01-08 03:50:04, Andreas Dilger wrote: On Jan 25, 2008 11:05 +0100, Jan Kara wrote: For example ext2 on fsync() just sync's a single inode (and has to use private_list to track metadata buffers associated with the inode) while ext3 flushes the whole journal. As for fsync(), we definitely need to preserve correct behaviour for the file itself, but there isn't a requirement that ext2 behave exactly like ext3 (it of course cannot). In the proposed ext4-mount-unjournaled-ext2 case, the superblock would be marked dirty as it is today and an e2fsck would need to be run at boot time. That is fine so long as the fsync() will cause the one file's data to be on disk before it returns. Well, you have to also make sure that all indirect blocks are on disk before fsync() returns. Otherwise there's not much point in the fact that data itself reached the disk. And for that you need something like private_list. In ext2, directory handling code is quite different. ext2 works in page cache of the directory while ext3 uses page cache of the underlying device via buffer heads - at least this second thing would be more or less mechanical thing to change and would make sence (we wouldn't have to reimplement readahead in ext3 directory handling code as we do now). I've looked at it once but then more urgent things came and ... you know it. I don't think it is a requirement that ext3 mounting a filesystem without a journal has to use page cache for directories. I wouldn't object to that being fixed. It definitely isn't a requirement for this to work, just an implementation difference. Yes, of course. I just wanted to point out that ext2 isn't a strict subset of ext3 so there is some non-trivial work to be done before you can safely mount ext2 as ext3-without-journal. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] jbd: Remove useless loop when writing commit record
Hi Andrew, here's the patch I wrote you about. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- Commit block was intended to have several copies of the header. But due to a bug it never had them and actually, nobody checks that. So just remove the useless loop. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c index 610264b..b54948f 100644 --- a/fs/jbd/commit.c +++ b/fs/jbd/commit.c @@ -104,7 +104,8 @@ static int journal_write_commit_record(journal_t *journal, { struct journal_head *descriptor; struct buffer_head *bh; - int i, ret; + journal_header_t *header; + int ret; int barrier_done = 0; if (is_journal_aborted(journal)) @@ -116,13 +117,10 @@ static int journal_write_commit_record(journal_t *journal, bh = jh2bh(descriptor); - /* AKPM: buglet - add `i' to tmp! */ - for (i = 0; i bh-b_size; i += 512) { - journal_header_t *tmp = (journal_header_t*)bh-b_data; - tmp-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER); - tmp-h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK); - tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid); - } + header = (journal_header_t*)(bh-b_data); + header-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER); + header-h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK); + header-h_sequence = cpu_to_be32(commit_transaction-t_tid); JBUFFER_TRACE(descriptor, write commit block); set_buffer_dirty(bh); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix commit block write in JBD
On Sat 26-01-08 22:02:07, Andrew Morton wrote: On Wed, 23 Jan 2008 20:09:43 +0100 Jan Kara [EMAIL PROTECTED] wrote: Commit block is expected to have several copies of the header. Fix the bug Andrew has spotted ages ago. ages indeed. diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c index 610264b..a69b240 100644 --- a/fs/jbd/commit.c +++ b/fs/jbd/commit.c @@ -116,9 +116,8 @@ static int journal_write_commit_record(journal_t *journal, bh = jh2bh(descriptor); - /* AKPM: buglet - add `i' to tmp! */ for (i = 0; i bh-b_size; i += 512) { - journal_header_t *tmp = (journal_header_t*)bh-b_data; + journal_header_t *tmp = (journal_header_t*)(bh-b_data+i); tmp-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER); tmp-h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK); tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid); But I don't think we can _use_ this feature now. Because there are 1000 disks out there which didn't implement it. So why not just remove the loop and do a single write? Yes, but OTOH once the filesystem gets mounted with a new kernel, the journal gets quickly rewritten and we'll have correct commit blocks there. But since neither kernel nor e2fsprogs actually check for further sectors (they check for the header just in the beginning of a block), I agree that removing the loop completely is probably the best option. Nobody cared so far so I guess they won't care in future as well. I'll send you a replacement patch. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext3: Fix lock inversion in direct IO
Hi Andrew, the patch below fixes a lock inversion which someone reported recently. More details below in the changelog. Can you merge the patch please? I'll also write a similar patch for ext4 once we agree this is the way to go... Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- We cannot start transaction in ext3_direct_IO() and just let it last during the whole write because dio_get_page() acquires mmap_sem which ranks above transaction start (e.g. because we have dependency chain mmap_sem-PageLock-journal_start, or because we update atime while holding mmap_sem) and thus deadlocks could happen. We solve the problem by starting a transaction separately for each ext3_get_block() call. We *could* have a problem that we allocate a block and before its data are written out the machine crashes and thus we expose stale data. But that does not happen because for hole-filling generic code falls back to buffered writes and for file extension, we add inode to orphan list and thus in case of crash, journal replay will truncate inode back to the original size. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 9b162cd..5ab7c57 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -941,55 +941,45 @@ out: return err; } -#define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32) +/* Maximum number of blocks we map for direct IO at once. */ +#define DIO_MAX_BLOCKS 4096 +/* + * Number of credits we need for writing DIO_MAX_BLOCKS: + * We need sb + group descriptor + bitmap + inode - 4 + * For B blocks with A block pointers per block we need: + * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect). + * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25. + */ +#define DIO_CREDITS 25 static int ext3_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create) { handle_t *handle = ext3_journal_current_handle(); - int ret = 0; + int ret = 0, started = 0; unsigned max_blocks = bh_result-b_size inode-i_blkbits; - if (!create) - goto get_block; /* A read */ - - if (max_blocks == 1) - goto get_block; /* A single block get */ - - if (handle-h_transaction-t_state == T_LOCKED) { - /* -* Huge direct-io writes can hold off commits for long -* periods of time. Let this commit run. -*/ - ext3_journal_stop(handle); - handle = ext3_journal_start(inode, DIO_CREDITS); - if (IS_ERR(handle)) + if (create !handle) {/* Direct IO write... */ + if (max_blocks DIO_MAX_BLOCKS) + max_blocks = DIO_MAX_BLOCKS; + handle = ext3_journal_start(inode, DIO_CREDITS + + 2 * EXT3_QUOTA_TRANS_BLOCKS(sb)); + if (IS_ERR(handle)) { ret = PTR_ERR(handle); - goto get_block; - } - - if (handle-h_buffer_credits = EXT3_RESERVE_TRANS_BLOCKS) { - /* -* Getting low on buffer credits... -*/ - ret = ext3_journal_extend(handle, DIO_CREDITS); - if (ret 0) { - /* -* Couldn't extend the transaction. Start a new one. -*/ - ret = ext3_journal_restart(handle, DIO_CREDITS); + goto out; } + started = 1; } -get_block: - if (ret == 0) { - ret = ext3_get_blocks_handle(handle, inode, iblock, + ret = ext3_get_blocks_handle(handle, inode, iblock, max_blocks, bh_result, create, 0); - if (ret 0) { - bh_result-b_size = (ret inode-i_blkbits); - ret = 0; - } + if (ret 0) { + bh_result-b_size = (ret inode-i_blkbits); + ret = 0; } + if (started) + ext3_journal_stop(handle); +out: return ret; } @@ -1680,7 +1670,8 @@ static int ext3_releasepage(struct page *page, gfp_t wait) * if the machine crashes during the write. * * If the O_DIRECT write is intantiating holes inside i_size and the machine - * crashes then stale disk data _may_ be exposed inside the file. + * crashes then stale disk data _may_ be exposed inside the file. But current + * VFS code falls back into buffered path in that case so we are safe. */ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, @@ -1689,7 +1680,7 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb, struct file *file = iocb-ki_filp; struct inode *inode
[PATCH] Fix commit block write in JBD
Hi, the patch below fixes preparation of commit block in journal_write_commit_record(). Obviously the bug doesn't really matter since nobody reported it so far but let's cleanup the code... Andrew, could you please queue it up? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- Commit block is expected to have several copies of the header. Fix the bug Andrew has spotted ages ago. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c index 610264b..a69b240 100644 --- a/fs/jbd/commit.c +++ b/fs/jbd/commit.c @@ -116,9 +116,8 @@ static int journal_write_commit_record(journal_t *journal, bh = jh2bh(descriptor); - /* AKPM: buglet - add `i' to tmp! */ for (i = 0; i bh-b_size; i += 512) { - journal_header_t *tmp = (journal_header_t*)bh-b_data; + journal_header_t *tmp = (journal_header_t*)(bh-b_data+i); tmp-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER); tmp-h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK); tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix commit block write in JBD2
On Wed 23-01-08 20:09:43, Jan Kara wrote: Hi, the patch below fixes preparation of commit block in journal_write_commit_record(). Obviously the bug doesn't really matter since nobody reported it so far but let's cleanup the code... Andrew, could you please queue it up? Thanks. And the same fix for JBD2. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- Commit block is expected to have several copies of the header. Fix the bug Andrew has spotted ages ago. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index 6986f33..ed61283 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -116,9 +116,8 @@ static int journal_write_commit_record(journal_t *journal, bh = jh2bh(descriptor); - /* AKPM: buglet - add `i' to tmp! */ for (i = 0; i bh-b_size; i += 512) { - journal_header_t *tmp = (journal_header_t*)bh-b_data; + journal_header_t *tmp = (journal_header_t*)(bh-b_data+i); tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER); tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK); tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Confused by journaling code in ext3_new_blocks()
Hello, I have the following question related to the journaling code in ext3_new_blocks() function of fs/ext3/balloc.c, any help in this regard will be greatly appreciated. The code snippet below is taken from 2.6.24-rc8-mm1 but this code has changed little for a long time so it's likely that any other recent kernel will also have very similar code. ext3_new_blocks() - ext3_try_to_allocate_with_rsv() - ext3_journal_dirty_metadata(handle, bitmap_bh) ... allocated: if ( /*sanity checking of newly allocated block numbers*/) { ext3_error(...); goto out; } out: ... brelse(bitmap_bh); return 0; In the above code snippet, ext3_try_to_allocate_with_rsv() is marking bitmap_bh as dirty metadata _before_ the sanity checks. I'm not worried about the sanity checks as such, but it seems to me that if the checks were to fail, we'd be committing the bitmap_bh updates to disk even though it violates metadata sanity, or am I missing something here ? Actually, it depends. Because if sanity checks fail, we call ext3_error() which remounts fs R/O or panics in most cases. Only if errors=continue is set, we really get to committing the wrong bitmap data. But I don't think this really is a problem - fsck has to be run anyway and it rebuilds the bitmaps from scratch. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix commit block write in JBD2
Hi, On Thu 24-01-08 02:48:41, Girish Shilamkar wrote: The loop was removed in journal checksum patch. There had been a discussion (11 Jul 2007: [EXT4 set 8][PATCH 1/1]Add journal checksums) about this part of code as checksum info was added to commit block. Ah, OK, thanks for explanation. So for JBD2 we don't need the patch anymore. But for JBD it's still needed. Honza On Wed, 2008-01-23 at 20:10 +0100, Jan Kara wrote: On Wed 23-01-08 20:09:43, Jan Kara wrote: Hi, the patch below fixes preparation of commit block in journal_write_commit_record(). Obviously the bug doesn't really matter since nobody reported it so far but let's cleanup the code... Andrew, could you please queue it up? Thanks. And the same fix for JBD2. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH resend] Fix reading of bitmaps from filesystem image
On Mon 14-01-08 10:57:18, Theodore Tso wrote: On Wed, Jan 09, 2008 at 08:54:35PM +0100, Jan Kara wrote: Reading of bitmaps from image file could never work with more than one group in a filesystem... Fix the loops so that we read appropriate number of blocks. OK, so I'm probably being dense, but what's the problem? You changed the loop from counting in bytes to inodes, but either method should work. At the beginning of read_bitmap() there is: unsigned int block_nbytes = EXT2_BLOCKS_PER_GROUP(fs-super) / 8; unsigned inode_nbytes = EXT2_INODES_PER_GROUP(fs-super) / 8; Therefore in the loop: while (inode_nbytes 0) we end up reading bitmap for just one group... Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: Fix the soft lockup with multi block allocator.
On Wed 09-01-08 23:54:28, Aneesh Kumar K.V wrote: On Wed, Jan 09, 2008 at 01:10:41PM +0100, Jan Kara wrote: With the multi block allocator when we don't have prealloc space we discard @@ -3790,7 +3782,9 @@ repeat: /* if we still need more blocks and some PAs were used, try again */ if (free needed busy) { + busy = 0; ext4_unlock_group(sb, group); + schedule_timeout(HZ); goto repeat; } Hmm, wouldn't just schedule() be enough here? That would give a good chance to other processes to proceed and we would avoid this artificial wait of 1s which is quite ugly IMO. But then who will wake up the task ?. I have the below comment added to the patch in the patch queue. As far as I know, you don't have to wake-up the task explicitely. Scheduler will simply schedule the task sometime in future (it is a similar situation as if the task got preempted in the kernel). /* * We see this quiet rare. But if a particular workload is * effected by this we may need to add a waitqueue */ Yes, adding that comment is good in any case :). Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH resend] Fix reading of bitmaps from filesystem image
Hi Ted, I've sent you the attached fix some time ago but it didn't seem to get merged yet... Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- Subject: Fix reading of bitmaps from filesystem image Reading of bitmaps from image file could never work with more than one group in a filesystem... Fix the loops so that we read appropriate number of blocks. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/lib/ext2fs/rw_bitmaps.c b/lib/ext2fs/rw_bitmaps.c index 1897ec3..abbe5e8 100644 --- a/lib/ext2fs/rw_bitmaps.c +++ b/lib/ext2fs/rw_bitmaps.c @@ -190,7 +190,7 @@ static errcode_t read_bitmaps(ext2_filsy if (fs-flags EXT2_FLAG_IMAGE_FILE) { blk = (fs-image_header-offset_inodemap / fs-blocksize); ino_cnt = fs-super-s_inodes_count; - while (inode_nbytes 0) { + while (do_inode ino_cnt 0) { retval = io_channel_read_blk(fs-image_io, blk++, 1, inode_bitmap); if (retval) @@ -202,15 +202,14 @@ static errcode_t read_bitmaps(ext2_filsy ino_itr, cnt, inode_bitmap); if (retval) goto cleanup; - ino_itr += fs-blocksize 3; - ino_cnt -= fs-blocksize 3; - inode_nbytes -= fs-blocksize; + ino_itr += cnt; + ino_cnt -= cnt; } blk = (fs-image_header-offset_blockmap / fs-blocksize); blk_cnt = EXT2_BLOCKS_PER_GROUP(fs-super) * fs-group_desc_count; - while (block_nbytes 0) { + while (do_block blk_cnt 0) { retval = io_channel_read_blk(fs-image_io, blk++, 1, block_bitmap); if (retval) @@ -222,9 +221,8 @@ static errcode_t read_bitmaps(ext2_filsy blk_itr, cnt, block_bitmap); if (retval) goto cleanup; - blk_itr += fs-blocksize 3; - blk_cnt -= fs-blocksize 3; - block_nbytes -= fs-blocksize; + blk_itr += cnt; + blk_cnt -= cnt; } goto success_cleanup; } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH resend] mke2fs: Make lost+found always have at least 2 blocks
Hi, this is a resend of Andreas' patch I've sent in the beginning of December. The patch modifies mke2fs to always create lost+found with at least two directory blocks. I think this could make sence not only for testing but as a sanity check that 64KB support works for general e2fsprogs if someone decides to use it... Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- From: Andreas Dilger [EMAIL PROTECTED] Make sure lost+found has always at least 2 disk blocks. This will provide at least elementary test that we have not screwed-up support for 64KB blocks since the second directory block will be empty. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/misc/mke2fs.c b/misc/mke2fs.c index 98a4957..6249cc2 100644 --- a/misc/mke2fs.c +++ b/misc/mke2fs.c @@ -534,7 +534,10 @@ static void create_lost_and_found(ext2_f } for (i=1; i EXT2_NDIR_BLOCKS; i++) { - if ((lpf_size += fs-blocksize) = 16*1024) + /* Ensure that lost+found is at least 2 blocks, so we always +* test large empty blocks for big-block filesystems. */ + if ((lpf_size += fs-blocksize) = 16*1024 + lpf_size = 2 * fs-blocksize) break; retval = ext2fs_expand_dir(fs, ino); if (retval) { - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 9546] New: Huge latency in concurrent I/O when using data=ordered
. I didn't try with other journaling filesystems. I guess ext2 also doesn't exhibit the problem. Interesting that data=writeback helped this. You don't give a lot of details, but I assume that data=writeback made a large difference here? What I think could be happening: One process does data-intensive load. Thus in the ordered mode the transaction is tiny but has tons of data buffers attached. If commit happens, it takes a long time to sync all the data before the commit can proceed... In the writeback mode, we don't wait for data buffers, in the journal mode amount of data to be written is really limited by the maximum size of a transaction and so we write by much smaller chunks and better latency is thus ensured. I'll try some tests to verify whether my theory is correct :). Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 9546] New: Huge latency in concurrent I/O when using data=ordered
(the program should have to access files, such as cookies, cache and so for konqueror). Then remount/tune the filesystem to use another data mode for ext3. I didn't try with other journaling filesystems. I guess ext2 also doesn't exhibit the problem. Interesting that data=writeback helped this. You don't give a lot of details, but I assume that data=writeback made a large difference here? What I think could be happening: One process does data-intensive load. Thus in the ordered mode the transaction is tiny but has tons of data buffers attached. If commit happens, it takes a long time to sync all the data before the commit can proceed... In the writeback mode, we don't wait for data buffers, in the journal mode amount of data to be written is really limited by the maximum size of a transaction and so we write by much smaller chunks and better latency is thus ensured. I'll try some tests to verify whether my theory is correct :). Hmm, I couldn't reproduce the bad behavior... But anyway, if my theory is correct, attached patch should help - can you try it please? It's a debugging quality only but shouldn't do anything bad to your data :) Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs Don't allow too much data buffers in a transaction. diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index 08ff6c7..e6f9dd6 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -163,7 +163,7 @@ repeat_locked: spin_lock(transaction-t_handle_lock); needed = transaction-t_outstanding_credits + nblocks; - if (needed journal-j_max_transaction_buffers) { + if (needed journal-j_max_transaction_buffers || atomic_read(transaction-t_data_buf_count) 32768) { /* * If the current transaction is already too large, then start * to commit it: we can then go back and attach this handle to @@ -1528,6 +1528,7 @@ static void __journal_temp_unlink_buffer(struct journal_head *jh) return; case BJ_SyncData: list = transaction-t_sync_datalist; + atomic_dec(transaction-t_data_buf_count); break; case BJ_Metadata: transaction-t_nr_buffers--; @@ -1989,6 +1990,7 @@ void __journal_file_buffer(struct journal_head *jh, return; case BJ_SyncData: list = transaction-t_sync_datalist; + atomic_inc(transaction-t_data_buf_count); break; case BJ_Metadata: transaction-t_nr_buffers++; diff --git a/include/linux/jbd.h b/include/linux/jbd.h index d9ecd13..6dd284a 100644 --- a/include/linux/jbd.h +++ b/include/linux/jbd.h @@ -541,6 +541,12 @@ struct transaction_s int t_outstanding_credits; /* + * Number of data buffers on t_sync_datalist attached to + * the transaction. + */ + atomic_t t_data_buf_count; + + /* * Forward and backward links for the circular list of all transactions * awaiting checkpoint. [j_list_lock] */
[PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize
Hi, attached is a new version of support for 64KB blocksize in e2fsprogs. The patch went through testing by a script I'll send in the following email so now the modifications should be correct. Ted, can you have a look at it when you have time? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- Subject: Support for 64KB blocksize in ext2-4 directories. When block size is 64KB, we have to take care that rec_len does not overflow. Kernel stores 0x in case 0x1 should be stored - perform appropriate conversion when processing directories. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/debugfs/htree.c b/debugfs/htree.c index d0e673e..a326241 100644 --- a/debugfs/htree.c +++ b/debugfs/htree.c @@ -40,6 +40,7 @@ static void htree_dump_leaf_node(ext2_fi blk_t pblk; ext2_dirhash_t hash; int hash_alg; + int rec_len; errcode = ext2fs_bmap(fs, ino, inode, buf, 0, blk, pblk); if (errcode) { @@ -61,10 +62,8 @@ static void htree_dump_leaf_node(ext2_fi while (offset fs-blocksize) { dirent = (struct ext2_dir_entry *) (buf + offset); - if (((offset + dirent-rec_len) fs-blocksize) || - (dirent-rec_len 8) || - ((dirent-rec_len % 4) != 0) || - (((dirent-name_len 0xFF)+8) dirent-rec_len)) { + rec_len = ext2fs_rec_len_from_disk(dirent-rec_len); + if (ext2fs_validate_dirent(fs, offset, dirent) 0) { fprintf(pager, Corrupted directory block (%u)!\n, blk); break; } @@ -79,7 +78,7 @@ static void htree_dump_leaf_node(ext2_fi com_err(htree_dump_leaf_node, errcode, while calculating hash); sprintf(tmp, %u 0x%08x (%d) %s , dirent-inode, - hash, dirent-rec_len, name); + hash, rec_len, name); thislen = strlen(tmp); if (col + thislen 80) { fprintf(pager, \n); @@ -87,7 +86,7 @@ static void htree_dump_leaf_node(ext2_fi } fprintf(pager, %s, tmp); col += thislen; - offset += dirent-rec_len; + offset += rec_len; } fprintf(pager, \n); } @@ -389,7 +388,7 @@ static int search_dir_block(ext2_filsys printf(offset %u\n, offset); return BLOCK_ABORT; } - offset += dirent-rec_len; + offset += ext2fs_rec_len_from_disk(dirent-rec_len); } return 0; } diff --git a/debugfs/ls.c b/debugfs/ls.c index 52c7e34..1960c11 100644 --- a/debugfs/ls.c +++ b/debugfs/ls.c @@ -97,7 +97,7 @@ static int list_dir_proc(ext2_ino_t dir fprintf (ls-f, %s %s\n, datestr, name); } else { sprintf(tmp, %c%u%c (%d) %s , lbr, dirent-inode, rbr, - dirent-rec_len, name); + ext2fs_rec_len_from_disk(dirent-rec_len), name); thislen = strlen(tmp); if (ls-col + thislen 80) { diff --git a/e2fsck/message.c b/e2fsck/message.c index b2e3e0f..05b2e17 100644 --- a/e2fsck/message.c +++ b/e2fsck/message.c @@ -342,12 +342,13 @@ static _INLINE_ void expand_dirent_expre struct problem_context *ctx) { struct ext2_dir_entry *dirent; - int len; + int len, rec_len; if (!ctx || !ctx-dirent) goto no_dirent; dirent = ctx-dirent; + rec_len = ext2fs_rec_len_from_disk(dirent-rec_len); switch (ch) { case 'i': @@ -357,12 +358,12 @@ static _INLINE_ void expand_dirent_expre len = dirent-name_len 0xFF; if (len EXT2_NAME_LEN) len = EXT2_NAME_LEN; - if (len dirent-rec_len) - len = dirent-rec_len; + if (len rec_len) + len = rec_len; safe_print(dirent-name, len); break; case 'r': - printf(%u, dirent-rec_len); + printf(%u, rec_len); break; case 'l': printf(%u, dirent-name_len 0xFF); diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c index ceb9c7f..fd2c7d0 100644 --- a/e2fsck/pass1.c +++ b/e2fsck/pass1.c @@ -379,6 +379,7 @@ static void check_is_really_dir(e2fsck_t errcode_t retval; blk_t blk; int i, not_device = 0; + int rec_len; if (LINUX_S_ISDIR(inode-i_mode) || LINUX_S_ISREG(inode-i_mode) || LINUX_S_ISLNK(inode-i_mode) || inode-i_block[0] == 0
[PATCH] mke2fs: Make lost+found always have at least two blocks
Hi, attached patch modifies mke2fs to always create lost+found with at least two directory blocks. I think this could make sence not only for testing but as a sanity check that 64KB support works for general e2fsprogs if someone decides to use it... Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- From: Andreas Dilger [EMAIL PROTECTED] Make sure lost+found has always at least 2 disk blocks. This will provide at least elementary test that we have not screwed-up support for 64KB blocks since the second directory block will be empty. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/misc/mke2fs.c b/misc/mke2fs.c index 98a4957..6249cc2 100644 --- a/misc/mke2fs.c +++ b/misc/mke2fs.c @@ -534,7 +534,10 @@ static void create_lost_and_found(ext2_f } for (i=1; i EXT2_NDIR_BLOCKS; i++) { - if ((lpf_size += fs-blocksize) = 16*1024) + /* Ensure that lost+found is at least 2 blocks, so we always +* test large empty blocks for big-block filesystems. */ + if ((lpf_size += fs-blocksize) = 16*1024 + lpf_size = 2 * fs-blocksize) break; retval = ext2fs_expand_dir(fs, ino); if (retval) { - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix reading of bitmaps from a filesystem image
Hi, reading of bitmaps from a filesystem image seems to be busted. The patch below fixes it for me. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- Subject: Fix reading of bitmaps from filesystem image Reading of bitmaps from image file could never work with more than one group in a filesystem... Fix the loops so that we read appropriate number of blocks. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/lib/ext2fs/rw_bitmaps.c b/lib/ext2fs/rw_bitmaps.c index 1897ec3..abbe5e8 100644 --- a/lib/ext2fs/rw_bitmaps.c +++ b/lib/ext2fs/rw_bitmaps.c @@ -190,7 +190,7 @@ static errcode_t read_bitmaps(ext2_filsy if (fs-flags EXT2_FLAG_IMAGE_FILE) { blk = (fs-image_header-offset_inodemap / fs-blocksize); ino_cnt = fs-super-s_inodes_count; - while (inode_nbytes 0) { + while (do_inode ino_cnt 0) { retval = io_channel_read_blk(fs-image_io, blk++, 1, inode_bitmap); if (retval) @@ -202,15 +202,14 @@ static errcode_t read_bitmaps(ext2_filsy ino_itr, cnt, inode_bitmap); if (retval) goto cleanup; - ino_itr += fs-blocksize 3; - ino_cnt -= fs-blocksize 3; - inode_nbytes -= fs-blocksize; + ino_itr += cnt; + ino_cnt -= cnt; } blk = (fs-image_header-offset_blockmap / fs-blocksize); blk_cnt = EXT2_BLOCKS_PER_GROUP(fs-super) * fs-group_desc_count; - while (block_nbytes 0) { + while (do_block blk_cnt 0) { retval = io_channel_read_blk(fs-image_io, blk++, 1, block_bitmap); if (retval) @@ -222,9 +221,8 @@ static errcode_t read_bitmaps(ext2_filsy blk_itr, cnt, block_bitmap); if (retval) goto cleanup; - blk_itr += fs-blocksize 3; - blk_cnt -= fs-blocksize 3; - block_nbytes -= fs-blocksize; + blk_itr += cnt; + blk_cnt -= cnt; } goto success_cleanup; } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix assertion failure in fs/jbd/checkpoint.c
Hi, the patch below should fix an assertion failure in JBD checkpointing code. The patch survived some fsstress and similar runs on my test machine so it shouldn't be obviously wrong ;). Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- Before we start committing a transaction, we call __journal_clean_checkpoint_list() to cleanup transaction's written-back buffers. If this call happens to remove all of them (and there were already some buffers), __journal_remove_checkpoint() will decide to free the transaction because it isn't (yet) a committing transaction and soon we fail some assertion - the transaction really isn't ready to be freed :). We change the check in __journal_remove_checkpoint() to free only a transaction in T_FINISHED state. The locking there is subtle though (as everywhere in JBD ;(). We use j_list_lock to protect the check and a subsequent call to __journal_drop_transaction() and do the same in the end of journal_commit_transaction() which is the only place where a transaction can get to T_FINISHED state. Probably I'm too paranoid here and such locking is not really necessary - checkpoint lists are processed only from log_do_checkpoint() where a transaction must be already committed to be processed or from __journal_clean_checkpoint_list() where kjournald itself calls it and thus transaction cannot change state either. Better be safe if something changes in future... Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c index 47552d4..0f69c41 100644 --- a/fs/jbd/checkpoint.c +++ b/fs/jbd/checkpoint.c @@ -602,15 +602,15 @@ int __journal_remove_checkpoint(struct journal_head *jh) /* * There is one special case to worry about: if we have just pulled the -* buffer off a committing transaction's forget list, then even if the -* checkpoint list is empty, the transaction obviously cannot be -* dropped! +* buffer off a running or committing transaction's checkpoing list, +* then even if the checkpoint list is empty, the transaction obviously +* cannot be dropped! * -* The locking here around j_committing_transaction is a bit sleazy. +* The locking here around t_state is a bit sleazy. * See the comment at the end of journal_commit_transaction(). */ - if (transaction == journal-j_committing_transaction) { - JBUFFER_TRACE(jh, belongs to committing transaction); + if (transaction-t_state != T_FINISHED) { + JBUFFER_TRACE(jh, belongs to running/committing transaction); goto out; } diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c index 8f1f2aa..610264b 100644 --- a/fs/jbd/commit.c +++ b/fs/jbd/commit.c @@ -858,10 +858,10 @@ restart_loop: } spin_unlock(journal-j_list_lock); /* -* This is a bit sleazy. We borrow j_list_lock to protect -* journal-j_committing_transaction in __journal_remove_checkpoint. -* Really, __journal_remove_checkpoint should be using j_state_lock but -* it's a bit hassle to hold that across __journal_remove_checkpoint +* This is a bit sleazy. We use j_list_lock to protect transition +* of a transaction into T_FINISHED state and calling +* __journal_drop_transaction(). Otherwise we could race with +* other checkpointing code processing the transaction... */ spin_lock(journal-j_state_lock); spin_lock(journal-j_list_lock); diff --git a/include/linux/jbd.h b/include/linux/jbd.h index 16e7ed8..d9ecd13 100644 --- a/include/linux/jbd.h +++ b/include/linux/jbd.h @@ -439,6 +439,8 @@ struct transaction_s /* * Transaction's current state * [no locking - only kjournald alters this] +* [j_list_lock] guards transition of a transaction into T_FINISHED +* state and subsequent call of __journal_drop_transaction() * FIXME: needs barriers * KLUDGE: [use j_state_lock] */ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: dir inode reservation V3
Hi Coly, finally I've found some time to have a look at a new version of your patch. 5, Performance number On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create 5 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I tried to remove all the directories and files recursively by a 'rm -rf'. Bellow is the benchmark result, normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low Create dirs:real0m49.101s real2m59.703s Create files: real24m17.962s real21m8.161s Unlink all: real24m43.788s real17m29.862s Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Creating files with dir inode reservation is 13% faster than normal ext4. Unlink all the directories and files is 29.2% faster as expected. When number of directories is increased, the performance improvement will be more considerable. More benchmark result will be posted here if necessary, because I need more time to run more test cases. Maybe to get some better idea - could you compare how long does take untar of a kernel tree, find through the whole kernel tree and removal of the tree? Also measuring CPU load during your benchmarks would be useful so that we can see whether we don't increase too much the cost of searching in bitmaps... The patch is nicely short ;) diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 17b5df1..f838a72 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -10,6 +10,8 @@ * Stephen Tweedie ([EMAIL PROTECTED]), 1993 * Big-endian to little-endian byte-swapping/bitmaps by *David S. Miller ([EMAIL PROTECTED]), 1995 + * Directory inodes reservation by + *Coly Li ([EMAIL PROTECTED]), 2007 */ #include linux/time.h @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, return -1; } +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, + int mode, ext4_group_t *group, unsigned long *ino) +{ + struct super_block *sb; + struct ext4_sb_info *sbi; + struct ext4_group_desc *gdp = NULL; + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; + ext4_group_t ires_group = *group; + unsigned long ires_ino; + int i, bit; + + sb = dir-i_sb; + sbi = EXT4_SB(sb); + + /* if the inode number is not for directory, + * only try to allocate after directory's inode + */ + if (!S_ISDIR(mode)) { + *ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb); + return 0; + } ^^^ You don't set a group here - is this intentional? It means we get the group from find_group_other() and there we start searching at a place corresponding to parent's inode number... That would be an interesting heuristic but I'm not sure if that's what you want. + + /* reserve inodes for new directory */ + for (i = 0; i sbi-s_groups_count; i++) { + gdp = ext4_get_group_desc(sb, ires_group, gdp_bh); + if (!gdp) + goto fail; + bit = 0; +try_same_group: + if (bit EXT4_INODES_PER_GROUP(sb)) { + brelse(bitmap_bh); + bitmap_bh = read_inode_bitmap(sb, ires_group); + if (!bitmap_bh) + goto fail; + + BUFFER_TRACE(bitmap_bh, get_write_access); + if (ext4_journal_get_write_access( + handle, bitmap_bh) != 0) + goto fail; + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), + bit, bitmap_bh-b_data)) { + /* we won it */ + BUFFER_TRACE(bitmap_bh, + call ext4_journal_dirty_metadata); + if (ext4_journal_dirty_metadata(handle, + bitmap_bh) != 0) + goto fail; + ires_ino = bit; + goto find; + } + /* we lost it */ + jbd2_journal_release_buffer(handle, bitmap_bh); + bit += sbi-s_dir_ireserve_nr; + goto try_same_group; + } So this above is just a while loop coded with goto... While loop would be IMO better. Honza -- Jan Kara
Re: [PATCH] ext4: dir inode reservation V3
On Wed 21-11-07 00:40:17, Coly Li wrote: Jan Kara wrote: diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 17b5df1..f838a72 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -10,6 +10,8 @@ * Stephen Tweedie ([EMAIL PROTECTED]), 1993 * Big-endian to little-endian byte-swapping/bitmaps by *David S. Miller ([EMAIL PROTECTED]), 1995 + * Directory inodes reservation by + *Coly Li ([EMAIL PROTECTED]), 2007 */ #include linux/time.h @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, return -1; } +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, +int mode, ext4_group_t *group, unsigned long *ino) +{ + struct super_block *sb; + struct ext4_sb_info *sbi; + struct ext4_group_desc *gdp = NULL; + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; + ext4_group_t ires_group = *group; + unsigned long ires_ino; + int i, bit; + + sb = dir-i_sb; + sbi = EXT4_SB(sb); + + /* if the inode number is not for directory, + * only try to allocate after directory's inode + */ + if (!S_ISDIR(mode)) { + *ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb); + return 0; + } ^^^ You don't set a group here - is this intentional? It means we get the group from find_group_other() and there we start searching at a place corresponding to parent's inode number... That would be an interesting heuristic but I'm not sure if that's what you want. Yes, if allocating a file inode, just return first inode offset in the reserved area of parent directory. In this case, group is decided by find_group_other() or find_group_orlov(), ext4_ino_from_ireserve() just tries to persuade linear inode allocator to search free inode slot after parent's inode. But what I mean is: Parent directory is in group 1, with inode number 10, now find_group_other will set group to 2 and you set inode number to 10 so linear allocator will start searching in group 2, inode number 10 which is *not* just after directory inode + + /* reserve inodes for new directory */ + for (i = 0; i sbi-s_groups_count; i++) { + gdp = ext4_get_group_desc(sb, ires_group, gdp_bh); + if (!gdp) + goto fail; + bit = 0; +try_same_group: + if (bit EXT4_INODES_PER_GROUP(sb)) { + brelse(bitmap_bh); + bitmap_bh = read_inode_bitmap(sb, ires_group); + if (!bitmap_bh) + goto fail; + + BUFFER_TRACE(bitmap_bh, get_write_access); + if (ext4_journal_get_write_access( + handle, bitmap_bh) != 0) + goto fail; + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), + bit, bitmap_bh-b_data)) { + /* we won it */ + BUFFER_TRACE(bitmap_bh, + call ext4_journal_dirty_metadata); + if (ext4_journal_dirty_metadata(handle, + bitmap_bh) != 0) + goto fail; + ires_ino = bit; + goto find; + } + /* we lost it */ + jbd2_journal_release_buffer(handle, bitmap_bh); + bit += sbi-s_dir_ireserve_nr; + goto try_same_group; + } So this above is just a while loop coded with goto... While loop would be IMO better. The only reason for me to use a goto, is 80 column limitation :) BTW, goto does not hurt performance and readability here. IMHO, it's acceptable :-) But you could just remove goto try_same_group; and change 'if' to 'while'. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops 2.6.23.1 in ext3+jbd at journal_put_journal_head
Hello, A one-time event thus far, happened under very heavy I/O, Dell i9400 Core2Duo notebook w/3GB ram, single SATA drive with ext3. Had to cycle power to get it back and see this Oops in the syslog: : BUG: unable to handle kernel paging request at virtual address 430a7261 : printing eip: : c01a6605 : *pde = : Oops: 0002 [#1] : PREEMPT SMP : Modules linked in: nls_iso8859_1 vfat fat usb_storage ide_core libusual hci_usb ext2 loop nls_cp437 isofs zlib_inflate udf vmnet(P) vmblock(P) vmmon(P) binfmt_misc rfcomm l2cap bluetooth nfs nfsd exportfs lockd nfs_acl auth_rpcgss sunrpc acpi_cpufreq cpufreq_stats cpufreq_userspace cpufreq_ondemand cpufreq_conservative freq_table cpufreq_powersave container fan firmware_class af_packet pciehp usbhid hid pci_hotplug visor usbserial fuse mousedev snd_hda_intel snd_pcm_oss snd_pcm snd_mixer_oss snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event serio_raw snd_seq snd_timer snd_seq_device sg thermal firewire_ohci snd pcspkr sr_mod cdrom psmouse firewire_core sdhci mmc_core b44 mii crc_itu_t ac uhci_hcd ehci_hcd intel_agp agpgart processor button soundcore snd_page_alloc usbcore battery unix : CPU:1 : EIP:0060:[journal_put_journal_head+64/209]Tainted: PVLI : EFLAGS: 00010202 (2.6.23.1-slab #15) : EIP is at journal_put_journal_head+0x40/0xd1 : eax: c2bf7000 ebx: 430a7261 ecx: edx: c24f4780 : esi: f000fea5 edi: c24f4780 ebp: f000fea5 esp: c2bf7e38 : ds: 007b es: 007b fs: 00d8 gs: ss: 0068 Hmm, your pointer to buffer_head in journal_head has been overwritten by some garbage - it actually looks like ASCII (C\n ra). I think your journal_head pointer is stored in EAX (at least if I compile SMP kernel for i386 it is) and that is 0xc2bd7000 - start of the page. So probably some driver went wild and overwritten a piece of memory which did not belong to it... I suggest turning on a few debugging options (like DEBUG_SLAB) to catch the offender. : Process kswapd0 (pid: 243, ti=c2bf7000 task=c29ec030 task.ti=c2bf7000) : Stack: d6216868 002a d4f52670 0034 f000fea5 c01a34b6 :0246 c003fe08 ef082d98 ef082d4c ef082d4c c29f00c0 c003fdb8 c0198a8f : 0002ad02 ef082d4c c0145b21 c24f4780 000b c014a5e0 000e : Call Trace: : [journal_try_to_free_buffers+299/383] journal_try_to_free_buffers+0x12b/0x17f : [ext3_releasepage+0/114] ext3_releasepage+0x0/0x72 : [try_to_release_page+48/66] try_to_release_page+0x30/0x42 : [__invalidate_mapping_pages+116/231] __invalidate_mapping_pages+0x74/0xe7 : [invalidate_mapping_pages+15/17] invalidate_mapping_pages+0xf/0x11 : [shrink_icache_memory+219/445] shrink_icache_memory+0xdb/0x1bd : [shrink_slab+217/338] shrink_slab+0xd9/0x152 : [kswapd+729/1069] kswapd+0x2d9/0x42d : [autoremove_wake_function+0/53] autoremove_wake_function+0x0/0x35 : [kswapd+0/1069] kswapd+0x0/0x42d : [kthread+56/95] kthread+0x38/0x5f : [kthread+0/95] kthread+0x0/0x5f : [kernel_thread_helper+7/16] kernel_thread_helper+0x7/0x10 : === : Code: 89 e0 25 00 f0 ff ff ff 48 14 8b 40 08 a8 04 74 05 e8 a6 d6 0e 00 f3 90 89 e0 25 00 f0 ff ff ff 40 14 8b 03 a9 00 00 40 00 75 d5 f0 0f ba 2b 16 19 c0 85 c0 75 ec 8b 46 04 85 c0 7f 30 c7 44 24 : EIP: [journal_put_journal_head+64/209] journal_put_journal_head+0x40/0xd1 SS:ESP 0068:c2bf7e38 : note: kswapd0[243] exited with preempt_count 2 Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize
On Sat 10-11-07 19:37:03, Theodore Tso wrote: On Wed, Nov 07, 2007 at 05:09:39PM +0100, Jan Kara wrote: Subject: Support for 64KB blocksize in ext2-4 directories. When block size is 64KB, we have to take care that rec_len does not overflow. Kernel stores 0x in case 0x1 should be stored - perform appropriate conversion when reading from / writing to disk. NACK. You can't do the conversion in the reader/writer routines because the fundamentally rec_len is only a 16 bit field. So when you read a directory block where the rec_len field is encoded as 0x, and you translate it to 0x1, when you assign it to dirent-rec_len, the 0x1 gets chopped off and rec_len gets a value of zero. Did you test this patch before submitting it? Argh, stupid me. I've just tested that I didn't break anything for normal block size and thought that I cannot make mistake in such a simple thing ;). The only way to do this is to find all of the places that reference rec_len, and do the check there. Yes.. Thanks for having look. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize
On Mon 12-11-07 09:58:23, Theodore Tso wrote: On Mon, Nov 12, 2007 at 10:52:45AM +0100, Jan Kara wrote: Did you test this patch before submitting it? Argh, stupid me. I've just tested that I didn't break anything for normal block size and thought that I cannot make mistake in such a simple thing ;). Could I ask you to perhaps include some 64k blocksize test cases that would exercise the new codepaths? Fair enough, will do. The only way to do this is to find all of the places that reference rec_len, and do the check there. Yes.. Thanks for having look. One suggestion is that instead of just creating an conversion function, and then doing a global search and replace, in some places it might be better to declare an integer variable, and then assign rec_len = ext2fs_rec_len_from_disk(dirent-rec_len). For example, that would make ext2fs_process_dir_block() more readable, where dirent-rec_len is used no less than eight times. OK, thanks for suggestion. Thanks, and my apologies for not having time to review the patch until now. At the moment things are a bit crazy since I am effectively doing two jobs, since I am in transition between two assignments, and me doing most of both of them at the moment. I should have substantially more time after the new year begins. I see. I'll be patient then :) Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH 3/3] Recursive mtime for ext3
On Wed 07-11-07 19:20:38, Theodore Tso wrote: On Wed, Nov 07, 2007 at 03:36:05PM +0100, Jan Kara wrote: What if more than one application wants to use this facility? That should be fine - let's see: Each application keeps somewhere a time when it started a scan of a subtree (or it can actually remember a time when it set the flag for each directory), during the scan, it sets the flag on each directory. When it wakes up to recheck the subtree it just compares the rtime against the stored time - if rtime is greater, subtree has been modified since the last scan and we recurse in it and when we are finished with it we set the flag. Now notice that we don't care about the flag when we check for changes - we care only for rtime - so if there are several applications interested in the same subtree, the flag just gets set more often and thus the update of rtime happens more often but the same scheme still works fine. OK, so in this case you don't need to set rtime on the every single file inode, but only directory inode, right? Because you're only Yes, that's actually what I'm doing - sorry if I didn't make it clear earlier. using checking the rtime at the directory level, and not the flag. And it's just as easy for you to check the rtime flag for the file's containing directory (modulo magic vis-a-vis hard links) as the file's inode. Exactly. I'm just really wishing that rtime and the rtime flag didn't have live on disk, but could rather be in memory. If you only needed to save the directory flags and rtimes, that might actually be doable. I already gave some thought to this but there seemed to be some drawbacks. Query I want to support is: given a directory, tell me which of its subdirectories (arbitrarily deep below) have been modified since time T. That is what you need to support faster rsync, updatedb and similar loads. Also I want to allow a reboot to happen inbetween the modification and a query (handling a crash correctly would be nice too but honestly my current implementation is not completely reliable in this regard either) so some pernament storage is needed in any case. What I can imagine we could do is to report all modifications to userspace - that has a problem that there are *many* possible modifications but we are interested only whether there happened some since time T. We could improve this by an in-memory inode flag I'm not interested in modifications any further and reporting the change only if the parent directory does not have this flag set (note that this flag gets lost when we evict the inode from memory). But I would say that in the end all this message passing, climbing the tree from userspace and maintaining data structure in memory and on disk would cost use more than the current implementation... Also it has the disadvantage that we miss the modifications which happen before we start the userspace daemon catching the events. Doing this in kernel memory has a problem how to solve the persistency across reboots (dumping mod's to userspace on request?) and also on my system you'd have roughly a few MB of pinned memory for these purposes... Plausible but I don't really like it... Note by the way that since you need to own the file/directory to set flags, this means that only programs that are running as root or running as the uid who owns the entire subtree will be able to use this scheme. One advantage of doing in kernel memory is that you might be able to support watching a tree that is not owned by the watcher. Yes, that is the advantage. On the other hand we could allow setting that particular flag even without being an owner of the inode. In fact, I don't currently see use case where you won't be either root (rsync, updatedb) or an owner of the files (watching config file trees) but I guess people would find some :). I don't get it here - you need to scan the whole subtree and set the flag only during the initial scan. Later, you need to scan and set the flag only for directories in whose subtree something changed. Similarty rtime needs to be updated for each inode at most once after the scan. OK, so in the worst case every single file in a kernel source tree might change after doing an extreme git checkout. That means around 36k of files get updated. So if you have to set/clear the rtime flag during the checkout process 36k file inodes would have to have their rtime flag cleared, plus 2k worth of directory inodes; but those would probably be folded into other changes made to the inodes anyway. But Yes, here the impact is hardly measurable as I've written in the previous email. then when trackerd goes back and scans the subtree, if you are actually setting rtime flags for every single file inode, then that's 38k of indoes that need updating. If you only need to set the rtime flags for directories, that's only 2k worth of extra gratuitous inode updates. As I wrote above, the flag
Re: [RFC] [PATCH 3/3] Recursive mtime for ext3
On Thu 08-11-07 09:37:59, Theodore Tso wrote: Ah, OK, so the two things that I didn't get from your patch description are: 1) the rtime flag and rtime field are only set on directories 2) the intended use is not trackerd and its ilk, but rsync and updatedb, so it is desirable that scan/queries be persistent across reboots But then the major hole in this scheme is still the issue of hard links. The rsync program is still going to have to scan the entire subtree looking for hard links, since an inode with multiple links into the directory tree can't guarantee that all of its parent directories will have their rtime field updated. Not really - initially rsync can scan a tree for hardlinks and remember where they are. If a hardlink to a file is created, an rtime update is sent up the tree via the path used to create the link. So during next scan, rsync will see the file is modified and finds out that its nlink is 1 and adds it to the list of hardlinked files. So for things like regular backups hardlinks can be dealt with efficiently. A program like updatedb which only cares about filenames will be OK, since that means it really only cares about knowing when directories have changed, and you can't have hard links to directories. The other problem, of course, is that this feature would become ext 2/3/4 specific, and I could see future filesystems possibly wanting this. So this raises the question of whether the interface should be at the VFS layer or not --- and if so, how to handle querying whether a particulra filesystem supports it, and what happens if you have a subtree which is covered by a filesystem that doesn't support rtime? So a program like rsync would need to scan /proc/self/mounts to see whether or not it would be safe to use this feature in the first Yes, being filesystem specific and thus requiring special handling of mount points is a disadvantage of this approach. place. And, of course, rsync would need to know whether it has write access to the tree in order to set flags in the directory, and what to do if some portion of the subtree isn't writeable by rsync. Yes, the cases where we cannot modify the flag in a tree would have to be handled (similarly as the cases where the filesystem simply does not support the feature). I don't think it wouldn't be too complicated but I have not the modification for rsync yet, so I can underestimate... On Thu, Nov 08, 2007 at 11:56:42AM +0100, Jan Kara wrote: Note by the way that since you need to own the file/directory to set flags, this means that only programs that are running as root or running as the uid who owns the entire subtree will be able to use this scheme. One advantage of doing in kernel memory is that you might be able to support watching a tree that is not owned by the watcher. Yes, that is the advantage. On the other hand we could allow setting that particular flag even without being an owner of the inode. In fact, I don't currently see use case where you won't be either root (rsync, updatedb) or an owner of the files (watching config file trees) but I guess people would find some :). Sometimes people like to use rsync to copy a subtree to which they have read access but not write access. (And here note that it's not enough to have write access, you actually need to *own* all of the directories in the subtree). Yes, so in such cases my feature won't be able to help. But I think there are still enough cases where it would help. Yes, it's safe to let any user *set* the rtime flag, but we couldn't let them clear the rtime flag, since then they would be able to hide a file modification from some other (potentially privileged) process. Good point. Speaking of security, I assume your patch will never allow rtime to go backwards (for example if the user attempts to backdate a file's mtime field using the utime() or utimes() system call)? No, the patch does not allow this. But anyway in case user has enough rights to change file's mtime, would it really be a security concern? I guess I'm convinced that updatedb could use this facility, but there are enough asteriks around it that I'm not sure that rsync could safely use this feature in production. I don't doubt that in a cold cache case, it would speed up rsync, but because it doesn't handle hard links, it's not reliable. Since rsync often gets used for backups, this is a big deal. There are also questions about what to do if rsync doesn't have write access to the filesystem, or if there is a non-rtime capable filesystem mounted in the subtree, etc., that can be worked around, but would add a lot of complexity and grottiness to the rsync source tree. Is the rsync maintainer really willing to add all of the necessary hair to support this rtime facility into their program? Hardlinks can be worked-around as I wrote above and there would have to be a fallback in case we cannot set
Re: [RFC] [PATCH 3/3] Recursive mtime for ext3
On Tue 06-11-07 10:04:47, H. Peter Anvin wrote: Arjan van de Ven wrote: On Tue, 6 Nov 2007 18:19:45 +0100 Jan Kara [EMAIL PROTECTED] wrote: Implement recursive mtime (rtime) feature for ext3. The feature works as follows: In each directory we keep a flag EXT3_RTIME_FL (modifiable by a user) whether rtime should be updated. In case a directory or a file in it is modified and when the flag is set, directory's rtime is updated, the flag is cleared, and we move to the parent. If the flag is set there, we clear it, update rtime and continue upwards upto the root of the filesystem. In case a regular file or symlink is modified, we pick arbitrary of its parents (actually the one that happens to be at the head of i_dentry list) and start the rtime update algorith there. Ok since mtime (and rtime) are part of the inode and not the dentry... how do you deal with hardlinks? And with cases of files that have been unlinked? (ok the later is a wash obviously other than not crashing) Unlinked files are easy - you just don't propagate the rtime anywhere. For hardlinks see below. There is only one possible answer... he only updates the directory path that was used to touch the particular file involved. Thus, the semantics gets grotty not just in the presence of hard links, but also in the presence of bind- and other non-root mounts. Update of recursive mtime does not pass filesystem boundaries (i.e. mountpoints) so bind mounts and such are non-issue (hmm, at least that was my original idea but as I'm looking now I don't handle bind mounts properly so that needs to be fixed). With hardlinks, you are right that the behaviour is undeterministic - I tried to argue in the text of the mail that this does not actually matter - there are not many hardlinks on usual system and so the application can check hardlinked files in the old way - i.e. look at mtime. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH 3/3] Recursive mtime for ext3
On Tue 06-11-07 18:01:00, Al Viro wrote: On Tue, Nov 06, 2007 at 06:19:45PM +0100, Jan Kara wrote: Implement recursive mtime (rtime) feature for ext3. The feature works as follows: In each directory we keep a flag EXT3_RTIME_FL (modifiable by a user) whether rtime should be updated. In case a directory or a file in it is modified and when the flag is set, directory's rtime is updated, the flag is cleared, and we move to the parent. If the flag is set there, we clear it, update rtime and continue upwards upto the root of the filesystem. In case a regular file or symlink is modified, we pick arbitrary of its parents (actually the one that happens to be at the head of i_dentry list) and start the rtime update algorith there. *e* Nothing like undeterministic behaviour, is there? Oh yes, there is :) But I tried to argue it does not really matter - application would have to handle hardlinks in a special way but I find that acceptable given how rare they are... Intended use case is that application which wants to watch any modification in a subtree scans the subtree and sets flags for all inodes there. Next time, it just needs to recurse in directories having rtime newer than the start of the previous scan. There it can handle modifications and set the flag again. It is up to application to watch out for hardlinked files. It can e.g. build their list and check their mtime separately (when a hardlink to a file is created its inode is modified and rtimes properly updated and thus any application has an effective way of finding new hardlinked files). You know, you can do that with aush^H^Hdit right now... Interesting idea, no I have not thought about this. I guess you mean watching all the VFS modification events and then do the checking and propagation from user space... My first feeling is that the performance penalty would be considerably higher (currently I am at 1% performance penalty for quite pessimistic test case) but in case the current patch would be considered unacceptable, I can try how large the penalty would be. Thanks for suggestion. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize
Hello, sorry for replying to myself but I've just found out that the patch I've sent was and old version of the patch which had some problems. Attached is a new version. On Tue 06-11-07 12:31:42, Jan Kara wrote: it seems attached patch still did not get your attention. It makes e2fsprogs properly handle filesystems with 64KB block size. Could you put it into e2fsprogs git? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR Subject: Support for 64KB blocksize in ext2-4 directories. When block size is 64KB, we have to take care that rec_len does not overflow. Kernel stores 0x in case 0x1 should be stored - perform appropriate conversion when reading from / writing to disk. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/lib/ext2fs/dirblock.c b/lib/ext2fs/dirblock.c index fb20fa0..db73edd 100644 --- a/lib/ext2fs/dirblock.c +++ b/lib/ext2fs/dirblock.c @@ -38,9 +38,9 @@ errcode_t ext2fs_read_dir_block2(ext2_fi dirent = (struct ext2_dir_entry *) p; #ifdef WORDS_BIGENDIAN dirent-inode = ext2fs_swab32(dirent-inode); - dirent-rec_len = ext2fs_swab16(dirent-rec_len); dirent-name_len = ext2fs_swab16(dirent-name_len); #endif + dirent-rec_len = ext2fs_rec_len_from_disk(dirent-rec_len); name_len = dirent-name_len; #ifdef WORDS_BIGENDIAN if (flags EXT2_DIRBLOCK_V2_STRUCT) @@ -68,12 +68,15 @@ errcode_t ext2fs_read_dir_block(ext2_fil errcode_t ext2fs_write_dir_block2(ext2_filsys fs, blk_t block, void *inbuf, int flags EXT2FS_ATTR((unused))) { -#ifdef WORDS_BIGENDIAN errcode_t retval; char *p, *end; char *buf = 0; struct ext2_dir_entry *dirent; +#ifndef WORDS_BIGENDIAN + if (fs-blocksize EXT2_MAX_REC_LEN) + goto just_write; +#endif retval = ext2fs_get_mem(fs-blocksize, buf); if (retval) return retval; @@ -88,19 +91,18 @@ errcode_t ext2fs_write_dir_block2(ext2_f return (EXT2_ET_DIR_CORRUPTED); } p += dirent-rec_len; + dirent-rec_len = ext2fs_rec_len_to_disk(dirent-rec_len); +#ifdef WORDS_BIGENDIAN dirent-inode = ext2fs_swab32(dirent-inode); - dirent-rec_len = ext2fs_swab16(dirent-rec_len); - dirent-name_len = ext2fs_swab16(dirent-name_len); - - if (flags EXT2_DIRBLOCK_V2_STRUCT) + if (!(flags EXT2_DIRBLOCK_V2_STRUCT)) dirent-name_len = ext2fs_swab16(dirent-name_len); +#endif } retval = io_channel_write_blk(fs-io, block, 1, buf); ext2fs_free_mem(buf); return retval; -#else +just_write: return io_channel_write_blk(fs-io, block, 1, (char *) inbuf); -#endif } diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h index a316665..21747c2 100644 --- a/lib/ext2fs/ext2_fs.h +++ b/lib/ext2fs/ext2_fs.h @@ -717,6 +718,32 @@ struct ext2_dir_entry_2 { #define EXT2_DIR_ROUND (EXT2_DIR_PAD - 1) #define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND) \ ~EXT2_DIR_ROUND) +#define EXT2_MAX_REC_LEN ((116)-1) + +static inline unsigned ext2fs_rec_len_from_disk(unsigned len) +{ +#ifdef WORDS_BIGENDIAN + len = ext2fs_swab16(dlen); +#endif + if (len == EXT2_MAX_REC_LEN) + return 1 16; + return len; +} + +static inline unsigned ext2fs_rec_len_to_disk(unsigned len) +{ + if (len == (1 16)) +#ifdef WORDS_BIGENDIAN + return ext2fs_swab16(EXT2_MAX_REC_LEN); +#else + return EXT2_MAX_REC_LEN; +#endif +#ifdef WORDS_BIGENDIAN + return ext2fs_swab_16(len); +#else + return len; +#endif +} /* * This structure will be used for multiple mount protection. It will be diff --git a/misc/e2image.c b/misc/e2image.c index 1fbb267..4e2c9fb 100644 --- a/misc/e2image.c +++ b/misc/e2image.c @@ -345,10 +345,7 @@ static void scramble_dir_block(ext2_fils end = buf + fs-blocksize; for (p = buf; p end-8; p += rec_len) { dirent = (struct ext2_dir_entry_2 *) p; - rec_len = dirent-rec_len; -#ifdef WORDS_BIGENDIAN - rec_len = ext2fs_swab16(rec_len); -#endif + rec_len = ext2fs_rec_len_from_disk(dirent-rec_len); #if 0 printf(rec_len = %d, name_len = %d\n, rec_len, dirent-name_len); #endif @@ -358,9 +355,7 @@ static void scramble_dir_block(ext2_fils bad rec_len (%d)\n, (unsigned long) blk, rec_len); rec_len = end - p; -#ifdef WORDS_BIGENDIAN -dirent-rec_len = ext2fs_swab16(rec_len); -#endif + dirent-rec_len = ext2fs_rec_len_to_disk(rec_len); continue; } if (dirent-name_len + 8 rec_len) {
[PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize (fwd)
Wrote wrong mailing list address so I'm resending it to a correct one. - Forwarded message from Jan Kara [EMAIL PROTECTED] - From: Jan Kara [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: [PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize Hello Ted, it seems attached patch still did not get your attention. It makes e2fsprogs properly handle filesystems with 64KB block size. Could you put it into e2fsprogs git? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR Subject: Support for 64KB blocksize in ext2-4 directories. When block size is 64KB, we have to take care that rec_len does not overflow. Kernel stores 0x in case 0x1 should be stored - perform appropriate conversion when reading from / writing to disk. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/lib/ext2fs/dirblock.c b/lib/ext2fs/dirblock.c index fb20fa0..628bf93 100644 --- a/lib/ext2fs/dirblock.c +++ b/lib/ext2fs/dirblock.c @@ -38,9 +38,9 @@ errcode_t ext2fs_read_dir_block2(ext2_fi dirent = (struct ext2_dir_entry *) p; #ifdef WORDS_BIGENDIAN dirent-inode = ext2fs_swab32(dirent-inode); - dirent-rec_len = ext2fs_swab16(dirent-rec_len); dirent-name_len = ext2fs_swab16(dirent-name_len); #endif + dirent-rec_len = ext2fs_rec_len_from_disk(dirent-rec_len); name_len = dirent-name_len; #ifdef WORDS_BIGENDIAN if (flags EXT2_DIRBLOCK_V2_STRUCT) @@ -68,12 +68,15 @@ errcode_t ext2fs_read_dir_block(ext2_fil errcode_t ext2fs_write_dir_block2(ext2_filsys fs, blk_t block, void *inbuf, int flags EXT2FS_ATTR((unused))) { -#ifdef WORDS_BIGENDIAN errcode_t retval; char*p, *end; char*buf = 0; struct ext2_dir_entry *dirent; +#ifndef WORDS_BIGENDIAN + if (fs-blocksize EXT2_MAX_REC_LEN) + goto just_write; +#endif retval = ext2fs_get_mem(fs-blocksize, buf); if (retval) return retval; @@ -89,7 +92,7 @@ errcode_t ext2fs_write_dir_block2(ext2_f } p += dirent-rec_len; dirent-inode = ext2fs_swab32(dirent-inode); - dirent-rec_len = ext2fs_swab16(dirent-rec_len); + dirent-rec_len = ext2fs_rec_len_to_disk(dirent-rec_len); dirent-name_len = ext2fs_swab16(dirent-name_len); if (flags EXT2_DIRBLOCK_V2_STRUCT) @@ -98,9 +101,8 @@ errcode_t ext2fs_write_dir_block2(ext2_f retval = io_channel_write_blk(fs-io, block, 1, buf); ext2fs_free_mem(buf); return retval; -#else +just_write: return io_channel_write_blk(fs-io, block, 1, (char *) inbuf); -#endif } diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h index a316665..2041f0f 100644 --- a/lib/ext2fs/ext2_fs.h +++ b/lib/ext2fs/ext2_fs.h @@ -717,6 +717,32 @@ struct ext2_dir_entry_2 { #define EXT2_DIR_ROUND (EXT2_DIR_PAD - 1) #define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND) \ ~EXT2_DIR_ROUND) +#define EXT2_MAX_REC_LEN ((116)-1) + +static inline unsigned ext2fs_rec_len_from_disk(unsigned len) +{ +#ifdef WORDS_BIGENDIAN + len = ext2fs_swab16(dlen); +#endif + if (len == EXT2_MAX_REC_LEN) + return 1 16; + return len; +} + +static inline unsigned ext2fs_rec_len_to_disk(unsigned len) +{ + if (len == (1 16)) +#ifdef WORDS_BIGENDIAN + return ext2fs_swab16(EXT2_MAX_REC_LEN); +#else + return EXT2_MAX_REC_LEN; +#endif +#ifdef WORDS_BIGENDIAN + return ext2fs_swab_16(len); +#else + return len; +#endif +} /* * This structure will be used for multiple mount protection. It will be diff --git a/misc/e2image.c b/misc/e2image.c index 1fbb267..4e2c9fb 100644 --- a/misc/e2image.c +++ b/misc/e2image.c @@ -345,10 +345,7 @@ static void scramble_dir_block(ext2_fils end = buf + fs-blocksize; for (p = buf; p end-8; p += rec_len) { dirent = (struct ext2_dir_entry_2 *) p; - rec_len = dirent-rec_len; -#ifdef WORDS_BIGENDIAN - rec_len = ext2fs_swab16(rec_len); -#endif + rec_len = ext2fs_rec_len_from_disk(dirent-rec_len); #if 0 printf(rec_len = %d, name_len = %d\n, rec_len, dirent-name_len); #endif @@ -358,9 +355,7 @@ static void scramble_dir_block(ext2_fils bad rec_len (%d)\n, (unsigned long) blk, rec_len); rec_len = end - p; -#ifdef WORDS_BIGENDIAN - dirent-rec_len = ext2fs_swab16(rec_len); -#endif + dirent-rec_len = ext2fs_rec_len_to_disk(rec_len); continue
[RFC] [PATCH 0/3] Recursive mtime for ext3
Hello, in following three patches is implemented recursive mtime feature for ext3. The first two patches are mostly clean-up patches, the third patch implements the feature itself. If somebody is interested in testing this (or even writing a support of this feature in rsync and similar), please contact me. Attached are sources of simple tools set_recmod, get_recmod for testing the feature and also a patch implementing basic support of the feature in e2fsprogs. Comments welcome. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR #include stdio.h #include fcntl.h #include errno.h #include string.h #include sys/ioctl.h #include linux/fs.h #include linux/ext2_fs.h #define EXT2_IOC_GETRTIME _IOR('f', 9, unsigned int) #define S_RECMOD 0x10 int main(int argc, char **argv) { int fd; long flags; if (argc 2) return 1; fd = open(argv[1], O_RDONLY); if (fd 0) { printf(Cannot open: %m\n); return 1; } if (ioctl(fd, EXT2_IOC_GETFLAGS, flags) 0) { printf(Cannot get flags: %m\n); return 1; } flags |= S_RECMOD; if (ioctl(fd, EXT2_IOC_SETFLAGS, flags) 0) { printf(Cannot set flags: %m\n); return 1; } return 0; } #include stdio.h #include fcntl.h #include errno.h #include string.h #include sys/ioctl.h #include linux/fs.h #include linux/ext2_fs.h #define EXT2_IOC_GETRTIME _IOR('f', 9, unsigned int) #define S_RECMOD 0x10 int main(int argc, char **argv) { int fd; long flags; unsigned time; if (argc 2) return 1; fd = open(argv[1], O_RDONLY); if (fd 0) { printf(Cannot open: %m\n); return 1; } if (ioctl(fd, EXT2_IOC_GETFLAGS, flags) 0) { printf(Cannot get flags: %m\n); return 1; } if (ioctl(fd, EXT2_IOC_GETRTIME, time) 0) { printf(Cannot get rtime: %s\n, strerror(errno)); return 1; } printf(RECMOD flags: %d, time %u\n, !!(flags S_RECMOD), time); return 0; } diff --git a/lib/e2p/feature.c b/lib/e2p/feature.c index fe7e65a..a12540b 100644 --- a/lib/e2p/feature.c +++ b/lib/e2p/feature.c @@ -37,6 +37,8 @@ static struct feature feature_list[] = { resize_inode }, { E2P_FEATURE_COMPAT, EXT2_FEATURE_COMPAT_LAZY_BG, lazy_bg }, + { E2P_FEATURE_COMPAT, EXT2_FEATURE_COMPAT_RTIME, + recursive_mtime }, { E2P_FEATURE_RO_INCOMPAT, EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER, sparse_super }, diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h index a316665..21747c2 100644 --- a/lib/ext2fs/ext2_fs.h +++ b/lib/ext2fs/ext2_fs.h @@ -623,6 +623,7 @@ struct ext2_super_block { #define EXT2_FEATURE_COMPAT_RESIZE_INODE 0x0010 #define EXT2_FEATURE_COMPAT_DIR_INDEX 0x0020 #define EXT2_FEATURE_COMPAT_LAZY_BG 0x0040 +#define EXT2_FEATURE_COMPAT_RTIME 0x0080 #define EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001 #define EXT2_FEATURE_RO_COMPAT_LARGE_FILE 0x0002 diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h index 83a9091..64b6eb6 100644 --- a/lib/ext2fs/ext2fs.h +++ b/lib/ext2fs/ext2fs.h @@ -418,7 +418,8 @@ typedef struct ext2_icount *ext2_icount_ EXT2_FEATURE_COMPAT_RESIZE_INODE|\ EXT2_FEATURE_COMPAT_DIR_INDEX|\ EXT2_FEATURE_COMPAT_LAZY_BG|\ - EXT2_FEATURE_COMPAT_EXT_ATTR) + EXT2_FEATURE_COMPAT_EXT_ATTR|\ + EXT2_FEATURE_COMPAT_RTIME) /* This #ifdef is temporary until compression is fully supported */ #ifdef ENABLE_COMPRESSION diff --git a/misc/mke2fs.c b/misc/mke2fs.c index 4a6cace..5060db7 100644 --- a/misc/mke2fs.c +++ b/misc/mke2fs.c @@ -870,7 +870,8 @@ static __u32 ok_features[3] = { EXT3_FEATURE_COMPAT_HAS_JOURNAL | EXT2_FEATURE_COMPAT_RESIZE_INODE | EXT2_FEATURE_COMPAT_DIR_INDEX | - EXT2_FEATURE_COMPAT_LAZY_BG, /* Compat */ + EXT2_FEATURE_COMPAT_LAZY_BG | + EXT2_FEATURE_COMPAT_RTIME, /* Compat */ EXT2_FEATURE_INCOMPAT_FILETYPE| /* Incompat */ EXT3_FEATURE_INCOMPAT_JOURNAL_DEV| EXT2_FEATURE_INCOMPAT_META_BG, diff --git a/misc/tune2fs.c b/misc/tune2fs.c index 833b994..5195e40 100644 --- a/misc/tune2fs.c +++ b/misc/tune2fs.c @@ -96,7 +96,8 @@ static void usage(void) static __u32 ok_features[3] = { EXT3_FEATURE_COMPAT_HAS_JOURNAL | - EXT2_FEATURE_COMPAT_DIR_INDEX, /* Compat */ + EXT2_FEATURE_COMPAT_DIR_INDEX | + EXT2_FEATURE_COMPAT_RTIME, /* Compat */ EXT2_FEATURE_INCOMPAT_FILETYPE, /* Incompat */ EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER /* R/O compat */ };
[RFC] [PATCH 1/3] Recursive mtime for ext3
Hello, the following patch makes more lightweight handling of EXT3_I(inode)-i_flags possible. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- Implement atomic updates of EXT3_I(inode)-i_flags. So far the i_flags access was guarded mostly by i_mutex but this is quite heavy-weight. We now use inode-i_lock to protect i_flags reading and updates in ext3. This patch introduces a bogus warning that jflag and oldflags may be uninitialized - anyone knows how to cleanly get rid of it? Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/dir.c linux-2.6.23-1-i_flags_atomicity/fs/ext3/dir.c --- linux-2.6.23/fs/ext3/dir.c 2007-10-11 12:01:23.0 +0200 +++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/dir.c 2007-11-05 14:04:56.0 +0100 @@ -108,10 +108,10 @@ static int ext3_readdir(struct file * fi sb = inode-i_sb; #ifdef CONFIG_EXT3_INDEX - if (EXT3_HAS_COMPAT_FEATURE(inode-i_sb, - EXT3_FEATURE_COMPAT_DIR_INDEX) - ((EXT3_I(inode)-i_flags EXT3_INDEX_FL) || -((inode-i_size sb-s_blocksize_bits) == 1))) { + if (is_dx(inode) || + (EXT3_HAS_COMPAT_FEATURE(inode-i_sb, \ + EXT3_FEATURE_COMPAT_DIR_INDEX) +(inode-i_size sb-s_blocksize_bits) == 1)) { err = ext3_dx_readdir(filp, dirent, filldir); if (err != ERR_BAD_DX_DIR) { ret = err; @@ -121,7 +121,9 @@ static int ext3_readdir(struct file * fi * We don't set the inode dirty flag since it's not * critical that it get flushed back to the disk. */ + spin_lock(inode-i_lock); EXT3_I(filp-f_path.dentry-d_inode)-i_flags = ~EXT3_INDEX_FL; + spin_unlock(inode-i_lock); } #endif stored = 0; diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/ialloc.c linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c --- linux-2.6.23/fs/ext3/ialloc.c 2006-11-29 22:57:37.0 +0100 +++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c 2007-11-05 14:14:50.0 +0100 @@ -278,7 +278,7 @@ static int find_group_orlov(struct super ndirs = percpu_counter_read_positive(sbi-s_dirs_counter); if ((parent == sb-s_root-d_inode) || - (EXT3_I(parent)-i_flags EXT3_TOPDIR_FL)) { + ext3_test_inode_flags(parent, EXT3_TOPDIR_FL)) { int best_ndir = inodes_per_group; int best_group = -1; @@ -566,7 +566,11 @@ got: ei-i_dir_start_lookup = 0; ei-i_disksize = 0; + /* Guard reading of directory's i_flags, created inode is safe as +* noone has a reference to it yet */ + spin_lock(dir-i_lock); ei-i_flags = EXT3_I(dir)-i_flags ~EXT3_INDEX_FL; + spin_unlock(dir-i_lock); if (S_ISLNK(mode)) ei-i_flags = ~(EXT3_IMMUTABLE_FL|EXT3_APPEND_FL); /* dirsync only applies to directories */ diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/inode.c linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c --- linux-2.6.23/fs/ext3/inode.c2007-10-11 12:01:23.0 +0200 +++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c2007-11-05 14:24:39.0 +0100 @@ -2557,8 +2557,10 @@ int ext3_get_inode_loc(struct inode *ino void ext3_set_inode_flags(struct inode *inode) { - unsigned int flags = EXT3_I(inode)-i_flags; + unsigned int flags; + spin_lock(inode-i_lock); + flags = EXT3_I(inode)-i_flags; inode-i_flags = ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC); if (flags EXT3_SYNC_FL) inode-i_flags |= S_SYNC; @@ -2570,13 +2572,16 @@ void ext3_set_inode_flags(struct inode * inode-i_flags |= S_NOATIME; if (flags EXT3_DIRSYNC_FL) inode-i_flags |= S_DIRSYNC; + spin_unlock(inode-i_lock); } /* Propagate flags from i_flags to EXT3_I(inode)-i_flags */ void ext3_get_inode_flags(struct ext3_inode_info *ei) { - unsigned int flags = ei-vfs_inode.i_flags; + unsigned int flags; + spin_lock(ei-vfs_inode.i_lock); + flags = ei-vfs_inode.i_flags; ei-i_flags = ~(EXT3_SYNC_FL|EXT3_APPEND_FL| EXT3_IMMUTABLE_FL|EXT3_NOATIME_FL|EXT3_DIRSYNC_FL); if (flags S_SYNC) @@ -2589,6 +2594,7 @@ void ext3_get_inode_flags(struct ext3_in ei-i_flags |= EXT3_NOATIME_FL; if (flags S_DIRSYNC) ei-i_flags |= EXT3_DIRSYNC_FL; + spin_unlock(ei-vfs_inode.i_lock); } void ext3_read_inode(struct inode * inode) @@ -2781,7 +2787,9 @@ static int ext3_do_update_inode(handle_t raw_inode-i_mtime = cpu_to_le32(inode-i_mtime.tv_sec); raw_inode-i_blocks = cpu_to_le32(inode-i_blocks); raw_inode-i_dtime
[RFC] [PATCH 2/3] Recursive mtime for ext3
Make space reserved for fragments as unused as they were never implemented. Remove also related initializations. We later use the space for recursive mtime. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c linux-2.6.23-2-make_flags_unused/fs/ext3/ialloc.c --- linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c 2007-11-05 14:14:50.0 +0100 +++ linux-2.6.23-2-make_flags_unused/fs/ext3/ialloc.c 2007-11-05 14:37:33.0 +0100 @@ -576,11 +576,6 @@ got: /* dirsync only applies to directories */ if (!S_ISDIR(mode)) ei-i_flags = ~EXT3_DIRSYNC_FL; -#ifdef EXT3_FRAGMENTS - ei-i_faddr = 0; - ei-i_frag_no = 0; - ei-i_frag_size = 0; -#endif ei-i_file_acl = 0; ei-i_dir_acl = 0; ei-i_dtime = 0; diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c linux-2.6.23-2-make_flags_unused/fs/ext3/inode.c --- linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c2007-11-05 14:24:39.0 +0100 +++ linux-2.6.23-2-make_flags_unused/fs/ext3/inode.c2007-11-05 14:38:05.0 +0100 @@ -2651,11 +2651,6 @@ void ext3_read_inode(struct inode * inod } inode-i_blocks = le32_to_cpu(raw_inode-i_blocks); ei-i_flags = le32_to_cpu(raw_inode-i_flags); -#ifdef EXT3_FRAGMENTS - ei-i_faddr = le32_to_cpu(raw_inode-i_faddr); - ei-i_frag_no = raw_inode-i_frag; - ei-i_frag_size = raw_inode-i_fsize; -#endif ei-i_file_acl = le32_to_cpu(raw_inode-i_file_acl); if (!S_ISREG(inode-i_mode)) { ei-i_dir_acl = le32_to_cpu(raw_inode-i_dir_acl); @@ -2790,11 +2785,6 @@ static int ext3_do_update_inode(handle_t spin_lock(inode-i_lock); raw_inode-i_flags = cpu_to_le32(ei-i_flags); spin_unlock(inode-i_lock); -#ifdef EXT3_FRAGMENTS - raw_inode-i_faddr = cpu_to_le32(ei-i_faddr); - raw_inode-i_frag = ei-i_frag_no; - raw_inode-i_fsize = ei-i_frag_size; -#endif raw_inode-i_file_acl = cpu_to_le32(ei-i_file_acl); if (!S_ISREG(inode-i_mode)) { raw_inode-i_dir_acl = cpu_to_le32(ei-i_dir_acl); diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/fs/ext3/super.c linux-2.6.23-2-make_flags_unused/fs/ext3/super.c --- linux-2.6.23-1-i_flags_atomicity/fs/ext3/super.c2007-11-05 15:04:19.0 +0100 +++ linux-2.6.23-2-make_flags_unused/fs/ext3/super.c2007-11-05 15:01:37.0 +0100 @@ -1584,17 +1584,7 @@ static int ext3_fill_super (struct super goto failed_mount; } } - sbi-s_frag_size = EXT3_MIN_FRAG_SIZE - le32_to_cpu(es-s_log_frag_size); - if (blocksize != sbi-s_frag_size) { - printk(KERN_ERR - EXT3-fs: fragsize %lu != blocksize %u (unsupported)\n, - sbi-s_frag_size, blocksize); - goto failed_mount; - } - sbi-s_frags_per_block = 1; sbi-s_blocks_per_group = le32_to_cpu(es-s_blocks_per_group); - sbi-s_frags_per_group = le32_to_cpu(es-s_frags_per_group); sbi-s_inodes_per_group = le32_to_cpu(es-s_inodes_per_group); if (EXT3_INODE_SIZE(sb) == 0) goto cantfind_ext3; @@ -1618,12 +1608,6 @@ static int ext3_fill_super (struct super sbi-s_blocks_per_group); goto failed_mount; } - if (sbi-s_frags_per_group blocksize * 8) { - printk (KERN_ERR - EXT3-fs: #fragments per group too big: %lu\n, - sbi-s_frags_per_group); - goto failed_mount; - } if (sbi-s_inodes_per_group blocksize * 8) { printk (KERN_ERR EXT3-fs: #inodes per group too big: %lu\n, diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/include/linux/ext3_fs.h linux-2.6.23-2-make_flags_unused/include/linux/ext3_fs.h --- linux-2.6.23-1-i_flags_atomicity/include/linux/ext3_fs.h2007-11-05 14:31:44.0 +0100 +++ linux-2.6.23-2-make_flags_unused/include/linux/ext3_fs.h2007-11-05 14:37:33.0 +0100 @@ -291,27 +291,24 @@ struct ext3_inode { __le32 i_generation; /* File version (for NFS) */ __le32 i_file_acl; /* File ACL */ __le32 i_dir_acl; /* Directory ACL */ - __le32 i_faddr;/* Fragment address */ + __le32 i_obsolete_faddr; /* Unused */ union { struct { - __u8l_i_frag; /* Fragment number */ - __u8l_i_fsize; /* Fragment size */ + __u16 l_i_obsolete_frag; /* Unused */ __u16 i_pad1; __le16 l_i_uid_high; /* these 2 fields*/ __le16 l_i_gid_high
[RFC] [PATCH 3/3] Recursive mtime for ext3
Implement recursive mtime (rtime) feature for ext3. The feature works as follows: In each directory we keep a flag EXT3_RTIME_FL (modifiable by a user) whether rtime should be updated. In case a directory or a file in it is modified and when the flag is set, directory's rtime is updated, the flag is cleared, and we move to the parent. If the flag is set there, we clear it, update rtime and continue upwards upto the root of the filesystem. In case a regular file or symlink is modified, we pick arbitrary of its parents (actually the one that happens to be at the head of i_dentry list) and start the rtime update algorith there. As the flag is always cleared after updating rtime and we don't climb up the tree if the flag is cleared, we have constant amortized complexity of rtime updates. That's for theoretical time consumption ;) Practically, there's no measurable performance impact for a test case like: touch every file in a kernel tree where every directory has RTIME flag set. Intended use case is that application which wants to watch any modification in a subtree scans the subtree and sets flags for all inodes there. Next time, it just needs to recurse in directories having rtime newer than the start of the previous scan. There it can handle modifications and set the flag again. It is up to application to watch out for hardlinked files. It can e.g. build their list and check their mtime separately (when a hardlink to a file is created its inode is modified and rtimes properly updated and thus any application has an effective way of finding new hardlinked files). Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/ialloc.c linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/ialloc.c --- linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/ialloc.c 2007-11-05 16:58:10.0 +0100 +++ linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/ialloc.c2007-11-05 16:58:53.0 +0100 @@ -569,7 +569,7 @@ got: /* Guard reading of directory's i_flags, created inode is safe as * noone has a reference to it yet */ spin_lock(dir-i_lock); - ei-i_flags = EXT3_I(dir)-i_flags ~EXT3_INDEX_FL; + ei-i_flags = EXT3_I(dir)-i_flags ~(EXT3_INDEX_FL | EXT3_RTIME_FL); spin_unlock(dir-i_lock); if (S_ISLNK(mode)) ei-i_flags = ~(EXT3_IMMUTABLE_FL|EXT3_APPEND_FL); @@ -579,6 +579,7 @@ got: ei-i_file_acl = 0; ei-i_dir_acl = 0; ei-i_dtime = 0; + ei-i_rtime = inode-i_mtime.tv_sec; ei-i_block_alloc_info = NULL; ei-i_block_group = group; diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/inode.c linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/inode.c --- linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/inode.c 2007-11-05 16:58:10.0 +0100 +++ linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/inode.c 2007-11-06 16:13:50.0 +0100 @@ -1232,6 +1232,8 @@ static int ext3_ordered_commit_write(str ret2 = ext3_journal_stop(handle); if (!ret) ret = ret2; + if (!ret) + ext3_update_rtimes(inode); return ret; } @@ -1255,6 +1257,8 @@ static int ext3_writeback_commit_write(s ret2 = ext3_journal_stop(handle); if (!ret) ret = ret2; + if (!ret) + ext3_update_rtimes(inode); return ret; } @@ -1288,6 +1292,8 @@ static int ext3_journalled_commit_write( ret2 = ext3_journal_stop(handle); if (!ret) ret = ret2; + if (!ret) + ext3_update_rtimes(inode); return ret; } @@ -2386,6 +2392,10 @@ out_stop: ext3_orphan_del(handle, inode); ext3_journal_stop(handle); + /* We update time only for linked inodes. Unlinked ones already +* notified parent during unlink... */ + if (inode-i_nlink) + ext3_update_rtimes(inode); } static ext3_fsblk_t ext3_get_inode_block(struct super_block *sb, @@ -2628,6 +2638,8 @@ void ext3_read_inode(struct inode * inod inode-i_ctime.tv_sec = (signed)le32_to_cpu(raw_inode-i_ctime); inode-i_mtime.tv_sec = (signed)le32_to_cpu(raw_inode-i_mtime); inode-i_atime.tv_nsec = inode-i_ctime.tv_nsec = inode-i_mtime.tv_nsec = 0; + if (EXT3_HAS_COMPAT_FEATURE(inode-i_sb, EXT3_FEATURE_COMPAT_RTIME)) + ei-i_rtime = le32_to_cpu(raw_inode-i_rtime); ei-i_state = 0; ei-i_dir_start_lookup = 0; @@ -2780,6 +2792,8 @@ static int ext3_do_update_inode(handle_t raw_inode-i_atime = cpu_to_le32(inode-i_atime.tv_sec); raw_inode-i_ctime = cpu_to_le32(inode-i_ctime.tv_sec); raw_inode-i_mtime = cpu_to_le32(inode-i_mtime.tv_sec); + if (EXT3_HAS_COMPAT_FEATURE(inode-i_sb, EXT3_FEATURE_COMPAT_RTIME)) + raw_inode-i_rtime = cpu_to_le32(ei-i_rtime); raw_inode-i_blocks = cpu_to_le32(inode
[PATCH] Fix 64KB blocksize in ext3 directories
Hi Andrew, looking at the commit messages it seems the patch fixing overflow of rec_len in ext3 directories when using 64KB blocksize got lost (I'm not surprised given the churn of the patches in this area for 2.6.24-rc1 ;). Anyway, the patch is attached. It should make it into 2.6.24 (or nasty things would happen if somebody actually try 64KB blocksize in ext3). Thanks Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. The patch also converts some places to use ext3_next_entry() when we are changing them anyway. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.24-rc1/fs/ext3/dir.c linux-2.6.24-rc1-1-ext3_64k_blocksize/fs/ext3/dir.c --- linux-2.6.24-rc1/fs/ext3/dir.c 2007-10-24 20:43:56.0 +0200 +++ linux-2.6.24-rc1-1-ext3_64k_blocksize/fs/ext3/dir.c 2007-10-24 20:45:01.0 +0200 @@ -67,7 +67,7 @@ int ext3_check_dir_entry (const char * f unsigned long offset) { const char * error_msg = NULL; - const int rlen = le16_to_cpu(de-rec_len); + const int rlen = ext3_rec_len_from_disk(de-rec_len); if (rlen EXT3_DIR_REC_LEN(1)) error_msg = rec_len is smaller than minimal; @@ -173,10 +173,10 @@ revalidate: * least that it is non-zero. A * failure will be detected in the * dirent test below. */ - if (le16_to_cpu(de-rec_len) + if (ext3_rec_len_from_disk(de-rec_len) EXT3_DIR_REC_LEN(1)) break; - i += le16_to_cpu(de-rec_len); + i += ext3_rec_len_from_disk(de-rec_len); } offset = i; filp-f_pos = (filp-f_pos ~(sb-s_blocksize - 1)) @@ -197,7 +197,7 @@ revalidate: ret = stored; goto out; } - offset += le16_to_cpu(de-rec_len); + offset += ext3_rec_len_from_disk(de-rec_len); if (le32_to_cpu(de-inode)) { /* We might block in the next section * if the data destination is @@ -219,7 +219,7 @@ revalidate: goto revalidate; stored ++; } - filp-f_pos += le16_to_cpu(de-rec_len); + filp-f_pos += ext3_rec_len_from_disk(de-rec_len); } offset = 0; brelse (bh); diff -rupX /home/jack/.kerndiffexclude linux-2.6.24-rc1/fs/ext3/namei.c linux-2.6.24-rc1-1-ext3_64k_blocksize/fs/ext3/namei.c --- linux-2.6.24-rc1/fs/ext3/namei.c2007-10-24 20:43:56.0 +0200 +++ linux-2.6.24-rc1-1-ext3_64k_blocksize/fs/ext3/namei.c 2007-10-24 20:45:58.0 +0200 @@ -177,6 +177,15 @@ static int ext3_dx_add_entry(handle_t *h struct inode *inode); /* + * p is at least 6 bytes before the end of page + */ +static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p) +{ + return (struct ext3_dir_entry_2 *)((char*)p + + ext3_rec_len_from_disk(p-rec_len)); +} + +/* * Future: use high four bits of block for coalesce-on-delete flags * Mask them off for now. */ @@ -280,7 +289,7 @@ static struct stats dx_show_leaf(struct space += EXT3_DIR_REC_LEN(de-name_len); names++; } - de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext3_next_entry(de); } printk((%i)\n, names); return (struct stats) { names, space, 1 }; @@ -547,14 +556,6 @@ static int ext3_htree_next_block(struct /* - * p is at least 6 bytes before the end of page - */ -static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p) -{ - return (struct ext3_dir_entry_2 *)((char*)p + le16_to_cpu(p-rec_len)); -} - -/* * This function fills a red-black tree with information from a * directory block. It returns the number directory entries loaded * into the tree. If there is an error it is returned in err. @@ -720,7 +721,7 @@ static int dx_make_map (struct ext3_dir_ cond_resched(); } /* XXX: do we need to check rec_len == 0 case? -Chris
Re: a potential deadlock?
Hi, Here is a question to be confirmed. In ext3_ioctl() with cmd == EXT3_IOC_SETFLAGS, we firstly lock inode-i_mutex, start a handle with 1 journal-block by calling ext3_journal_start(). In ext3_new_blocks(), say QUOTA was enabled with vfsv0 format, we will call the function DQUOT_ALLOC_BLOCK(). The handle in ext3_new_blocks() was started by high-level functions, and DQUOT_ALLOC_BLOCK() will finally calles ext3_quota_write() in which it try to lock the i_mutex of the inode of a quota-file. At it happens, when we want to modify the inodes of quota-files via ext3_ioctl(cmd = EXT3_IOC_SETFLAGS) (say process-A), another guy try to execute ext3_quota_write() by calling DQUOT_ALLOC_BLOCK() (say process-B). I guess a potential deadlock between process-A and process-B would happen in such a executing sequence: (1) process-B got many journal-blocks, then came into ext3_new_blocks(), hung up (2) process-A locked i_mutex of the inode of a quota-file, then try to starts a handle. Unfortunately, there are no enough journal-blocks left for process-A. (3) process-B awakened, and came into DQUOT_ALLOC_BLOCK(), finally came into the function ext3_quota_write() who also wants to lock the i_mutex of the inode of a quota-file. But the i_mutex was locked by process-A. so process-B has no choice but to wait. (4) if the ext3-filesystem was too busy to release jounal-blocks for process-A, or a unexpected incident happened. Both the two situations would result in no journal-blocks for any other processes. Apparently, process-A have to wait for available journal-blocks. so process-A was hung-up with i_mutex of the inode of a quota-file locked. (5) process-B was blocked by the inode-i_mutex subsequently. Yes, that is a lock inversion between the journal lock and i_mutex on quota files which can indeed lead to a deadlock. Thanks for spotting it. Luckily it's not very likely you're going to hit it but we should fix it anyway. Currently I don't have much idea how - probably we'd have to get rid of the need to use i_mutex on quota files in quota_write but that's also non-trivial. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] e2fsprogs: handle rec_len in case of 64KB blocks
Hi Ted, I haven't got any answer from you regarding this patch implementing hadling of rec_len for 64KB block size. Is it still waiting in your queue to review or did it just get lost? Since all the kernel changes will be in 2.6.24, we should have tools for that too... Please write me if there are any problems with the patch. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- Subject: Support for 64KB blocksize in ext2-4 directories. When block size is 64KB, we have to take care that rec_len does not overflow. Kernel stores 0x in case 0x1 should be stored - perform appropriate conversion when reading from / writing to disk. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/lib/ext2fs/dirblock.c b/lib/ext2fs/dirblock.c index fb20fa0..628bf93 100644 --- a/lib/ext2fs/dirblock.c +++ b/lib/ext2fs/dirblock.c @@ -38,9 +38,9 @@ errcode_t ext2fs_read_dir_block2(ext2_fi dirent = (struct ext2_dir_entry *) p; #ifdef WORDS_BIGENDIAN dirent-inode = ext2fs_swab32(dirent-inode); - dirent-rec_len = ext2fs_swab16(dirent-rec_len); dirent-name_len = ext2fs_swab16(dirent-name_len); #endif + dirent-rec_len = ext2fs_rec_len_from_disk(dirent-rec_len); name_len = dirent-name_len; #ifdef WORDS_BIGENDIAN if (flags EXT2_DIRBLOCK_V2_STRUCT) @@ -68,12 +68,15 @@ errcode_t ext2fs_read_dir_block(ext2_fil errcode_t ext2fs_write_dir_block2(ext2_filsys fs, blk_t block, void *inbuf, int flags EXT2FS_ATTR((unused))) { -#ifdef WORDS_BIGENDIAN errcode_t retval; char*p, *end; char*buf = 0; struct ext2_dir_entry *dirent; +#ifndef WORDS_BIGENDIAN + if (fs-blocksize EXT2_MAX_REC_LEN) + goto just_write; +#endif retval = ext2fs_get_mem(fs-blocksize, buf); if (retval) return retval; @@ -89,7 +92,7 @@ errcode_t ext2fs_write_dir_block2(ext2_f } p += dirent-rec_len; dirent-inode = ext2fs_swab32(dirent-inode); - dirent-rec_len = ext2fs_swab16(dirent-rec_len); + dirent-rec_len = ext2fs_rec_len_to_disk(dirent-rec_len); dirent-name_len = ext2fs_swab16(dirent-name_len); if (flags EXT2_DIRBLOCK_V2_STRUCT) @@ -98,9 +101,8 @@ errcode_t ext2fs_write_dir_block2(ext2_f retval = io_channel_write_blk(fs-io, block, 1, buf); ext2fs_free_mem(buf); return retval; -#else +just_write: return io_channel_write_blk(fs-io, block, 1, (char *) inbuf); -#endif } diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h index a316665..2041f0f 100644 --- a/lib/ext2fs/ext2_fs.h +++ b/lib/ext2fs/ext2_fs.h @@ -717,6 +717,32 @@ struct ext2_dir_entry_2 { #define EXT2_DIR_ROUND (EXT2_DIR_PAD - 1) #define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND) \ ~EXT2_DIR_ROUND) +#define EXT2_MAX_REC_LEN ((116)-1) + +static inline unsigned ext2fs_rec_len_from_disk(unsigned len) +{ +#ifdef WORDS_BIGENDIAN + len = ext2fs_swab16(dlen); +#endif + if (len == EXT2_MAX_REC_LEN) + return 1 16; + return len; +} + +static inline unsigned ext2fs_rec_len_to_disk(unsigned len) +{ + if (len == (1 16)) +#ifdef WORDS_BIGENDIAN + return ext2fs_swab16(EXT2_MAX_REC_LEN); +#else + return EXT2_MAX_REC_LEN; +#endif +#ifdef WORDS_BIGENDIAN + return ext2fs_swab_16(len); +#else + return len; +#endif +} /* * This structure will be used for multiple mount protection. It will be diff --git a/misc/e2image.c b/misc/e2image.c index 1fbb267..4e2c9fb 100644 --- a/misc/e2image.c +++ b/misc/e2image.c @@ -345,10 +345,7 @@ static void scramble_dir_block(ext2_fils end = buf + fs-blocksize; for (p = buf; p end-8; p += rec_len) { dirent = (struct ext2_dir_entry_2 *) p; - rec_len = dirent-rec_len; -#ifdef WORDS_BIGENDIAN - rec_len = ext2fs_swab16(rec_len); -#endif + rec_len = ext2fs_rec_len_from_disk(dirent-rec_len); #if 0 printf(rec_len = %d, name_len = %d\n, rec_len, dirent-name_len); #endif @@ -358,9 +355,7 @@ static void scramble_dir_block(ext2_fils bad rec_len (%d)\n, (unsigned long) blk, rec_len); rec_len = end - p; -#ifdef WORDS_BIGENDIAN - dirent-rec_len = ext2fs_swab16(rec_len); -#endif + dirent-rec_len = ext2fs_rec_len_to_disk(rec_len); continue; } if (dirent-name_len + 8 rec_len) { - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body
Re: + ext3-change-the-default-behaviour-on-error.patch added to -mm tree
The patch titled ext3: change the default behaviour on error has been added to the -mm tree. Its filename is ext3-change-the-default-behaviour-on-error.patch *** Remember to use Documentation/SubmitChecklist when testing your code *** See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find out what to do about this -- Subject: ext3: change the default behaviour on error From: Aneesh Kumar K.V [EMAIL PROTECTED] ext3 file system was by default ignoring errors and continuing. This is not a good default as continuing on error could lead to file system corruption. Change the default to mark the file system readonly. Debian and ubuntu already does this as the default in their fstab. The change is fine as such but looking at it I just wonder whether it would not make sence to write in /proc/mounts options corresponding to the real state of mount options and not what we guess user has specified... Or does anybody see a sane usecase where userspace would rather want to see the current output? ... maybe when user wants to inspect /proc/mounts by himself. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] Cc: linux-ext4@vger.kernel.org Cc: Eric Sandeen [EMAIL PROTECTED] Cc: Jan Kara [EMAIL PROTECTED] Cc: Dave Jones [EMAIL PROTECTED] Cc: Chuck Ebbert [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- fs/ext3/super.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff -puN fs/ext3/super.c~ext3-change-the-default-behaviour-on-error fs/ext3/super.c --- a/fs/ext3/super.c~ext3-change-the-default-behaviour-on-error +++ a/fs/ext3/super.c @@ -575,16 +575,16 @@ static int ext3_show_options(struct seq_ le16_to_cpu(es-s_def_resgid) != EXT3_DEF_RESGID) { seq_printf(seq, ,resgid=%u, sbi-s_resgid); } - if (test_opt(sb, ERRORS_CONT)) { + if (test_opt(sb, ERRORS_RO)) { int def_errors = le16_to_cpu(es-s_errors); if (def_errors == EXT3_ERRORS_PANIC || - def_errors == EXT3_ERRORS_RO) { - seq_puts(seq, ,errors=continue); + def_errors == EXT3_ERRORS_CONTINUE) { + seq_puts(seq, ,errors=remount-ro); } } - if (test_opt(sb, ERRORS_RO)) - seq_puts(seq, ,errors=remount-ro); + if (test_opt(sb, ERRORS_CONT)) + seq_puts(seq, ,errors=continue); if (test_opt(sb, ERRORS_PANIC)) seq_puts(seq, ,errors=panic); if (test_opt(sb, NO_UID32)) @@ -1559,10 +1559,10 @@ static int ext3_fill_super (struct super if (le16_to_cpu(sbi-s_es-s_errors) == EXT3_ERRORS_PANIC) set_opt(sbi-s_mount_opt, ERRORS_PANIC); - else if (le16_to_cpu(sbi-s_es-s_errors) == EXT3_ERRORS_RO) - set_opt(sbi-s_mount_opt, ERRORS_RO); - else + else if (le16_to_cpu(sbi-s_es-s_errors) == EXT3_ERRORS_CONTINUE) set_opt(sbi-s_mount_opt, ERRORS_CONT); + else + set_opt(sbi-s_mount_opt, ERRORS_RO); sbi-s_resuid = le16_to_cpu(es-s_def_resuid); sbi-s_resgid = le16_to_cpu(es-s_def_resgid); _ Patches currently in -mm which might be from [EMAIL PROTECTED] are ext2-return-after-ext2_error-in-case-of-failures.patch ext2-change-the-default-behaviour-on-error.patch ext4-return-after-ext4_error-in-case-of-failures.patch ext3-return-after-ext3_error-in-case-of-failures.patch ext3-change-the-default-behaviour-on-error.patch ext2-fix-the-max-file-size-for-ext2-file-system.patch ext3-fix-the-max-file-size-for-ext3-file-system.patch Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext4 devel interlock meeting minutes (October 22, 2007)
Hello, Sorry for not taking part in the call but I was just on my way from NY to Prague... Ext4 Developer Interlock Call October 22, 2007: Meeting Minutes Attendees: Mingming Cao, Andreas Dilger, Eric Sandeen, Dave Kleikamp, Jose Santos, Aneesh Veetil, Valerie Clement, Avantika Mathur - There has been discussion on linux-ext4 about overflow in ext2 with 64 KB block size. Discussed adding an incompat flag to this feature and concluded that it is too late to do this; it will be treated as a bug fix. Hmm, why adding an incompat flag? Obviously there's no harm in doing that but the large block size itself should be enough to protect old kernels/utils touching it, shouldn't it? E2fsprogs - Ted has added new branches the the e2fsprogs git tree - master: current stable version - next: patches expected to go into the next stable version - pu: proposed update; patches that are in preliminary review and test phase - E2fsprogs patches should probably be submitted against the 'next' branch of the git tree. BTW: We still don't have appropriate patches for large block size in e2fsprogs. I guess I should ping Ted about merging the patch once more. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Thu 04-10-07 16:11:21, Andrew Morton wrote: On Thu, 4 Oct 2007 16:40:44 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: On Oct 04, 2007 13:12 -0700, Andrew Morton wrote: On Mon, 01 Oct 2007 17:35:46 -0700 ext2: Avoid rec_len overflow with 64KB block size into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. This patch clashes in non-trivial ways with ext2-convert-to-new-aops-fix.patch and perhaps other things which are already queued for 2.6.24 inclusion, so I'll need to ask for an updated patch, please. If the rel_len overflow patch isn't going to make it, then we also need to revert the EXT*_MAX_BLOCK_SIZE change to 65536. It would be possible to allow this to be up to 32768 w/o the rec_len overflow fix however. Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and ext2-support-large-blocksize-up-to-pagesize.patch. Sorry, for the delayed answer but I had some urgent bugs to fix... Why did you drom ext3-support-large-blocksize-up-to-pagesize.patch? As far as I understand your previous email (and also as I've checked against 2.6.23-rc8-mm2), the patch fixing rec_len overflow clashes only for ext2... I'll send you an updated patch for ext2 in a moment. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Thu 04-10-07 13:12:07, Andrew Morton wrote: On Mon, 01 Oct 2007 17:35:46 -0700 Mingming Cao [EMAIL PROTECTED] wrote: ext2: Avoid rec_len overflow with 64KB block size From: Jan Kara [EMAIL PROTECTED] With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. This patch clashes in non-trivial ways with ext2-convert-to-new-aops-fix.patch and perhaps other things which are already queued for 2.6.24 inclusion, so I'll need to ask for an updated patch, please. Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if we're aiming for complete support of 64k blocksize in 2.6.24's ext2, additional testing and checking will be needed. OK, attached is a patch diffed against 2.6.23-rc9-mm2 - does that work fine for you? Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR -- With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-mm/fs/ext2/dir.c linux-2.6.23-mm-1-ext2_64k_rec_len/fs/ext2/dir.c --- linux-2.6.23-mm/fs/ext2/dir.c 2007-10-11 12:08:16.0 +0200 +++ linux-2.6.23-mm-1-ext2_64k_rec_len/fs/ext2/dir.c2007-10-11 12:14:24.0 +0200 @@ -28,6 +28,24 @@ typedef struct ext2_dir_entry_2 ext2_dirent; +static inline unsigned ext2_rec_len_from_disk(__le16 dlen) +{ + unsigned len = le16_to_cpu(dlen); + + if (len == EXT2_MAX_REC_LEN) + return 1 16; + return len; +} + +static inline __le16 ext2_rec_len_to_disk(unsigned len) +{ + if (len == (1 16)) + return cpu_to_le16(EXT2_MAX_REC_LEN); + else if (len (1 16)) + BUG(); + return cpu_to_le16(len); +} + /* * ext2 uses block-sized chunks. Arguably, sector-sized ones would be * more robust, but we have what we have @@ -106,7 +124,7 @@ static void ext2_check_page(struct page } for (offs = 0; offs = limit - EXT2_DIR_REC_LEN(1); offs += rec_len) { p = (ext2_dirent *)(kaddr + offs); - rec_len = le16_to_cpu(p-rec_len); + rec_len = ext2_rec_len_from_disk(p-rec_len); if (rec_len EXT2_DIR_REC_LEN(1)) goto Eshort; @@ -204,7 +222,8 @@ static inline int ext2_match (int len, c */ static inline ext2_dirent *ext2_next_entry(ext2_dirent *p) { - return (ext2_dirent *)((char*)p + le16_to_cpu(p-rec_len)); + return (ext2_dirent *)((char*)p + + ext2_rec_len_from_disk(p-rec_len)); } static inline unsigned @@ -316,7 +335,7 @@ ext2_readdir (struct file * filp, void * return 0; } } - filp-f_pos += le16_to_cpu(de-rec_len); + filp-f_pos += ext2_rec_len_from_disk(de-rec_len); } ext2_put_page(page); } @@ -425,7 +444,7 @@ void ext2_set_link(struct inode *dir, st { loff_t pos = page_offset(page) + (char *) de - (char *) page_address(page); - unsigned len = le16_to_cpu(de-rec_len); + unsigned len = ext2_rec_len_from_disk(de-rec_len); int err; lock_page(page); @@ -482,7 +501,7 @@ int ext2_add_link (struct dentry *dentry /* We hit i_size */ name_len = 0; rec_len = chunk_size; - de-rec_len = cpu_to_le16(chunk_size); + de-rec_len = ext2_rec_len_to_disk(chunk_size); de-inode = 0; goto got_it; } @@ -496,7 +515,7 @@ int ext2_add_link (struct dentry *dentry if (ext2_match (namelen, name, de)) goto out_unlock; name_len = EXT2_DIR_REC_LEN(de-name_len); - rec_len = le16_to_cpu(de-rec_len); + rec_len = ext2_rec_len_from_disk(de-rec_len); if (!de-inode rec_len = reclen) goto got_it; if (rec_len = name_len + reclen) @@ -518,8 +537,8 @@ got_it: goto out_unlock; if (de-inode) { ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len); - de1-rec_len = cpu_to_le16(rec_len - name_len); - de-rec_len = cpu_to_le16(name_len); + de1-rec_len = ext2_rec_len_to_disk(rec_len - name_len); + de-rec_len
Re: max file size for ext3
Hello, I am looking at ext4_max_size and was confused how the number upper_limit = 0x1ff7fffd000LL is arrived. The comment says the value is arrived looking at 4K. So i tried the below program. main() { unsigned long long upper_limit, meta_blocks; int bits = 12; /* total blocks in 512 bytes */ upper_limit = (1LL 32) - 1; /* total blocks in file system block size */ upper_limit = (bits - 9); meta_blocks = 1; /* double indirect blocks */ meta_blocks += 1 + 1LL (bits-2); /* tripple indirect blocks */ meta_blocks += 1 + 1LL (bits-2) + 1LL (2*(bits-2)); upper_limit -= meta_blocks; upper_limit = bits; printf(%x\n, upper_limit); } Can somebody help me to find out what is missing in the above ? You actually overestimate the number of triply indirect blocks in your program above (you count all possible but since the limit is around 2TB only roughly half of them is really needed). I also think hardcoding 4k block size is not correct. I have the below patch pending with large file size. It is incorrect for blocksize larger than 4096, that's right. Your patch looks fine - we loose a bit of filesize limit but I don't think it really matters. Honza - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Thu 04-10-07 13:12:07, Andrew Morton wrote: On Mon, 01 Oct 2007 17:35:46 -0700 Mingming Cao [EMAIL PROTECTED] wrote: ext2: Avoid rec_len overflow with 64KB block size From: Jan Kara [EMAIL PROTECTED] With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. This patch clashes in non-trivial ways with ext2-convert-to-new-aops-fix.patch and perhaps other things which are already queued for 2.6.24 inclusion, so I'll need to ask for an updated patch, please. Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if we're aiming for complete support of 64k blocksize in 2.6.24's ext2, additional testing and checking will be needed. OK, I'll fixup those rejects and send a new patch. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix commit code to properly abort journal
And a similar patch for JBD2... Honza - We should really call journal_abort() and not __journal_abort_hard() in case of errors. The latter call does not record the error in the journal superblock and thus filesystem won't be marked as with errors later (and user could happily mount it without any warning). Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6-1-jbd_abort_fix/fs/jbd2/commit.c linux-2.6.23-rc6-2-jbd2_abort_fix/fs/jbd2/commit.c --- linux-2.6.23-rc6-1-jbd_abort_fix/fs/jbd2/commit.c 2007-09-18 19:22:28.0 +0200 +++ linux-2.6.23-rc6-2-jbd2_abort_fix/fs/jbd2/commit.c 2007-09-25 16:14:09.0 +0200 @@ -475,7 +475,7 @@ void jbd2_journal_commit_transaction(jou spin_unlock(journal-j_list_lock); if (err) - __jbd2_journal_abort_hard(journal); + jbd2_journal_abort(journal, err); jbd2_journal_write_revoke_records(journal, commit_transaction); @@ -533,7 +533,7 @@ void jbd2_journal_commit_transaction(jou descriptor = jbd2_journal_get_descriptor_buffer(journal); if (!descriptor) { - __jbd2_journal_abort_hard(journal); + jbd2_journal_abort(journal, -EIO); continue; } @@ -566,7 +566,7 @@ void jbd2_journal_commit_transaction(jou and repeat this loop: we'll fall into the refile-on-abort condition above. */ if (err) { - __jbd2_journal_abort_hard(journal); + jbd2_journal_abort(journal, err); continue; } @@ -757,7 +757,7 @@ wait_for_iobuf: err = -EIO; if (err) - __jbd2_journal_abort_hard(journal); + jbd2_journal_abort(journal, err); /* End of a transaction! Finally, we can do checkpoint processing: any buffers committed as a result of this - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: jbd : config_jbd_debug cannot create /proc entry
On Tue, 25 Sep 2007 07:49:38 -0500 Jose R. Santos [EMAIL PROTECTED] wrote: On Tue, 25 Sep 2007 13:50:46 +0200 Jan Kara [EMAIL PROTECTED] wrote: Jan Kara wrote: -#define create_jbd_proc_entry() do {} while (0) -#define remove_jbd_proc_entry() do {} while (0) +static ctl_table fs_table[] = { + { +.ctl_name = -1, /* Don't want it */ shouldn't this be CTL_UNNUMBERED ? Oh, it should be. I didn't notice we have this :) Thanks for notifying me. Attached is a fixed version. This was fixed in JBD2 by moving the jbd-debug file to debugfs: http://lkml.org/lkml/2007/7/11/334 Since this code is already in the kernel, we should keep it consistent. OK. Here's a quick patch to fix this. Adapted from the JBD2 patch. Let me know what you think. Looks fine - exactly what I've just done here :). Honza commit 6cbd2ce05b7504514707ce825170a5d77abf6a6e Author: root [EMAIL PROTECTED] Date: Thu Jun 14 09:40:09 2007 -0500 The jbd-debug file used to be located in /proc/sys/fs/jbd-debug, but create_proc_entry() does not do lookups on file names that are more that one directory deep. This causes the entry creation to fail and hence, no proc file is created. Instead of fixing this on procfs might as well move the jbd2-debug file to debugfs which would be the preferred location for this kind of tunable. The new location is now /sys/kernel/debug/jbd/jbd-debug. Signed-off-by: Jose R. Santos [EMAIL PROTECTED] You can add Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/fs/Kconfig b/fs/Kconfig index 58a0650..a8937a6 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -219,7 +219,7 @@ config JBD config JBD_DEBUG bool JBD (ext3) debugging support - depends on JBD + depends on JBD DEBUG_FS help If you are using the ext3 journaled file system (or potentially any other file system/device using JBD), this option allows you to @@ -228,10 +228,10 @@ config JBD_DEBUG debugging output will be turned off. If you select Y here, then you will be able to turn on debugging - with echo N /proc/sys/fs/jbd-debug, where N is a number between - 1 and 5, the higher the number, the more debugging output is - generated. To turn debugging off again, do - echo 0 /proc/sys/fs/jbd-debug. + with echo N /sys/kernel/debug/jbd/jbd-debug, where N is a + number between 1 and 5, the higher the number, the more debugging + output is generated. To turn debugging off again, do + echo 0 /sys/kernel/debug/jbd/jbd-debug. config JBD2 tristate diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c index 06ab3c1..3cad624 100644 --- a/fs/jbd/journal.c +++ b/fs/jbd/journal.c @@ -35,6 +35,7 @@ #include linux/kthread.h #include linux/poison.h #include linux/proc_fs.h +#include linux/debugfs.h #include asm/uaccess.h #include asm/page.h @@ -1939,63 +1940,38 @@ void journal_put_journal_head(struct journal_head *jh) } /* - * /proc tunables + * debugfs tunables */ #if defined(CONFIG_JBD_DEBUG) -int journal_enable_debug; +u8 journal_enable_debug; EXPORT_SYMBOL(journal_enable_debug); #endif -#if defined(CONFIG_JBD_DEBUG) defined(CONFIG_PROC_FS) +#if defined(CONFIG_JBD_DEBUG) defined(CONFIG_DEBUG_FS) -static struct proc_dir_entry *proc_jbd_debug; +struct dentry *jbd_debugfs_dir, *jbd_debug; -static int read_jbd_debug(char *page, char **start, off_t off, - int count, int *eof, void *data) +static void __init create_jbd_debugfs_entry(void) { - int ret; - - ret = sprintf(page + off, %d\n, journal_enable_debug); - *eof = 1; - return ret; -} - -static int write_jbd_debug(struct file *file, const char __user *buffer, -unsigned long count, void *data) -{ - char buf[32]; - - if (count ARRAY_SIZE(buf) - 1) - count = ARRAY_SIZE(buf) - 1; - if (copy_from_user(buf, buffer, count)) - return -EFAULT; - buf[ARRAY_SIZE(buf) - 1] = '\0'; - journal_enable_debug = simple_strtoul(buf, NULL, 10); - return count; -} - -#define JBD_PROC_NAME sys/fs/jbd-debug - -static void __init create_jbd_proc_entry(void) -{ - proc_jbd_debug = create_proc_entry(JBD_PROC_NAME, 0644, NULL); - if (proc_jbd_debug) { - /* Why is this so hard? */ - proc_jbd_debug-read_proc = read_jbd_debug; - proc_jbd_debug-write_proc = write_jbd_debug; - } + jbd_debugfs_dir = debugfs_create_dir(jbd, NULL); + if (jbd_debugfs_dir) + jbd_debug = debugfs_create_u8(jbd-debug, S_IRUGO, +jbd_debugfs_dir
[PATCH] Support for 64KB blocksize in e2fsprogs
Hi, attached patch implements support for 64KB blocksize in directories in e2fsprogs. This patch complements the kernel patches for ext2-4 I've sent yesterday. Ted, does the patch look fine? Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - Subject: Support for 64KB blocksize in ext2-4 directories. When block size is 64KB, we have to take care that rec_len does not overflow. Kernel stores 0x in case 0x1 should be stored - perform appropriate conversion when reading from / writing to disk. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/lib/ext2fs/dirblock.c b/lib/ext2fs/dirblock.c index fb20fa0..628bf93 100644 --- a/lib/ext2fs/dirblock.c +++ b/lib/ext2fs/dirblock.c @@ -38,9 +38,9 @@ errcode_t ext2fs_read_dir_block2(ext2_fi dirent = (struct ext2_dir_entry *) p; #ifdef WORDS_BIGENDIAN dirent-inode = ext2fs_swab32(dirent-inode); - dirent-rec_len = ext2fs_swab16(dirent-rec_len); dirent-name_len = ext2fs_swab16(dirent-name_len); #endif + dirent-rec_len = ext2fs_rec_len_from_disk(dirent-rec_len); name_len = dirent-name_len; #ifdef WORDS_BIGENDIAN if (flags EXT2_DIRBLOCK_V2_STRUCT) @@ -68,12 +68,15 @@ errcode_t ext2fs_read_dir_block(ext2_fil errcode_t ext2fs_write_dir_block2(ext2_filsys fs, blk_t block, void *inbuf, int flags EXT2FS_ATTR((unused))) { -#ifdef WORDS_BIGENDIAN errcode_t retval; char*p, *end; char*buf = 0; struct ext2_dir_entry *dirent; +#ifndef WORDS_BIGENDIAN + if (fs-blocksize EXT2_MAX_REC_LEN) + goto just_write; +#endif retval = ext2fs_get_mem(fs-blocksize, buf); if (retval) return retval; @@ -89,7 +92,7 @@ errcode_t ext2fs_write_dir_block2(ext2_f } p += dirent-rec_len; dirent-inode = ext2fs_swab32(dirent-inode); - dirent-rec_len = ext2fs_swab16(dirent-rec_len); + dirent-rec_len = ext2fs_rec_len_to_disk(dirent-rec_len); dirent-name_len = ext2fs_swab16(dirent-name_len); if (flags EXT2_DIRBLOCK_V2_STRUCT) @@ -98,9 +101,8 @@ errcode_t ext2fs_write_dir_block2(ext2_f retval = io_channel_write_blk(fs-io, block, 1, buf); ext2fs_free_mem(buf); return retval; -#else +just_write: return io_channel_write_blk(fs-io, block, 1, (char *) inbuf); -#endif } diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h index a316665..2041f0f 100644 --- a/lib/ext2fs/ext2_fs.h +++ b/lib/ext2fs/ext2_fs.h @@ -717,6 +717,32 @@ struct ext2_dir_entry_2 { #define EXT2_DIR_ROUND (EXT2_DIR_PAD - 1) #define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND) \ ~EXT2_DIR_ROUND) +#define EXT2_MAX_REC_LEN ((116)-1) + +static inline unsigned ext2fs_rec_len_from_disk(unsigned len) +{ +#ifdef WORDS_BIGENDIAN + len = ext2fs_swab16(dlen); +#endif + if (len == EXT2_MAX_REC_LEN) + return 1 16; + return len; +} + +static inline unsigned ext2fs_rec_len_to_disk(unsigned len) +{ + if (len == (1 16)) +#ifdef WORDS_BIGENDIAN + return ext2fs_swab16(EXT2_MAX_REC_LEN); +#else + return EXT2_MAX_REC_LEN; +#endif +#ifdef WORDS_BIGENDIAN + return ext2fs_swab_16(len); +#else + return len; +#endif +} /* * This structure will be used for multiple mount protection. It will be diff --git a/misc/e2image.c b/misc/e2image.c index 1fbb267..4e2c9fb 100644 --- a/misc/e2image.c +++ b/misc/e2image.c @@ -345,10 +345,7 @@ static void scramble_dir_block(ext2_fils end = buf + fs-blocksize; for (p = buf; p end-8; p += rec_len) { dirent = (struct ext2_dir_entry_2 *) p; - rec_len = dirent-rec_len; -#ifdef WORDS_BIGENDIAN - rec_len = ext2fs_swab16(rec_len); -#endif + rec_len = ext2fs_rec_len_from_disk(dirent-rec_len); #if 0 printf(rec_len = %d, name_len = %d\n, rec_len, dirent-name_len); #endif @@ -358,9 +355,7 @@ static void scramble_dir_block(ext2_fils bad rec_len (%d)\n, (unsigned long) blk, rec_len); rec_len = end - p; -#ifdef WORDS_BIGENDIAN - dirent-rec_len = ext2fs_swab16(rec_len); -#endif + dirent-rec_len = ext2fs_rec_len_to_disk(rec_len); continue; } if (dirent-name_len + 8 rec_len) { - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Avoid rec_len overflow with 64KB block size
Hi, In the following series I'm sending patches to ext2, ext3 and ext4 solving possible overflow of rec_len when using 64KB blocksize. Minming, would you add these to the patch queue? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] Avoid rec_len overflow with 64KB block size
With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext2/dir.c linux-2.6.23-rc6-1-ext2_64k_blocksize/fs/ext2/dir.c --- linux-2.6.23-rc6/fs/ext2/dir.c 2007-07-16 17:47:21.0 +0200 +++ linux-2.6.23-rc6-1-ext2_64k_blocksize/fs/ext2/dir.c 2007-09-24 19:26:13.0 +0200 @@ -26,6 +26,24 @@ typedef struct ext2_dir_entry_2 ext2_dirent; +static inline unsigned ext2_rec_len_from_disk(__le16 dlen) +{ + unsigned len = le16_to_cpu(dlen); + + if (len == EXT2_MAX_REC_LEN) + return 1 16; + return len; +} + +static inline __le16 ext2_rec_len_to_disk(unsigned len) +{ + if (len == (1 16)) + return cpu_to_le16(EXT2_MAX_REC_LEN); + else if (len (1 16)) + BUG(); + return cpu_to_le16(len); +} + /* * ext2 uses block-sized chunks. Arguably, sector-sized ones would be * more robust, but we have what we have @@ -95,7 +113,7 @@ static void ext2_check_page(struct page } for (offs = 0; offs = limit - EXT2_DIR_REC_LEN(1); offs += rec_len) { p = (ext2_dirent *)(kaddr + offs); - rec_len = le16_to_cpu(p-rec_len); + rec_len = ext2_rec_len_from_disk(p-rec_len); if (rec_len EXT2_DIR_REC_LEN(1)) goto Eshort; @@ -193,7 +211,8 @@ static inline int ext2_match (int len, c */ static inline ext2_dirent *ext2_next_entry(ext2_dirent *p) { - return (ext2_dirent *)((char*)p + le16_to_cpu(p-rec_len)); + return (ext2_dirent *)((char*)p + + ext2_rec_len_from_disk(p-rec_len)); } static inline unsigned @@ -305,7 +324,7 @@ ext2_readdir (struct file * filp, void * return 0; } } - filp-f_pos += le16_to_cpu(de-rec_len); + filp-f_pos += ext2_rec_len_from_disk(de-rec_len); } ext2_put_page(page); } @@ -413,7 +432,7 @@ void ext2_set_link(struct inode *dir, st struct page *page, struct inode *inode) { unsigned from = (char *) de - (char *) page_address(page); - unsigned to = from + le16_to_cpu(de-rec_len); + unsigned to = from + ext2_rec_len_from_disk(de-rec_len); int err; lock_page(page); @@ -469,7 +488,7 @@ int ext2_add_link (struct dentry *dentry /* We hit i_size */ name_len = 0; rec_len = chunk_size; - de-rec_len = cpu_to_le16(chunk_size); + de-rec_len = ext2_rec_len_to_disk(chunk_size); de-inode = 0; goto got_it; } @@ -483,7 +502,7 @@ int ext2_add_link (struct dentry *dentry if (ext2_match (namelen, name, de)) goto out_unlock; name_len = EXT2_DIR_REC_LEN(de-name_len); - rec_len = le16_to_cpu(de-rec_len); + rec_len = ext2_rec_len_from_disk(de-rec_len); if (!de-inode rec_len = reclen) goto got_it; if (rec_len = name_len + reclen) @@ -504,8 +523,8 @@ got_it: goto out_unlock; if (de-inode) { ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len); - de1-rec_len = cpu_to_le16(rec_len - name_len); - de-rec_len = cpu_to_le16(name_len); + de1-rec_len = ext2_rec_len_to_disk(rec_len - name_len); + de-rec_len = ext2_rec_len_to_disk(name_len); de = de1; } de-name_len = namelen; @@ -536,7 +555,7 @@ int ext2_delete_entry (struct ext2_dir_e struct inode *inode = mapping-host; char *kaddr = page_address(page); unsigned from = ((char*)dir - kaddr) ~(ext2_chunk_size(inode)-1); - unsigned to = ((char*)dir - kaddr) + le16_to_cpu(dir-rec_len); + unsigned to = ((char*)dir - kaddr) + ext2_rec_len_from_disk(dir-rec_len); ext2_dirent * pde = NULL; ext2_dirent * de = (ext2_dirent *) (kaddr + from); int err; @@ -557,7 +576,7 @@ int ext2_delete_entry (struct ext2_dir_e err = mapping-a_ops-prepare_write(NULL, page, from, to); BUG_ON(err); if (pde) - pde-rec_len = cpu_to_le16(to-from); + pde-rec_len = ext2_rec_len_to_disk(to-from); dir-inode = 0; err = ext2_commit_chunk(page, from, to); inode-i_ctime = inode-i_mtime = CURRENT_TIME_SEC; @@ -591,14
Re: [PATCH 3/3] Avoid rec_len overflow with 64KB block size
With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. The patch also converts some places to use ext4_next_entry() when we are changing them anyway. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext4/dir.c linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/dir.c --- linux-2.6.23-rc6/fs/ext4/dir.c 2007-09-18 19:22:28.0 +0200 +++ linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/dir.c 2007-09-24 18:30:13.0 +0200 @@ -69,7 +69,7 @@ int ext4_check_dir_entry (const char * f unsigned long offset) { const char * error_msg = NULL; - const int rlen = le16_to_cpu(de-rec_len); + const int rlen = ext4_rec_len_from_disk(de-rec_len); if (rlen EXT4_DIR_REC_LEN(1)) error_msg = rec_len is smaller than minimal; @@ -176,10 +176,10 @@ revalidate: * least that it is non-zero. A * failure will be detected in the * dirent test below. */ - if (le16_to_cpu(de-rec_len) - EXT4_DIR_REC_LEN(1)) + if (ext4_rec_len_from_disk(de-rec_len) +EXT4_DIR_REC_LEN(1)) break; - i += le16_to_cpu(de-rec_len); + i += ext4_rec_len_from_disk(de-rec_len); } offset = i; filp-f_pos = (filp-f_pos ~(sb-s_blocksize - 1)) @@ -201,7 +201,7 @@ revalidate: ret = stored; goto out; } - offset += le16_to_cpu(de-rec_len); + offset += ext4_rec_len_from_disk(de-rec_len); if (le32_to_cpu(de-inode)) { /* We might block in the next section * if the data destination is @@ -223,7 +223,7 @@ revalidate: goto revalidate; stored ++; } - filp-f_pos += le16_to_cpu(de-rec_len); + filp-f_pos += ext4_rec_len_from_disk(de-rec_len); } offset = 0; brelse (bh); diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext4/namei.c linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/namei.c --- linux-2.6.23-rc6/fs/ext4/namei.c2007-09-18 19:22:28.0 +0200 +++ linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/namei.c 2007-09-24 18:35:34.0 +0200 @@ -280,7 +280,7 @@ static struct stats dx_show_leaf(struct space += EXT4_DIR_REC_LEN(de-name_len); names++; } - de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext4_next_entry(de); } printk((%i)\n, names); return (struct stats) { names, space, 1 }; @@ -525,7 +525,8 @@ static int ext4_htree_next_block(struct */ static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 *p) { - return (struct ext4_dir_entry_2 *)((char*)p + le16_to_cpu(p-rec_len)); + return (struct ext4_dir_entry_2 *)((char*)p + + ext4_rec_len_from_disk(p-rec_len)); } /* @@ -689,7 +690,7 @@ static int dx_make_map (struct ext4_dir_ cond_resched(); } /* XXX: do we need to check rec_len == 0 case? -Chris */ - de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext4_next_entry(de); } return count; } @@ -790,7 +791,7 @@ static inline int search_dirblock(struct return 1; } /* prevent looping on a bad block */ - de_len = le16_to_cpu(de-rec_len); + de_len = ext4_rec_len_from_disk(de-rec_len); if (de_len = 0) return -1; offset += de_len; @@ -1099,7 +1100,7 @@ dx_move_dirents(char *from, char *to, st rec_len = EXT4_DIR_REC_LEN(de-name_len); memcpy (to, de, rec_len); ((struct ext4_dir_entry_2 *) to)-rec_len = - cpu_to_le16(rec_len); + ext4_rec_len_to_disk(rec_len); de-inode = 0; map++; to += rec_len; @@ -1114,13 +1115,12 @@ static struct ext4_dir_entry_2* dx_pack_ prev = to = de; while ((char*)de base + size) { - next
Ext3 not marking filesystems as with errors
Hi, one friend has just pointed me to a following misbehaviour of ext3. If we stumble on some error in JBD (e.g. in commit code), we call __journal_abort_hard(). It just marks the journal as aborted but does nothing else. Later ext3 comes, finds journal aborted, calls ext3_abort() which remounts fs read-only and stops (it does not mark filesystem as having errors). It calls journal_abort(.., -EIO) but that does nothing because the journal is already aborted. If you then unmount the filesystem and mount it again, everything goes on happily as if there was no error - no suggestion for running fsck, nothing. I guess this is a bug but please correct me if you don't think so. There are two possibilities how to fix it - either we mark the filesystem as with errors in ext3_abort() or we could call some less lowlevel function from JBD to abort journal (as soon as j_errno is set, we are safe). Any feeling what is less hacky? Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Avoid rec_len overflow with 64KB block size
Hello, when converting ext4 directories to pagecache I just came over Takashi's patch preventing overflowing of rec_len. Looking over the patch - can't we do it more elegantly by using say 0x instead of 64K and perform conversion (using some helper) at the moment we read / store rec_len? That would be IMHO more transparent than current approach (at least it took me some time to understand what's going on with the current patch when I was looking at the code)... Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling h-trees too early?
On Thu 20-09-07 11:14:40, Theodore Tso wrote: On Thu, Sep 20, 2007 at 04:58:39PM +0200, Jan Kara wrote: Hmm, strange - I've just looked at my computer and dir_index is set just for 5 directories in my tree. I looked at a tree that had object files, which is probably why I had 8 directories; I'm guessing you probably just had kernel sources and no build files. If I try deleting just them, I also see some performance decrease but it's less than if I try deleting the whole tree (and that result seems to be quite consistent)... There's something fishy there. Maybe I could try seekwatcher or something similar to see what's really happening. That is very strange. Just a guess: Can't the culprit be the following test in ext3/4_readdir()? if (EXT4_HAS_COMPAT_FEATURE(inode-i_sb, EXT4_FEATURE_COMPAT_DIR_INDEX) ((EXT4_I(inode)-i_flags EXT4_INDEX_FL) || ((inode-i_size sb-s_blocksize_bits) == 1))) { error = ext4_dx_readdir(filp, dirent, filldir); if (error != ERR_BAD_DX_DIR) { ret = error; goto out; } /* * We don't set the inode dirty flag since it's not * critical that it get flushed back to the disk. */ EXT4_I(filp-f_path.dentry-d_inode)-i_flags = ~EXT4_INDEX_FL; } It calls ext4_dx_readdir() for *every* directory with 1 block (we have 1326 of them in the kernel tree). Now ext4_dx_readdir() calls ext4_htree_fill_tree() which finds out the directory is not h-tree and and calls htree_dirblock_to_tree(). So even for 4KB directories we end up deleting inodes in hash order! And as a bonus we burn some cycles building trees etc. What is the point of this? Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Avoid rec_len overflow with 64KB block size
when converting ext4 directories to pagecache I just came over Takashi's patch preventing overflowing of rec_len. Looking over the patch - can't we do it more elegantly by using say 0x instead of 64K and perform conversion (using some helper) at the moment we read / store rec_len? That would be IMHO more transparent than current approach (at least it took me some time to understand what's going on with the current patch when I was looking at the code)... Attached is a patch that does this for ext4. If you like this approach, I can cook up a similar patch for ext2 / ext3. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. The patch also converts some places to use ext4_next_entry() when we are changing them anyway. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext4/dir.c linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/dir.c --- linux-2.6.23-rc6/fs/ext4/dir.c 2007-09-18 19:22:28.0 +0200 +++ linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/dir.c 2007-09-20 18:08:02.0 +0200 @@ -69,7 +69,7 @@ int ext4_check_dir_entry (const char * f unsigned long offset) { const char * error_msg = NULL; - const int rlen = le16_to_cpu(de-rec_len); + const int rlen = ext4_get_rec_len(le16_to_cpu(de-rec_len)); if (rlen EXT4_DIR_REC_LEN(1)) error_msg = rec_len is smaller than minimal; @@ -176,10 +176,10 @@ revalidate: * least that it is non-zero. A * failure will be detected in the * dirent test below. */ -if (le16_to_cpu(de-rec_len) - EXT4_DIR_REC_LEN(1)) +if (ext4_get_rec_len(le16_to_cpu(de-rec_len)) + EXT4_DIR_REC_LEN(1)) break; -i += le16_to_cpu(de-rec_len); +i += ext4_get_rec_len(le16_to_cpu(de-rec_len)); } offset = i; filp-f_pos = (filp-f_pos ~(sb-s_blocksize - 1)) @@ -201,7 +201,7 @@ revalidate: ret = stored; goto out; } - offset += le16_to_cpu(de-rec_len); + offset += ext4_get_rec_len(le16_to_cpu(de-rec_len)); if (le32_to_cpu(de-inode)) { /* We might block in the next section * if the data destination is @@ -223,7 +223,7 @@ revalidate: goto revalidate; stored ++; } - filp-f_pos += le16_to_cpu(de-rec_len); + filp-f_pos += ext4_get_rec_len(le16_to_cpu(de-rec_len)); } offset = 0; brelse (bh); diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext4/namei.c linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/namei.c --- linux-2.6.23-rc6/fs/ext4/namei.c 2007-09-18 19:22:28.0 +0200 +++ linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/namei.c 2007-09-20 18:29:29.0 +0200 @@ -280,7 +280,7 @@ static struct stats dx_show_leaf(struct space += EXT4_DIR_REC_LEN(de-name_len); names++; } - de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext4_next_entry(de); } printk((%i)\n, names); return (struct stats) { names, space, 1 }; @@ -525,7 +525,8 @@ static int ext4_htree_next_block(struct */ static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 *p) { - return (struct ext4_dir_entry_2 *)((char*)p + le16_to_cpu(p-rec_len)); + return (struct ext4_dir_entry_2 *)((char*)p + + ext4_get_rec_len(le16_to_cpu(p-rec_len))); } /* @@ -689,7 +690,7 @@ static int dx_make_map (struct ext4_dir_ cond_resched(); } /* XXX: do we need to check rec_len == 0 case? -Chris */ - de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext4_next_entry(de); } return count; } @@ -790,7 +791,7 @@ static inline int search_dirblock(struct return 1; } /* prevent looping on a bad block */ - de_len = le16_to_cpu(de-rec_len); + de_len = ext4_get_rec_len(le16_to_cpu(de-rec_len)); if (de_len = 0) return -1; offset += de_len; @@ -1099,7 +1100,7 @@ dx_move_dirents(char *from, char *to, st rec_len = EXT4_DIR_REC_LEN(de-name_len); memcpy (to, de, rec_len); ((struct ext4_dir_entry_2 *) to)-rec_len = -cpu_to_le16(rec_len); +cpu_to_le16(ext4_store_rec_len(rec_len)); de-inode = 0; map++; to += rec_len; @@ -1114,13 +1115,12 @@ static struct ext4_dir_entry_2* dx_pack_ prev = to = de; while ((char*)de base + size) { - next = (struct ext4_dir_entry_2 *) ((char *) de + - le16_to_cpu(de-rec_len)); + next = ext4_next_entry(de); if (de-inode de-name_len) { rec_len = EXT4_DIR_REC_LEN(de-name_len); if (de to) memmove(to, de, rec_len); - to-rec_len = cpu_to_le16(rec_len); + to-rec_len = cpu_to_le16(ext4_store_rec_len(rec_len)); prev = to; to = (struct ext4_dir_entry_2 *) (((char *) to) + rec_len); } @@ -1178,8 +1178,8 @@ static struct ext4_dir_entry_2 *do_split /* Fancy dance
Enabling h-trees too early?
Hi, I was just wondering: Currently we start to build h-tree in a directory already when the size of directory exceeds one block. But honestly, it does not seem to make much sence to use this feature until the directory is much larger (I'd say at least 16 or 32 KB). It actually slows down some operations like deleting the whole directory, etc. So what is the reason for starting building the tree so early? Just the simplicity of building it when the directory is just one block large? Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2/3] 2.6.23-rc6: known regressions v2
FS Subject : hanging ext3 dbench tests References : http://lkml.org/lkml/2007/9/11/176 Last known good : ? Submitter : Andy Whitcroft [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : under test -- unreproducible at present Yep... Hard to do anything until Andy is able to reproduce it at least once more to gather needed info. Subject : umount triggers a warning in jfs and takes almost a minute References : http://lkml.org/lkml/2007/9/4/73 Last known good : ? Submitter : Oliver Neukum [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown I thought Shaggy asked Oliver about some details (and he did not answer so far) so I'd assume Shaggy is handling this. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] System calls for online defrag
Hello, I've finally got to writing up some proposal how could look system calls allowing for online filesystem defragmentation and generally moving file blocks around for improving performance. Comments are welcome. Honza int sys_movedata(int datafd, int spacefd, loff_t from, size_t len) The call takes blocks used to carry data starting at offset @from of length @len in @spacefd and places them instead of corresponding blocks in @datafd. Data is copied from @datafd to newly spliced data blocks. If @spacefd contains a hole in the specified interval, a hole is created also in @datafd in the corresponding place. A data block from @spacefd and also replace a hole in @datafd - zeros are copied to such data block. @from and @len should be multiples of filesystem block size (otherwise EINVAL is returned). Data blocks from @datafd in the interval are released, a hole is created in @spacefd. The call returns either 0 (success) or an error code. Another possibility would be to just replace data blocks without any copying of data (that would have to be done by the caller to before calling sys_movedata()). The problem here is how to avoid data loss if someone writes to the file after userspace has copied the data and before sys_movedata() is called. ssize_t sys_allocate(int fd, int mode, loff_t goal, ssize_t len) Allocate new space to file @fd at offset defined by file position. Both file offset and @len should be a multiple of filesystem block size. The whole interval must not contain any allocated blocks. If the interval extends past EOF, the file size is changed accordingly. @mode defines a way the filesystem will search for blocks. @mode is a bitwise OR of the following flags: ALLOC_FIXED_START - allocation must start at @goal; if not specified, @goal is just a hint where to start an allocation ALLOC_FIXED_LEN - allocate exactly space for @len; if not specified, upto @len bytes may be allocated. ALLOC_CONTINGUOUS - allocation must be one continguous run of blocks If the allocation succeeds, number of allocated bytes is returned. Otherwise an error code is returned. The following syscall may be also useful - although I'm not completely convinced this is the right way to go. But on the other hand, disk optimizer should have a way to find out about free space so that he can decide what and where is beneficial to move. int sys_get_free_blocks(const char *fs, loff_t start, loff_t end, int count, struct alloc_extent *space) Get a description of free space on a filesystem between @start and @end (in bytes, should be blocksize aligned). @fs is a path where the filesystem is mounted (I guess it's better than dev_t, isn't it?). @space is a pointer to an array of 'struct alloc_extent'. In each struct alloc_extent is stored description of one extent of free space. Upto @count extents are stored. struct alloc_extent { loff_t start; size_t len; }; Function returns a number of extents stored. Note that the result of the function is unreliable as the space can be already allocated by the time system call returns. -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix panic in jbd by adding locks
On Thu 16-08-07 12:37:45, Josef Bacik wrote: On Thu, Aug 16, 2007 at 06:08:35PM +0200, Jan Kara wrote: It is possible to panic the box by a race condition that exists in the journalling code where we do not take the j_revoke_lock when traversing the journal's revoked record list. This patch has been tested and we haven't seen the issue yet, its a rather straightforward and correct (at least I think so :) fix. Thank you, In principle, the patch looks fine. The only thing I'm wondering about is how that panic can happen... Journal write_revoke_records() is called from journal_commit_transaction() when revoke table for the committing transaction shouldn't see any further changes. So maybe the patch is masking a different problem. Do you have a way of reproducing the problem? Any stack trace available? Reproducing the problem is tricky as theres no sure way to make it happen, basically we run the box with alot of memory pressure while doing alot operations that require journalling. Here is the backtrace of the panic (note this is on a RHEL4 kernel so 2.6.9, but the same problem exists upstream) OK. 1Unable to handle kernel paging request at virtual address 602b1110 4kjournald[1310]: Oops 11012296146944 [1] 4Modules linked in: ltd(U) vfat fat dm_mod button uhci_hcd shpchp e1000 bond1(U) bond0(U) ext3 jbd hfcldd(U) hfcldd_conf(U) sd_mod scsi_mod 4 4Pid: 1310, CPU 0, comm:kjournald 4psr : 121008026018 ifs : 8c9c ip : [a00200045161] Tainted: P 4ip is at journal_write_revoke_records+0x221/0x4e0 [jbd] 4unat: pfs : 0c9c rsc : 0003 4rnat: bsps: pr : a681 4ldrs: ccv : fpsr: 0009804c8a70033f 4csd : ssd : 4b0 : a00200045270 b6 : a0020026a240 b7 : a001ee90 4f6 : 0fffbe38e38e381b23800 f7 : 0ffe9edc3d22c 4f8 : 1000e86fb f9 : 100029000 4f10 : 1000aeff71c71b9e6e61a f11 : 1003e0eff 4r1 : a00200234000 r2 : 048c r3 : e002791a7a90 4r8 : r9 : e002791a0400 r10 : 4r11 : e100 r12 : e002791a7b00 r13 : e002791a 4r14 : e0027b7ee6c0 r15 : e002791a7b00 r16 : e00272d48018 4r17 : r18 : r19 : 0009804c8a70033f 4r20 : 602b1118 r21 : a0010006ad70 r22 : 0019 4r23 : r24 : r25 : 0019 4r26 : r27 : r28 : 6a41 4r29 : r30 : r31 : e0027b7ee5a4 4 4Call Trace: 4 [a00100016da0] show_stack+0x80/0xa0 4sp=e002791a7690 bsp=e002791a1170 4 [a001000176b0] show_regs+0x890/0x8c0 4sp=e002791a7860 bsp=e002791a1128 4 [a0010003e910] die+0x150/0x240 4sp=e002791a7880 bsp=e002791a10e8 4 [a001000644c0] ia64_do_page_fault+0x8c0/0xbc0 4sp=e002791a7880 bsp=e002791a1080 4 [a001f600] ia64_leave_kernel+0x0/0x260 4sp=e002791a7930 bsp=e002791a1080 4 [a00200045160] journal_write_revoke_records+0x220/0x4e0 [jbd] 4sp=e002791a7b00 bsp=e002791a0f98 4 [a0020003d940] journal_commit_transaction+0xf80/0x3080 [jbd] 4sp=e002791a7b10 bsp=e002791a0ea0 4 [a002000458d0] kjournald+0x170/0x580 [jbd] 4sp=e002791a7d80 bsp=e002791a0e38 4 [a00100018c70] kernel_thread_helper+0x30/0x60 4sp=e002791a7e30 bsp=e002791a0e10 4 [a0018c60] start_kernel_thread+0x20/0x40 4sp=e002791a7e30 bsp=e002791a0e10 Do you know (or could you find out) where exactly in the code is journal_write_revoke_records+0x221/0x4e0? Yeah sorry 500 void journal_write_revoke_records(journal_t *journal, 501 transaction_t *transaction) 502 { 503 struct journal_head *descriptor; 504 struct jbd_revoke_record_s *record; 505 struct jbd_revoke_table_s *revoke; 506 struct list_head *hash_list; 507 int i, offset, count; 508 509 descriptor = NULL; 510 offset = 0; 511 count = 0; 512 513 /* select revoke table for committing transaction */ 514 revoke = journal-j_revoke == journal-j_revoke_table[0] ? 515
Re: [PATCH] fix panic in jbd by adding locks
Hello, It is possible to panic the box by a race condition that exists in the journalling code where we do not take the j_revoke_lock when traversing the journal's revoked record list. This patch has been tested and we haven't seen the issue yet, its a rather straightforward and correct (at least I think so :) fix. Thank you, In principle, the patch looks fine. The only thing I'm wondering about is how that panic can happen... Journal write_revoke_records() is called from journal_commit_transaction() when revoke table for the committing transaction shouldn't see any further changes. So maybe the patch is masking a different problem. Do you have a way of reproducing the problem? Any stack trace available? Reproducing the problem is tricky as theres no sure way to make it happen, basically we run the box with alot of memory pressure while doing alot operations that require journalling. Here is the backtrace of the panic (note this is on a RHEL4 kernel so 2.6.9, but the same problem exists upstream) OK. 1Unable to handle kernel paging request at virtual address 602b1110 4kjournald[1310]: Oops 11012296146944 [1] 4Modules linked in: ltd(U) vfat fat dm_mod button uhci_hcd shpchp e1000 bond1(U) bond0(U) ext3 jbd hfcldd(U) hfcldd_conf(U) sd_mod scsi_mod 4 4Pid: 1310, CPU 0, comm:kjournald 4psr : 121008026018 ifs : 8c9c ip : [a00200045161] Tainted: P 4ip is at journal_write_revoke_records+0x221/0x4e0 [jbd] 4unat: pfs : 0c9c rsc : 0003 4rnat: bsps: pr : a681 4ldrs: ccv : fpsr: 0009804c8a70033f 4csd : ssd : 4b0 : a00200045270 b6 : a0020026a240 b7 : a001ee90 4f6 : 0fffbe38e38e381b23800 f7 : 0ffe9edc3d22c 4f8 : 1000e86fb f9 : 100029000 4f10 : 1000aeff71c71b9e6e61a f11 : 1003e0eff 4r1 : a00200234000 r2 : 048c r3 : e002791a7a90 4r8 : r9 : e002791a0400 r10 : 4r11 : e100 r12 : e002791a7b00 r13 : e002791a 4r14 : e0027b7ee6c0 r15 : e002791a7b00 r16 : e00272d48018 4r17 : r18 : r19 : 0009804c8a70033f 4r20 : 602b1118 r21 : a0010006ad70 r22 : 0019 4r23 : r24 : r25 : 0019 4r26 : r27 : r28 : 6a41 4r29 : r30 : r31 : e0027b7ee5a4 4 4Call Trace: 4 [a00100016da0] show_stack+0x80/0xa0 4sp=e002791a7690 bsp=e002791a1170 4 [a001000176b0] show_regs+0x890/0x8c0 4sp=e002791a7860 bsp=e002791a1128 4 [a0010003e910] die+0x150/0x240 4sp=e002791a7880 bsp=e002791a10e8 4 [a001000644c0] ia64_do_page_fault+0x8c0/0xbc0 4sp=e002791a7880 bsp=e002791a1080 4 [a001f600] ia64_leave_kernel+0x0/0x260 4sp=e002791a7930 bsp=e002791a1080 4 [a00200045160] journal_write_revoke_records+0x220/0x4e0 [jbd] 4sp=e002791a7b00 bsp=e002791a0f98 4 [a0020003d940] journal_commit_transaction+0xf80/0x3080 [jbd] 4sp=e002791a7b10 bsp=e002791a0ea0 4 [a002000458d0] kjournald+0x170/0x580 [jbd] 4sp=e002791a7d80 bsp=e002791a0e38 4 [a00100018c70] kernel_thread_helper+0x30/0x60 4sp=e002791a7e30 bsp=e002791a0e10 4 [a0018c60] start_kernel_thread+0x20/0x40 4sp=e002791a7e30 bsp=e002791a0e10 Do you know (or could you find out) where exactly in the code is journal_write_revoke_records+0x221/0x4e0? While analyzing the problem, Hitachi came up with this explanation for the race condition PID: 31401 TASK: e0004fb3 CPU: 1 COMMAND: GET #0 [BSP:e0004fb314d8] context_switch at a0010006ab90 #1 [BSP:e0004fb313b8] schedule at a00100590f40 #2 [BSP:e0004fb31340] do_get_write_access at a002000388e0 #3 [BSP:e0004fb31300] journal_get_write_access at a00200039680 #4 [BSP:e0004fb312b8] ext3_reserve_inode_write at a0020013f180 #5 [BSP:e0004fb31290] ext3_mark_inode_dirty at a0020013f2a0 #6 [BSP:e0004fb31260] ext3_dirty_inode at a00200144310 #7 [BSP:e0004fb31210] __mark_inode_dirty at a00100178200 #8 [BSP:e0004fb311e8] update_atime at a00100165cc0 #9 [BSP:e0004fb31128] do_generic_mapping_read at a001000d40e0 #10 [BSP:e0004fb310c0] __generic_file_aio_read at a001000d8b40 #11 [BSP:e0004fb31088]
Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote: .. 5) ext3_write_end: Before write_begin/write_end patch set we have folowing locking order: stop_journal(handle); unlock_page(page); But now order is oposite: unlock_page(page); stop_journal(handle); Can we got any race condition now? I'm not sure is it actual problem, may be somebody cant describe this. Can we just change it to the original order? That would seem to be safest unless one of the ext3 devs explicitly acks it. Sorry, I've missed beginning of this thread. But what problems can exactly cause this ordering change? ext3_journal_stop has no need to be protected by the page lock - it can be even better that it's not protected as it can trigger commit and all that would happen unnecessarily under page lock... Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext2_discard_prealloc() called on each iput?
On Tue 29-05-07 14:10:52, Mingming Cao wrote: On Mon, 2007-05-28 at 18:04 +0200, Jan Kara wrote: On Wed 23-05-07 08:37:43, Theodore Tso wrote: On Tue, May 22, 2007 at 06:11:27PM +0200, Jan Kara wrote: while fixing some problems with preallocation in UDF, I had a look how ext2 solves similar problems. I found out that ext2_discard_prealloc() is called on every iput() from ext2_put_inode(). Is it really appropriate? I don't see a reason for doing so... I agree, it's probably not appropriate. It's been that way for a long time, though (since 2.4.20). It's not as horrible as it seems since unlike traditional Unix systems, we don't call iput() as often, since for example operations like close() end up calling dput(), which decrements the ref. count on dentry, not the inode. But it would probably be better to check to see if i_count is 1 before deciding to discard the preallocation. OK, but then you could move the code to drop_inode() which is called at exactly that moment... I've been thinking more about it when fixing UDF. I have tried to optimize ext2 discard preallocation code like ext3 discard reservation a while back: we only call ext2_discard_prealloc on the last iput(), i.e. ext2/3_clear_inode(). This patch actually made into mainline for a little while, then later it is being reversed back because of possible leak of preallocated blocks. Tt the unmount time, someone might still hold the reference of the inode, thus ext2_discard_prealloc() did not get a chance to be called. Since ext2 preallocation is doing pre-allocation on disk, this leads to leak of preallocated blocks, confused fsck later. http://lkml.org/lkml/2005/4/12/333 Interesting. I don't quite understand how it can happen that inode is not released at umount time... It must be released for umount to succeed. There is a slight problem that inode is not written after calling clear_inode() which could cause some problems (i_blocks being wrong) but blocks in bitmaps should be freed just right... anyway, so the preallocated region is less useful. OK, but still we could use e.g. i_writecount to check that we drop the last descriptor for writing... Yep, that is what ext3 does in ext3_release_file(), I forget why we didn't fix this for ext2. Hmm, probably we just forgot... Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext2_discard_prealloc() called on each iput?
On Wed 23-05-07 08:37:43, Theodore Tso wrote: On Tue, May 22, 2007 at 06:11:27PM +0200, Jan Kara wrote: while fixing some problems with preallocation in UDF, I had a look how ext2 solves similar problems. I found out that ext2_discard_prealloc() is called on every iput() from ext2_put_inode(). Is it really appropriate? I don't see a reason for doing so... I agree, it's probably not appropriate. It's been that way for a long time, though (since 2.4.20). It's not as horrible as it seems since unlike traditional Unix systems, we don't call iput() as often, since for example operations like close() end up calling dput(), which decrements the ref. count on dentry, not the inode. But it would probably be better to check to see if i_count is 1 before deciding to discard the preallocation. OK, but then you could move the code to drop_inode() which is called at exactly that moment... I've been thinking more about it when fixing UDF. Discarding prealloc at drop_inode() has the disadvantage that symlinks/directories will keep their preallocated blocks until inodes are evicted from memory. Which is probably why ext2 discards prealloc on iput(). Also I found slightly misleading the comment at ext2_release_file(). As far as I understand the code it isn't when /all/ files are closed but rather when all fd's for given filp are closed. I.e. if you open the same file two times, -release will get called once for each open. Am I right? Yep! If so, then also calling ext2_discard_prealloc() from ext2_release_file() is suboptimal, isn't it? Yes, although it's a bit better because only discaord the preallocation if the file descriptor was opened for writing. The file could be opened for writing by multiple file descriptors, true, but in that case it's likely that the write pattern will be a random access one anyway, so the preallocated region is less useful. OK, but still we could use e.g. i_writecount to check that we drop the last descriptor for writing... P.S. Note that ext3 and ext4 use a different preallocation scheme; Yes, I know that. That's why I looked at ext2 when searching for inspiration how to fix UDF :) still patches to fix the comments might not be a bad idea, since it might save confusion by others later on. Ok, will write one. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Online defragmentation and ext4migrate
On Mon, 2007-05-21 at 12:38 +0200, Jan Kara wrote: Yes. On the other hand I believe that some people would like to use defragmentation but stay with ext3. For them conversion to extents is no-go. [...] I've written a patch that defragments non-extent files but after discussion with XFS guys I've decided that the interfaces should be made more generic, so that XFS and other filesystems can use them too... I see no reason why the ioctl to convert a file to extents and then defragment it should be different from the ioctl to defragment a non-extent file. After all, whether a file's blocks are tracked as lists of blocks or a set of extents is just bookkeeping, right? The set of data blocks that make up the file and their order is the same regardless of whether the extent flag is set in the inode. I agree that at least part of the interface should be independent on the particular representation of data references - especially because I want it to be useful for more filesystems than just ext2/3/4. Currently I think that defragmenting data blocks itself can have fs-independent interface. Of course, when you decide to defragment metadata (i.e. indirect blocks, inodes, etc.) you have to have fs-specific interfaces, probably ioctls... If the user is running the ext2/3 driver or the ext4 driver with the noextents option, just defragment the file. If the user is running ext4 without the noextents option, convert to extents and then defragment. Defragmentation ioctl definitely should not touch the way the file is represented. I.e. if the file uses indirect blocks it should use indirect blocks after defragmentation. If it uses extents, it should use extents afterwards too. It should be the userspace utility which decides whether the file should be converted or not and uses appropriate call for that... The only problem that I can think of is that defragmenting metadata (including indirect block and/or whatever the equivalent is in extent-land) presumably has performance benefits too, so maybe a defragmenter in userspace would want to have some knowledge/control over this process. Yes, it has measurable benefit (especially for indirect blocks) so eventually we should do it. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
ext2_discard_prealloc() called on each iput?
Hello, while fixing some problems with preallocation in UDF, I had a look how ext2 solves similar problems. I found out that ext2_discard_prealloc() is called on every iput() from ext2_put_inode(). Is it really appropriate? I don't see a reason for doing so... Also I found slightly misleading the comment at ext2_release_file(). As far as I understand the code it isn't when /all/ files are closed but rather when all fd's for given filp are closed. I.e. if you open the same file two times, -release will get called once for each open. Am I right? If so, then also calling ext2_discard_prealloc() from ext2_release_file() is suboptimal, isn't it? Thanks for answer Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext2/3/4 online defrag
On Thu, 2007-05-17 at 18:11 +0200, Jan Kara wrote: But me (and several other people independently as I've learnt recently) have written some tools which should result in something useful. If you're interested, you can join [EMAIL PROTECTED] - it's led by one guy who is doing defrag and stuff as his google summer of code project. Is this different from the ext4/extent-based defrag patch that's been mentioned on this list? Yes, it is different. In particular, it's offline only tool so far... *An implementation of an ext* filesystem driver can work with any ext2/3/4 filesystem as long as it supports the necessary revision (GOOD_OLD_REV or DYNAMIC_REV) and feature flags set in the filesystem. Not sure what you mean here... The ext2 filesystem/ext3 filesystem/ext4 filesystem terminology was confusing to me when I first started reading about them. In my mind, it implied that those three filesystems were more different than they actually are. I think it would be more accurate to say that they are all essentially the same filesystem, and that any filesystem driver that can mount a given filesystem can mount any other ext2/3/4 filesystem of the same revision with the same feature flags set. I was asking for confirmation of this assumption, but I've since found a lot of really good documentation that has cleared up a lot of things. Yes, basically it's just a question of a feature set. But for example current online defrag from Takashi requires extents, which are not available for ext2 or ext3. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Online defragmentation and ext4migrate
Hello, While doing online defragmentation do we move the blocks corresponding to extent index ? The reason why i am asking this is to understand the usefulness of doing a ext4migrate followed by defrag. I understand that defragmentation in general will improve the performance. But with respect to ext4migrate we are not touching the data blocks. Instead we build the extent map and if that requires to have an extent index block then we allocate one. I am trying to understand what would be the performance impact of this and whether doing a defrag really improve the performance. I think converting a file to extents has the benefit for the performance of block searching. If we want to improve also the performance of reading file data, we have to run the defrag after that. Yes. On the other hand I believe that some people would like to use defragmentation but stay with ext3. For them conversion to extents is no-go. Also looking at the version 0.4 I see that defrag ioctl only work if we have EXT4_EXTENTS_FL flag set. What are the plans for making defrag work with indirect block map inode ? Unfortunately, my defrag doesn't support an indirect block file. But we can reduce fragments in the file with the defrag just after ext4migrate. In my opinion, to keep the ioctl simple and small is very important for ease of maintenance. So I would rather not support indirect block files in the ioctl. Instead, I can add the call of the migration ioctl to my defrag tool in order to defragment indirect block files. How do you think of it? Yes that could be useful but I don't think it's a complete solution for people that don't want to migrate. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Copy i_flags to ext3 inode flags on write
On Mon 16-04-07 12:14:21, Andreas Dilger wrote: On Apr 16, 2007 19:05 +0200, Jan Kara wrote: attached is a patch that stores inode flags such as S_IMMUTABLE, S_APPEND, etc. from i_flags to EXT3_I(inode)-i_flags when inode is written to disk. The same thing is done on GETFLAGS ioctl. Quota code changes these flags on quota files (to make it harder for sysadmin to screw himself) and these changes were not correctly propagated into the filesystem (especially, lsattr did not show them and users were wondering...). +/* Propagate flags from i_flags to EXT3_I(inode)-i_flags */ +void ext3_get_inode_flags(struct inode *inode) +{ + unsigned int flags = inode-i_flags; + + EXT3_I(inode)-i_flags = ~(EXT3_SYNC_FL|EXT3_APPEND_FL| + EXT3_IMMUTABLE_FL|EXT3_NOATIME_FL|EXT3_DIRSYNC_FL); + if (flags S_SYNC) + EXT3_I(inode)-i_flags |= EXT3_SYNC_FL; + if (flags S_APPEND) + EXT3_I(inode)-i_flags |= EXT3_APPEND_FL; + if (flags S_IMMUTABLE) + EXT3_I(inode)-i_flags |= EXT3_IMMUTABLE_FL; + if (flags S_NOATIME) + EXT3_I(inode)-i_flags |= EXT3_NOATIME_FL; + if (flags S_DIRSYNC) + EXT3_I(inode)-i_flags |= EXT3_DIRSYNC_FL; +} It probably makes sense to pass struct ext3_inode_info *ei to this function, which is available at both callsites and avoids some pointer math for each access. OK, makes sence. Will resend the patch in a while. void ext3_read_inode(struct inode * inode) { struct ext3_iloc iloc; @@ -2736,6 +2755,7 @@ static int ext3_do_update_inode(handle_t if (ei-i_state EXT3_STATE_NEW) memset(raw_inode, 0, EXT3_SB(inode-i_sb)-s_inode_size); + ext3_get_inode_flags(inode); raw_inode-i_mode = cpu_to_le16(inode-i_mode); if(!(test_opt(inode-i_sb, NO_UID32))) { raw_inode-i_uid_low = cpu_to_le16(low_16_bits(inode-i_uid)); diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/ext3/ioctl.c linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c --- linux-2.6.21-rc6/fs/ext3/ioctl.c2007-02-07 12:03:23.0 +0100 +++ linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c2007-04-12 18:22:54.0 +0200 @@ -28,6 +28,7 @@ int ext3_ioctl (struct inode * inode, st switch (cmd) { case EXT3_IOC_GETFLAGS: + ext3_get_inode_flags(inode); flags = ei-i_flags EXT3_FL_USER_VISIBLE; return put_user(flags, (int __user *) arg); case EXT3_IOC_SETFLAGS: { Looks fine otherwise - there is already ext3_set_inode_flags() which sets the VFS inode flags properly in ext3_read_inode(). Yes, I know :). Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Copy i_flags to ext3 inode flags on write (version 2)
Hi, attached is a second version of a patch that stores inode flags such as S_IMMUTABLE, S_APPEND, etc. from i_flags to EXT3_I(inode)-i_flags when inode is written to disk. The same thing is done on GETFLAGS ioctl. Quota code changes these flags on quota files (to make it harder for sysadmin to screw himself) and these changes were not correctly propagated into the filesystem (especially, lsattr did not show them and users were wondering...). Andrew, could you please put the patch into your queue? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into ext3-specific i_flags. Hence, when someone sets these flags via a different interface than ioctl, they are stored correctly. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/ext3/inode.c linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/inode.c --- linux-2.6.21-rc6/fs/ext3/inode.c 2007-04-10 17:09:55.0 +0200 +++ linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/inode.c 2007-04-17 11:24:28.0 +0200 @@ -2581,6 +2581,25 @@ void ext3_set_inode_flags(struct inode * inode-i_flags |= S_DIRSYNC; } +/* Propagate flags from i_flags to EXT3_I(inode)-i_flags */ +void ext3_get_inode_flags(struct ext3_inode_info *ei) +{ + unsigned int flags = ei-vfs_inode.i_flags; + + ei-i_flags = ~(EXT3_SYNC_FL|EXT3_APPEND_FL| + EXT3_IMMUTABLE_FL|EXT3_NOATIME_FL|EXT3_DIRSYNC_FL); + if (flags S_SYNC) + ei-i_flags |= EXT3_SYNC_FL; + if (flags S_APPEND) + ei-i_flags |= EXT3_APPEND_FL; + if (flags S_IMMUTABLE) + ei-i_flags |= EXT3_IMMUTABLE_FL; + if (flags S_NOATIME) + ei-i_flags |= EXT3_NOATIME_FL; + if (flags S_DIRSYNC) + ei-i_flags |= EXT3_DIRSYNC_FL; +} + void ext3_read_inode(struct inode * inode) { struct ext3_iloc iloc; @@ -2736,6 +2755,7 @@ static int ext3_do_update_inode(handle_t if (ei-i_state EXT3_STATE_NEW) memset(raw_inode, 0, EXT3_SB(inode-i_sb)-s_inode_size); + ext3_get_inode_flags(ei); raw_inode-i_mode = cpu_to_le16(inode-i_mode); if(!(test_opt(inode-i_sb, NO_UID32))) { raw_inode-i_uid_low = cpu_to_le16(low_16_bits(inode-i_uid)); diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/ext3/ioctl.c linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c --- linux-2.6.21-rc6/fs/ext3/ioctl.c 2007-02-07 12:03:23.0 +0100 +++ linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c 2007-04-17 11:24:39.0 +0200 @@ -28,6 +28,7 @@ int ext3_ioctl (struct inode * inode, st switch (cmd) { case EXT3_IOC_GETFLAGS: + ext3_get_inode_flags(ei); flags = ei-i_flags EXT3_FL_USER_VISIBLE; return put_user(flags, (int __user *) arg); case EXT3_IOC_SETFLAGS: { diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/include/linux/ext3_fs.h linux-2.6.21-rc6-1-ext3_flags_update/include/linux/ext3_fs.h --- linux-2.6.21-rc6/include/linux/ext3_fs.h 2007-04-10 17:09:58.0 +0200 +++ linux-2.6.21-rc6-1-ext3_flags_update/include/linux/ext3_fs.h 2007-04-17 11:25:23.0 +0200 @@ -824,6 +824,7 @@ extern int ext3_change_inode_journal_fla extern int ext3_get_inode_loc(struct inode *, struct ext3_iloc *); extern void ext3_truncate (struct inode *); extern void ext3_set_inode_flags(struct inode *); +extern void ext3_get_inode_flags(struct ext3_inode_info *); extern void ext3_set_aops(struct inode *inode); /* ioctl.c */
[PATCH] Copy i_flags to ext4 inode flags on write
Hi, attached is a port of the previous patch for ext3 to ext4. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into ext4-specific i_flags. Hence, when someone sets these flags via a different interface than ioctl, they are stored correctly. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6-1-ext3_flags_update/fs/ext4/inode.c linux-2.6.21-rc6-2-ext4_flags_update/fs/ext4/inode.c --- linux-2.6.21-rc6-1-ext3_flags_update/fs/ext4/inode.c 2007-04-10 17:09:55.0 +0200 +++ linux-2.6.21-rc6-2-ext4_flags_update/fs/ext4/inode.c 2007-04-17 12:08:54.0 +0200 @@ -2584,6 +2584,25 @@ void ext4_set_inode_flags(struct inode * inode-i_flags |= S_DIRSYNC; } +/* Propagate flags from i_flags to EXT4_I(inode)-i_flags */ +void ext4_get_inode_flags(struct ext4_inode_info *ei) +{ + unsigned int flags = ei-vfs_inode.i_flags; + + ei-i_flags = ~(EXT4_SYNC_FL|EXT4_APPEND_FL| + EXT4_IMMUTABLE_FL|EXT4_NOATIME_FL|EXT4_DIRSYNC_FL); + if (flags S_SYNC) + ei-i_flags |= EXT4_SYNC_FL; + if (flags S_APPEND) + ei-i_flags |= EXT4_APPEND_FL; + if (flags S_IMMUTABLE) + ei-i_flags |= EXT4_IMMUTABLE_FL; + if (flags S_NOATIME) + ei-i_flags |= EXT4_NOATIME_FL; + if (flags S_DIRSYNC) + ei-i_flags |= EXT4_DIRSYNC_FL; +} + void ext4_read_inode(struct inode * inode) { struct ext4_iloc iloc; @@ -2743,6 +2762,7 @@ static int ext4_do_update_inode(handle_t if (ei-i_state EXT4_STATE_NEW) memset(raw_inode, 0, EXT4_SB(inode-i_sb)-s_inode_size); + ext4_get_inode_flags(ei); raw_inode-i_mode = cpu_to_le16(inode-i_mode); if(!(test_opt(inode-i_sb, NO_UID32))) { raw_inode-i_uid_low = cpu_to_le16(low_16_bits(inode-i_uid)); diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6-1-ext3_flags_update/fs/ext4/ioctl.c linux-2.6.21-rc6-2-ext4_flags_update/fs/ext4/ioctl.c --- linux-2.6.21-rc6-1-ext3_flags_update/fs/ext4/ioctl.c 2007-02-07 12:03:23.0 +0100 +++ linux-2.6.21-rc6-2-ext4_flags_update/fs/ext4/ioctl.c 2007-04-17 12:09:19.0 +0200 @@ -28,6 +28,7 @@ int ext4_ioctl (struct inode * inode, st switch (cmd) { case EXT4_IOC_GETFLAGS: + ext4_get_inode_flags(ei); flags = ei-i_flags EXT4_FL_USER_VISIBLE; return put_user(flags, (int __user *) arg); case EXT4_IOC_SETFLAGS: { diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6-1-ext3_flags_update/include/linux/ext4_fs.h linux-2.6.21-rc6-2-ext4_flags_update/include/linux/ext4_fs.h --- linux-2.6.21-rc6-1-ext3_flags_update/include/linux/ext4_fs.h 2007-04-10 17:09:58.0 +0200 +++ linux-2.6.21-rc6-2-ext4_flags_update/include/linux/ext4_fs.h 2007-04-17 12:10:02.0 +0200 @@ -855,6 +855,7 @@ extern int ext4_change_inode_journal_fla extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *); extern void ext4_truncate (struct inode *); extern void ext4_set_inode_flags(struct inode *); +extern void ext4_get_inode_flags(struct ext4_inode_info *); extern void ext4_set_aops(struct inode *inode); extern int ext4_writepage_trans_blocks(struct inode *); extern int ext4_block_truncate_page(handle_t *handle, struct page *page,
[PATCH] Copy i_flags to ext3 inode flags on write
Hello, attached is a patch that stores inode flags such as S_IMMUTABLE, S_APPEND, etc. from i_flags to EXT3_I(inode)-i_flags when inode is written to disk. The same thing is done on GETFLAGS ioctl. Quota code changes these flags on quota files (to make it harder for sysadmin to screw himself) and these changes were not correctly propagated into the filesystem (especially, lsattr did not show them and users were wondering...). If people agree that this patch is fine, I can also rediff it for ext4. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into ext3-specific i_flags. Hence, when someone sets these flags via a different interface than ioctl, they are stored correctly. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/ext3/inode.c linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/inode.c --- linux-2.6.21-rc6/fs/ext3/inode.c2007-04-10 17:09:55.0 +0200 +++ linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/inode.c2007-04-12 18:19:29.0 +0200 @@ -2581,6 +2581,25 @@ void ext3_set_inode_flags(struct inode * inode-i_flags |= S_DIRSYNC; } +/* Propagate flags from i_flags to EXT3_I(inode)-i_flags */ +void ext3_get_inode_flags(struct inode *inode) +{ + unsigned int flags = inode-i_flags; + + EXT3_I(inode)-i_flags = ~(EXT3_SYNC_FL|EXT3_APPEND_FL| + EXT3_IMMUTABLE_FL|EXT3_NOATIME_FL|EXT3_DIRSYNC_FL); + if (flags S_SYNC) + EXT3_I(inode)-i_flags |= EXT3_SYNC_FL; + if (flags S_APPEND) + EXT3_I(inode)-i_flags |= EXT3_APPEND_FL; + if (flags S_IMMUTABLE) + EXT3_I(inode)-i_flags |= EXT3_IMMUTABLE_FL; + if (flags S_NOATIME) + EXT3_I(inode)-i_flags |= EXT3_NOATIME_FL; + if (flags S_DIRSYNC) + EXT3_I(inode)-i_flags |= EXT3_DIRSYNC_FL; +} + void ext3_read_inode(struct inode * inode) { struct ext3_iloc iloc; @@ -2736,6 +2755,7 @@ static int ext3_do_update_inode(handle_t if (ei-i_state EXT3_STATE_NEW) memset(raw_inode, 0, EXT3_SB(inode-i_sb)-s_inode_size); + ext3_get_inode_flags(inode); raw_inode-i_mode = cpu_to_le16(inode-i_mode); if(!(test_opt(inode-i_sb, NO_UID32))) { raw_inode-i_uid_low = cpu_to_le16(low_16_bits(inode-i_uid)); diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/ext3/ioctl.c linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c --- linux-2.6.21-rc6/fs/ext3/ioctl.c2007-02-07 12:03:23.0 +0100 +++ linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c2007-04-12 18:22:54.0 +0200 @@ -28,6 +28,7 @@ int ext3_ioctl (struct inode * inode, st switch (cmd) { case EXT3_IOC_GETFLAGS: + ext3_get_inode_flags(inode); flags = ei-i_flags EXT3_FL_USER_VISIBLE; return put_user(flags, (int __user *) arg); case EXT3_IOC_SETFLAGS: {
Re: 64bit inode number and dynamic inode table for ext4
, in case the original inode table file corrupted we will not lost the inodes for the entire block group. If you have checksums / magic numbers, you will be able to find blocks belonging to the inode table file. If you also implement the idea with the chunks of inode blocks (actually, it looks like a small inode table) with a header, you can even store all the information you need for reconstruction in the header... Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext3 behavior on power failure
On Wed 28-03-07 19:00:54, Ric Wheeler wrote: Jan Kara wrote: [EMAIL PROTECTED] wrote: Hi all, We are building a new system which is going to use ext3 FS. We would like to know more about the behavior of ext3 in the case of failure. But before I procede, I would like to share more information about our future system. * Our application always does an fsync on files * When symbolic links (more specifically fast symlink) are created, the host directory is also fsync'ed. * Our application is also going to front an EMC disk array configured using RAID5 or RAID6. * We will be using multipathing so that we can assume that no disk errors will be reported. In this context , we would like to know the following for recovery after a power outage: 1. When will an fsck have to be run (not counting the scheduled fsck every N-mounts)? 2. In the case of a crash, are the fsync-ed file contents and symbolic links safe no matter what? Thanks, This is an interesting twist on some of the discussion that we have had at the recent workshop and in other forums on hardening file system in order to prevent the need to fsck. The twist is that we have a disk that will not lose power without being able to write to platter all of the data that has been sent - this is the case for most mid-range or higher disk arrays. If the application can precisely use fsync() on files, directories and symlinks, it wants to know that all objects are safe on disk that have completed a successful fsync. It also wants to know that the file system will not need any recovery beyond replaying transactions after a power outage/reboot - simply mount, let the transactions get replayed and you should be good to go without the fsck. The hard part of the question is to understand when and how often we will fail to deliver this easy case. Also, does any of the hardening in ext4 help here. I'm probably misunderstanding something because the answer seems to be too obvious to me :) But anyway I'll write it so that you can correct me: Due to journalling guarantees you should get consistent FS whenever you replay the log (unless there are some software bugs or hardware problems which is why fsck is run once per several mounts anyway). If you fsync() your data, you are guaranteed that also your data are safely on disk when fsync returns. So what is the question here? Honza I think that the real question here is in practice, how often does this really hold to be true? When it fails, how long does it take to recover the file system? I see, thanks for explanation :). There are a lot of odd errors that can happen when you monitor a large enough number of file systems. In my experience, I would guess that disk errors are clearly the leading cause of issues, followed by software bugs (file system, firmware, etc) and then a group of errors caused by various occasional things (bad DRAM in the server/HBA/disk, bad cables/etc). Note that using a high end array does not eliminate errors, it just reduces the rate (hopefully by a large amount). What is really hard to predict is the rate of the failures that require fsck with our current file system (say for a specific hardware setup) and how changes like the checksumming in ext4 can help us ride through these errors without needing a full fsck. OK. All the features I've seen so far were more aiming to detecting that such an unexpected problem happened rather than trying to fix it or make fixing it faster. So currently it seems to me that any such unexpected failure requires fsck... Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext2/ext3 still under active maintenance?
Hello, I'm trying to look at the ext2/ext3 source for learning Linux FS development. I'm avoiding ext4 for now because it's under active development and it's too much to chew before I understand the previous versions. But am I going to get an outdated view of the right way to program filesystems if I look at those, or is the code just as shiny as ext4's? Definitely you won't get an outdated view of how a filesystem is written. ext4 uses ext3 code as its baseline and although there are new features flowing in, the basic things stay the same. Honza - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext3 behavior on power failure
If you fsync() your data, you are guaranteed that also your data are safely on disk when fsync returns. So what is the question here? Pardon a newbie's intrusion, but I do know this isn't true. There is a window of possible loss because of the multitude of layers of caching, especially within the drive itself. Unless there is a super_duper_fsync() that is able to actually poll the hardware and get a confirmation that the internal buffers are purged? OK :), to correct myself: After fsync() returns, all the data is acked from the disk (or at least it should be like that unless there's a bug somewhere). So if there are some caches in the hardware which the hardware is not able to flush on power failure, that's a bad luck... That's why you should turn off write caching on cheaper disks if you really care about data integrity. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext3 behavior on power failure
On Wed 28-03-07 10:17:33, [EMAIL PROTECTED] wrote: In my case the disk cache is not a problem - We use an emc disk array the write cache is protected - Once the data has made over the disk array we can assume it is safe - Then if you are able to reproduce the situation that not all data is written after fsync(); poweroff; that is a bug worth reporting.. Honza -Original Message- From: John Anthony Kazos Jr. [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 28, 2007 9:17 AM To: Jan Kara Cc: wheeler, richard; armangau, philippe; [EMAIL PROTECTED]; linux-ext4@vger.kernel.org; [EMAIL PROTECTED] Subject: Re: Ext3 behavior on power failure If you fsync() your data, you are guaranteed that also your data are safely on disk when fsync returns. So what is the question here? Pardon a newbie's intrusion, but I do know this isn't true. There is a window of possible loss because of the multitude of layers of caching, especially within the drive itself. Unless there is a super_duper_fsync() that is able to actually poll the hardware and get a confirmation that the internal buffers are purged? -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Tue 06-03-07 12:23:22, Eric Sandeen wrote: Jan Kara wrote: On Tue 06-03-07 06:36:09, Ulrich Drepper wrote: Christoph Hellwig wrote: fallocate with the whence argument and flags is already quite complicated, I'd rather have another call for placement decisions, that would be called on an fd to do placement decissions for any further allocations (prealloc, write, etc) Yes, posix_fallocate shouldn't be made more complicated. But I don't understand why requesting linear layout of the blocks should be an option. It's always an advantage if the blocks requested this way are linear on disk. So, the kernel should always do its best to make this happen, without needing an additional option. Actually, it's not that simple. You want linear layout of blocks you are going to read. That is not necessary a linear layout of blocks in a single file - trace sometime a start of some complicated app like KDE. You find it's seeking like a hell because it needs a few blocks from a ton of distinct files (shared libs, config files, etc). As these files are mostly read only, it's advantageous to interleave them on disk or at least keep them close. At some point shouldn't the apps be fixed, rather than do crazy things with the filesystem? :) Yes :) That's basically what we told KDE developpers when they were complaining ;) But it's hard to fix it for them too (because of some desktop specs requiring lots of different text config files which can change anytime - don't ask me who designed it). Moreover for example for loading shared libraries from which you need just a few blocks scattered all over the place the problem is in ELF itself. I'll probably first write some userspace fs-reorganizer to find out how much these changes in layout are able to give you in performance (i.e. whether it's worth the effort of more complicated kernel online defragmenter). Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Tue 06-03-07 06:36:09, Ulrich Drepper wrote: Christoph Hellwig wrote: fallocate with the whence argument and flags is already quite complicated, I'd rather have another call for placement decisions, that would be called on an fd to do placement decissions for any further allocations (prealloc, write, etc) Yes, posix_fallocate shouldn't be made more complicated. But I don't understand why requesting linear layout of the blocks should be an option. It's always an advantage if the blocks requested this way are linear on disk. So, the kernel should always do its best to make this happen, without needing an additional option. Actually, it's not that simple. You want linear layout of blocks you are going to read. That is not necessary a linear layout of blocks in a single file - trace sometime a start of some complicated app like KDE. You find it's seeking like a hell because it needs a few blocks from a ton of distinct files (shared libs, config files, etc). As these files are mostly read only, it's advantageous to interleave them on disk or at least keep them close. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Fri, 02 Mar 2007 09:40:54 +1100 Nathan Scott [EMAIL PROTECTED] wrote: On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation fallocate, for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); ... I'd agree with Eric on the command flag extension. Seems like a separate syscall would be better, command sounds a bit ioctl like, especially if that command is passed into the filesystems.. madvise, fadvise, lseek, etc seem to work OK. I get repeatedly traumatised by patch rejects whenever a new syscall gets added, so I'm biased. The advantage of a command flag is that we can add new modes in the future without causing lots of churn, waiting for arch maintainers to catch up, potentially adding new compat code, etc. Rename it to mode? ;) I am wondering if it is useful to add another mode to advise block allocation policy? Something like indicating which physical block/block group to allocate from (goal), and whether ask for strict contigous blocks. This will help preallocation or reservation to choose the right blocks for the file. Yes, I also think this would be useful so you can guide preallocation for things like defragmentation (e.g. preallocate space for the file being defragmented and move the file to it). Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
ext3 not writing inode flags
Hello, I've noticed that if e. g. S_IMMUTABLE flag is set on an inode, ext3 does not propagate it to its inode flags on write. This also leads to the situation that lsattr reports no flag set but the file is in fact immutable. Would a fix be accepted? Actually, the quota code is the only place where S_IMMUTABLE flag is set to an inode so this is not a big issue but still... Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Thu 08-02-07 01:45:29, Andrew Morton wrote: snip I though Andreas meant any write changes - i.e. you check that noone has open file descriptor for writing and block any new open for writing. That can be done quite easily. Anyway, I agree with you that userspace solution to a possible page cache pollution is preferable after thinking about it for a while. As I've been thinking about it, we could actually do the copying from user space. We could do something like: block any writes to file (as I described above) craft new inode with blocks allocated as we want (using preallocation, we should mostly have the kernel infrastructure we need) copy data using splice syscall call the kernel to switch data I don't think we need to block any writes to any file or anything. To move a page within a file: fd = open(file); p = mmap(fd); the_page_was_in_core = mincore(p, offset); munmap(p); ioctl(fd, ..., new_block); kernel read_cache_page(inode, offset); lock_page(page); if (try_to_free_buffers(page)) { relocate the page set_page_dirty(page); } unlock_page(page); if (the_page_was_in_core) { sync_file_range(fd, offset SYNC_FILE_RANGE_WAIT_BEFORE| SYNC_FILE_RANGE_WRITE| SYNC_FILE_RANGE_WAIT_AFTER); fadvise(fd, offset, FADV_DONTNEED); } completely coherent with pagecache, quite safe in the presence of mmap, mlock, O_DIRECT, everything else. Also fully journallable in-kernel. Yes, this is the simple way. But I see two disadvantages: 1) You'd like to relocate metadata (indirect blocks) too. For that you need a different mechanism. In my approach, you can mostly assume you've got sanely laid out metadata and so the existence of such mechanism is not so important. 2) You'd like to allocate new blocks in big chunks. So your kernel function should rather take a range. Also when you fail in the middle of relocating a file (for example the block you'd like to use is already taken by someone else), I find it nice if you can return at least to the original state. But that's probably not important. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Thu 08-02-07 02:32:13, Andrew Morton wrote: On Thu, 8 Feb 2007 11:21:02 +0100 Jan Kara [EMAIL PROTECTED] wrote: On Thu 08-02-07 01:45:29, Andrew Morton wrote: snip I though Andreas meant any write changes - i.e. you check that noone has open file descriptor for writing and block any new open for writing. That can be done quite easily. Anyway, I agree with you that userspace solution to a possible page cache pollution is preferable after thinking about it for a while. As I've been thinking about it, we could actually do the copying from user space. We could do something like: block any writes to file (as I described above) craft new inode with blocks allocated as we want (using preallocation, we should mostly have the kernel infrastructure we need) copy data using splice syscall call the kernel to switch data I don't think we need to block any writes to any file or anything. To move a page within a file: fd = open(file); p = mmap(fd); the_page_was_in_core = mincore(p, offset); munmap(p); ioctl(fd, ..., new_block); kernel read_cache_page(inode, offset); lock_page(page); if (try_to_free_buffers(page)) { relocate the page set_page_dirty(page); } unlock_page(page); if (the_page_was_in_core) { sync_file_range(fd, offset SYNC_FILE_RANGE_WAIT_BEFORE| SYNC_FILE_RANGE_WRITE| SYNC_FILE_RANGE_WAIT_AFTER); fadvise(fd, offset, FADV_DONTNEED); } completely coherent with pagecache, quite safe in the presence of mmap, mlock, O_DIRECT, everything else. Also fully journallable in-kernel. Yes, this is the simple way. But I see two disadvantages: 1) You'd like to relocate metadata (indirect blocks) too. Well. Do we really? Are we looking for a 100% solution here, or a 90% one? Umm, I think that for ext3 having data on one end of the disk and indirect blocks on the other end of the disk does not quite help (not mentioning that it can create bad free space fragmentation over the time). I have not measured it but I'd guess that it would erase the effect of moving data closer together. At least for sequential reads.. Relocating data is the main thing. After that, yeah, relocating metadata, inodes and directories is probably a second-order thing. For that you need a different mechanism. I suspect a similar approach will work there: load and lock the buffer_heads (or maybe just the top-level buffer_head) and then alter their contents. It could be that verify_chain() will just magically do the right thing there, but some changes might be needed. Yes, it could be done. I just wanted to point to the fact that things may not be as simple in your solution either... Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
; + + page = read_cache_page(inode-i_mapping, +org_offset, (filler_t *)inode-i_mapping-a_ops-readpage, +NULL); + + if (IS_ERR(page)) { + ret = PTR_ERR(page); + return ret; + } + + wait_on_page_locked(page); + lock_page(page); + /* release old bh and drop refs */ + try_to_release_page(page, 0); + ret = ext4_ext_replace_branches(inode, tmp_inode, org_offset, + dest_offset, 1, delete_start); + if (ret 0) + goto ERR; + + if (org_offset == ((inode-i_size - 1) PAGE_SHIFT)) + offset_in_page = (inode-i_size (PAGE_CACHE_SIZE - 1)); + + ret = mapping-a_ops-prepare_write(filp, page, + 0, offset_in_page); + if (ret) + goto ERR; + + ret = mapping-a_ops-commit_write(filp, page, +0, offset_in_page); +ERR: + unlock_page(page); + page_cache_release(page); + + return (ret 0 ? ret : 0); +} + +/** * ext4_ext_new_extent_tree - allocate contiguous blocks * @inode: inode of the original file * @tmp_inode: inode of the temporary file - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH] Fix kmalloc flags used in ext3 with an active journal handle
On Tue, 19 Dec 2006 18:22:03 -0800 Suzuki [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Tue, 19 Dec 2006 17:58:12 -0800 Suzuki [EMAIL PROTECTED] wrote: * Fix the kmalloc flags used from within ext3, when we have an active journal handle If we do a kmalloc with GFP_KERNEL on system running low on memory, with an active journal handle, we might end up in cleaning up the fs cache flushing dirty inodes for some other filesystem. This would cause hitting a J_ASSERT() in : The change might be needed (haven't looked at it yet). But I'd like to see the full BUG trace, please. To see the callchain. Here is the call trace which was hit by one of our test teams. This was from fs/ext3/xattr.c. While looking for similar calls I found the others described in the patch. Assertion failure in journal_start() at fs/jbd/transaction.c:274: handle- h_transaction-t_journal == journal kernel BUG at fs/jbd/transaction.c:274! illegal operation: 0001 [#1] CPU:0Not tainted (2.6.5-7.282-s390x SLES9_SP3_BRANCH-20061031152356) Process dbench (pid: 14070, task: 025617f0, ksp: 01057630) Krnl PSW : 07018000 08837b38 (journal_start+0x90/0x15c [jbd]) Krnl GPRS: 00507fc0 002b 01056d80 08837b36 2885 08841da6 001bfaa0 03483d08 0002 07a8bda0 08833000 088a7d08 08837b36 01056e80 Krnl Code: 00 00 58 10 b0 0c a7 1a 00 01 b9 04 00 2b 50 10 b0 0c e3 40 Call Trace: [088a30fc] ext3_journal_start+0x8c/0xa4 [ext3] [08896822] ext3_dirty_inode+0x3a/0xe0 [ext3] [001ca362] __mark_inode_dirty+0x1ae/0x1c8 [001bfaa0] iput+0xbc/0xf0 [001bdcca] prune_dcache+0x29e/0x584 [001bdfe4] shrink_dcache_memory+0x34/0x54 [0017b100] shrink_slab+0x15c/0x250 [0017b6e4] try_to_free_pages+0x1c0/0x2a4 [00170276] __alloc_pages+0x2ba/0x4e0 [0017059a] __get_free_pages+0x4e/0x8c [00174ea2] cache_alloc_refill+0x2a6/0x868 [00175540] __kmalloc+0xdc/0xe0 [088a4e62] ext3_xattr_set_handle+0x114a/0x174c [ext3] [088a54e4] ext3_xattr_set+0x80/0xd0 [ext3] [088a6312] ext3_xattr_user_set+0xce/0xe4 [ext3] [088a5f1e] ext3_setxattr+0x17e/0x18c [ext3] [001c88e6] setxattr+0x14a/0x234 [001c8a80] sys_fsetxattr+0xb0/0x110 [0011fc10] sysc_noemu+0x10/0x16 How did we get from iput() into __mark_inode_dirty()? I can't see it in mainline, nor in 2.6.5 which you appear to be using... Hmm, I think it could happen at least via quota code (but then I would expect to see some entry in the backtrace about it). Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Defragmentation interface
On Fri, Nov 03, 2006 at 03:30:30PM +0100, Jan Kara wrote: BTW, does use of sysfs mean ASCII encoding of all the data passing between kernel and userspace? Not necessarify but mostly yes. At least I intend to have all the files I have proposed in ASCII. Ok - that's how you're looking to avoid 32/64bit compatibility issues? Yes. It will make the interface quite verbose, though, and entail significant encoding and decoding costs It would be verbose. On the other hand for most things it should not matter (not too much data goes through the interface and it's not too performance critical). meta/allocation/free_blocks - RO file - if you read from fpos F, you'll get a list of extents describing areas with free blocks (as many as fits into supplied buffer) starting from block F. Fpos of your file descriptor is shifted to the first unreported free block. - linear search properties == Bad. (think fs sizes of hundreds of terabytes - XFS is already deployed with filesystems of this size) OK, so what do you propose? You want syscall find_free_blocks() and my idea of it was that it will do basically the same think as my snip Right. More complicated requests are something that we need to support in XFS in the short-medium term. We _need_ an interface to XFS that allows complex, compound allocation policies to be accessible from userspace - and this is not just for defrag programs. I think a set of well defined allocation primitives suits a syscall interface far better than a per-filesystem sysfs interface. I'm only afraid of one thing: Once you define a syscall it's hard to change anything and for this kind of thing I'm not sure we are able to tell what we'll need in two years... That is basically my main concern with implementing this interface as a syscall. - every time you fail an allocation, you need to reread this file. Yes, that's the most serious disadvantage I see. Do you see any way out of it in any interface? I haven't really thought about solutions for this interface - the syscall interface doesn't have this problem because of the way you can specify where you want free blocks from But that does not solve the problem with having to repeat the search, does it? Only with the syscall interface filesystem can possibly search for free blocks more efficiently.. - stripe unit and stripe width need to be exposed so defrag too can make correct placement decisions. fs-specific thing... As Andreas said, this isn't fs-specific. XFS takes sunit and swidth as mkfs parameters so it can align both metadata and data optimally for RAID devices. Other fileystems have different methods of specifying this (ext2/3/4 use -E stride-size for this), but it would need to be exposed in some way I see. But then shouldn't we expose it regardless the interface (sysfs/syscall) we choose so that userspace can take it into account when picking where to allocate? meta/nodes/ident - this should be a directory containing things specific for a fs-object with identification ident. In case of ext3 these would be inode numbers, I guess this should be plausible also for XFS and others but I'm open to suggestions... - directory contains the following: alloc_goal - block number with current allocation goal The kernel has to store this across syscalls until you write into data/alloc? That sounds dangerous... This is persistent until kernel decides to remove inode from memory. So while you have the file open, you are guaranteed that kernel keeps the information. But the inode hangs around long after the file is closed. How do you guarantee that this gets cleared when it needs to be? It gets cleared (or rewritten) as soon as alloc_goal is used for allocation or when inode gets removed from memory. Ext3 currently has such thing (settable via ioctl()) and it seems to work reasonably well. I just don't like the principle of this interface when we are talking about moving data around online - it's inherently unsafe when you consider mutli-threaded or -process access to an inode. Yes, we certainly have to make sure we don't do something destructive in such case. On the other hand if several processes try to guide allocation in the same file, results are uncertain and that's IMHO ok. The major difference is that one implementation requires 3 new generically useful syscalls, and the other requires every filesystem to implement a metadata filesystem and require root priviledges to use. Yes. IMO the complexity of implementation is almost the same in the syscall case and in my sysfs case. What syscall would do is just do some basic checks and redirect everything into fs-specific call anyway... Sure, but you don't need to implement a new filesystem in every filesystem to support it But the cost of this meta
Re: [RFC] Defragmentation interface
Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Defragmentation interface
Hi, from the thread after my patch implementing ext3 online defragmentation I found out that probably the only (and definitely the biggest) issue is the interface. Someone wants is common enough so that we can profit from common tools for several filesystems, others object that some applications, e.g. defragmenter, need to know something about ext3 internals to work reasonably well. Moreover ioctl() is ugly and has some compatibility issues, on the other hand ext2meta is too lowlevel, fs-specific and it would be hard to do any reasonable application crash-safe... So in this email I try to propose some interface which should hopefully address most of the concerns. The type of the interface is sysfs like (idea taken from ext2meta) - that has a few advantages: - no 32/64-bit compatibility issues - easily extensible - generally nice ;) Each filesystem willing to support this interface implements special filesystem (e.g. ext3meta, XFSmeta, ...) and admin/defrag-tool mounts it to some directory. There are parts of this interface which should be common for all filesystems (so that tools don't have to care about particular filesystem and still get some useful results), other parts are fs-specific. Here is basic structure I propose: meta/features - bitmap of features supported by the interface (ext2/3-like) so that the tool can verify whether it understands the interface and don't mess with it otherwise meta/allocation/free_blocks - RO file - if you read from fpos F, you'll get a list of extents describing areas with free blocks (as many as fits into supplied buffer) starting from block F. Fpos of your file descriptor is shifted to the first unreported free block. meta/super/blocksize - filesystem block size meta/super/id - filesystem ID (for paranoid tools to verify that they are accessing really the right meta-filesystem) meta/nodes/ident - this should be a directory containing things specific for a fs-object with identification ident. In case of ext3 these would be inode numbers, I guess this should be plausible also for XFS and others but I'm open to suggestions... - directory contains the following: alloc_goal - block number with current allocation goal data/extents - if you read from this file, you get a list of extents describing data blocks (and holes) of the file. The listing starts at logical block fpos. Fpos is shifted to the first unreported data block. data/alloc - you write there a number L and fs allocates L blocks to a file (preferable from alloc_goal) starting from file-block fpos. Fpos is shifted after the last block allocated in this call. data/reloc - you write there ident and relocation of data happens as follows: All blocks that are allocated both in original file and ident are relocated to ident. Write returns number of relocated blocks. metadata/ - this directory is fs-specific, contains fs block pointers and similar. Here I describe what I'd like to have for ext3. metadata/alloc - you write there number Level and a fs allocates an indirect block to a file for logical block fpos at level Level of the indirect tree. Parent indirect block must be already allocated. metadata/reloc - you write two numbers ident Level and an indirect block for logical offset fpos at level Level will be swapped with corresponding indirect block of ident. This is all that is needed for my purposes. Any comments welcome. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote: So how do you then get the generic interface to allocate blocks specified by userspace race free? As has been repeatedly stated, there is no generic. There MUST be filesystem-specific knowledge during these operations. What information? All we need to know is where the free disk space is, and have a method to attempt to allocate from it. That's _easy_ to abstract into a common interface via the VFS Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Sure, but that's not a generic solution to a problem common to all filesystems You clearly don't know what I'm talking about. ext2meta is an example of a filesystem-specific metadata access method, applicable to tasks such as online optimization. I know exactly what ext2meta is. I said it's not a generic solution and you say its a filesystem specific solution. I think we're agreeing here. ;) We don't need to expose anything filesystem specific to userspace to implement this. Online data movement (i.e. the defrag mechanism) becomes something like: do { get_free_list(dst_fd, location, len, list) /* select extent to use */ Upto this point I can imagine we can be perfectly generic. alloc_from_list(dst_fd, list[X], off, len) } while (ENOALLOC) move_data(src_fd, dst_fd, off, len); With these two it's not clear how well can we do with just a generic interface. Every filesystem needs to have some additional metadata to keep list of data blocks. In case of ext2/ext3/reiserfs this is not a negligible amount of space and placement of these metadata is important for performance. So either we focus only on data blocks and let implementation of alloc_from_list() allocate metadata wherever it wants (but then we get suboptimal performace because there need not be space for indirect blocks close before our provided extent) or we allocate metadata from the provided list, but then we need some knowledge of fs to know how much should we expect to spend on metadata and where these metadata should be placed. For example if you know that indirect block for your interval is at block B, then you'd like to allocate somewhere close after this point or to relocate that indirect block (and all the data it references to). But for that you need to know you have something like indirect blocks = filesystem knowledge. So I think that to get this working, we also need some way to tell the program that if it wants to allocate some data, it also needs to count with this amount of metadata and some of it is already allocated in given blocks... I see substantial benefit moving forward from having filesystem independent interfaces. Many features that filesystems implement are common, and as time goes on the common feature set of the different filesystems gets larger. So why shouldn't we be trying to make common operations generic so that every filesystem can benefit from the latest and greatest tool? So you prefer to handle only data blocks part of the problem and let filesystem sort out metadata? Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 24, 2006 15:44 -0400, Theodore Tso wrote: First of all, we would need a way of allowing userpsace to specify which blocks should be used in the preallocation. Presumably it could do this in the same way it will be specifying which blocks to relocate in the defragger - by passing an extent. You would be required to pass the file offset for which to preallocate, and optionally an extent for the on-disk allocation itself (if none is supplied the kernel will allocate the best extent it can). Secondly, we would need a way of marking blocks as preallocated but not pre-zeroed; otherwise we would have to zero out all of the blocks in order to assure security (don't want userspace programs seeing the previous contents of the data blocks), only to do the copy and the extents vector swap. This could be mitigated by having the preallocation be done (in the defragment case) against a temporary inode in the orphan list (as the initial patch did) so if there is a crash it will be released. The temporary inode will not be linked into the namespace so it cannot be read - only used to hold preallocation. If this was a write-only file handle then we should be OK? For defragger purposes this would need: - allocate new temporary inode (VFS + fs, returns write-only fh if fs can't properly handle uninitalized extents, or doesn't request full-extent zeroing) for each extent to defragment { - preallocate extents on temp inode (fs specific internals) - copy data from orig to temp at offset X (VFS, splice or e.g. sys_copyfile(src, dst, offset, count) which Linus agreed to at KS '05 for network filesystems) - migrate copied extent to original inode (fs specific internals) } - free temporary inode (just close of temp fh, frees unmigrated extents). Yes, this sounds feasible. We could split the defrag ioctl into two pieces (addition of given extent to a file and swapping of extents), which can have generic interface... I don't think this is much more work than implementing all of this functionality as part of a monolithic online defrag function, assuming we don't require full-file copies in order to do defrag. Yes, it's not more work than supporting swapping of extents in the middle of the file. I've just not yet decided how to handle indirect blocks in case of relocation in the middle of the file. Should they be relocated or shouldn't they? Probably they should be relocated at least in case they are fully contained in relocated interval or maybe better said when all the blocks they reference to are also in the interval (this handles also the case of EOF). But still if you would like to relocate the file by parts this is not quite what you want (you won't be able to relocate indirect blocks in the boundary of intervals) :(. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote: Yes, this sounds feasible. We could split the defrag ioctl into two pieces (addition of given extent to a file and swapping of extents), which can have generic interface... An ioctl is UGLY. Agreed. This was discussed years ago. Google for 'Alexander Viro' and 'ext2meta'. That's a clean, flexible, extensible way to access metadata online. No need for ioctl binary translation across 32bit-64bit, or any other ioctl issue. I've briefly looked at this and this kind of interface has some appeal. On the other hand it's not obvious to me, how to implement in this interface *atomic* operation copy data from file F to given set of blocks and rewrite pointers to original blocks with pointers to new blocks. Something like this is needed for what we want to do... Also if we'd like to implement operation like add this block to file F at position P we have to make sure that all the necessary updates (bitmap updates, inode updates, indirect block updates) go into one transaction. Which basically mean that either ext3meta has to have a way how to do this in a single operation, or we have to give userspace a way to start/stop transaction and that starts to be really a mess because of various deadlocks and so on. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 07:58:51PM +0200, Jan Kara wrote: I've briefly looked at this and this kind of interface has some appeal. On the other hand it's not obvious to me, how to implement in this interface *atomic* operation copy data from file F to given set of blocks and rewrite pointers to original blocks with pointers to new blocks. Something like this is needed for what we want to do... Also if we'd like to implement operation like add this block to file F at position P we have to make sure that all the necessary updates (bitmap updates, inode updates, indirect block updates) go into one transaction. Which basically mean that either ext3meta has to have a way how to do this in a single operation, or we have to give userspace a way to start/stop transaction and that starts to be really a mess because of various deadlocks and so on. Agreed, this issues exist. But these issues exist independent of whether an ioctl or ext3meta is used. It's all the responsibility of the implementor to define the interface. My contention is that ext3meta interface method would be much more robust than ioctl. It's a namespace inside which you can define any inodes/dirents you wish, for the operations you desire. I see. So you mean that in our ext3meta filesystem we'd have a file named add_this_extent_to_inode and a file reloc_inode_interval and they'd be fed essentially the same info as the current ioctl interface and do the same thing as we currently do. Hmm, I don't find it that nice any more but yes, this would work. Heck, according to my sf.net/projects/gkernel CVS log, you offered some helpful review comments to me when I was implementing ext2meta ;-) Looking at those mails it was already quite some time ago so I forgot about it ;) Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 23, 2006 18:03 +0200, Jan Kara wrote: Andreas Dilger wrote: I would in fact go so far as to allow only a single extent to be specified per call. This is to avoid the passing of any pointers as part of the interface (hello ioctl police :-), and also makes the kernel code simpler. I don't think the syscall/ioctl overhead is significant compared to the journal and IO overhead. ...it makes it kind of harder to tell where indirect blocks would go - and it would be impossible for the defragmenter to force some unusual placement of indirect blocks... It would be possible to specify indirect block relocation in same manner as regular block relocation I think. Allocate a new block, copy contents, flush block from cache, fix up reference (inode, dindirect), commit. Yes, but there's a question of the interface to this operation. How to specify which indirect block I mean? Obviously we could introduce separate call for remapping indirect blocks but I find this solution kind of clumsy... Bye Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html