Re: [RFC][PATCH] fix journal overflow problem

2008-02-22 Thread Jan Kara
 = bh2jh(bh);
   BUFFER_TRACE(bh, entry);
 +
 + jbd_lock_bh_state(bh);
 + /*
 +  * if nobody else has requested write access to this jh then set
 +  * b_next_transaction to NULL to keep it from getting refiled onto
 +  * this transaction
 +  */
 + if (!(--jh-b_nr_access)) {
 +
 + /*
 +  * A journal_get_undo_access()+journal_release_buffer() will
 +  * leave undo-committed data that we no longer need
 +  */
 + if (jh-b_next_transaction  jh-b_committed_data) {
 + jbd_free(jh-b_committed_data, bh-b_size);
 + jh-b_committed_data = NULL;
 + }
 +
 + jh-b_next_transaction = NULL;
 + }
 +
 + jbd_unlock_bh_state(bh);
  }
  
  /**
 @@ -1245,6 +1307,11 @@ int journal_forget (handle_t *handle, struct 
 buffer_head 
 *bh)
*/
   jh-b_modified = 0;
  
 + /*
 +  * drop one of our write access credits
 +  */
 + jh-b_nr_access--;
 +
   if (jh-b_transaction == handle-h_transaction) {
   J_ASSERT_JH(jh, !jh-b_frozen_data);
  
 diff --git a/include/linux/journal-head.h b/include/linux/journal-head.h
 index 8a62d1e..160f1d1 100644
 --- a/include/linux/journal-head.h
 +++ b/include/linux/journal-head.h
 @@ -39,6 +39,13 @@ struct journal_head {
   unsigned b_modified;
  
   /*
 +  * This is a counter of the number of things who have requested write
 +  * access to this buffer by the current transaction
 +  * [jbd_lock_bh_state()]
 +  */
 + unsigned b_nr_access;
 +
 + /*
* Copy of the buffer data frozen for writing to the log.
* [jbd_lock_bh_state()]
*/
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] jbd: fix assertion failure in journal_next_log_block

2008-02-06 Thread Jan Kara
On Tue 05-02-08 13:59:05, Josef Bacik wrote:
 On Tuesday 05 February 2008 12:27:31 pm Jan Kara wrote:
Hello,
 
Sorry for replying a bit late but I'm currently falling behind in
  maling-list reading...
 
   The way jbd tries to determine if there is enough space left on the
   journal in order to start a new transaction is looking at the space left
   in the journal and the space needed for the committing transaction, which
   is 1/4 of the journal + the number of t_outstanding_credits for that
   transaction.  In this case its assumed that t_outstanding_credits
   accurately represents the number of credits
 
Yes.
 
   waiting to be written to the journal, but this sometimes isn't the case. 
   The transaction has two counters for buffers, t_outstanding_credits which
   is used in conjunction with handles that are added to the transaction,
   and t_nr_buffers which is only incremented/decremented when buffers are
   added/removed from the transaction and are actually destined to be
   journaled.  Normally these two
 
t_nr_buffers actually represents number of buffers on BJ_Metadata list
  and nobody uses it (except for the assertion in
  __journal_temp_unlink_buffer()). t_outstanding_credits is supposed to be
  *the* counter making sure we don't write more than we have credits for.
 
   counters are the same, however there are cases where the committing
   transaction can have buffers moved to the next running transaction, for
   example any buffers on the committing transactions t_reserved list would
   be moved to the next (running) transaction, and if it had been dirtied in
   the process it would immediately make it onto the t_updates list, which
   would increment t_nr_buffers
 
You probably mean t_buffers list here...
 
   but not t_outstanding_credits.  So you get into this situation where
 
But which moving and dirtying do you mean? The caller which dirties
  the buffer must make sure that he has acquired enough credits for the
  transaction where the buffer ends up... So if there were not enough
  buffers in the running transaction where we refiled the buffer it is a
  bug in the caller which dirties the buffer.
 
 
 You know now that you say that I feel like an idiot, you are right the only 
 way 
 for something to actually end up on that list was if somebody dirtied it and 
 if 
 they did it would have had to been accounted for at some point on the running 
 transaction.
  Never mind :) I also made several mistakes before I learned how JBD
works...

   t_nr_buffers (the actual number of buffers that are on the transaction)
   is greater than the number of buffers accounted for via
   t_outstanding_credits. This presents a problem since as we loop through
   writting buffers to the journal, we decrement t_outstanding_credits, and
   if t_nr_buffers is more than t_outstanding_credits then we end up with a
   negative number for
   t_outstanding_credits, which means we start saying we need less than 1/4
   of the journal for our committing transaction and allow more transactions
   than we can handle to start, and then bam we fail because
   journal_next_log_block doesn't have enough free blocks in order to handle
   the request.  This has been tested and fixes the issue (which could not
   be reproduced by me but several other people could get it to reproduce
   using postmark), and although I couldn't reproduce the assertion, I could
   very easily reproduce the situation where t_outstanding_credits was 
   than t_nr_buffers.
 
I suppose you see the assertion J_ASSERT(journal-j_free  1); to
  fail, right? I don't see how your patch could help avoid that assertion.
  You've just removed accounting of t_outstanding_credits which has no
  impact on the real number of free blocks in the journal stored in
  j_free. Anyway, if you can reproduce t_outstanding_credits 
  t_nr_buffers, then there's something fishy. Are you able to reproduce it
  also with a current kernel?
Thanks for looking into the problem :)
 
 
 Well my patch helped avoid the assertion because t_outstanding_credits was 
 going 
 negative therefore we were letting transactions start when we shouldn't be, 
 and 
 eventually we would end up with too much of the journal in use and we'd 
 assert.  
 Course I can't reproduce where t_outstanding_credits  t_nr_buffers upstream 
 (again I feel like an idiot, should have tested that first).  Thanks for 
 looking at this Jan.
  Glad to hear we don't have a bug there :) Anyway, assertion that
t_outstanding_credits doesn't go below zero would be a useful assertion to
add...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Fix circular locking dependency with migrate and rm.

2008-02-06 Thread Jan Kara
On Wed 06-02-08 16:30:04, Aneesh Kumar K.V wrote:
 We now take inode-i_mutex lock to prevent any update of the inode i_data
 field. Before we switch the inode format we take i_data_sem to prevent
 parallel read.
 
 ===
 [ INFO: possible circular locking dependency detected ]
 2.6.24-rc8 #6
 ---
 rm/2401 is trying to acquire lock:
  (ei-i_data_sem){}, at: [c01dca58] ext4_get_blocks_wrap+0x21/0x108
 
 but task is already holding lock:
  (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff
 
 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:
 
 - #1 (jbd2_handle){--..}:
[c0143a5c] __lock_acquire+0xa31/0xc1a
[c0143cbf] lock_acquire+0x7a/0x94
[c01fc4ca] jbd2_journal_start+0xf5/0xff
[c01e3539] ext4_journal_start_sb+0x48/0x4a
[c01eb980] ext4_ext_migrate+0x7d/0x535
[c01df328] ext4_ioctl+0x528/0x56c
[c0177700] do_ioctl+0x50/0x67
[c017794e] vfs_ioctl+0x237/0x24a
[c0177992] sys_ioctl+0x31/0x4b
[c0104f8a] sysenter_past_esp+0x5f/0xa5
[] 0x
 
 - #0 (ei-i_data_sem){}:
[c014394c] __lock_acquire+0x921/0xc1a
[c0143cbf] lock_acquire+0x7a/0x94
[c044f247] down_read+0x42/0x79
[c01dca58] ext4_get_blocks_wrap+0x21/0x108
[c01dcba1] ext4_getblk+0x62/0x1c4
[c01e0de9] ext4_find_entry+0x350/0x5b7
[c01e200c] ext4_unlink+0x6e/0x1a4
[c017449e] vfs_unlink+0x49/0x89
[c0175f02] do_unlinkat+0x96/0x12c
[c0175fa8] sys_unlink+0x10/0x12
[c0104f8a] sysenter_past_esp+0x5f/0xa5
[] 0x
 
 other info that might help us debug this:
 
 3 locks held by rm/2401:
  #0:  (type-i_mutex_dir_key#5/1){--..}, at: [c0175eca] 
 do_unlinkat+0x5e/0x12c
  #1:  (sb-s_type-i_mutex_key#8){--..}, at: [c017448b] 
 vfs_unlink+0x36/0x89
  #2:  (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff
 
 stack backtrace:
 Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6
  [c0106017] show_trace_log_lvl+0x1a/0x2f
  [c0106893] show_trace+0x12/0x14
  [c0106b89] dump_stack+0x6c/0x72
  [c0141b26] print_circular_bug_tail+0x5f/0x68
  [c014394c] __lock_acquire+0x921/0xc1a
  [c0143cbf] lock_acquire+0x7a/0x94
  [c044f247] down_read+0x42/0x79
  [c01dca58] ext4_get_blocks_wrap+0x21/0x108
  [c01dcba1] ext4_getblk+0x62/0x1c4
  [c01e0de9] ext4_find_entry+0x350/0x5b7
  [c01e200c] ext4_unlink+0x6e/0x1a4
  [c017449e] vfs_unlink+0x49/0x89
  [c0175f02] do_unlinkat+0x96/0x12c
  [c0175fa8] sys_unlink+0x10/0x12
  [c0104f8a] sysenter_past_esp+0x5f/0xa5
 
 Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
  Add Acked-by: Jan Kara [EMAIL PROTECTED] if you wish.

Honza
 ---
  fs/ext4/migrate.c |  117 +---
  1 files changed, 74 insertions(+), 43 deletions(-)
 
 diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
 index 9ee1f7c..8c6c685 100644
 --- a/fs/ext4/migrate.c
 +++ b/fs/ext4/migrate.c
 @@ -61,10 +61,9 @@ static int finish_range(handle_t *handle, struct inode 
 *inode,
   retval = ext4_journal_restart(handle, needed);
   if (retval)
   goto err_out;
 - }
 - if (needed) {
 + } else if (needed) {
   retval = ext4_journal_extend(handle, needed);
 - if (retval != 0) {
 + if (retval) {
   /*
* IF not able to extend the journal restart the journal
*/
 @@ -220,6 +219,26 @@ static int update_tind_extent_range(handle_t *handle, 
 struct inode *inode,
  
  }
  
 +static int extend_credit_for_blkdel(handle_t *handle, struct inode *inode)
 +{
 + int retval = 0, needed;
 +
 + if (handle-h_buffer_credits  EXT4_RESERVE_TRANS_BLOCKS)
 + return 0;
 + /*
 +  * We are freeing a blocks. During this we touch
 +  * superblock, group descriptor and block bitmap.
 +  * So allocate a credit of 3. We may update
 +  * quota (user and group).
 +  */
 + needed = 3 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode-i_sb);
 +
 + if (ext4_journal_extend(handle, needed) != 0)
 + retval = ext4_journal_restart(handle, needed);
 +
 + return retval;
 +}
 +
  static int free_dind_blocks(handle_t *handle,
   struct inode *inode, __le32 i_data)
  {
 @@ -234,11 +253,14 @@ static int free_dind_blocks(handle_t *handle,
  
   tmp_idata = (__le32 *)bh-b_data;
   for (i = 0; i  max_entries; i++) {
 - if (tmp_idata[i])
 + if (tmp_idata[i]) {
 + extend_credit_for_blkdel(handle, inode);
   ext4_free_blocks(handle, inode,
   le32_to_cpu(tmp_idata[i]), 1, 1);
 + }
   }
   put_bh(bh

Re: [RESEND] [PATCH] ext3,4:fdatasync should skip metadata writeout when overwriting

2008-02-06 Thread Jan Kara
  Hi,

 Currently fdatasync is identical to fsync in ext3,4.
 I think fdatasync should skip journal flush in data=ordered and 
 data=writeback mode
 when it overwrites to already-instantiated blocks on HDD.
 When I_DIRTY_DATASYNC flag is not set, fdatasync should skip journal writeout
 because this indicates only atime or/and mtime updates.  
 
 Following patch is the same approach of ext2's fsync code(ext2_sync_file).
 
 I did a performance test using the sysbench.
 
 #sysbench --num-threads=128 --max-requests=5 --test=fileio 
 --file-total-size=128G 
 --file-test-mode=rndwr --file-fsync-mode=fdatasync run
 
 The result was:
 
   -2.6.24
   Operations performed:  0 Read, 50080 Write, 59600 Other = 109680 Total
   Read 0b  Written 782.5Mb  Total transferred 782.5Mb  (12.116Mb/sec)
 775.45 Requests/sec executed
 
   Test execution summary:
   total time:  64.5814s
   total number of events:  50080
   total time taken by event execution: 3713.9836
   per-request statistics:
min:0.s
avg:0.0742s
max:0.9375s
approx.  95 percentile: 0.2901s
 
   Threads fairness:
   events (avg/stddev):   391.2500/23.26
   execution time (avg/stddev):   29.0155/1.99
 
 
   -2.6.24-patched
   Operations performed:  0 Read, 50009 Write, 61596 Other = 111605 Total
   Read 0b  Written 781.39Mb  Total transferred 781.39Mb  (16.419Mb/sec)
1050.83 Requests/sec executed
 
   Test execution summary:
   total time:  47.5900s
   total number of events:  50009
   total time taken by event execution: 2934.5768
   per-request statistics:
min:0.s
avg:0.0587s
max:0.8938s
approx.  95 percentile: 0.1993s
 
   Threads fairness:
   events (avg/stddev):   390.6953/22.64
   execution time (avg/stddev):   22.9264/1.17
 
 
 Filesystem I/O throughput was improved.
 
 Thanks.
 
 Signed-off-by :Hisashi Hifumi [EMAIL PROTECTED]
  Yes, the patch looks fine. You can add
Acked-by: Jan Kara [EMAIL PROTECTED]
  if you wish.
Honza

 diff -Nrup linux-2.6.24.org/fs/ext3/fsync.c linux-2.6.24/fs/ext3/fsync.c
 --- linux-2.6.24.org/fs/ext3/fsync.c  2008-01-25 07:58:37.0 +0900
 +++ linux-2.6.24/fs/ext3/fsync.c  2008-02-04 12:42:42.0 +0900
 @@ -72,6 +72,9 @@ int ext3_sync_file(struct file * file, s
   goto out;
   }
  
 + if (datasync  !(inode-i_state  I_DIRTY_DATASYNC))
 + goto out;
 +
   /*
* The VFS has written the file data.  If the inode is unaltered
* then we need not start a commit.
 diff -Nrup linux-2.6.24.org/fs/ext4/fsync.c linux-2.6.24/fs/ext4/fsync.c
 --- linux-2.6.24.org/fs/ext4/fsync.c  2008-01-25 07:58:37.0 +0900
 +++ linux-2.6.24/fs/ext4/fsync.c  2008-02-04 12:43:37.0 +0900
 @@ -72,6 +72,9 @@ int ext4_sync_file(struct file * file, s
   goto out;
   }
  
 +if (datasync  !(inode-i_state  I_DIRTY_DATASYNC))
 +goto out;
 +
   /*
* The VFS has written the file data.  If the inode is unaltered
* then we need not start a commit.
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: Fix DIO locking

2008-02-06 Thread Jan Kara
  Hi,

  this patch fixes lock inversion in ext4 direct IO path. The similar patch
has already been accepted for ext3. Mingming, will you put it into ext4
patch queue? Thanks.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

We cannot start transaction in ext4_direct_IO() and just let it last during the
whole write because dio_get_page() acquires mmap_sem which ranks above
transaction start (e.g. because we have dependency chain
mmap_sem-PageLock-journal_start, or because we update atime while holding
mmap_sem) and thus deadlocks could happen. We solve the problem by starting
a transaction separately for each ext4_get_block() call.

We *could* have a problem that we allocate a block and before its data are
written out the machine crashes and thus we expose stale data. But that
does not happen because for hole-filling generic code falls back to buffered
writes and for file extension, we add inode to orphan list and thus in
case of crash, journal replay will truncate inode back to the original size.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bb717cb..1948b97 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -895,7 +895,16 @@ out:
return err;
 }
 
-#define DIO_CREDITS (EXT4_RESERVE_TRANS_BLOCKS + 32)
+/* Maximum number of blocks we map for direct IO at once. */
+#define DIO_MAX_BLOCKS 4096
+/*
+ * Number of credits we need for writing DIO_MAX_BLOCKS:
+ * We need sb + group descriptor + bitmap + inode - 4
+ * For B blocks with A block pointers per block we need:
+ * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
+ * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
+ */
+#define DIO_CREDITS 25
 
 int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
unsigned long max_blocks, struct buffer_head *bh,
@@ -942,49 +951,31 @@ static int ext4_get_block(struct inode *inode, sector_t 
iblock,
struct buffer_head *bh_result, int create)
 {
handle_t *handle = ext4_journal_current_handle();
-   int ret = 0;
+   int ret = 0, started = 0;
unsigned max_blocks = bh_result-b_size  inode-i_blkbits;
 
-   if (!create)
-   goto get_block; /* A read */
-
-   if (max_blocks == 1)
-   goto get_block; /* A single block get */
-
-   if (handle-h_transaction-t_state == T_LOCKED) {
-   /*
-* Huge direct-io writes can hold off commits for long
-* periods of time.  Let this commit run.
-*/
-   ext4_journal_stop(handle);
-   handle = ext4_journal_start(inode, DIO_CREDITS);
-   if (IS_ERR(handle))
+   
+   if (create  !handle) {/* Direct IO write... */
+   if (max_blocks  DIO_MAX_BLOCKS)
+   max_blocks = DIO_MAX_BLOCKS;
+   handle = ext4_journal_start(inode, DIO_CREDITS +
+ 2 * EXT4_QUOTA_TRANS_BLOCKS(inode-i_sb));
+   if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
-   goto get_block;
-   }
-
-   if (handle-h_buffer_credits = EXT4_RESERVE_TRANS_BLOCKS) {
-   /*
-* Getting low on buffer credits...
-*/
-   ret = ext4_journal_extend(handle, DIO_CREDITS);
-   if (ret  0) {
-   /*
-* Couldn't extend the transaction.  Start a new one.
-*/
-   ret = ext4_journal_restart(handle, DIO_CREDITS);
+   goto out;
}
+   started = 1;
}
 
-get_block:
-   if (ret == 0) {
-   ret = ext4_get_blocks_wrap(handle, inode, iblock,
+   ret = ext4_get_blocks_wrap(handle, inode, iblock,
max_blocks, bh_result, create, 0);
-   if (ret  0) {
-   bh_result-b_size = (ret  inode-i_blkbits);
-   ret = 0;
-   }
+   if (ret  0) {
+   bh_result-b_size = (ret  inode-i_blkbits);
+   ret = 0;
}
+   if (started)
+   ext4_journal_stop(handle);
+out:
return ret;
 }
 
@@ -1674,7 +1665,8 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
  * if the machine crashes during the write.
  *
  * If the O_DIRECT write is intantiating holes inside i_size and the machine
- * crashes then stale disk data _may_ be exposed inside the file.
+ * crashes then stale disk data _may_ be exposed inside the file. But current
+ * VFS code falls back into buffered path in that case so we are safe.
  */
 static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
const struct iovec *iov, loff_t offset,
@@ -1683,7 +1675,7 @@ static

Re: jbd2_handle and i_data_sem circular locking dependency detected

2008-02-05 Thread Jan Kara
On Tue 05-02-08 17:53:42, Aneesh Kumar K.V wrote:
 On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote:
Hi,
  
  On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
   This is with the new ext3 - ext4 migrate code added. The recently added
   lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
   on the ext3 inode during migration to prevent walking the ext3 inode
   when it is being converted to ext4 format. Also we want to avoid 
   file truncation and new blocks being added while converting to ext4.
   Also we dont want to reserve large number of credits for journal.
   Any idea how to fix this ?
Hmm, while briefly looking at the code - why do you introduce i_data_sem
  and not use i_alloc_sem which is already in VFS inode? That is aimed
  exactly at the serialization of truncates, writes and similar users.
  That doesn't solve problems with lock ordering but I was just wondering...
Another problem - ext4_fallocate() has the same lock ordering problem as
  the migration code and maybe there are others...
One (stupid) solution to your problem is to make i_data_sem be
  always locked before the transaction is started. It could possibly have
  negative performance impact because you'd have to hold the semaphore for
  a longer time and thus a writer would block readers for longer time. So one
  would have to measure how big difference that would make.
Another possibility is to start a single transaction for migration and
  extend it as long as you can (as truncate does it). And when you can't
  extend any more, you drop the i_data_sem and start a new transaction and
  acquire the semaphore again. This has the disadvantage that after dropping
  the semaphore you have to resync your original inode with the temporary
  one your are building which probably ends up being ugly as night... Hmm,
  but maybe we could get rid of this - hold i_mutex to protect against all
  writes (that ranks outside of transaction start so you can hold it for the
  whole migration time - maybe you even hold it if you are called from the
  write path...). After dropping i_data_sem you let some readers proceed
  but writers still wait on i_mutex so the file shouldn't change under you
  (but I suggest adding some BUG_ONs to verify that the file really doesn't
  change :).
  
 
 How about the patch below. I did the below testing
 a) migrate a file
 b) run fs_inode fsstres fsx_linux.
 
 The intention was to find out whether the new locking is breaking any of
 the other expected hierarchy. It seems to fine. I didn't get any lockdep
 warning.
  I think there's a problem in the patch. I don't think you can call
free_ind_block() while readers are accessing the file. That could make them
think the file contains holes when they aren't there (or even worse read
freed blocks or so). So you need to lock i_data_sem before this call (that
means you have to move journal_extend() as well). Actually, I don't quite
get why ext4_journal_extend() is called in that function. You can just
count with the 1 credit in ext4_ext_migrate() when you call
ext4_journal_extend() before calling ext4_ext_swap_inode_data(). Oh,
probably because free_ind_block() could extend the transaction (which they
don't do now as far as I can see).
  BTW: That freeing code should really take into account that it can modify
bitmaps in different block groups. It's even not that hard to do. Just
before each ext4_free_blocks() in free_ind_block() you check whether you
have still enough credits in the handle (use h_buffer_credits) and if not,
extend it by some amount.
  Maybe you could do some test like writing a big file with some data and then
while migration is running read it in a loop and compare that MD5SUM is the
same all the time. Also run some memory-pressure during this test so that
data blocks aren't cached for the whole time of the test... That should
reasonably stress the migration code.

 ext4: Fix circular locking dependency with migrate and rm.
 
 From: Aneesh Kumar K.V [EMAIL PROTECTED]
 
 We now take inode-i_mutex lock to prevent any update of the inode i_data
 field. Before we switch the inode format we take i_data_sem to prevent
 parallel read.
 
 ===
 [ INFO: possible circular locking dependency detected ]
 2.6.24-rc8 #6
 ---
 rm/2401 is trying to acquire lock:
  (ei-i_data_sem){}, at: [c01dca58] ext4_get_blocks_wrap+0x21/0x108
 
 but task is already holding lock:
  (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff
 
 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:
 
 - #1 (jbd2_handle){--..}:
[c0143a5c] __lock_acquire+0xa31/0xc1a
[c0143cbf] lock_acquire+0x7a/0x94
[c01fc4ca] jbd2_journal_start+0xf5/0xff
[c01e3539] ext4_journal_start_sb+0x48/0x4a
[c01eb980] ext4_ext_migrate+0x7d/0x535
[c01df328

Re: jbd2_handle and i_data_sem circular locking dependency detected

2008-02-05 Thread Jan Kara
On Tue 05-02-08 21:57:03, Aneesh Kumar K.V wrote:
 On Tue, Feb 05, 2008 at 02:42:28PM +0100, Jan Kara wrote:
  On Tue 05-02-08 17:53:42, Aneesh Kumar K.V wrote:
   
   How about the patch below. I did the below testing
   a) migrate a file
   b) run fs_inode fsstres fsx_linux.
   
   The intention was to find out whether the new locking is breaking any of
   the other expected hierarchy. It seems to fine. I didn't get any lockdep
   warning.
I think there's a problem in the patch. I don't think you can call
  free_ind_block() while readers are accessing the file. That could make them
  think the file contains holes when they aren't there (or even worse read
  freed blocks or so). So you need to lock i_data_sem before this call (that
  means you have to move journal_extend() as well). Actually, I don't quite
  get why ext4_journal_extend() is called in that function. You can just
  count with the 1 credit in ext4_ext_migrate() when you call
  ext4_journal_extend() before calling ext4_ext_swap_inode_data(). Oh,
  probably because free_ind_block() could extend the transaction (which they
  don't do now as far as I can see).
 
 ext4_journal_extend is called to extend the journal by one credit to
 take care of writing the block containing inode. I moved the journal
 extend before taking i_data_sem lock.
 
 diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
 index f97c993..dc6617a 100644
 --- a/fs/ext4/migrate.c
 +++ b/fs/ext4/migrate.c
 @@ -302,10 +302,6 @@ static int ext4_ext_swap_inode_data(handle_t *handle, 
 struct inode *inode,
   struct ext4_inode_info *ei = EXT4_I(inode);
   struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode);
  
 - retval = free_ind_block(handle, inode);
 - if (retval)
 - goto err_out;
 -
   /*
* One credit accounted for writing the
* i_data field of the original inode
 @@ -318,6 +314,10 @@ static int ext4_ext_swap_inode_data(handle_t *handle, 
 struct inode *inode,
   }
  
   down_write(EXT4_I(inode)-i_data_sem);
 + retval = free_ind_block(handle, inode);
 + if (retval)
 + goto err_out;
 +
   /*
* We have the extent map build with the tmp inode.
* Now copy the i_data across
 
 The above change also make sure we don't fail after we free the indirect
 blocks.
  Yeah, now it looks fine.

BTW: That freeing code should really take into account that it can modify
  bitmaps in different block groups. It's even not that hard to do. Just
  before each ext4_free_blocks() in free_ind_block() you check whether you
  have still enough credits in the handle (use h_buffer_credits) and if not,
  extend it by some amount.
 
 
 I have a FIXME at migrate.c:524 documenting exactly that. The
 difficult question was by how much we should extent the journal. ? But
 in reality we might have accumulated enough journal credits, I never
 really ran across a case where we are running out of the journal credit.
  Yes, I don't think it is likely to happen in reality but if somebody
would trigger this, it would be almost impossible to track down so one
should be quite careful with these things...
  And as I described, doing it failsafe is easy - just look how
try_to_extend_transaction() in ext4/inode.c handles similar problems with
truncate.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] jbd: fix assertion failure in journal_next_log_block

2008-02-05 Thread Jan Kara
  Hello,

  Sorry for replying a bit late but I'm currently falling behind in
maling-list reading...

 The way jbd tries to determine if there is enough space left on the journal in
 order to start a new transaction is looking at the space left in the journal 
 and
 the space needed for the committing transaction, which is 1/4 of the journal +
 the number of t_outstanding_credits for that transaction.  In this case its
 assumed that t_outstanding_credits accurately represents the number of credits
  Yes.

 waiting to be written to the journal, but this sometimes isn't the case.  The
 transaction has two counters for buffers, t_outstanding_credits which is used 
 in
 conjunction with handles that are added to the transaction, and t_nr_buffers
 which is only incremented/decremented when buffers are added/removed from the
 transaction and are actually destined to be journaled.  Normally these two
  t_nr_buffers actually represents number of buffers on BJ_Metadata list
and nobody uses it (except for the assertion in
__journal_temp_unlink_buffer()). t_outstanding_credits is supposed to be *the*
counter making sure we don't write more than we have credits for.

 counters are the same, however there are cases where the committing 
 transaction
 can have buffers moved to the next running transaction, for example any 
 buffers
 on the committing transactions t_reserved list would be moved to the next
 (running) transaction, and if it had been dirtied in the process it would
 immediately make it onto the t_updates list, which would increment 
 t_nr_buffers
  You probably mean t_buffers list here...
 but not t_outstanding_credits.  So you get into this situation where
  But which moving and dirtying do you mean? The caller which dirties
the buffer must make sure that he has acquired enough credits for the
transaction where the buffer ends up... So if there were not enough
buffers in the running transaction where we refiled the buffer it is a
bug in the caller which dirties the buffer.

 t_nr_buffers (the actual number of buffers that are on the transaction) is
 greater than the number of buffers accounted for via t_outstanding_credits.
 This presents a problem since as we loop through writting buffers to the
 journal, we decrement t_outstanding_credits, and if t_nr_buffers is more than
 t_outstanding_credits then we end up with a negative number for
 t_outstanding_credits, which means we start saying we need less than 1/4 of 
 the
 journal for our committing transaction and allow more transactions than we can
 handle to start, and then bam we fail because journal_next_log_block doesn't
 have enough free blocks in order to handle the request.  This has been tested
 and fixes the issue (which could not be reproduced by me but several other
 people could get it to reproduce using postmark), and although I couldn't
 reproduce the assertion, I could very easily reproduce the situation where
 t_outstanding_credits was  than t_nr_buffers.
  I suppose you see the assertion J_ASSERT(journal-j_free  1); to
fail, right? I don't see how your patch could help avoid that assertion.
You've just removed accounting of t_outstanding_credits which has no
impact on the real number of free blocks in the journal stored in
j_free. Anyway, if you can reproduce t_outstanding_credits 
t_nr_buffers, then there's something fishy. Are you able to reproduce it
also with a current kernel?
  Thanks for looking into the problem :)

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: jbd2_handle and i_data_sem circular locking dependency detected

2008-02-04 Thread Jan Kara
On Mon 04-02-08 22:42:08, Aneesh Kumar K.V wrote:
 On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote:
Hi,
  
  On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
   This is with the new ext3 - ext4 migrate code added. The recently added
   lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
   on the ext3 inode during migration to prevent walking the ext3 inode
   when it is being converted to ext4 format. Also we want to avoid 
   file truncation and new blocks being added while converting to ext4.
   Also we dont want to reserve large number of credits for journal.
   Any idea how to fix this ?
Hmm, while briefly looking at the code - why do you introduce i_data_sem
  and not use i_alloc_sem which is already in VFS inode? That is aimed
  exactly at the serialization of truncates, writes and similar users.
 
 How about read ? We are changing the format of inode. We don't want even
 the read to go through.
  I just meant that the code could use i_alloc_sem instead of i_data_sem in
all the places, remove i_data_sem and you safe some memory... It's just a
suggestion for a cleanup.

One (stupid) solution to your problem is to make i_data_sem be
  always locked before the transaction is started. It could possibly have
  negative performance impact because you'd have to hold the semaphore for
  a longer time and thus a writer would block readers for longer time. So one
  would have to measure how big difference that would make.
Another possibility is to start a single transaction for migration and
  extend it as long as you can (as truncate does it). And when you can't
  extend any more, you drop the i_data_sem and start a new transaction and
  acquire the semaphore again. This has the disadvantage that after dropping
  the semaphore you have to resync your original inode with the temporary
  one your are building which probably ends up being ugly as night... Hmm,
  but maybe we could get rid of this - hold i_mutex to protect against all
  writes (that ranks outside of transaction start so you can hold it for the
  whole migration time - maybe you even hold it if you are called from the
  write path...). After dropping i_data_sem you let some readers proceed
  but writers still wait on i_mutex so the file shouldn't change under you
  (but I suggest adding some BUG_ONs to verify that the file really doesn't
  change :).
 
 A quick look says truncate can happen even when we hold i_mutex ??
  No, it shouldn't happen - do_truncate() in fs/open.c acquires i_mutex
before it calls notify_change().

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: jbd2_handle and i_data_sem circular locking dependency detected

2008-02-04 Thread Jan Kara
  Hi,

On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
 This is with the new ext3 - ext4 migrate code added. The recently added
 lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
 on the ext3 inode during migration to prevent walking the ext3 inode
 when it is being converted to ext4 format. Also we want to avoid 
 file truncation and new blocks being added while converting to ext4.
 Also we dont want to reserve large number of credits for journal.
 Any idea how to fix this ?
  Hmm, while briefly looking at the code - why do you introduce i_data_sem
and not use i_alloc_sem which is already in VFS inode? That is aimed
exactly at the serialization of truncates, writes and similar users.
That doesn't solve problems with lock ordering but I was just wondering...
  Another problem - ext4_fallocate() has the same lock ordering problem as
the migration code and maybe there are others...
  One (stupid) solution to your problem is to make i_data_sem be
always locked before the transaction is started. It could possibly have
negative performance impact because you'd have to hold the semaphore for
a longer time and thus a writer would block readers for longer time. So one
would have to measure how big difference that would make.
  Another possibility is to start a single transaction for migration and
extend it as long as you can (as truncate does it). And when you can't
extend any more, you drop the i_data_sem and start a new transaction and
acquire the semaphore again. This has the disadvantage that after dropping
the semaphore you have to resync your original inode with the temporary
one your are building which probably ends up being ugly as night... Hmm,
but maybe we could get rid of this - hold i_mutex to protect against all
writes (that ranks outside of transaction start so you can hold it for the
whole migration time - maybe you even hold it if you are called from the
write path...). After dropping i_data_sem you let some readers proceed
but writers still wait on i_mutex so the file shouldn't change under you
(but I suggest adding some BUG_ONs to verify that the file really doesn't
change :).

Honza

 ===
 [ INFO: possible circular locking dependency detected ]
 2.6.24-rc8 #6
 ---
 rm/2401 is trying to acquire lock:
  (ei-i_data_sem){}, at: [c01dca58] ext4_get_blocks_wrap+0x21/0x108
 
 but task is already holding lock:
  (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff
 
 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:
 
 - #1 (jbd2_handle){--..}:
[c0143a5c] __lock_acquire+0xa31/0xc1a
[c0143cbf] lock_acquire+0x7a/0x94
[c01fc4ca] jbd2_journal_start+0xf5/0xff
[c01e3539] ext4_journal_start_sb+0x48/0x4a
[c01eb980] ext4_ext_migrate+0x7d/0x535
[c01df328] ext4_ioctl+0x528/0x56c
[c0177700] do_ioctl+0x50/0x67
[c017794e] vfs_ioctl+0x237/0x24a
[c0177992] sys_ioctl+0x31/0x4b
[c0104f8a] sysenter_past_esp+0x5f/0xa5
[] 0x
 
 - #0 (ei-i_data_sem){}:
[c014394c] __lock_acquire+0x921/0xc1a
[c0143cbf] lock_acquire+0x7a/0x94
[c044f247] down_read+0x42/0x79
[c01dca58] ext4_get_blocks_wrap+0x21/0x108
[c01dcba1] ext4_getblk+0x62/0x1c4
[c01e0de9] ext4_find_entry+0x350/0x5b7
[c01e200c] ext4_unlink+0x6e/0x1a4
[c017449e] vfs_unlink+0x49/0x89
[c0175f02] do_unlinkat+0x96/0x12c
[c0175fa8] sys_unlink+0x10/0x12
[c0104f8a] sysenter_past_esp+0x5f/0xa5
[] 0x
 
 other info that might help us debug this:
 
 3 locks held by rm/2401:
  #0:  (type-i_mutex_dir_key#5/1){--..}, at: [c0175eca] 
 do_unlinkat+0x5e/0x12c
  #1:  (sb-s_type-i_mutex_key#8){--..}, at: [c017448b] 
 vfs_unlink+0x36/0x89
  #2:  (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff
 
 stack backtrace:
 Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6
  [c0106017] show_trace_log_lvl+0x1a/0x2f
  [c0106893] show_trace+0x12/0x14
  [c0106b89] dump_stack+0x6c/0x72
  [c0141b26] print_circular_bug_tail+0x5f/0x68
  [c014394c] __lock_acquire+0x921/0xc1a
  [c0143cbf] lock_acquire+0x7a/0x94
  [c044f247] down_read+0x42/0x79
  [c01dca58] ext4_get_blocks_wrap+0x21/0x108
  [c01dcba1] ext4_getblk+0x62/0x1c4
  [c01e0de9] ext4_find_entry+0x350/0x5b7
  [c01e200c] ext4_unlink+0x6e/0x1a4
  [c017449e] vfs_unlink+0x49/0x89
  [c0175f02] do_unlinkat+0x96/0x12c
  [c0175fa8] sys_unlink+0x10/0x12
  [c0104f8a] sysenter_past_esp+0x5f/0xa5
  
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RESEND] jbd: Remove useless loop in journal_write_commit_record()

2008-01-30 Thread Jan Kara
  Hi,

  resending the patch just in case you've missed it.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

Commit block was intended to have several copies of the header. But
due to a bug it never had them and actually, nobody checks that. So
just remove the useless loop.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 610264b..b54948f 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -104,7 +104,8 @@ static int journal_write_commit_record(journal_t *journal,
 {
struct journal_head *descriptor;
struct buffer_head *bh;
-   int i, ret;
+   journal_header_t *header;
+   int ret;
int barrier_done = 0;
 
if (is_journal_aborted(journal))
@@ -116,13 +117,10 @@ static int journal_write_commit_record(journal_t *journal,
 
bh = jh2bh(descriptor);
 
-   /* AKPM: buglet - add `i' to tmp! */
-   for (i = 0; i  bh-b_size; i += 512) {
-   journal_header_t *tmp = (journal_header_t*)bh-b_data;
-   tmp-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER);
-   tmp-h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK);
-   tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid);
-   }
+   header = (journal_header_t*)(bh-b_data);
+   header-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER);
+   header-h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK);
+   header-h_sequence = cpu_to_be32(commit_transaction-t_tid);
 
JBUFFER_TRACE(descriptor, write commit block);
set_buffer_dirty(bh);
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] Add new development flag to the ext4 filesystem

2008-01-28 Thread Jan Kara
On Fri 25-01-08 03:50:04, Andreas Dilger wrote:
 On Jan 25, 2008  11:05 +0100, Jan Kara wrote:
  For example ext2 on fsync() just sync's a single inode
  (and has to use private_list to track metadata buffers associated with
  the inode) while ext3 flushes the whole journal.
 
 As for fsync(), we definitely need to preserve correct behaviour for the
 file itself, but there isn't a requirement that ext2 behave exactly like
 ext3 (it of course cannot).  In the proposed ext4-mount-unjournaled-ext2
 case, the superblock would be marked dirty as it is today and an e2fsck
 would need to be run at boot time.  That is fine so long as the fsync()
 will cause the one file's data to be on disk before it returns.
  Well, you have to also make sure that all indirect blocks are on disk
before fsync() returns. Otherwise there's not much point in the fact that
data itself reached the disk. And for that you need something like
private_list.

  In ext2, directory
  handling code is quite different. ext2 works in page cache of the
  directory while ext3 uses page cache of the underlying device via buffer
  heads - at least this second thing would be more or less mechanical
  thing to change and would make sence (we wouldn't have to reimplement
  readahead in ext3 directory handling code as we do now). I've looked at
  it once but then more urgent things came and ... you know it.
 
 I don't think it is a requirement that ext3 mounting a filesystem without
 a journal has to use page cache for directories.  I wouldn't object to
 that being fixed.  It definitely isn't a requirement for this to work,
 just an implementation difference.
  Yes, of course. I just wanted to point out that ext2 isn't a strict
subset of ext3 so there is some non-trivial work to be done before you can
safely mount ext2 as ext3-without-journal.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] jbd: Remove useless loop when writing commit record

2008-01-28 Thread Jan Kara
  Hi Andrew,

  here's the patch I wrote you about.
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

Commit block was intended to have several copies of the header. But
due to a bug it never had them and actually, nobody checks that. So
just remove the useless loop.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 610264b..b54948f 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -104,7 +104,8 @@ static int journal_write_commit_record(journal_t *journal,
 {
struct journal_head *descriptor;
struct buffer_head *bh;
-   int i, ret;
+   journal_header_t *header;
+   int ret;
int barrier_done = 0;
 
if (is_journal_aborted(journal))
@@ -116,13 +117,10 @@ static int journal_write_commit_record(journal_t *journal,
 
bh = jh2bh(descriptor);
 
-   /* AKPM: buglet - add `i' to tmp! */
-   for (i = 0; i  bh-b_size; i += 512) {
-   journal_header_t *tmp = (journal_header_t*)bh-b_data;
-   tmp-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER);
-   tmp-h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK);
-   tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid);
-   }
+   header = (journal_header_t*)(bh-b_data);
+   header-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER);
+   header-h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK);
+   header-h_sequence = cpu_to_be32(commit_transaction-t_tid);
 
JBUFFER_TRACE(descriptor, write commit block);
set_buffer_dirty(bh);
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix commit block write in JBD

2008-01-28 Thread Jan Kara
On Sat 26-01-08 22:02:07, Andrew Morton wrote:
  On Wed, 23 Jan 2008 20:09:43 +0100 Jan Kara [EMAIL PROTECTED] wrote:
  
  Commit block is expected to have several copies of the header. Fix the
  bug Andrew has spotted ages ago.
  
 
 ages indeed.
 
  
  diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
  index 610264b..a69b240 100644
  --- a/fs/jbd/commit.c
  +++ b/fs/jbd/commit.c
  @@ -116,9 +116,8 @@ static int journal_write_commit_record(journal_t 
  *journal,
   
  bh = jh2bh(descriptor);
   
  -   /* AKPM: buglet - add `i' to tmp! */
  for (i = 0; i  bh-b_size; i += 512) {
  -   journal_header_t *tmp = (journal_header_t*)bh-b_data;
  +   journal_header_t *tmp = (journal_header_t*)(bh-b_data+i);
  tmp-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER);
  tmp-h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK);
  tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid);
 
 But I don't think we can _use_ this feature now.  Because there are
 1000 disks out there which didn't implement it.
 So why not just remove the loop and do a single write?
  Yes, but OTOH once the filesystem gets mounted with a new kernel, the
journal gets quickly rewritten and we'll have correct commit blocks there.
But since neither kernel nor e2fsprogs actually check for further sectors
(they check for the header just in the beginning of a block), I agree that
removing the loop completely is probably the best option. Nobody cared so
far so I guess they won't care in future as well. I'll send you a
replacement patch.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext3: Fix lock inversion in direct IO

2008-01-28 Thread Jan Kara
  Hi Andrew,

  the patch below fixes a lock inversion which someone reported recently.
More details below in the changelog. Can you merge the patch please? I'll
also write a similar patch for ext4 once we agree this is the way to go...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

We cannot start transaction in ext3_direct_IO() and just let it last during the
whole write because dio_get_page() acquires mmap_sem which ranks above
transaction start (e.g. because we have dependency chain
mmap_sem-PageLock-journal_start, or because we update atime while holding
mmap_sem) and thus deadlocks could happen. We solve the problem by starting
a transaction separately for each ext3_get_block() call.

We *could* have a problem that we allocate a block and before its data are
written out the machine crashes and thus we expose stale data. But that
does not happen because for hole-filling generic code falls back to buffered
writes and for file extension, we add inode to orphan list and thus in
case of crash, journal replay will truncate inode back to the original size.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 9b162cd..5ab7c57 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -941,55 +941,45 @@ out:
return err;
 }
 
-#define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32)
+/* Maximum number of blocks we map for direct IO at once. */
+#define DIO_MAX_BLOCKS 4096
+/*
+ * Number of credits we need for writing DIO_MAX_BLOCKS:
+ * We need sb + group descriptor + bitmap + inode - 4
+ * For B blocks with A block pointers per block we need:
+ * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
+ * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
+ */
+#define DIO_CREDITS 25
 
 static int ext3_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
 {
handle_t *handle = ext3_journal_current_handle();
-   int ret = 0;
+   int ret = 0, started = 0;
unsigned max_blocks = bh_result-b_size  inode-i_blkbits;
 
-   if (!create)
-   goto get_block; /* A read */
-
-   if (max_blocks == 1)
-   goto get_block; /* A single block get */
-
-   if (handle-h_transaction-t_state == T_LOCKED) {
-   /*
-* Huge direct-io writes can hold off commits for long
-* periods of time.  Let this commit run.
-*/
-   ext3_journal_stop(handle);
-   handle = ext3_journal_start(inode, DIO_CREDITS);
-   if (IS_ERR(handle))
+   if (create  !handle) {/* Direct IO write... */
+   if (max_blocks  DIO_MAX_BLOCKS)
+   max_blocks = DIO_MAX_BLOCKS;
+   handle = ext3_journal_start(inode, DIO_CREDITS +
+   2 * EXT3_QUOTA_TRANS_BLOCKS(sb));
+   if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
-   goto get_block;
-   }
-
-   if (handle-h_buffer_credits = EXT3_RESERVE_TRANS_BLOCKS) {
-   /*
-* Getting low on buffer credits...
-*/
-   ret = ext3_journal_extend(handle, DIO_CREDITS);
-   if (ret  0) {
-   /*
-* Couldn't extend the transaction.  Start a new one.
-*/
-   ret = ext3_journal_restart(handle, DIO_CREDITS);
+   goto out;
}
+   started = 1;
}
 
-get_block:
-   if (ret == 0) {
-   ret = ext3_get_blocks_handle(handle, inode, iblock,
+   ret = ext3_get_blocks_handle(handle, inode, iblock,
max_blocks, bh_result, create, 0);
-   if (ret  0) {
-   bh_result-b_size = (ret  inode-i_blkbits);
-   ret = 0;
-   }
+   if (ret  0) {
+   bh_result-b_size = (ret  inode-i_blkbits);
+   ret = 0;
}
+   if (started)
+   ext3_journal_stop(handle);
+out:
return ret;
 }
 
@@ -1680,7 +1670,8 @@ static int ext3_releasepage(struct page *page, gfp_t wait)
  * if the machine crashes during the write.
  *
  * If the O_DIRECT write is intantiating holes inside i_size and the machine
- * crashes then stale disk data _may_ be exposed inside the file.
+ * crashes then stale disk data _may_ be exposed inside the file. But current
+ * VFS code falls back into buffered path in that case so we are safe.
  */
 static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
const struct iovec *iov, loff_t offset,
@@ -1689,7 +1680,7 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
struct file *file = iocb-ki_filp;
struct inode *inode

[PATCH] Fix commit block write in JBD

2008-01-23 Thread Jan Kara
  Hi,

  the patch below fixes preparation of commit block in
journal_write_commit_record(). Obviously the bug doesn't really matter
since nobody reported it so far but let's cleanup the code... Andrew, could
you please queue it up? Thanks.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

Commit block is expected to have several copies of the header. Fix the
bug Andrew has spotted ages ago.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 610264b..a69b240 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -116,9 +116,8 @@ static int journal_write_commit_record(journal_t *journal,
 
bh = jh2bh(descriptor);
 
-   /* AKPM: buglet - add `i' to tmp! */
for (i = 0; i  bh-b_size; i += 512) {
-   journal_header_t *tmp = (journal_header_t*)bh-b_data;
+   journal_header_t *tmp = (journal_header_t*)(bh-b_data+i);
tmp-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER);
tmp-h_blocktype = cpu_to_be32(JFS_COMMIT_BLOCK);
tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid);
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix commit block write in JBD2

2008-01-23 Thread Jan Kara
On Wed 23-01-08 20:09:43, Jan Kara wrote:
   Hi,
 
   the patch below fixes preparation of commit block in
 journal_write_commit_record(). Obviously the bug doesn't really matter
 since nobody reported it so far but let's cleanup the code... Andrew, could
 you please queue it up? Thanks.
  And the same fix for JBD2.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

Commit block is expected to have several copies of the header. Fix the
bug Andrew has spotted ages ago.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 6986f33..ed61283 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -116,9 +116,8 @@ static int journal_write_commit_record(journal_t *journal,
 
bh = jh2bh(descriptor);
 
-   /* AKPM: buglet - add `i' to tmp! */
for (i = 0; i  bh-b_size; i += 512) {
-   journal_header_t *tmp = (journal_header_t*)bh-b_data;
+   journal_header_t *tmp = (journal_header_t*)(bh-b_data+i);
tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid);
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Confused by journaling code in ext3_new_blocks()

2008-01-23 Thread Jan Kara
  Hello,

 I have the following question related to the journaling code in
 ext3_new_blocks() function of fs/ext3/balloc.c, any help in this
 regard will be greatly appreciated. The code snippet below is taken
 from 2.6.24-rc8-mm1 but this code has changed little for a long time
 so it's likely that any other recent kernel will also have very
 similar code.
 
 ext3_new_blocks()
 - ext3_try_to_allocate_with_rsv()
 - ext3_journal_dirty_metadata(handle, bitmap_bh)
 ...
 allocated:
 if ( /*sanity checking of newly allocated block numbers*/) {
 ext3_error(...);
 goto out;
 }
 
 out:
 ...
 brelse(bitmap_bh);
 return 0;
 
 
 In the above code snippet, ext3_try_to_allocate_with_rsv() is marking
 bitmap_bh as dirty metadata _before_ the sanity checks. I'm not
 worried about the sanity checks as such, but it seems to me that if
 the checks were to fail, we'd be committing the bitmap_bh updates to
 disk even though it violates metadata sanity, or am I missing
 something here ?
  Actually, it depends. Because if sanity checks fail, we call
ext3_error() which remounts fs R/O or panics in most cases. Only if
errors=continue is set, we really get to committing the wrong bitmap
data. But I don't think this really is a problem - fsck has to be run
anyway and it rebuilds the bitmaps from scratch.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix commit block write in JBD2

2008-01-23 Thread Jan Kara
  Hi,

On Thu 24-01-08 02:48:41, Girish Shilamkar wrote:
   The loop was removed in journal checksum patch. There had been a
 discussion (11 Jul 2007: [EXT4 set 8][PATCH 1/1]Add journal checksums)
 about this part of code as checksum info was added to commit block. 
  Ah, OK, thanks for explanation. So for JBD2 we don't need the patch
anymore. But for JBD it's still needed.

Honza

 On Wed, 2008-01-23 at 20:10 +0100, Jan Kara wrote:
  On Wed 23-01-08 20:09:43, Jan Kara wrote:
 Hi,
   
 the patch below fixes preparation of commit block in
   journal_write_commit_record(). Obviously the bug doesn't really matter
   since nobody reported it so far but let's cleanup the code... Andrew, 
   could
   you please queue it up? Thanks.
And the same fix for JBD2.
  
  Honza
 
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] Fix reading of bitmaps from filesystem image

2008-01-14 Thread Jan Kara
On Mon 14-01-08 10:57:18, Theodore Tso wrote:
 On Wed, Jan 09, 2008 at 08:54:35PM +0100, Jan Kara wrote:
  
  Reading of bitmaps from image file could never work with more than one
  group in a filesystem... Fix the loops so that we read appropriate number
  of blocks.
 
 OK, so I'm probably being dense, but what's the problem?
 
 You changed the loop from counting in bytes to inodes, but either
 method should work.
  At the beginning of read_bitmap() there is:
unsigned int block_nbytes = EXT2_BLOCKS_PER_GROUP(fs-super) / 8;
unsigned inode_nbytes = EXT2_INODES_PER_GROUP(fs-super) / 8;

  Therefore in the loop:
 while (inode_nbytes  0)

  we end up reading bitmap for just one group...


Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Fix the soft lockup with multi block allocator.

2008-01-09 Thread Jan Kara
On Wed 09-01-08 23:54:28, Aneesh Kumar K.V wrote:
 On Wed, Jan 09, 2008 at 01:10:41PM +0100, Jan Kara wrote:
   With the multi block allocator when we don't have prealloc space we 
   discard
   @@ -3790,7 +3782,9 @@ repeat:

 /* if we still need more blocks and some PAs were used, try again */
 if (free  needed  busy) {
   + busy = 0;
 ext4_unlock_group(sb, group);
   + schedule_timeout(HZ);
 goto repeat;
 }
Hmm, wouldn't just schedule() be enough here? That would give a good
  chance to other processes to proceed and we would avoid this artificial
  wait of 1s which is quite ugly IMO.
  
 
 But then who will wake up the task ?. I have the below comment added to
 the patch in the patch queue.
  As far as I know, you don't have to wake-up the task explicitely.
Scheduler will simply schedule the task sometime in future (it is a similar
situation as if the task got preempted in the kernel).

 /*
  * We see this quiet rare. But if a particular workload is
  * effected by this we may need to add a waitqueue
  */
  Yes, adding that comment is good in any case :).

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend] Fix reading of bitmaps from filesystem image

2008-01-09 Thread Jan Kara
  Hi Ted,

  I've sent you the attached fix some time ago but it didn't seem to get
merged yet...
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

Subject: Fix reading of bitmaps from filesystem image

Reading of bitmaps from image file could never work with more than one
group in a filesystem... Fix the loops so that we read appropriate number
of blocks.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/lib/ext2fs/rw_bitmaps.c b/lib/ext2fs/rw_bitmaps.c
index 1897ec3..abbe5e8 100644
--- a/lib/ext2fs/rw_bitmaps.c
+++ b/lib/ext2fs/rw_bitmaps.c
@@ -190,7 +190,7 @@ static errcode_t read_bitmaps(ext2_filsy
if (fs-flags  EXT2_FLAG_IMAGE_FILE) {
blk = (fs-image_header-offset_inodemap / fs-blocksize);
ino_cnt = fs-super-s_inodes_count;
-   while (inode_nbytes  0) {
+   while (do_inode  ino_cnt  0) {
retval = io_channel_read_blk(fs-image_io, blk++,
 1, inode_bitmap);
if (retval)
@@ -202,15 +202,14 @@ static errcode_t read_bitmaps(ext2_filsy
   ino_itr, cnt, inode_bitmap);
if (retval)
goto cleanup;
-   ino_itr += fs-blocksize  3;
-   ino_cnt -= fs-blocksize  3;
-   inode_nbytes -= fs-blocksize;
+   ino_itr += cnt;
+   ino_cnt -= cnt;
}
blk = (fs-image_header-offset_blockmap /
   fs-blocksize);
blk_cnt = EXT2_BLOCKS_PER_GROUP(fs-super) * 
fs-group_desc_count;
-   while (block_nbytes  0) {
+   while (do_block  blk_cnt  0) {
retval = io_channel_read_blk(fs-image_io, blk++,
 1, block_bitmap);
if (retval)
@@ -222,9 +221,8 @@ static errcode_t read_bitmaps(ext2_filsy
   blk_itr, cnt, block_bitmap);
if (retval)
goto cleanup;
-   blk_itr += fs-blocksize  3;
-   blk_cnt -= fs-blocksize  3;
-   block_nbytes -= fs-blocksize;
+   blk_itr += cnt;
+   blk_cnt -= cnt;
}
goto success_cleanup;
}
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend] mke2fs: Make lost+found always have at least 2 blocks

2008-01-09 Thread Jan Kara
  Hi,

  this is a resend of Andreas' patch I've sent in the beginning of
December. The patch modifies mke2fs to always create lost+found with at
least two directory blocks. I think this could make sence not only for
testing but as a sanity check that 64KB support works for general e2fsprogs
if someone decides to use it...

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

From: Andreas Dilger [EMAIL PROTECTED]

Make sure lost+found has always at least 2 disk blocks. This will provide at
least elementary test that we have not screwed-up support for 64KB blocks since
the second directory block will be empty.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 98a4957..6249cc2 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -534,7 +534,10 @@ static void create_lost_and_found(ext2_f
}

for (i=1; i  EXT2_NDIR_BLOCKS; i++) {
-   if ((lpf_size += fs-blocksize) = 16*1024)
+   /* Ensure that lost+found is at least 2 blocks, so we always
+* test large empty blocks for big-block filesystems.  */
+   if ((lpf_size += fs-blocksize) = 16*1024 
+   lpf_size = 2 * fs-blocksize)
break;
retval = ext2fs_expand_dir(fs, ino);
if (retval) {
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 9546] New: Huge latency in concurrent I/O when using data=ordered

2007-12-12 Thread Jan Kara
.
  
  I didn't try with other journaling filesystems. I guess ext2 also doesn't
  exhibit the problem.
  
 
 Interesting that data=writeback helped this.  You don't give a lot of
 details, but I assume that data=writeback made a large difference here?
  What I think could be happening:
One process does data-intensive load. Thus in the ordered mode the
transaction is tiny but has tons of data buffers attached. If commit
happens, it takes a long time to sync all the data before the commit
can proceed... In the writeback mode, we don't wait for data buffers, in
the journal mode amount of data to be written is really limited by the
maximum size of a transaction and so we write by much smaller chunks
and better latency is thus ensured.
  I'll try some tests to verify whether my theory is correct :).

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 9546] New: Huge latency in concurrent I/O when using data=ordered

2007-12-12 Thread Jan Kara
 (the program should have to access files, such as cookies, 
   cache and
   so for konqueror). Then remount/tune the filesystem to use another data 
   mode
   for ext3.
   
   I didn't try with other journaling filesystems. I guess ext2 also doesn't
   exhibit the problem.
   
  
  Interesting that data=writeback helped this.  You don't give a lot of
  details, but I assume that data=writeback made a large difference here?
   What I think could be happening:
 One process does data-intensive load. Thus in the ordered mode the
 transaction is tiny but has tons of data buffers attached. If commit
 happens, it takes a long time to sync all the data before the commit
 can proceed... In the writeback mode, we don't wait for data buffers, in
 the journal mode amount of data to be written is really limited by the
 maximum size of a transaction and so we write by much smaller chunks
 and better latency is thus ensured.
   I'll try some tests to verify whether my theory is correct :).
  Hmm, I couldn't reproduce the bad behavior... But anyway, if my theory
is correct, attached patch should help - can you try it please? It's a
debugging quality only but shouldn't do anything bad to your data :)

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
Don't allow too much data buffers in a transaction.

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 08ff6c7..e6f9dd6 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -163,7 +163,7 @@ repeat_locked:
 	spin_lock(transaction-t_handle_lock);
 	needed = transaction-t_outstanding_credits + nblocks;
 
-	if (needed  journal-j_max_transaction_buffers) {
+	if (needed  journal-j_max_transaction_buffers || atomic_read(transaction-t_data_buf_count)  32768) {
 		/*
 		 * If the current transaction is already too large, then start
 		 * to commit it: we can then go back and attach this handle to
@@ -1528,6 +1528,7 @@ static void __journal_temp_unlink_buffer(struct journal_head *jh)
 		return;
 	case BJ_SyncData:
 		list = transaction-t_sync_datalist;
+		atomic_dec(transaction-t_data_buf_count);
 		break;
 	case BJ_Metadata:
 		transaction-t_nr_buffers--;
@@ -1989,6 +1990,7 @@ void __journal_file_buffer(struct journal_head *jh,
 		return;
 	case BJ_SyncData:
 		list = transaction-t_sync_datalist;
+		atomic_inc(transaction-t_data_buf_count);
 		break;
 	case BJ_Metadata:
 		transaction-t_nr_buffers++;
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index d9ecd13..6dd284a 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -541,6 +541,12 @@ struct transaction_s
 	int			t_outstanding_credits;
 
 	/*
+	 * Number of data buffers on t_sync_datalist attached to
+	 * the transaction.
+	 */
+	atomic_t		t_data_buf_count;
+
+	/*
 	 * Forward and backward links for the circular list of all transactions
 	 * awaiting checkpoint. [j_list_lock]
 	 */


[PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize

2007-12-06 Thread Jan Kara
  Hi,

  attached is a new version of support for 64KB blocksize in e2fsprogs. The
patch went through testing by a script I'll send in the following email so
now the modifications should be correct. Ted, can you have a look at it
when you have time? Thanks.

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR

---

Subject: Support for 64KB blocksize in ext2-4 directories.

When block size is 64KB, we have to take care that rec_len does not overflow.
Kernel stores 0x in case 0x1 should be stored - perform appropriate
conversion when processing directories.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/debugfs/htree.c b/debugfs/htree.c
index d0e673e..a326241 100644
--- a/debugfs/htree.c
+++ b/debugfs/htree.c
@@ -40,6 +40,7 @@ static void htree_dump_leaf_node(ext2_fi
blk_t   pblk;
ext2_dirhash_t  hash;
int hash_alg;
+   int rec_len;

errcode = ext2fs_bmap(fs, ino, inode, buf, 0, blk, pblk);
if (errcode) {
@@ -61,10 +62,8 @@ static void htree_dump_leaf_node(ext2_fi
 
while (offset  fs-blocksize) {
dirent = (struct ext2_dir_entry *) (buf + offset);
-   if (((offset + dirent-rec_len)  fs-blocksize) ||
-   (dirent-rec_len  8) ||
-   ((dirent-rec_len % 4) != 0) ||
-   (((dirent-name_len  0xFF)+8)  dirent-rec_len)) {
+   rec_len = ext2fs_rec_len_from_disk(dirent-rec_len);
+   if (ext2fs_validate_dirent(fs, offset, dirent)  0) {
fprintf(pager, Corrupted directory block (%u)!\n, 
blk);
break;
}
@@ -79,7 +78,7 @@ static void htree_dump_leaf_node(ext2_fi
com_err(htree_dump_leaf_node, errcode,
while calculating hash);
sprintf(tmp, %u 0x%08x (%d) %s   , dirent-inode,
-   hash, dirent-rec_len, name);
+   hash, rec_len, name);
thislen = strlen(tmp);
if (col + thislen  80) {
fprintf(pager, \n);
@@ -87,7 +86,7 @@ static void htree_dump_leaf_node(ext2_fi
}
fprintf(pager, %s, tmp);
col += thislen;
-   offset += dirent-rec_len;
+   offset += rec_len;
}
fprintf(pager, \n);
 }
@@ -389,7 +388,7 @@ static int search_dir_block(ext2_filsys
printf(offset %u\n, offset);
return BLOCK_ABORT;
}
-   offset += dirent-rec_len;
+   offset += ext2fs_rec_len_from_disk(dirent-rec_len);
}
return 0;
 }
diff --git a/debugfs/ls.c b/debugfs/ls.c
index 52c7e34..1960c11 100644
--- a/debugfs/ls.c
+++ b/debugfs/ls.c
@@ -97,7 +97,7 @@ static int list_dir_proc(ext2_ino_t dir
fprintf (ls-f,  %s %s\n, datestr, name);
} else {
sprintf(tmp, %c%u%c (%d) %s   , lbr, dirent-inode, rbr,
-   dirent-rec_len, name);
+   ext2fs_rec_len_from_disk(dirent-rec_len), name);
thislen = strlen(tmp);
 
if (ls-col + thislen  80) {
diff --git a/e2fsck/message.c b/e2fsck/message.c
index b2e3e0f..05b2e17 100644
--- a/e2fsck/message.c
+++ b/e2fsck/message.c
@@ -342,12 +342,13 @@ static _INLINE_ void expand_dirent_expre
  struct problem_context *ctx)
 {
struct ext2_dir_entry   *dirent;
-   int len;
+   int len, rec_len;

if (!ctx || !ctx-dirent)
goto no_dirent;

dirent = ctx-dirent;
+   rec_len = ext2fs_rec_len_from_disk(dirent-rec_len);

switch (ch) {
case 'i':
@@ -357,12 +358,12 @@ static _INLINE_ void expand_dirent_expre
len = dirent-name_len  0xFF;
if (len  EXT2_NAME_LEN)
len = EXT2_NAME_LEN;
-   if (len  dirent-rec_len)
-   len = dirent-rec_len;
+   if (len  rec_len)
+   len = rec_len;
safe_print(dirent-name, len);
break;
case 'r':
-   printf(%u, dirent-rec_len);
+   printf(%u, rec_len);
break;
case 'l':
printf(%u, dirent-name_len  0xFF);
diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
index ceb9c7f..fd2c7d0 100644
--- a/e2fsck/pass1.c
+++ b/e2fsck/pass1.c
@@ -379,6 +379,7 @@ static void check_is_really_dir(e2fsck_t
errcode_t   retval;
blk_t   blk;
int i, not_device = 0;
+   int rec_len;
 
if (LINUX_S_ISDIR(inode-i_mode) || LINUX_S_ISREG(inode-i_mode) ||
LINUX_S_ISLNK(inode-i_mode) || inode-i_block[0] == 0

[PATCH] mke2fs: Make lost+found always have at least two blocks

2007-12-06 Thread Jan Kara
  Hi,

  attached patch modifies mke2fs to always create lost+found with at least
two directory blocks. I think this could make sence not only for testing
but as a sanity check that 64KB support works for general e2fsprogs if
someone decides to use it...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

From: Andreas Dilger [EMAIL PROTECTED]

Make sure lost+found has always at least 2 disk blocks. This will provide at
least elementary test that we have not screwed-up support for 64KB blocks since
the second directory block will be empty.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 98a4957..6249cc2 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -534,7 +534,10 @@ static void create_lost_and_found(ext2_f
}

for (i=1; i  EXT2_NDIR_BLOCKS; i++) {
-   if ((lpf_size += fs-blocksize) = 16*1024)
+   /* Ensure that lost+found is at least 2 blocks, so we always
+* test large empty blocks for big-block filesystems.  */
+   if ((lpf_size += fs-blocksize) = 16*1024 
+   lpf_size = 2 * fs-blocksize)
break;
retval = ext2fs_expand_dir(fs, ino);
if (retval) {
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix reading of bitmaps from a filesystem image

2007-11-29 Thread Jan Kara
  Hi,

  reading of bitmaps from a filesystem image seems to be busted. The patch
below fixes it for me.

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

Subject: Fix reading of bitmaps from filesystem image

Reading of bitmaps from image file could never work with more than one
group in a filesystem... Fix the loops so that we read appropriate number
of blocks.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/lib/ext2fs/rw_bitmaps.c b/lib/ext2fs/rw_bitmaps.c
index 1897ec3..abbe5e8 100644
--- a/lib/ext2fs/rw_bitmaps.c
+++ b/lib/ext2fs/rw_bitmaps.c
@@ -190,7 +190,7 @@ static errcode_t read_bitmaps(ext2_filsy
if (fs-flags  EXT2_FLAG_IMAGE_FILE) {
blk = (fs-image_header-offset_inodemap / fs-blocksize);
ino_cnt = fs-super-s_inodes_count;
-   while (inode_nbytes  0) {
+   while (do_inode  ino_cnt  0) {
retval = io_channel_read_blk(fs-image_io, blk++,
 1, inode_bitmap);
if (retval)
@@ -202,15 +202,14 @@ static errcode_t read_bitmaps(ext2_filsy
   ino_itr, cnt, inode_bitmap);
if (retval)
goto cleanup;
-   ino_itr += fs-blocksize  3;
-   ino_cnt -= fs-blocksize  3;
-   inode_nbytes -= fs-blocksize;
+   ino_itr += cnt;
+   ino_cnt -= cnt;
}
blk = (fs-image_header-offset_blockmap /
   fs-blocksize);
blk_cnt = EXT2_BLOCKS_PER_GROUP(fs-super) * 
fs-group_desc_count;
-   while (block_nbytes  0) {
+   while (do_block  blk_cnt  0) {
retval = io_channel_read_blk(fs-image_io, blk++,
 1, block_bitmap);
if (retval)
@@ -222,9 +221,8 @@ static errcode_t read_bitmaps(ext2_filsy
   blk_itr, cnt, block_bitmap);
if (retval)
goto cleanup;
-   blk_itr += fs-blocksize  3;
-   blk_cnt -= fs-blocksize  3;
-   block_nbytes -= fs-blocksize;
+   blk_itr += cnt;
+   blk_cnt -= cnt;
}
goto success_cleanup;
}
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix assertion failure in fs/jbd/checkpoint.c

2007-11-28 Thread Jan Kara
  Hi,

  the patch below should fix an assertion failure in JBD checkpointing
code. The patch survived some fsstress and similar runs on my test machine
so it shouldn't be obviously wrong ;).

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

Before we start committing a transaction, we call
__journal_clean_checkpoint_list() to cleanup transaction's written-back
buffers. If this call happens to remove all of them (and there were already
some buffers), __journal_remove_checkpoint() will decide to free the
transaction because it isn't (yet) a committing transaction and soon we fail
some assertion - the transaction really isn't ready to be freed :).
We change the check in __journal_remove_checkpoint() to free only a transaction
in T_FINISHED state. The locking there is subtle though (as everywhere in
JBD ;(). We use j_list_lock to protect the check and a subsequent call to
__journal_drop_transaction() and do the same in the end of
journal_commit_transaction() which is the only place where a transaction can
get to T_FINISHED state. Probably I'm too paranoid here and such locking is
not really necessary - checkpoint lists are processed only from
log_do_checkpoint() where a transaction must be already committed to be
processed or from __journal_clean_checkpoint_list() where kjournald itself
calls it and thus transaction cannot change state either. Better be safe if
something changes in future...

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index 47552d4..0f69c41 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -602,15 +602,15 @@ int __journal_remove_checkpoint(struct journal_head *jh)
 
/*
 * There is one special case to worry about: if we have just pulled the
-* buffer off a committing transaction's forget list, then even if the
-* checkpoint list is empty, the transaction obviously cannot be
-* dropped!
+* buffer off a running or committing transaction's checkpoing list,
+* then even if the checkpoint list is empty, the transaction obviously
+* cannot be dropped!
 *
-* The locking here around j_committing_transaction is a bit sleazy.
+* The locking here around t_state is a bit sleazy.
 * See the comment at the end of journal_commit_transaction().
 */
-   if (transaction == journal-j_committing_transaction) {
-   JBUFFER_TRACE(jh, belongs to committing transaction);
+   if (transaction-t_state != T_FINISHED) {
+   JBUFFER_TRACE(jh, belongs to running/committing transaction);
goto out;
}
 
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 8f1f2aa..610264b 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -858,10 +858,10 @@ restart_loop:
}
spin_unlock(journal-j_list_lock);
/*
-* This is a bit sleazy.  We borrow j_list_lock to protect
-* journal-j_committing_transaction in __journal_remove_checkpoint.
-* Really, __journal_remove_checkpoint should be using j_state_lock but
-* it's a bit hassle to hold that across __journal_remove_checkpoint
+* This is a bit sleazy.  We use j_list_lock to protect transition
+* of a transaction into T_FINISHED state and calling
+* __journal_drop_transaction(). Otherwise we could race with
+* other checkpointing code processing the transaction...
 */
spin_lock(journal-j_state_lock);
spin_lock(journal-j_list_lock);
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index 16e7ed8..d9ecd13 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -439,6 +439,8 @@ struct transaction_s
/*
 * Transaction's current state
 * [no locking - only kjournald alters this]
+* [j_list_lock] guards transition of a transaction into T_FINISHED
+* state and subsequent call of __journal_drop_transaction()
 * FIXME: needs barriers
 * KLUDGE: [use j_state_lock]
 */
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: dir inode reservation V3

2007-11-20 Thread Jan Kara
  Hi Coly,

  finally I've found some time to have a look at a new version of your
patch.

 5, Performance number
 On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4
 partition, and tried to create 5 directories, and create 15 (1KB)
 files in each directory alternatively. After a remount, I tried to
 remove all the directories and files recursively by a 'rm -rf'. Bellow
 is the benchmark result,
   normal ext4 ext4 with dir inode 
 reservation
   mount options:  -o data=writeback   -o 
 data=writeback,dir_ireserve=low
   Create dirs:real0m49.101s   real2m59.703s
   Create files:   real24m17.962s  real21m8.161s
   Unlink all: real24m43.788s  real17m29.862s
 Creating dirs with dir inode reservation is slower than normal ext4 as
 predicted, because allocating directory inodes in non-linear order
 will cause extra hard disk seeking and block I/O. Creating files with
 dir inode reservation is 13% faster than normal ext4. Unlink all the
 directories and files is 29.2% faster as expected.  When number of
 directories is increased, the performance improvement will be more
 considerable. More benchmark result will be posted here if necessary,
 because I need more time to run more test cases.
  Maybe to get some better idea - could you compare how long does take
untar of a kernel tree, find through the whole kernel tree and removal
of the tree? Also measuring CPU load during your benchmarks would be
useful so that we can see whether we don't increase too much the cost of
searching in bitmaps...

  The patch is nicely short ;)

 diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
 index 17b5df1..f838a72 100644
 --- a/fs/ext4/ialloc.c
 +++ b/fs/ext4/ialloc.c
 @@ -10,6 +10,8 @@
   *  Stephen Tweedie ([EMAIL PROTECTED]), 1993
   *  Big-endian to little-endian byte-swapping/bitmaps by
   *David S. Miller ([EMAIL PROTECTED]), 1995
 + *  Directory inodes reservation by
 + *Coly Li ([EMAIL PROTECTED]), 2007
   */
 
  #include linux/time.h
 @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, 
 struct inode *parent,
   return -1;
  }
 
 +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir,
 +   int mode, ext4_group_t *group, unsigned long *ino)
 +{
 + struct super_block *sb;
 + struct ext4_sb_info *sbi;
 + struct ext4_group_desc *gdp = NULL;
 + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL;
 + ext4_group_t ires_group = *group;
 + unsigned long ires_ino;
 + int i, bit;
 +
 + sb = dir-i_sb;
 + sbi = EXT4_SB(sb);
 +
 + /* if the inode number is not for directory,
 +  * only try to allocate after directory's inode
 +  */
 + if (!S_ISDIR(mode)) {
 + *ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb);
 + return 0;
 + }
  ^^^ You don't set a group here - is this intentional? It means we get
the group from find_group_other() and there we start searching at a
place corresponding to parent's inode number... That would be an
interesting heuristic but I'm not sure if that's what you want.

 +
 + /* reserve inodes for new directory */
 + for (i = 0; i  sbi-s_groups_count; i++) {
 + gdp = ext4_get_group_desc(sb, ires_group, gdp_bh);
 + if (!gdp)
 + goto fail;
 + bit = 0;
 +try_same_group:
 + if (bit  EXT4_INODES_PER_GROUP(sb)) {
 + brelse(bitmap_bh);
 + bitmap_bh = read_inode_bitmap(sb, ires_group);
 + if (!bitmap_bh)
 + goto fail;
 +
 + BUFFER_TRACE(bitmap_bh, get_write_access);
 + if (ext4_journal_get_write_access(
 + handle, bitmap_bh) != 0)
 + goto fail;
 + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group),
 + bit, bitmap_bh-b_data)) {
 + /* we won it */
 + BUFFER_TRACE(bitmap_bh,
 + call ext4_journal_dirty_metadata);
 + if (ext4_journal_dirty_metadata(handle,
 + bitmap_bh) != 0)
 + goto fail;
 + ires_ino = bit;
 + goto find;
 + }
 + /* we lost it */
 + jbd2_journal_release_buffer(handle, bitmap_bh);
 + bit += sbi-s_dir_ireserve_nr;
 + goto try_same_group;
 + }
 So this above is just a while loop coded with goto... While loop
would be IMO better.

Honza
-- 
Jan Kara

Re: [PATCH] ext4: dir inode reservation V3

2007-11-20 Thread Jan Kara
On Wed 21-11-07 00:40:17, Coly Li wrote:
 Jan Kara wrote:
  diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
  index 17b5df1..f838a72 100644
  --- a/fs/ext4/ialloc.c
  +++ b/fs/ext4/ialloc.c
  @@ -10,6 +10,8 @@
*  Stephen Tweedie ([EMAIL PROTECTED]), 1993
*  Big-endian to little-endian byte-swapping/bitmaps by
*David S. Miller ([EMAIL PROTECTED]), 1995
  + *  Directory inodes reservation by
  + *Coly Li ([EMAIL PROTECTED]), 2007
*/
 
   #include linux/time.h
  @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, 
  struct inode *parent,
 return -1;
   }
 
  +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir,
  +int mode, ext4_group_t *group, unsigned long *ino)
  +{
  +  struct super_block *sb;
  +  struct ext4_sb_info *sbi;
  +  struct ext4_group_desc *gdp = NULL;
  +  struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL;
  +  ext4_group_t ires_group = *group;
  +  unsigned long ires_ino;
  +  int i, bit;
  +
  +  sb = dir-i_sb;
  +  sbi = EXT4_SB(sb);
  +
  +  /* if the inode number is not for directory,
  +   * only try to allocate after directory's inode
  +   */
  +  if (!S_ISDIR(mode)) {
  +  *ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb);
  +  return 0;
  +  }
^^^ You don't set a group here - is this intentional? It means we get
  the group from find_group_other() and there we start searching at a
  place corresponding to parent's inode number... That would be an
  interesting heuristic but I'm not sure if that's what you want.
 Yes, if allocating a file inode, just return first inode offset in the 
 reserved area of parent
 directory. In this case, group is decided by find_group_other() or 
 find_group_orlov(),
 ext4_ino_from_ireserve() just tries to persuade linear inode allocator to 
 search free inode slot
 after parent's inode.
  But what I mean is: Parent directory is in group 1, with inode number 10, now
find_group_other will set group to 2 and you set inode number to 10 so
linear allocator will start searching in group 2, inode number 10 which is
*not* just after directory inode

  +
  +  /* reserve inodes for new directory */
  +  for (i = 0; i  sbi-s_groups_count; i++) {
  +  gdp = ext4_get_group_desc(sb, ires_group, gdp_bh);
  +  if (!gdp)
  +  goto fail;
  +  bit = 0;
  +try_same_group:
  +  if (bit  EXT4_INODES_PER_GROUP(sb)) {
  +  brelse(bitmap_bh);
  +  bitmap_bh = read_inode_bitmap(sb, ires_group);
  +  if (!bitmap_bh)
  +  goto fail;
  +
  +  BUFFER_TRACE(bitmap_bh, get_write_access);
  +  if (ext4_journal_get_write_access(
  +  handle, bitmap_bh) != 0)
  +  goto fail;
  +  if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group),
  +  bit, bitmap_bh-b_data)) {
  +  /* we won it */
  +  BUFFER_TRACE(bitmap_bh,
  +  call ext4_journal_dirty_metadata);
  +  if (ext4_journal_dirty_metadata(handle,
  +  bitmap_bh) != 0)
  +  goto fail;
  +  ires_ino = bit;
  +  goto find;
  +  }
  +  /* we lost it */
  +  jbd2_journal_release_buffer(handle, bitmap_bh);
  +  bit += sbi-s_dir_ireserve_nr;
  +  goto try_same_group;
  +  }
   So this above is just a while loop coded with goto... While loop
  would be IMO better.
 
 The only reason for me to use a goto, is 80 column limitation :) BTW,
 goto does not hurt performance and readability here. IMHO, it's
 acceptable :-)
  But you could just remove goto try_same_group; and change 'if' to 'while'.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Oops 2.6.23.1 in ext3+jbd at journal_put_journal_head

2007-11-13 Thread Jan Kara
  Hello,

 A one-time event thus far, happened under very heavy I/O,
 Dell i9400 Core2Duo notebook w/3GB ram, single SATA drive with ext3.
 Had to cycle power to get it back and see this Oops in the syslog:
 
 : BUG: unable to handle kernel paging request at virtual address 430a7261
 :  printing eip:
 : c01a6605
 : *pde = 
 : Oops: 0002 [#1]
 : PREEMPT SMP
 : Modules linked in: nls_iso8859_1 vfat fat usb_storage ide_core libusual 
 hci_usb ext2 loop nls_cp437 isofs zlib_inflate udf vmnet(P) vmblock(P) 
 vmmon(P) binfmt_misc rfcomm l2cap bluetooth nfs nfsd exportfs lockd nfs_acl 
 auth_rpcgss sunrpc acpi_cpufreq cpufreq_stats cpufreq_userspace 
 cpufreq_ondemand cpufreq_conservative freq_table cpufreq_powersave 
 container fan firmware_class af_packet pciehp usbhid hid pci_hotplug visor 
 usbserial fuse mousedev snd_hda_intel snd_pcm_oss snd_pcm snd_mixer_oss 
 snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event 
 serio_raw snd_seq snd_timer snd_seq_device sg thermal firewire_ohci snd 
 pcspkr sr_mod cdrom psmouse firewire_core sdhci mmc_core b44 mii crc_itu_t 
 ac uhci_hcd ehci_hcd intel_agp agpgart processor button soundcore 
 snd_page_alloc usbcore battery unix
 : CPU:1
 : EIP:0060:[journal_put_journal_head+64/209]Tainted: PVLI
 : EFLAGS: 00010202   (2.6.23.1-slab #15)
 : EIP is at journal_put_journal_head+0x40/0xd1
 : eax: c2bf7000   ebx: 430a7261   ecx:    edx: c24f4780
 : esi: f000fea5   edi: c24f4780   ebp: f000fea5   esp: c2bf7e38
 : ds: 007b   es: 007b   fs: 00d8  gs:   ss: 0068
  Hmm, your pointer to buffer_head in journal_head has been overwritten
by some garbage - it actually looks like ASCII (C\n ra). I think your
journal_head pointer is stored in EAX (at least if I compile SMP kernel
for i386 it is) and that is 0xc2bd7000 - start of the page. So probably
some driver went wild and overwritten a piece of memory which did not
belong to it... I suggest turning on a few debugging options (like
DEBUG_SLAB) to catch the offender.

 : Process kswapd0 (pid: 243, ti=c2bf7000 task=c29ec030 task.ti=c2bf7000)
 : Stack:  d6216868 002a d4f52670 0034 f000fea5  
 c01a34b6
 :0246 c003fe08 ef082d98 ef082d4c ef082d4c c29f00c0 c003fdb8 
 c0198a8f
 : 0002ad02 ef082d4c c0145b21 c24f4780 000b c014a5e0 
 000e
 : Call Trace:
 :  [journal_try_to_free_buffers+299/383] 
 journal_try_to_free_buffers+0x12b/0x17f
 :  [ext3_releasepage+0/114] ext3_releasepage+0x0/0x72
 :  [try_to_release_page+48/66] try_to_release_page+0x30/0x42
 :  [__invalidate_mapping_pages+116/231] __invalidate_mapping_pages+0x74/0xe7
 :  [invalidate_mapping_pages+15/17] invalidate_mapping_pages+0xf/0x11
 :  [shrink_icache_memory+219/445] shrink_icache_memory+0xdb/0x1bd
 :  [shrink_slab+217/338] shrink_slab+0xd9/0x152
 :  [kswapd+729/1069] kswapd+0x2d9/0x42d
 :  [autoremove_wake_function+0/53] autoremove_wake_function+0x0/0x35
 :  [kswapd+0/1069] kswapd+0x0/0x42d
 :  [kthread+56/95] kthread+0x38/0x5f
 :  [kthread+0/95] kthread+0x0/0x5f
 :  [kernel_thread_helper+7/16] kernel_thread_helper+0x7/0x10
 :  ===
 : Code: 89 e0 25 00 f0 ff ff ff 48 14 8b 40 08 a8 04 74 05 e8 a6 d6 0e 00 
 f3 90 89 e0 25 00 f0 ff ff ff 40 14 8b 03 a9 00 00 40 00 75 d5 f0 0f ba 
 2b 16 19 c0 85 c0 75 ec 8b 46 04 85 c0 7f 30 c7 44 24
 : EIP: [journal_put_journal_head+64/209] journal_put_journal_head+0x40/0xd1 
 SS:ESP 0068:c2bf7e38
 : note: kswapd0[243] exited with preempt_count 2

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize

2007-11-12 Thread Jan Kara
On Sat 10-11-07 19:37:03, Theodore Tso wrote:
 On Wed, Nov 07, 2007 at 05:09:39PM +0100, Jan Kara wrote:
 
  Subject: Support for 64KB blocksize in ext2-4 directories.
  
  When block size is 64KB, we have to take care that rec_len does not 
  overflow.
  Kernel stores 0x in case 0x1 should be stored - perform appropriate
  conversion when reading from / writing to disk.
 
 NACK.  You can't do the conversion in the reader/writer routines
 because the fundamentally rec_len is only a 16 bit field.  So when you
 read a directory block where the rec_len field is encoded as 0x,
 and you translate it to 0x1, when you assign it to
 dirent-rec_len, the 0x1 gets chopped off and rec_len gets a value
 of zero.  Did you test this patch before submitting it?
  Argh, stupid me. I've just tested that I didn't break anything for normal
block size and thought that I cannot make mistake in such a simple thing
;).

 The only way to do this is to find all of the places that reference
 rec_len, and do the check there.
  Yes.. Thanks for having look.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize

2007-11-12 Thread Jan Kara
On Mon 12-11-07 09:58:23, Theodore Tso wrote:
 On Mon, Nov 12, 2007 at 10:52:45AM +0100, Jan Kara wrote:
   Did you test this patch before submitting it?
 
Argh, stupid me. I've just tested that I didn't break anything for normal
  block size and thought that I cannot make mistake in such a simple thing
  ;).
 
 Could I ask you to perhaps include some 64k blocksize test cases that
 would exercise the new codepaths?
  Fair enough, will do.

   The only way to do this is to find all of the places that reference
   rec_len, and do the check there.
Yes.. Thanks for having look.
 
 One suggestion is that instead of just creating an conversion
 function, and then doing a global search and replace, in some places
 it might be better to declare an integer variable, and then assign
 rec_len = ext2fs_rec_len_from_disk(dirent-rec_len).  For example,
 that would make ext2fs_process_dir_block() more readable, where
 dirent-rec_len is used no less than eight times.
  OK, thanks for suggestion.

 Thanks, and my apologies for not having time to review the patch until
 now.  At the moment things are a bit crazy since I am effectively
 doing two jobs, since I am in transition between two assignments, and
 me doing most of both of them at the moment.  I should have
 substantially more time after the new year begins.
  I see. I'll be patient then :)

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

2007-11-08 Thread Jan Kara
On Wed 07-11-07 19:20:38, Theodore Tso wrote:
 On Wed, Nov 07, 2007 at 03:36:05PM +0100, Jan Kara wrote:
   What if more than one application wants to use this facility?
 
That should be fine - let's see: Each application keeps somewhere a time 
  when
  it started a scan of a subtree (or it can actually remember a time when it
  set the flag for each directory), during the scan, it sets the flag on
  each directory. When it wakes up to recheck the subtree it just compares
  the rtime against the stored time - if rtime is greater, subtree has been
  modified since the last scan and we recurse in it and when we are finished
  with it we set the flag. Now notice that we don't care about the flag when
  we check for changes - we care only for rtime - so if there are several
  applications interested in the same subtree, the flag just gets set more
  often and thus the update of rtime happens more often but the same scheme
  still works fine.
 
 OK, so in this case you don't need to set rtime on the every single
 file inode, but only directory inode, right?  Because you're only
  Yes, that's actually what I'm doing - sorry if I didn't make it clear
earlier.

 using checking the rtime at the directory level, and not the flag.
 And it's just as easy for you to check the rtime flag for the file's
 containing directory (modulo magic vis-a-vis hard links) as the file's
 inode.
  Exactly.

 I'm just really wishing that rtime and the rtime flag didn't have live
 on disk, but could rather be in memory.  If you only needed to save
 the directory flags and rtimes, that might actually be doable.
  I already gave some thought to this but there seemed to be some
drawbacks. Query I want to support is: given a directory, tell me which of
its subdirectories (arbitrarily deep below) have been modified since time
T.  That is what you need to support faster rsync, updatedb and similar
loads.  Also I want to allow a reboot to happen inbetween the modification
and a query (handling a crash correctly would be nice too but honestly my
current implementation is not completely reliable in this regard either) so
some pernament storage is needed in any case. What I can imagine we could
do is to report all modifications to userspace - that has a problem that
there are *many* possible modifications but we are interested only whether
there happened some since time T. We could improve this by an in-memory
inode flag I'm not interested in modifications any further and reporting
the change only if the parent directory does not have this flag set (note
that this flag gets lost when we evict the inode from memory). But I would
say that in the end all this message passing, climbing the tree from
userspace and maintaining data structure in memory and on disk would cost
use more than the current implementation... Also it has the disadvantage
that we miss the modifications which happen before we start the userspace
daemon catching the events.
  Doing this in kernel memory has a problem how to solve the persistency
across reboots (dumping mod's to userspace on request?) and also on my
system you'd have roughly a few MB of pinned memory for these purposes...
Plausible but I don't really like it...

 Note by the way that since you need to own the file/directory to set
 flags, this means that only programs that are running as root or
 running as the uid who owns the entire subtree will be able to use
 this scheme.  One advantage of doing in kernel memory is that you
 might be able to support watching a tree that is not owned by the
 watcher.
  Yes, that is the advantage. On the other hand we could allow setting that
particular flag even without being an owner of the inode. In fact, I
don't currently see use case where you won't be either root (rsync,
updatedb) or an owner of the files (watching config file trees) but I guess
people would find some :).

I don't get it here - you need to scan the whole subtree and set the flag
  only during the initial scan. Later, you need to scan and set the flag only
  for directories in whose subtree something changed. Similarty rtime needs
  to be updated for each inode at most once after the scan. 
 
 OK, so in the worst case every single file in a kernel source tree
 might change after doing an extreme git checkout.  That means around
 36k of files get updated.  So if you have to set/clear the rtime flag
 during the checkout process 36k file inodes would have to have their
 rtime flag cleared, plus 2k worth of directory inodes; but those would
 probably be folded into other changes made to the inodes anyway.  But
  Yes, here the impact is hardly measurable as I've written in the previous
email.

 then when trackerd goes back and scans the subtree, if you are
 actually setting rtime flags for every single file inode, then that's
 38k of indoes that need updating.  If you only need to set the rtime
 flags for directories, that's only 2k worth of extra gratuitous inode
 updates.
  As I wrote above, the flag

Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

2007-11-08 Thread Jan Kara
On Thu 08-11-07 09:37:59, Theodore Tso wrote:
 Ah, OK, so the two things that I didn't get from your patch
 description are:
 
 1) the rtime flag and rtime field are only set on directories
 2) the intended use is not trackerd and its ilk, but rsync and updatedb,
so it is desirable that scan/queries be persistent across reboots
 
 But then the major hole in this scheme is still the issue of hard
 links.  The rsync program is still going to have to scan the entire
 subtree looking for hard links, since an inode with multiple links
 into the directory tree can't guarantee that all of its parent
 directories will have their rtime field updated.
  Not really - initially rsync can scan a tree for hardlinks and remember
where they are. If a hardlink to a file is created, an rtime update is
sent up the tree via the path used to create the link. So during next scan,
rsync will see the file is modified and finds out that its nlink is  1
and adds it to the list of hardlinked files.
  So for things like regular backups hardlinks can be dealt with
efficiently.

 A program like updatedb which only cares about filenames will be OK,
 since that means it really only cares about knowing when directories
 have changed, and you can't have hard links to directories.
 
 The other problem, of course, is that this feature would become ext
 2/3/4 specific, and I could see future filesystems possibly wanting
 this.  So this raises the question of whether the interface should be
 at the VFS layer or not --- and if so, how to handle querying whether
 a particulra filesystem supports it, and what happens if you have a
 subtree which is covered by a filesystem that doesn't support rtime?
 
 So a program like rsync would need to scan /proc/self/mounts to see
 whether or not it would be safe to use this feature in the first
  Yes, being filesystem specific and thus requiring special handling of
mount points is a disadvantage of this approach.

 place.  And, of course, rsync would need to know whether it has write
 access to the tree in order to set flags in the directory, and what to
 do if some portion of the subtree isn't writeable by rsync.
  Yes, the cases where we cannot modify the flag in a tree would have to be
handled (similarly as the cases where the filesystem simply does not
support the feature). I don't think it wouldn't be too complicated but I have
not the modification for rsync yet, so I can underestimate...

 On Thu, Nov 08, 2007 at 11:56:42AM +0100, Jan Kara wrote:
   Note by the way that since you need to own the file/directory to set
   flags, this means that only programs that are running as root or
   running as the uid who owns the entire subtree will be able to use
   this scheme.  One advantage of doing in kernel memory is that you
   might be able to support watching a tree that is not owned by the
   watcher.
Yes, that is the advantage. On the other hand we could allow setting that
  particular flag even without being an owner of the inode. In fact, I
  don't currently see use case where you won't be either root (rsync,
  updatedb) or an owner of the files (watching config file trees) but I guess
  people would find some :).
 
 Sometimes people like to use rsync to copy a subtree to which they
 have read access but not write access.  (And here note that it's not
 enough to have write access, you actually need to *own* all of the
 directories in the subtree).
  Yes, so in such cases my feature won't be able to help. But I think
there are still enough cases where it would help.

 Yes, it's safe to let any user *set* the rtime flag, but we couldn't
 let them clear the rtime flag, since then they would be able to hide a
 file modification from some other (potentially privileged) process.
  Good point.

 Speaking of security, I assume your patch will never allow rtime to go
 backwards (for example if the user attempts to backdate a file's mtime
 field using the utime() or utimes() system call)?
  No, the patch does not allow this. But anyway in case user has enough
rights to change file's mtime, would it really be a security concern?

 I guess I'm convinced that updatedb could use this facility, but there
 are enough asteriks around it that I'm not sure that rsync could
 safely use this feature in production.  I don't doubt that in a cold
 cache case, it would speed up rsync, but because it doesn't handle
 hard links, it's not reliable.  Since rsync often gets used for
 backups, this is a big deal.  There are also questions about what to
 do if rsync doesn't have write access to the filesystem, or if there
 is a non-rtime capable filesystem mounted in the subtree, etc., that
 can be worked around, but would add a lot of complexity and grottiness
 to the rsync source tree.  Is the rsync maintainer really willing to
 add all of the necessary hair to support this rtime facility into
 their program?
  Hardlinks can be worked-around as I wrote above and there would have to
be a fallback in case we cannot set

Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

2007-11-07 Thread Jan Kara
On Tue 06-11-07 10:04:47, H. Peter Anvin wrote:
 Arjan van de Ven wrote:
 On Tue, 6 Nov 2007 18:19:45 +0100
 Jan Kara [EMAIL PROTECTED] wrote:
 
 Implement recursive mtime (rtime) feature for ext3. The feature works
 as follows: In each directory we keep a flag EXT3_RTIME_FL
 (modifiable by a user) whether rtime should be updated. In case a
 directory or a file in it is modified and when the flag is set,
 directory's rtime is updated, the flag is cleared, and we move to the
 parent. If the flag is set there, we clear it, update rtime and
 continue upwards upto the root of the filesystem. In case a regular
 file or symlink is modified, we pick arbitrary of its parents
 (actually the one that happens to be at the head of i_dentry list)
 and start the rtime update algorith there.
 
 Ok since mtime (and rtime) are part of the inode and not the dentry...
 how do you deal with hardlinks? And with cases of files that have been
 unlinked? (ok the later is a wash obviously other than not crashing)
  Unlinked files are easy - you just don't propagate the rtime anywhere.
For hardlinks see below.

 There is only one possible answer... he only updates the directory path 
 that was used to touch the particular file involved.  Thus, the 
 semantics gets grotty not just in the presence of hard links, but also 
 in the presence of bind- and other non-root mounts.
  Update of recursive mtime does not pass filesystem boundaries (i.e.
mountpoints) so bind mounts and such are non-issue (hmm, at least that was
my original idea but as I'm looking now I don't handle bind mounts properly
so that needs to be fixed). With hardlinks, you are right that the
behaviour is undeterministic - I tried to argue in the text of the mail
that this does not actually matter - there are not many hardlinks on usual
system and so the application can check hardlinked files in the old way -
i.e. look at mtime.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

2007-11-07 Thread Jan Kara
On Tue 06-11-07 18:01:00, Al Viro wrote:
 On Tue, Nov 06, 2007 at 06:19:45PM +0100, Jan Kara wrote:
  Implement recursive mtime (rtime) feature for ext3. The feature works as
  follows: In each directory we keep a flag EXT3_RTIME_FL (modifiable by a 
  user)
  whether rtime should be updated. In case a directory or a file in it is
  modified and when the flag is set, directory's rtime is updated, the flag is
  cleared, and we move to the parent. If the flag is set there, we clear it,
  update rtime and continue upwards upto the root of the filesystem. In case a
  regular file or symlink is modified, we pick arbitrary of its parents 
  (actually
  the one that happens to be at the head of i_dentry list) and start the rtime
  update algorith there.
 
 *e*
 
 Nothing like undeterministic behaviour, is there?
  Oh yes, there is :) But I tried to argue it does not really matter -
application would have to handle hardlinks in a special way but I find that
acceptable given how rare they are...

  Intended use case is that application which wants to watch any modification 
  in
  a subtree scans the subtree and sets flags for all inodes there. Next time, 
  it
  just needs to recurse in directories having rtime newer than the start of 
  the
  previous scan. There it can handle modifications and set the flag again. It 
  is
  up to application to watch out for hardlinked files. It can e.g.  build 
  their
  list and check their mtime separately (when a hardlink to a file is created 
  its
  inode is modified and rtimes properly updated and thus any application has 
  an
  effective way of finding new hardlinked files).
 
 You know, you can do that with aush^H^Hdit right now...
  Interesting idea, no I have not thought about this. I guess you mean
watching all the VFS modification events and then do the checking and 
propagation
from user space... My first feeling is that the performance penalty would be
considerably higher (currently I am at 1% performance penalty for quite
pessimistic test case) but in case the current patch would be considered
unacceptable, I can try how large the penalty would be. Thanks for
suggestion.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize

2007-11-07 Thread Jan Kara
  Hello,

  sorry for replying to myself but I've just found out that the patch I've
sent was and old version of the patch which had some problems. Attached is
a new version.

On Tue 06-11-07 12:31:42, Jan Kara wrote:
   it seems attached patch still did not get your attention. It makes
 e2fsprogs properly handle filesystems with 64KB block size. Could you put
 it into e2fsprogs git? Thanks.

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
Subject: Support for 64KB blocksize in ext2-4 directories.

When block size is 64KB, we have to take care that rec_len does not overflow.
Kernel stores 0x in case 0x1 should be stored - perform appropriate
conversion when reading from / writing to disk.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/lib/ext2fs/dirblock.c b/lib/ext2fs/dirblock.c
index fb20fa0..db73edd 100644
--- a/lib/ext2fs/dirblock.c
+++ b/lib/ext2fs/dirblock.c
@@ -38,9 +38,9 @@ errcode_t ext2fs_read_dir_block2(ext2_fi
 		dirent = (struct ext2_dir_entry *) p;
 #ifdef WORDS_BIGENDIAN
 		dirent-inode = ext2fs_swab32(dirent-inode);
-		dirent-rec_len = ext2fs_swab16(dirent-rec_len);
 		dirent-name_len = ext2fs_swab16(dirent-name_len);
 #endif
+		dirent-rec_len = ext2fs_rec_len_from_disk(dirent-rec_len);
 		name_len = dirent-name_len;
 #ifdef WORDS_BIGENDIAN
 		if (flags  EXT2_DIRBLOCK_V2_STRUCT)
@@ -68,12 +68,15 @@ errcode_t ext2fs_read_dir_block(ext2_fil
 errcode_t ext2fs_write_dir_block2(ext2_filsys fs, blk_t block,
   void *inbuf, int flags EXT2FS_ATTR((unused)))
 {
-#ifdef WORDS_BIGENDIAN
 	errcode_t	retval;
 	char		*p, *end;
 	char		*buf = 0;
 	struct ext2_dir_entry *dirent;
 
+#ifndef WORDS_BIGENDIAN
+	if (fs-blocksize  EXT2_MAX_REC_LEN)
+		goto just_write;
+#endif
 	retval = ext2fs_get_mem(fs-blocksize, buf);
 	if (retval)
 		return retval;
@@ -88,19 +91,18 @@ errcode_t ext2fs_write_dir_block2(ext2_f
 			return (EXT2_ET_DIR_CORRUPTED);
 		}
 		p += dirent-rec_len;
+		dirent-rec_len = ext2fs_rec_len_to_disk(dirent-rec_len);
+#ifdef WORDS_BIGENDIAN
 		dirent-inode = ext2fs_swab32(dirent-inode);
-		dirent-rec_len = ext2fs_swab16(dirent-rec_len);
-		dirent-name_len = ext2fs_swab16(dirent-name_len);
-
-		if (flags  EXT2_DIRBLOCK_V2_STRUCT)
+		if (!(flags  EXT2_DIRBLOCK_V2_STRUCT))
 			dirent-name_len = ext2fs_swab16(dirent-name_len);
+#endif
 	}
  	retval = io_channel_write_blk(fs-io, block, 1, buf);
 	ext2fs_free_mem(buf);
 	return retval;
-#else
+just_write:
  	return io_channel_write_blk(fs-io, block, 1, (char *) inbuf);
-#endif
 }
 
 
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index a316665..21747c2 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -717,6 +718,32 @@ struct ext2_dir_entry_2 {
 #define EXT2_DIR_ROUND			(EXT2_DIR_PAD - 1)
 #define EXT2_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT2_DIR_ROUND)  \
 	 ~EXT2_DIR_ROUND)
+#define EXT2_MAX_REC_LEN		((116)-1)
+
+static inline unsigned ext2fs_rec_len_from_disk(unsigned len)
+{
+#ifdef WORDS_BIGENDIAN
+	len = ext2fs_swab16(dlen);
+#endif
+	if (len == EXT2_MAX_REC_LEN)
+		return 1  16;
+	return len;
+}
+
+static inline unsigned ext2fs_rec_len_to_disk(unsigned len)
+{
+	if (len == (1  16))
+#ifdef WORDS_BIGENDIAN
+		return ext2fs_swab16(EXT2_MAX_REC_LEN);
+#else
+		return EXT2_MAX_REC_LEN;
+#endif
+#ifdef WORDS_BIGENDIAN
+	return ext2fs_swab_16(len);
+#else
+	return len;
+#endif
+}
 
 /*
  * This structure will be used for multiple mount protection. It will be
diff --git a/misc/e2image.c b/misc/e2image.c
index 1fbb267..4e2c9fb 100644
--- a/misc/e2image.c
+++ b/misc/e2image.c
@@ -345,10 +345,7 @@ static void scramble_dir_block(ext2_fils
 	end = buf + fs-blocksize;
 	for (p = buf; p  end-8; p += rec_len) {
 		dirent = (struct ext2_dir_entry_2 *) p;
-		rec_len = dirent-rec_len;
-#ifdef WORDS_BIGENDIAN
-		rec_len = ext2fs_swab16(rec_len);
-#endif
+		rec_len = ext2fs_rec_len_from_disk(dirent-rec_len);
 #if 0
 		printf(rec_len = %d, name_len = %d\n, rec_len, dirent-name_len);
 #endif
@@ -358,9 +355,7 @@ static void scramble_dir_block(ext2_fils
 			   bad rec_len (%d)\n, (unsigned long) blk, 
 			   rec_len);
 			rec_len = end - p;
-#ifdef WORDS_BIGENDIAN
-dirent-rec_len = ext2fs_swab16(rec_len);
-#endif
+			dirent-rec_len = ext2fs_rec_len_to_disk(rec_len);
 			continue;
 		}
 		if (dirent-name_len + 8  rec_len) {


[PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize (fwd)

2007-11-06 Thread Jan Kara
  Wrote wrong mailing list address so I'm resending it to a correct one.

- Forwarded message from Jan Kara [EMAIL PROTECTED] -

From: Jan Kara [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: [PATCH] e2fsprogs: Handle rec_len correctly for 64KB blocksize

  Hello Ted,

  it seems attached patch still did not get your attention. It makes
e2fsprogs properly handle filesystems with 64KB block size. Could you put
it into e2fsprogs git? Thanks.

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR

Subject: Support for 64KB blocksize in ext2-4 directories.

When block size is 64KB, we have to take care that rec_len does not overflow.
Kernel stores 0x in case 0x1 should be stored - perform appropriate
conversion when reading from / writing to disk.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/lib/ext2fs/dirblock.c b/lib/ext2fs/dirblock.c
index fb20fa0..628bf93 100644
--- a/lib/ext2fs/dirblock.c
+++ b/lib/ext2fs/dirblock.c
@@ -38,9 +38,9 @@ errcode_t ext2fs_read_dir_block2(ext2_fi
dirent = (struct ext2_dir_entry *) p;
 #ifdef WORDS_BIGENDIAN
dirent-inode = ext2fs_swab32(dirent-inode);
-   dirent-rec_len = ext2fs_swab16(dirent-rec_len);
dirent-name_len = ext2fs_swab16(dirent-name_len);
 #endif
+   dirent-rec_len = ext2fs_rec_len_from_disk(dirent-rec_len);
name_len = dirent-name_len;
 #ifdef WORDS_BIGENDIAN
if (flags  EXT2_DIRBLOCK_V2_STRUCT)
@@ -68,12 +68,15 @@ errcode_t ext2fs_read_dir_block(ext2_fil
 errcode_t ext2fs_write_dir_block2(ext2_filsys fs, blk_t block,
  void *inbuf, int flags EXT2FS_ATTR((unused)))
 {
-#ifdef WORDS_BIGENDIAN
errcode_t   retval;
char*p, *end;
char*buf = 0;
struct ext2_dir_entry *dirent;
 
+#ifndef WORDS_BIGENDIAN
+   if (fs-blocksize  EXT2_MAX_REC_LEN)
+   goto just_write;
+#endif
retval = ext2fs_get_mem(fs-blocksize, buf);
if (retval)
return retval;
@@ -89,7 +92,7 @@ errcode_t ext2fs_write_dir_block2(ext2_f
}
p += dirent-rec_len;
dirent-inode = ext2fs_swab32(dirent-inode);
-   dirent-rec_len = ext2fs_swab16(dirent-rec_len);
+   dirent-rec_len = ext2fs_rec_len_to_disk(dirent-rec_len);
dirent-name_len = ext2fs_swab16(dirent-name_len);
 
if (flags  EXT2_DIRBLOCK_V2_STRUCT)
@@ -98,9 +101,8 @@ errcode_t ext2fs_write_dir_block2(ext2_f
retval = io_channel_write_blk(fs-io, block, 1, buf);
ext2fs_free_mem(buf);
return retval;
-#else
+just_write:
return io_channel_write_blk(fs-io, block, 1, (char *) inbuf);
-#endif
 }
 
 
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index a316665..2041f0f 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -717,6 +717,32 @@ struct ext2_dir_entry_2 {
 #define EXT2_DIR_ROUND (EXT2_DIR_PAD - 1)
 #define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND)  \
 ~EXT2_DIR_ROUND)
+#define EXT2_MAX_REC_LEN   ((116)-1)
+
+static inline unsigned ext2fs_rec_len_from_disk(unsigned len)
+{
+#ifdef WORDS_BIGENDIAN
+   len = ext2fs_swab16(dlen);
+#endif
+   if (len == EXT2_MAX_REC_LEN)
+   return 1  16;
+   return len;
+}
+
+static inline unsigned ext2fs_rec_len_to_disk(unsigned len)
+{
+   if (len == (1  16))
+#ifdef WORDS_BIGENDIAN
+   return ext2fs_swab16(EXT2_MAX_REC_LEN);
+#else
+   return EXT2_MAX_REC_LEN;
+#endif
+#ifdef WORDS_BIGENDIAN
+   return ext2fs_swab_16(len);
+#else
+   return len;
+#endif
+}
 
 /*
  * This structure will be used for multiple mount protection. It will be
diff --git a/misc/e2image.c b/misc/e2image.c
index 1fbb267..4e2c9fb 100644
--- a/misc/e2image.c
+++ b/misc/e2image.c
@@ -345,10 +345,7 @@ static void scramble_dir_block(ext2_fils
end = buf + fs-blocksize;
for (p = buf; p  end-8; p += rec_len) {
dirent = (struct ext2_dir_entry_2 *) p;
-   rec_len = dirent-rec_len;
-#ifdef WORDS_BIGENDIAN
-   rec_len = ext2fs_swab16(rec_len);
-#endif
+   rec_len = ext2fs_rec_len_from_disk(dirent-rec_len);
 #if 0
printf(rec_len = %d, name_len = %d\n, rec_len, 
dirent-name_len);
 #endif
@@ -358,9 +355,7 @@ static void scramble_dir_block(ext2_fils
   bad rec_len (%d)\n, (unsigned long) blk, 
   rec_len);
rec_len = end - p;
-#ifdef WORDS_BIGENDIAN
-   dirent-rec_len = ext2fs_swab16(rec_len);
-#endif
+   dirent-rec_len = ext2fs_rec_len_to_disk(rec_len);
continue

[RFC] [PATCH 0/3] Recursive mtime for ext3

2007-11-06 Thread Jan Kara
  Hello,

  in following three patches is implemented recursive mtime feature for
ext3. The first two patches are mostly clean-up patches, the third patch
implements the feature itself. If somebody is interested in testing this
(or even writing a support of this feature in rsync and similar), please
contact me. Attached are sources of simple tools set_recmod, get_recmod
for testing the feature and also a patch implementing basic support of
the feature in e2fsprogs. Comments welcome.

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
#include stdio.h
#include fcntl.h
#include errno.h
#include string.h

#include sys/ioctl.h
#include linux/fs.h
#include linux/ext2_fs.h

#define EXT2_IOC_GETRTIME _IOR('f', 9, unsigned int)

#define S_RECMOD 0x10

int main(int argc, char **argv)
{
	int fd;
	long flags;

	if (argc  2)
		return 1;

	fd = open(argv[1], O_RDONLY);
	if (fd  0) {
		printf(Cannot open: %m\n);
		return 1;
	}
	if (ioctl(fd, EXT2_IOC_GETFLAGS, flags)  0) {
		printf(Cannot get flags: %m\n);
		return 1;
	}
	flags |= S_RECMOD;
	if (ioctl(fd, EXT2_IOC_SETFLAGS, flags)  0) {
		printf(Cannot set flags: %m\n);
		return 1;
	}

	return 0;
}
#include stdio.h
#include fcntl.h
#include errno.h
#include string.h

#include sys/ioctl.h
#include linux/fs.h
#include linux/ext2_fs.h

#define EXT2_IOC_GETRTIME _IOR('f', 9, unsigned int)

#define S_RECMOD 0x10

int main(int argc, char **argv)
{
	int fd;
	long flags;
	unsigned time;

	if (argc  2)
		return 1;

	fd = open(argv[1], O_RDONLY);
	if (fd  0) {
		printf(Cannot open: %m\n);
		return 1;
	}
	if (ioctl(fd, EXT2_IOC_GETFLAGS, flags)  0) {
		printf(Cannot get flags: %m\n);
		return 1;
	}
	if (ioctl(fd, EXT2_IOC_GETRTIME, time)  0) {
		printf(Cannot get rtime: %s\n, strerror(errno));
		return 1;
	}
	printf(RECMOD flags: %d, time %u\n, !!(flags  S_RECMOD), time);

	return 0;
}
diff --git a/lib/e2p/feature.c b/lib/e2p/feature.c
index fe7e65a..a12540b 100644
--- a/lib/e2p/feature.c
+++ b/lib/e2p/feature.c
@@ -37,6 +37,8 @@ static struct feature feature_list[] = {
 			resize_inode },
 	{	E2P_FEATURE_COMPAT, EXT2_FEATURE_COMPAT_LAZY_BG,
 			lazy_bg },
+	{	E2P_FEATURE_COMPAT, EXT2_FEATURE_COMPAT_RTIME,
+			recursive_mtime },
 
 	{	E2P_FEATURE_RO_INCOMPAT, EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER,
 			sparse_super },
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index a316665..21747c2 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -623,6 +623,7 @@ struct ext2_super_block {
 #define EXT2_FEATURE_COMPAT_RESIZE_INODE	0x0010
 #define EXT2_FEATURE_COMPAT_DIR_INDEX		0x0020
 #define EXT2_FEATURE_COMPAT_LAZY_BG		0x0040
+#define EXT2_FEATURE_COMPAT_RTIME		0x0080
 
 #define EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER	0x0001
 #define EXT2_FEATURE_RO_COMPAT_LARGE_FILE	0x0002
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 83a9091..64b6eb6 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -418,7 +418,8 @@ typedef struct ext2_icount *ext2_icount_
 	 EXT2_FEATURE_COMPAT_RESIZE_INODE|\
 	 EXT2_FEATURE_COMPAT_DIR_INDEX|\
 	 EXT2_FEATURE_COMPAT_LAZY_BG|\
-	 EXT2_FEATURE_COMPAT_EXT_ATTR)
+	 EXT2_FEATURE_COMPAT_EXT_ATTR|\
+	 EXT2_FEATURE_COMPAT_RTIME)
 
 /* This #ifdef is temporary until compression is fully supported */
 #ifdef ENABLE_COMPRESSION
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 4a6cace..5060db7 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -870,7 +870,8 @@ static __u32 ok_features[3] = {
 	EXT3_FEATURE_COMPAT_HAS_JOURNAL |
 		EXT2_FEATURE_COMPAT_RESIZE_INODE |
 		EXT2_FEATURE_COMPAT_DIR_INDEX |
-		EXT2_FEATURE_COMPAT_LAZY_BG,	/* Compat */
+		EXT2_FEATURE_COMPAT_LAZY_BG |
+		EXT2_FEATURE_COMPAT_RTIME,	/* Compat */
 	EXT2_FEATURE_INCOMPAT_FILETYPE|		/* Incompat */
 		EXT3_FEATURE_INCOMPAT_JOURNAL_DEV|
 		EXT2_FEATURE_INCOMPAT_META_BG,
diff --git a/misc/tune2fs.c b/misc/tune2fs.c
index 833b994..5195e40 100644
--- a/misc/tune2fs.c
+++ b/misc/tune2fs.c
@@ -96,7 +96,8 @@ static void usage(void)
 
 static __u32 ok_features[3] = {
 	EXT3_FEATURE_COMPAT_HAS_JOURNAL |
-		EXT2_FEATURE_COMPAT_DIR_INDEX,	/* Compat */
+		EXT2_FEATURE_COMPAT_DIR_INDEX |
+		EXT2_FEATURE_COMPAT_RTIME,	/* Compat */
 	EXT2_FEATURE_INCOMPAT_FILETYPE,		/* Incompat */
 	EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER	/* R/O compat */
 };


[RFC] [PATCH 1/3] Recursive mtime for ext3

2007-11-06 Thread Jan Kara
  Hello,

  the following patch makes more lightweight handling of
EXT3_I(inode)-i_flags possible.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

Implement atomic updates of EXT3_I(inode)-i_flags. So far the i_flags access
was guarded mostly by i_mutex but this is quite heavy-weight. We now use
inode-i_lock to protect i_flags reading and updates in ext3. This patch
introduces a bogus warning that jflag and oldflags may be uninitialized -
anyone knows how to cleanly get rid of it?

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/dir.c 
linux-2.6.23-1-i_flags_atomicity/fs/ext3/dir.c
--- linux-2.6.23/fs/ext3/dir.c  2007-10-11 12:01:23.0 +0200
+++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/dir.c  2007-11-05 
14:04:56.0 +0100
@@ -108,10 +108,10 @@ static int ext3_readdir(struct file * fi
sb = inode-i_sb;
 
 #ifdef CONFIG_EXT3_INDEX
-   if (EXT3_HAS_COMPAT_FEATURE(inode-i_sb,
-   EXT3_FEATURE_COMPAT_DIR_INDEX) 
-   ((EXT3_I(inode)-i_flags  EXT3_INDEX_FL) ||
-((inode-i_size  sb-s_blocksize_bits) == 1))) {
+   if (is_dx(inode) ||
+   (EXT3_HAS_COMPAT_FEATURE(inode-i_sb, \
+   EXT3_FEATURE_COMPAT_DIR_INDEX) 
+(inode-i_size  sb-s_blocksize_bits) == 1)) {
err = ext3_dx_readdir(filp, dirent, filldir);
if (err != ERR_BAD_DX_DIR) {
ret = err;
@@ -121,7 +121,9 @@ static int ext3_readdir(struct file * fi
 * We don't set the inode dirty flag since it's not
 * critical that it get flushed back to the disk.
 */
+   spin_lock(inode-i_lock);
EXT3_I(filp-f_path.dentry-d_inode)-i_flags = ~EXT3_INDEX_FL;
+   spin_unlock(inode-i_lock);
}
 #endif
stored = 0;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/ialloc.c 
linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c
--- linux-2.6.23/fs/ext3/ialloc.c   2006-11-29 22:57:37.0 +0100
+++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c   2007-11-05 
14:14:50.0 +0100
@@ -278,7 +278,7 @@ static int find_group_orlov(struct super
ndirs = percpu_counter_read_positive(sbi-s_dirs_counter);
 
if ((parent == sb-s_root-d_inode) ||
-   (EXT3_I(parent)-i_flags  EXT3_TOPDIR_FL)) {
+   ext3_test_inode_flags(parent, EXT3_TOPDIR_FL)) {
int best_ndir = inodes_per_group;
int best_group = -1;
 
@@ -566,7 +566,11 @@ got:
ei-i_dir_start_lookup = 0;
ei-i_disksize = 0;
 
+   /* Guard reading of directory's i_flags, created inode is safe as
+* noone has a reference to it yet */
+   spin_lock(dir-i_lock);
ei-i_flags = EXT3_I(dir)-i_flags  ~EXT3_INDEX_FL;
+   spin_unlock(dir-i_lock);
if (S_ISLNK(mode))
ei-i_flags = ~(EXT3_IMMUTABLE_FL|EXT3_APPEND_FL);
/* dirsync only applies to directories */
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/inode.c 
linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c
--- linux-2.6.23/fs/ext3/inode.c2007-10-11 12:01:23.0 +0200
+++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c2007-11-05 
14:24:39.0 +0100
@@ -2557,8 +2557,10 @@ int ext3_get_inode_loc(struct inode *ino
 
 void ext3_set_inode_flags(struct inode *inode)
 {
-   unsigned int flags = EXT3_I(inode)-i_flags;
+   unsigned int flags;
 
+   spin_lock(inode-i_lock);
+   flags = EXT3_I(inode)-i_flags;
inode-i_flags = ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
if (flags  EXT3_SYNC_FL)
inode-i_flags |= S_SYNC;
@@ -2570,13 +2572,16 @@ void ext3_set_inode_flags(struct inode *
inode-i_flags |= S_NOATIME;
if (flags  EXT3_DIRSYNC_FL)
inode-i_flags |= S_DIRSYNC;
+   spin_unlock(inode-i_lock);
 }
 
 /* Propagate flags from i_flags to EXT3_I(inode)-i_flags */
 void ext3_get_inode_flags(struct ext3_inode_info *ei)
 {
-   unsigned int flags = ei-vfs_inode.i_flags;
+   unsigned int flags;
 
+   spin_lock(ei-vfs_inode.i_lock);
+   flags = ei-vfs_inode.i_flags;
ei-i_flags = ~(EXT3_SYNC_FL|EXT3_APPEND_FL|
EXT3_IMMUTABLE_FL|EXT3_NOATIME_FL|EXT3_DIRSYNC_FL);
if (flags  S_SYNC)
@@ -2589,6 +2594,7 @@ void ext3_get_inode_flags(struct ext3_in
ei-i_flags |= EXT3_NOATIME_FL;
if (flags  S_DIRSYNC)
ei-i_flags |= EXT3_DIRSYNC_FL;
+   spin_unlock(ei-vfs_inode.i_lock);
 }
 
 void ext3_read_inode(struct inode * inode)
@@ -2781,7 +2787,9 @@ static int ext3_do_update_inode(handle_t
raw_inode-i_mtime = cpu_to_le32(inode-i_mtime.tv_sec);
raw_inode-i_blocks = cpu_to_le32(inode-i_blocks);
raw_inode-i_dtime

[RFC] [PATCH 2/3] Recursive mtime for ext3

2007-11-06 Thread Jan Kara
Make space reserved for fragments as unused as they were never implemented.
Remove also related initializations. We later use the space for recursive
mtime.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude 
linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c 
linux-2.6.23-2-make_flags_unused/fs/ext3/ialloc.c
--- linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c   2007-11-05 
14:14:50.0 +0100
+++ linux-2.6.23-2-make_flags_unused/fs/ext3/ialloc.c   2007-11-05 
14:37:33.0 +0100
@@ -576,11 +576,6 @@ got:
/* dirsync only applies to directories */
if (!S_ISDIR(mode))
ei-i_flags = ~EXT3_DIRSYNC_FL;
-#ifdef EXT3_FRAGMENTS
-   ei-i_faddr = 0;
-   ei-i_frag_no = 0;
-   ei-i_frag_size = 0;
-#endif
ei-i_file_acl = 0;
ei-i_dir_acl = 0;
ei-i_dtime = 0;
diff -rupX /home/jack/.kerndiffexclude 
linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c 
linux-2.6.23-2-make_flags_unused/fs/ext3/inode.c
--- linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c2007-11-05 
14:24:39.0 +0100
+++ linux-2.6.23-2-make_flags_unused/fs/ext3/inode.c2007-11-05 
14:38:05.0 +0100
@@ -2651,11 +2651,6 @@ void ext3_read_inode(struct inode * inod
}
inode-i_blocks = le32_to_cpu(raw_inode-i_blocks);
ei-i_flags = le32_to_cpu(raw_inode-i_flags);
-#ifdef EXT3_FRAGMENTS
-   ei-i_faddr = le32_to_cpu(raw_inode-i_faddr);
-   ei-i_frag_no = raw_inode-i_frag;
-   ei-i_frag_size = raw_inode-i_fsize;
-#endif
ei-i_file_acl = le32_to_cpu(raw_inode-i_file_acl);
if (!S_ISREG(inode-i_mode)) {
ei-i_dir_acl = le32_to_cpu(raw_inode-i_dir_acl);
@@ -2790,11 +2785,6 @@ static int ext3_do_update_inode(handle_t
spin_lock(inode-i_lock);
raw_inode-i_flags = cpu_to_le32(ei-i_flags);
spin_unlock(inode-i_lock);
-#ifdef EXT3_FRAGMENTS
-   raw_inode-i_faddr = cpu_to_le32(ei-i_faddr);
-   raw_inode-i_frag = ei-i_frag_no;
-   raw_inode-i_fsize = ei-i_frag_size;
-#endif
raw_inode-i_file_acl = cpu_to_le32(ei-i_file_acl);
if (!S_ISREG(inode-i_mode)) {
raw_inode-i_dir_acl = cpu_to_le32(ei-i_dir_acl);
diff -rupX /home/jack/.kerndiffexclude 
linux-2.6.23-1-i_flags_atomicity/fs/ext3/super.c 
linux-2.6.23-2-make_flags_unused/fs/ext3/super.c
--- linux-2.6.23-1-i_flags_atomicity/fs/ext3/super.c2007-11-05 
15:04:19.0 +0100
+++ linux-2.6.23-2-make_flags_unused/fs/ext3/super.c2007-11-05 
15:01:37.0 +0100
@@ -1584,17 +1584,7 @@ static int ext3_fill_super (struct super
goto failed_mount;
}
}
-   sbi-s_frag_size = EXT3_MIN_FRAG_SIZE 
-  le32_to_cpu(es-s_log_frag_size);
-   if (blocksize != sbi-s_frag_size) {
-   printk(KERN_ERR
-  EXT3-fs: fragsize %lu != blocksize %u (unsupported)\n,
-  sbi-s_frag_size, blocksize);
-   goto failed_mount;
-   }
-   sbi-s_frags_per_block = 1;
sbi-s_blocks_per_group = le32_to_cpu(es-s_blocks_per_group);
-   sbi-s_frags_per_group = le32_to_cpu(es-s_frags_per_group);
sbi-s_inodes_per_group = le32_to_cpu(es-s_inodes_per_group);
if (EXT3_INODE_SIZE(sb) == 0)
goto cantfind_ext3;
@@ -1618,12 +1608,6 @@ static int ext3_fill_super (struct super
sbi-s_blocks_per_group);
goto failed_mount;
}
-   if (sbi-s_frags_per_group  blocksize * 8) {
-   printk (KERN_ERR
-   EXT3-fs: #fragments per group too big: %lu\n,
-   sbi-s_frags_per_group);
-   goto failed_mount;
-   }
if (sbi-s_inodes_per_group  blocksize * 8) {
printk (KERN_ERR
EXT3-fs: #inodes per group too big: %lu\n,
diff -rupX /home/jack/.kerndiffexclude 
linux-2.6.23-1-i_flags_atomicity/include/linux/ext3_fs.h 
linux-2.6.23-2-make_flags_unused/include/linux/ext3_fs.h
--- linux-2.6.23-1-i_flags_atomicity/include/linux/ext3_fs.h2007-11-05 
14:31:44.0 +0100
+++ linux-2.6.23-2-make_flags_unused/include/linux/ext3_fs.h2007-11-05 
14:37:33.0 +0100
@@ -291,27 +291,24 @@ struct ext3_inode {
__le32  i_generation;   /* File version (for NFS) */
__le32  i_file_acl; /* File ACL */
__le32  i_dir_acl;  /* Directory ACL */
-   __le32  i_faddr;/* Fragment address */
+   __le32  i_obsolete_faddr;   /* Unused */
union {
struct {
-   __u8l_i_frag;   /* Fragment number */
-   __u8l_i_fsize;  /* Fragment size */
+   __u16   l_i_obsolete_frag;  /* Unused */
__u16   i_pad1;
__le16  l_i_uid_high;   /* these 2 fields*/
__le16  l_i_gid_high

[RFC] [PATCH 3/3] Recursive mtime for ext3

2007-11-06 Thread Jan Kara
Implement recursive mtime (rtime) feature for ext3. The feature works as
follows: In each directory we keep a flag EXT3_RTIME_FL (modifiable by a user)
whether rtime should be updated. In case a directory or a file in it is
modified and when the flag is set, directory's rtime is updated, the flag is
cleared, and we move to the parent. If the flag is set there, we clear it,
update rtime and continue upwards upto the root of the filesystem. In case a
regular file or symlink is modified, we pick arbitrary of its parents (actually
the one that happens to be at the head of i_dentry list) and start the rtime
update algorith there.

As the flag is always cleared after updating rtime and we don't climb up the
tree if the flag is cleared, we have constant amortized complexity of rtime
updates. That's for theoretical time consumption ;) Practically, there's no
measurable performance impact for a test case like: touch every file in a
kernel tree where every directory has RTIME flag set.

Intended use case is that application which wants to watch any modification in
a subtree scans the subtree and sets flags for all inodes there. Next time, it
just needs to recurse in directories having rtime newer than the start of the
previous scan. There it can handle modifications and set the flag again. It is
up to application to watch out for hardlinked files. It can e.g.  build their
list and check their mtime separately (when a hardlink to a file is created its
inode is modified and rtimes properly updated and thus any application has an
effective way of finding new hardlinked files).

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude 
linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/ialloc.c 
linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/ialloc.c
--- linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/ialloc.c  2007-11-05 
16:58:10.0 +0100
+++ linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/ialloc.c2007-11-05 
16:58:53.0 +0100
@@ -569,7 +569,7 @@ got:
/* Guard reading of directory's i_flags, created inode is safe as
 * noone has a reference to it yet */
spin_lock(dir-i_lock);
-   ei-i_flags = EXT3_I(dir)-i_flags  ~EXT3_INDEX_FL;
+   ei-i_flags = EXT3_I(dir)-i_flags  ~(EXT3_INDEX_FL | EXT3_RTIME_FL);
spin_unlock(dir-i_lock);
if (S_ISLNK(mode))
ei-i_flags = ~(EXT3_IMMUTABLE_FL|EXT3_APPEND_FL);
@@ -579,6 +579,7 @@ got:
ei-i_file_acl = 0;
ei-i_dir_acl = 0;
ei-i_dtime = 0;
+   ei-i_rtime = inode-i_mtime.tv_sec;
ei-i_block_alloc_info = NULL;
ei-i_block_group = group;
 
diff -rupX /home/jack/.kerndiffexclude 
linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/inode.c 
linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/inode.c
--- linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/inode.c   2007-11-05 
16:58:10.0 +0100
+++ linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/inode.c 2007-11-06 
16:13:50.0 +0100
@@ -1232,6 +1232,8 @@ static int ext3_ordered_commit_write(str
ret2 = ext3_journal_stop(handle);
if (!ret)
ret = ret2;
+   if (!ret)
+   ext3_update_rtimes(inode);
return ret;
 }
 
@@ -1255,6 +1257,8 @@ static int ext3_writeback_commit_write(s
ret2 = ext3_journal_stop(handle);
if (!ret)
ret = ret2;
+   if (!ret)
+   ext3_update_rtimes(inode);
return ret;
 }
 
@@ -1288,6 +1292,8 @@ static int ext3_journalled_commit_write(
ret2 = ext3_journal_stop(handle);
if (!ret)
ret = ret2;
+   if (!ret)
+   ext3_update_rtimes(inode);
return ret;
 }
 
@@ -2386,6 +2392,10 @@ out_stop:
ext3_orphan_del(handle, inode);
 
ext3_journal_stop(handle);
+   /* We update time only for linked inodes. Unlinked ones already
+* notified parent during unlink... */
+   if (inode-i_nlink)
+   ext3_update_rtimes(inode);
 }
 
 static ext3_fsblk_t ext3_get_inode_block(struct super_block *sb,
@@ -2628,6 +2638,8 @@ void ext3_read_inode(struct inode * inod
inode-i_ctime.tv_sec = (signed)le32_to_cpu(raw_inode-i_ctime);
inode-i_mtime.tv_sec = (signed)le32_to_cpu(raw_inode-i_mtime);
inode-i_atime.tv_nsec = inode-i_ctime.tv_nsec = 
inode-i_mtime.tv_nsec = 0;
+   if (EXT3_HAS_COMPAT_FEATURE(inode-i_sb, EXT3_FEATURE_COMPAT_RTIME))
+   ei-i_rtime = le32_to_cpu(raw_inode-i_rtime);
 
ei-i_state = 0;
ei-i_dir_start_lookup = 0;
@@ -2780,6 +2792,8 @@ static int ext3_do_update_inode(handle_t
raw_inode-i_atime = cpu_to_le32(inode-i_atime.tv_sec);
raw_inode-i_ctime = cpu_to_le32(inode-i_ctime.tv_sec);
raw_inode-i_mtime = cpu_to_le32(inode-i_mtime.tv_sec);
+   if (EXT3_HAS_COMPAT_FEATURE(inode-i_sb, EXT3_FEATURE_COMPAT_RTIME))
+   raw_inode-i_rtime = cpu_to_le32(ei-i_rtime);
raw_inode-i_blocks = cpu_to_le32(inode

[PATCH] Fix 64KB blocksize in ext3 directories

2007-10-31 Thread Jan Kara
  Hi Andrew,

  looking at the commit messages it seems the patch fixing overflow of
rec_len in ext3 directories when using 64KB blocksize got lost (I'm not
surprised given the churn of the patches in this area for 2.6.24-rc1 ;).
Anyway, the patch is attached. It should make it into 2.6.24 (or nasty
things would happen if somebody actually try 64KB blocksize in ext3).

Thanks
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0x instead and convert
value when read from / written to disk. The patch also converts some places
to use ext3_next_entry() when we are changing them anyway.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.24-rc1/fs/ext3/dir.c 
linux-2.6.24-rc1-1-ext3_64k_blocksize/fs/ext3/dir.c
--- linux-2.6.24-rc1/fs/ext3/dir.c  2007-10-24 20:43:56.0 +0200
+++ linux-2.6.24-rc1-1-ext3_64k_blocksize/fs/ext3/dir.c 2007-10-24 
20:45:01.0 +0200
@@ -67,7 +67,7 @@ int ext3_check_dir_entry (const char * f
  unsigned long offset)
 {
const char * error_msg = NULL;
-   const int rlen = le16_to_cpu(de-rec_len);
+   const int rlen = ext3_rec_len_from_disk(de-rec_len);
 
if (rlen  EXT3_DIR_REC_LEN(1))
error_msg = rec_len is smaller than minimal;
@@ -173,10 +173,10 @@ revalidate:
 * least that it is non-zero.  A
 * failure will be detected in the
 * dirent test below. */
-   if (le16_to_cpu(de-rec_len) 
+   if (ext3_rec_len_from_disk(de-rec_len) 
EXT3_DIR_REC_LEN(1))
break;
-   i += le16_to_cpu(de-rec_len);
+   i += ext3_rec_len_from_disk(de-rec_len);
}
offset = i;
filp-f_pos = (filp-f_pos  ~(sb-s_blocksize - 1))
@@ -197,7 +197,7 @@ revalidate:
ret = stored;
goto out;
}
-   offset += le16_to_cpu(de-rec_len);
+   offset += ext3_rec_len_from_disk(de-rec_len);
if (le32_to_cpu(de-inode)) {
/* We might block in the next section
 * if the data destination is
@@ -219,7 +219,7 @@ revalidate:
goto revalidate;
stored ++;
}
-   filp-f_pos += le16_to_cpu(de-rec_len);
+   filp-f_pos += ext3_rec_len_from_disk(de-rec_len);
}
offset = 0;
brelse (bh);
diff -rupX /home/jack/.kerndiffexclude linux-2.6.24-rc1/fs/ext3/namei.c 
linux-2.6.24-rc1-1-ext3_64k_blocksize/fs/ext3/namei.c
--- linux-2.6.24-rc1/fs/ext3/namei.c2007-10-24 20:43:56.0 +0200
+++ linux-2.6.24-rc1-1-ext3_64k_blocksize/fs/ext3/namei.c   2007-10-24 
20:45:58.0 +0200
@@ -177,6 +177,15 @@ static int ext3_dx_add_entry(handle_t *h
 struct inode *inode);
 
 /*
+ * p is at least 6 bytes before the end of page
+ */
+static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 
*p)
+{
+   return (struct ext3_dir_entry_2 *)((char*)p +
+   ext3_rec_len_from_disk(p-rec_len));
+}
+
+/*
  * Future: use high four bits of block for coalesce-on-delete flags
  * Mask them off for now.
  */
@@ -280,7 +289,7 @@ static struct stats dx_show_leaf(struct 
space += EXT3_DIR_REC_LEN(de-name_len);
names++;
}
-   de = (struct ext3_dir_entry_2 *) ((char *) de + 
le16_to_cpu(de-rec_len));
+   de = ext3_next_entry(de);
}
printk((%i)\n, names);
return (struct stats) { names, space, 1 };
@@ -547,14 +556,6 @@ static int ext3_htree_next_block(struct 
 
 
 /*
- * p is at least 6 bytes before the end of page
- */
-static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 
*p)
-{
-   return (struct ext3_dir_entry_2 *)((char*)p + le16_to_cpu(p-rec_len));
-}
-
-/*
  * This function fills a red-black tree with information from a
  * directory block.  It returns the number directory entries loaded
  * into the tree.  If there is an error it is returned in err.
@@ -720,7 +721,7 @@ static int dx_make_map (struct ext3_dir_
cond_resched();
}
/* XXX: do we need to check rec_len == 0 case? -Chris

Re: a potential deadlock?

2007-10-31 Thread Jan Kara
  Hi,

 Here is a question to be confirmed.
 
 In ext3_ioctl() with cmd == EXT3_IOC_SETFLAGS, we firstly lock
 inode-i_mutex, start a handle with 1 journal-block by calling
 ext3_journal_start(). In ext3_new_blocks(), say QUOTA was enabled with vfsv0
 format, we will call the function DQUOT_ALLOC_BLOCK(). The handle in
 ext3_new_blocks()  was started by high-level functions, and
 DQUOT_ALLOC_BLOCK() will finally calles ext3_quota_write() in which it try
 to lock the i_mutex of the inode of a quota-file. 
 
 At it happens, when we want to modify the inodes of quota-files via
 ext3_ioctl(cmd = EXT3_IOC_SETFLAGS) (say process-A), another guy try to
 execute ext3_quota_write() by calling DQUOT_ALLOC_BLOCK() (say process-B). I
 guess a potential deadlock between process-A and process-B would happen in
 such a executing sequence:
 
 (1) process-B got many journal-blocks, then came into ext3_new_blocks(),
 hung up
 (2) process-A locked i_mutex of the inode of a quota-file, then try to
 starts a handle. Unfortunately, there are no enough journal-blocks left for
 process-A.
 (3) process-B awakened, and came into DQUOT_ALLOC_BLOCK(), finally came into
 the function ext3_quota_write() who also wants to lock the i_mutex of the
 inode of a quota-file. But the i_mutex was locked by process-A. so process-B
 has no choice but to wait.
 (4) if the ext3-filesystem was  too busy to release jounal-blocks for
 process-A, or a unexpected incident happened. Both  the two situations would
 result in no journal-blocks for any other processes. Apparently, process-A
 have to wait for available journal-blocks. so process-A was hung-up with
 i_mutex of the inode of a quota-file locked.
 (5) process-B was blocked by the inode-i_mutex subsequently.
  Yes, that is a lock inversion between the journal lock and i_mutex on
quota files which can indeed lead to a deadlock. Thanks for spotting it.
Luckily it's not very likely you're going to hit it but we should fix it
anyway. Currently I don't have much idea how - probably we'd have to get
rid of the need to use i_mutex on quota files in quota_write but that's
also non-trivial.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] e2fsprogs: handle rec_len in case of 64KB blocks

2007-10-31 Thread Jan Kara
  Hi Ted,

  I haven't got any answer from you regarding this patch implementing
hadling of rec_len for 64KB block size. Is it still waiting in your queue
to review or did it just get lost? Since all the kernel changes will be in
2.6.24, we should have tools for that too...  Please write me if there are
any problems with the patch.
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---


Subject: Support for 64KB blocksize in ext2-4 directories.

When block size is 64KB, we have to take care that rec_len does not overflow.
Kernel stores 0x in case 0x1 should be stored - perform appropriate
conversion when reading from / writing to disk.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/lib/ext2fs/dirblock.c b/lib/ext2fs/dirblock.c
index fb20fa0..628bf93 100644
--- a/lib/ext2fs/dirblock.c
+++ b/lib/ext2fs/dirblock.c
@@ -38,9 +38,9 @@ errcode_t ext2fs_read_dir_block2(ext2_fi
dirent = (struct ext2_dir_entry *) p;
 #ifdef WORDS_BIGENDIAN
dirent-inode = ext2fs_swab32(dirent-inode);
-   dirent-rec_len = ext2fs_swab16(dirent-rec_len);
dirent-name_len = ext2fs_swab16(dirent-name_len);
 #endif
+   dirent-rec_len = ext2fs_rec_len_from_disk(dirent-rec_len);
name_len = dirent-name_len;
 #ifdef WORDS_BIGENDIAN
if (flags  EXT2_DIRBLOCK_V2_STRUCT)
@@ -68,12 +68,15 @@ errcode_t ext2fs_read_dir_block(ext2_fil
 errcode_t ext2fs_write_dir_block2(ext2_filsys fs, blk_t block,
  void *inbuf, int flags EXT2FS_ATTR((unused)))
 {
-#ifdef WORDS_BIGENDIAN
errcode_t   retval;
char*p, *end;
char*buf = 0;
struct ext2_dir_entry *dirent;
 
+#ifndef WORDS_BIGENDIAN
+   if (fs-blocksize  EXT2_MAX_REC_LEN)
+   goto just_write;
+#endif
retval = ext2fs_get_mem(fs-blocksize, buf);
if (retval)
return retval;
@@ -89,7 +92,7 @@ errcode_t ext2fs_write_dir_block2(ext2_f
}
p += dirent-rec_len;
dirent-inode = ext2fs_swab32(dirent-inode);
-   dirent-rec_len = ext2fs_swab16(dirent-rec_len);
+   dirent-rec_len = ext2fs_rec_len_to_disk(dirent-rec_len);
dirent-name_len = ext2fs_swab16(dirent-name_len);
 
if (flags  EXT2_DIRBLOCK_V2_STRUCT)
@@ -98,9 +101,8 @@ errcode_t ext2fs_write_dir_block2(ext2_f
retval = io_channel_write_blk(fs-io, block, 1, buf);
ext2fs_free_mem(buf);
return retval;
-#else
+just_write:
return io_channel_write_blk(fs-io, block, 1, (char *) inbuf);
-#endif
 }
 
 
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index a316665..2041f0f 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -717,6 +717,32 @@ struct ext2_dir_entry_2 {
 #define EXT2_DIR_ROUND (EXT2_DIR_PAD - 1)
 #define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND)  \
 ~EXT2_DIR_ROUND)
+#define EXT2_MAX_REC_LEN   ((116)-1)
+
+static inline unsigned ext2fs_rec_len_from_disk(unsigned len)
+{
+#ifdef WORDS_BIGENDIAN
+   len = ext2fs_swab16(dlen);
+#endif
+   if (len == EXT2_MAX_REC_LEN)
+   return 1  16;
+   return len;
+}
+
+static inline unsigned ext2fs_rec_len_to_disk(unsigned len)
+{
+   if (len == (1  16))
+#ifdef WORDS_BIGENDIAN
+   return ext2fs_swab16(EXT2_MAX_REC_LEN);
+#else
+   return EXT2_MAX_REC_LEN;
+#endif
+#ifdef WORDS_BIGENDIAN
+   return ext2fs_swab_16(len);
+#else
+   return len;
+#endif
+}
 
 /*
  * This structure will be used for multiple mount protection. It will be
diff --git a/misc/e2image.c b/misc/e2image.c
index 1fbb267..4e2c9fb 100644
--- a/misc/e2image.c
+++ b/misc/e2image.c
@@ -345,10 +345,7 @@ static void scramble_dir_block(ext2_fils
end = buf + fs-blocksize;
for (p = buf; p  end-8; p += rec_len) {
dirent = (struct ext2_dir_entry_2 *) p;
-   rec_len = dirent-rec_len;
-#ifdef WORDS_BIGENDIAN
-   rec_len = ext2fs_swab16(rec_len);
-#endif
+   rec_len = ext2fs_rec_len_from_disk(dirent-rec_len);
 #if 0
printf(rec_len = %d, name_len = %d\n, rec_len, 
dirent-name_len);
 #endif
@@ -358,9 +355,7 @@ static void scramble_dir_block(ext2_fils
   bad rec_len (%d)\n, (unsigned long) blk, 
   rec_len);
rec_len = end - p;
-#ifdef WORDS_BIGENDIAN
-   dirent-rec_len = ext2fs_swab16(rec_len);
-#endif
+   dirent-rec_len = ext2fs_rec_len_to_disk(rec_len);
continue;
}
if (dirent-name_len + 8  rec_len) {
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body

Re: + ext3-change-the-default-behaviour-on-error.patch added to -mm tree

2007-10-25 Thread Jan Kara
 
 The patch titled
  ext3: change the default behaviour on error
 has been added to the -mm tree.  Its filename is
  ext3-change-the-default-behaviour-on-error.patch
 
 *** Remember to use Documentation/SubmitChecklist when testing your code ***
 
 See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
 out what to do about this
 
 --
 Subject: ext3: change the default behaviour on error
 From: Aneesh Kumar K.V [EMAIL PROTECTED]
 
 ext3 file system was by default ignoring errors and continuing.  This is
 not a good default as continuing on error could lead to file system
 corruption.  Change the default to mark the file system readonly.  Debian
 and ubuntu already does this as the default in their fstab.
  The change is fine as such but looking at it I just wonder
whether it would not make sence to write in /proc/mounts options
corresponding to the real state of mount options and not what we
guess user has specified... Or does anybody see a sane usecase where
userspace would rather want to see the current output? ... maybe
when user wants to inspect /proc/mounts by himself.

 Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
 Cc: linux-ext4@vger.kernel.org
 Cc: Eric Sandeen [EMAIL PROTECTED]
 Cc: Jan Kara [EMAIL PROTECTED]
 Cc: Dave Jones [EMAIL PROTECTED]
 Cc: Chuck Ebbert [EMAIL PROTECTED]
 Signed-off-by: Andrew Morton [EMAIL PROTECTED]
 ---
 
  fs/ext3/super.c |   16 
  1 file changed, 8 insertions(+), 8 deletions(-)
 
 diff -puN fs/ext3/super.c~ext3-change-the-default-behaviour-on-error 
 fs/ext3/super.c
 --- a/fs/ext3/super.c~ext3-change-the-default-behaviour-on-error
 +++ a/fs/ext3/super.c
 @@ -575,16 +575,16 @@ static int ext3_show_options(struct seq_
   le16_to_cpu(es-s_def_resgid) != EXT3_DEF_RESGID) {
   seq_printf(seq, ,resgid=%u, sbi-s_resgid);
   }
 - if (test_opt(sb, ERRORS_CONT)) {
 + if (test_opt(sb, ERRORS_RO)) {
   int def_errors = le16_to_cpu(es-s_errors);
  
   if (def_errors == EXT3_ERRORS_PANIC ||
 - def_errors == EXT3_ERRORS_RO) {
 - seq_puts(seq, ,errors=continue);
 + def_errors == EXT3_ERRORS_CONTINUE) {
 + seq_puts(seq, ,errors=remount-ro);
   }
   }
 - if (test_opt(sb, ERRORS_RO))
 - seq_puts(seq, ,errors=remount-ro);
 + if (test_opt(sb, ERRORS_CONT))
 + seq_puts(seq, ,errors=continue);
   if (test_opt(sb, ERRORS_PANIC))
   seq_puts(seq, ,errors=panic);
   if (test_opt(sb, NO_UID32))
 @@ -1559,10 +1559,10 @@ static int ext3_fill_super (struct super
  
   if (le16_to_cpu(sbi-s_es-s_errors) == EXT3_ERRORS_PANIC)
   set_opt(sbi-s_mount_opt, ERRORS_PANIC);
 - else if (le16_to_cpu(sbi-s_es-s_errors) == EXT3_ERRORS_RO)
 - set_opt(sbi-s_mount_opt, ERRORS_RO);
 - else
 + else if (le16_to_cpu(sbi-s_es-s_errors) == EXT3_ERRORS_CONTINUE)
   set_opt(sbi-s_mount_opt, ERRORS_CONT);
 + else
 + set_opt(sbi-s_mount_opt, ERRORS_RO);
  
   sbi-s_resuid = le16_to_cpu(es-s_def_resuid);
   sbi-s_resgid = le16_to_cpu(es-s_def_resgid);
 _
 
 Patches currently in -mm which might be from [EMAIL PROTECTED] are
 
 ext2-return-after-ext2_error-in-case-of-failures.patch
 ext2-change-the-default-behaviour-on-error.patch
 ext4-return-after-ext4_error-in-case-of-failures.patch
 ext3-return-after-ext3_error-in-case-of-failures.patch
 ext3-change-the-default-behaviour-on-error.patch
 ext2-fix-the-max-file-size-for-ext2-file-system.patch
 ext3-fix-the-max-file-size-for-ext3-file-system.patch

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext4 devel interlock meeting minutes (October 22, 2007)

2007-10-24 Thread Jan Kara
  Hello,

  Sorry for not taking part in the call but I was just on my way from NY
to Prague...

 Ext4 Developer Interlock Call
 October 22, 2007: Meeting Minutes
 
 Attendees: Mingming Cao, Andreas Dilger, Eric Sandeen, Dave Kleikamp, 
 Jose Santos, Aneesh Veetil, Valerie Clement, Avantika Mathur
 
 - There has been discussion on linux-ext4 about overflow in ext2 with 64 
 KB block size.  Discussed adding an incompat flag to this feature and 
 concluded that it is too late to do this; it will be treated as a bug fix.
  Hmm, why adding an incompat flag? Obviously there's no harm in doing
that but the large block size itself should be enough to protect old
kernels/utils touching it, shouldn't it?

 E2fsprogs
 - Ted has added new branches the the e2fsprogs git tree
- master: current stable version
- next: patches expected to go into the next stable version
- pu: proposed update; patches that are in preliminary review and 
   test phase
 - E2fsprogs patches should probably be submitted against the 'next' 
 branch of the git tree.
  BTW: We still don't have appropriate patches for large block size in
e2fsprogs. I guess I should ping Ted about merging the patch once more.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

2007-10-11 Thread Jan Kara
On Thu 04-10-07 16:11:21, Andrew Morton wrote:
 On Thu, 4 Oct 2007 16:40:44 -0600
 Andreas Dilger [EMAIL PROTECTED] wrote:
 
  On Oct 04, 2007  13:12 -0700, Andrew Morton wrote:
   On Mon, 01 Oct 2007 17:35:46 -0700
ext2: Avoid rec_len overflow with 64KB block size

into 16 bits we have for entry lenght. So we store 0x instead and
convert value when read from / written to disk.
   
   This patch clashes in non-trivial ways with
   ext2-convert-to-new-aops-fix.patch and perhaps other things which are
   already queued for 2.6.24 inclusion, so I'll need to ask for an updated
   patch, please.
  
  If the rel_len overflow patch isn't going to make it, then we also need
  to revert the EXT*_MAX_BLOCK_SIZE change to 65536.  It would be possible
  to allow this to be up to 32768 w/o the rec_len overflow fix however.
  
 
 Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and
 ext2-support-large-blocksize-up-to-pagesize.patch.
  Sorry, for the delayed answer but I had some urgent bugs to fix...
Why did you drom ext3-support-large-blocksize-up-to-pagesize.patch? As far
as I understand your previous email (and also as I've checked against
2.6.23-rc8-mm2), the patch fixing rec_len overflow clashes only for ext2...
  I'll send you an updated patch for ext2 in a moment.

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

2007-10-11 Thread Jan Kara
On Thu 04-10-07 13:12:07, Andrew Morton wrote:
 On Mon, 01 Oct 2007 17:35:46 -0700
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  ext2: Avoid rec_len overflow with 64KB block size
  
  From: Jan Kara [EMAIL PROTECTED]
  
  With 64KB blocksize, a directory entry can have size 64KB which does not fit
  into 16 bits we have for entry lenght. So we store 0x instead and 
  convert
  value when read from / written to disk.
 
 This patch clashes in non-trivial ways with
 ext2-convert-to-new-aops-fix.patch and perhaps other things which are
 already queued for 2.6.24 inclusion, so I'll need to ask for an updated
 patch, please.
 
 Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if
 we're aiming for complete support of 64k blocksize in 2.6.24's ext2,
 additional testing and checking will be needed.
  OK, attached is a patch diffed against 2.6.23-rc9-mm2 - does that work
fine for you?

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR

--

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0x instead and convert
value when read from / written to disk.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-mm/fs/ext2/dir.c 
linux-2.6.23-mm-1-ext2_64k_rec_len/fs/ext2/dir.c
--- linux-2.6.23-mm/fs/ext2/dir.c   2007-10-11 12:08:16.0 +0200
+++ linux-2.6.23-mm-1-ext2_64k_rec_len/fs/ext2/dir.c2007-10-11 
12:14:24.0 +0200
@@ -28,6 +28,24 @@
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
 
+static inline unsigned ext2_rec_len_from_disk(__le16 dlen)
+{
+   unsigned len = le16_to_cpu(dlen);
+
+   if (len == EXT2_MAX_REC_LEN)
+   return 1  16;
+   return len;
+}
+
+static inline __le16 ext2_rec_len_to_disk(unsigned len)
+{
+   if (len == (1  16))
+   return cpu_to_le16(EXT2_MAX_REC_LEN);
+   else if (len  (1  16))
+   BUG();
+   return cpu_to_le16(len);
+}
+
 /*
  * ext2 uses block-sized chunks. Arguably, sector-sized ones would be
  * more robust, but we have what we have
@@ -106,7 +124,7 @@ static void ext2_check_page(struct page 
}
for (offs = 0; offs = limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
p = (ext2_dirent *)(kaddr + offs);
-   rec_len = le16_to_cpu(p-rec_len);
+   rec_len = ext2_rec_len_from_disk(p-rec_len);
 
if (rec_len  EXT2_DIR_REC_LEN(1))
goto Eshort;
@@ -204,7 +222,8 @@ static inline int ext2_match (int len, c
  */
 static inline ext2_dirent *ext2_next_entry(ext2_dirent *p)
 {
-   return (ext2_dirent *)((char*)p + le16_to_cpu(p-rec_len));
+   return (ext2_dirent *)((char*)p +
+   ext2_rec_len_from_disk(p-rec_len));
 }
 
 static inline unsigned 
@@ -316,7 +335,7 @@ ext2_readdir (struct file * filp, void *
return 0;
}
}
-   filp-f_pos += le16_to_cpu(de-rec_len);
+   filp-f_pos += ext2_rec_len_from_disk(de-rec_len);
}
ext2_put_page(page);
}
@@ -425,7 +444,7 @@ void ext2_set_link(struct inode *dir, st
 {
loff_t pos = page_offset(page) +
(char *) de - (char *) page_address(page);
-   unsigned len = le16_to_cpu(de-rec_len);
+   unsigned len = ext2_rec_len_from_disk(de-rec_len);
int err;
 
lock_page(page);
@@ -482,7 +501,7 @@ int ext2_add_link (struct dentry *dentry
/* We hit i_size */
name_len = 0;
rec_len = chunk_size;
-   de-rec_len = cpu_to_le16(chunk_size);
+   de-rec_len = ext2_rec_len_to_disk(chunk_size);
de-inode = 0;
goto got_it;
}
@@ -496,7 +515,7 @@ int ext2_add_link (struct dentry *dentry
if (ext2_match (namelen, name, de))
goto out_unlock;
name_len = EXT2_DIR_REC_LEN(de-name_len);
-   rec_len = le16_to_cpu(de-rec_len);
+   rec_len = ext2_rec_len_from_disk(de-rec_len);
if (!de-inode  rec_len = reclen)
goto got_it;
if (rec_len = name_len + reclen)
@@ -518,8 +537,8 @@ got_it:
goto out_unlock;
if (de-inode) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
-   de1-rec_len = cpu_to_le16(rec_len - name_len);
-   de-rec_len = cpu_to_le16(name_len);
+   de1-rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+   de-rec_len

Re: max file size for ext3

2007-10-10 Thread Jan Kara
  Hello,

 I am looking at ext4_max_size and was confused how the
 number upper_limit = 0x1ff7fffd000LL is arrived. 
 The comment says the value is arrived looking at 4K.
 So i tried the below program. 
 
 main()
 {
   unsigned long long upper_limit, meta_blocks;
   int bits = 12;
 
   /* total blocks in 512 bytes */
   upper_limit = (1LL  32) - 1;
   /* total blocks in file system block size */
   upper_limit = (bits - 9);
 
 
   meta_blocks = 1;
   /* double indirect blocks */
   meta_blocks += 1 + 1LL  (bits-2);
   /* tripple indirect blocks */
   meta_blocks += 1 + 1LL  (bits-2) + 1LL  (2*(bits-2));
 
   upper_limit -= meta_blocks;
   upper_limit = bits;
 
   printf(%x\n, upper_limit);
 }
 
 Can somebody help me to find out what is missing in the above ?
  You actually overestimate the number of triply indirect blocks in your
program above (you count all possible but since the limit is around 2TB
only roughly half of them is really needed).

 I also think hardcoding 4k block size is not correct. I have the below
 patch pending with large file size.
  It is incorrect for blocksize larger than 4096, that's right. Your
patch looks fine - we loose a bit of filesize limit but I don't think it
really matters.

Honza
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

2007-10-08 Thread Jan Kara
On Thu 04-10-07 13:12:07, Andrew Morton wrote:
 On Mon, 01 Oct 2007 17:35:46 -0700
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  ext2: Avoid rec_len overflow with 64KB block size
  
  From: Jan Kara [EMAIL PROTECTED]
  
  With 64KB blocksize, a directory entry can have size 64KB which does not fit
  into 16 bits we have for entry lenght. So we store 0x instead and 
  convert
  value when read from / written to disk.
 
 This patch clashes in non-trivial ways with
 ext2-convert-to-new-aops-fix.patch and perhaps other things which are
 already queued for 2.6.24 inclusion, so I'll need to ask for an updated
 patch, please.
 
 Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if
 we're aiming for complete support of 64k blocksize in 2.6.24's ext2,
 additional testing and checking will be needed.
  OK, I'll fixup those rejects and send a new patch.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix commit code to properly abort journal

2007-09-25 Thread Jan Kara
  And a similar patch for JBD2...

Honza
-

We should really call journal_abort() and not __journal_abort_hard() in case
of errors. The latter call does not record the error in the journal superblock 
and
thus filesystem won't be marked as with errors later (and user could happily 
mount
it without any warning).

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude 
linux-2.6.23-rc6-1-jbd_abort_fix/fs/jbd2/commit.c 
linux-2.6.23-rc6-2-jbd2_abort_fix/fs/jbd2/commit.c
--- linux-2.6.23-rc6-1-jbd_abort_fix/fs/jbd2/commit.c   2007-09-18 
19:22:28.0 +0200
+++ linux-2.6.23-rc6-2-jbd2_abort_fix/fs/jbd2/commit.c  2007-09-25 
16:14:09.0 +0200
@@ -475,7 +475,7 @@ void jbd2_journal_commit_transaction(jou
spin_unlock(journal-j_list_lock);
 
if (err)
-   __jbd2_journal_abort_hard(journal);
+   jbd2_journal_abort(journal, err);
 
jbd2_journal_write_revoke_records(journal, commit_transaction);
 
@@ -533,7 +533,7 @@ void jbd2_journal_commit_transaction(jou
 
descriptor = 
jbd2_journal_get_descriptor_buffer(journal);
if (!descriptor) {
-   __jbd2_journal_abort_hard(journal);
+   jbd2_journal_abort(journal, -EIO);
continue;
}
 
@@ -566,7 +566,7 @@ void jbd2_journal_commit_transaction(jou
   and repeat this loop: we'll fall into the
   refile-on-abort condition above. */
if (err) {
-   __jbd2_journal_abort_hard(journal);
+   jbd2_journal_abort(journal, err);
continue;
}
 
@@ -757,7 +757,7 @@ wait_for_iobuf:
err = -EIO;
 
if (err)
-   __jbd2_journal_abort_hard(journal);
+   jbd2_journal_abort(journal, err);
 
/* End of a transaction!  Finally, we can do checkpoint
processing: any buffers committed as a result of this

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: jbd : config_jbd_debug cannot create /proc entry

2007-09-25 Thread Jan Kara
 On Tue, 25 Sep 2007 07:49:38 -0500
 Jose R. Santos [EMAIL PROTECTED] wrote:
 
  On Tue, 25 Sep 2007 13:50:46 +0200
  Jan Kara [EMAIL PROTECTED] wrote:
Jan Kara wrote:

-#define create_jbd_proc_entry() do {} while (0)
-#define remove_jbd_proc_entry() do {} while (0)
+static ctl_table fs_table[] = {
+  {
+.ctl_name   = -1, /* Don't want it */



shouldn't this be CTL_UNNUMBERED ?
 Oh, it should be. I didn't notice we have this :) Thanks for notifying
   me. Attached is a fixed version.
  
  This was fixed in JBD2 by moving the jbd-debug file to debugfs:
  http://lkml.org/lkml/2007/7/11/334
  
  Since this code is already in the kernel, we should keep it consistent. 
  
 
 OK.  Here's a quick patch to fix this.  Adapted from the JBD2 patch.
 Let me know what you think.
  Looks fine - exactly what I've just done here :).

Honza

 commit 6cbd2ce05b7504514707ce825170a5d77abf6a6e
 Author: root [EMAIL PROTECTED]
 Date:   Thu Jun 14 09:40:09 2007 -0500
 
 The jbd-debug file used to be located in /proc/sys/fs/jbd-debug, but
 create_proc_entry() does not do lookups on file names that are more that 
 one
 directory deep.  This causes the entry creation to fail and hence, no proc
 file is created.
 
 Instead of fixing this on procfs might as well move the jbd2-debug file to
 debugfs which would be the preferred location for this kind of tunable.  
 The
 new location is now /sys/kernel/debug/jbd/jbd-debug.
 
 
 Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
  You can add Signed-off-by: Jan Kara [EMAIL PROTECTED]

 
 diff --git a/fs/Kconfig b/fs/Kconfig
 index 58a0650..a8937a6 100644
 --- a/fs/Kconfig
 +++ b/fs/Kconfig
 @@ -219,7 +219,7 @@ config JBD
  
  config JBD_DEBUG
   bool JBD (ext3) debugging support
 - depends on JBD
 + depends on JBD  DEBUG_FS
   help
 If you are using the ext3 journaled file system (or potentially any
 other file system/device using JBD), this option allows you to
 @@ -228,10 +228,10 @@ config JBD_DEBUG
 debugging output will be turned off.
  
 If you select Y here, then you will be able to turn on debugging
 -   with echo N  /proc/sys/fs/jbd-debug, where N is a number between
 -   1 and 5, the higher the number, the more debugging output is
 -   generated.  To turn debugging off again, do
 -   echo 0  /proc/sys/fs/jbd-debug.
 +   with echo N  /sys/kernel/debug/jbd/jbd-debug, where N is a 
 +   number between 1 and 5, the higher the number, the more debugging 
 +   output is generated.  To turn debugging off again, do
 +   echo 0  /sys/kernel/debug/jbd/jbd-debug.
  
  config JBD2
   tristate
 diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
 index 06ab3c1..3cad624 100644
 --- a/fs/jbd/journal.c
 +++ b/fs/jbd/journal.c
 @@ -35,6 +35,7 @@
  #include linux/kthread.h
  #include linux/poison.h
  #include linux/proc_fs.h
 +#include linux/debugfs.h
  
  #include asm/uaccess.h
  #include asm/page.h
 @@ -1939,63 +1940,38 @@ void journal_put_journal_head(struct journal_head *jh)
  }
  
  /*
 - * /proc tunables
 + * debugfs tunables
   */
  #if defined(CONFIG_JBD_DEBUG)
 -int journal_enable_debug;
 +u8 journal_enable_debug;
  EXPORT_SYMBOL(journal_enable_debug);
  #endif
  
 -#if defined(CONFIG_JBD_DEBUG)  defined(CONFIG_PROC_FS)
 +#if defined(CONFIG_JBD_DEBUG)  defined(CONFIG_DEBUG_FS)
  
 -static struct proc_dir_entry *proc_jbd_debug;
 +struct dentry  *jbd_debugfs_dir, *jbd_debug;
  
 -static int read_jbd_debug(char *page, char **start, off_t off,
 -   int count, int *eof, void *data)
 +static void __init create_jbd_debugfs_entry(void)
  {
 - int ret;
 -
 - ret = sprintf(page + off, %d\n, journal_enable_debug);
 - *eof = 1;
 - return ret;
 -}
 -
 -static int write_jbd_debug(struct file *file, const char __user *buffer,
 -unsigned long count, void *data)
 -{
 - char buf[32];
 -
 - if (count  ARRAY_SIZE(buf) - 1)
 - count = ARRAY_SIZE(buf) - 1;
 - if (copy_from_user(buf, buffer, count))
 - return -EFAULT;
 - buf[ARRAY_SIZE(buf) - 1] = '\0';
 - journal_enable_debug = simple_strtoul(buf, NULL, 10);
 - return count;
 -}
 -
 -#define JBD_PROC_NAME sys/fs/jbd-debug
 -
 -static void __init create_jbd_proc_entry(void)
 -{
 - proc_jbd_debug = create_proc_entry(JBD_PROC_NAME, 0644, NULL);
 - if (proc_jbd_debug) {
 - /* Why is this so hard? */
 - proc_jbd_debug-read_proc = read_jbd_debug;
 - proc_jbd_debug-write_proc = write_jbd_debug;
 - }
 + jbd_debugfs_dir = debugfs_create_dir(jbd, NULL);
 + if (jbd_debugfs_dir)
 + jbd_debug = debugfs_create_u8(jbd-debug, S_IRUGO,
 +jbd_debugfs_dir

[PATCH] Support for 64KB blocksize in e2fsprogs

2007-09-25 Thread Jan Kara
  Hi,

  attached patch implements support for 64KB blocksize in directories in
e2fsprogs. This patch complements the kernel patches for ext2-4 I've sent
yesterday. Ted, does the patch look fine?
Honza

-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR

-

Subject: Support for 64KB blocksize in ext2-4 directories.

When block size is 64KB, we have to take care that rec_len does not overflow.
Kernel stores 0x in case 0x1 should be stored - perform appropriate
conversion when reading from / writing to disk.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/lib/ext2fs/dirblock.c b/lib/ext2fs/dirblock.c
index fb20fa0..628bf93 100644
--- a/lib/ext2fs/dirblock.c
+++ b/lib/ext2fs/dirblock.c
@@ -38,9 +38,9 @@ errcode_t ext2fs_read_dir_block2(ext2_fi
dirent = (struct ext2_dir_entry *) p;
 #ifdef WORDS_BIGENDIAN
dirent-inode = ext2fs_swab32(dirent-inode);
-   dirent-rec_len = ext2fs_swab16(dirent-rec_len);
dirent-name_len = ext2fs_swab16(dirent-name_len);
 #endif
+   dirent-rec_len = ext2fs_rec_len_from_disk(dirent-rec_len);
name_len = dirent-name_len;
 #ifdef WORDS_BIGENDIAN
if (flags  EXT2_DIRBLOCK_V2_STRUCT)
@@ -68,12 +68,15 @@ errcode_t ext2fs_read_dir_block(ext2_fil
 errcode_t ext2fs_write_dir_block2(ext2_filsys fs, blk_t block,
  void *inbuf, int flags EXT2FS_ATTR((unused)))
 {
-#ifdef WORDS_BIGENDIAN
errcode_t   retval;
char*p, *end;
char*buf = 0;
struct ext2_dir_entry *dirent;
 
+#ifndef WORDS_BIGENDIAN
+   if (fs-blocksize  EXT2_MAX_REC_LEN)
+   goto just_write;
+#endif
retval = ext2fs_get_mem(fs-blocksize, buf);
if (retval)
return retval;
@@ -89,7 +92,7 @@ errcode_t ext2fs_write_dir_block2(ext2_f
}
p += dirent-rec_len;
dirent-inode = ext2fs_swab32(dirent-inode);
-   dirent-rec_len = ext2fs_swab16(dirent-rec_len);
+   dirent-rec_len = ext2fs_rec_len_to_disk(dirent-rec_len);
dirent-name_len = ext2fs_swab16(dirent-name_len);
 
if (flags  EXT2_DIRBLOCK_V2_STRUCT)
@@ -98,9 +101,8 @@ errcode_t ext2fs_write_dir_block2(ext2_f
retval = io_channel_write_blk(fs-io, block, 1, buf);
ext2fs_free_mem(buf);
return retval;
-#else
+just_write:
return io_channel_write_blk(fs-io, block, 1, (char *) inbuf);
-#endif
 }
 
 
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index a316665..2041f0f 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -717,6 +717,32 @@ struct ext2_dir_entry_2 {
 #define EXT2_DIR_ROUND (EXT2_DIR_PAD - 1)
 #define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND)  \
 ~EXT2_DIR_ROUND)
+#define EXT2_MAX_REC_LEN   ((116)-1)
+
+static inline unsigned ext2fs_rec_len_from_disk(unsigned len)
+{
+#ifdef WORDS_BIGENDIAN
+   len = ext2fs_swab16(dlen);
+#endif
+   if (len == EXT2_MAX_REC_LEN)
+   return 1  16;
+   return len;
+}
+
+static inline unsigned ext2fs_rec_len_to_disk(unsigned len)
+{
+   if (len == (1  16))
+#ifdef WORDS_BIGENDIAN
+   return ext2fs_swab16(EXT2_MAX_REC_LEN);
+#else
+   return EXT2_MAX_REC_LEN;
+#endif
+#ifdef WORDS_BIGENDIAN
+   return ext2fs_swab_16(len);
+#else
+   return len;
+#endif
+}
 
 /*
  * This structure will be used for multiple mount protection. It will be
diff --git a/misc/e2image.c b/misc/e2image.c
index 1fbb267..4e2c9fb 100644
--- a/misc/e2image.c
+++ b/misc/e2image.c
@@ -345,10 +345,7 @@ static void scramble_dir_block(ext2_fils
end = buf + fs-blocksize;
for (p = buf; p  end-8; p += rec_len) {
dirent = (struct ext2_dir_entry_2 *) p;
-   rec_len = dirent-rec_len;
-#ifdef WORDS_BIGENDIAN
-   rec_len = ext2fs_swab16(rec_len);
-#endif
+   rec_len = ext2fs_rec_len_from_disk(dirent-rec_len);
 #if 0
printf(rec_len = %d, name_len = %d\n, rec_len, 
dirent-name_len);
 #endif
@@ -358,9 +355,7 @@ static void scramble_dir_block(ext2_fils
   bad rec_len (%d)\n, (unsigned long) blk, 
   rec_len);
rec_len = end - p;
-#ifdef WORDS_BIGENDIAN
-   dirent-rec_len = ext2fs_swab16(rec_len);
-#endif
+   dirent-rec_len = ext2fs_rec_len_to_disk(rec_len);
continue;
}
if (dirent-name_len + 8  rec_len) {
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Avoid rec_len overflow with 64KB block size

2007-09-24 Thread Jan Kara
  Hi,

  In the following series I'm sending patches to ext2, ext3 and ext4
solving possible overflow of rec_len when using 64KB blocksize. Minming,
would you add these to the patch queue? Thanks.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Avoid rec_len overflow with 64KB block size

2007-09-24 Thread Jan Kara

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0x instead and convert
value when read from / written to disk.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext2/dir.c 
linux-2.6.23-rc6-1-ext2_64k_blocksize/fs/ext2/dir.c
--- linux-2.6.23-rc6/fs/ext2/dir.c  2007-07-16 17:47:21.0 +0200
+++ linux-2.6.23-rc6-1-ext2_64k_blocksize/fs/ext2/dir.c 2007-09-24 
19:26:13.0 +0200
@@ -26,6 +26,24 @@
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
 
+static inline unsigned ext2_rec_len_from_disk(__le16 dlen)
+{
+   unsigned len = le16_to_cpu(dlen);
+
+   if (len == EXT2_MAX_REC_LEN)
+   return 1  16;
+   return len;
+}
+
+static inline __le16 ext2_rec_len_to_disk(unsigned len)
+{
+   if (len == (1  16))
+   return cpu_to_le16(EXT2_MAX_REC_LEN);
+   else if (len  (1  16))
+   BUG();
+   return cpu_to_le16(len);
+}
+
 /*
  * ext2 uses block-sized chunks. Arguably, sector-sized ones would be
  * more robust, but we have what we have
@@ -95,7 +113,7 @@ static void ext2_check_page(struct page 
}
for (offs = 0; offs = limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
p = (ext2_dirent *)(kaddr + offs);
-   rec_len = le16_to_cpu(p-rec_len);
+   rec_len = ext2_rec_len_from_disk(p-rec_len);
 
if (rec_len  EXT2_DIR_REC_LEN(1))
goto Eshort;
@@ -193,7 +211,8 @@ static inline int ext2_match (int len, c
  */
 static inline ext2_dirent *ext2_next_entry(ext2_dirent *p)
 {
-   return (ext2_dirent *)((char*)p + le16_to_cpu(p-rec_len));
+   return (ext2_dirent *)((char*)p +
+   ext2_rec_len_from_disk(p-rec_len));
 }
 
 static inline unsigned 
@@ -305,7 +324,7 @@ ext2_readdir (struct file * filp, void *
return 0;
}
}
-   filp-f_pos += le16_to_cpu(de-rec_len);
+   filp-f_pos += ext2_rec_len_from_disk(de-rec_len);
}
ext2_put_page(page);
}
@@ -413,7 +432,7 @@ void ext2_set_link(struct inode *dir, st
struct page *page, struct inode *inode)
 {
unsigned from = (char *) de - (char *) page_address(page);
-   unsigned to = from + le16_to_cpu(de-rec_len);
+   unsigned to = from + ext2_rec_len_from_disk(de-rec_len);
int err;
 
lock_page(page);
@@ -469,7 +488,7 @@ int ext2_add_link (struct dentry *dentry
/* We hit i_size */
name_len = 0;
rec_len = chunk_size;
-   de-rec_len = cpu_to_le16(chunk_size);
+   de-rec_len = ext2_rec_len_to_disk(chunk_size);
de-inode = 0;
goto got_it;
}
@@ -483,7 +502,7 @@ int ext2_add_link (struct dentry *dentry
if (ext2_match (namelen, name, de))
goto out_unlock;
name_len = EXT2_DIR_REC_LEN(de-name_len);
-   rec_len = le16_to_cpu(de-rec_len);
+   rec_len = ext2_rec_len_from_disk(de-rec_len);
if (!de-inode  rec_len = reclen)
goto got_it;
if (rec_len = name_len + reclen)
@@ -504,8 +523,8 @@ got_it:
goto out_unlock;
if (de-inode) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
-   de1-rec_len = cpu_to_le16(rec_len - name_len);
-   de-rec_len = cpu_to_le16(name_len);
+   de1-rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+   de-rec_len = ext2_rec_len_to_disk(name_len);
de = de1;
}
de-name_len = namelen;
@@ -536,7 +555,7 @@ int ext2_delete_entry (struct ext2_dir_e
struct inode *inode = mapping-host;
char *kaddr = page_address(page);
unsigned from = ((char*)dir - kaddr)  ~(ext2_chunk_size(inode)-1);
-   unsigned to = ((char*)dir - kaddr) + le16_to_cpu(dir-rec_len);
+   unsigned to = ((char*)dir - kaddr) + 
ext2_rec_len_from_disk(dir-rec_len);
ext2_dirent * pde = NULL;
ext2_dirent * de = (ext2_dirent *) (kaddr + from);
int err;
@@ -557,7 +576,7 @@ int ext2_delete_entry (struct ext2_dir_e
err = mapping-a_ops-prepare_write(NULL, page, from, to);
BUG_ON(err);
if (pde)
-   pde-rec_len = cpu_to_le16(to-from);
+   pde-rec_len = ext2_rec_len_to_disk(to-from);
dir-inode = 0;
err = ext2_commit_chunk(page, from, to);
inode-i_ctime = inode-i_mtime = CURRENT_TIME_SEC;
@@ -591,14

Re: [PATCH 3/3] Avoid rec_len overflow with 64KB block size

2007-09-24 Thread Jan Kara

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0x instead and convert
value when read from / written to disk. The patch also converts some places
to use ext4_next_entry() when we are changing them anyway.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext4/dir.c 
linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/dir.c
--- linux-2.6.23-rc6/fs/ext4/dir.c  2007-09-18 19:22:28.0 +0200
+++ linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/dir.c 2007-09-24 
18:30:13.0 +0200
@@ -69,7 +69,7 @@ int ext4_check_dir_entry (const char * f
  unsigned long offset)
 {
const char * error_msg = NULL;
-   const int rlen = le16_to_cpu(de-rec_len);
+   const int rlen = ext4_rec_len_from_disk(de-rec_len);
 
if (rlen  EXT4_DIR_REC_LEN(1))
error_msg = rec_len is smaller than minimal;
@@ -176,10 +176,10 @@ revalidate:
 * least that it is non-zero.  A
 * failure will be detected in the
 * dirent test below. */
-   if (le16_to_cpu(de-rec_len) 
-   EXT4_DIR_REC_LEN(1))
+   if (ext4_rec_len_from_disk(de-rec_len)
+EXT4_DIR_REC_LEN(1))
break;
-   i += le16_to_cpu(de-rec_len);
+   i += ext4_rec_len_from_disk(de-rec_len);
}
offset = i;
filp-f_pos = (filp-f_pos  ~(sb-s_blocksize - 1))
@@ -201,7 +201,7 @@ revalidate:
ret = stored;
goto out;
}
-   offset += le16_to_cpu(de-rec_len);
+   offset += ext4_rec_len_from_disk(de-rec_len);
if (le32_to_cpu(de-inode)) {
/* We might block in the next section
 * if the data destination is
@@ -223,7 +223,7 @@ revalidate:
goto revalidate;
stored ++;
}
-   filp-f_pos += le16_to_cpu(de-rec_len);
+   filp-f_pos += ext4_rec_len_from_disk(de-rec_len);
}
offset = 0;
brelse (bh);
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext4/namei.c 
linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/namei.c
--- linux-2.6.23-rc6/fs/ext4/namei.c2007-09-18 19:22:28.0 +0200
+++ linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/namei.c   2007-09-24 
18:35:34.0 +0200
@@ -280,7 +280,7 @@ static struct stats dx_show_leaf(struct 
space += EXT4_DIR_REC_LEN(de-name_len);
names++;
}
-   de = (struct ext4_dir_entry_2 *) ((char *) de + 
le16_to_cpu(de-rec_len));
+   de = ext4_next_entry(de);
}
printk((%i)\n, names);
return (struct stats) { names, space, 1 };
@@ -525,7 +525,8 @@ static int ext4_htree_next_block(struct 
  */
 static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 
*p)
 {
-   return (struct ext4_dir_entry_2 *)((char*)p + le16_to_cpu(p-rec_len));
+   return (struct ext4_dir_entry_2 *)((char*)p +
+   ext4_rec_len_from_disk(p-rec_len));
 }
 
 /*
@@ -689,7 +690,7 @@ static int dx_make_map (struct ext4_dir_
cond_resched();
}
/* XXX: do we need to check rec_len == 0 case? -Chris */
-   de = (struct ext4_dir_entry_2 *) ((char *) de + 
le16_to_cpu(de-rec_len));
+   de = ext4_next_entry(de);
}
return count;
 }
@@ -790,7 +791,7 @@ static inline int search_dirblock(struct
return 1;
}
/* prevent looping on a bad block */
-   de_len = le16_to_cpu(de-rec_len);
+   de_len = ext4_rec_len_from_disk(de-rec_len);
if (de_len = 0)
return -1;
offset += de_len;
@@ -1099,7 +1100,7 @@ dx_move_dirents(char *from, char *to, st
rec_len = EXT4_DIR_REC_LEN(de-name_len);
memcpy (to, de, rec_len);
((struct ext4_dir_entry_2 *) to)-rec_len =
-   cpu_to_le16(rec_len);
+   ext4_rec_len_to_disk(rec_len);
de-inode = 0;
map++;
to += rec_len;
@@ -1114,13 +1115,12 @@ static struct ext4_dir_entry_2* dx_pack_
 
prev = to = de;
while ((char*)de  base + size) {
-   next

Ext3 not marking filesystems as with errors

2007-09-20 Thread Jan Kara
  Hi,

  one friend has just pointed me to a following misbehaviour of ext3. If we
stumble on some error in JBD (e.g. in commit code), we call
__journal_abort_hard(). It just marks the journal as aborted but does
nothing else. Later ext3 comes, finds journal aborted, calls ext3_abort()
which remounts fs read-only and stops (it does not mark filesystem as
having errors). It calls journal_abort(.., -EIO) but that does nothing
because the journal is already aborted. If you then unmount the filesystem
and mount it again, everything goes on happily as if there was no error -
no suggestion for running fsck, nothing.
  I guess this is a bug but please correct me if you don't think so. There
are two possibilities how to fix it - either we mark the filesystem as with
errors in ext3_abort() or we could call some less lowlevel function from
JBD to abort journal (as soon as j_errno is set, we are safe). Any feeling
what is less hacky?

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Avoid rec_len overflow with 64KB block size

2007-09-20 Thread Jan Kara
   Hello,

  when converting ext4 directories to pagecache I just came over
Takashi's patch preventing overflowing of rec_len. Looking over the
patch - can't we do it more elegantly by using say 0x instead of 64K
and perform conversion (using some helper) at the moment we read / store
rec_len? That would be IMHO more transparent than current approach (at
least it took me some time to understand what's going on with the
current patch when I was looking at the code)...

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling h-trees too early?

2007-09-20 Thread Jan Kara
On Thu 20-09-07 11:14:40, Theodore Tso wrote:
 On Thu, Sep 20, 2007 at 04:58:39PM +0200, Jan Kara wrote:
Hmm, strange - I've just looked at my computer and dir_index is set
  just for 5 directories in my tree.
 
 I looked at a tree that had object files, which is probably why I had
 8 directories; I'm guessing you probably just had kernel sources and
 no build files.
 
  If I try deleting just them, I also
  see some performance decrease but it's less than if I try deleting the
  whole tree (and that result seems to be quite consistent)... There's 
  something
  fishy there. Maybe I could try seekwatcher or something similar to see
  what's really happening.
 
 That is very strange.
  Just a guess: Can't the culprit be the following test in ext3/4_readdir()?
if (EXT4_HAS_COMPAT_FEATURE(inode-i_sb, EXT4_FEATURE_COMPAT_DIR_INDEX) 
((EXT4_I(inode)-i_flags  EXT4_INDEX_FL) ||
 ((inode-i_size  sb-s_blocksize_bits) == 1))) {
error = ext4_dx_readdir(filp, dirent, filldir);
if (error != ERR_BAD_DX_DIR) {
ret = error;
goto out;
}
/*
 * We don't set the inode dirty flag since it's not
 * critical that it get flushed back to the disk.
 */
EXT4_I(filp-f_path.dentry-d_inode)-i_flags = ~EXT4_INDEX_FL;
}
  It calls ext4_dx_readdir() for *every* directory with 1 block (we have
1326 of them in the kernel tree). Now ext4_dx_readdir() calls
ext4_htree_fill_tree() which finds out the directory is not h-tree and
and calls htree_dirblock_to_tree(). So even for 4KB directories we end up
deleting inodes in hash order! And as a bonus we burn some cycles building
trees etc. What is the point of this?

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Avoid rec_len overflow with 64KB block size

2007-09-20 Thread Jan Kara
   when converting ext4 directories to pagecache I just came over
 Takashi's patch preventing overflowing of rec_len. Looking over the
 patch - can't we do it more elegantly by using say 0x instead of 64K
 and perform conversion (using some helper) at the moment we read / store
 rec_len? That would be IMHO more transparent than current approach (at
 least it took me some time to understand what's going on with the
 current patch when I was looking at the code)...
  Attached is a patch that does this for ext4. If you like this
approach, I can cook up a similar patch for ext2 / ext3.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0x instead and convert
value when read from / written to disk. The patch also converts some places
to use ext4_next_entry() when we are changing them anyway.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext4/dir.c linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/dir.c
--- linux-2.6.23-rc6/fs/ext4/dir.c	2007-09-18 19:22:28.0 +0200
+++ linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/dir.c	2007-09-20 18:08:02.0 +0200
@@ -69,7 +69,7 @@ int ext4_check_dir_entry (const char * f
 			  unsigned long offset)
 {
 	const char * error_msg = NULL;
-	const int rlen = le16_to_cpu(de-rec_len);
+	const int rlen = ext4_get_rec_len(le16_to_cpu(de-rec_len));
 
 	if (rlen  EXT4_DIR_REC_LEN(1))
 		error_msg = rec_len is smaller than minimal;
@@ -176,10 +176,10 @@ revalidate:
  * least that it is non-zero.  A
  * failure will be detected in the
  * dirent test below. */
-if (le16_to_cpu(de-rec_len) 
-		EXT4_DIR_REC_LEN(1))
+if (ext4_get_rec_len(le16_to_cpu(de-rec_len))
+		 EXT4_DIR_REC_LEN(1))
 	break;
-i += le16_to_cpu(de-rec_len);
+i += ext4_get_rec_len(le16_to_cpu(de-rec_len));
 			}
 			offset = i;
 			filp-f_pos = (filp-f_pos  ~(sb-s_blocksize - 1))
@@ -201,7 +201,7 @@ revalidate:
 ret = stored;
 goto out;
 			}
-			offset += le16_to_cpu(de-rec_len);
+			offset += ext4_get_rec_len(le16_to_cpu(de-rec_len));
 			if (le32_to_cpu(de-inode)) {
 /* We might block in the next section
  * if the data destination is
@@ -223,7 +223,7 @@ revalidate:
 	goto revalidate;
 stored ++;
 			}
-			filp-f_pos += le16_to_cpu(de-rec_len);
+			filp-f_pos += ext4_get_rec_len(le16_to_cpu(de-rec_len));
 		}
 		offset = 0;
 		brelse (bh);
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext4/namei.c linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/namei.c
--- linux-2.6.23-rc6/fs/ext4/namei.c	2007-09-18 19:22:28.0 +0200
+++ linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/namei.c	2007-09-20 18:29:29.0 +0200
@@ -280,7 +280,7 @@ static struct stats dx_show_leaf(struct 
 			space += EXT4_DIR_REC_LEN(de-name_len);
 			names++;
 		}
-		de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len));
+		de = ext4_next_entry(de);
 	}
 	printk((%i)\n, names);
 	return (struct stats) { names, space, 1 };
@@ -525,7 +525,8 @@ static int ext4_htree_next_block(struct 
  */
 static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 *p)
 {
-	return (struct ext4_dir_entry_2 *)((char*)p + le16_to_cpu(p-rec_len));
+	return (struct ext4_dir_entry_2 *)((char*)p +
+		ext4_get_rec_len(le16_to_cpu(p-rec_len)));
 }
 
 /*
@@ -689,7 +690,7 @@ static int dx_make_map (struct ext4_dir_
 			cond_resched();
 		}
 		/* XXX: do we need to check rec_len == 0 case? -Chris */
-		de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len));
+		de = ext4_next_entry(de);
 	}
 	return count;
 }
@@ -790,7 +791,7 @@ static inline int search_dirblock(struct
 			return 1;
 		}
 		/* prevent looping on a bad block */
-		de_len = le16_to_cpu(de-rec_len);
+		de_len = ext4_get_rec_len(le16_to_cpu(de-rec_len));
 		if (de_len = 0)
 			return -1;
 		offset += de_len;
@@ -1099,7 +1100,7 @@ dx_move_dirents(char *from, char *to, st
 		rec_len = EXT4_DIR_REC_LEN(de-name_len);
 		memcpy (to, de, rec_len);
 		((struct ext4_dir_entry_2 *) to)-rec_len =
-cpu_to_le16(rec_len);
+cpu_to_le16(ext4_store_rec_len(rec_len));
 		de-inode = 0;
 		map++;
 		to += rec_len;
@@ -1114,13 +1115,12 @@ static struct ext4_dir_entry_2* dx_pack_
 
 	prev = to = de;
 	while ((char*)de  base + size) {
-		next = (struct ext4_dir_entry_2 *) ((char *) de +
-		le16_to_cpu(de-rec_len));
+		next = ext4_next_entry(de);
 		if (de-inode  de-name_len) {
 			rec_len = EXT4_DIR_REC_LEN(de-name_len);
 			if (de  to)
 memmove(to, de, rec_len);
-			to-rec_len = cpu_to_le16(rec_len);
+			to-rec_len = cpu_to_le16(ext4_store_rec_len(rec_len));
 			prev = to;
 			to = (struct ext4_dir_entry_2 *) (((char *) to) + rec_len);
 		}
@@ -1178,8 +1178,8 @@ static struct ext4_dir_entry_2 *do_split
 	/* Fancy dance

Enabling h-trees too early?

2007-09-19 Thread Jan Kara
  Hi,

  I was just wondering: Currently we start to build h-tree in a directory
already when the size of directory exceeds one block. But honestly, it does
not seem to make much sence to use this feature until the directory is much
larger (I'd say at least 16 or 32 KB). It actually slows down some
operations like deleting the whole directory, etc. So what is the reason
for starting building the tree so early? Just the simplicity of building it
when the directory is just one block large?

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/3] 2.6.23-rc6: known regressions v2

2007-09-18 Thread Jan Kara
 FS
 
 Subject : hanging ext3 dbench tests
 References  : http://lkml.org/lkml/2007/9/11/176
 Last known good : ?
 Submitter   : Andy Whitcroft [EMAIL PROTECTED]
 Caused-By   : ?
 Handled-By  : ?
 Status  : under test -- unreproducible at present
  Yep... Hard to do anything until Andy is able to reproduce it at least
once more to gather needed info.

 Subject : umount triggers a warning in jfs and takes almost a minute
 References  : http://lkml.org/lkml/2007/9/4/73
 Last known good : ?
 Submitter   : Oliver Neukum [EMAIL PROTECTED]
 Caused-By   : ?
 Handled-By  : ?
 Status  : unknown
  I thought Shaggy asked Oliver about some details (and he did not
answer so far) so I'd assume Shaggy is handling this.


Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] System calls for online defrag

2007-09-03 Thread Jan Kara
  Hello,

  I've finally got to writing up some proposal how could look system calls
allowing for online filesystem defragmentation and generally moving file
blocks around for improving performance. Comments are welcome.

Honza

int sys_movedata(int datafd, int spacefd, loff_t from, size_t len)
   The call takes blocks used to carry data starting at offset @from of length
@len in @spacefd and places them instead of corresponding blocks in @datafd.
Data is copied from @datafd to newly spliced data blocks. If @spacefd contains
a hole in the specified interval, a hole is created also in @datafd in the
corresponding place. A data block from @spacefd and also replace a hole in
@datafd - zeros are copied to such data block. @from and @len should be
multiples of filesystem block size (otherwise EINVAL is returned). Data blocks
from @datafd in the interval are released, a hole is created in @spacefd. The
call returns either 0 (success) or an error code.
  Another possibility would be to just replace data blocks without any copying
of data (that would have to be done by the caller to before calling
sys_movedata()). The problem here is how to avoid data loss if someone writes
to the file after userspace has copied the data and before sys_movedata() is
called.



ssize_t sys_allocate(int fd, int mode, loff_t goal, ssize_t len)
  Allocate new space to file @fd at offset defined by file position.  Both file
offset and @len should be a multiple of filesystem block size. The whole
interval must not contain any allocated blocks. If the interval extends past
EOF, the file size is changed accordingly.  @mode defines a way the filesystem
will search for blocks. @mode is a bitwise OR of the following flags:
  ALLOC_FIXED_START - allocation must start at @goal; if not specified, @goal
is just a hint where to start an allocation
  ALLOC_FIXED_LEN - allocate exactly space for @len; if not specified, upto
@len bytes may be allocated.
  ALLOC_CONTINGUOUS - allocation must be one continguous run of blocks

  If the allocation succeeds, number of allocated bytes is returned. Otherwise
an error code is returned.



The following syscall may be also useful - although I'm not completely
convinced this is the right way to go. But on the other hand, disk optimizer
should have a way to find out about free space so that he can decide what and
where is beneficial to move.

int sys_get_free_blocks(const char *fs, loff_t start, loff_t end, int count,
  struct alloc_extent *space)

  Get a description of free space on a filesystem between @start and @end (in
bytes, should be blocksize aligned). @fs is a path where the filesystem is
mounted (I guess it's better than dev_t, isn't it?). @space is a pointer to an
array of 'struct alloc_extent'. In each struct alloc_extent is stored
description of one extent of free space. Upto @count extents are stored.

struct alloc_extent {
  loff_t start;
  size_t len;
};
  Function returns a number of extents stored. Note that the result of the
function is unreliable as the space can be already allocated by the time system
call returns.

-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix panic in jbd by adding locks

2007-08-20 Thread Jan Kara
On Thu 16-08-07 12:37:45, Josef Bacik wrote:
 On Thu, Aug 16, 2007 at 06:08:35PM +0200, Jan Kara wrote:
 It is possible to panic the box by a race condition that exists in the
 journalling code where we do not take the j_revoke_lock when 
 traversing the
 journal's revoked record list.  This patch has been tested and we 
 haven't seen
 the issue yet, its a rather straightforward and correct (at least I 
 think so :)
 fix.  Thank you,
  In principle, the patch looks fine. The only thing I'm wondering about
is how that panic can happen... Journal write_revoke_records() is called
from journal_commit_transaction() when revoke table for the committing
transaction shouldn't see any further changes. So maybe the patch is
masking a different problem.
  Do you have a way of reproducing the problem? Any stack trace
available?
   
   Reproducing the problem is tricky as theres no sure way to make it happen,
   basically we run the box with alot of memory pressure while doing alot
   operations that require journalling.  Here is the backtrace of the panic 
   (note
   this is on a RHEL4 kernel so 2.6.9, but the same problem exists upstream)
OK.
  
   1Unable to handle kernel paging request at virtual address 
   602b1110
   4kjournald[1310]: Oops 11012296146944 [1]
   4Modules linked in: ltd(U) vfat fat dm_mod button uhci_hcd shpchp e1000
   bond1(U) bond0(U) ext3 jbd hfcldd(U) hfcldd_conf(U) sd_mod scsi_mod
   4
   4Pid: 1310, CPU 0, comm:kjournald
   4psr : 121008026018 ifs : 8c9c ip  : 
   [a00200045161]
   Tainted: P 
   4ip is at journal_write_revoke_records+0x221/0x4e0 [jbd]
   4unat:  pfs : 0c9c rsc : 0003
   4rnat:  bsps:  pr  : a681
   4ldrs:  ccv :  fpsr: 0009804c8a70033f
   4csd :  ssd : 
   4b0  : a00200045270 b6  : a0020026a240 b7  : a001ee90
   4f6  : 0fffbe38e38e381b23800 f7  : 0ffe9edc3d22c
   4f8  : 1000e86fb f9  : 100029000
   4f10 : 1000aeff71c71b9e6e61a f11 : 1003e0eff
   4r1  : a00200234000 r2  : 048c r3  : e002791a7a90
   4r8  :  r9  : e002791a0400 r10 : 
   4r11 : e100 r12 : e002791a7b00 r13 : e002791a
   4r14 : e0027b7ee6c0 r15 : e002791a7b00 r16 : e00272d48018
   4r17 :  r18 :  r19 : 0009804c8a70033f
   4r20 : 602b1118 r21 : a0010006ad70 r22 : 0019
   4r23 :  r24 :  r25 : 0019
   4r26 :  r27 :  r28 : 6a41
   4r29 :  r30 :  r31 : e0027b7ee5a4
   4
   4Call Trace:
   4 [a00100016da0] show_stack+0x80/0xa0
   4sp=e002791a7690 
   bsp=e002791a1170
   4 [a001000176b0] show_regs+0x890/0x8c0
   4sp=e002791a7860 
   bsp=e002791a1128
   4 [a0010003e910] die+0x150/0x240
   4sp=e002791a7880 
   bsp=e002791a10e8
   4 [a001000644c0] ia64_do_page_fault+0x8c0/0xbc0
   4sp=e002791a7880 
   bsp=e002791a1080
   4 [a001f600] ia64_leave_kernel+0x0/0x260
   4sp=e002791a7930 
   bsp=e002791a1080
   4 [a00200045160] journal_write_revoke_records+0x220/0x4e0 [jbd]
   4sp=e002791a7b00 
   bsp=e002791a0f98
   4 [a0020003d940] journal_commit_transaction+0xf80/0x3080 [jbd]
   4sp=e002791a7b10 
   bsp=e002791a0ea0
   4 [a002000458d0] kjournald+0x170/0x580 [jbd]
   4sp=e002791a7d80 
   bsp=e002791a0e38
   4 [a00100018c70] kernel_thread_helper+0x30/0x60
   4sp=e002791a7e30 
   bsp=e002791a0e10
   4 [a0018c60] start_kernel_thread+0x20/0x40
   4sp=e002791a7e30 
   bsp=e002791a0e10
Do you know (or could you find out) where exactly in the code is
  journal_write_revoke_records+0x221/0x4e0?
  
 
 Yeah sorry
 
 500 void journal_write_revoke_records(journal_t *journal, 
  501   transaction_t *transaction)
  502 {
  503 struct journal_head *descriptor;
  504 struct jbd_revoke_record_s *record;
  505 struct jbd_revoke_table_s *revoke;
  506 struct list_head *hash_list;
  507 int i, offset, count;
  508 
  509 descriptor = NULL; 
  510 offset = 0;
  511 count = 0;
  512 
  513 /* select revoke table for committing transaction */
  514 revoke = journal-j_revoke == journal-j_revoke_table[0] ?
  515

Re: [PATCH] fix panic in jbd by adding locks

2007-08-16 Thread Jan Kara
  Hello,

   It is possible to panic the box by a race condition that exists in the
   journalling code where we do not take the j_revoke_lock when traversing 
   the
   journal's revoked record list.  This patch has been tested and we haven't 
   seen
   the issue yet, its a rather straightforward and correct (at least I think 
   so :)
   fix.  Thank you,
In principle, the patch looks fine. The only thing I'm wondering about
  is how that panic can happen... Journal write_revoke_records() is called
  from journal_commit_transaction() when revoke table for the committing
  transaction shouldn't see any further changes. So maybe the patch is
  masking a different problem.
Do you have a way of reproducing the problem? Any stack trace
  available?
 
 Reproducing the problem is tricky as theres no sure way to make it happen,
 basically we run the box with alot of memory pressure while doing alot
 operations that require journalling.  Here is the backtrace of the panic (note
 this is on a RHEL4 kernel so 2.6.9, but the same problem exists upstream)
  OK.

 1Unable to handle kernel paging request at virtual address 602b1110
 4kjournald[1310]: Oops 11012296146944 [1]
 4Modules linked in: ltd(U) vfat fat dm_mod button uhci_hcd shpchp e1000
 bond1(U) bond0(U) ext3 jbd hfcldd(U) hfcldd_conf(U) sd_mod scsi_mod
 4
 4Pid: 1310, CPU 0, comm:kjournald
 4psr : 121008026018 ifs : 8c9c ip  : [a00200045161]
 Tainted: P 
 4ip is at journal_write_revoke_records+0x221/0x4e0 [jbd]
 4unat:  pfs : 0c9c rsc : 0003
 4rnat:  bsps:  pr  : a681
 4ldrs:  ccv :  fpsr: 0009804c8a70033f
 4csd :  ssd : 
 4b0  : a00200045270 b6  : a0020026a240 b7  : a001ee90
 4f6  : 0fffbe38e38e381b23800 f7  : 0ffe9edc3d22c
 4f8  : 1000e86fb f9  : 100029000
 4f10 : 1000aeff71c71b9e6e61a f11 : 1003e0eff
 4r1  : a00200234000 r2  : 048c r3  : e002791a7a90
 4r8  :  r9  : e002791a0400 r10 : 
 4r11 : e100 r12 : e002791a7b00 r13 : e002791a
 4r14 : e0027b7ee6c0 r15 : e002791a7b00 r16 : e00272d48018
 4r17 :  r18 :  r19 : 0009804c8a70033f
 4r20 : 602b1118 r21 : a0010006ad70 r22 : 0019
 4r23 :  r24 :  r25 : 0019
 4r26 :  r27 :  r28 : 6a41
 4r29 :  r30 :  r31 : e0027b7ee5a4
 4
 4Call Trace:
 4 [a00100016da0] show_stack+0x80/0xa0
 4sp=e002791a7690 bsp=e002791a1170
 4 [a001000176b0] show_regs+0x890/0x8c0
 4sp=e002791a7860 bsp=e002791a1128
 4 [a0010003e910] die+0x150/0x240
 4sp=e002791a7880 bsp=e002791a10e8
 4 [a001000644c0] ia64_do_page_fault+0x8c0/0xbc0
 4sp=e002791a7880 bsp=e002791a1080
 4 [a001f600] ia64_leave_kernel+0x0/0x260
 4sp=e002791a7930 bsp=e002791a1080
 4 [a00200045160] journal_write_revoke_records+0x220/0x4e0 [jbd]
 4sp=e002791a7b00 bsp=e002791a0f98
 4 [a0020003d940] journal_commit_transaction+0xf80/0x3080 [jbd]
 4sp=e002791a7b10 bsp=e002791a0ea0
 4 [a002000458d0] kjournald+0x170/0x580 [jbd]
 4sp=e002791a7d80 bsp=e002791a0e38
 4 [a00100018c70] kernel_thread_helper+0x30/0x60
 4sp=e002791a7e30 bsp=e002791a0e10
 4 [a0018c60] start_kernel_thread+0x20/0x40
 4sp=e002791a7e30 bsp=e002791a0e10
  Do you know (or could you find out) where exactly in the code is
journal_write_revoke_records+0x221/0x4e0?

 While analyzing the problem, Hitachi came up with this explanation for the 
 race
 condition
 
 PID: 31401  TASK: e0004fb3  CPU: 1   COMMAND: GET
 #0 [BSP:e0004fb314d8] context_switch at a0010006ab90
 #1 [BSP:e0004fb313b8] schedule at a00100590f40
 #2 [BSP:e0004fb31340] do_get_write_access at a002000388e0
 #3 [BSP:e0004fb31300] journal_get_write_access at a00200039680
 #4 [BSP:e0004fb312b8] ext3_reserve_inode_write at a0020013f180
 #5 [BSP:e0004fb31290] ext3_mark_inode_dirty at a0020013f2a0
 #6 [BSP:e0004fb31260] ext3_dirty_inode at a00200144310
 #7 [BSP:e0004fb31210] __mark_inode_dirty at a00100178200
 #8 [BSP:e0004fb311e8] update_atime at a00100165cc0
 #9 [BSP:e0004fb31128] do_generic_mapping_read at a001000d40e0
 #10 [BSP:e0004fb310c0] __generic_file_aio_read at a001000d8b40
 #11 [BSP:e0004fb31088] 

Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree

2007-06-14 Thread Jan Kara
 On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote:
 ..
   
   5) ext3_write_end:
 Before  write_begin/write_end patch set we have folowing locking
 order:
 stop_journal(handle);
 unlock_page(page);
 But now order is oposite:
 unlock_page(page);
 stop_journal(handle);
 Can we got any race condition now? I'm not sure is it actual problem,
 may be somebody cant describe this.
  
  Can we just change it to the original order? That would seem to be
  safest unless one of the ext3 devs explicitly acks it.
  Sorry, I've missed beginning of this thread. But what problems can
exactly cause this ordering change? ext3_journal_stop has no need to be
protected by the page lock - it can be even better that it's not
protected as it can trigger commit and all that would happen
unnecessarily under page lock...

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext2_discard_prealloc() called on each iput?

2007-05-30 Thread Jan Kara
On Tue 29-05-07 14:10:52, Mingming Cao wrote:
 On Mon, 2007-05-28 at 18:04 +0200, Jan Kara wrote:
  On Wed 23-05-07 08:37:43, Theodore Tso wrote:
   On Tue, May 22, 2007 at 06:11:27PM +0200, Jan Kara wrote:

  while fixing some problems with preallocation in UDF, I had a look how
ext2 solves similar problems. I found out that ext2_discard_prealloc() 
is
called on every iput() from ext2_put_inode(). Is it really appropriate? 
I
don't see a reason for doing so...
   
   I agree, it's probably not appropriate.  It's been that way for a long
   time, though (since 2.4.20).  It's not as horrible as it seems since
   unlike traditional Unix systems, we don't call iput() as often, since
   for example operations like close() end up calling dput(), which
   decrements the ref. count on dentry, not the inode.  But it would
   probably be better to check to see if i_count is 1 before deciding to
   discard the preallocation.
OK, but then you could move the code to drop_inode() which is called at
  exactly that moment... I've been thinking more about it when fixing UDF.
 
 I have tried to optimize ext2 discard preallocation code like ext3
 discard reservation a while back: we only call ext2_discard_prealloc on
 the last iput(), i.e. ext2/3_clear_inode().
 
 This patch actually made into mainline for a little while, then later it
 is being reversed back because of possible leak of preallocated blocks.
 
 Tt the unmount time, someone might still hold the reference of the
 inode, thus ext2_discard_prealloc() did not get a chance to be called.
 Since ext2 preallocation is doing pre-allocation on disk, this leads to
 leak of preallocated blocks, confused fsck later.
 
 http://lkml.org/lkml/2005/4/12/333
  Interesting. I don't quite understand how it can happen that inode is not
released at umount time... It must be released for umount to succeed. There
is a slight problem that inode is not written after calling clear_inode()
which could cause some problems (i_blocks being wrong) but blocks in
bitmaps should be freed just right...

   anyway, so the preallocated region is less useful.
OK, but still we could use e.g. i_writecount to check that we drop the
  last descriptor for writing...
  
 Yep, that is what ext3 does in ext3_release_file(), I forget why we
 didn't fix this for ext2.
  Hmm, probably we just forgot...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext2_discard_prealloc() called on each iput?

2007-05-28 Thread Jan Kara
On Wed 23-05-07 08:37:43, Theodore Tso wrote:
 On Tue, May 22, 2007 at 06:11:27PM +0200, Jan Kara wrote:
  
while fixing some problems with preallocation in UDF, I had a look how
  ext2 solves similar problems. I found out that ext2_discard_prealloc() is
  called on every iput() from ext2_put_inode(). Is it really appropriate? I
  don't see a reason for doing so...
 
 I agree, it's probably not appropriate.  It's been that way for a long
 time, though (since 2.4.20).  It's not as horrible as it seems since
 unlike traditional Unix systems, we don't call iput() as often, since
 for example operations like close() end up calling dput(), which
 decrements the ref. count on dentry, not the inode.  But it would
 probably be better to check to see if i_count is 1 before deciding to
 discard the preallocation.
  OK, but then you could move the code to drop_inode() which is called at
exactly that moment... I've been thinking more about it when fixing UDF.
Discarding prealloc at drop_inode() has the disadvantage that
symlinks/directories will keep their preallocated blocks until inodes are
evicted from memory. Which is probably why ext2 discards prealloc on
iput().

Also I found slightly misleading the comment at ext2_release_file().
  As far as I understand the code it isn't when /all/ files are closed but
  rather when all fd's for given filp are closed. I.e. if you open the same
  file two times, -release will get called once for each open. Am I right?
 
 Yep!
 
  If so, then also calling ext2_discard_prealloc() from ext2_release_file()
  is suboptimal, isn't it?
 
 Yes, although it's a bit better because only discaord the
 preallocation if the file descriptor was opened for writing.  The file
 could be opened for writing by multiple file descriptors, true, but in
 that case it's likely that the write pattern will be a random access one
 anyway, so the preallocated region is less useful.
  OK, but still we could use e.g. i_writecount to check that we drop the
last descriptor for writing...

 P.S.  Note that ext3 and ext4 use a different preallocation scheme;
  Yes, I know that. That's why I looked at ext2 when searching for
inspiration how to fix UDF :)

 still patches to fix the comments might not be a bad idea, since it
 might save confusion by others later on.
  Ok, will write one.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Online defragmentation and ext4migrate

2007-05-22 Thread Jan Kara
 On Mon, 2007-05-21 at 12:38 +0200, Jan Kara wrote:
Yes. On the other hand I believe that some people would like to use
  defragmentation but stay with ext3. For them conversion to extents is
  no-go.
  [...]
   I've written a patch that defragments non-extent files but after
  discussion with XFS guys I've decided that the interfaces should be made
  more generic, so that XFS and other filesystems can use them too...
 I see no reason why the ioctl to convert a file to extents and then
 defragment it should be different from the ioctl to defragment a
 non-extent file.

 After all, whether a file's blocks are tracked as lists of blocks or a
 set of extents is just bookkeeping, right? The set of data blocks that
 make up the file and their order is the same regardless of whether the
 extent flag is set in the inode.
  I agree that at least part of the interface should
be independent on the particular representation of data references -
especially because I want it to be useful for more filesystems than just
ext2/3/4. Currently I think that defragmenting data blocks itself can
have fs-independent interface. Of course, when you decide to defragment
metadata (i.e. indirect blocks, inodes, etc.) you have to have fs-specific
interfaces, probably ioctls...

 If the user is running the ext2/3 driver or the ext4 driver with the
 noextents option, just defragment the file. If the user is running ext4
 without the noextents option, convert to extents and then defragment.
  Defragmentation ioctl definitely should not touch the way the file is
represented. I.e. if the file uses indirect blocks it should use
indirect blocks after defragmentation. If it uses extents, it should use
extents afterwards too. It should be the userspace utility which decides
whether the file should be converted or not and uses appropriate call
for that...

 The only problem that I can think of is that defragmenting metadata
 (including indirect block and/or whatever the equivalent is in
 extent-land) presumably has performance benefits too, so maybe a
 defragmenter in userspace would want to have some knowledge/control over
 this process.
  Yes, it has measurable benefit (especially for indirect blocks) so
eventually we should do it.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ext2_discard_prealloc() called on each iput?

2007-05-22 Thread Jan Kara
  Hello,

  while fixing some problems with preallocation in UDF, I had a look how
ext2 solves similar problems. I found out that ext2_discard_prealloc() is
called on every iput() from ext2_put_inode(). Is it really appropriate? I
don't see a reason for doing so...
  Also I found slightly misleading the comment at ext2_release_file().
As far as I understand the code it isn't when /all/ files are closed but
rather when all fd's for given filp are closed. I.e. if you open the same
file two times, -release will get called once for each open. Am I right?
If so, then also calling ext2_discard_prealloc() from ext2_release_file()
is suboptimal, isn't it?
Thanks for answer
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext2/3/4 online defrag

2007-05-21 Thread Jan Kara
 On Thu, 2007-05-17 at 18:11 +0200, Jan Kara wrote:
  But me (and several other people
  independently as I've learnt recently) have written some tools which
  should result in something useful. If you're interested, you can join
  [EMAIL PROTECTED] - it's led by one guy who is doing
  defrag and stuff as his google summer of code project.
 
 Is this different from the ext4/extent-based defrag patch that's been
 mentioned on this list?
  Yes, it is different. In particular, it's offline only tool so far...

   *An implementation of an ext* filesystem driver can work with any
   ext2/3/4 filesystem as long as it supports the necessary revision
   (GOOD_OLD_REV or DYNAMIC_REV) and feature flags set in the filesystem.
Not sure what you mean here...
 
 The ext2 filesystem/ext3 filesystem/ext4 filesystem terminology
 was confusing to me when I first started reading about them. In my mind,
 it implied that those three filesystems were more different than they
 actually are.
 
 I think it would be more accurate to say that they are all essentially
 the same filesystem, and that any filesystem driver that can mount a
 given filesystem can mount any other ext2/3/4 filesystem of the same
 revision with the same feature flags set.
 
 I was asking for confirmation of this assumption, but I've since found a
 lot of really good documentation that has cleared up a lot of things.
  Yes, basically it's just a question of a feature set. But for example
current online defrag from Takashi requires extents, which are not
available for ext2 or ext3.

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Online defragmentation and ext4migrate

2007-05-21 Thread Jan Kara
  Hello,

 While doing online defragmentation do we move the blocks corresponding to 
 extent index ? The reason why i am asking this is to understand the
 usefulness of doing a ext4migrate followed by defrag. I understand that 
 defragmentation in general will improve the performance. But with respect 
 to ext4migrate we are not touching the data blocks. Instead we build the 
 extent map and if that requires to have an extent index block then we 
 allocate one. I am trying to understand what would be the performance 
 impact of this and whether doing a defrag really improve the performance.
 
 I think converting a file to extents has the benefit for the performance of
 block searching. If we want to improve also the performance of  reading
 file data, we have to run the defrag after that.
  Yes. On the other hand I believe that some people would like to use
defragmentation but stay with ext3. For them conversion to extents is
no-go.

 Also looking at the version 0.4 I see that defrag ioctl only work if we 
 have EXT4_EXTENTS_FL flag set. What are the plans for making defrag work 
 with indirect block map inode ?
 
 Unfortunately, my defrag doesn't support an indirect block file.
 But we can reduce fragments in the file with the defrag just after
 ext4migrate.
 
 In my opinion, to keep the ioctl simple and small is very important
 for ease of maintenance.  So I would rather not support indirect block
 files in the ioctl.  Instead, I can add the call of the migration
 ioctl to my defrag tool in order to defragment indirect block files.
 How do you think of it?
  Yes that could be useful but I don't think it's a complete solution
for people that don't want to migrate.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Copy i_flags to ext3 inode flags on write

2007-04-17 Thread Jan Kara
On Mon 16-04-07 12:14:21, Andreas Dilger wrote:
 On Apr 16, 2007  19:05 +0200, Jan Kara wrote:
attached is a patch that stores inode flags such as S_IMMUTABLE,
  S_APPEND, etc. from i_flags to EXT3_I(inode)-i_flags when inode is
  written to disk. The same thing is done on GETFLAGS ioctl.
Quota code changes these flags on quota files (to make it harder for
  sysadmin to screw himself) and these changes were not correctly
  propagated into the filesystem (especially, lsattr did not show them and
  users were wondering...).
   
  +/* Propagate flags from i_flags to EXT3_I(inode)-i_flags */
  +void ext3_get_inode_flags(struct inode *inode)
  +{
  +   unsigned int flags = inode-i_flags;
  +
  +   EXT3_I(inode)-i_flags = ~(EXT3_SYNC_FL|EXT3_APPEND_FL|
  +   EXT3_IMMUTABLE_FL|EXT3_NOATIME_FL|EXT3_DIRSYNC_FL);
  +   if (flags  S_SYNC)
  +   EXT3_I(inode)-i_flags |= EXT3_SYNC_FL;
  +   if (flags  S_APPEND)
  +   EXT3_I(inode)-i_flags |= EXT3_APPEND_FL;
  +   if (flags  S_IMMUTABLE)
  +   EXT3_I(inode)-i_flags |= EXT3_IMMUTABLE_FL;
  +   if (flags  S_NOATIME)
  +   EXT3_I(inode)-i_flags |= EXT3_NOATIME_FL;
  +   if (flags  S_DIRSYNC)
  +   EXT3_I(inode)-i_flags |= EXT3_DIRSYNC_FL;
  +}
 
 It probably makes sense to pass struct ext3_inode_info *ei to this
 function, which is available at both callsites and avoids some pointer
 math for each access.
  OK, makes sence. Will resend the patch in a while.

   void ext3_read_inode(struct inode * inode)
   {
  struct ext3_iloc iloc;
  @@ -2736,6 +2755,7 @@ static int ext3_do_update_inode(handle_t
  if (ei-i_state  EXT3_STATE_NEW)
  memset(raw_inode, 0, EXT3_SB(inode-i_sb)-s_inode_size);
   
  +   ext3_get_inode_flags(inode);
  raw_inode-i_mode = cpu_to_le16(inode-i_mode);
  if(!(test_opt(inode-i_sb, NO_UID32))) {
  raw_inode-i_uid_low = cpu_to_le16(low_16_bits(inode-i_uid));
  diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/ext3/ioctl.c 
  linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c
  --- linux-2.6.21-rc6/fs/ext3/ioctl.c2007-02-07 12:03:23.0 
  +0100
  +++ linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c2007-04-12 
  18:22:54.0 +0200
  @@ -28,6 +28,7 @@ int ext3_ioctl (struct inode * inode, st
   
  switch (cmd) {
  case EXT3_IOC_GETFLAGS:
  +   ext3_get_inode_flags(inode);
  flags = ei-i_flags  EXT3_FL_USER_VISIBLE;
  return put_user(flags, (int __user *) arg);
  case EXT3_IOC_SETFLAGS: {
 
 Looks fine otherwise - there is already ext3_set_inode_flags() which sets the
 VFS inode flags properly in ext3_read_inode().
  Yes, I know :).

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Copy i_flags to ext3 inode flags on write (version 2)

2007-04-17 Thread Jan Kara
  Hi,

  attached is a second version of a patch that stores inode flags such as
S_IMMUTABLE, S_APPEND, etc. from i_flags to EXT3_I(inode)-i_flags when
inode is written to disk. The same thing is done on GETFLAGS ioctl.
  Quota code changes these flags on quota files (to make it harder for
sysadmin to screw himself) and these changes were not correctly
propagated into the filesystem (especially, lsattr did not show them and
users were wondering...). Andrew, could you please put the patch into your
queue? Thanks.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into
ext3-specific i_flags. Hence, when someone sets these flags via a different
interface than ioctl, they are stored correctly.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/ext3/inode.c linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/inode.c
--- linux-2.6.21-rc6/fs/ext3/inode.c	2007-04-10 17:09:55.0 +0200
+++ linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/inode.c	2007-04-17 11:24:28.0 +0200
@@ -2581,6 +2581,25 @@ void ext3_set_inode_flags(struct inode *
 		inode-i_flags |= S_DIRSYNC;
 }
 
+/* Propagate flags from i_flags to EXT3_I(inode)-i_flags */
+void ext3_get_inode_flags(struct ext3_inode_info *ei)
+{
+	unsigned int flags = ei-vfs_inode.i_flags;
+
+	ei-i_flags = ~(EXT3_SYNC_FL|EXT3_APPEND_FL|
+			EXT3_IMMUTABLE_FL|EXT3_NOATIME_FL|EXT3_DIRSYNC_FL);
+	if (flags  S_SYNC)
+		ei-i_flags |= EXT3_SYNC_FL;
+	if (flags  S_APPEND)
+		ei-i_flags |= EXT3_APPEND_FL;
+	if (flags  S_IMMUTABLE)
+		ei-i_flags |= EXT3_IMMUTABLE_FL;
+	if (flags  S_NOATIME)
+		ei-i_flags |= EXT3_NOATIME_FL;
+	if (flags  S_DIRSYNC)
+		ei-i_flags |= EXT3_DIRSYNC_FL;
+}
+
 void ext3_read_inode(struct inode * inode)
 {
 	struct ext3_iloc iloc;
@@ -2736,6 +2755,7 @@ static int ext3_do_update_inode(handle_t
 	if (ei-i_state  EXT3_STATE_NEW)
 		memset(raw_inode, 0, EXT3_SB(inode-i_sb)-s_inode_size);
 
+	ext3_get_inode_flags(ei);
 	raw_inode-i_mode = cpu_to_le16(inode-i_mode);
 	if(!(test_opt(inode-i_sb, NO_UID32))) {
 		raw_inode-i_uid_low = cpu_to_le16(low_16_bits(inode-i_uid));
diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/ext3/ioctl.c linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c
--- linux-2.6.21-rc6/fs/ext3/ioctl.c	2007-02-07 12:03:23.0 +0100
+++ linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c	2007-04-17 11:24:39.0 +0200
@@ -28,6 +28,7 @@ int ext3_ioctl (struct inode * inode, st
 
 	switch (cmd) {
 	case EXT3_IOC_GETFLAGS:
+		ext3_get_inode_flags(ei);
 		flags = ei-i_flags  EXT3_FL_USER_VISIBLE;
 		return put_user(flags, (int __user *) arg);
 	case EXT3_IOC_SETFLAGS: {
diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/include/linux/ext3_fs.h linux-2.6.21-rc6-1-ext3_flags_update/include/linux/ext3_fs.h
--- linux-2.6.21-rc6/include/linux/ext3_fs.h	2007-04-10 17:09:58.0 +0200
+++ linux-2.6.21-rc6-1-ext3_flags_update/include/linux/ext3_fs.h	2007-04-17 11:25:23.0 +0200
@@ -824,6 +824,7 @@ extern int ext3_change_inode_journal_fla
 extern int ext3_get_inode_loc(struct inode *, struct ext3_iloc *);
 extern void ext3_truncate (struct inode *);
 extern void ext3_set_inode_flags(struct inode *);
+extern void ext3_get_inode_flags(struct ext3_inode_info *);
 extern void ext3_set_aops(struct inode *inode);
 
 /* ioctl.c */


[PATCH] Copy i_flags to ext4 inode flags on write

2007-04-17 Thread Jan Kara
  Hi,

  attached is a port of the previous patch for ext3 to ext4.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into
ext4-specific i_flags. Hence, when someone sets these flags via a different
interface than ioctl, they are stored correctly.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6-1-ext3_flags_update/fs/ext4/inode.c linux-2.6.21-rc6-2-ext4_flags_update/fs/ext4/inode.c
--- linux-2.6.21-rc6-1-ext3_flags_update/fs/ext4/inode.c	2007-04-10 17:09:55.0 +0200
+++ linux-2.6.21-rc6-2-ext4_flags_update/fs/ext4/inode.c	2007-04-17 12:08:54.0 +0200
@@ -2584,6 +2584,25 @@ void ext4_set_inode_flags(struct inode *
 		inode-i_flags |= S_DIRSYNC;
 }
 
+/* Propagate flags from i_flags to EXT4_I(inode)-i_flags */
+void ext4_get_inode_flags(struct ext4_inode_info *ei)
+{
+	unsigned int flags = ei-vfs_inode.i_flags;
+
+	ei-i_flags = ~(EXT4_SYNC_FL|EXT4_APPEND_FL|
+			EXT4_IMMUTABLE_FL|EXT4_NOATIME_FL|EXT4_DIRSYNC_FL);
+	if (flags  S_SYNC)
+		ei-i_flags |= EXT4_SYNC_FL;
+	if (flags  S_APPEND)
+		ei-i_flags |= EXT4_APPEND_FL;
+	if (flags  S_IMMUTABLE)
+		ei-i_flags |= EXT4_IMMUTABLE_FL;
+	if (flags  S_NOATIME)
+		ei-i_flags |= EXT4_NOATIME_FL;
+	if (flags  S_DIRSYNC)
+		ei-i_flags |= EXT4_DIRSYNC_FL;
+}
+
 void ext4_read_inode(struct inode * inode)
 {
 	struct ext4_iloc iloc;
@@ -2743,6 +2762,7 @@ static int ext4_do_update_inode(handle_t
 	if (ei-i_state  EXT4_STATE_NEW)
 		memset(raw_inode, 0, EXT4_SB(inode-i_sb)-s_inode_size);
 
+	ext4_get_inode_flags(ei);
 	raw_inode-i_mode = cpu_to_le16(inode-i_mode);
 	if(!(test_opt(inode-i_sb, NO_UID32))) {
 		raw_inode-i_uid_low = cpu_to_le16(low_16_bits(inode-i_uid));
diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6-1-ext3_flags_update/fs/ext4/ioctl.c linux-2.6.21-rc6-2-ext4_flags_update/fs/ext4/ioctl.c
--- linux-2.6.21-rc6-1-ext3_flags_update/fs/ext4/ioctl.c	2007-02-07 12:03:23.0 +0100
+++ linux-2.6.21-rc6-2-ext4_flags_update/fs/ext4/ioctl.c	2007-04-17 12:09:19.0 +0200
@@ -28,6 +28,7 @@ int ext4_ioctl (struct inode * inode, st
 
 	switch (cmd) {
 	case EXT4_IOC_GETFLAGS:
+ 		ext4_get_inode_flags(ei);
 		flags = ei-i_flags  EXT4_FL_USER_VISIBLE;
 		return put_user(flags, (int __user *) arg);
 	case EXT4_IOC_SETFLAGS: {
diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6-1-ext3_flags_update/include/linux/ext4_fs.h linux-2.6.21-rc6-2-ext4_flags_update/include/linux/ext4_fs.h
--- linux-2.6.21-rc6-1-ext3_flags_update/include/linux/ext4_fs.h	2007-04-10 17:09:58.0 +0200
+++ linux-2.6.21-rc6-2-ext4_flags_update/include/linux/ext4_fs.h	2007-04-17 12:10:02.0 +0200
@@ -855,6 +855,7 @@ extern int ext4_change_inode_journal_fla
 extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
 extern void ext4_truncate (struct inode *);
 extern void ext4_set_inode_flags(struct inode *);
+extern void ext4_get_inode_flags(struct ext4_inode_info *);
 extern void ext4_set_aops(struct inode *inode);
 extern int ext4_writepage_trans_blocks(struct inode *);
 extern int ext4_block_truncate_page(handle_t *handle, struct page *page,


[PATCH] Copy i_flags to ext3 inode flags on write

2007-04-16 Thread Jan Kara
  Hello,

  attached is a patch that stores inode flags such as S_IMMUTABLE,
S_APPEND, etc. from i_flags to EXT3_I(inode)-i_flags when inode is
written to disk. The same thing is done on GETFLAGS ioctl.
  Quota code changes these flags on quota files (to make it harder for
sysadmin to screw himself) and these changes were not correctly
propagated into the filesystem (especially, lsattr did not show them and
users were wondering...).
  If people agree that this patch is fine, I can also rediff it for
ext4.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into
ext3-specific i_flags. Hence, when someone sets these flags via a different
interface than ioctl, they are stored correctly.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/ext3/inode.c 
linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/inode.c
--- linux-2.6.21-rc6/fs/ext3/inode.c2007-04-10 17:09:55.0 +0200
+++ linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/inode.c2007-04-12 
18:19:29.0 +0200
@@ -2581,6 +2581,25 @@ void ext3_set_inode_flags(struct inode *
inode-i_flags |= S_DIRSYNC;
 }
 
+/* Propagate flags from i_flags to EXT3_I(inode)-i_flags */
+void ext3_get_inode_flags(struct inode *inode)
+{
+   unsigned int flags = inode-i_flags;
+
+   EXT3_I(inode)-i_flags = ~(EXT3_SYNC_FL|EXT3_APPEND_FL|
+   EXT3_IMMUTABLE_FL|EXT3_NOATIME_FL|EXT3_DIRSYNC_FL);
+   if (flags  S_SYNC)
+   EXT3_I(inode)-i_flags |= EXT3_SYNC_FL;
+   if (flags  S_APPEND)
+   EXT3_I(inode)-i_flags |= EXT3_APPEND_FL;
+   if (flags  S_IMMUTABLE)
+   EXT3_I(inode)-i_flags |= EXT3_IMMUTABLE_FL;
+   if (flags  S_NOATIME)
+   EXT3_I(inode)-i_flags |= EXT3_NOATIME_FL;
+   if (flags  S_DIRSYNC)
+   EXT3_I(inode)-i_flags |= EXT3_DIRSYNC_FL;
+}
+
 void ext3_read_inode(struct inode * inode)
 {
struct ext3_iloc iloc;
@@ -2736,6 +2755,7 @@ static int ext3_do_update_inode(handle_t
if (ei-i_state  EXT3_STATE_NEW)
memset(raw_inode, 0, EXT3_SB(inode-i_sb)-s_inode_size);
 
+   ext3_get_inode_flags(inode);
raw_inode-i_mode = cpu_to_le16(inode-i_mode);
if(!(test_opt(inode-i_sb, NO_UID32))) {
raw_inode-i_uid_low = cpu_to_le16(low_16_bits(inode-i_uid));
diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/ext3/ioctl.c 
linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c
--- linux-2.6.21-rc6/fs/ext3/ioctl.c2007-02-07 12:03:23.0 +0100
+++ linux-2.6.21-rc6-1-ext3_flags_update/fs/ext3/ioctl.c2007-04-12 
18:22:54.0 +0200
@@ -28,6 +28,7 @@ int ext3_ioctl (struct inode * inode, st
 
switch (cmd) {
case EXT3_IOC_GETFLAGS:
+   ext3_get_inode_flags(inode);
flags = ei-i_flags  EXT3_FL_USER_VISIBLE;
return put_user(flags, (int __user *) arg);
case EXT3_IOC_SETFLAGS: {


Re: 64bit inode number and dynamic inode table for ext4

2007-04-02 Thread Jan Kara
, in case the original inode table file
 corrupted we will not lost the inodes for the entire block group.
  If you have checksums / magic numbers, you will be able to find blocks
belonging to the inode table file. If you also implement the idea with the
chunks of inode blocks (actually, it looks like a small inode table) with a
header, you can even store all the information you need for reconstruction
in the header...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext3 behavior on power failure

2007-03-29 Thread Jan Kara
On Wed 28-03-07 19:00:54, Ric Wheeler wrote:
 Jan Kara wrote:
 [EMAIL PROTECTED] wrote:
 Hi all,
 
 We are building a new system which is going to use ext3 FS. We would 
 like to know more about the behavior of ext3 in the case of failure.  
 But before I procede, I would like to share more information about our 
 future system. 
 *  Our application always does an fsync on files
 *  When symbolic links (more specifically fast symlink) are created, 
 the host directory is also fsync'ed. * Our application is also 
 going to front an EMC disk array configured using RAID5 or RAID6.
 *  We will be using multipathing  so that we can assume that no disk 
 errors will be reported. 
 In this context , we would like to know the following for recovery after 
 a power outage:
 
 1. When will an fsck have to be run (not counting  the scheduled fsck 
 every N-mounts)?
 2. In the case of a crash, are the fsync-ed file contents and symbolic 
 links safe no matter what?
 
 Thanks,
 This is an interesting twist on some of the discussion that we have had 
 at the recent workshop and in other forums on hardening  file system in 
 order to prevent the need to fsck.
 
 The twist is that we have a disk that will not lose power without being 
 able to write to platter all of the data that has been sent - this is 
 the case for most mid-range or higher disk arrays.
 
 If the application can precisely use fsync() on files, directories and 
 symlinks, it wants to know that all objects are safe on disk that have 
 completed a successful fsync. It also wants to know that the file system 
 will not need any recovery beyond replaying transactions after a power 
 outage/reboot - simply mount, let the transactions get replayed and you 
 should be good to go without the fsck.
 
 The hard part of the question is to understand when and how often we 
 will fail to deliver this easy case. Also, does any of the hardening in 
 ext4 help here.
   I'm probably misunderstanding something because the answer seems to be
 too obvious to me :) But anyway I'll write it so that you can correct
 me:
   Due to journalling guarantees you should get consistent FS whenever
 you replay the log (unless there are some software bugs or hardware
 problems which is why fsck is run once per several mounts anyway).
   If you fsync() your data, you are guaranteed that also your data are
 safely on disk when fsync returns. So what is the question here?
 
  Honza
 
 I think that the real question here is in practice, how often does this 
 really hold to be true? When it fails, how long does it take to recover the 
 file system?
  I see, thanks for explanation :).

 There are a lot of odd errors that can happen when you monitor a large 
 enough number of file systems. In my experience, I would guess that disk 
 errors are clearly the leading cause of issues, followed by software bugs 
 (file system, firmware, etc) and then a group of errors caused by various 
 occasional things (bad DRAM in the server/HBA/disk, bad cables/etc). Note 
 that using a high end array does not eliminate errors, it just reduces the 
 rate (hopefully by a large amount).
 
 What is really hard to predict is the rate of the failures that require 
 fsck with our current file system (say for a specific hardware setup) and 
 how changes like the checksumming in ext4 can help us ride through these 
 errors without needing a full fsck.
  OK. All the features I've seen so far were more aiming to detecting that
such an unexpected problem happened rather than trying to fix it or make
fixing it faster. So currently it seems to me that any such unexpected
failure requires fsck...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext2/ext3 still under active maintenance?

2007-03-28 Thread Jan Kara
  Hello,

 I'm trying to look at the ext2/ext3 source for learning Linux FS 
 development. I'm avoiding ext4 for now because it's under active 
 development and it's too much to chew before I understand the previous 
 versions. But am I going to get an outdated view of the right way to 
 program filesystems if I look at those, or is the code just as shiny as 
 ext4's?
  Definitely you won't get an outdated view of how a filesystem is written.
ext4 uses ext3 code as its baseline and although there are new features
flowing in, the basic things stay the same.

Honza
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext3 behavior on power failure

2007-03-28 Thread Jan Kara
  If you fsync() your data, you are guaranteed that also your data are
 safely on disk when fsync returns. So what is the question here?
 Pardon a newbie's intrusion, but I do know this isn't true. There is a 
 window of possible loss because of the multitude of layers of caching, 
 especially within the drive itself. Unless there is a super_duper_fsync() 
 that is able to actually poll the hardware and get a confirmation that the 
 internal buffers are purged?
  OK :), to correct myself: After fsync() returns, all the data is acked from
the disk (or at least it should be like that unless there's a bug
somewhere). So if there are some caches in the hardware which the hardware
is not able to flush on power failure, that's a bad luck... That's why
you should turn off write caching on cheaper disks if you really care
about data integrity.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext3 behavior on power failure

2007-03-28 Thread Jan Kara
On Wed 28-03-07 10:17:33, [EMAIL PROTECTED] wrote:
 In my case the disk cache is not a  problem - We use an emc disk array
 the write cache is protected - 
 Once the data has made over the disk array we can assume it is safe - 
  Then if you are able to reproduce the situation that not all data
is written after fsync(); poweroff; that is a bug worth reporting..

Honza
 
 -Original Message-
 From: John Anthony Kazos Jr. [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, March 28, 2007 9:17 AM
 To: Jan Kara
 Cc: wheeler, richard; armangau, philippe; [EMAIL PROTECTED];
 linux-ext4@vger.kernel.org; [EMAIL PROTECTED]
 Subject: Re: Ext3 behavior on power failure
 
   If you fsync() your data, you are guaranteed that also your data are
  safely on disk when fsync returns. So what is the question here?
 
 Pardon a newbie's intrusion, but I do know this isn't true. There is a 
 window of possible loss because of the multitude of layers of caching, 
 especially within the drive itself. Unless there is a
 super_duper_fsync() 
 that is able to actually poll the hardware and get a confirmation that
 the 
 internal buffers are purged?
 
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-07 Thread Jan Kara
On Tue 06-03-07 12:23:22, Eric Sandeen wrote:
 Jan Kara wrote:
  On Tue 06-03-07 06:36:09, Ulrich Drepper wrote:
  Christoph Hellwig wrote:
  fallocate with the whence argument and flags is already quite complicated,
  I'd rather have another call for placement decisions, that would
  be called on an fd to do placement decissions for any further allocations
  (prealloc, write, etc)
  Yes, posix_fallocate shouldn't be made more complicated.  But I don't
  understand why requesting linear layout of the blocks should be an
  option.  It's always an advantage if the blocks requested this way are
  linear on disk.  So, the kernel should always do its best to make this
  happen, without needing an additional option.
Actually, it's not that simple. You want linear layout of blocks you are
  going to read. That is not necessary a linear layout of blocks in a single
  file - trace sometime a start of some complicated app like KDE. You find
  it's seeking like a hell because it needs a few blocks from a ton of
  distinct files (shared libs, config files, etc). As these files are mostly
  read only, it's advantageous to interleave them on disk or at least keep
  them close.
 
 At some point shouldn't the apps be fixed, rather than do crazy things
 with the filesystem?  :)
  Yes :) That's basically what we told KDE developpers when they were
complaining ;) But it's hard to fix it for them too (because of some
desktop specs requiring lots of different text config files which can
change anytime - don't ask me who designed it). Moreover for example for
loading shared libraries from which you need just a few blocks scattered
all over the place the problem is in ELF itself.
  I'll probably first write some userspace fs-reorganizer to find out how
much these changes in layout are able to give you in performance (i.e.
whether it's worth the effort of more complicated kernel online
defragmenter).

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-06 Thread Jan Kara
On Tue 06-03-07 06:36:09, Ulrich Drepper wrote:
 Christoph Hellwig wrote:
  fallocate with the whence argument and flags is already quite complicated,
  I'd rather have another call for placement decisions, that would
  be called on an fd to do placement decissions for any further allocations
  (prealloc, write, etc)
 
 Yes, posix_fallocate shouldn't be made more complicated.  But I don't
 understand why requesting linear layout of the blocks should be an
 option.  It's always an advantage if the blocks requested this way are
 linear on disk.  So, the kernel should always do its best to make this
 happen, without needing an additional option.
  Actually, it's not that simple. You want linear layout of blocks you are
going to read. That is not necessary a linear layout of blocks in a single
file - trace sometime a start of some complicated app like KDE. You find
it's seeking like a hell because it needs a few blocks from a ton of
distinct files (shared libs, config files, etc). As these files are mostly
read only, it's advantageous to interleave them on disk or at least keep
them close.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jan Kara
 On Fri, 02 Mar 2007 09:40:54 +1100
 Nathan Scott [EMAIL PROTECTED] wrote:
 
 
 On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
 
 On Fri, 2 Mar 2007 00:04:45 +0530
 Amit K. Arora [EMAIL PROTECTED] wrote:
 
 
 This is to give a heads up on few patches that we will be soon coming up
 with. These patches implement a new system call sys_fallocate() and a
 new inode operation fallocate, for persistent preallocation. The new
 system call, as Andrew suggested, will look like:
 
  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
 
 ...
 
 I'd agree with Eric on the command flag extension.
 
 Seems like a separate syscall would be better, command sounds
 a bit ioctl like, especially if that command is passed into the
 filesystems..
 
 
 
 madvise, fadvise, lseek, etc seem to work OK.
 
 I get repeatedly traumatised by patch rejects whenever a new syscall gets
 added, so I'm biased.
 
 The advantage of a command flag is that we can add new modes in the future
 without causing lots of churn, waiting for arch maintainers to catch up,
 potentially adding new compat code, etc.
 
 Rename it to mode? ;)
 
 I am wondering if it is useful to add another mode to advise block 
 allocation policy? Something like indicating which physical block/block 
 group to allocate from (goal), and whether ask for strict contigous 
 blocks. This will help preallocation or reservation to choose the right 
 blocks for the file.
  Yes, I also think this would be useful so you can guide
preallocation for things like defragmentation (e.g. preallocate space
for the file being defragmented and move the file to it).

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ext3 not writing inode flags

2007-02-21 Thread Jan Kara
  Hello,

  I've noticed that if e. g. S_IMMUTABLE flag is set on an inode, ext3
does not propagate it to its inode flags on write. This also leads to
the situation that lsattr reports no flag set but the file is in fact
immutable. Would a fix be accepted? Actually, the quota code is the
only place where S_IMMUTABLE flag is set to an inode so this is not a
big issue but still...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-08 Thread Jan Kara
On Thu 08-02-07 01:45:29, Andrew Morton wrote:
 snip
I though Andreas meant any write changes - i.e. you check that noone
  has open file descriptor for writing and block any new open for writing.
  That can be done quite easily.
Anyway, I agree with you that userspace solution to a possible page
  cache pollution is preferable after thinking about it for a while.
  As I've been thinking about it, we could actually do the copying
  from user space. We could do something like:
block any writes to file (as I described above)
craft new inode with blocks allocated as we want (using preallocation,
  we should mostly have the kernel infrastructure we need)
copy data using splice syscall
call the kernel to switch data
  
 
 I don't think we need to block any writes to any file or anything.
 
 To move a page within a file:
 
   fd = open(file);
   p = mmap(fd);
   the_page_was_in_core = mincore(p, offset);
   munmap(p);
   ioctl(fd, ..., new_block);
 
   kernel
   read_cache_page(inode, offset);
   lock_page(page);
   if (try_to_free_buffers(page)) {
   relocate the page
   set_page_dirty(page);
   }
   unlock_page(page);
 
   if (the_page_was_in_core) {
   sync_file_range(fd, offset SYNC_FILE_RANGE_WAIT_BEFORE|
   SYNC_FILE_RANGE_WRITE|
   SYNC_FILE_RANGE_WAIT_AFTER);
   fadvise(fd, offset, FADV_DONTNEED);
   }
 
 completely coherent with pagecache, quite safe in the presence of mmap,
 mlock, O_DIRECT, everything else.  Also fully journallable in-kernel.
  Yes, this is the simple way. But I see two disadvantages:
1) You'd like to relocate metadata (indirect blocks) too. For that you need
   a different mechanism. In my approach, you can mostly assume you've got
   sanely laid out metadata and so the existence of such mechanism is not
   so important.
2) You'd like to allocate new blocks in big chunks. So your kernel function
   should rather take a range. Also when you fail in the middle of
   relocating a file (for example the block you'd like to use is already
   taken by someone else), I find it nice if you can return at least to the
   original state. But that's probably not important.

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-08 Thread Jan Kara
On Thu 08-02-07 02:32:13, Andrew Morton wrote:
 On Thu, 8 Feb 2007 11:21:02 +0100 Jan Kara [EMAIL PROTECTED] wrote:
 
  On Thu 08-02-07 01:45:29, Andrew Morton wrote:
   snip
  I though Andreas meant any write changes - i.e. you check that noone
has open file descriptor for writing and block any new open for writing.
That can be done quite easily.
  Anyway, I agree with you that userspace solution to a possible page
cache pollution is preferable after thinking about it for a while.
As I've been thinking about it, we could actually do the copying
from user space. We could do something like:
  block any writes to file (as I described above)
  craft new inode with blocks allocated as we want (using preallocation,
we should mostly have the kernel infrastructure we need)
  copy data using splice syscall
  call the kernel to switch data

   
   I don't think we need to block any writes to any file or anything.
   
   To move a page within a file:
   
 fd = open(file);
 p = mmap(fd);
 the_page_was_in_core = mincore(p, offset);
 munmap(p);
 ioctl(fd, ..., new_block);
   
 kernel
 read_cache_page(inode, offset);
 lock_page(page);
 if (try_to_free_buffers(page)) {
 relocate the page
 set_page_dirty(page);
 }
 unlock_page(page);
   
 if (the_page_was_in_core) {
 sync_file_range(fd, offset SYNC_FILE_RANGE_WAIT_BEFORE|
 SYNC_FILE_RANGE_WRITE|
 SYNC_FILE_RANGE_WAIT_AFTER);
 fadvise(fd, offset, FADV_DONTNEED);
 }
   
   completely coherent with pagecache, quite safe in the presence of mmap,
   mlock, O_DIRECT, everything else.  Also fully journallable in-kernel.
Yes, this is the simple way. But I see two disadvantages:
  1) You'd like to relocate metadata (indirect blocks) too.
 
 Well.  Do we really?  Are we looking for a 100% solution here, or a 90% one?
  Umm, I think that for ext3 having data on one end of the disk and
indirect blocks on the other end of the disk does not quite help (not
mentioning that it can create bad free space fragmentation over the time).
I have not measured it but I'd guess that it would erase the effect of
moving data closer together. At least for sequential reads..

 Relocating data is the main thing.  After that, yeah, relocating metadata,
 inodes and directories is probably a second-order thing.
 
  For that you need
 a different mechanism.
 
 I suspect a similar approach will work there: load and lock the
 buffer_heads (or maybe just the top-level buffer_head) and then alter their
 contents.  It could be that verify_chain() will just magically do the right
 thing there, but some changes might be needed.
  Yes, it could be done. I just wanted to point to the fact that things may
not be as simple in your solution either...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-05 Thread Jan Kara
;
 +
 + page = read_cache_page(inode-i_mapping,
 +org_offset, (filler_t 
 *)inode-i_mapping-a_ops-readpage,
 +NULL);
 +
 + if (IS_ERR(page)) {
 + ret = PTR_ERR(page);
 + return ret;
 + }
 +
 + wait_on_page_locked(page);
 + lock_page(page);
 + /* release old bh and drop refs */
 + try_to_release_page(page, 0);
 + ret = ext4_ext_replace_branches(inode, tmp_inode, org_offset, 
 + dest_offset, 1, delete_start);
 + if (ret  0)
 + goto ERR;
 +
 + if (org_offset == ((inode-i_size - 1)  PAGE_SHIFT))
 + offset_in_page = (inode-i_size  (PAGE_CACHE_SIZE - 1));
 +
 + ret = mapping-a_ops-prepare_write(filp, page,
 + 0, offset_in_page);
 + if (ret)
 + goto ERR;
 +
 + ret = mapping-a_ops-commit_write(filp, page,
 +0, offset_in_page);
 +ERR:
 + unlock_page(page);
 + page_cache_release(page);
 +
 + return (ret  0 ? ret : 0);
 +}
 +
 +/**
   * ext4_ext_new_extent_tree -  allocate contiguous blocks
   * @inode:   inode of the original file
   * @tmp_inode:   inode of the temporary file
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] Fix kmalloc flags used in ext3 with an active journal handle

2006-12-22 Thread Jan Kara
 On Tue, 19 Dec 2006 18:22:03 -0800
 Suzuki [EMAIL PROTECTED] wrote:
 
  
  Andrew Morton wrote:
   On Tue, 19 Dec 2006 17:58:12 -0800
   Suzuki [EMAIL PROTECTED] wrote:
   
   
  * Fix the kmalloc flags used from within ext3, when we have an active 
  journal handle
  
If we do a kmalloc with GFP_KERNEL on system running low on memory, 
   with an active journal handle, we might end up in cleaning up the fs 
   cache flushing dirty inodes for some other filesystem. This would cause 
   hitting a J_ASSERT() in :
   
   
   The change might be needed (haven't looked at it yet).  But I'd like to 
   see
   the full BUG trace, please.  To see the callchain.
  
  Here is the call trace which was hit by one of our test teams. This was 
  from fs/ext3/xattr.c. While looking for similar calls I found the others 
  described in the patch.
  
  Assertion failure in journal_start() at fs/jbd/transaction.c:274: handle-
   h_transaction-t_journal == journal
  kernel BUG at fs/jbd/transaction.c:274!
  illegal operation: 0001 [#1]
  CPU:0Not tainted (2.6.5-7.282-s390x SLES9_SP3_BRANCH-20061031152356)
  Process dbench (pid: 14070, task: 025617f0, ksp: 01057630)
  Krnl PSW : 07018000 08837b38 (journal_start+0x90/0x15c 
  [jbd])
  Krnl GPRS:  00507fc0 002b 
  01056d80
  08837b36 2885 08841da6 
  
  001bfaa0 03483d08 0002 
  07a8bda0
  08833000 088a7d08 08837b36 
  01056e80
  Krnl Code: 00 00 58 10 b0 0c a7 1a 00 01 b9 04 00 2b 50 10 b0 0c e3 40
  Call Trace:
[088a30fc] ext3_journal_start+0x8c/0xa4 [ext3]
[08896822] ext3_dirty_inode+0x3a/0xe0 [ext3]
[001ca362] __mark_inode_dirty+0x1ae/0x1c8
[001bfaa0] iput+0xbc/0xf0
[001bdcca] prune_dcache+0x29e/0x584
[001bdfe4] shrink_dcache_memory+0x34/0x54
[0017b100] shrink_slab+0x15c/0x250
[0017b6e4] try_to_free_pages+0x1c0/0x2a4
[00170276] __alloc_pages+0x2ba/0x4e0
[0017059a] __get_free_pages+0x4e/0x8c
[00174ea2] cache_alloc_refill+0x2a6/0x868
[00175540] __kmalloc+0xdc/0xe0
[088a4e62] ext3_xattr_set_handle+0x114a/0x174c [ext3]
[088a54e4] ext3_xattr_set+0x80/0xd0 [ext3]
[088a6312] ext3_xattr_user_set+0xce/0xe4 [ext3]
[088a5f1e] ext3_setxattr+0x17e/0x18c [ext3]
[001c88e6] setxattr+0x14a/0x234
[001c8a80] sys_fsetxattr+0xb0/0x110
[0011fc10] sysc_noemu+0x10/0x16
 
 How did we get from iput() into __mark_inode_dirty()?  I can't see it in
 mainline, nor in 2.6.5 which you appear to be using...
  Hmm, I think it could happen at least via quota code (but then I would expect
to see some entry in the backtrace about it).

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Defragmentation interface

2006-11-06 Thread Jan Kara
 On Fri, Nov 03, 2006 at 03:30:30PM +0100, Jan Kara wrote:
   BTW, does use of sysfs mean ASCII encoding of all the data
   passing between kernel and userspace?
Not necessarify but mostly yes. At least I intend to have all the
  files I have proposed in ASCII.
 
 Ok - that's how you're looking to avoid 32/64bit compatibility issues?
  Yes.

 It will make the interface quite verbose, though, and entail significant
 encoding and decoding costs
  It would be verbose. On the other hand for most things it should not
matter (not too much data goes through the interface and it's not too
performance critical).

meta/allocation/free_blocks
  - RO file - if you read from fpos F, you'll get a list of extents
describing areas with free blocks (as many as fits into supplied
buffer) starting from block F. Fpos of your file descriptor is
shifted to the first unreported free block.
   
   - linear search properties == Bad. (think fs sizes of hundreds of
 terabytes - XFS is already deployed with filesystems of this size)
OK, so what do you propose? You want syscall find_free_blocks() and
  my idea of it was that it will do basically the same think as my
 snip

 Right. More complicated requests are something that we need to
 support in XFS in the short-medium term. We _need_ an interface to
 XFS that allows complex, compound allocation policies to be
 accessible from userspace - and this is not just for defrag
 programs.
 
 I think a set of well defined allocation primitives suits a syscall
 interface far better than a per-filesystem sysfs interface.
  I'm only afraid of one thing: Once you define a syscall it's hard to
change anything and for this kind of thing I'm not sure we are able to
tell what we'll need in two years... That is basically my main
concern with implementing this interface as a syscall.

   - every time you fail an allocation, you need to reread this file.
Yes, that's the most serious disadvantage I see. Do you see any way
  out of it in any interface?
 
 I haven't really thought about solutions for this interface - the
 syscall interface doesn't have this problem because of the way you
 can specify where you want free blocks from
  But that does not solve the problem with having to repeat the search,
does it? Only with the syscall interface filesystem can possibly search
for free blocks more efficiently..

   - stripe unit and stripe width need to be exposed so defrag too
 can make correct placement decisions.
fs-specific thing...
 
 As Andreas said, this isn't fs-specific. XFS takes sunit and swidth
 as mkfs parameters so it can align both metadata and data optimally
 for RAID devices. Other fileystems have different methods of
 specifying this (ext2/3/4 use -E stride-size for this), but it would
 need to be exposed in some way
  I see. But then shouldn't we expose it regardless the interface
(sysfs/syscall) we choose so that userspace can take it into account
when picking where to allocate?

meta/nodes/ident
  - this should be a directory containing things specific for a 
fs-object
with identification ident. In case of ext3 these would be inode
numbers, I guess this should be plausible also for XFS and others
but I'm open to suggestions...
  - directory contains the following:
  alloc_goal
- block number with current allocation goal
   
   The kernel has to store this across syscalls until you write into
   data/alloc? That sounds dangerous...
This is persistent until kernel decides to remove inode from memory.
  So while you have the file open, you are guaranteed that kernel keeps
  the information.
 
 But the inode hangs around long after the file is closed. How
 do you guarantee that this gets cleared when it needs to be?
  It gets cleared (or rewritten) as soon as alloc_goal is used for
allocation or when inode gets removed from memory. Ext3 currently has
such thing (settable via ioctl()) and it seems to work reasonably well.

 I just don't like the principle of this interface when we are
 talking about moving data around online - it's inherently unsafe
 when you consider mutli-threaded or -process access to an inode.
  Yes, we certainly have to make sure we don't do something destructive
in such case. On the other hand if several processes try to guide
allocation in the same file, results are uncertain and that's IMHO ok.
 
   The major difference is that one implementation requires 3 new
   generically useful syscalls, and the other requires every filesystem
   to implement a metadata filesystem and require root priviledges
   to use.
Yes. IMO the complexity of implementation is almost the same in the
  syscall case and in my sysfs case. What syscall would do is just do some
  basic checks and redirect everything into fs-specific call anyway...
 
 Sure, but you don't need to implement a new filesystem in every
 filesystem to support it
  But the cost of this meta

Re: [RFC] Defragmentation interface

2006-11-03 Thread Jan Kara
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Defragmentation interface

2006-11-02 Thread Jan Kara
  Hi,

  from the thread after my patch implementing ext3 online
defragmentation I found out that probably the only (and definitely the
biggest) issue is the interface. Someone wants is common enough so that
we can profit from common tools for several filesystems, others object
that some applications, e.g. defragmenter, need to know something about
ext3 internals to work reasonably well. Moreover ioctl() is ugly and has
some compatibility issues, on the other hand ext2meta is too lowlevel,
fs-specific and it would be hard to do any reasonable application
crash-safe...
  So in this email I try to propose some interface which should hopefully
address most of the concerns. The type of the interface is sysfs like
(idea taken from ext2meta) - that has a few advantages:
 - no 32/64-bit compatibility issues
 - easily extensible
 - generally nice ;)

  Each filesystem willing to support this interface implements special
filesystem (e.g. ext3meta, XFSmeta, ...) and admin/defrag-tool mounts it
to some directory. There are parts of this interface which should be
common for all filesystems (so that tools don't have to care about
particular filesystem and still get some useful results), other parts
are fs-specific. Here is basic structure I propose:

meta/features
  - bitmap of features supported by the interface (ext2/3-like) so that
the tool can verify whether it understands the interface and don't
mess with it otherwise
meta/allocation/free_blocks
  - RO file - if you read from fpos F, you'll get a list of extents
describing areas with free blocks (as many as fits into supplied
buffer) starting from block F. Fpos of your file descriptor is
shifted to the first unreported free block.
meta/super/blocksize
  - filesystem block size
meta/super/id
  - filesystem ID (for paranoid tools to verify that they are accessing
really the right meta-filesystem)
meta/nodes/ident
  - this should be a directory containing things specific for a fs-object
with identification ident. In case of ext3 these would be inode
numbers, I guess this should be plausible also for XFS and others
but I'm open to suggestions...
  - directory contains the following:
  alloc_goal
- block number with current allocation goal
  data/extents
- if you read from this file, you get a list of extents describing
  data blocks (and holes) of the file. The listing starts at logical
  block fpos. Fpos is shifted to the first unreported data block.
  data/alloc
- you write there a number L and fs allocates L blocks to a file
  (preferable from alloc_goal) starting from file-block fpos. Fpos
  is shifted after the last block allocated in this call.
  data/reloc
- you write there ident and relocation of data happens as follows:
  All blocks that are allocated both in original file and ident
  are relocated to ident. Write returns number of relocated
  blocks.
  metadata/
- this directory is fs-specific, contains fs block pointers and
  similar. Here I describe what I'd like to have for ext3.
  metadata/alloc
- you write there number Level and a fs allocates an indirect block
  to a file for logical block fpos at level Level of the indirect
  tree. Parent indirect block must be already allocated.
  metadata/reloc
- you write two numbers ident Level and an indirect block for
  logical offset fpos at level Level will be swapped with
  corresponding indirect block of ident.

  This is all that is needed for my purposes. Any comments welcome.

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-26 Thread Jan Kara
 On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote:
  On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote:
   On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
   So how do you then get the generic interface to allocate blocks
   specified by userspace race free?
  
  As has been repeatedly stated, there is no generic.  There MUST be
  filesystem-specific knowledge during these operations.
 
 What information? All we need to know is where the free disk space
 is, and have a method to attempt to allocate from it. That's _easy_
 to abstract into a common interface via the VFS
 
Further, in the case being discussed in this thread, ext2meta has
already been proven a workable solution.
   
   Sure, but that's not a generic solution to a problem common to
   all filesystems
  
  You clearly don't know what I'm talking about.  ext2meta is an example
  of a filesystem-specific metadata access method, applicable to tasks
  such as online optimization.
 
 I know exactly what ext2meta is. I said it's not a generic solution
 and you say its a filesystem specific solution.  I think we're
 agreeing here. ;)
 
 We don't need to expose anything filesystem specific to userspace to
 implement this.  Online data movement (i.e. the defrag mechanism)
 becomes something like:
 
   do {
   get_free_list(dst_fd, location, len, list)
   /* select extent to use */
  Upto this point I can imagine we can be perfectly generic.

   alloc_from_list(dst_fd, list[X], off, len)
   } while (ENOALLOC)
   move_data(src_fd, dst_fd, off, len);
  With these two it's not clear how well can we do with just a generic
interface. Every filesystem needs to have some additional metadata to
keep list of data blocks. In case of ext2/ext3/reiserfs this is not
a negligible amount of space and placement of these metadata is important
for performance. So either we focus only on data blocks and let
implementation of alloc_from_list() allocate metadata wherever it wants
(but then we get suboptimal performace because there need not be space
for indirect blocks close before our provided extent) or we allocate
metadata from the provided list, but then we need some knowledge of fs
to know how much should we expect to spend on metadata and where these
metadata should be placed. For example if you know that indirect block
for your interval is at block B, then you'd like to allocate somewhere
close after this point or to relocate that indirect block (and all the
data it references to). But for that you need to know you have something
like indirect blocks = filesystem knowledge.
  So I think that to get this working, we also need some way to tell
the program that if it wants to allocate some data, it also needs to
count with this amount of metadata and some of it is already allocated
in given blocks...

 I see substantial benefit moving forward from having filesystem
 independent interfaces. Many features that  filesystems implement
 are common, and as time goes on the common feature set of the
 different filesystems gets larger. So why shouldn't we be
 trying to make common operations generic so that every filesystem
 can benefit from the latest and greatest tool?
  So you prefer to handle only data blocks part of the problem and let
filesystem sort out metadata?

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jan Kara
 On Oct 24, 2006  15:44 -0400, Theodore Tso wrote:
  First of all, we would need a way of allowing userpsace to specify
  which blocks should be used in the preallocation.
 
 Presumably it could do this in the same way it will be specifying
 which blocks to relocate in the defragger - by passing an extent.
 You would be required to pass the file offset for which to preallocate,
 and optionally an extent for the on-disk allocation itself (if none is
 supplied the kernel will allocate the best extent it can).
 
  Secondly, we would need a way of marking blocks as preallocated but
  not pre-zeroed; otherwise we would have to zero out all of the blocks
  in order to assure security (don't want userspace programs seeing the
  previous contents of the data blocks), only to do the copy and the
  extents vector swap.
 
 This could be mitigated by having the preallocation be done (in the
 defragment case) against a temporary inode in the orphan list (as
 the initial patch did) so if there is a crash it will be released.
 The temporary inode will not be linked into the namespace so it cannot
 be read - only used to hold preallocation.  If this was a write-only
 file handle then we should be OK?
 
 For defragger purposes this would need:
 
 - allocate new temporary inode (VFS + fs, returns write-only fh if
fs can't properly handle uninitalized extents, or doesn't request
full-extent zeroing)
 
for each extent to defragment {
   - preallocate extents on temp inode (fs specific internals)
   - copy data from orig to temp at offset X (VFS, splice or
  e.g. sys_copyfile(src, dst, offset, count) which Linus agreed
  to at KS '05 for network filesystems)
   - migrate copied extent to original inode (fs specific internals)
}
 
 - free temporary inode (just close of temp fh, frees unmigrated extents).
  Yes, this sounds feasible. We could split the defrag ioctl into two
pieces (addition of given extent to a file and swapping of extents), which
can have generic interface... 

 I don't think this is much more work than implementing all of this
 functionality as part of a monolithic online defrag function, assuming
 we don't require full-file copies in order to do defrag.
  Yes, it's not more work than supporting swapping of extents in the
middle of the file. I've just not yet decided how to handle indirect
blocks in case of relocation in the middle of the file. Should they be
relocated or shouldn't they? Probably they should be relocated at least
in case they are fully contained in relocated interval or maybe better
said when all the blocks they reference to are also in the interval
(this handles also the case of EOF). But still if you would like to
relocate the file by parts this is not quite what you want (you won't be
able to relocate indirect blocks in the boundary of intervals) :(.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jan Kara
 On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote:
Yes, this sounds feasible. We could split the defrag ioctl into two
  pieces (addition of given extent to a file and swapping of extents), which
  can have generic interface... 
 
 An ioctl is UGLY.
  Agreed.

 This was discussed years ago.  Google for 'Alexander Viro' and
 'ext2meta'.  That's a clean, flexible, extensible way to access metadata
 online.  No need for ioctl binary translation across 32bit-64bit, or
 any other ioctl issue.
  I've briefly looked at this and this kind of interface has some
appeal. On the other hand it's not obvious to me, how to implement in
this interface *atomic* operation copy data from file F to given set of
blocks and rewrite pointers to original blocks with pointers to new
blocks. Something like this is needed for what we want to do...
Also if we'd like to implement operation like add this block to file F
at position P we have to make sure that all the necessary updates
(bitmap updates, inode updates, indirect block updates) go into one
transaction. Which basically mean that either ext3meta has to have a way
how to do this in a single operation, or we have to give userspace a way
to start/stop transaction and that starts to be really a mess because of
various deadlocks and so on.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jan Kara
 On Wed, Oct 25, 2006 at 07:58:51PM +0200, Jan Kara wrote:
I've briefly looked at this and this kind of interface has some
  appeal. On the other hand it's not obvious to me, how to implement in
  this interface *atomic* operation copy data from file F to given set of
  blocks and rewrite pointers to original blocks with pointers to new
  blocks. Something like this is needed for what we want to do...
  Also if we'd like to implement operation like add this block to file F
  at position P we have to make sure that all the necessary updates
  (bitmap updates, inode updates, indirect block updates) go into one
  transaction. Which basically mean that either ext3meta has to have a way
  how to do this in a single operation, or we have to give userspace a way
  to start/stop transaction and that starts to be really a mess because of
  various deadlocks and so on.
 
 Agreed, this issues exist.  But these issues exist independent of
 whether an ioctl or ext3meta is used.  It's all the responsibility
 of the implementor to define the interface.
 
 My contention is that ext3meta interface method would be much more
 robust than ioctl.  It's a namespace inside which you can define any
 inodes/dirents you wish, for the operations you desire.
  I see. So you mean that in our ext3meta filesystem we'd have a file
named add_this_extent_to_inode and a file reloc_inode_interval and
they'd be fed essentially the same info as the current ioctl interface and
do the same thing as we currently do. Hmm, I don't find it that nice any
more but yes, this would work.

 Heck, according to my sf.net/projects/gkernel CVS log, you offered
 some helpful review comments to me when I was implementing ext2meta ;-)
  Looking at those mails it was already quite some time ago so I
forgot about it  ;)
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jan Kara
 On Oct 23, 2006  18:03 +0200, Jan Kara wrote:
  Andreas Dilger wrote:
   I would in fact go so far as to allow only a single extent to be specified
   per call.  This is to avoid the passing of any pointers as part of the
   interface (hello ioctl police :-), and also makes the kernel code simpler.
   I don't think the syscall/ioctl overhead is significant compared to the
   journal and IO overhead.
 
  ...it makes it kind of
  harder to tell where indirect blocks would go - and it would be
  impossible for the defragmenter to force some unusual placement of
  indirect blocks...
 
 It would be possible to specify indirect block relocation in same manner
 as regular block relocation I think.  Allocate a new block, copy contents,
 flush block from cache, fix up reference (inode, dindirect), commit.
  Yes, but there's a question of the interface to this operation. How to
specify which indirect block I mean? Obviously we could introduce
separate call for remapping indirect blocks but I find this solution
kind of clumsy...

Bye
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >