Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails

2007-12-14 Thread Badari Pulavarty
On Tue, 2007-12-11 at 17:00 -0800, Zach Brown wrote:
 Hisashi Hifumi wrote:
  Hi.
  
  Current dio has some problems:
  1, In ext3 ordered, dio write can return with EIO because of the race
  between invalidation of
  a page and jbd. jbd pins the bhs while committing journal so
  try_to_release_page fails when jbd
  is committing the transaction.
 
 Yeah.  It sure would be fantastic if some ext3 expert could stop this
 from happening somehow.  But that hasn't happened in.. uh.. Badari, for
 how many years has this been on the radar? :)

I used to have a test case that would reproduce the problem some what
consistently. But with invalidate_range() introduction and Jan Kara's
re-write of journal commit handling, I can't reproduce the problem
anymore. So I gave up fixing the problem which I can't reproduce.

If anyone has a testcase - I can take a look at the problem again.

Thanks,
Badari



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-11-02 Thread Badari Pulavarty
On Thu, 2007-11-01 at 10:10 -0800, Badari Pulavarty wrote:
 On Thu, 2007-11-01 at 11:51 -0400, Chris Mason wrote:
  On Thu, 01 Nov 2007 08:38:57 -0800
  Badari Pulavarty [EMAIL PROTECTED] wrote:
  
   On Wed, 2007-10-31 at 13:40 -0400, Chris Mason wrote:
On Wed, 31 Oct 2007 08:14:21 -0800
Badari Pulavarty [EMAIL PROTECTED] wrote:
 
 I tried data=writeback mode and it didn't help :(

Ouch, so much for the easy way out.

 
 unable to release the page 262070
 bh c000211b9408 flags 110029 count 1 private 0
 unable to release the page 262098
 bh c00020ec9198 flags 110029 count 1 private 0
 memory offlining 3f000 to 4 failed
 

The only other special thing reiserfs does with the page cache is
file tails.  I don't suppose all of these pages are index zero in
files smaller than 4k?
   
   Ah !! I am so blind :(
   
   I have been suspecting reiserfs all along, since its executing
   fallback_migrate_page(). Actually, these buffer heads are
   backing blockdev. I guess these are metadata buffers :( 
   I am not sure we can do much with these..
  
  Hmpf, my first reply had a paragraph about the block device inode
  pages, I noticed the phrase file data pages and deleted it ;)
  
  But, for the metadata buffers there's not much we can do.  They are
  included in a bunch of different lists and the patch would
  be non-trivial.
 
 Unfortunately, these buffer pages are spread all around making
 those sections of memory non-removable. Of course, one can use
 ZONE_MOVABLE to make sure to guarantee the remove. But I am
 hoping we could easily group all these allocations and minimize
 spreading them around. Mel ?
 


BTW, I am having better luck with being able to offline sections
of memory on x86-64, if I take out __GFP_MOVABLE flag for 
blockdev pages. (in grow_dev_page()).

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-11-01 Thread Badari Pulavarty
On Wed, 2007-10-31 at 13:40 -0400, Chris Mason wrote:
 On Wed, 31 Oct 2007 08:14:21 -0800
 Badari Pulavarty [EMAIL PROTECTED] wrote:
  
  I tried data=writeback mode and it didn't help :(
 
 Ouch, so much for the easy way out.
 
  
  unable to release the page 262070
  bh c000211b9408 flags 110029 count 1 private 0
  unable to release the page 262098
  bh c00020ec9198 flags 110029 count 1 private 0
  memory offlining 3f000 to 4 failed
  
 
 The only other special thing reiserfs does with the page cache is file
 tails.  I don't suppose all of these pages are index zero in files
 smaller than 4k?

Ah !! I am so blind :(

I have been suspecting reiserfs all along, since its executing
fallback_migrate_page(). Actually, these buffer heads are
backing blockdev. I guess these are metadata buffers :( 
I am not sure we can do much with these..

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-11-01 Thread Badari Pulavarty
On Thu, 2007-11-01 at 11:51 -0400, Chris Mason wrote:
 On Thu, 01 Nov 2007 08:38:57 -0800
 Badari Pulavarty [EMAIL PROTECTED] wrote:
 
  On Wed, 2007-10-31 at 13:40 -0400, Chris Mason wrote:
   On Wed, 31 Oct 2007 08:14:21 -0800
   Badari Pulavarty [EMAIL PROTECTED] wrote:

I tried data=writeback mode and it didn't help :(
   
   Ouch, so much for the easy way out.
   

unable to release the page 262070
bh c000211b9408 flags 110029 count 1 private 0
unable to release the page 262098
bh c00020ec9198 flags 110029 count 1 private 0
memory offlining 3f000 to 4 failed

   
   The only other special thing reiserfs does with the page cache is
   file tails.  I don't suppose all of these pages are index zero in
   files smaller than 4k?
  
  Ah !! I am so blind :(
  
  I have been suspecting reiserfs all along, since its executing
  fallback_migrate_page(). Actually, these buffer heads are
  backing blockdev. I guess these are metadata buffers :( 
  I am not sure we can do much with these..
 
 Hmpf, my first reply had a paragraph about the block device inode
 pages, I noticed the phrase file data pages and deleted it ;)
 
 But, for the metadata buffers there's not much we can do.  They are
 included in a bunch of different lists and the patch would
 be non-trivial.

Unfortunately, these buffer pages are spread all around making
those sections of memory non-removable. Of course, one can use
ZONE_MOVABLE to make sure to guarantee the remove. But I am
hoping we could easily group all these allocations and minimize
spreading them around. Mel ?

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-10-31 Thread Badari Pulavarty
On Tue, 2007-10-30 at 18:58 -0400, Chris Mason wrote:
 On Tue, 30 Oct 2007 13:54:05 -0800
 Badari Pulavarty [EMAIL PROTECTED] wrote:
 
  On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote:
   On Tue, 30 Oct 2007 10:27:04 -0800
   Badari Pulavarty [EMAIL PROTECTED] wrote:
   
Hi,

While testing hotplug memory remove, I ran into this issue. Given
a range of pages hotplug memory remove tries to migrate those
pages.

migrate_pages() keeps failing to migrate pages containing
pagecache pages for reiserfs files. I noticed that reiserfs
doesn't have -migratepage() ops. So, fallback_migrate_page()
code tries to do try_to_release_page(). try_to_release_page()
fails to drop_buffers() since b_count == 1. Here is what my debug
shows:

migrate pages failed pfn 258111/flags 3f801
bh cb53f6e0 flags 110029 count 1

Any one know why the b_count == 1 and not getting dropped to
zero ? 
   
   If these are file data pages, the count is probably elevated as
   part of the data=ordered tracking.  You can verify this via
   b_private, or just mount data=writeback to double check.
  
  
  Chris,
  
  That was my first assumption. But after looking at
  reiserfs_releasepage (), realized that it would do reiserfs_free_jh()
  and clears the b_private. I couldn't easily find out who has the ref.
  against this bh.
  
  bh cbdaaf00 flags 110029 count 1 private 0
  
 
 If I'm reading this correctly the buffer is BH_Lock | BH_Req, perhaps
 it is currently under IO?
 
 The page isn't locked, but data=ordered does IO directly on the buffer
 heads, without taking the page lock.
 
 The easy way to narrow our search is to try without data=ordered, it is
 certainly complicating things.

I tried data=writeback mode and it didn't help :(

unable to release the page 262070
bh c000211b9408 flags 110029 count 1 private 0
unable to release the page 262098
bh c00020ec9198 flags 110029 count 1 private 0
memory offlining 3f000 to 4 failed

# cat /etc/mtab
/dev/sda3 / reiserfs rw,data=writeback 0 0
proc /proc proc rw 0 0
tmpfs /dev/shm tmpfs rw 0 0
devpts /dev/pts devpts rw,mode=0620,gid=5 0 0

Thanks,
Badari


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


migratepage failures on reiserfs

2007-10-30 Thread Badari Pulavarty
Hi,

While testing hotplug memory remove, I ran into this issue. Given a
range of pages hotplug memory remove tries to migrate those pages.

migrate_pages() keeps failing to migrate pages containing pagecache
pages for reiserfs files. I noticed that reiserfs doesn't have 
-migratepage() ops. So, fallback_migrate_page() code tries to
do try_to_release_page(). try_to_release_page() fails to drop_buffers()
since b_count == 1. Here is what my debug shows:

migrate pages failed pfn 258111/flags 3f801
bh cb53f6e0 flags 110029 count 1

Any one know why the b_count == 1 and not getting dropped to zero ? 

Thanks,
Badari 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-10-30 Thread Badari Pulavarty
On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote:
 On Tue, 30 Oct 2007 10:27:04 -0800
 Badari Pulavarty [EMAIL PROTECTED] wrote:
 
  Hi,
  
  While testing hotplug memory remove, I ran into this issue. Given a
  range of pages hotplug memory remove tries to migrate those pages.
  
  migrate_pages() keeps failing to migrate pages containing pagecache
  pages for reiserfs files. I noticed that reiserfs doesn't have 
  -migratepage() ops. So, fallback_migrate_page() code tries to
  do try_to_release_page(). try_to_release_page() fails to
  drop_buffers() since b_count == 1. Here is what my debug shows:
  
  migrate pages failed pfn 258111/flags 3f801
  bh cb53f6e0 flags 110029 count 1
  
  Any one know why the b_count == 1 and not getting dropped to zero ? 
 
 If these are file data pages, the count is probably elevated as part of
 the data=ordered tracking.  You can verify this via b_private, or just
 mount data=writeback to double check.


Chris,

That was my first assumption. But after looking at reiserfs_releasepage
(), realized that it would do reiserfs_free_jh() and clears the
b_private. I couldn't easily find out who has the ref. against this
bh.

bh cbdaaf00 flags 110029 count 1 private 0

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-17 Thread Badari Pulavarty
On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote:
 On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
  jbd/jbd2: Replace slab allocations with page cache allocations
  
  From: Christoph Lameter [EMAIL PROTECTED]
  
  JBD should not pass slab pages down to the block layer.
  Use page allocator pages instead. This will also prepare
  JBD for the large blocksize patchset.
  
 
 Currently memory allocation for committed_data(and frozen_buffer) for
 bufferhead is done through jbd slab management, as Christoph Hellwig
 pointed out that this is broken as jbd should not pass slab pages down
 to IO layer. and suggested to use get_free_pages() directly.
 
 The problem with this patch, as Andreas Dilger pointed today in ext4
 interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
 1/3-1/2 page space. 
 
 What was the originally intention to set up slabs for committed_data(and
 frozen_buffer) in JBD? Why not using kmalloc?
 
 Mingming

Looks good. Small suggestion is to get rid of all kmalloc() usages and
consistently use jbd_kmalloc() or jbd2_kmalloc().

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] dio: remove bogus refcounting BUG_ON

2007-07-05 Thread Badari Pulavarty
On Thu, 2007-07-05 at 10:11 -0700, Zach Brown wrote:
  the BUG_ON(). But unfortunately, our perf. team is able reproduce the
  problem.
 
 What are they doing to reproduce it?  How much setup does it take?

Huge OLTP run :(

 
  Debug indicated that, the ret2 == 1 :(
 
 That could be consistent with the theory that we're racing with the  
 dio struct being freed and reused before it's tested in the BUG_ON()  
 condition.  Suparna's suggestion to sample dio-is_async before  
 releasing the refcount and using that in the BUG_ON condition is a  
 good one.

I will ask them to try that.

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] dio: remove bogus refcounting BUG_ON

2007-07-04 Thread Badari Pulavarty
On Tue, 2007-07-03 at 15:28 -0700, Zach Brown wrote:
 Linus, Andrew, please apply the bug fix patch at the end of this reply
 for .22.
 
  One of our perf. team ran into this while doing some runs.
  I didn't see anything obvious - it looks like we converted
  async IO to synchronous one. I didn't spend much time digging
  around.
 
 OK, I think this BUG_ON() is just broken.  I wasn't able to find any
 obvious bugs from reading the code which would cause the BUG_ON() to
 fire.  If it's reproducible I'd love to hear what the recipe is.
 
 I did notice that this BUG_ON() is evaluating dio after having dropped
 it's ref :/.  So it's not completely absurd to fear that it's a race
 with the dio's memory being reused, but that'd be a pretty tight race.
 
 Let's remove this stupid BUG_ON and see if that test box still has
 trouble.  It might just hit the valid BUG_ON a few lines down, but this
 unsafe BUG_ON needs to go.

I went through the code multiple times, I can't find how we can trigger
the BUG_ON(). But unfortunately, our perf. team is able reproduce the
problem. Debug indicated that, the ret2 == 1 :(

Not sure how that can happen. Ideas ?

Thanks,
Badari

 
 ---
 
 dio: remove bogus refcounting BUG_ON
 
 Badari Pulavarty reported a case of this BUG_ON is triggering during
 testing.  It's completely bogus and should be removed.
 
 It's trying to notice if we left references to the dio hanging around in
 the sync case.  They should have been dropped as IO completed while this
 path was in dio_await_completion().  This condition will also be
 checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few
 lines lower.  So to start this BUG_ON() is redundant.
 
 More fatally, it's dereferencing dio- after having dropped its
 reference.  It's only safe to dereference the dio after releasing the
 lock if the final reference was just dropped.  Another CPU might free
 the dio in bio completion and reuse the memory after this path drops the
 dio lock but before the BUG_ON() is evaluated.
 
 This patch passed aio+dio regression unit tests and aio-stress on ext3.
 
 Signed-off-by: Zach Brown [EMAIL PROTECTED]
 Cc: Badari Pulavarty [EMAIL PROTECTED]
 
 diff -r 509ce354ae1b fs/direct-io.c
 --- a/fs/direct-io.c  Sun Jul 01 22:00:49 2007 +
 +++ b/fs/direct-io.c  Tue Jul 03 14:56:41 2007 -0700
 @@ -1106,7 +1106,7 @@ direct_io_worker(int rw, struct kiocb *i
   spin_lock_irqsave(dio-bio_lock, flags);
   ret2 = --dio-refcount;
   spin_unlock_irqrestore(dio-bio_lock, flags);
 - BUG_ON(!dio-is_async  ret2 != 0);
 +
   if (ret2 == 0) {
   ret = dio_complete(dio, offset, ret);
   kfree(dio);

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


DIO panic on 2.6.21.5

2007-06-27 Thread Badari Pulavarty

Hi Zach,

One of our perf. team ran into this while doing some runs.
I didn't see anything obvious - it looks like we converted
async IO to synchronous one. I didn't spend much time digging
around.

Is this a known issue ? Any ideas ?

Thanks,
Badari

[ cut here ]
kernel BUG at fs/direct-io.c:1113!
invalid opcode:  [1] SMP
CPU 11
Modules linked in: oprofile raw capability commoncap qla2xxx ipv6 button 
battery ac loop dm_mod tg3 ext3 jbd edd fan thermal processor scsi_transport_fc 
sg aic94xx libsas firmware_class scsi_transport_sas serverworks sd_mod scsePid: 
9709, comm: db2sysc Not tainted 2.6.21.5-smp #1
RIP: 0010:[802a7a1a]  [802a7a1a] 
__blockdev_direct_IO+0x96c/0xa4a

RSP: 0018:810c3d3efc08  EFLAGS: 00010202
RAX: 0246 RBX: 810041768b24 RCX: 80406918
RDX:  RSI: 0246 RDI: 810041768b24
RBP: 810041768800 R08: 0086 R09: 81003f0a2370
R10: 810c3f3ee9d8 R11: 802e9716 R12: 0001
R13: fdef R14: 810c5beb42b0 R15: 
FS:  2b3d14ffe940() GS:810c444c95c0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 2b3e84f57000 CR3: 4511c000 CR4: 06e0
Process db2sysc (pid: 9709, threadinfo 810c3d3ee000, task 810041fee760)
Stack:  810c42f0fbc0 4c1f4000 811836306d48 0011803e7f95
0009  0008 
810041fee760 00090001 0001 810024ac9948
Call Trace:
[802a67c7] blkdev_direct_IO+0x45/0x4a
[802a66ec] blkdev_get_blocks+0x0/0x96
[80260108] generic_file_direct_IO+0xbd/0xfb
[802601a6] generic_file_direct_write+0x60/0xf5
[80260b70] __generic_file_aio_write_nolock+0x2e7/0x410
[8029a0ce] aio_read_evt+0x9a/0x108
[80260d8e] generic_file_aio_write_nolock+0x34/0x80
[80260d5a] generic_file_aio_write_nolock+0x0/0x80
[80299d23] aio_rw_vect_retry+0x72/0x14a
[8029a63f] aio_run_iocb+0x9a/0x134
[8029b10a] io_submit_one+0x311/0x35e
[8029b678] sys_io_submit+0x9b/0x101
[80229986] default_wake_function+0x0/0xe
[80209b8e] system_call+0x7e/0x83


Code: 0f 0b eb fe 4d 85 e4 75 1d 48 8b 74 24 08 44 89 ea 48 89 ef
RIP  [802a7a1a] __blockdev_direct_IO+0x96c/0xa4a
RSP 810c3d3efc08



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 13/14] ext3 whiteout support

2007-05-14 Thread Badari Pulavarty
On Mon, 2007-05-14 at 15:14 +0530, Bharata B Rao wrote:
 From: Bharata B Rao [EMAIL PROTECTED]
 Subject: ext3 whiteout support
 
 Introduce whiteout support for ext3.
 
 Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
 Signed-off-by: Jan Blunck [EMAIL PROTECTED]
 ---
  fs/ext3/dir.c   |2 -
  fs/ext3/namei.c |   62 
 
  fs/ext3/super.c |   11 +++-
  include/linux/ext3_fs.h |5 +++
  4 files changed, 72 insertions(+), 8 deletions(-)
 
 --- a/fs/ext3/dir.c
 +++ b/fs/ext3/dir.c
 @@ -29,7 +29,7 @@
  #include linux/rbtree.h
  
  static unsigned char ext3_filetype_table[] = {
 - DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
 + DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK, 
 DT_WHT
  };
  
  static int ext3_readdir(struct file *, void *, filldir_t);
 --- a/fs/ext3/namei.c
 +++ b/fs/ext3/namei.c
 @@ -1071,6 +1071,7 @@ static unsigned char ext3_type_by_mode[S
   [S_IFIFO  S_SHIFT]= EXT3_FT_FIFO,
   [S_IFSOCK  S_SHIFT]   = EXT3_FT_SOCK,
   [S_IFLNK  S_SHIFT]= EXT3_FT_SYMLINK,
 + [S_IFWHT  S_SHIFT]= EXT3_FT_WHT,
  };
  
  static inline void ext3_set_de_type(struct super_block *sb,
 @@ -1786,7 +1787,7 @@ out_stop:
  /*
   * routine to check that the specified directory is empty (for rmdir)
   */
 -static int empty_dir (struct inode * inode)
 +static int empty_dir (handle_t *handle, struct inode * inode)

Is there a reason for passing the handle ? Why couldn't you get it from
journal_current_handle() if needed to do the delete the whiteout ?

  {
   unsigned long offset;
   struct buffer_head * bh;
 @@ -1848,8 +1849,28 @@ static int empty_dir (struct inode * ino
   continue;
   }
   if (le32_to_cpu(de-inode)) {
 - brelse (bh);
 - return 0;
 + /* If this is a whiteout, remove it */
 + if (de-file_type == EXT3_FT_WHT) {
 + unsigned long ino = le32_to_cpu(de-inode);
 + struct inode *tmp_inode = iget(inode-i_sb, 
 ino);
 + if (!tmp_inode) {
 + brelse (bh);
 + return 0;
 + }
 +
 + if (ext3_delete_entry(handle, inode, de, bh)) {
 + iput(tmp_inode);
 + brelse (bh);
 + return 0;
 + }
 +
 + tmp_inode-i_ctime = inode-i_ctime;
 + tmp_inode-i_nlink--;
 + iput(tmp_inode);
 + } else {
 + brelse (bh);
 + return 0;
 + }
   }
   offset += le16_to_cpu(de-rec_len);
   de = (struct ext3_dir_entry_2 *)
 @@ -2031,7 +2052,7 @@ static int ext3_rmdir (struct inode * di
   goto end_rmdir;
  
   retval = -ENOTEMPTY;
 - if (!empty_dir (inode))
 + if (!empty_dir (handle, inode))
   goto end_rmdir;
  
   retval = ext3_delete_entry(handle, dir, de, bh);
 @@ -2060,6 +2081,36 @@ end_rmdir:
   return retval;
  }
  
 +static int ext3_whiteout(struct inode *dir, struct dentry *dentry)
 +{
 + struct inode * inode;
 + int err, retries = 0;
 + handle_t *handle;
 +
 +retry:
 + handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir-i_sb) +
 + EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 +
 + 2*EXT3_QUOTA_INIT_BLOCKS(dir-i_sb));
 + if (IS_ERR(handle))
 + return PTR_ERR(handle);
 +
 + if (IS_DIRSYNC(dir))
 + handle-h_sync = 1;
 +
 + inode = ext3_new_inode (handle, dir, S_IFWHT | S_IRUGO);
 + err = PTR_ERR(inode);
 + if (IS_ERR(inode))
 + goto out_stop;

Don't you need to call init_special_inode() here ?
Or this is handled somewhere else ?

 +
 + err = ext3_add_nondir(handle, dentry, inode);
 +
 +out_stop:
 + ext3_journal_stop(handle);
 + if (err == -ENOSPC  ext3_should_retry_alloc(dir-i_sb, retries))
 + goto retry;
 + return err;
 +}
 +

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 5/14] Introduce union stack

2007-05-14 Thread Badari Pulavarty
On Mon, 2007-05-14 at 15:10 +0530, Bharata B Rao wrote:
 From: Jan Blunck [EMAIL PROTECTED]
 Subject: Introduce union stack.
 
 Adds union stack infrastructure to the dentry structure and provides
 locking routines to walk the union stack.
 
 Signed-off-by: Jan Blunck [EMAIL PROTECTED]
 Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
...

 +/*
 + * This is a *I can't get no sleep* helper which is called when we try
 + * to access the struct fs_struct *fs field of a struct task_struct.
 + *
 + * Yes, this is possibly starving but we have to change root, altroot
 + * or pwd in the frequency of this while loop. Don't think that this
 + * happens really often ;)
 + *
 + * This is called while holding the rwlock_t fs-lock
 + *
 + * TODO: Unlocking side of union_lock_fs() needs 3 union_unlock()s.
 + * May be introduce union_unlock_fs().
 + *
 + * FIXME: This routine is used when the caller wants to dget one or
 + * more of fs-[root, altroot, pwd]. When the caller doesn't want to
 + * dget _all_ of these, it is strictly not necessary to get union_locks
 + * on all of these. Check.
 + */
 +static inline void union_lock_fs(struct fs_struct *fs)
 +{
 + int locked;
 +
 + while (fs) {
 + locked = union_trylock(fs-root);
 + if (!locked)
 + goto loop1;
 + locked = union_trylock(fs-altroot);
 + if (!locked)
 + goto loop2;
 + locked = union_trylock(fs-pwd);
 + if (!locked)
 + goto loop3;
 + break;
 + loop3:
 + union_unlock(fs-altroot);
 + loop2:
 + union_unlock(fs-root);
 + loop1:
 + read_unlock(fs-lock);
 + UM_DEBUG_LOCK(Failed to get all semaphores in fs_struct!\n);
 + cpu_relax();
 + read_lock(fs-lock);
 + continue;

Nit.. why continue ?

 + }
 + BUG_ON(!fs);

Whats the use of BUG_ON() here ? Top of the function would be more
useful.

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 5/14] Introduce union stack

2007-05-14 Thread Badari Pulavarty
On Mon, 2007-05-14 at 15:10 +0530, Bharata B Rao wrote:
 From: Jan Blunck [EMAIL PROTECTED]
 Subject: Introduce union stack.
 
 Adds union stack infrastructure to the dentry structure and provides
 locking routines to walk the union stack.
...

 --- /dev/null
 +++ b/include/linux/dcache_union.h
 @@ -0,0 +1,248 @@
 +/*
 + * VFS based union mount for Linux
 + *
 + * Copyright © 2004-2007 IBM Corporation
 + *   Author(s): Jan Blunck ([EMAIL PROTECTED])
 + *
 + * This program is free software; you can redistribute it and/or modify it
 + * under the terms of the GNU General Public License as published by the Free
 + * Software Foundation; either version 2 of the License, or (at your option)
 + * any later version.
 + *
 + */
 +#ifndef __LINUX_DCACHE_UNION_H
 +#define __LINUX_DCACHE_UNION_H
 +#ifdef __KERNEL__
 +
 +#include linux/union_debug.h
 +#include linux/fs_struct.h
 +#include asm/atomic.h
 +#include asm/semaphore.h
 +
 +#ifdef CONFIG_UNION_MOUNT
 +
 +/*
 + * This is the union info object, that describes general information about 
 this
 + * union directory
 + *
 + * u_mutex protects the union stack against modification. You can reach it
 + * through the d_union field in struct dentry. Hold it when you are walking
 + * or modifing the union stack !
 + */
 +struct union_info {
 + atomic_t u_count;
 + struct mutex u_mutex;
 +};
 +
 +/* allocate/de-allocate */
 +extern struct union_info *union_alloc(void);
 +extern struct union_info *union_get(struct union_info *);
 +extern void union_put(struct union_info *);
 +
 +/*
 + * These are the functions for locking a dentry's union. When one
 + * want to acquire a denties union lock, use:
 + *
 + * - union_lock() when you can sleep,
 + * - union_lock_spinlock() when you are holding a spinlock (that
 + *   you CAN savely give up and reacquire again)
 + * - union_lock_readlock() when you are holding a readlock (that
 + *   you CAN savely give up and reacquire again)
 + *
 + * Otherwise get the union lock early before you enter your
 + * no sleeping here code.
 + *
 + * NOTES: union_info structure is reference counted using u_count member.
 + * union_get() and union_put() which get and put references on union_info
 + * should be done under union_info's u_mutex. Since the last union_put() 
 frees
 + * the union_info structure itself it can't obviously be done under u_mutex.
 + * union_release() should be used in such cases (Eg. dput(), umount()) where
 + * union_info is disassociated from the dentries, and it becomes safe
 + * to free the union_info.
 + */
 +static inline void __union_lock(struct union_info *uinfo)
 +{
 + BUG_ON(!atomic_read(uinfo-u_count));
 + mutex_lock(uinfo-u_mutex);
 +}
 +
 +static inline void union_lock(struct dentry *dentry)
 +{
 + if (unlikely(dentry  dentry-d_union)) {
 + struct union_info *ui = dentry-d_union;
 +
 + UM_DEBUG_LOCK(\%s\ locking %p (count=%d)\n,
 +   dentry-d_name.name, ui,
 +   atomic_read(ui-u_count));
 + __union_lock(dentry-d_union);
 + }
 +}
 +
 +static inline void __union_unlock(struct union_info *uinfo)
 +{
 + BUG_ON(!atomic_read(uinfo-u_count));
 + mutex_unlock(uinfo-u_mutex);
 +}
 +
 +static inline void union_unlock(struct dentry *dentry)
 +{
 + if (unlikely(dentry  dentry-d_union)) {
 + struct union_info *ui = dentry-d_union;
 +
 + UM_DEBUG_LOCK(\%s\ unlocking %p (count=%d)\n,
 +   dentry-d_name.name, ui,
 +   atomic_read(ui-u_count));
 + __union_unlock(dentry-d_union);
 + }
 +}
 +
 +static inline void union_alloc_dentry(struct dentry *dentry)
 +{
 + spin_lock(dentry-d_lock);
 + if (!dentry-d_union) {
 + dentry-d_union = union_alloc();
 + spin_unlock(dentry-d_lock);
 + } else {
 + spin_unlock(dentry-d_lock);
 + union_lock(dentry);
 + }
 +}
 +
 +static inline struct union_info *union_lock_and_get(struct dentry *dentry)
 +{
 + union_lock(dentry);
 + return union_get(dentry-d_union);
 +}
 +
 +/* Shouldn't be called with last reference to union_info */
 +static inline void union_put_and_unlock(struct union_info *uinfo)
 +{
 + union_put(uinfo);
 + __union_unlock(uinfo-u_mutex);
   ^^^

It should be

__union_unlock(uinfo);

Thanks,
Badari



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.21-rc6 new aops patchset

2007-04-12 Thread Badari Pulavarty
On Thu, 2007-04-12 at 06:48 +0200, Nick Piggin wrote:
 http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
 
 2.6.21-rc6-new-aops*
 
 New aops patchset against 2.6.21-rc6.


  Building modules, stage 2.
  MODPOST 558 modules
WARNING: .cont_prepare_write [fs/hfsplus/hfsplus.ko] undefined!
WARNING: .iov_iter_advance [fs/fuse/fuse.ko] undefined!
WARNING: .iov_iter_copy_from_user_atomic [fs/fuse/fuse.ko] undefined!
WARNING: .iov_iter_fault_in_readable [fs/fuse/fuse.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] new new aops patchset

2007-04-05 Thread Badari Pulavarty
On Thu, 2007-04-05 at 04:08 +0200, Nick Piggin wrote:
 On Wed, Apr 04, 2007 at 04:32:24PM -0700, Badari Pulavarty wrote:
  On Wed, 2007-04-04 at 16:17 -0700, Mark Fasheh wrote:
   On Wed, Apr 04, 2007 at 04:05:19PM -0700, Badari Pulavarty wrote:
Hmm.. Okay, only filesystems that could return AOP_TRUNCATED_PAGE
are ocf2 and gfs2. Now that both of them are switched to have
write_begin()/write_end(), why do we need this code to handle
AOP_TRUNCATED_PAGE (in the else part) ? Can't we just cleanup/nuke
all the AOP_TRUNCATED_PAGE handling ?
   
   We don't - I'm pretty sure that fs-no-AOP_TRUNCATED_PAGE.patch gets rid of
   them.
 --Mark
  
  It didn't, completely get rid of them :(
 
 -readpage can still return AOP_TRUNCATED_PAGE. Were there any from
 prepare_write or commit_write still around?
 
 

Not a big deal. But trying to understand it better.


int pagecache_write_begin()
{
 
if (aops-write_begin) {
return aops-write_begin(file, mapping, pos, len, flags,
pagep, fsdata);
} else {
.
ret = aops-readpage(file, page);
page_cache_release(page);
if (ret) {
if (ret == AOP_TRUNCATED_PAGE)
goto again;
return ret;
}
goto again;

  
}
}

filesystems (ocfs2, gfs2) which can return AOP_TRUNCATED_PAGE for
prepare_write or readpage would never come to this case. They
have write_begin() method set. Isn't it ? Why this check ?

Thanks,
Badari


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] new new aops patchset

2007-04-04 Thread Badari Pulavarty
On Wed, 2007-04-04 at 15:10 -0700, Mark Fasheh wrote:
 On Mon, Apr 02, 2007 at 02:09:34PM +0200, Nick Piggin wrote:
  Updated aops patchset against 2.6.21-rc5.
  
  http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
 
 Thanks again for pushing this stuff out Nick.
 
 Fwiw, I had merge conflicts working against the latest git code from Linus -
 it looks like Jens pushed some splice changes (some from you). A refreshed
 fs-new-write-aops.patch is attached.
 
 The ext3 patch also caused me some problems, but I just temporarily dropped
 it for my immediate work.


Could you elaborate on issues ext3 caused ? Its pretty stable for
me (after bunch of simple fixes, I posted earlier). Anything I need
to be aware ?

Thanks,
Badari


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] new new aops patchset

2007-04-04 Thread Badari Pulavarty
On Wed, 2007-04-04 at 16:17 -0700, Mark Fasheh wrote:
 On Wed, Apr 04, 2007 at 04:05:19PM -0700, Badari Pulavarty wrote:
  Hmm.. Okay, only filesystems that could return AOP_TRUNCATED_PAGE
  are ocf2 and gfs2. Now that both of them are switched to have
  write_begin()/write_end(), why do we need this code to handle
  AOP_TRUNCATED_PAGE (in the else part) ? Can't we just cleanup/nuke
  all the AOP_TRUNCATED_PAGE handling ?
 
 We don't - I'm pretty sure that fs-no-AOP_TRUNCATED_PAGE.patch gets rid of
 them.
   --Mark

It didn't, completely get rid of them :(

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] new new aops patchset

2007-04-03 Thread Badari Pulavarty
On Tue, 2007-04-03 at 01:49 +0200, Nick Piggin wrote:
 On Mon, Apr 02, 2007 at 04:14:59PM -0700, Badari Pulavarty wrote:
  On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
   Updated aops patchset against 2.6.21-rc5.
   
   http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
   
   Files/dirs are 2.6.21-rc5-new-aops*
  
  Baaah !! You took away ext3 -nobh option :(
 
 Ahh, just the person I wanted to ask! ;) How useful is it, out of curiosity?
 What sort of users use it, and what sort of improvements do they get?

Well, at the time it helped lowmem issue on x86. Bufferheads end up
using lots of lowmem and caused bunch of issues. We used it heavily
on our local machines. Not sure if any customers use it (my guess
is not).

I wanted to completely kill bufferhead usage from ext3. writeback
mode is simple and easy. Ordered mode, need attach the buffers
to transaction - which needed complete rework. So never gotten
around to doing it.

 
  Do you have plans to support nobh versions of block_write_begin/end ?
 
 At the moment I'm looking at doing it another way. Having the seperate
 nobh path is quite annoying -- there are still bugs in it and it is
 simply less tested (not that the bh path is bug-free either, but it
 is better to be able to concentrate on one). So I hope to merge them
 at some point to restore that functionality. 
 
 
  BTW, I don't see how block_write_end() can ever return  0.
  If so, here is the cleanup fix for ext3 (no unnecessay checks).
 
 Shouldn't we allow for the possibility?

Well, if there is a (remote) possibility of failure - yes
we should handle it. Otherwise its dead code, unnecessary
checks in hotpath and makes code ugly by confusing non-possible
error cases from possible ones. 

Even today - ext3 code has checks to handle  generic_commit_write()
failures. But it never returns failure. I wanted to clean up ext3 code,
but I left it for allowing for possibility.

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] new new aops patchset

2007-04-03 Thread Badari Pulavarty
On Tue, 2007-04-03 at 01:58 +0200, Nick Piggin wrote:
 On Mon, Apr 02, 2007 at 01:44:59PM -0700, Badari Pulavarty wrote:
  On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
   Updated aops patchset against 2.6.21-rc5.
   
   http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
   
   Files/dirs are 2.6.21-rc5-new-aops*
   
   Contains numerous fixes from Mark and myself -- I'd say the core code is
   getting reasonably stable at this point.
  
  
  ext3_write_failure() conversion is NOT quite correct.
  Old code was returing failure of do_journal_get_write_access(),
  where as your changes will return journal_stop(). 
 
 Good catch. This still requires a journal_stop, however?

Yes.
 
 How's this look?

Looks good. I am little worried about the page-lock vs journal
start/stop order. I remember running into deadlock while trying
to do writepages() for ordered mode. I think we are okay here.
I will take a closer look.

Thanks,
Badari

 
 --
 Index: linux-2.6/fs/ext3/inode.c
 ===
 --- linux-2.6.orig/fs/ext3/inode.c
 +++ linux-2.6/fs/ext3/inode.c
 @@ -1167,11 +1167,13 @@ static int ext3_write_failure(struct fil
   handle_t *handle = ext3_journal_current_handle();
  
   if (ext3_should_writeback_data(mapping-host)) {
 +skip_and_stop:
   /* optimization: no constraints about data */
 + ret = ext3_journal_stop(handle);
  skip:
   unlock_page(page);
   page_cache_release(page);
 - return ext3_journal_stop(handle);
 + return ret;
   }
  
   from = pos  (PAGE_CACHE_SIZE - 1);
 @@ -1196,8 +1198,10 @@ skip:
   break;
   if (ext3_should_journal_data(mapping-host)) {
   ret = do_journal_get_write_access(handle, bh);
 - if (ret)
 + if (ret) {
 + ext3_journal_stop(handle);
   goto skip;
 + }
   }
   /*
* block_start here becomes the first block where the current iteration
 @@ -1205,7 +1209,7 @@ skip:
*/
   }
   if (block_start = from)
 - goto skip;
 + goto skip_and_stop;
  
   /* commit allocated and zeroed buffers */
   return mapping-a_ops-write_end(file, mapping, pos, len,

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] new new aops patchset

2007-04-03 Thread Badari Pulavarty
On Tue, 2007-04-03 at 11:31 +0200, Nick Piggin wrote:
 On Tue, Apr 03, 2007 at 02:08:53AM +0200, Nick Piggin wrote:
   BTW, I will take a shot at ext4 tomorrow.
  
  Thanks, so long as you think ext3 is looking OK?
  
  BTW. is it a known issue that ext3 fails fsx-linux? (I tried 2.6.21-rc3
  IIRC, and ordered and writeback both eventually failed I think). ext2
  does not.
 
 Well I just tested, and it is not fixed by the recent patch to revert
 ext3_prepare_failure
 
 Is this a known issue? Is it an fsx-linux shortcoming? It is fairly
 surprising because it basically makes it impossible to test ext3
 changes with that nice tool :( I can submit the traces if anyone is
 interested, however I can reproduce in UML on an ext3 writeback
 filesystem with no arguments (except the filename).
 

I haven't seen an fsx failure recently. I ran 4 copies of fsx
on 2.6.21-rc5 and 2.6.21-rc5+aops, without any problems for
12+ hours. What am I missing ?

Mingming, do you know of any fsx failures recently ?

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] new new aops patchset

2007-04-03 Thread Badari Pulavarty
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
 Updated aops patchset against 2.6.21-rc5.
 
 http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
 
 Files/dirs are 2.6.21-rc5-new-aops*

Here is the ext4 support for it. This is a simple port from 
ext3 code. Ran fsx without any problems :)

BTW, I never clearly understood what exactly the problem these
new interfaces are solving and how :( I can dig through the
archives and try to figure out. Would you care to put a 
small description of the *actual* problem and how these
new aops are needed (vs hacking the existing methods).

Thanks,
Badari

---
 fs/ext4/inode.c |  171 +---
 1 file changed, 113 insertions(+), 58 deletions(-)

Index: linux-2.6.21-rc5/fs/ext4/inode.c
===
--- linux-2.6.21-rc5.orig/fs/ext4/inode.c   2007-03-25 15:56:23.0 
-0700
+++ linux-2.6.21-rc5/fs/ext4/inode.c2007-04-03 10:01:34.0 -0700
@@ -1154,23 +1154,30 @@ static int do_journal_get_write_access(h
  * This content is expected to be set to zeroes by block_prepare_write().
  * 2006/10/14  SAW
  */
-static int ext4_prepare_failure(struct file *file, struct page *page,
-   unsigned from, unsigned to)
+static int ext4_write_failure(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len,
+   struct page *page, void *fsdata)
 {
-   struct address_space *mapping;
struct buffer_head *bh, *head, *next;
unsigned block_start, block_end;
unsigned blocksize;
+   unsigned from, to;
int ret;
handle_t *handle = ext4_journal_current_handle();
 
-   mapping = page-mapping;
if (ext4_should_writeback_data(mapping-host)) {
/* optimization: no constraints about data */
+skip_and_stop:
+   ret = ext4_journal_stop(handle);
 skip:
-   return ext4_journal_stop(handle);
+   unlock_page(page);
+   page_cache_release(page);
+   return ret;
}
 
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
+
head = page_buffers(page);
blocksize = head-b_size;
for (   bh = head, block_start = 0;
@@ -1192,7 +1199,7 @@ skip:
ret = do_journal_get_write_access(handle, bh);
if (ret) {
ext4_journal_stop(handle);
-   return ret;
+   goto skip;
}
}
/*
@@ -1201,43 +1208,64 @@ skip:
 */
}
if (block_start = from)
-   goto skip;
+   goto skip_and_stop;
 
/* commit allocated and zeroed buffers */
-   return mapping-a_ops-commit_write(file, page, from, block_start);
+   return mapping-a_ops-write_end(file, mapping, pos, len,
+   block_start - from, page, fsdata);
 }
 
-static int ext4_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ext4_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   struct inode *inode = page-mapping-host;
-   int ret, ret2;
+   struct inode *inode = mapping-host;
int needed_blocks = ext4_writepage_trans_blocks(inode);
+   int ret, ret2;
handle_t *handle;
int retries = 0;
+   struct page *page;
+   pgoff_t index;
+   unsigned start, end;
+
+   index = pos  PAGE_CACHE_SHIFT;
+   start = pos * (PAGE_CACHE_SIZE - 1);
+   end = start + len;
 
 retry:
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
+
handle = ext4_journal_start(inode, needed_blocks);
-   if (IS_ERR(handle))
+   if (IS_ERR(handle)) {
+   unlock_page(page);
+   page_cache_release(page);
return PTR_ERR(handle);
-   if (test_opt(inode-i_sb, NOBH)  ext4_should_writeback_data(inode))
-   ret = nobh_prepare_write(page, from, to, ext4_get_block);
-   else
-   ret = block_prepare_write(page, from, to, ext4_get_block);
+   }
+   ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ext4_get_block);
if (ret)
goto failure;
 
if (ext4_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
-   from, to, NULL, do_journal_get_write_access);
-   if (ret)
+   start, end, NULL, 

Re: [ANNOUNCE] new new aops patchset

2007-04-02 Thread Badari Pulavarty
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
 Updated aops patchset against 2.6.21-rc5.
 
 http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
 
 Files/dirs are 2.6.21-rc5-new-aops*
 
 Contains numerous fixes from Mark and myself -- I'd say the core code is
 getting reasonably stable at this point.
 
 New stuff: patches from Mark and Steve for the cluster filesystems.
 
 (compile only) patches for NFS, XFS, FUSE, eCryptfs. OK, they're untested,
 but I have tried to put some effort in, so if maintainers would kindly
 take a look... ext3 still needs review and porting to ext4, which I hope
 someone will do eventually (I presume ext3 is still maintained)... does NFS
 have a lock_page vs lock_kernel deadlock in its commit_write? (if yes, this
 is fixable with the new write aops).

In case of error, write_begin() is *supposed* to unlock and release the
page it locked. Right ?

If so, ext3 error handling is not quite right :(
Here is the fix.

Thanks,
Badari

---
 fs/ext3/inode.c |   10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6.21-rc5.aop/fs/ext3/inode.c
===
--- linux-2.6.21-rc5.aop.orig/fs/ext3/inode.c   2007-04-02 10:21:44.0 
-0700
+++ linux-2.6.21-rc5.aop/fs/ext3/inode.c2007-04-02 13:08:41.0 
-0700
@@ -1236,8 +1236,11 @@ retry:
*pagep = page;
 
handle = ext3_journal_start(inode, needed_blocks);
-   if (IS_ERR(handle))
+   if (IS_ERR(handle)) {
+   unlock_page(page);
+   page_cache_release(page);
return PTR_ERR(handle);
+   }
ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
ext3_get_block);
if (ret)
@@ -1246,9 +1249,12 @@ retry:
if (ext3_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
start, end, NULL, do_journal_get_write_access);
-   if (ret)
+   if (ret) {
/* fatal error, just put the handle and return */
+   unlock_page(page);
+   page_cache_release(page);
journal_stop(handle);
+   }
}
return ret;
 


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] new new aops patchset

2007-04-02 Thread Badari Pulavarty
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
 Updated aops patchset against 2.6.21-rc5.
 
 http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
 
 Files/dirs are 2.6.21-rc5-new-aops*
 
 Contains numerous fixes from Mark and myself -- I'd say the core code is
 getting reasonably stable at this point.


ext3_write_failure() conversion is NOT quite correct.
Old code was returing failure of do_journal_get_write_access(),
where as your changes will return journal_stop(). 

Thanks,
Badari

---
 fs/ext3/inode.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6.21-rc5.aop/fs/ext3/inode.c
===
--- linux-2.6.21-rc5.aop.orig/fs/ext3/inode.c   2007-04-02 13:37:16.0 
-0700
+++ linux-2.6.21-rc5.aop/fs/ext3/inode.c2007-04-02 13:37:35.0 
-0700
@@ -1196,8 +1196,11 @@ skip:
break;
if (ext3_should_journal_data(mapping-host)) {
ret = do_journal_get_write_access(handle, bh);
-   if (ret)
-   goto skip;
+   if (ret) {
+   unlock_page(page);
+   page_cache_release(page);
+   return ret;
+   }
}
/*
 * block_start here becomes the first block where the current iteration


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] new new aops patchset

2007-04-02 Thread Badari Pulavarty
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
 Updated aops patchset against 2.6.21-rc5.
 
 http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
 
 Files/dirs are 2.6.21-rc5-new-aops*


[EMAIL PROTECTED] linux-2.6.21-rc5]# make -j8 modules
  CHK include/linux/version.h
  CHK include/linux/utsrelease.h
  Building modules, stage 2.
  MODPOST 1281 modules
WARNING: cont_write_begin [fs/fat/fat.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2


Here is the fix.

Thanks,
Badari

---
 fs/buffer.c |1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6.21-rc5/fs/buffer.c
===
--- linux-2.6.21-rc5.orig/fs/buffer.c   2007-04-02 13:45:56.0 -0700
+++ linux-2.6.21-rc5/fs/buffer.c2007-04-02 14:33:20.0 -0700
@@ -3128,6 +3128,7 @@ EXPORT_SYMBOL(block_sync_page);
 EXPORT_SYMBOL(block_truncate_page);
 EXPORT_SYMBOL(block_write_full_page);
 EXPORT_SYMBOL(cont_prepare_write);
+EXPORT_SYMBOL(cont_write_begin);
 EXPORT_SYMBOL(end_buffer_read_sync);
 EXPORT_SYMBOL(end_buffer_write_sync);
 EXPORT_SYMBOL(file_fsync);


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] new new aops patchset

2007-04-02 Thread Badari Pulavarty
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
 Updated aops patchset against 2.6.21-rc5.
 
 http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
 

Sorry to send you so many silly fixes, but I though it would
be easy to review individual ones.

Thanks,
Badari

---
 mm/filemap.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.21-rc5/mm/filemap.c
===
--- linux-2.6.21-rc5.orig/mm/filemap.c  2007-04-02 13:45:56.0 -0700
+++ linux-2.6.21-rc5/mm/filemap.c   2007-04-02 16:23:48.0 -0700
@@ -2107,7 +2107,7 @@ int pagecache_write_end(struct file *fil
const struct address_space_operations *aops = mapping-a_ops;
int ret;
 
-   if (aops-write_begin) {
+   if (aops-write_end) {
ret = aops-write_end(file, mapping, pos, len, copied,
page, fsdata);
} else {




-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] new new aops patchset

2007-04-02 Thread Badari Pulavarty
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
 Updated aops patchset against 2.6.21-rc5.
 
 http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
 
 Files/dirs are 2.6.21-rc5-new-aops*

ext3_write_begin() is computing start incorrectly.

Thanks,
Badari

---
 fs/ext3/inode.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.21-rc5/fs/ext3/inode.c
===
--- linux-2.6.21-rc5.orig/fs/ext3/inode.c   2007-04-02 16:06:39.0 
-0700
+++ linux-2.6.21-rc5/fs/ext3/inode.c2007-04-02 16:49:57.0 -0700
@@ -1229,7 +1229,7 @@ static int ext3_write_begin(struct file 
unsigned start, end;
 
index = pos  PAGE_CACHE_SHIFT;
-   start = pos * (PAGE_CACHE_SIZE - 1);
+   start = pos  (PAGE_CACHE_SIZE - 1);
end = start + len;
 
 retry:


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] new new aops patchset

2007-04-02 Thread Badari Pulavarty
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
 Updated aops patchset against 2.6.21-rc5.
 
 http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
 
 Files/dirs are 2.6.21-rc5-new-aops*

One more cleanup. index is unused.

BTW, I will take a shot at ext4 tomorrow.

(If you want me to roll all my fixes and send them to you,u,

5.)


1c link

5.orig


h2
(

 and send them to you,
let me know).

Thanks,
Badari

---
 mm/filemap.c |2 --
 1 file changed, 2 deletions(-)

Index: linux-2.6.21-rc5/mm/filemap.c
===
--- linux-2.6.21-rc5.orig/mm/filemap.c  2007-04-02 16:23:48.0 -0700
+++ linux-2.6.21-rc5/mm/filemap.c   2007-04-02 16:53:13.0 -0700
@@ -2383,14 +2383,12 @@ static ssize_t generic_perform_write(str
 
do {
struct page *page;
-   pgoff_t index;  /* Pagecache index for current page */
unsigned long offset;   /* Offset into pagecache page */
unsigned long bytes;/* Bytes to write to page */
size_t copied;  /* Bytes copied from user */
void *fsdata;
 
offset = (pos  (PAGE_CACHE_SIZE - 1));
-   index = pos  PAGE_CACHE_SHIFT;
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
iov_iter_count(i));
 




-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Badari Pulavarty
On Fri, 2007-03-02 at 09:16 -0600, Eric Sandeen wrote:
 Badari Pulavarty wrote:
  
  Amit K. Arora wrote:
  
  This is to give a heads up on few patches that we will be soon coming up
  with. These patches implement a new system call sys_fallocate() and a
  new inode operation fallocate, for persistent preallocation. The new
  system call, as Andrew suggested, will look like:
 
   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
 
  I am wondering about return values from this syscall ? Is it supposed to 
  return the
  number of bytes allocated ? What about partial allocations ? 
 
 If you don't have enough blocks to cover the request, you should 
 probably just return -ENOSPC, not a partial allocation.

That could be challenging, when multiple writers are working in
parallel. You may not be able to return -ENOSPC, till you fail the
allocation (for filesystems which alllocates a block at a time).

 
  What about 
  if the
  blocks already exists ? What would be return values in those cases ?
 
 0 on success, other normal errors oetherwise..
 
 If asked for a range that includes already-allocated blocks, you just 
 allocate any non-allocated blocks in the range, I think.

Yes. What I was trying to figure out is, if there is a requirement that
interface need to return exact number of bytes it *really* allocated
(like write() or read()). I can't think of any, but just wanted to
through it out..

BTW, what is the interface for finding out what is the size of the
pre-allocated file ? 

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Badari Pulavarty


Amit K. Arora wrote:


This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation fallocate, for persistent preallocation. The new
system call, as Andrew suggested, will look like:

 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

I am wondering about return values from this syscall ? Is it supposed to 
return the
number of bytes allocated ? What about partial allocations ? What about 
if the

blocks already exists ? What would be return values in those cases ?

Just curious .. What does posix_fallocate() return ?

Thanks,
Badari







-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 2/4]delayed allocation for ext3

2005-07-26 Thread Badari Pulavarty
On Tue, 2005-07-26 at 15:52 -0700, Andrew Morton wrote:
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  Here is the updated patch from Badari for delayed allocation for ext3.
  Delayed allocation defers block allocation from prepare-write time to
  page writeout time. 
 
 For data=writeback only, yes?

Yes.

 
  ...
  --- linux-2.6.12/fs/ext3/inode.c~ext3-delalloc  2005-07-14 
  23:15:34.866752480 -0700
  +++ linux-2.6.12-ming/fs/ext3/inode.c   2005-07-14 23:15:34.889748984 
  -0700
  @@ -1340,6 +1340,9 @@ static int ext3_prepare_write(struct fil
  handle_t *handle;
  int retries = 0;
   
  +
  +   if (test_opt(inode-i_sb, DELAYED_ALLOC))
  +   return __nobh_prepare_write(page, from, to, ext3_get_block, 0);
 
 Rather than performing this test on each -prepare_write(), would it not be
 better to set up a new set of address_space_operations for this mode?
 
 __nobh_prepare_write() seems like a poor choice of name?

You are correct. I was trying to minimize the changes to interfaces.
Once we get it working, I will do it as part of cleanups.

 
   retry:
  handle = ext3_journal_start(inode, needed_blocks);
  if (IS_ERR(handle)) {
  @@ -1439,6 +1442,9 @@ static int ext3_writeback_commit_write(s
  else
  ret = generic_commit_write(file, page, from, to);
   
  +   if (test_opt(inode-i_sb, DELAYED_ALLOC))
  +   return ret;
  +
 
 Here too, perhaps.
 
  +   }
  +   }
  /*
   * The journal_load will have done any necessary log recovery,
   * so we can safely mount the rest of the filesystem now.
  diff -puN fs/buffer.c~ext3-delalloc fs/buffer.c
  --- linux-2.6.12/fs/buffer.c~ext3-delalloc  2005-07-14 23:15:34.875751112 
  -0700
  +++ linux-2.6.12-ming/fs/buffer.c   2005-07-14 23:15:34.903746856 -0700
  @@ -2337,8 +2337,8 @@ static void end_buffer_read_nobh(struct 
* On entry, the page is fully not uptodate.
* On exit the page is fully uptodate in the areas outside (from,to)
*/
  -int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
  -   get_block_t *get_block)
  +int __nobh_prepare_write(struct page *page, unsigned from, unsigned to,
  +   get_block_t *get_block, int create)
 
 Suggest you make this static and update the comment.
 

Sure.

   {
  struct inode *inode = page-mapping-host;
  const unsigned blkbits = inode-i_blkbits;
  @@ -2370,10 +2370,8 @@ int nobh_prepare_write(struct page *page
block_start  PAGE_CACHE_SIZE;
block_in_page++, block_start += blocksize) {
  unsigned block_end = block_start + blocksize;
  -   int create;
   
  map_bh.b_state = 0;
  -   create = 1;
  if (block_start = to)
  create = 0;
  ret = get_block(inode, block_in_file + block_in_page,
 
 What's going on here?  Seems that we'll call get_block() with `create=0'. 
 Is there any point in doing that?  For delayed allocation we shuld be able
 to skip get_block() altogether here and, err, delay it.

For delayed allocation, I need to delay the block allocation - but I
still need to do get_block(READ) and read the data from the block - if
the block already exists.

 
  +int nobh_prepare_write(struct page *page, unsigned from, unsigned
  +   get_block_t *get_block)
  +{
  +   return __nobh_prepare_write(page, from, to, get_block, 1);
  +}
  +
  EXPORT_SYMBOL(nobh_prepare_write);
 
 Here you add nobh_dalloc_prepare_write() and remember to export it to
 modules this time ;)
 

Will do.

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: get_blocks_t semantics

2005-07-21 Thread Badari Pulavarty

Nir Tzachar wrote:


On Wed, 2005-07-20 at 20:14 -0700, Badari Pulavarty wrote:


Nir Tzachar wrote:



hello list.

can someone please explain the exact semantics the get_block_t function 
(which is passed to mpage_readpage(s)) should implement??

i could not find any documentation, and existing code kind of baffled
me...

my current understanding goes like this:
if the block is present, call map_bh on the bh, with the physical block
number.
else, if create is set, allocate a new block for the inode, and again
update the new physical block number.

is this all?? 


thanks.



For a given file offset, get_block() function is supposed to return the
physical disk block# of the block. (and size of the block). If create
is one, it needs to allocate the block (if it doesn't already exist). 
Makes sense ?



from fs.h:
typedef int (get_block_t)(struct inode *inode, sector_t iblock,
 struct buffer_head *bh_result, int create);

so, iblock is the block number in the file, or the offset in the file??


iblock = fileoffset in represented sectors. (for example,
file offset 4K = 8)


furthermore, what  set_buffer_mapped (used in map_bh) does?? i could not
find it's definition anywhere??


When the filesystems map the fileoffset to a disk block, it returns the
block# in the bh-b_blocknr and sets the flag to indicate that the
buffer is mapped (to a disk block).

If get_block() returned sucess, but if the buffer mapped is not set, 
means thats there is a hole.




about create, allocate a block means just deciding which block should
be allocated, update the fs control structures, and return that block #,
right??



yep.

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: get_blocks_t semantics

2005-07-20 Thread Badari Pulavarty

Nir Tzachar wrote:


hello list.

can someone please explain the exact semantics the get_block_t function 
(which is passed to mpage_readpage(s)) should implement??

i could not find any documentation, and existing code kind of baffled
me...

my current understanding goes like this:
if the block is present, call map_bh on the bh, with the physical block
number.
else, if create is set, allocate a new block for the inode, and again
update the new physical block number.

is this all?? 


thanks.



For a given file offset, get_block() function is supposed to return the
physical disk block# of the block. (and size of the block). If create
is one, it needs to allocate the block (if it doesn't already exist). 
Makes sense ?


Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] [RFC] [PATCH 2/4]delayed allocation for ext3

2005-07-18 Thread Badari Pulavarty
On Sun, 2005-07-17 at 19:47 -0600, Andreas Dilger wrote:
 On Jul 17, 2005  10:40 -0700, Mingming Cao wrote:
  @@ -373,6 +373,7 @@ struct ext3_inode {
   #define EXT3_MOUNT_BARRIER 0x2 /* Use block barriers */
   #define EXT3_MOUNT_NOBH0x4 /* No bufferheads */
   #define EXT3_MOUNT_QUOTA   0x8 /* Some quota option set */
  + #define EXT3_MOUNT_DELAYED_ALLOC  0xC /* Delayed Allocation */
 
 This doesn't make sense.  DELAYED_ALLOC == QUOTA | NOBH?

My fault. I will fix it.

 
  + {Opt_delayed_alloc, delalloc},
 
 Is this a replacement for Alex's delalloc code?  We also use delalloc for
 that code and if they are not interchangeable it will cause confusion
 about which one is in use.
 

Well, basically delalloc concept is same - whether we use it on
current ext3 layout or with new extent layout doesn't matter.


  + if (test_opt(sb, DELAYED_ALLOC)) {
  + if (!(test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)) 
  {
  + printk(KERN_WARNING EXT3-fs: Ignoring delall option 
  - 
  + its supported only with writeback mode\n);
 
 Should be ignoring delalloc option.

Yep.

Thanks,
Badari



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-20 Thread Badari Pulavarty
On Tue, 2005-04-19 at 13:41, Martin Jambor wrote:

  
  1) Currently none of the generic helper routines can handle this.
  We need to add support to do these, but still somehow make the
  routines generic enough for every ones use.
 
 I'm quite happy about most of them. I can't see how we could use any
 generic form of writepage(s) as we write stuff in a quite different
 way from almost anybody else but all the others except
 block_prepare_write do  pretty much exactly what we need (if I have
 not missed something).

Yep. You need to have your own block_prepare_write() which doesn't
allocate a block - instead it reserves a block.

 
  2) There is no easy way to find out if we reserved a block or
  not in writepage() correctly. There are 2 paths to writepage().
  
  sys_write() - prepare/commit()
  and later sync()  writepage()
  
  mmap() - touch a page()
  and later -- writepage()
  
  In order to do the correct accounting, we need to mark a page
  to indicate if we reserved a block or not. One way to do this,
  to use page-private to indicate this. But then, all the generic
  routines will fail - since they assume that page-private represents
  bufferheads. So we need a better way to do this.
 
 I didn't hope for a special bit in struct page so I wanted to simply
 fake the page/buffer mapping somehow. Since we don't really care
 whether a page is mapped or reserved as long as it is at least one of
 these when actually writing it (we write stuff to different places
 from where we have read it from), the PG_mappedtodisk is fine for us
 as long as no other kernel code thinks that having it set means we
 also have buffers which point to meaningful positions on the device
 because we don't. Is that the case?
 
 Of course, having a PG_RESERVED flag would be a nice and clean thing
 to use and we would be more than happy to do so.
 
  3) We need add hooks into filesystem specific calls from these
  generic routines to handle journaling mode requirements
 
 Our fs is basically one big journal so we don't need any of these. Or
 at least I don't see any need for it at the moment.
 
  So, what are your requirements ?  I am looking for a common
  way to combine all the requirements and come out with a
  saner generic routines to handle these.
 
 I'm happy with most generic functions. we need to implement
 writepage(s) ourselves no matter what, the only problem is
 block_prepare_write and I can currently only see two options for us:
 
 1) Implement it ourselves and use a flag in the struct page to mark it 
 reserved.
 
 2) Use block_prepare_write but enable the get_block function to mark
 an individual buffer as reserved so that it is trated as mapped (can
 be dirty and stuff) but no code assumes it is located somewhere on the
 disk (for example block_prepare_write would not call
 unmap_underlying_metadata).
 
 I think we'll go for the first method, but the second would make life
 easier for filesystems which can have pages consisting of both mapped
 and reserved blocks.

I guess for now, you can do all this in your filesystem specific code.
Too bad, every one has to write their own. Hopefully, we cleanup all
these interfaces and provide a generic *enough* interfaces to do this -
so everyone can use it.

As you can see, lots of issues need to be worked out + I don't have
bandwidth to undertake such an effort, unless Nikita and Bryan
are willing to signup for this :)

Thanks,
Badari


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Badari Pulavarty
On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote:
 Badari Pulavarty [EMAIL PROTECTED] writes:
 
 [...]
 
 
  Yes. Its possible to do what you want to. I am currently working on
  adding delayed allocation support to ext3. As part of that, We
 
 As you most likely already know, Alex Thomas already implemented delayed
 block allocation for ext3.

Yep. I reviewed Alex Thomas patches for delayed allocation. He handled
all the cases in his code and did NOT use any mpage* routines to do
the work. I was hoping to change the mpage infrastructure to handle
these, so that every filesystem doesn't have to do their thing.


 
 
  In order to do the correct accounting, we need to mark a page
  to indicate if we reserved a block or not. One way to do this,
  to use page-private to indicate this. But then, all the generic
 
 I believe one can use PG_mappedtodisk bit in page-flags for this
 purpose. There was old Andrew Morton's patch that introduced new bit
 (PG_delalloc?) for this purpose.

That would be good. But I don't feel like asking for a bit in page
if there is a way to get around it.

 
  routines will fail - since they assume that page-private represents
  bufferheads. So we need a better way to do this.
 
 They are not generic then. Some file systems store things completely
 different from buffer head ring in page-private.

Yep. Instead of changing the whole world, I was hoping to come up with
few common interfaces (which doesn't assume anything about bufferheads
etc..) which are useful for more than one filesystem.


 
  3) We need add hooks into filesystem specific calls from these
  generic routines to handle journaling mode requirements
  (for ext3 and may be others).
 
 Please don't. There is no such thing as generic
 journalling. Traditional WAL used by ext3, phase-trees of Tux2, and
 wandering logs of reiser4 are so much different that there is no hope
 for a single API to accommodate them all. Adding such API will only
 force more workarounds and hacks in non-ext3 file systems.
 
 What _is_ common to all journalling file systems on the other hand, is
 the notion of transaction as the natural unit of caching and
 write-out. Currently in Linux, write-out is inode-based
 (-writepages()). Reiser4 already has a patch that replaces
 sync_sb_inodes() function with super-block operation. In reiser4 case,
 this operation scans the list of transactions (instead of the list of
 inodes) and writes some of them out, which is natural thing to do for a
 journalled file system.
 
 Similarly, transaction is a unit of caching: it's often necessary to
 scan all pages of a given transaction, all dirty pages of a given
 transaction, or to check whether given page belongs to a given
 transaction. That is, transaction plays role similar to struct
 address_space. But currently there is 1-to-1 relation between inodes and
 address_spaces, and this forces file system to implement additional data
 structures to duplicate functionality already present in address_space.
 
  So, what are your requirements ?  I am looking for a common
  way to combine all the requirements and come out with a
  saner generic routines to handle these.
 
 
 I think that one reasonable way to add generic support for journalling
 is to split struct address_space into two objects: lower layer that
 represents file (say, struct vm_file), in which pages are linearly
 ordered, and on top of this vm_cache (representing transaction) that
 keeps track of pages from various vm_file's. vm_file is embedded into
 inode, and vm_cache has a pointer to (the analog of) struct
 address_space_operations.
 
 vm_cache's are created by file system back-end as necessary (can be
 embedded into inode for non-journalled file systems). VM scanner and
 balance_dirty_pages() call vm_cache operations to do write-out.

Need to think some more. I guess you thought about this more than you
do :)

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Badari Pulavarty
On Tue, 2005-04-19 at 08:04, Alex Tomas wrote:
  Badari Pulavarty (BP) writes:
  
   you can introduce one more bit to page-flags
 
  BP Agreed. I was hoping to avoid it as much as I can.
 
 well, you're gonna modify mpage api anyway ...

Okay, I will give a serious look then. Last time, I tried to
go near page-flags I got slapped :( This time we have a
valid reason, I guess :)

 
  BP What I meant by jounalling mode is that - after the pages are submitted
  BP for IO, we need some way of waiting for the IOs to finish inorder to
  BP guarantee the ordering ? Is this not needed for anything other than
  BP ext3 ?
 
 1) i'm not sure anyone else supports this

Fair enough. If no one needs it - lets keep the interface simple.

 2) Andrew proposed the excelent solution

Well, I wasn't sure how heavy thats going to be. He was recommending
that we flush all dirty pages from all inodes for each transaction
commit. Isn't it ?

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Badari Pulavarty
Alex Tomas wrote:
Nikita Danilov (ND) writes:

   In order to do the correct accounting, we need to mark a page
   to indicate if we reserved a block or not. One way to do this,
   to use page-private to indicate this. But then, all the generic
  
  I believe one can use PG_mappedtodisk bit in page-flags for this
  purpose. There was old Andrew Morton's patch that introduced new bit
  (PG_delalloc?) for this purpose.
  
  That would be good. But I don't feel like asking for a bit in page
  if there is a way to get around it.

 ND Clarification: PG_mappedtodisk is already here, it seems you can reuse
 ND this already existing bit to implement delayed allocation support.
I think we need another one, because mappedtodisk != reserved. we could use
mappedtodisk, but this means in -commit_write() we'd need to check that one
more time (first time in -prepare_write())
Yep. We need one more to indicate the we reserved a block for this page.
Other option I was thinking on how to avoid is, by reserving a block
when a mapped page changes from read - write. Andrew's -mm tree has
patch to give us a notification when it happens.
Thanks,
Badari
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-18 Thread Badari Pulavarty
Martin Jambor wrote:
Hi all,
I am a member of a group that implements a filesystem that allocates
disk blocks to in-memory blocks lazily, that means, the decision is
made just before the data are actually sent to disk. Moreover, when
cached pages are modified, the data can be (and almost certainly will
be) written to a different place to from where it was read.
I was wondering, whether we could use the generic function
block_prepare_write at all. The function checks every buffer of the
page and if it is not mapped, it calls a fs supplied function that is
supposed to map the buffer, i.e. assign it a block on the device and
set its mapped flag.
This is where we would like to give an error if there is not enough
free disk space left but we cannot give a specific device block number
yet. Can we make one up, such as -1? What would that do to such dark
functions as unmap_underlying_metadata or any other? Would some other
part of kernel break if there was a bunch of buffers assigned to the
same spot on the disk?
On the other hand, if I understand buffer flags correctly, I need to
be able to emulate mapping of buffers to set them dirty, or em I
wrong?
Thanks for any insight or thoughts,
Yes. Its possible to do what you want to. I am currently working on
adding delayed allocation support to ext3. As part of that, We
are modifying generic helper routines to delay the allocation from
prepare time to actual writeout time. (writepage).
Here is the basic idea:
===
The idea is to reserve a block at the prepare/commit write instead
of allocating the block. Do the actual allocation in writepage().
Sounds simple :)
Here are the issues:

1) Currently none of the generic helper routines can handle this.
We need to add support to do these, but still somehow make the
routines generic enough for every ones use.
2) There is no easy way to find out if we reserved a block or
not in writepage() correctly. There are 2 paths to writepage().
sys_write() - prepare/commit()
and later sync()  writepage()
mmap() - touch a page()
and later -- writepage()
In order to do the correct accounting, we need to mark a page
to indicate if we reserved a block or not. One way to do this,
to use page-private to indicate this. But then, all the generic
routines will fail - since they assume that page-private represents
bufferheads. So we need a better way to do this.
3) We need add hooks into filesystem specific calls from these
generic routines to handle journaling mode requirements
(for ext3 and may be others).
So, what are your requirements ?  I am looking for a common
way to combine all the requirements and come out with a
saner generic routines to handle these.
Thanks,
Badari
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Delayed alloc for ordered-mode

2005-03-13 Thread Badari Pulavarty
I think adding support to JBD to deal with bios  would be
a valuable generic extention. Anyway its dealing with bhs
now.
Thanks,
Badari
Suparna Bhattacharya wrote:
What would be really nice is if we could do this in a way that
enables reuse of generic paths even for ordered mode. One thought
that comes to mind is journal commit waiting for writeback to 
complete on the data pages which need to be flushed to disk before 
meta-data can be committed, much like we do for O_SYNC. 

I realise that JBD is intended to work at a level of abstraction
where it has no awareness of filesystems - hence the correspondence
with buffer heads all through. So would the above be a complete
no-no ?
Regards
Suparna
On Fri, Mar 04, 2005 at 06:02:35PM +0300, Alex Tomas wrote:
On 03 Mar 2005 17:12:14 -0800
Badari Pulavarty [EMAIL PROTECTED] wrote:

One more thing, we need to keep in mind is - we need to make sure
that ordered mode also improved - since all our testcode 
focuses on writeback mode and the default mode is ordered :(

I've just cooked the patch to implement ordered mode for delayed
allocation path. please take it:
ftp://ftp.clusterfs.com/pub/people/alex/2.6.11/ext3-delalloc-ordered-2.6.11-0.1.patch
Stephen, Andrew could you review it, please?
thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-03 Thread Badari Pulavarty
On Thu, 2005-03-03 at 00:33, Suparna Bhattacharya wrote:
 Since the performance improvements seen so far are quite encouraging, 
 and momentum is picking up so well, I started looking through the
 patches from Alex ... just a quick code walkthrough to get a hang
 of it and think about what kind of simplifications might be possible
 and what it might take for inclusion.
 
 I haven't had a chance to go too deep line by line yet,
 but thought I'd initiate some discussion with some first impressions
 and summary of what directions I hear several people converging
 towards to validate if I'm on the right track here.
 
 diffstat of the 3 patches : 22 files changed, 5920 insertions(+), 
 47 deletions. The largest is in the extents patch (2743), mballoc 
 is 1968, and delalloc is 1209. To use delalloc, which gives us
 all the performance benefits, right now we need all the 3 patches
 to be used in conjunction. Supporting extent map btrees as well 
 as traditional indexing and associated options for compatibility etc
 is perhaps the more invasive of changes. Given that keeping ext3 
 stable and maintainable is a key concern (that is after all a major 
 reason why a lot of users rely on ext3), a somewhat incremental 
 approach is desirable. 
 
 So, I'll start from the direction that has been suggested by
 some -- (1) delayed allocation without changing the
 on-disk format. And then later (2) go on to breaking format with 
 all changes for scalability to larger files with full extents 
 support (haven't thought enough about this yet - maybe in a
 separate mail)
 

Just doing delayed allocation without multiblock allocation
(with the current layout) is not really a useful thing, IMHO.
We will benifit few cases, but in general - we moved the
block allocation overhead from prepare write to writepages/writepage
time. There is a little benifit of not doing journaling twice etc..
but I don't think it would be enough to justify the effort. 
Isn't it ?

So, may be we should look at adding multiblock allocation +
delayed allocation to current ext3 layout. Then we can evaluate
the benifits of having extents etc and then break the layout ?

One more thing, we need to keep in mind is - we need to make sure
that ordered mode also improved - since all our testcode 
focuses on writeback mode and the default mode is ordered :(


Thanks,
Badari


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] Generic mpage_writepage() support

2005-02-16 Thread Badari Pulavarty
On Wed, 2005-02-16 at 11:09, Dave Kleikamp wrote:
 On Wed, 2005-02-16 at 10:37 -0800, Badari Pulavarty wrote:
 
  Yes. page-private is assumed for the bufferhead usage. Do you really
  need for handling page-private for non-bufferhead usage ?
 
 For what it's worth, I'm working on some changes to jfs that will use
 page-private for non-bufferhead usage for metadata, but I won't be
 using a generic writepage, so it's not an issue for me.

Nope. it would be an issue for you, since jfs uses mpage_writepages()
which uses the same code - which thinks page-private as bufferhead.

 
 mpage.c already assumes page-private implies bufferheads, so it's not
 completely generic.  Would implementing this as nobh_write_full_page, to
 complement block_write_full_page, make sense?

I guess, it can be done. So to really deal with this, we need to come
up with generic writepage/writepages interfaces which doesn't deal
with bufferheads.

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Re: [RFC] [PATCH] Generic mpage_writepage() support

2005-02-16 Thread Badari Pulavarty
On Wed, 2005-02-16 at 11:43, Dave Kleikamp wrote:
 On Wed, 2005-02-16 at 11:28 -0800, Badari Pulavarty wrote:
  On Wed, 2005-02-16 at 11:09, Dave Kleikamp wrote:
   On Wed, 2005-02-16 at 10:37 -0800, Badari Pulavarty wrote:
   
Yes. page-private is assumed for the bufferhead usage. Do you really
need for handling page-private for non-bufferhead usage ?
   
   For what it's worth, I'm working on some changes to jfs that will use
   page-private for non-bufferhead usage for metadata, but I won't be
   using a generic writepage, so it's not an issue for me.
  
  Nope. it would be an issue for you, since jfs uses mpage_writepages()
  which uses the same code - which thinks page-private as bufferhead.
 
 The patch I am working on will call mpage_writepages() for metadata, but
 will use my own writepage() rather than mpage_writepage(), and nothing
 in mpage_writepages() will use page-private.

Okay, that will work. Basically, you plan to call mpage_writepages()
with NULL as get_block(). Isn't it ?

 For normal data, page-private, if used at all, will be bufferheads.

Good.

 
  
   mpage.c already assumes page-private implies bufferheads, so it's not
   completely generic.  Would implementing this as nobh_write_full_page, to
   complement block_write_full_page, make sense?
  
  I guess, it can be done. So to really deal with this, we need to come
  up with generic writepage/writepages interfaces which doesn't deal
  with bufferheads.
 
 I'm not sure how useful that would be.  Are there any users of a
 non-bufferhead page-private that want to call a generic writepage(s)?
 In other words, if a generic function is sufficient, you probably
 wouldn't be using page-private anyway.

This goes back to original question, is there a point in creating
new interfaces for writepage  writepages which doesn't assuming
page-private. So far, only users are ext2+nobh and JFS. But both
of them will not use page-private for anything else other than
bufferheads (if at all used). For both of these, the mpage_writepage()
I cooked up earlier would be good enough.

I guess, we will do it ONLY if someone really needs it. 

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bufferheads page-cache reference

2005-02-15 Thread Badari Pulavarty
On Mon, 2005-02-14 at 18:57, Andrew Morton wrote:
 Badari Pulavarty [EMAIL PROTECTED] wrote:
 
   Most of DB2 customers use filesystem for their database. Under the load,
   they complain that entire memory in the system is used by filesystem
   pagecache, freememory is very low and system starts swapping crazy OR
   see lots of memory allocation failures and OOM killer kills db2.
   slabinfo shows lots of bufferheads and VM folks claim that, bufferheads
   are holding a ref. on the pages, so they can't use them. So, I want
   to find the truth in the story and findout what exactly happening here
   and which one to blame (VM or FS or IO problems) ?
  
   BTW, all these on 2.4 kernels and I don't have a reproducible testcase
   :(
  
   Feb 7 05:35:17 nmcopsu41 kernel: ENOMEM in do_get_write_access,
   retrying.
 
 Do these machines have a large amount of highmem?
 
 If so, yes, you can oom because lots of highmem pages have buffer_heads
 attached and you've run out of lowmem.  The 2.4 VM will go off looking for
 lowmem pages to reclaim and will ignore the highmem pages because there's
 no highmem shortage.  Consequently those buffer_heads don't get freed up
 and we're unable to reclaim any lowmem - oom.
 
 Andrea did a patch along time ago (it'll be in suse 2.4 kernels) which,
 under these circumstances, strip the buffers from those highmem pages when
 they're encountered on the LRU.  From a quick read it seems that that patch
 is not in current 2.4 kernels.
 
 It's harder to do that in 2.6 because we have a separate LR per zone.

Our DB2 folks *claims* to have seen this problem both on ia32 and AMD64
customers.  So, I am not sure if its really only highmem related. Only
workaround seems to be configure DB2 to not to use more than 1.5GB on a
8GB RAM system :(

I have nothing much to go on, other than looking data from a sick 
machine. What should I be looking at, to narrow down the problem
some more ?

BTW, none of these BIG customers will take a patch to figure out
whats happening (since its on their production system) :(


Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bufferheads page-cache reference

2005-02-15 Thread Badari Pulavarty
On Tue, 2005-02-15 at 09:54, Andrew Morton wrote:
 Badari Pulavarty [EMAIL PROTECTED] wrote:
 
  Yep. nobh_prepare_write() doesn't add any bufferheads. But
   we call block_write_full_page() even for nobh case, which 
   does create bufferheads, attaches to the page and operates
   on them..
 
 hmm, yeah, OK, we'll attach bh's in that case.  It's a rare case though -

Thanks. I just wanted to make sure I am not confused.

 when a dirty page falls off the end of the LRU.  There's no particular
 reason why we cannot have a real mpage_writepage() which doesn't use bh's
 and employ that.
 
 I coulda sworn we used to have one.

Okay, I will cook it up. 

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bufferheads page-cache reference

2005-02-15 Thread Badari Pulavarty
On Tue, 2005-02-15 at 11:07, Nikita Danilov wrote:
 Andrew Morton [EMAIL PROTECTED] writes:
 
  Badari Pulavarty [EMAIL PROTECTED] wrote:
 
  Yep. nobh_prepare_write() doesn't add any bufferheads. But
   we call block_write_full_page() even for nobh case, which 
   does create bufferheads, attaches to the page and operates
   on them..
 
  hmm, yeah, OK, we'll attach bh's in that case.  It's a rare case though -
  when a dirty page falls off the end of the LRU.  There's no particular
 
 Maybe DB2 dirties pages through mmap?
 
 Nikita.

It does. Most databases dirty data using shared memory segments, but
they do write to filesystem directly also.

I am not looking at DB2 issue here. I am looking at bufferheads usage
in general. 

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Re: [RFC] ext3 writepages for writeback mode

2005-02-14 Thread Badari Pulavarty
On Sat, 2005-02-12 at 15:29, Alex Tomas wrote:
  Badari Pulavarty (BP) writes:
 
   UP, after:
   [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
   real0m21.452s user0m0.026s sys 0m6.105s
   [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
   real0m23.317s user0m0.027s sys 0m4.206s
   [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
   real0m25.190s user0m0.022s sys 0m4.026s
   [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
   real0m25.024s user0m0.021s sys 0m4.103s
   UP, before:
   [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
   real0m19.138s user0m0.021s sys 0m5.980s
   [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
   real0m23.123s user0m0.026s sys 0m4.083s
   [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
   real0m25.152s user0m0.017s sys 0m3.836s
   [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
   real0m25.217s user0m0.026s sys 0m3.827s
 
  BP Hmm.. On UP, the patch made sys.time little worse ?
 
 g. sorry, that's a typo. exchange 'after' and 'before' please.

Thank you. I was worried for a minute.

 btw, want to see how delayed allocation manages this?

Sure. I think it will improve the allocation case. 
Non-allocation case, should be pretty much same, provided
I got contiguous layout on the disk.

Isn't it ?

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bufferheads page-cache reference

2005-02-14 Thread Badari Pulavarty
On Mon, 2005-02-14 at 14:50, William Lee Irwin III wrote:
 On Mon, Feb 14, 2005 at 01:40:58PM -0800, Andrew Morton wrote:
  Seems about right.  There's also the buffer_heads_over_limit logic in
  mm/vmscan.c and fs/buffer.c.  That logic has a hole in that it requires
  that there be a highmem shortage before we start to reclaim the lowmem
  buffer_heads, but it is somewhat helpful.
 
 William Lee Irwin III [EMAIL PROTECTED] wrote:
  It would be beneficial to close the hole in that logic.
 
 On Mon, Feb 14, 2005 at 02:31:42PM -0800, Andrew Morton wrote:
  Really?  Who's hurting?
 
 Apart from the fact that buffer_head proliferation has been a perennial
 problem, there's little to go on. By and large 2.6.x production usage
 is not yet very significant where I can see it, where the production
 usage is largely more stressful on account of very long durations and
 the variety of unusual situations to which the kernel is subjected.
 

Now that we are on the subject of bufferheads and filesystem pagecache,
the reason I am looking thro the code closely is ..

Most of DB2 customers use filesystem for their database. Under the load,
they complain that entire memory in the system is used by filesystem
pagecache, freememory is very low and system starts swapping crazy OR
see lots of memory allocation failures and OOM killer kills db2.
slabinfo shows lots of bufferheads and VM folks claim that, bufferheads
are holding a ref. on the pages, so they can't use them. So, I want
to find the truth in the story and findout what exactly happening here
and which one to blame (VM or FS or IO problems) ?

BTW, all these on 2.4 kernels and I don't have a reproducible testcase
:(

Feb 7 05:35:17 nmcopsu41 kernel: ENOMEM in do_get_write_access,
retrying.
Feb 7 05:35:18 nmcopsu41 kernel: Out of Memory: Killed process 18517
(db2sysc).
Feb 7 05:35:25 nmcopsu41 kernel: Out of Memory: Killed process 18660
(db2sysc).
Feb 7 05:35:29 nmcopsu41 kernel: Out of Memory: Killed process 18873
(db2sysc).


total used free shared buffers cached
Mem: 16304560 16284152 20408 0 228428 15093736
-/+ buffers/cache: 961988 15342572
Swap: 35655616 24448 35631168
Total: 51960176 16308600 35651576


Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bufferheads page-cache reference

2005-02-14 Thread Badari Pulavarty
On Mon, 2005-02-14 at 13:40, Andrew Morton wrote:
 Badari Pulavarty [EMAIL PROTECTED] wrote:
 
  I see that as part of bufferheads to page association, we get a
   ref. on the page.
  
 create_empty_buffers() - attach_page_buffers() - page_cache_get()
  
   I also see that this reference get dropped by ..
  
 shrink_list() - try_to_release_page() -
   try_to_free_buffers() - drop_buffers() -
__clear_page_buffers()- page_cache_release();
  
   So, it looks like we drop the reference on the page and disassociate
   bufferheads from the page when VM wants to re-use the page. Only other
   path, I see this can happen is through invalidate_mapping_pages().
   Is this true ?
  
   If I do fsync(), we flush the data - still leave the page  bufferhead
   association. If I see lots of bufferheads even after fsync() is normal.
   Correct ?
 
 Seems about right.  There's also the buffer_heads_over_limit logic in
 mm/vmscan.c and fs/buffer.c.  That logic has a hole in that it requires
 that there be a highmem shortage before we start to reclaim the lowmem
 buffer_heads, but it is somewhat helpful.
 

Is there anything wrong, if we tear down bufferheads after the
writepage/writepages is complete ? may be -nobh option for ext3 ?

Even for ext2 with -nobh and JFS - we seem to attach buffer heads
to page in __block_write_full_page() and leave them around. I was
thinking, they gets tossed out after the write-out completes. No ?

These bufferheads are driving me crazy :)

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Re: [RFC] ext3 writepages for writeback mode

2005-02-12 Thread Badari Pulavarty
Alex Tomas wrote:
Badari Pulavarty (BP) writes:

 BP Test: writes 10,000 blocks of 64k and does fdatasync().
The patch doesn't apply ;) after a minor correction
I've tested the patch too:
SMP, before:
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m17.748s user0m0.032s sys 0m6.031s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m16.016s user0m0.015s sys 0m3.819s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m19.074s user0m0.014s sys 0m3.761s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m22.011s user0m0.010s sys 0m3.773s
SMP, after:
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m20.676s user0m0.023s sys 0m5.643s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m13.188s user0m0.016s sys 0m3.511s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m23.725s user0m0.017s sys 0m3.399s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m23.684s user0m0.013s sys 0m3.421s 


UP, after:
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m21.452s user0m0.026s sys 0m6.105s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m23.317s user0m0.027s sys 0m4.206s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m25.190s user0m0.022s sys 0m4.026s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m25.024s user0m0.021s sys 0m4.103s
UP, before:
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m19.138s user0m0.021s sys 0m5.980s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m23.123s user0m0.026s sys 0m4.083s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m25.152s user0m0.017s sys 0m3.836s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m25.217s user0m0.026s sys 0m3.827s

AFAICS, the patch makes sense, but in my setup it doesn't
reduce sys.time so dramatically. all my boxes are P3-based.
thanks, Alex

Thank you for testing and confirming the results.
Is this on a simple single disk configuration ?
Thanks,
Badari
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Re: [RFC] ext3 writepages for writeback mode

2005-02-12 Thread Badari Pulavarty
Alex Tomas wrote:
Badari Pulavarty (BP) writes:

 BP Test: writes 10,000 blocks of 64k and does fdatasync().
The patch doesn't apply ;) after a minor correction
I've tested the patch too:
...
UP, after:
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m21.452s user0m0.026s sys 0m6.105s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m23.317s user0m0.027s sys 0m4.206s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m25.190s user0m0.022s sys 0m4.026s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m25.024s user0m0.021s sys 0m4.103s
UP, before:
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m19.138s user0m0.021s sys 0m5.980s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m23.123s user0m0.026s sys 0m4.083s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m25.152s user0m0.017s sys 0m3.836s
[EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1
real0m25.217s user0m0.026s sys 0m3.827s
Hmm.. On UP, the patch made sys.time little worse ?
- Badari
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 writepages for writeback mode

2005-02-11 Thread Badari Pulavarty
On Fri, 2005-02-11 at 15:58, Andrew Morton wrote:
 Badari Pulavarty [EMAIL PROTECTED] wrote:
 
  Due to lack of interesting suggestions to solve 
  mpage_writepages() - ext3_writeback_writepage() problem,
  I fixed it in the dumbest possible way.
 
 I've actually forgotten what the problem was.  It was 100 patches ago :(

The problem was 
ext3_writeback_writepages() - mpage_writepages() could
call back ext3_writeback_writepage() in the confused case.
ext3_writeback_writepage() could end up doing nothing, since the
we already have a journal handle.

I added a writepage_helper to handle this case.

 
  Let me know, what you think.
 
 
 If it works, let's get some benchmark numbers so we can decide whether it
 justifies more development?
 

Yep. I will get some numbers to see ..

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 writepages for writeback mode

2005-02-11 Thread Badari Pulavarty
On Fri, 2005-02-11 at 15:58, Andrew Morton wrote:
 Badari Pulavarty [EMAIL PROTECTED] wrote:
 
  Due to lack of interesting suggestions to solve 
  mpage_writepages() - ext3_writeback_writepage() problem,
  I fixed it in the dumbest possible way.
 
 I've actually forgotten what the problem was.  It was 100 patches ago :(
 
  Let me know, what you think.
 
 
 If it works, let's get some benchmark numbers so we can decide whether it
 justifies more development?

Okay, here is a quick data I collected. More to follow..

Test: writes 10,000 blocks of 64k and does fdatasync().

Thanks,
Badari

BEFORE: (without writepages support)

   
elm3b29:/mnt # touch file
elm3b29:/mnt # time /tmp/writer file
real0m23.746s user0m0.000s sys 0m5.020s (allocation)
elm3b29:/mnt # time /tmp/writer file
real0m20.950s user0m0.001s sys 0m2.278s (no alloc)
elm3b29:/mnt # time /tmp/writer file
real0m21.030s user0m0.001s sys 0m2.254s (no alloc)
elm3b29:/mnt # time /tmp/writer file
real0m20.577s user0m0.001s sys 0m2.184s (no alloc)

   

AFTER: (with writepages support)

   
elm3b29:/mnt # touch file
elm3b29:/mnt # time /tmp/writer file
real0m23.230s user0m0.001s sys 0m4.132s (allocation)
elm3b29:/mnt # time /tmp/writer file
real0m20.175s user0m0.004s sys 0m1.756s (no alloc)
elm3b29:/mnt # time /tmp/writer file
real0m20.368s user0m0.001s sys 0m1.696s (no alloc)
elm3b29:/mnt # time /tmp/writer file
real0m20.626s user0m0.002s sys 0m1.763s (no alloc)



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Badari Pulavarty
On Wed, 2005-02-09 at 18:05, Bryan Henderson wrote:
 I see much larger IO chunks and better throughput. So, I guess its
 worth doing it
 
 I hate to see something like this go ahead based on empirical results 
 without theory.  It might make things worse somewhere else.
 
 Do you have an explanation for why the IO chunks are larger?  Is the I/O 
 scheduler not building large I/Os out of small requests?  Is the queue 
 running dry while the device is actually busy?
 

Bryan,

I would like to find out what theory you are looking for.

Don't you think, filesystems submitting biggest chunks of IO
possible is better than submitting 1k-4k chunks and hoping that
IO schedulers do the perfect job ? 

BTW, writepages() is being used for other filesystems like JFS.

We all learnt thro 2.4 RAW code about the overhead of doing 512bytes
IO and making the elevator merge all the peices together. Thats
one reason why 2.6 DIO/RAW code is completely written from
scratch to submit the biggest possible IO chunks.

Well, I agree that we should have theory behind the results.
We are just playing with prototypes for now.

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Badari Pulavarty
On Thu, 2005-02-10 at 10:00, Bryan Henderson wrote:
 Don't you think, filesystems submitting biggest chunks of IO
 possible is better than submitting 1k-4k chunks and hoping that
 IO schedulers do the perfect job ? 
 
 No, I don't see why it would better.  In fact intuitively, I think the I/O 
 scheduler, being closer to the device, should do a better job of deciding 
 in what packages I/O should go to the device.  After all, there exist 
 block devices that don't process big chunks faster than small ones.  But 
 
 So this starts to look like something where you withhold data from the I/O 
 scheduler in order to prevent it from scheduling the I/O wrongly because 
 you (the pager/filesystem driver) know better.  That shouldn't be the 
 architecture.
 
 So I'd like still like to see a theory that explains why submitting the 
 I/O a little at a time (i.e. including the bio_submit() in the loop that 
 assembles the I/O) causes the device to be idle more.
 
 We all learnt thro 2.4 RAW code about the overhead of doing 512bytes
 IO and making the elevator merge all the peices together.
 
 That was CPU time, right?  In the present case, the numbers say it takes 
 the same amount of CPU time to assemble the I/O above the I/O scheduler as 
 inside it.

One clear distinction between submitting smaller chunks vs larger
ones is - number of call backs we get and the processing we need to
do.

I don't think we have enough numbers here to get to bottom of this.
CPU utilization remains same in both cases, doesn't mean that - the
test took exactly same amount of time. I don't even think that we
are doing a fixed number of IOs. Its possible that by doing larger
IOs we save CPU and use that CPU to push more data ?



Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Re: journal start/stop in ext3_writeback_writepage()

2005-02-10 Thread Badari Pulavarty
On Thu, 2005-02-10 at 15:12, Stephen C. Tweedie wrote:
 Hi,
 
 On Thu, 2005-02-10 at 20:21, Andrew Morton wrote:
 
   But I still don't understand why this can't happen
thro original code ..
 
what am i missing ?
  
  presumably there are never any dirty pages or inodes when we run
  journal_destroy().
 
 I assume so, yes.  If there is no a_ops-writepages(), then we default
 to generic_writepages() which is a noop if there are no dirty pages.  If
 your new ext3-specific writepages code tries to do a journal_start() in
 that case, then yes, it is likely to blow up spectacularly during
 journal_destroy!
 
 --Stephen

Yep. I found this hardway that exactly whats happening.
generic_writepages() is clever enough to do nothing, if there are no
dirty pages. But I am being stupid in my writepages(). 

I need to teach writepages() to nothing in case of no dirty pages. 
Is there a easy way like checking a count somewhere than doing all the
stuff mpage_writepages() is doing to figure this out, like ..


while (!done  (index = end) 
(nr_pages = pagevec_lookup_tag(pvec, mapping,
index,
PAGECACHE_TAG_DIRTY,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)))
...

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] ext3 writepages for writeback mode

2005-02-10 Thread Badari Pulavarty
Hi,

Here is my first cut at adding writepages() support for
ext3 writeback mode.

I have not done any performance analysis on the patch, 
so try it at your own risk.

Please let me know, if I am completely off or its a
stupid idea.

Thanks,
Badari


--- linux-2.6.10.org/fs/ext3/inode.c2004-12-06 11:45:49.0 -0800
+++ linux-2.6.10/fs/ext3/inode.c2005-02-10 18:14:17.987263744 -0800
@@ -856,6 +856,12 @@
return ret;
 }
 
+static int ext3_writepages_get_block(struct inode *inode, sector_t iblock,
+   struct buffer_head *bh, int create)
+{
+   return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create);
+}
+
 /*
  * `handle' can be NULL if create is zero
  */
@@ -1321,6 +1327,37 @@
return ret;
 }
 
+static int
+ext3_writeback_writepages(struct address_space *mapping, 
+   struct writeback_control *wbc)
+{
+   struct inode *inode = mapping-host;
+   handle_t *handle = NULL;
+   int err, ret = 0;
+
+   if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
+   return ret;
+
+   handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
+   if (IS_ERR(handle)) {
+   ret = PTR_ERR(handle);
+   return ret;
+   }
+
+ret = mpage_writepages(mapping, wbc, ext3_writepages_get_block);
+
+   /*
+* Need to reaquire the handle since ext3_writepages_get_block()
+* can restart the handle
+*/
+   handle = journal_current_handle();
+
+   err = ext3_journal_stop(handle);
+   if (!ret)
+   ret = err;
+   return ret;
+}
+
 static int ext3_writeback_writepage(struct page *page,
struct writeback_control *wbc)
 {
@@ -1552,6 +1589,7 @@
.readpage   = ext3_readpage,
.readpages  = ext3_readpages,
.writepage  = ext3_writeback_writepage,
+   .writepages = ext3_writeback_writepages,
.sync_page  = block_sync_page,
.prepare_write  = ext3_prepare_write,
.commit_write   = ext3_writeback_commit_write,


Re: [RFC] ext3 writepages for writeback mode

2005-02-10 Thread Badari Pulavarty
Andrew Morton wrote:
Badari Pulavarty [EMAIL PROTECTED] wrote:
Here is my first cut at adding writepages() support for
ext3 writeback mode.

Looks sane from a brief scan.
Well, not really..
mpage_writepages() could end up calling ext3_writeback_writepage()
in confused case thro ..
*ret = page-mapping-a_ops-writepage(page, wbc);
which ends up doing nothing and leaves the page dirty, since there is 
journal handle started :(

if (ext3_journal_current_handle())
goto out_fail;
Ideas ?
Thanks,
Badari
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: journal start/stop in ext3_writeback_writepage()

2005-02-09 Thread Badari Pulavarty
On Wed, 2005-02-09 at 08:37, Stephen C. Tweedie wrote:
 Hi,
 
 On Wed, 2005-02-09 at 16:18, Badari Pulavarty wrote:
 
  I am trying to understand journaling code in ext3. 
  Can some one enlighten me, why we need journal start
  and stop in ext3_writeback_writepage() ? The block
  allocation is already made in prepare_write().
 
 prepare_write()/commit_write() are used for write(2) writes: the data is
 dirtied, but not immediately queued for IO (unless you're using O_SYNC).
  
 writepage is used when you want to write the page's data to disk
 *immediately* --- it's used when the VM is swapping out an mmaped file,
 or for msync().
 
 So when writepage comes in, there's no guarantee that we've had a
 previous prepare.  You can, for example, use ftruncate() to create a
 large hole in a file, and then mmap() it; if you then dirty a page, then
 the allocation occurs in the writepage().  So a transaction handle is
 necessary.
 
 --Stephen

Thank you. It make sense now. 

I was under the impression that writepage() could also be used to
flush the data even for write(2) writes. Is it not true ? 

I was trying to add writepages() interface for ext3. I am wondering
if I need to do journaling for that case too.

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mpage writepage question

2005-02-03 Thread Badari Pulavarty
Hi Andrew,

I was wondering why mpage_writepage() is only static ?

Is the expectation that, filesystems use

.writepage == block_full_write_page
.writepages == mpage_writepages 

? I am little confused on why we have 2 different ways to
do things ? block_full_write_page() seems to be creating
buffer heads, where as mpage_writepages() can do directly
bios. Shouldn't they be using mpage_writepage() instead of
block_full_write_page() ?

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-03 Thread Badari Pulavarty
On Thu, 2005-02-03 at 09:00, Sonny Rao wrote:

 
 Well it seems to work, here's my (rather ugly) patch.
 I'm doing some performance comparisons now.
 
 Sonny

Interesting.. Why did you create a nobh_prepare_write() ?
mpage_writepages() can handle  pages with buffer heads
attached.

And also, are you sure you don't need to journal start/stop
in writepages() ?

Thanks,
Badari



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html