Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails
On Tue, 2007-12-11 at 17:00 -0800, Zach Brown wrote: Hisashi Hifumi wrote: Hi. Current dio has some problems: 1, In ext3 ordered, dio write can return with EIO because of the race between invalidation of a page and jbd. jbd pins the bhs while committing journal so try_to_release_page fails when jbd is committing the transaction. Yeah. It sure would be fantastic if some ext3 expert could stop this from happening somehow. But that hasn't happened in.. uh.. Badari, for how many years has this been on the radar? :) I used to have a test case that would reproduce the problem some what consistently. But with invalidate_range() introduction and Jan Kara's re-write of journal commit handling, I can't reproduce the problem anymore. So I gave up fixing the problem which I can't reproduce. If anyone has a testcase - I can take a look at the problem again. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Thu, 2007-11-01 at 10:10 -0800, Badari Pulavarty wrote: On Thu, 2007-11-01 at 11:51 -0400, Chris Mason wrote: On Thu, 01 Nov 2007 08:38:57 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: On Wed, 2007-10-31 at 13:40 -0400, Chris Mason wrote: On Wed, 31 Oct 2007 08:14:21 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: I tried data=writeback mode and it didn't help :( Ouch, so much for the easy way out. unable to release the page 262070 bh c000211b9408 flags 110029 count 1 private 0 unable to release the page 262098 bh c00020ec9198 flags 110029 count 1 private 0 memory offlining 3f000 to 4 failed The only other special thing reiserfs does with the page cache is file tails. I don't suppose all of these pages are index zero in files smaller than 4k? Ah !! I am so blind :( I have been suspecting reiserfs all along, since its executing fallback_migrate_page(). Actually, these buffer heads are backing blockdev. I guess these are metadata buffers :( I am not sure we can do much with these.. Hmpf, my first reply had a paragraph about the block device inode pages, I noticed the phrase file data pages and deleted it ;) But, for the metadata buffers there's not much we can do. They are included in a bunch of different lists and the patch would be non-trivial. Unfortunately, these buffer pages are spread all around making those sections of memory non-removable. Of course, one can use ZONE_MOVABLE to make sure to guarantee the remove. But I am hoping we could easily group all these allocations and minimize spreading them around. Mel ? BTW, I am having better luck with being able to offline sections of memory on x86-64, if I take out __GFP_MOVABLE flag for blockdev pages. (in grow_dev_page()). Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Wed, 2007-10-31 at 13:40 -0400, Chris Mason wrote: On Wed, 31 Oct 2007 08:14:21 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: I tried data=writeback mode and it didn't help :( Ouch, so much for the easy way out. unable to release the page 262070 bh c000211b9408 flags 110029 count 1 private 0 unable to release the page 262098 bh c00020ec9198 flags 110029 count 1 private 0 memory offlining 3f000 to 4 failed The only other special thing reiserfs does with the page cache is file tails. I don't suppose all of these pages are index zero in files smaller than 4k? Ah !! I am so blind :( I have been suspecting reiserfs all along, since its executing fallback_migrate_page(). Actually, these buffer heads are backing blockdev. I guess these are metadata buffers :( I am not sure we can do much with these.. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Thu, 2007-11-01 at 11:51 -0400, Chris Mason wrote: On Thu, 01 Nov 2007 08:38:57 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: On Wed, 2007-10-31 at 13:40 -0400, Chris Mason wrote: On Wed, 31 Oct 2007 08:14:21 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: I tried data=writeback mode and it didn't help :( Ouch, so much for the easy way out. unable to release the page 262070 bh c000211b9408 flags 110029 count 1 private 0 unable to release the page 262098 bh c00020ec9198 flags 110029 count 1 private 0 memory offlining 3f000 to 4 failed The only other special thing reiserfs does with the page cache is file tails. I don't suppose all of these pages are index zero in files smaller than 4k? Ah !! I am so blind :( I have been suspecting reiserfs all along, since its executing fallback_migrate_page(). Actually, these buffer heads are backing blockdev. I guess these are metadata buffers :( I am not sure we can do much with these.. Hmpf, my first reply had a paragraph about the block device inode pages, I noticed the phrase file data pages and deleted it ;) But, for the metadata buffers there's not much we can do. They are included in a bunch of different lists and the patch would be non-trivial. Unfortunately, these buffer pages are spread all around making those sections of memory non-removable. Of course, one can use ZONE_MOVABLE to make sure to guarantee the remove. But I am hoping we could easily group all these allocations and minimize spreading them around. Mel ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Tue, 2007-10-30 at 18:58 -0400, Chris Mason wrote: On Tue, 30 Oct 2007 13:54:05 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote: On Tue, 30 Oct 2007 10:27:04 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: Hi, While testing hotplug memory remove, I ran into this issue. Given a range of pages hotplug memory remove tries to migrate those pages. migrate_pages() keeps failing to migrate pages containing pagecache pages for reiserfs files. I noticed that reiserfs doesn't have -migratepage() ops. So, fallback_migrate_page() code tries to do try_to_release_page(). try_to_release_page() fails to drop_buffers() since b_count == 1. Here is what my debug shows: migrate pages failed pfn 258111/flags 3f801 bh cb53f6e0 flags 110029 count 1 Any one know why the b_count == 1 and not getting dropped to zero ? If these are file data pages, the count is probably elevated as part of the data=ordered tracking. You can verify this via b_private, or just mount data=writeback to double check. Chris, That was my first assumption. But after looking at reiserfs_releasepage (), realized that it would do reiserfs_free_jh() and clears the b_private. I couldn't easily find out who has the ref. against this bh. bh cbdaaf00 flags 110029 count 1 private 0 If I'm reading this correctly the buffer is BH_Lock | BH_Req, perhaps it is currently under IO? The page isn't locked, but data=ordered does IO directly on the buffer heads, without taking the page lock. The easy way to narrow our search is to try without data=ordered, it is certainly complicating things. I tried data=writeback mode and it didn't help :( unable to release the page 262070 bh c000211b9408 flags 110029 count 1 private 0 unable to release the page 262098 bh c00020ec9198 flags 110029 count 1 private 0 memory offlining 3f000 to 4 failed # cat /etc/mtab /dev/sda3 / reiserfs rw,data=writeback 0 0 proc /proc proc rw 0 0 tmpfs /dev/shm tmpfs rw 0 0 devpts /dev/pts devpts rw,mode=0620,gid=5 0 0 Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
migratepage failures on reiserfs
Hi, While testing hotplug memory remove, I ran into this issue. Given a range of pages hotplug memory remove tries to migrate those pages. migrate_pages() keeps failing to migrate pages containing pagecache pages for reiserfs files. I noticed that reiserfs doesn't have -migratepage() ops. So, fallback_migrate_page() code tries to do try_to_release_page(). try_to_release_page() fails to drop_buffers() since b_count == 1. Here is what my debug shows: migrate pages failed pfn 258111/flags 3f801 bh cb53f6e0 flags 110029 count 1 Any one know why the b_count == 1 and not getting dropped to zero ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote: On Tue, 30 Oct 2007 10:27:04 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: Hi, While testing hotplug memory remove, I ran into this issue. Given a range of pages hotplug memory remove tries to migrate those pages. migrate_pages() keeps failing to migrate pages containing pagecache pages for reiserfs files. I noticed that reiserfs doesn't have -migratepage() ops. So, fallback_migrate_page() code tries to do try_to_release_page(). try_to_release_page() fails to drop_buffers() since b_count == 1. Here is what my debug shows: migrate pages failed pfn 258111/flags 3f801 bh cb53f6e0 flags 110029 count 1 Any one know why the b_count == 1 and not getting dropped to zero ? If these are file data pages, the count is probably elevated as part of the data=ordered tracking. You can verify this via b_private, or just mount data=writeback to double check. Chris, That was my first assumption. But after looking at reiserfs_releasepage (), realized that it would do reiserfs_free_jh() and clears the b_private. I couldn't easily find out who has the ref. against this bh. bh cbdaaf00 flags 110029 count 1 private 0 Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote: On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote: jbd/jbd2: Replace slab allocations with page cache allocations From: Christoph Lameter [EMAIL PROTECTED] JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Currently memory allocation for committed_data(and frozen_buffer) for bufferhead is done through jbd slab management, as Christoph Hellwig pointed out that this is broken as jbd should not pass slab pages down to IO layer. and suggested to use get_free_pages() directly. The problem with this patch, as Andreas Dilger pointed today in ext4 interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste 1/3-1/2 page space. What was the originally intention to set up slabs for committed_data(and frozen_buffer) in JBD? Why not using kmalloc? Mingming Looks good. Small suggestion is to get rid of all kmalloc() usages and consistently use jbd_kmalloc() or jbd2_kmalloc(). Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] dio: remove bogus refcounting BUG_ON
On Thu, 2007-07-05 at 10:11 -0700, Zach Brown wrote: the BUG_ON(). But unfortunately, our perf. team is able reproduce the problem. What are they doing to reproduce it? How much setup does it take? Huge OLTP run :( Debug indicated that, the ret2 == 1 :( That could be consistent with the theory that we're racing with the dio struct being freed and reused before it's tested in the BUG_ON() condition. Suparna's suggestion to sample dio-is_async before releasing the refcount and using that in the BUG_ON condition is a good one. I will ask them to try that. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] dio: remove bogus refcounting BUG_ON
On Tue, 2007-07-03 at 15:28 -0700, Zach Brown wrote: Linus, Andrew, please apply the bug fix patch at the end of this reply for .22. One of our perf. team ran into this while doing some runs. I didn't see anything obvious - it looks like we converted async IO to synchronous one. I didn't spend much time digging around. OK, I think this BUG_ON() is just broken. I wasn't able to find any obvious bugs from reading the code which would cause the BUG_ON() to fire. If it's reproducible I'd love to hear what the recipe is. I did notice that this BUG_ON() is evaluating dio after having dropped it's ref :/. So it's not completely absurd to fear that it's a race with the dio's memory being reused, but that'd be a pretty tight race. Let's remove this stupid BUG_ON and see if that test box still has trouble. It might just hit the valid BUG_ON a few lines down, but this unsafe BUG_ON needs to go. I went through the code multiple times, I can't find how we can trigger the BUG_ON(). But unfortunately, our perf. team is able reproduce the problem. Debug indicated that, the ret2 == 1 :( Not sure how that can happen. Ideas ? Thanks, Badari --- dio: remove bogus refcounting BUG_ON Badari Pulavarty reported a case of this BUG_ON is triggering during testing. It's completely bogus and should be removed. It's trying to notice if we left references to the dio hanging around in the sync case. They should have been dropped as IO completed while this path was in dio_await_completion(). This condition will also be checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few lines lower. So to start this BUG_ON() is redundant. More fatally, it's dereferencing dio- after having dropped its reference. It's only safe to dereference the dio after releasing the lock if the final reference was just dropped. Another CPU might free the dio in bio completion and reuse the memory after this path drops the dio lock but before the BUG_ON() is evaluated. This patch passed aio+dio regression unit tests and aio-stress on ext3. Signed-off-by: Zach Brown [EMAIL PROTECTED] Cc: Badari Pulavarty [EMAIL PROTECTED] diff -r 509ce354ae1b fs/direct-io.c --- a/fs/direct-io.c Sun Jul 01 22:00:49 2007 + +++ b/fs/direct-io.c Tue Jul 03 14:56:41 2007 -0700 @@ -1106,7 +1106,7 @@ direct_io_worker(int rw, struct kiocb *i spin_lock_irqsave(dio-bio_lock, flags); ret2 = --dio-refcount; spin_unlock_irqrestore(dio-bio_lock, flags); - BUG_ON(!dio-is_async ret2 != 0); + if (ret2 == 0) { ret = dio_complete(dio, offset, ret); kfree(dio); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
DIO panic on 2.6.21.5
Hi Zach, One of our perf. team ran into this while doing some runs. I didn't see anything obvious - it looks like we converted async IO to synchronous one. I didn't spend much time digging around. Is this a known issue ? Any ideas ? Thanks, Badari [ cut here ] kernel BUG at fs/direct-io.c:1113! invalid opcode: [1] SMP CPU 11 Modules linked in: oprofile raw capability commoncap qla2xxx ipv6 button battery ac loop dm_mod tg3 ext3 jbd edd fan thermal processor scsi_transport_fc sg aic94xx libsas firmware_class scsi_transport_sas serverworks sd_mod scsePid: 9709, comm: db2sysc Not tainted 2.6.21.5-smp #1 RIP: 0010:[802a7a1a] [802a7a1a] __blockdev_direct_IO+0x96c/0xa4a RSP: 0018:810c3d3efc08 EFLAGS: 00010202 RAX: 0246 RBX: 810041768b24 RCX: 80406918 RDX: RSI: 0246 RDI: 810041768b24 RBP: 810041768800 R08: 0086 R09: 81003f0a2370 R10: 810c3f3ee9d8 R11: 802e9716 R12: 0001 R13: fdef R14: 810c5beb42b0 R15: FS: 2b3d14ffe940() GS:810c444c95c0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 2b3e84f57000 CR3: 4511c000 CR4: 06e0 Process db2sysc (pid: 9709, threadinfo 810c3d3ee000, task 810041fee760) Stack: 810c42f0fbc0 4c1f4000 811836306d48 0011803e7f95 0009 0008 810041fee760 00090001 0001 810024ac9948 Call Trace: [802a67c7] blkdev_direct_IO+0x45/0x4a [802a66ec] blkdev_get_blocks+0x0/0x96 [80260108] generic_file_direct_IO+0xbd/0xfb [802601a6] generic_file_direct_write+0x60/0xf5 [80260b70] __generic_file_aio_write_nolock+0x2e7/0x410 [8029a0ce] aio_read_evt+0x9a/0x108 [80260d8e] generic_file_aio_write_nolock+0x34/0x80 [80260d5a] generic_file_aio_write_nolock+0x0/0x80 [80299d23] aio_rw_vect_retry+0x72/0x14a [8029a63f] aio_run_iocb+0x9a/0x134 [8029b10a] io_submit_one+0x311/0x35e [8029b678] sys_io_submit+0x9b/0x101 [80229986] default_wake_function+0x0/0xe [80209b8e] system_call+0x7e/0x83 Code: 0f 0b eb fe 4d 85 e4 75 1d 48 8b 74 24 08 44 89 ea 48 89 ef RIP [802a7a1a] __blockdev_direct_IO+0x96c/0xa4a RSP 810c3d3efc08 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 13/14] ext3 whiteout support
On Mon, 2007-05-14 at 15:14 +0530, Bharata B Rao wrote: From: Bharata B Rao [EMAIL PROTECTED] Subject: ext3 whiteout support Introduce whiteout support for ext3. Signed-off-by: Bharata B Rao [EMAIL PROTECTED] Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/ext3/dir.c |2 - fs/ext3/namei.c | 62 fs/ext3/super.c | 11 +++- include/linux/ext3_fs.h |5 +++ 4 files changed, 72 insertions(+), 8 deletions(-) --- a/fs/ext3/dir.c +++ b/fs/ext3/dir.c @@ -29,7 +29,7 @@ #include linux/rbtree.h static unsigned char ext3_filetype_table[] = { - DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK + DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK, DT_WHT }; static int ext3_readdir(struct file *, void *, filldir_t); --- a/fs/ext3/namei.c +++ b/fs/ext3/namei.c @@ -1071,6 +1071,7 @@ static unsigned char ext3_type_by_mode[S [S_IFIFO S_SHIFT]= EXT3_FT_FIFO, [S_IFSOCK S_SHIFT] = EXT3_FT_SOCK, [S_IFLNK S_SHIFT]= EXT3_FT_SYMLINK, + [S_IFWHT S_SHIFT]= EXT3_FT_WHT, }; static inline void ext3_set_de_type(struct super_block *sb, @@ -1786,7 +1787,7 @@ out_stop: /* * routine to check that the specified directory is empty (for rmdir) */ -static int empty_dir (struct inode * inode) +static int empty_dir (handle_t *handle, struct inode * inode) Is there a reason for passing the handle ? Why couldn't you get it from journal_current_handle() if needed to do the delete the whiteout ? { unsigned long offset; struct buffer_head * bh; @@ -1848,8 +1849,28 @@ static int empty_dir (struct inode * ino continue; } if (le32_to_cpu(de-inode)) { - brelse (bh); - return 0; + /* If this is a whiteout, remove it */ + if (de-file_type == EXT3_FT_WHT) { + unsigned long ino = le32_to_cpu(de-inode); + struct inode *tmp_inode = iget(inode-i_sb, ino); + if (!tmp_inode) { + brelse (bh); + return 0; + } + + if (ext3_delete_entry(handle, inode, de, bh)) { + iput(tmp_inode); + brelse (bh); + return 0; + } + + tmp_inode-i_ctime = inode-i_ctime; + tmp_inode-i_nlink--; + iput(tmp_inode); + } else { + brelse (bh); + return 0; + } } offset += le16_to_cpu(de-rec_len); de = (struct ext3_dir_entry_2 *) @@ -2031,7 +2052,7 @@ static int ext3_rmdir (struct inode * di goto end_rmdir; retval = -ENOTEMPTY; - if (!empty_dir (inode)) + if (!empty_dir (handle, inode)) goto end_rmdir; retval = ext3_delete_entry(handle, dir, de, bh); @@ -2060,6 +2081,36 @@ end_rmdir: return retval; } +static int ext3_whiteout(struct inode *dir, struct dentry *dentry) +{ + struct inode * inode; + int err, retries = 0; + handle_t *handle; + +retry: + handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir-i_sb) + + EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 + + 2*EXT3_QUOTA_INIT_BLOCKS(dir-i_sb)); + if (IS_ERR(handle)) + return PTR_ERR(handle); + + if (IS_DIRSYNC(dir)) + handle-h_sync = 1; + + inode = ext3_new_inode (handle, dir, S_IFWHT | S_IRUGO); + err = PTR_ERR(inode); + if (IS_ERR(inode)) + goto out_stop; Don't you need to call init_special_inode() here ? Or this is handled somewhere else ? + + err = ext3_add_nondir(handle, dentry, inode); + +out_stop: + ext3_journal_stop(handle); + if (err == -ENOSPC ext3_should_retry_alloc(dir-i_sb, retries)) + goto retry; + return err; +} + Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 5/14] Introduce union stack
On Mon, 2007-05-14 at 15:10 +0530, Bharata B Rao wrote: From: Jan Blunck [EMAIL PROTECTED] Subject: Introduce union stack. Adds union stack infrastructure to the dentry structure and provides locking routines to walk the union stack. Signed-off-by: Jan Blunck [EMAIL PROTECTED] Signed-off-by: Bharata B Rao [EMAIL PROTECTED] ... +/* + * This is a *I can't get no sleep* helper which is called when we try + * to access the struct fs_struct *fs field of a struct task_struct. + * + * Yes, this is possibly starving but we have to change root, altroot + * or pwd in the frequency of this while loop. Don't think that this + * happens really often ;) + * + * This is called while holding the rwlock_t fs-lock + * + * TODO: Unlocking side of union_lock_fs() needs 3 union_unlock()s. + * May be introduce union_unlock_fs(). + * + * FIXME: This routine is used when the caller wants to dget one or + * more of fs-[root, altroot, pwd]. When the caller doesn't want to + * dget _all_ of these, it is strictly not necessary to get union_locks + * on all of these. Check. + */ +static inline void union_lock_fs(struct fs_struct *fs) +{ + int locked; + + while (fs) { + locked = union_trylock(fs-root); + if (!locked) + goto loop1; + locked = union_trylock(fs-altroot); + if (!locked) + goto loop2; + locked = union_trylock(fs-pwd); + if (!locked) + goto loop3; + break; + loop3: + union_unlock(fs-altroot); + loop2: + union_unlock(fs-root); + loop1: + read_unlock(fs-lock); + UM_DEBUG_LOCK(Failed to get all semaphores in fs_struct!\n); + cpu_relax(); + read_lock(fs-lock); + continue; Nit.. why continue ? + } + BUG_ON(!fs); Whats the use of BUG_ON() here ? Top of the function would be more useful. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 5/14] Introduce union stack
On Mon, 2007-05-14 at 15:10 +0530, Bharata B Rao wrote: From: Jan Blunck [EMAIL PROTECTED] Subject: Introduce union stack. Adds union stack infrastructure to the dentry structure and provides locking routines to walk the union stack. ... --- /dev/null +++ b/include/linux/dcache_union.h @@ -0,0 +1,248 @@ +/* + * VFS based union mount for Linux + * + * Copyright © 2004-2007 IBM Corporation + * Author(s): Jan Blunck ([EMAIL PROTECTED]) + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + */ +#ifndef __LINUX_DCACHE_UNION_H +#define __LINUX_DCACHE_UNION_H +#ifdef __KERNEL__ + +#include linux/union_debug.h +#include linux/fs_struct.h +#include asm/atomic.h +#include asm/semaphore.h + +#ifdef CONFIG_UNION_MOUNT + +/* + * This is the union info object, that describes general information about this + * union directory + * + * u_mutex protects the union stack against modification. You can reach it + * through the d_union field in struct dentry. Hold it when you are walking + * or modifing the union stack ! + */ +struct union_info { + atomic_t u_count; + struct mutex u_mutex; +}; + +/* allocate/de-allocate */ +extern struct union_info *union_alloc(void); +extern struct union_info *union_get(struct union_info *); +extern void union_put(struct union_info *); + +/* + * These are the functions for locking a dentry's union. When one + * want to acquire a denties union lock, use: + * + * - union_lock() when you can sleep, + * - union_lock_spinlock() when you are holding a spinlock (that + * you CAN savely give up and reacquire again) + * - union_lock_readlock() when you are holding a readlock (that + * you CAN savely give up and reacquire again) + * + * Otherwise get the union lock early before you enter your + * no sleeping here code. + * + * NOTES: union_info structure is reference counted using u_count member. + * union_get() and union_put() which get and put references on union_info + * should be done under union_info's u_mutex. Since the last union_put() frees + * the union_info structure itself it can't obviously be done under u_mutex. + * union_release() should be used in such cases (Eg. dput(), umount()) where + * union_info is disassociated from the dentries, and it becomes safe + * to free the union_info. + */ +static inline void __union_lock(struct union_info *uinfo) +{ + BUG_ON(!atomic_read(uinfo-u_count)); + mutex_lock(uinfo-u_mutex); +} + +static inline void union_lock(struct dentry *dentry) +{ + if (unlikely(dentry dentry-d_union)) { + struct union_info *ui = dentry-d_union; + + UM_DEBUG_LOCK(\%s\ locking %p (count=%d)\n, + dentry-d_name.name, ui, + atomic_read(ui-u_count)); + __union_lock(dentry-d_union); + } +} + +static inline void __union_unlock(struct union_info *uinfo) +{ + BUG_ON(!atomic_read(uinfo-u_count)); + mutex_unlock(uinfo-u_mutex); +} + +static inline void union_unlock(struct dentry *dentry) +{ + if (unlikely(dentry dentry-d_union)) { + struct union_info *ui = dentry-d_union; + + UM_DEBUG_LOCK(\%s\ unlocking %p (count=%d)\n, + dentry-d_name.name, ui, + atomic_read(ui-u_count)); + __union_unlock(dentry-d_union); + } +} + +static inline void union_alloc_dentry(struct dentry *dentry) +{ + spin_lock(dentry-d_lock); + if (!dentry-d_union) { + dentry-d_union = union_alloc(); + spin_unlock(dentry-d_lock); + } else { + spin_unlock(dentry-d_lock); + union_lock(dentry); + } +} + +static inline struct union_info *union_lock_and_get(struct dentry *dentry) +{ + union_lock(dentry); + return union_get(dentry-d_union); +} + +/* Shouldn't be called with last reference to union_info */ +static inline void union_put_and_unlock(struct union_info *uinfo) +{ + union_put(uinfo); + __union_unlock(uinfo-u_mutex); ^^^ It should be __union_unlock(uinfo); Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21-rc6 new aops patchset
On Thu, 2007-04-12 at 06:48 +0200, Nick Piggin wrote: http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ 2.6.21-rc6-new-aops* New aops patchset against 2.6.21-rc6. Building modules, stage 2. MODPOST 558 modules WARNING: .cont_prepare_write [fs/hfsplus/hfsplus.ko] undefined! WARNING: .iov_iter_advance [fs/fuse/fuse.ko] undefined! WARNING: .iov_iter_copy_from_user_atomic [fs/fuse/fuse.ko] undefined! WARNING: .iov_iter_fault_in_readable [fs/fuse/fuse.ko] undefined! make[1]: *** [__modpost] Error 1 make: *** [modules] Error 2 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] new new aops patchset
On Thu, 2007-04-05 at 04:08 +0200, Nick Piggin wrote: On Wed, Apr 04, 2007 at 04:32:24PM -0700, Badari Pulavarty wrote: On Wed, 2007-04-04 at 16:17 -0700, Mark Fasheh wrote: On Wed, Apr 04, 2007 at 04:05:19PM -0700, Badari Pulavarty wrote: Hmm.. Okay, only filesystems that could return AOP_TRUNCATED_PAGE are ocf2 and gfs2. Now that both of them are switched to have write_begin()/write_end(), why do we need this code to handle AOP_TRUNCATED_PAGE (in the else part) ? Can't we just cleanup/nuke all the AOP_TRUNCATED_PAGE handling ? We don't - I'm pretty sure that fs-no-AOP_TRUNCATED_PAGE.patch gets rid of them. --Mark It didn't, completely get rid of them :( -readpage can still return AOP_TRUNCATED_PAGE. Were there any from prepare_write or commit_write still around? Not a big deal. But trying to understand it better. int pagecache_write_begin() { if (aops-write_begin) { return aops-write_begin(file, mapping, pos, len, flags, pagep, fsdata); } else { . ret = aops-readpage(file, page); page_cache_release(page); if (ret) { if (ret == AOP_TRUNCATED_PAGE) goto again; return ret; } goto again; } } filesystems (ocfs2, gfs2) which can return AOP_TRUNCATED_PAGE for prepare_write or readpage would never come to this case. They have write_begin() method set. Isn't it ? Why this check ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] new new aops patchset
On Wed, 2007-04-04 at 15:10 -0700, Mark Fasheh wrote: On Mon, Apr 02, 2007 at 02:09:34PM +0200, Nick Piggin wrote: Updated aops patchset against 2.6.21-rc5. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ Thanks again for pushing this stuff out Nick. Fwiw, I had merge conflicts working against the latest git code from Linus - it looks like Jens pushed some splice changes (some from you). A refreshed fs-new-write-aops.patch is attached. The ext3 patch also caused me some problems, but I just temporarily dropped it for my immediate work. Could you elaborate on issues ext3 caused ? Its pretty stable for me (after bunch of simple fixes, I posted earlier). Anything I need to be aware ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] new new aops patchset
On Wed, 2007-04-04 at 16:17 -0700, Mark Fasheh wrote: On Wed, Apr 04, 2007 at 04:05:19PM -0700, Badari Pulavarty wrote: Hmm.. Okay, only filesystems that could return AOP_TRUNCATED_PAGE are ocf2 and gfs2. Now that both of them are switched to have write_begin()/write_end(), why do we need this code to handle AOP_TRUNCATED_PAGE (in the else part) ? Can't we just cleanup/nuke all the AOP_TRUNCATED_PAGE handling ? We don't - I'm pretty sure that fs-no-AOP_TRUNCATED_PAGE.patch gets rid of them. --Mark It didn't, completely get rid of them :( Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] new new aops patchset
On Tue, 2007-04-03 at 01:49 +0200, Nick Piggin wrote: On Mon, Apr 02, 2007 at 04:14:59PM -0700, Badari Pulavarty wrote: On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote: Updated aops patchset against 2.6.21-rc5. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ Files/dirs are 2.6.21-rc5-new-aops* Baaah !! You took away ext3 -nobh option :( Ahh, just the person I wanted to ask! ;) How useful is it, out of curiosity? What sort of users use it, and what sort of improvements do they get? Well, at the time it helped lowmem issue on x86. Bufferheads end up using lots of lowmem and caused bunch of issues. We used it heavily on our local machines. Not sure if any customers use it (my guess is not). I wanted to completely kill bufferhead usage from ext3. writeback mode is simple and easy. Ordered mode, need attach the buffers to transaction - which needed complete rework. So never gotten around to doing it. Do you have plans to support nobh versions of block_write_begin/end ? At the moment I'm looking at doing it another way. Having the seperate nobh path is quite annoying -- there are still bugs in it and it is simply less tested (not that the bh path is bug-free either, but it is better to be able to concentrate on one). So I hope to merge them at some point to restore that functionality. BTW, I don't see how block_write_end() can ever return 0. If so, here is the cleanup fix for ext3 (no unnecessay checks). Shouldn't we allow for the possibility? Well, if there is a (remote) possibility of failure - yes we should handle it. Otherwise its dead code, unnecessary checks in hotpath and makes code ugly by confusing non-possible error cases from possible ones. Even today - ext3 code has checks to handle generic_commit_write() failures. But it never returns failure. I wanted to clean up ext3 code, but I left it for allowing for possibility. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] new new aops patchset
On Tue, 2007-04-03 at 01:58 +0200, Nick Piggin wrote: On Mon, Apr 02, 2007 at 01:44:59PM -0700, Badari Pulavarty wrote: On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote: Updated aops patchset against 2.6.21-rc5. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ Files/dirs are 2.6.21-rc5-new-aops* Contains numerous fixes from Mark and myself -- I'd say the core code is getting reasonably stable at this point. ext3_write_failure() conversion is NOT quite correct. Old code was returing failure of do_journal_get_write_access(), where as your changes will return journal_stop(). Good catch. This still requires a journal_stop, however? Yes. How's this look? Looks good. I am little worried about the page-lock vs journal start/stop order. I remember running into deadlock while trying to do writepages() for ordered mode. I think we are okay here. I will take a closer look. Thanks, Badari -- Index: linux-2.6/fs/ext3/inode.c === --- linux-2.6.orig/fs/ext3/inode.c +++ linux-2.6/fs/ext3/inode.c @@ -1167,11 +1167,13 @@ static int ext3_write_failure(struct fil handle_t *handle = ext3_journal_current_handle(); if (ext3_should_writeback_data(mapping-host)) { +skip_and_stop: /* optimization: no constraints about data */ + ret = ext3_journal_stop(handle); skip: unlock_page(page); page_cache_release(page); - return ext3_journal_stop(handle); + return ret; } from = pos (PAGE_CACHE_SIZE - 1); @@ -1196,8 +1198,10 @@ skip: break; if (ext3_should_journal_data(mapping-host)) { ret = do_journal_get_write_access(handle, bh); - if (ret) + if (ret) { + ext3_journal_stop(handle); goto skip; + } } /* * block_start here becomes the first block where the current iteration @@ -1205,7 +1209,7 @@ skip: */ } if (block_start = from) - goto skip; + goto skip_and_stop; /* commit allocated and zeroed buffers */ return mapping-a_ops-write_end(file, mapping, pos, len, - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] new new aops patchset
On Tue, 2007-04-03 at 11:31 +0200, Nick Piggin wrote: On Tue, Apr 03, 2007 at 02:08:53AM +0200, Nick Piggin wrote: BTW, I will take a shot at ext4 tomorrow. Thanks, so long as you think ext3 is looking OK? BTW. is it a known issue that ext3 fails fsx-linux? (I tried 2.6.21-rc3 IIRC, and ordered and writeback both eventually failed I think). ext2 does not. Well I just tested, and it is not fixed by the recent patch to revert ext3_prepare_failure Is this a known issue? Is it an fsx-linux shortcoming? It is fairly surprising because it basically makes it impossible to test ext3 changes with that nice tool :( I can submit the traces if anyone is interested, however I can reproduce in UML on an ext3 writeback filesystem with no arguments (except the filename). I haven't seen an fsx failure recently. I ran 4 copies of fsx on 2.6.21-rc5 and 2.6.21-rc5+aops, without any problems for 12+ hours. What am I missing ? Mingming, do you know of any fsx failures recently ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] new new aops patchset
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote: Updated aops patchset against 2.6.21-rc5. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ Files/dirs are 2.6.21-rc5-new-aops* Here is the ext4 support for it. This is a simple port from ext3 code. Ran fsx without any problems :) BTW, I never clearly understood what exactly the problem these new interfaces are solving and how :( I can dig through the archives and try to figure out. Would you care to put a small description of the *actual* problem and how these new aops are needed (vs hacking the existing methods). Thanks, Badari --- fs/ext4/inode.c | 171 +--- 1 file changed, 113 insertions(+), 58 deletions(-) Index: linux-2.6.21-rc5/fs/ext4/inode.c === --- linux-2.6.21-rc5.orig/fs/ext4/inode.c 2007-03-25 15:56:23.0 -0700 +++ linux-2.6.21-rc5/fs/ext4/inode.c2007-04-03 10:01:34.0 -0700 @@ -1154,23 +1154,30 @@ static int do_journal_get_write_access(h * This content is expected to be set to zeroes by block_prepare_write(). * 2006/10/14 SAW */ -static int ext4_prepare_failure(struct file *file, struct page *page, - unsigned from, unsigned to) +static int ext4_write_failure(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, + struct page *page, void *fsdata) { - struct address_space *mapping; struct buffer_head *bh, *head, *next; unsigned block_start, block_end; unsigned blocksize; + unsigned from, to; int ret; handle_t *handle = ext4_journal_current_handle(); - mapping = page-mapping; if (ext4_should_writeback_data(mapping-host)) { /* optimization: no constraints about data */ +skip_and_stop: + ret = ext4_journal_stop(handle); skip: - return ext4_journal_stop(handle); + unlock_page(page); + page_cache_release(page); + return ret; } + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; + head = page_buffers(page); blocksize = head-b_size; for ( bh = head, block_start = 0; @@ -1192,7 +1199,7 @@ skip: ret = do_journal_get_write_access(handle, bh); if (ret) { ext4_journal_stop(handle); - return ret; + goto skip; } } /* @@ -1201,43 +1208,64 @@ skip: */ } if (block_start = from) - goto skip; + goto skip_and_stop; /* commit allocated and zeroed buffers */ - return mapping-a_ops-commit_write(file, page, from, block_start); + return mapping-a_ops-write_end(file, mapping, pos, len, + block_start - from, page, fsdata); } -static int ext4_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +static int ext4_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - struct inode *inode = page-mapping-host; - int ret, ret2; + struct inode *inode = mapping-host; int needed_blocks = ext4_writepage_trans_blocks(inode); + int ret, ret2; handle_t *handle; int retries = 0; + struct page *page; + pgoff_t index; + unsigned start, end; + + index = pos PAGE_CACHE_SHIFT; + start = pos * (PAGE_CACHE_SIZE - 1); + end = start + len; retry: + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; + handle = ext4_journal_start(inode, needed_blocks); - if (IS_ERR(handle)) + if (IS_ERR(handle)) { + unlock_page(page); + page_cache_release(page); return PTR_ERR(handle); - if (test_opt(inode-i_sb, NOBH) ext4_should_writeback_data(inode)) - ret = nobh_prepare_write(page, from, to, ext4_get_block); - else - ret = block_prepare_write(page, from, to, ext4_get_block); + } + ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + ext4_get_block); if (ret) goto failure; if (ext4_should_journal_data(inode)) { ret = walk_page_buffers(handle, page_buffers(page), - from, to, NULL, do_journal_get_write_access); - if (ret) + start, end, NULL,
Re: [ANNOUNCE] new new aops patchset
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote: Updated aops patchset against 2.6.21-rc5. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ Files/dirs are 2.6.21-rc5-new-aops* Contains numerous fixes from Mark and myself -- I'd say the core code is getting reasonably stable at this point. New stuff: patches from Mark and Steve for the cluster filesystems. (compile only) patches for NFS, XFS, FUSE, eCryptfs. OK, they're untested, but I have tried to put some effort in, so if maintainers would kindly take a look... ext3 still needs review and porting to ext4, which I hope someone will do eventually (I presume ext3 is still maintained)... does NFS have a lock_page vs lock_kernel deadlock in its commit_write? (if yes, this is fixable with the new write aops). In case of error, write_begin() is *supposed* to unlock and release the page it locked. Right ? If so, ext3 error handling is not quite right :( Here is the fix. Thanks, Badari --- fs/ext3/inode.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) Index: linux-2.6.21-rc5.aop/fs/ext3/inode.c === --- linux-2.6.21-rc5.aop.orig/fs/ext3/inode.c 2007-04-02 10:21:44.0 -0700 +++ linux-2.6.21-rc5.aop/fs/ext3/inode.c2007-04-02 13:08:41.0 -0700 @@ -1236,8 +1236,11 @@ retry: *pagep = page; handle = ext3_journal_start(inode, needed_blocks); - if (IS_ERR(handle)) + if (IS_ERR(handle)) { + unlock_page(page); + page_cache_release(page); return PTR_ERR(handle); + } ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, ext3_get_block); if (ret) @@ -1246,9 +1249,12 @@ retry: if (ext3_should_journal_data(inode)) { ret = walk_page_buffers(handle, page_buffers(page), start, end, NULL, do_journal_get_write_access); - if (ret) + if (ret) { /* fatal error, just put the handle and return */ + unlock_page(page); + page_cache_release(page); journal_stop(handle); + } } return ret; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] new new aops patchset
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote: Updated aops patchset against 2.6.21-rc5. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ Files/dirs are 2.6.21-rc5-new-aops* Contains numerous fixes from Mark and myself -- I'd say the core code is getting reasonably stable at this point. ext3_write_failure() conversion is NOT quite correct. Old code was returing failure of do_journal_get_write_access(), where as your changes will return journal_stop(). Thanks, Badari --- fs/ext3/inode.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) Index: linux-2.6.21-rc5.aop/fs/ext3/inode.c === --- linux-2.6.21-rc5.aop.orig/fs/ext3/inode.c 2007-04-02 13:37:16.0 -0700 +++ linux-2.6.21-rc5.aop/fs/ext3/inode.c2007-04-02 13:37:35.0 -0700 @@ -1196,8 +1196,11 @@ skip: break; if (ext3_should_journal_data(mapping-host)) { ret = do_journal_get_write_access(handle, bh); - if (ret) - goto skip; + if (ret) { + unlock_page(page); + page_cache_release(page); + return ret; + } } /* * block_start here becomes the first block where the current iteration - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] new new aops patchset
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote: Updated aops patchset against 2.6.21-rc5. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ Files/dirs are 2.6.21-rc5-new-aops* [EMAIL PROTECTED] linux-2.6.21-rc5]# make -j8 modules CHK include/linux/version.h CHK include/linux/utsrelease.h Building modules, stage 2. MODPOST 1281 modules WARNING: cont_write_begin [fs/fat/fat.ko] undefined! make[1]: *** [__modpost] Error 1 make: *** [modules] Error 2 Here is the fix. Thanks, Badari --- fs/buffer.c |1 + 1 file changed, 1 insertion(+) Index: linux-2.6.21-rc5/fs/buffer.c === --- linux-2.6.21-rc5.orig/fs/buffer.c 2007-04-02 13:45:56.0 -0700 +++ linux-2.6.21-rc5/fs/buffer.c2007-04-02 14:33:20.0 -0700 @@ -3128,6 +3128,7 @@ EXPORT_SYMBOL(block_sync_page); EXPORT_SYMBOL(block_truncate_page); EXPORT_SYMBOL(block_write_full_page); EXPORT_SYMBOL(cont_prepare_write); +EXPORT_SYMBOL(cont_write_begin); EXPORT_SYMBOL(end_buffer_read_sync); EXPORT_SYMBOL(end_buffer_write_sync); EXPORT_SYMBOL(file_fsync); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] new new aops patchset
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote: Updated aops patchset against 2.6.21-rc5. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ Sorry to send you so many silly fixes, but I though it would be easy to review individual ones. Thanks, Badari --- mm/filemap.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.21-rc5/mm/filemap.c === --- linux-2.6.21-rc5.orig/mm/filemap.c 2007-04-02 13:45:56.0 -0700 +++ linux-2.6.21-rc5/mm/filemap.c 2007-04-02 16:23:48.0 -0700 @@ -2107,7 +2107,7 @@ int pagecache_write_end(struct file *fil const struct address_space_operations *aops = mapping-a_ops; int ret; - if (aops-write_begin) { + if (aops-write_end) { ret = aops-write_end(file, mapping, pos, len, copied, page, fsdata); } else { - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] new new aops patchset
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote: Updated aops patchset against 2.6.21-rc5. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ Files/dirs are 2.6.21-rc5-new-aops* ext3_write_begin() is computing start incorrectly. Thanks, Badari --- fs/ext3/inode.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.21-rc5/fs/ext3/inode.c === --- linux-2.6.21-rc5.orig/fs/ext3/inode.c 2007-04-02 16:06:39.0 -0700 +++ linux-2.6.21-rc5/fs/ext3/inode.c2007-04-02 16:49:57.0 -0700 @@ -1229,7 +1229,7 @@ static int ext3_write_begin(struct file unsigned start, end; index = pos PAGE_CACHE_SHIFT; - start = pos * (PAGE_CACHE_SIZE - 1); + start = pos (PAGE_CACHE_SIZE - 1); end = start + len; retry: - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] new new aops patchset
On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote: Updated aops patchset against 2.6.21-rc5. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ Files/dirs are 2.6.21-rc5-new-aops* One more cleanup. index is unused. BTW, I will take a shot at ext4 tomorrow. (If you want me to roll all my fixes and send them to you,u, 5.) 1c link 5.orig h2 ( and send them to you, let me know). Thanks, Badari --- mm/filemap.c |2 -- 1 file changed, 2 deletions(-) Index: linux-2.6.21-rc5/mm/filemap.c === --- linux-2.6.21-rc5.orig/mm/filemap.c 2007-04-02 16:23:48.0 -0700 +++ linux-2.6.21-rc5/mm/filemap.c 2007-04-02 16:53:13.0 -0700 @@ -2383,14 +2383,12 @@ static ssize_t generic_perform_write(str do { struct page *page; - pgoff_t index; /* Pagecache index for current page */ unsigned long offset; /* Offset into pagecache page */ unsigned long bytes;/* Bytes to write to page */ size_t copied; /* Bytes copied from user */ void *fsdata; offset = (pos (PAGE_CACHE_SIZE - 1)); - index = pos PAGE_CACHE_SHIFT; bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset, iov_iter_count(i)); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Fri, 2007-03-02 at 09:16 -0600, Eric Sandeen wrote: Badari Pulavarty wrote: Amit K. Arora wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation fallocate, for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); I am wondering about return values from this syscall ? Is it supposed to return the number of bytes allocated ? What about partial allocations ? If you don't have enough blocks to cover the request, you should probably just return -ENOSPC, not a partial allocation. That could be challenging, when multiple writers are working in parallel. You may not be able to return -ENOSPC, till you fail the allocation (for filesystems which alllocates a block at a time). What about if the blocks already exists ? What would be return values in those cases ? 0 on success, other normal errors oetherwise.. If asked for a range that includes already-allocated blocks, you just allocate any non-allocated blocks in the range, I think. Yes. What I was trying to figure out is, if there is a requirement that interface need to return exact number of bytes it *really* allocated (like write() or read()). I can't think of any, but just wanted to through it out.. BTW, what is the interface for finding out what is the size of the pre-allocated file ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Amit K. Arora wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation fallocate, for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); I am wondering about return values from this syscall ? Is it supposed to return the number of bytes allocated ? What about partial allocations ? What about if the blocks already exists ? What would be return values in those cases ? Just curious .. What does posix_fallocate() return ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH 2/4]delayed allocation for ext3
On Tue, 2005-07-26 at 15:52 -0700, Andrew Morton wrote: Mingming Cao [EMAIL PROTECTED] wrote: Here is the updated patch from Badari for delayed allocation for ext3. Delayed allocation defers block allocation from prepare-write time to page writeout time. For data=writeback only, yes? Yes. ... --- linux-2.6.12/fs/ext3/inode.c~ext3-delalloc 2005-07-14 23:15:34.866752480 -0700 +++ linux-2.6.12-ming/fs/ext3/inode.c 2005-07-14 23:15:34.889748984 -0700 @@ -1340,6 +1340,9 @@ static int ext3_prepare_write(struct fil handle_t *handle; int retries = 0; + + if (test_opt(inode-i_sb, DELAYED_ALLOC)) + return __nobh_prepare_write(page, from, to, ext3_get_block, 0); Rather than performing this test on each -prepare_write(), would it not be better to set up a new set of address_space_operations for this mode? __nobh_prepare_write() seems like a poor choice of name? You are correct. I was trying to minimize the changes to interfaces. Once we get it working, I will do it as part of cleanups. retry: handle = ext3_journal_start(inode, needed_blocks); if (IS_ERR(handle)) { @@ -1439,6 +1442,9 @@ static int ext3_writeback_commit_write(s else ret = generic_commit_write(file, page, from, to); + if (test_opt(inode-i_sb, DELAYED_ALLOC)) + return ret; + Here too, perhaps. + } + } /* * The journal_load will have done any necessary log recovery, * so we can safely mount the rest of the filesystem now. diff -puN fs/buffer.c~ext3-delalloc fs/buffer.c --- linux-2.6.12/fs/buffer.c~ext3-delalloc 2005-07-14 23:15:34.875751112 -0700 +++ linux-2.6.12-ming/fs/buffer.c 2005-07-14 23:15:34.903746856 -0700 @@ -2337,8 +2337,8 @@ static void end_buffer_read_nobh(struct * On entry, the page is fully not uptodate. * On exit the page is fully uptodate in the areas outside (from,to) */ -int nobh_prepare_write(struct page *page, unsigned from, unsigned to, - get_block_t *get_block) +int __nobh_prepare_write(struct page *page, unsigned from, unsigned to, + get_block_t *get_block, int create) Suggest you make this static and update the comment. Sure. { struct inode *inode = page-mapping-host; const unsigned blkbits = inode-i_blkbits; @@ -2370,10 +2370,8 @@ int nobh_prepare_write(struct page *page block_start PAGE_CACHE_SIZE; block_in_page++, block_start += blocksize) { unsigned block_end = block_start + blocksize; - int create; map_bh.b_state = 0; - create = 1; if (block_start = to) create = 0; ret = get_block(inode, block_in_file + block_in_page, What's going on here? Seems that we'll call get_block() with `create=0'. Is there any point in doing that? For delayed allocation we shuld be able to skip get_block() altogether here and, err, delay it. For delayed allocation, I need to delay the block allocation - but I still need to do get_block(READ) and read the data from the block - if the block already exists. +int nobh_prepare_write(struct page *page, unsigned from, unsigned + get_block_t *get_block) +{ + return __nobh_prepare_write(page, from, to, get_block, 1); +} + EXPORT_SYMBOL(nobh_prepare_write); Here you add nobh_dalloc_prepare_write() and remember to export it to modules this time ;) Will do. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: get_blocks_t semantics
Nir Tzachar wrote: On Wed, 2005-07-20 at 20:14 -0700, Badari Pulavarty wrote: Nir Tzachar wrote: hello list. can someone please explain the exact semantics the get_block_t function (which is passed to mpage_readpage(s)) should implement?? i could not find any documentation, and existing code kind of baffled me... my current understanding goes like this: if the block is present, call map_bh on the bh, with the physical block number. else, if create is set, allocate a new block for the inode, and again update the new physical block number. is this all?? thanks. For a given file offset, get_block() function is supposed to return the physical disk block# of the block. (and size of the block). If create is one, it needs to allocate the block (if it doesn't already exist). Makes sense ? from fs.h: typedef int (get_block_t)(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create); so, iblock is the block number in the file, or the offset in the file?? iblock = fileoffset in represented sectors. (for example, file offset 4K = 8) furthermore, what set_buffer_mapped (used in map_bh) does?? i could not find it's definition anywhere?? When the filesystems map the fileoffset to a disk block, it returns the block# in the bh-b_blocknr and sets the flag to indicate that the buffer is mapped (to a disk block). If get_block() returned sucess, but if the buffer mapped is not set, means thats there is a hole. about create, allocate a block means just deciding which block should be allocated, update the fs control structures, and return that block #, right?? yep. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: get_blocks_t semantics
Nir Tzachar wrote: hello list. can someone please explain the exact semantics the get_block_t function (which is passed to mpage_readpage(s)) should implement?? i could not find any documentation, and existing code kind of baffled me... my current understanding goes like this: if the block is present, call map_bh on the bh, with the physical block number. else, if create is set, allocate a new block for the inode, and again update the new physical block number. is this all?? thanks. For a given file offset, get_block() function is supposed to return the physical disk block# of the block. (and size of the block). If create is one, it needs to allocate the block (if it doesn't already exist). Makes sense ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] [RFC] [PATCH 2/4]delayed allocation for ext3
On Sun, 2005-07-17 at 19:47 -0600, Andreas Dilger wrote: On Jul 17, 2005 10:40 -0700, Mingming Cao wrote: @@ -373,6 +373,7 @@ struct ext3_inode { #define EXT3_MOUNT_BARRIER 0x2 /* Use block barriers */ #define EXT3_MOUNT_NOBH0x4 /* No bufferheads */ #define EXT3_MOUNT_QUOTA 0x8 /* Some quota option set */ + #define EXT3_MOUNT_DELAYED_ALLOC 0xC /* Delayed Allocation */ This doesn't make sense. DELAYED_ALLOC == QUOTA | NOBH? My fault. I will fix it. + {Opt_delayed_alloc, delalloc}, Is this a replacement for Alex's delalloc code? We also use delalloc for that code and if they are not interchangeable it will cause confusion about which one is in use. Well, basically delalloc concept is same - whether we use it on current ext3 layout or with new extent layout doesn't matter. + if (test_opt(sb, DELAYED_ALLOC)) { + if (!(test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)) { + printk(KERN_WARNING EXT3-fs: Ignoring delall option - + its supported only with writeback mode\n); Should be ignoring delalloc option. Yep. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lazy block allocation and block_prepare_write?
On Tue, 2005-04-19 at 13:41, Martin Jambor wrote: 1) Currently none of the generic helper routines can handle this. We need to add support to do these, but still somehow make the routines generic enough for every ones use. I'm quite happy about most of them. I can't see how we could use any generic form of writepage(s) as we write stuff in a quite different way from almost anybody else but all the others except block_prepare_write do pretty much exactly what we need (if I have not missed something). Yep. You need to have your own block_prepare_write() which doesn't allocate a block - instead it reserves a block. 2) There is no easy way to find out if we reserved a block or not in writepage() correctly. There are 2 paths to writepage(). sys_write() - prepare/commit() and later sync() writepage() mmap() - touch a page() and later -- writepage() In order to do the correct accounting, we need to mark a page to indicate if we reserved a block or not. One way to do this, to use page-private to indicate this. But then, all the generic routines will fail - since they assume that page-private represents bufferheads. So we need a better way to do this. I didn't hope for a special bit in struct page so I wanted to simply fake the page/buffer mapping somehow. Since we don't really care whether a page is mapped or reserved as long as it is at least one of these when actually writing it (we write stuff to different places from where we have read it from), the PG_mappedtodisk is fine for us as long as no other kernel code thinks that having it set means we also have buffers which point to meaningful positions on the device because we don't. Is that the case? Of course, having a PG_RESERVED flag would be a nice and clean thing to use and we would be more than happy to do so. 3) We need add hooks into filesystem specific calls from these generic routines to handle journaling mode requirements Our fs is basically one big journal so we don't need any of these. Or at least I don't see any need for it at the moment. So, what are your requirements ? I am looking for a common way to combine all the requirements and come out with a saner generic routines to handle these. I'm happy with most generic functions. we need to implement writepage(s) ourselves no matter what, the only problem is block_prepare_write and I can currently only see two options for us: 1) Implement it ourselves and use a flag in the struct page to mark it reserved. 2) Use block_prepare_write but enable the get_block function to mark an individual buffer as reserved so that it is trated as mapped (can be dirty and stuff) but no code assumes it is located somewhere on the disk (for example block_prepare_write would not call unmap_underlying_metadata). I think we'll go for the first method, but the second would make life easier for filesystems which can have pages consisting of both mapped and reserved blocks. I guess for now, you can do all this in your filesystem specific code. Too bad, every one has to write their own. Hopefully, we cleanup all these interfaces and provide a generic *enough* interfaces to do this - so everyone can use it. As you can see, lots of issues need to be worked out + I don't have bandwidth to undertake such an effort, unless Nikita and Bryan are willing to signup for this :) Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lazy block allocation and block_prepare_write?
On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote: Badari Pulavarty [EMAIL PROTECTED] writes: [...] Yes. Its possible to do what you want to. I am currently working on adding delayed allocation support to ext3. As part of that, We As you most likely already know, Alex Thomas already implemented delayed block allocation for ext3. Yep. I reviewed Alex Thomas patches for delayed allocation. He handled all the cases in his code and did NOT use any mpage* routines to do the work. I was hoping to change the mpage infrastructure to handle these, so that every filesystem doesn't have to do their thing. In order to do the correct accounting, we need to mark a page to indicate if we reserved a block or not. One way to do this, to use page-private to indicate this. But then, all the generic I believe one can use PG_mappedtodisk bit in page-flags for this purpose. There was old Andrew Morton's patch that introduced new bit (PG_delalloc?) for this purpose. That would be good. But I don't feel like asking for a bit in page if there is a way to get around it. routines will fail - since they assume that page-private represents bufferheads. So we need a better way to do this. They are not generic then. Some file systems store things completely different from buffer head ring in page-private. Yep. Instead of changing the whole world, I was hoping to come up with few common interfaces (which doesn't assume anything about bufferheads etc..) which are useful for more than one filesystem. 3) We need add hooks into filesystem specific calls from these generic routines to handle journaling mode requirements (for ext3 and may be others). Please don't. There is no such thing as generic journalling. Traditional WAL used by ext3, phase-trees of Tux2, and wandering logs of reiser4 are so much different that there is no hope for a single API to accommodate them all. Adding such API will only force more workarounds and hacks in non-ext3 file systems. What _is_ common to all journalling file systems on the other hand, is the notion of transaction as the natural unit of caching and write-out. Currently in Linux, write-out is inode-based (-writepages()). Reiser4 already has a patch that replaces sync_sb_inodes() function with super-block operation. In reiser4 case, this operation scans the list of transactions (instead of the list of inodes) and writes some of them out, which is natural thing to do for a journalled file system. Similarly, transaction is a unit of caching: it's often necessary to scan all pages of a given transaction, all dirty pages of a given transaction, or to check whether given page belongs to a given transaction. That is, transaction plays role similar to struct address_space. But currently there is 1-to-1 relation between inodes and address_spaces, and this forces file system to implement additional data structures to duplicate functionality already present in address_space. So, what are your requirements ? I am looking for a common way to combine all the requirements and come out with a saner generic routines to handle these. I think that one reasonable way to add generic support for journalling is to split struct address_space into two objects: lower layer that represents file (say, struct vm_file), in which pages are linearly ordered, and on top of this vm_cache (representing transaction) that keeps track of pages from various vm_file's. vm_file is embedded into inode, and vm_cache has a pointer to (the analog of) struct address_space_operations. vm_cache's are created by file system back-end as necessary (can be embedded into inode for non-journalled file systems). VM scanner and balance_dirty_pages() call vm_cache operations to do write-out. Need to think some more. I guess you thought about this more than you do :) Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lazy block allocation and block_prepare_write?
On Tue, 2005-04-19 at 08:04, Alex Tomas wrote: Badari Pulavarty (BP) writes: you can introduce one more bit to page-flags BP Agreed. I was hoping to avoid it as much as I can. well, you're gonna modify mpage api anyway ... Okay, I will give a serious look then. Last time, I tried to go near page-flags I got slapped :( This time we have a valid reason, I guess :) BP What I meant by jounalling mode is that - after the pages are submitted BP for IO, we need some way of waiting for the IOs to finish inorder to BP guarantee the ordering ? Is this not needed for anything other than BP ext3 ? 1) i'm not sure anyone else supports this Fair enough. If no one needs it - lets keep the interface simple. 2) Andrew proposed the excelent solution Well, I wasn't sure how heavy thats going to be. He was recommending that we flush all dirty pages from all inodes for each transaction commit. Isn't it ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lazy block allocation and block_prepare_write?
Alex Tomas wrote: Nikita Danilov (ND) writes: In order to do the correct accounting, we need to mark a page to indicate if we reserved a block or not. One way to do this, to use page-private to indicate this. But then, all the generic I believe one can use PG_mappedtodisk bit in page-flags for this purpose. There was old Andrew Morton's patch that introduced new bit (PG_delalloc?) for this purpose. That would be good. But I don't feel like asking for a bit in page if there is a way to get around it. ND Clarification: PG_mappedtodisk is already here, it seems you can reuse ND this already existing bit to implement delayed allocation support. I think we need another one, because mappedtodisk != reserved. we could use mappedtodisk, but this means in -commit_write() we'd need to check that one more time (first time in -prepare_write()) Yep. We need one more to indicate the we reserved a block for this page. Other option I was thinking on how to avoid is, by reserving a block when a mapped page changes from read - write. Andrew's -mm tree has patch to give us a notification when it happens. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lazy block allocation and block_prepare_write?
Martin Jambor wrote: Hi all, I am a member of a group that implements a filesystem that allocates disk blocks to in-memory blocks lazily, that means, the decision is made just before the data are actually sent to disk. Moreover, when cached pages are modified, the data can be (and almost certainly will be) written to a different place to from where it was read. I was wondering, whether we could use the generic function block_prepare_write at all. The function checks every buffer of the page and if it is not mapped, it calls a fs supplied function that is supposed to map the buffer, i.e. assign it a block on the device and set its mapped flag. This is where we would like to give an error if there is not enough free disk space left but we cannot give a specific device block number yet. Can we make one up, such as -1? What would that do to such dark functions as unmap_underlying_metadata or any other? Would some other part of kernel break if there was a bunch of buffers assigned to the same spot on the disk? On the other hand, if I understand buffer flags correctly, I need to be able to emulate mapping of buffers to set them dirty, or em I wrong? Thanks for any insight or thoughts, Yes. Its possible to do what you want to. I am currently working on adding delayed allocation support to ext3. As part of that, We are modifying generic helper routines to delay the allocation from prepare time to actual writeout time. (writepage). Here is the basic idea: === The idea is to reserve a block at the prepare/commit write instead of allocating the block. Do the actual allocation in writepage(). Sounds simple :) Here are the issues: 1) Currently none of the generic helper routines can handle this. We need to add support to do these, but still somehow make the routines generic enough for every ones use. 2) There is no easy way to find out if we reserved a block or not in writepage() correctly. There are 2 paths to writepage(). sys_write() - prepare/commit() and later sync() writepage() mmap() - touch a page() and later -- writepage() In order to do the correct accounting, we need to mark a page to indicate if we reserved a block or not. One way to do this, to use page-private to indicate this. But then, all the generic routines will fail - since they assume that page-private represents bufferheads. So we need a better way to do this. 3) We need add hooks into filesystem specific calls from these generic routines to handle journaling mode requirements (for ext3 and may be others). So, what are your requirements ? I am looking for a common way to combine all the requirements and come out with a saner generic routines to handle these. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Delayed alloc for ordered-mode
I think adding support to JBD to deal with bios would be a valuable generic extention. Anyway its dealing with bhs now. Thanks, Badari Suparna Bhattacharya wrote: What would be really nice is if we could do this in a way that enables reuse of generic paths even for ordered mode. One thought that comes to mind is journal commit waiting for writeback to complete on the data pages which need to be flushed to disk before meta-data can be committed, much like we do for O_SYNC. I realise that JBD is intended to work at a level of abstraction where it has no awareness of filesystems - hence the correspondence with buffer heads all through. So would the above be a complete no-no ? Regards Suparna On Fri, Mar 04, 2005 at 06:02:35PM +0300, Alex Tomas wrote: On 03 Mar 2005 17:12:14 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: One more thing, we need to keep in mind is - we need to make sure that ordered mode also improved - since all our testcode focuses on writeback mode and the default mode is ordered :( I've just cooked the patch to implement ordered mode for delayed allocation path. please take it: ftp://ftp.clusterfs.com/pub/people/alex/2.6.11/ext3-delalloc-ordered-2.6.11-0.1.patch Stephen, Andrew could you review it, please? thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
On Thu, 2005-03-03 at 00:33, Suparna Bhattacharya wrote: Since the performance improvements seen so far are quite encouraging, and momentum is picking up so well, I started looking through the patches from Alex ... just a quick code walkthrough to get a hang of it and think about what kind of simplifications might be possible and what it might take for inclusion. I haven't had a chance to go too deep line by line yet, but thought I'd initiate some discussion with some first impressions and summary of what directions I hear several people converging towards to validate if I'm on the right track here. diffstat of the 3 patches : 22 files changed, 5920 insertions(+), 47 deletions. The largest is in the extents patch (2743), mballoc is 1968, and delalloc is 1209. To use delalloc, which gives us all the performance benefits, right now we need all the 3 patches to be used in conjunction. Supporting extent map btrees as well as traditional indexing and associated options for compatibility etc is perhaps the more invasive of changes. Given that keeping ext3 stable and maintainable is a key concern (that is after all a major reason why a lot of users rely on ext3), a somewhat incremental approach is desirable. So, I'll start from the direction that has been suggested by some -- (1) delayed allocation without changing the on-disk format. And then later (2) go on to breaking format with all changes for scalability to larger files with full extents support (haven't thought enough about this yet - maybe in a separate mail) Just doing delayed allocation without multiblock allocation (with the current layout) is not really a useful thing, IMHO. We will benifit few cases, but in general - we moved the block allocation overhead from prepare write to writepages/writepage time. There is a little benifit of not doing journaling twice etc.. but I don't think it would be enough to justify the effort. Isn't it ? So, may be we should look at adding multiblock allocation + delayed allocation to current ext3 layout. Then we can evaluate the benifits of having extents etc and then break the layout ? One more thing, we need to keep in mind is - we need to make sure that ordered mode also improved - since all our testcode focuses on writeback mode and the default mode is ordered :( Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH] Generic mpage_writepage() support
On Wed, 2005-02-16 at 11:09, Dave Kleikamp wrote: On Wed, 2005-02-16 at 10:37 -0800, Badari Pulavarty wrote: Yes. page-private is assumed for the bufferhead usage. Do you really need for handling page-private for non-bufferhead usage ? For what it's worth, I'm working on some changes to jfs that will use page-private for non-bufferhead usage for metadata, but I won't be using a generic writepage, so it's not an issue for me. Nope. it would be an issue for you, since jfs uses mpage_writepages() which uses the same code - which thinks page-private as bufferhead. mpage.c already assumes page-private implies bufferheads, so it's not completely generic. Would implementing this as nobh_write_full_page, to complement block_write_full_page, make sense? I guess, it can be done. So to really deal with this, we need to come up with generic writepage/writepages interfaces which doesn't deal with bufferheads. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Re: [RFC] [PATCH] Generic mpage_writepage() support
On Wed, 2005-02-16 at 11:43, Dave Kleikamp wrote: On Wed, 2005-02-16 at 11:28 -0800, Badari Pulavarty wrote: On Wed, 2005-02-16 at 11:09, Dave Kleikamp wrote: On Wed, 2005-02-16 at 10:37 -0800, Badari Pulavarty wrote: Yes. page-private is assumed for the bufferhead usage. Do you really need for handling page-private for non-bufferhead usage ? For what it's worth, I'm working on some changes to jfs that will use page-private for non-bufferhead usage for metadata, but I won't be using a generic writepage, so it's not an issue for me. Nope. it would be an issue for you, since jfs uses mpage_writepages() which uses the same code - which thinks page-private as bufferhead. The patch I am working on will call mpage_writepages() for metadata, but will use my own writepage() rather than mpage_writepage(), and nothing in mpage_writepages() will use page-private. Okay, that will work. Basically, you plan to call mpage_writepages() with NULL as get_block(). Isn't it ? For normal data, page-private, if used at all, will be bufferheads. Good. mpage.c already assumes page-private implies bufferheads, so it's not completely generic. Would implementing this as nobh_write_full_page, to complement block_write_full_page, make sense? I guess, it can be done. So to really deal with this, we need to come up with generic writepage/writepages interfaces which doesn't deal with bufferheads. I'm not sure how useful that would be. Are there any users of a non-bufferhead page-private that want to call a generic writepage(s)? In other words, if a generic function is sufficient, you probably wouldn't be using page-private anyway. This goes back to original question, is there a point in creating new interfaces for writepage writepages which doesn't assuming page-private. So far, only users are ext2+nobh and JFS. But both of them will not use page-private for anything else other than bufferheads (if at all used). For both of these, the mpage_writepage() I cooked up earlier would be good enough. I guess, we will do it ONLY if someone really needs it. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bufferheads page-cache reference
On Mon, 2005-02-14 at 18:57, Andrew Morton wrote: Badari Pulavarty [EMAIL PROTECTED] wrote: Most of DB2 customers use filesystem for their database. Under the load, they complain that entire memory in the system is used by filesystem pagecache, freememory is very low and system starts swapping crazy OR see lots of memory allocation failures and OOM killer kills db2. slabinfo shows lots of bufferheads and VM folks claim that, bufferheads are holding a ref. on the pages, so they can't use them. So, I want to find the truth in the story and findout what exactly happening here and which one to blame (VM or FS or IO problems) ? BTW, all these on 2.4 kernels and I don't have a reproducible testcase :( Feb 7 05:35:17 nmcopsu41 kernel: ENOMEM in do_get_write_access, retrying. Do these machines have a large amount of highmem? If so, yes, you can oom because lots of highmem pages have buffer_heads attached and you've run out of lowmem. The 2.4 VM will go off looking for lowmem pages to reclaim and will ignore the highmem pages because there's no highmem shortage. Consequently those buffer_heads don't get freed up and we're unable to reclaim any lowmem - oom. Andrea did a patch along time ago (it'll be in suse 2.4 kernels) which, under these circumstances, strip the buffers from those highmem pages when they're encountered on the LRU. From a quick read it seems that that patch is not in current 2.4 kernels. It's harder to do that in 2.6 because we have a separate LR per zone. Our DB2 folks *claims* to have seen this problem both on ia32 and AMD64 customers. So, I am not sure if its really only highmem related. Only workaround seems to be configure DB2 to not to use more than 1.5GB on a 8GB RAM system :( I have nothing much to go on, other than looking data from a sick machine. What should I be looking at, to narrow down the problem some more ? BTW, none of these BIG customers will take a patch to figure out whats happening (since its on their production system) :( Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bufferheads page-cache reference
On Tue, 2005-02-15 at 09:54, Andrew Morton wrote: Badari Pulavarty [EMAIL PROTECTED] wrote: Yep. nobh_prepare_write() doesn't add any bufferheads. But we call block_write_full_page() even for nobh case, which does create bufferheads, attaches to the page and operates on them.. hmm, yeah, OK, we'll attach bh's in that case. It's a rare case though - Thanks. I just wanted to make sure I am not confused. when a dirty page falls off the end of the LRU. There's no particular reason why we cannot have a real mpage_writepage() which doesn't use bh's and employ that. I coulda sworn we used to have one. Okay, I will cook it up. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bufferheads page-cache reference
On Tue, 2005-02-15 at 11:07, Nikita Danilov wrote: Andrew Morton [EMAIL PROTECTED] writes: Badari Pulavarty [EMAIL PROTECTED] wrote: Yep. nobh_prepare_write() doesn't add any bufferheads. But we call block_write_full_page() even for nobh case, which does create bufferheads, attaches to the page and operates on them.. hmm, yeah, OK, we'll attach bh's in that case. It's a rare case though - when a dirty page falls off the end of the LRU. There's no particular Maybe DB2 dirties pages through mmap? Nikita. It does. Most databases dirty data using shared memory segments, but they do write to filesystem directly also. I am not looking at DB2 issue here. I am looking at bufferheads usage in general. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Re: [RFC] ext3 writepages for writeback mode
On Sat, 2005-02-12 at 15:29, Alex Tomas wrote: Badari Pulavarty (BP) writes: UP, after: [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m21.452s user0m0.026s sys 0m6.105s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m23.317s user0m0.027s sys 0m4.206s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m25.190s user0m0.022s sys 0m4.026s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m25.024s user0m0.021s sys 0m4.103s UP, before: [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m19.138s user0m0.021s sys 0m5.980s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m23.123s user0m0.026s sys 0m4.083s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m25.152s user0m0.017s sys 0m3.836s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m25.217s user0m0.026s sys 0m3.827s BP Hmm.. On UP, the patch made sys.time little worse ? g. sorry, that's a typo. exchange 'after' and 'before' please. Thank you. I was worried for a minute. btw, want to see how delayed allocation manages this? Sure. I think it will improve the allocation case. Non-allocation case, should be pretty much same, provided I got contiguous layout on the disk. Isn't it ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bufferheads page-cache reference
On Mon, 2005-02-14 at 14:50, William Lee Irwin III wrote: On Mon, Feb 14, 2005 at 01:40:58PM -0800, Andrew Morton wrote: Seems about right. There's also the buffer_heads_over_limit logic in mm/vmscan.c and fs/buffer.c. That logic has a hole in that it requires that there be a highmem shortage before we start to reclaim the lowmem buffer_heads, but it is somewhat helpful. William Lee Irwin III [EMAIL PROTECTED] wrote: It would be beneficial to close the hole in that logic. On Mon, Feb 14, 2005 at 02:31:42PM -0800, Andrew Morton wrote: Really? Who's hurting? Apart from the fact that buffer_head proliferation has been a perennial problem, there's little to go on. By and large 2.6.x production usage is not yet very significant where I can see it, where the production usage is largely more stressful on account of very long durations and the variety of unusual situations to which the kernel is subjected. Now that we are on the subject of bufferheads and filesystem pagecache, the reason I am looking thro the code closely is .. Most of DB2 customers use filesystem for their database. Under the load, they complain that entire memory in the system is used by filesystem pagecache, freememory is very low and system starts swapping crazy OR see lots of memory allocation failures and OOM killer kills db2. slabinfo shows lots of bufferheads and VM folks claim that, bufferheads are holding a ref. on the pages, so they can't use them. So, I want to find the truth in the story and findout what exactly happening here and which one to blame (VM or FS or IO problems) ? BTW, all these on 2.4 kernels and I don't have a reproducible testcase :( Feb 7 05:35:17 nmcopsu41 kernel: ENOMEM in do_get_write_access, retrying. Feb 7 05:35:18 nmcopsu41 kernel: Out of Memory: Killed process 18517 (db2sysc). Feb 7 05:35:25 nmcopsu41 kernel: Out of Memory: Killed process 18660 (db2sysc). Feb 7 05:35:29 nmcopsu41 kernel: Out of Memory: Killed process 18873 (db2sysc). total used free shared buffers cached Mem: 16304560 16284152 20408 0 228428 15093736 -/+ buffers/cache: 961988 15342572 Swap: 35655616 24448 35631168 Total: 51960176 16308600 35651576 Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bufferheads page-cache reference
On Mon, 2005-02-14 at 13:40, Andrew Morton wrote: Badari Pulavarty [EMAIL PROTECTED] wrote: I see that as part of bufferheads to page association, we get a ref. on the page. create_empty_buffers() - attach_page_buffers() - page_cache_get() I also see that this reference get dropped by .. shrink_list() - try_to_release_page() - try_to_free_buffers() - drop_buffers() - __clear_page_buffers()- page_cache_release(); So, it looks like we drop the reference on the page and disassociate bufferheads from the page when VM wants to re-use the page. Only other path, I see this can happen is through invalidate_mapping_pages(). Is this true ? If I do fsync(), we flush the data - still leave the page bufferhead association. If I see lots of bufferheads even after fsync() is normal. Correct ? Seems about right. There's also the buffer_heads_over_limit logic in mm/vmscan.c and fs/buffer.c. That logic has a hole in that it requires that there be a highmem shortage before we start to reclaim the lowmem buffer_heads, but it is somewhat helpful. Is there anything wrong, if we tear down bufferheads after the writepage/writepages is complete ? may be -nobh option for ext3 ? Even for ext2 with -nobh and JFS - we seem to attach buffer heads to page in __block_write_full_page() and leave them around. I was thinking, they gets tossed out after the write-out completes. No ? These bufferheads are driving me crazy :) Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Re: [RFC] ext3 writepages for writeback mode
Alex Tomas wrote: Badari Pulavarty (BP) writes: BP Test: writes 10,000 blocks of 64k and does fdatasync(). The patch doesn't apply ;) after a minor correction I've tested the patch too: SMP, before: [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m17.748s user0m0.032s sys 0m6.031s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m16.016s user0m0.015s sys 0m3.819s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m19.074s user0m0.014s sys 0m3.761s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m22.011s user0m0.010s sys 0m3.773s SMP, after: [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m20.676s user0m0.023s sys 0m5.643s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m13.188s user0m0.016s sys 0m3.511s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m23.725s user0m0.017s sys 0m3.399s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m23.684s user0m0.013s sys 0m3.421s UP, after: [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m21.452s user0m0.026s sys 0m6.105s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m23.317s user0m0.027s sys 0m4.206s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m25.190s user0m0.022s sys 0m4.026s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m25.024s user0m0.021s sys 0m4.103s UP, before: [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m19.138s user0m0.021s sys 0m5.980s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m23.123s user0m0.026s sys 0m4.083s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m25.152s user0m0.017s sys 0m3.836s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m25.217s user0m0.026s sys 0m3.827s AFAICS, the patch makes sense, but in my setup it doesn't reduce sys.time so dramatically. all my boxes are P3-based. thanks, Alex Thank you for testing and confirming the results. Is this on a simple single disk configuration ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Re: [RFC] ext3 writepages for writeback mode
Alex Tomas wrote: Badari Pulavarty (BP) writes: BP Test: writes 10,000 blocks of 64k and does fdatasync(). The patch doesn't apply ;) after a minor correction I've tested the patch too: ... UP, after: [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m21.452s user0m0.026s sys 0m6.105s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m23.317s user0m0.027s sys 0m4.206s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m25.190s user0m0.022s sys 0m4.026s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m25.024s user0m0.021s sys 0m4.103s UP, before: [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m19.138s user0m0.021s sys 0m5.980s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m23.123s user0m0.026s sys 0m4.083s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m25.152s user0m0.017s sys 0m3.836s [EMAIL PROTECTED] root]# time /work/tests/fwrite /test/fff 64 1 real0m25.217s user0m0.026s sys 0m3.827s Hmm.. On UP, the patch made sys.time little worse ? - Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3 writepages for writeback mode
On Fri, 2005-02-11 at 15:58, Andrew Morton wrote: Badari Pulavarty [EMAIL PROTECTED] wrote: Due to lack of interesting suggestions to solve mpage_writepages() - ext3_writeback_writepage() problem, I fixed it in the dumbest possible way. I've actually forgotten what the problem was. It was 100 patches ago :( The problem was ext3_writeback_writepages() - mpage_writepages() could call back ext3_writeback_writepage() in the confused case. ext3_writeback_writepage() could end up doing nothing, since the we already have a journal handle. I added a writepage_helper to handle this case. Let me know, what you think. If it works, let's get some benchmark numbers so we can decide whether it justifies more development? Yep. I will get some numbers to see .. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3 writepages for writeback mode
On Fri, 2005-02-11 at 15:58, Andrew Morton wrote: Badari Pulavarty [EMAIL PROTECTED] wrote: Due to lack of interesting suggestions to solve mpage_writepages() - ext3_writeback_writepage() problem, I fixed it in the dumbest possible way. I've actually forgotten what the problem was. It was 100 patches ago :( Let me know, what you think. If it works, let's get some benchmark numbers so we can decide whether it justifies more development? Okay, here is a quick data I collected. More to follow.. Test: writes 10,000 blocks of 64k and does fdatasync(). Thanks, Badari BEFORE: (without writepages support) elm3b29:/mnt # touch file elm3b29:/mnt # time /tmp/writer file real0m23.746s user0m0.000s sys 0m5.020s (allocation) elm3b29:/mnt # time /tmp/writer file real0m20.950s user0m0.001s sys 0m2.278s (no alloc) elm3b29:/mnt # time /tmp/writer file real0m21.030s user0m0.001s sys 0m2.254s (no alloc) elm3b29:/mnt # time /tmp/writer file real0m20.577s user0m0.001s sys 0m2.184s (no alloc) AFTER: (with writepages support) elm3b29:/mnt # touch file elm3b29:/mnt # time /tmp/writer file real0m23.230s user0m0.001s sys 0m4.132s (allocation) elm3b29:/mnt # time /tmp/writer file real0m20.175s user0m0.004s sys 0m1.756s (no alloc) elm3b29:/mnt # time /tmp/writer file real0m20.368s user0m0.001s sys 0m1.696s (no alloc) elm3b29:/mnt # time /tmp/writer file real0m20.626s user0m0.002s sys 0m1.763s (no alloc) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3 writepages ?
On Wed, 2005-02-09 at 18:05, Bryan Henderson wrote: I see much larger IO chunks and better throughput. So, I guess its worth doing it I hate to see something like this go ahead based on empirical results without theory. It might make things worse somewhere else. Do you have an explanation for why the IO chunks are larger? Is the I/O scheduler not building large I/Os out of small requests? Is the queue running dry while the device is actually busy? Bryan, I would like to find out what theory you are looking for. Don't you think, filesystems submitting biggest chunks of IO possible is better than submitting 1k-4k chunks and hoping that IO schedulers do the perfect job ? BTW, writepages() is being used for other filesystems like JFS. We all learnt thro 2.4 RAW code about the overhead of doing 512bytes IO and making the elevator merge all the peices together. Thats one reason why 2.6 DIO/RAW code is completely written from scratch to submit the biggest possible IO chunks. Well, I agree that we should have theory behind the results. We are just playing with prototypes for now. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3 writepages ?
On Thu, 2005-02-10 at 10:00, Bryan Henderson wrote: Don't you think, filesystems submitting biggest chunks of IO possible is better than submitting 1k-4k chunks and hoping that IO schedulers do the perfect job ? No, I don't see why it would better. In fact intuitively, I think the I/O scheduler, being closer to the device, should do a better job of deciding in what packages I/O should go to the device. After all, there exist block devices that don't process big chunks faster than small ones. But So this starts to look like something where you withhold data from the I/O scheduler in order to prevent it from scheduling the I/O wrongly because you (the pager/filesystem driver) know better. That shouldn't be the architecture. So I'd like still like to see a theory that explains why submitting the I/O a little at a time (i.e. including the bio_submit() in the loop that assembles the I/O) causes the device to be idle more. We all learnt thro 2.4 RAW code about the overhead of doing 512bytes IO and making the elevator merge all the peices together. That was CPU time, right? In the present case, the numbers say it takes the same amount of CPU time to assemble the I/O above the I/O scheduler as inside it. One clear distinction between submitting smaller chunks vs larger ones is - number of call backs we get and the processing we need to do. I don't think we have enough numbers here to get to bottom of this. CPU utilization remains same in both cases, doesn't mean that - the test took exactly same amount of time. I don't even think that we are doing a fixed number of IOs. Its possible that by doing larger IOs we save CPU and use that CPU to push more data ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Re: journal start/stop in ext3_writeback_writepage()
On Thu, 2005-02-10 at 15:12, Stephen C. Tweedie wrote: Hi, On Thu, 2005-02-10 at 20:21, Andrew Morton wrote: But I still don't understand why this can't happen thro original code .. what am i missing ? presumably there are never any dirty pages or inodes when we run journal_destroy(). I assume so, yes. If there is no a_ops-writepages(), then we default to generic_writepages() which is a noop if there are no dirty pages. If your new ext3-specific writepages code tries to do a journal_start() in that case, then yes, it is likely to blow up spectacularly during journal_destroy! --Stephen Yep. I found this hardway that exactly whats happening. generic_writepages() is clever enough to do nothing, if there are no dirty pages. But I am being stupid in my writepages(). I need to teach writepages() to nothing in case of no dirty pages. Is there a easy way like checking a count somewhere than doing all the stuff mpage_writepages() is doing to figure this out, like .. while (!done (index = end) (nr_pages = pagevec_lookup_tag(pvec, mapping, index, PAGECACHE_TAG_DIRTY, min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) ... Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] ext3 writepages for writeback mode
Hi, Here is my first cut at adding writepages() support for ext3 writeback mode. I have not done any performance analysis on the patch, so try it at your own risk. Please let me know, if I am completely off or its a stupid idea. Thanks, Badari --- linux-2.6.10.org/fs/ext3/inode.c2004-12-06 11:45:49.0 -0800 +++ linux-2.6.10/fs/ext3/inode.c2005-02-10 18:14:17.987263744 -0800 @@ -856,6 +856,12 @@ return ret; } +static int ext3_writepages_get_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh, int create) +{ + return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create); +} + /* * `handle' can be NULL if create is zero */ @@ -1321,6 +1327,37 @@ return ret; } +static int +ext3_writeback_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct inode *inode = mapping-host; + handle_t *handle = NULL; + int err, ret = 0; + + if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) + return ret; + + handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + return ret; + } + +ret = mpage_writepages(mapping, wbc, ext3_writepages_get_block); + + /* +* Need to reaquire the handle since ext3_writepages_get_block() +* can restart the handle +*/ + handle = journal_current_handle(); + + err = ext3_journal_stop(handle); + if (!ret) + ret = err; + return ret; +} + static int ext3_writeback_writepage(struct page *page, struct writeback_control *wbc) { @@ -1552,6 +1589,7 @@ .readpage = ext3_readpage, .readpages = ext3_readpages, .writepage = ext3_writeback_writepage, + .writepages = ext3_writeback_writepages, .sync_page = block_sync_page, .prepare_write = ext3_prepare_write, .commit_write = ext3_writeback_commit_write,
Re: [RFC] ext3 writepages for writeback mode
Andrew Morton wrote: Badari Pulavarty [EMAIL PROTECTED] wrote: Here is my first cut at adding writepages() support for ext3 writeback mode. Looks sane from a brief scan. Well, not really.. mpage_writepages() could end up calling ext3_writeback_writepage() in confused case thro .. *ret = page-mapping-a_ops-writepage(page, wbc); which ends up doing nothing and leaves the page dirty, since there is journal handle started :( if (ext3_journal_current_handle()) goto out_fail; Ideas ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: journal start/stop in ext3_writeback_writepage()
On Wed, 2005-02-09 at 08:37, Stephen C. Tweedie wrote: Hi, On Wed, 2005-02-09 at 16:18, Badari Pulavarty wrote: I am trying to understand journaling code in ext3. Can some one enlighten me, why we need journal start and stop in ext3_writeback_writepage() ? The block allocation is already made in prepare_write(). prepare_write()/commit_write() are used for write(2) writes: the data is dirtied, but not immediately queued for IO (unless you're using O_SYNC). writepage is used when you want to write the page's data to disk *immediately* --- it's used when the VM is swapping out an mmaped file, or for msync(). So when writepage comes in, there's no guarantee that we've had a previous prepare. You can, for example, use ftruncate() to create a large hole in a file, and then mmap() it; if you then dirty a page, then the allocation occurs in the writepage(). So a transaction handle is necessary. --Stephen Thank you. It make sense now. I was under the impression that writepage() could also be used to flush the data even for write(2) writes. Is it not true ? I was trying to add writepages() interface for ext3. I am wondering if I need to do journaling for that case too. Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
mpage writepage question
Hi Andrew, I was wondering why mpage_writepage() is only static ? Is the expectation that, filesystems use .writepage == block_full_write_page .writepages == mpage_writepages ? I am little confused on why we have 2 different ways to do things ? block_full_write_page() seems to be creating buffer heads, where as mpage_writepages() can do directly bios. Shouldn't they be using mpage_writepage() instead of block_full_write_page() ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3 writepages ?
On Thu, 2005-02-03 at 09:00, Sonny Rao wrote: Well it seems to work, here's my (rather ugly) patch. I'm doing some performance comparisons now. Sonny Interesting.. Why did you create a nobh_prepare_write() ? mpage_writepages() can handle pages with buffer heads attached. And also, are you sure you don't need to journal start/stop in writepages() ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html