Re: How Inactive may be much greather than cached?
Hi, On Thursday 18 October 2007 16:24, Vasily Averin wrote: Hi all, could anybody explain how inactive may be much greater than cached? stress test (http://weather.ou.edu/~apw/projects/stress/) that writes into removed files in cycle puts the node to the following state: MemTotal: 16401648 kB MemFree: 636644 kB Buffers: 1122556 kB Cached: 362880 kB SwapCached: 700 kB Active: 1604180 kB Inactive: 13609828 kB At the first glance memory should be freed on file closing, nobody refers to file and ext3_delete_inode() truncates inode. We can see that memory is go away from cached, however could somebody explain why it become invalid instead be freed? Who holds the references to these pages? Buffers, swap cache, and anonymous. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How Inactive may be much greather than cached?
On Thursday 18 October 2007 17:14, Vasily Averin wrote: Nick Piggin wrote: Hi, On Thursday 18 October 2007 16:24, Vasily Averin wrote: Hi all, could anybody explain how inactive may be much greater than cached? stress test (http://weather.ou.edu/~apw/projects/stress/) that writes into removed files in cycle puts the node to the following state: MemTotal: 16401648 kB MemFree:636644 kB Buffers: 1122556 kB Cached: 362880 kB SwapCached:700 kB Active:1604180 kB Inactive: 13609828 kB At the first glance memory should be freed on file closing, nobody refers to file and ext3_delete_inode() truncates inode. We can see that memory is go away from cached, however could somebody explain why it become invalid instead be freed? Who holds the references to these pages? Buffers, swap cache, and anonymous. But buffers and swap cache are low (1.1 Gb and 700kB in this example) and anonymous should go away when process finished. Ah, I didn't see it was an order of magnitude out. Some filesystems, including I believe, ext3 with data=ordered, can leave orphaned pages around after they have been truncated out of the pagecache. These pages get left on the LRU and vmscan reclaims them pretty easily. Try ext3 data=writeback, or even ext2. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
On Thu, Jun 14, 2007 at 11:52:49AM +0200, Jan Kara wrote: On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote: .. 5) ext3_write_end: Before write_begin/write_end patch set we have folowing locking order: stop_journal(handle); unlock_page(page); But now order is oposite: unlock_page(page); stop_journal(handle); Can we got any race condition now? I'm not sure is it actual problem, may be somebody cant describe this. Can we just change it to the original order? That would seem to be safest unless one of the ext3 devs explicitly acks it. Sorry, I've missed beginning of this thread. But what problems can exactly cause this ordering change? ext3_journal_stop has no need to be protected by the page lock - it can be even better that it's not protected as it can trigger commit and all that would happen unnecessarily under page lock... Sure, if you think it is safe. I would rather it be done in a different patch though. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
On Wed, Jun 13, 2007 at 04:07:01PM -0700, Badari Pulavarty wrote: On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote: .. 5) ext3_write_end: Before write_begin/write_end patch set we have folowing locking order: stop_journal(handle); unlock_page(page); But now order is oposite: unlock_page(page); stop_journal(handle); Can we got any race condition now? I'm not sure is it actual problem, may be somebody cant describe this. Can we just change it to the original order? That would seem to be safest unless one of the ext3 devs explicitly acks it. It would be nice to go back to original order, but its not that simple with current structure of the code. With Nick's patches unlock_page() happens in generic_write_end(). journal_stop() needs to happen after generic_write_end(). :( Well we could use block_write_end? Mingming, can you take a look at the current proposed order ? I ran into bunch of races when I tried to change the order for -writepages() support earlier :( OK, it sounds like we probably want to revert to the original order at least for this patchset. If the new order is proven safe then that could be introduced later to simplify things... Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] fs/buffer.c:1821 in 2.6.22-rc4-mm2
Andrew Morton wrote: On Sun, 10 Jun 2007 17:57:14 +0200 Eric Sesterhenn / Snakebyte [EMAIL PROTECTED] wrote: hi, i got the following BUG while running the syscalls.sh from ltp-full-20070531 on an ext3 partition, it is easily reproducible for me [ 476.338068] [ cut here ] [ 476.338223] kernel BUG at fs/buffer.c:1821! [ 476.338324] invalid opcode: [#1] [ 476.338423] PREEMPT [ 476.338665] Modules linked in: [ 476.338833] CPU:0 [ 476.338836] EIP:0060:[c01a1914]Not tainted VLI [ 476.338840] EFLAGS: 00010202 (2.6.22-rc4-mm2 #1) [ 476.339206] EIP is at __block_prepare_write+0x64/0x410 [ 476.339311] eax: 0001 ebx: c136fbb8 ecx: c07faf28 edx: 0001 [ 476.339417] esi: c1dc9040 edi: c32d2dfc ebp: c3733db8 esp: c3733d50 [ 476.339584] ds: 007b es: 007b fs: gs: 0033 ss: 0068 [ 476.339690] Process vmsplice01 (pid: 7680, ti=c3733000 task=c351ed60 task.ti=c3733000) [ 476.339796] Stack: c3733d70 c0143e76 c1a0eab0 0046 c2509d64 0cd8 c136fbb8 [ 476.340675]c32d2dfc 0296 c02313b6 c1086088 0050 c02313b6 c1dc9040 c2509d50 [ 476.341491]c1dc9054 c3733dc4 c02313e9 c3733dbc c015728d c32d2f0c c136fbb8 [ 476.342371] Call Trace: [ 476.342565] [c01a1d83] block_write_begin+0x83/0xf0 [ 476.342804] [c0207778] ext3_write_begin+0xc8/0x1c0 [ 476.342987] [c01595bf] pagecache_write_begin+0x4f/0x150 [ 476.343243] [c019db3b] pipe_to_file+0x9b/0x170 [ 476.343418] [c019d4b0] __splice_from_pipe+0x70/0x260 [ 476.343654] [c019d6e8] splice_from_pipe+0x48/0x70 [ 476.343828] [c019d9f8] generic_file_splice_write+0x88/0x130 [ 476.344066] [c019d267] do_splice_from+0xb7/0xc0 [ 476.344240] [c019ea51] sys_splice+0x1a1/0x230 [ 476.344474] [c01043be] sysenter_past_esp+0x5f/0x99 [ 476.344656] [e410] 0xe410 [ 476.344882] === [ 476.344984] INFO: lockdep is turned off. [ 476.345084] Code: 00 0f 97 c2 e8 ee 2f 22 00 85 c0 74 04 0f 0b eb fe 31 d2 b8 28 af 7f c0 81 7d 08 00 10 00 00 0f 97 c2 e8 d0 2f 22 00 85 c0 74 04 0f 0b eb fe 8b 55 08 39 55 b0 0f 97 c0 0f b6 d0 b8 0c af 7f c0 [ 476.350365] EIP: [c01a1914] __block_prepare_write+0x64/0x410 SS:ESP 0068:c3733d50 Yep, vmsplice01 is not supported on -mm kernels ;) Nick has a protofix but I don't think it's been tested yet. Yeah, sorry I didn't catch that after you merged :P This should be the correct bugfix attached -- it is just a typo. -- SUSE Labs, Novell Inc. Index: linux-2.6/fs/splice.c === --- linux-2.6.orig/fs/splice.c +++ linux-2.6/fs/splice.c @@ -570,7 +570,7 @@ static int pipe_to_file(struct pipe_inod if (this_len + offset PAGE_CACHE_SIZE) this_len = PAGE_CACHE_SIZE - offset; - ret = pagecache_write_begin(file, mapping, sd-pos, sd-len, + ret = pagecache_write_begin(file, mapping, sd-pos, this_len, AOP_FLAG_UNINTERRUPTIBLE, page, fsdata); if (unlikely(ret)) goto out; @@ -583,11 +583,12 @@ static int pipe_to_file(struct pipe_inod char *dst = kmap_atomic(page, KM_USER1); memcpy(dst + offset, src + buf-offset, this_len); + flush_dcache_page(page); kunmap_atomic(dst, KM_USER1); buf-ops-unmap(pipe, buf, src); } - ret = pagecache_write_end(file, mapping, sd-pos, sd-len, sd-len, page, fsdata); + ret = pagecache_write_end(file, mapping, sd-pos, this_len, this_len, page, fsdata); out:
Re: [patch 17/41] ext2 convert to new aops.
On Mon, May 14, 2007 at 04:06:36PM +1000, [EMAIL PROTECTED] wrote: Cc: linux-ext4@vger.kernel.org Cc: Linux Filesystems [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] Found a problem in ext2 pagecache directory handling. Trivial fix follows. Longer-term, it might be better to rework these things a bit so they can directly use the pagecache_write_begin/pagecache_write_end accessors. --- Index: linux-2.6/fs/ext2/dir.c === --- linux-2.6.orig/fs/ext2/dir.c +++ linux-2.6/fs/ext2/dir.c @@ -70,10 +70,18 @@ static int ext2_commit_chunk(struct page dir-i_version++; block_write_end(NULL, mapping, pos, len, len, page, NULL); + + if (pos+len dir-i_size) { + i_size_write(dir, pos+len); + mark_inode_dirty(dir); + } + if (IS_DIRSYNC(dir)) err = write_one_page(page, 1); else unlock_page(page); + mark_page_accessed(page); + return err; } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] resolve duplicate flag no for PG_lazyfree
Andrew Morton wrote: On Mon, 14 May 2007 14:06:19 -0400 Theodore Tso [EMAIL PROTECTED] wrote: On Sun, May 13, 2007 at 10:46:30PM -0700, Andrew Morton wrote: otoh, the intersection between pages which are PageBooked() and pages which are PageLazyFree() should be zreo, so it'd be good to actually formalise this reuse within the ext4 patches. otoh2, PageLazyFree() could have reused PG_owner_priv_1. Rik, Ted: any thoughts? We do need to scrimp on page flags: when we finally run out, we're screwed. It makes sense to me. PG_lazyfree is currently only in -mm, right? Ah, yes, I got confused, sorry. I don't see it in my git tree. It would probably would be a good idea to make sure that we check to add some sanity checking code if it isn't there already that PG_lazyfree isn't already set when try to set PG_lazyfree (just in case there is a bug in the future which causes the should-never-happen case of trying lazy free a PageBooked page). Actually, I think the current status of lazy-freeing-of-memory-through-madv_free.patch is might not be needed. I _think_ we've determined that 0a27a14a62921b438bb6f33772690d345a089be6 sufficiently fixed the perfomance problems we had in there? I think so far we've found that it fixes the MySQL scalability problem, yes. I couldn't see any statistically significant difference with MySQL in my tests with MADV_FREE (versus MADV_DONTNEED). ebizzy is improved a bit at low concurrency but drops off slightly at higher concurrency. But basically, I don't think we've found a good reason to use a page flag and introduce the potential performance regressions that the MADV_FREE patch has. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 19/44] ext4 convert to new aops
Cc: linux-ext4@vger.kernel.org Cc: Linux Filesystems [EMAIL PROTECTED] Convert ext4 to use write_begin()/write_end() methods. Signed-off-by: Badari Pulavarty [EMAIL PROTECTED] fs/ext4/inode.c | 147 +++- 1 file changed, 93 insertions(+), 54 deletions(-) Index: linux-2.6/fs/ext4/inode.c === --- linux-2.6.orig/fs/ext4/inode.c +++ linux-2.6/fs/ext4/inode.c @@ -1146,34 +1146,50 @@ static int do_journal_get_write_access(h return ext4_journal_get_write_access(handle, bh); } -static int ext4_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +static int ext4_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - struct inode *inode = page-mapping-host; + struct inode *inode = mapping-host; int ret, needed_blocks = ext4_writepage_trans_blocks(inode); handle_t *handle; int retries = 0; + struct page *page; + pgoff_t index; + unsigned from, to; + + index = pos PAGE_CACHE_SHIFT; + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; retry: - handle = ext4_journal_start(inode, needed_blocks); - if (IS_ERR(handle)) { - ret = PTR_ERR(handle); - goto out; + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; + + handle = ext4_journal_start(inode, needed_blocks); + if (IS_ERR(handle)) { + unlock_page(page); + page_cache_release(page); + ret = PTR_ERR(handle); + goto out; } - if (test_opt(inode-i_sb, NOBH) ext4_should_writeback_data(inode)) - ret = nobh_prepare_write(page, from, to, ext4_get_block); - else - ret = block_prepare_write(page, from, to, ext4_get_block); - if (ret) - goto prepare_write_failed; - if (ext4_should_journal_data(inode)) { + ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + ext4_get_block); + + if (!ret ext4_should_journal_data(inode)) { ret = walk_page_buffers(handle, page_buffers(page), from, to, NULL, do_journal_get_write_access); } -prepare_write_failed: - if (ret) + + if (ret) { ext4_journal_stop(handle); + unlock_page(page); + page_cache_release(page); + } + if (ret == -ENOSPC ext4_should_retry_alloc(inode-i_sb, retries)) goto retry; out: @@ -1185,12 +1201,12 @@ int ext4_journal_dirty_data(handle_t *ha int err = jbd2_journal_dirty_data(handle, bh); if (err) ext4_journal_abort_handle(__FUNCTION__, __FUNCTION__, - bh, handle,err); + bh, handle, err); return err; } -/* For commit_write() in data=journal mode */ -static int commit_write_fn(handle_t *handle, struct buffer_head *bh) +/* For write_end() in data=journal mode */ +static int write_end_fn(handle_t *handle, struct buffer_head *bh) { if (!buffer_mapped(bh) || buffer_freed(bh)) return 0; @@ -1205,78 +1221,100 @@ static int commit_write_fn(handle_t *han * ext4 never places buffers on inode-i_mapping-private_list. metadata * buffers are managed internally. */ -static int ext4_ordered_commit_write(struct file *file, struct page *page, -unsigned from, unsigned to) +static int ext4_ordered_write_end(struct file *file, + struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { handle_t *handle = ext4_journal_current_handle(); - struct inode *inode = page-mapping-host; + struct inode *inode = file-f_mapping-host; + unsigned from, to; int ret = 0, ret2; + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; + ret = walk_page_buffers(handle, page_buffers(page), from, to, NULL, ext4_journal_dirty_data); if (ret == 0) { /* -* generic_commit_write() will run mark_inode_dirty() if i_size +* generic_write_end() will run mark_inode_dirty() if i_size * changes. So let's piggyback the i_disksize mark_inode_dirty * into that. */ loff_t new_i_size; - new_i_size = ((loff_t)page-index PAGE_CACHE_SHIFT) + to; +
Announce: new-aops-1 for 2.6.21-rc3
OK, I've gone through and fixed several bugs until the thing actually survives fsx-linux for both ext2 and ext3 ordered and writeback (both when using the new aops, and the legacy prepare_write path). Actually ext3 sometimes breaks, but it does in unpatched kernels anyway. At 15 patches (including the initial buffered write deadlock fixes), it is too much to keep posting -- not much has fundamentally changed, so I'll just post occasionally if we make big changes. The quilt format is probably easier for someone wishing to work on it anyway. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ (excludes the OCFS2 patch that Mark sent, in anticipation of an update) It would be really nice if filesystem developers could take a look at the new interfaces some time, because otherwise they might get stuck with it :) So I'm cc'ing a few filesystems that come to mind, that I haven't heard anything from. Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Announce: new-aops-1 for 2.6.21-rc3
On Thu, Mar 15, 2007 at 12:32:45PM -0700, Joel Becker wrote: On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote: At 15 patches (including the initial buffered write deadlock fixes), it is too much to keep posting -- not much has fundamentally changed, so I'll just post occasionally if we make big changes. The quilt format is probably easier for someone wishing to work on it anyway. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ For future drops, can you provide the unpacked patches too, so lazy people like me can read them in the browser? Thanks. Sorry, I did intend to unpack that, but forgot. It's done now, the new directory containing the patches is under the same URL as above. Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Announce: new-aops-1 for 2.6.21-rc3
On Thu, Mar 15, 2007 at 12:53:51PM -0700, Mark Fasheh wrote: On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote: OK, I've gone through and fixed several bugs until the thing actually survives fsx-linux for both ext2 and ext3 ordered and writeback (both when using the new aops, and the legacy prepare_write path). Actually ext3 sometimes breaks, but it does in unpatched kernels anyway. At 15 patches (including the initial buffered write deadlock fixes), it is too much to keep posting -- not much has fundamentally changed, so I'll just post occasionally if we make big changes. The quilt format is probably easier for someone wishing to work on it anyway. Hmm, we still left out some exports... Thanks, applied. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html