Re: How Inactive may be much greather than cached?

2007-10-18 Thread Nick Piggin
Hi,

On Thursday 18 October 2007 16:24, Vasily Averin wrote:
 Hi all,

 could anybody explain how inactive may be much greater than cached?
 stress test (http://weather.ou.edu/~apw/projects/stress/) that writes into
 removed files in cycle puts the node to the following state:

 MemTotal: 16401648 kB
 MemFree: 636644 kB
 Buffers: 1122556 kB
 Cached: 362880 kB
 SwapCached: 700 kB
 Active: 1604180 kB
 Inactive: 13609828 kB

 At the first glance memory should be freed on file closing, nobody refers
 to file and ext3_delete_inode() truncates inode. We can see that memory is
 go away from cached, however could somebody explain why it become
 invalid instead be freed? Who holds the references to these pages?

Buffers, swap cache, and anonymous.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How Inactive may be much greather than cached?

2007-10-18 Thread Nick Piggin
On Thursday 18 October 2007 17:14, Vasily Averin wrote:
 Nick Piggin wrote:
  Hi,
 
  On Thursday 18 October 2007 16:24, Vasily Averin wrote:
  Hi all,
 
  could anybody explain how inactive may be much greater than cached?
  stress test (http://weather.ou.edu/~apw/projects/stress/) that writes
  into removed files in cycle puts the node to the following state:
 
  MemTotal: 16401648 kB
  MemFree:636644 kB
  Buffers:   1122556 kB
  Cached: 362880 kB
  SwapCached:700 kB
  Active:1604180 kB
  Inactive: 13609828 kB
 
  At the first glance memory should be freed on file closing, nobody
  refers to file and ext3_delete_inode() truncates inode. We can see that
  memory is go away from cached, however could somebody explain why it
  become invalid instead be freed? Who holds the references to these
  pages?
 
  Buffers, swap cache, and anonymous.

 But buffers and swap cache are low (1.1 Gb and 700kB in this example) and
 anonymous should go away when process finished.

Ah, I didn't see it was an order of magnitude out.

Some filesystems, including I believe, ext3 with data=ordered,
can leave orphaned pages around after they have been truncated
out of the pagecache. These pages get left on the LRU and vmscan
reclaims them pretty easily.

Try ext3 data=writeback, or even ext2.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree

2007-06-14 Thread Nick Piggin
On Thu, Jun 14, 2007 at 11:52:49AM +0200, Jan Kara wrote:
  On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote:
  ..

5) ext3_write_end:
Before  write_begin/write_end patch set we have folowing locking
order:
stop_journal(handle);
unlock_page(page);
But now order is oposite:
unlock_page(page);
stop_journal(handle);
Can we got any race condition now? I'm not sure is it actual 
problem,
may be somebody cant describe this.
   
   Can we just change it to the original order? That would seem to be
   safest unless one of the ext3 devs explicitly acks it.
   Sorry, I've missed beginning of this thread. But what problems can
 exactly cause this ordering change? ext3_journal_stop has no need to be
 protected by the page lock - it can be even better that it's not
 protected as it can trigger commit and all that would happen
 unnecessarily under page lock...

Sure, if you think it is safe. I would rather it be done in a
different patch though.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree

2007-06-13 Thread Nick Piggin
On Wed, Jun 13, 2007 at 04:07:01PM -0700, Badari Pulavarty wrote:
 On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote:
 ..
   
   5) ext3_write_end:
 Before  write_begin/write_end patch set we have folowing locking
 order:
 stop_journal(handle);
 unlock_page(page);
 But now order is oposite:
 unlock_page(page);
 stop_journal(handle);
 Can we got any race condition now? I'm not sure is it actual problem,
 may be somebody cant describe this.
  
  Can we just change it to the original order? That would seem to be
  safest unless one of the ext3 devs explicitly acks it.
 
 It would be nice to go back to original order, but its not that
 simple with current structure of the code. With Nick's patches
 unlock_page() happens in generic_write_end(). journal_stop() 
 needs to happen after generic_write_end(). :(

Well we could use block_write_end?

 
 Mingming, can you take a look at the current  proposed order ?
 I ran into bunch of races when I tried to change the order for
 -writepages() support earlier :(

OK, it sounds like we probably want to revert to the original
order at least for this patchset. If the new order is proven
safe then that could be introduced later to simplify things...

Thanks,
Nick

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] fs/buffer.c:1821 in 2.6.22-rc4-mm2

2007-06-11 Thread Nick Piggin

Andrew Morton wrote:

On Sun, 10 Jun 2007 17:57:14 +0200 Eric Sesterhenn / Snakebyte [EMAIL 
PROTECTED] wrote:



hi,

i got the following BUG while running the syscalls.sh
from ltp-full-20070531 on an ext3 partition, it is easily reproducible
for me

[  476.338068] [ cut here ]
[  476.338223] kernel BUG at fs/buffer.c:1821!
[  476.338324] invalid opcode:  [#1]
[  476.338423] PREEMPT 
[  476.338665] Modules linked in:

[  476.338833] CPU:0
[  476.338836] EIP:0060:[c01a1914]Not tainted VLI
[  476.338840] EFLAGS: 00010202   (2.6.22-rc4-mm2 #1)
[  476.339206] EIP is at __block_prepare_write+0x64/0x410
[  476.339311] eax: 0001   ebx: c136fbb8   ecx: c07faf28   edx:
0001
[  476.339417] esi: c1dc9040   edi: c32d2dfc   ebp: c3733db8   esp:
c3733d50
[  476.339584] ds: 007b   es: 007b   fs:   gs: 0033  ss: 0068
[  476.339690] Process vmsplice01 (pid: 7680, ti=c3733000 task=c351ed60
task.ti=c3733000)
[  476.339796] Stack: c3733d70 c0143e76 c1a0eab0 0046 
c2509d64 0cd8 c136fbb8 
[  476.340675]c32d2dfc 0296 c02313b6 c1086088 0050
c02313b6 c1dc9040 c2509d50 
[  476.341491]c1dc9054 c3733dc4 c02313e9 c3733dbc c015728d
c32d2f0c  c136fbb8 
[  476.342371] Call Trace:

[  476.342565]  [c01a1d83] block_write_begin+0x83/0xf0
[  476.342804]  [c0207778] ext3_write_begin+0xc8/0x1c0
[  476.342987]  [c01595bf] pagecache_write_begin+0x4f/0x150
[  476.343243]  [c019db3b] pipe_to_file+0x9b/0x170
[  476.343418]  [c019d4b0] __splice_from_pipe+0x70/0x260
[  476.343654]  [c019d6e8] splice_from_pipe+0x48/0x70
[  476.343828]  [c019d9f8] generic_file_splice_write+0x88/0x130
[  476.344066]  [c019d267] do_splice_from+0xb7/0xc0
[  476.344240]  [c019ea51] sys_splice+0x1a1/0x230
[  476.344474]  [c01043be] sysenter_past_esp+0x5f/0x99
[  476.344656]  [e410] 0xe410
[  476.344882]  ===
[  476.344984] INFO: lockdep is turned off.
[  476.345084] Code: 00 0f 97 c2 e8 ee 2f 22 00 85 c0 74 04 0f 0b eb fe
31 d2 b8 28 af 7f c0 81 7d 08 00 10 00 00 0f 97 c2 e8 d0 2f 22 00 85 c0
74 04 0f 0b eb fe 8b 55 08 39 55 b0 0f 97 c0 0f b6 d0 b8 0c af 7f c0 
[  476.350365] EIP: [c01a1914] __block_prepare_write+0x64/0x410 SS:ESP

0068:c3733d50



Yep, vmsplice01 is not supported on -mm kernels ;)

Nick has a protofix but I don't think it's been tested yet.


Yeah, sorry I didn't catch that after you merged :P
This should be the correct bugfix attached -- it is just a typo.

--
SUSE Labs, Novell Inc.
Index: linux-2.6/fs/splice.c
===
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -570,7 +570,7 @@ static int pipe_to_file(struct pipe_inod
if (this_len + offset  PAGE_CACHE_SIZE)
this_len = PAGE_CACHE_SIZE - offset;
 
-   ret = pagecache_write_begin(file, mapping, sd-pos, sd-len,
+   ret = pagecache_write_begin(file, mapping, sd-pos, this_len,
AOP_FLAG_UNINTERRUPTIBLE, page, fsdata);
if (unlikely(ret))
goto out;
@@ -583,11 +583,12 @@ static int pipe_to_file(struct pipe_inod
char *dst = kmap_atomic(page, KM_USER1);
 
memcpy(dst + offset, src + buf-offset, this_len);
+   flush_dcache_page(page);
kunmap_atomic(dst, KM_USER1);
buf-ops-unmap(pipe, buf, src);
}
 
-   ret = pagecache_write_end(file, mapping, sd-pos, sd-len, sd-len, 
page, fsdata);
+   ret = pagecache_write_end(file, mapping, sd-pos, this_len, this_len, 
page, fsdata);
 
 out:
 


Re: [patch 17/41] ext2 convert to new aops.

2007-05-16 Thread Nick Piggin
On Mon, May 14, 2007 at 04:06:36PM +1000, [EMAIL PROTECTED] wrote:
 Cc: linux-ext4@vger.kernel.org
 Cc: Linux Filesystems [EMAIL PROTECTED]
 Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Found a problem in ext2 pagecache directory handling. Trivial fix follows.
Longer-term, it might be better to rework these things a bit so they can
directly use the pagecache_write_begin/pagecache_write_end accessors.
---
Index: linux-2.6/fs/ext2/dir.c
===
--- linux-2.6.orig/fs/ext2/dir.c
+++ linux-2.6/fs/ext2/dir.c
@@ -70,10 +70,18 @@ static int ext2_commit_chunk(struct page
 
dir-i_version++;
block_write_end(NULL, mapping, pos, len, len, page, NULL);
+
+   if (pos+len  dir-i_size) {
+   i_size_write(dir, pos+len);
+   mark_inode_dirty(dir);
+   }
+
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
unlock_page(page);
+   mark_page_accessed(page);
+
return err;
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] resolve duplicate flag no for PG_lazyfree

2007-05-15 Thread Nick Piggin

Andrew Morton wrote:

On Mon, 14 May 2007 14:06:19 -0400
Theodore Tso [EMAIL PROTECTED] wrote:



On Sun, May 13, 2007 at 10:46:30PM -0700, Andrew Morton wrote:


otoh, the intersection between pages which are PageBooked() and pages which
are PageLazyFree() should be zreo, so it'd be good to actually formalise
this reuse within the ext4 patches.

otoh2, PageLazyFree() could have reused PG_owner_priv_1.

Rik, Ted: any thoughts?  We do need to scrimp on page flags: when we
finally run out, we're screwed.


It makes sense to me.  PG_lazyfree is currently only in -mm, right?



Ah, yes, I got confused, sorry.



I
don't see it in my git tree.  It would probably would be a good idea
to make sure that we check to add some sanity checking code if it
isn't there already that PG_lazyfree isn't already set when try to set
PG_lazyfree (just in case there is a bug in the future which causes
the should-never-happen case of trying lazy free a PageBooked page).




Actually, I think the current status of
lazy-freeing-of-memory-through-madv_free.patch is might not be needed.  I
_think_ we've determined that 0a27a14a62921b438bb6f33772690d345a089be6
sufficiently fixed the perfomance problems we had in there?


I think so far we've found that it fixes the MySQL scalability problem,
yes. I couldn't see any statistically significant difference with MySQL
in my tests with MADV_FREE (versus MADV_DONTNEED).

ebizzy is improved a bit at low concurrency but drops off slightly at
higher concurrency.

But basically, I don't think we've found a good reason to use a page
flag and introduce the potential performance regressions that the
MADV_FREE patch has.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 19/44] ext4 convert to new aops

2007-04-23 Thread Nick Piggin
Cc: linux-ext4@vger.kernel.org
Cc: Linux Filesystems [EMAIL PROTECTED]
Convert ext4 to use write_begin()/write_end() methods.

Signed-off-by: Badari Pulavarty [EMAIL PROTECTED]

 fs/ext4/inode.c |  147 +++-
 1 file changed, 93 insertions(+), 54 deletions(-)

Index: linux-2.6/fs/ext4/inode.c
===
--- linux-2.6.orig/fs/ext4/inode.c
+++ linux-2.6/fs/ext4/inode.c
@@ -1146,34 +1146,50 @@ static int do_journal_get_write_access(h
return ext4_journal_get_write_access(handle, bh);
 }
 
-static int ext4_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ext4_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = mapping-host;
int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
handle_t *handle;
int retries = 0;
+   struct page *page;
+   pgoff_t index;
+   unsigned from, to;
+
+   index = pos  PAGE_CACHE_SHIFT;
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
 
 retry:
-   handle = ext4_journal_start(inode, needed_blocks);
-   if (IS_ERR(handle)) {
-   ret = PTR_ERR(handle);
-   goto out;
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
+
+   handle = ext4_journal_start(inode, needed_blocks);
+   if (IS_ERR(handle)) {
+   unlock_page(page);
+   page_cache_release(page);
+   ret = PTR_ERR(handle);
+   goto out;
}
-   if (test_opt(inode-i_sb, NOBH)  ext4_should_writeback_data(inode))
-   ret = nobh_prepare_write(page, from, to, ext4_get_block);
-   else
-   ret = block_prepare_write(page, from, to, ext4_get_block);
-   if (ret)
-   goto prepare_write_failed;
 
-   if (ext4_should_journal_data(inode)) {
+   ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ext4_get_block);
+
+   if (!ret  ext4_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, do_journal_get_write_access);
}
-prepare_write_failed:
-   if (ret)
+
+   if (ret) {
ext4_journal_stop(handle);
+   unlock_page(page);
+   page_cache_release(page);
+   }
+
if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, retries))
goto retry;
 out:
@@ -1185,12 +1201,12 @@ int ext4_journal_dirty_data(handle_t *ha
int err = jbd2_journal_dirty_data(handle, bh);
if (err)
ext4_journal_abort_handle(__FUNCTION__, __FUNCTION__,
-   bh, handle,err);
+   bh, handle, err);
return err;
 }
 
-/* For commit_write() in data=journal mode */
-static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
+/* For write_end() in data=journal mode */
+static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
if (!buffer_mapped(bh) || buffer_freed(bh))
return 0;
@@ -1205,78 +1221,100 @@ static int commit_write_fn(handle_t *han
  * ext4 never places buffers on inode-i_mapping-private_list.  metadata
  * buffers are managed internally.
  */
-static int ext4_ordered_commit_write(struct file *file, struct page *page,
-unsigned from, unsigned to)
+static int ext4_ordered_write_end(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
handle_t *handle = ext4_journal_current_handle();
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = file-f_mapping-host;
+   unsigned from, to;
int ret = 0, ret2;
 
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
+
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, ext4_journal_dirty_data);
 
if (ret == 0) {
/*
-* generic_commit_write() will run mark_inode_dirty() if i_size
+* generic_write_end() will run mark_inode_dirty() if i_size
 * changes.  So let's piggyback the i_disksize mark_inode_dirty
 * into that.
 */
loff_t new_i_size;
 
-   new_i_size = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
+

Announce: new-aops-1 for 2.6.21-rc3

2007-03-15 Thread Nick Piggin
OK, I've gone through and fixed several bugs until the thing actually
survives fsx-linux for both ext2 and ext3 ordered and writeback (both
when using the new aops, and the legacy prepare_write path). Actually
ext3 sometimes breaks, but it does in unpatched kernels anyway.

At 15 patches (including the initial buffered write deadlock fixes),
it is too much to keep posting -- not much has fundamentally changed,
so I'll just post occasionally if we make big changes. The quilt
format is probably easier for someone wishing to work on it anyway.

http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/

(excludes the OCFS2 patch that Mark sent, in anticipation of an update)

It would be really nice if filesystem developers could take a look
at the new interfaces some time, because otherwise they might get stuck
with it :) So I'm cc'ing a few filesystems that come to mind, that I 
haven't heard anything from. 

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Announce: new-aops-1 for 2.6.21-rc3

2007-03-15 Thread Nick Piggin
On Thu, Mar 15, 2007 at 12:32:45PM -0700, Joel Becker wrote:
 On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote:
  At 15 patches (including the initial buffered write deadlock fixes),
  it is too much to keep posting -- not much has fundamentally changed,
  so I'll just post occasionally if we make big changes. The quilt
  format is probably easier for someone wishing to work on it anyway.
  
  http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
 
   For future drops, can you provide the unpacked patches too, so
 lazy people like me can read them in the browser?  Thanks.

Sorry, I did intend to unpack that, but forgot. It's done now, the
new directory containing the patches is under the same URL as above.

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Announce: new-aops-1 for 2.6.21-rc3

2007-03-15 Thread Nick Piggin
On Thu, Mar 15, 2007 at 12:53:51PM -0700, Mark Fasheh wrote:
 On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote:
  OK, I've gone through and fixed several bugs until the thing actually
  survives fsx-linux for both ext2 and ext3 ordered and writeback (both
  when using the new aops, and the legacy prepare_write path). Actually
  ext3 sometimes breaks, but it does in unpatched kernels anyway.
  
  At 15 patches (including the initial buffered write deadlock fixes),
  it is too much to keep posting -- not much has fundamentally changed,
  so I'll just post occasionally if we make big changes. The quilt
  format is probably easier for someone wishing to work on it anyway.
 
 Hmm, we still left out some exports...

Thanks, applied.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html