Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Hugh Dickins
On Sat, 13 Oct 2007, Pekka Enberg wrote:
 On 10/12/07, Hugh Dickins [EMAIL PROTECTED] wrote:
  But I keep suspecting that the answer might be the patch below (which
  rather follows what drivers/block/rd.c is doing).  I'm especially
  worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
  to userspace, bad enough in itself, you might be liable to hit that
  BUG_ON(page_mapped(page)).  shmem_writepage does not expect to be
  called by anyone outside mm/vmscan.c, but unionfs can now get to it?
 
 Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
 without unionfs even?

I believe not.  Please do double-check my assertions, I've always found
the _writepages paths rather twisty; but my belief (supported by the
fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
in five years) is that tmpfs/shmem opts out of all of that with its
.capabilities   = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
in shmem_backing_dev_info, which avoids all those _writepages avenues
(e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
just a subfunction of the _writepages.

So, while I don't disagree with your patch to write_cache_pages (though
it wasn't clear to me whether it should break from or continue the loop
if it ever does meet an AOP_WRITEPAGE_ACTIVATE), I don't think that's
really the root of the problem.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Pekka Enberg
Hi Hugh,

On Sat, 13 Oct 2007, Pekka Enberg wrote:
 Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
 without unionfs even?

On 10/14/07, Hugh Dickins [EMAIL PROTECTED] wrote:
 I believe not.  Please do double-check my assertions, I've always found
 the _writepages paths rather twisty; but my belief (supported by the
 fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
 in five years) is that tmpfs/shmem opts out of all of that with its
 .capabilities   = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
 in shmem_backing_dev_info, which avoids all those _writepages avenues
 (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
 just a subfunction of the _writepages.

Thanks for the explanation, you're obviously correct.

However, I don't think the mapping_cap_writeback_dirty() check in
__filemap_fdatawrite_range() works as expected when tmpfs is a lower
mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
for unionfs mappings so do_fsync() will call write_cache_pages() that
unconditionally invokes shmem_writepage() via unionfs_writepage().
Unless, of course, there's some other unionfs magic I am missing.

   Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Erez Zadok
In message [EMAIL PROTECTED], Pekka Enberg writes:
 Hi Hugh,
 
 On Sat, 13 Oct 2007, Pekka Enberg wrote:
  Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
  without unionfs even?
 
 On 10/14/07, Hugh Dickins [EMAIL PROTECTED] wrote:
  I believe not.  Please do double-check my assertions, I've always found
  the _writepages paths rather twisty; but my belief (supported by the
  fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
  in five years) is that tmpfs/shmem opts out of all of that with its
  .capabilities   = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
  in shmem_backing_dev_info, which avoids all those _writepages avenues
  (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
  just a subfunction of the _writepages.
 
 Thanks for the explanation, you're obviously correct.
 
 However, I don't think the mapping_cap_writeback_dirty() check in
 __filemap_fdatawrite_range() works as expected when tmpfs is a lower
 mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
 for unionfs mappings so do_fsync() will call write_cache_pages() that
 unconditionally invokes shmem_writepage() via unionfs_writepage().
 Unless, of course, there's some other unionfs magic I am missing.
 
Pekka

In unionfs_writepage() I tried to emulate as best possible what the lower
f/s will have returned to the VFS.  Since tmpfs's -writepage can return
AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.

Should I be doing something different when unionfs stacks on top of tmpfs?
(BTW, this is probably also relevant to ecryptfs.)

Thanks,
Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Pekka J Enberg
Hi Erez,

On Sun, 14 Oct 2007, Erez Zadok wrote:
 In unionfs_writepage() I tried to emulate as best possible what the lower
 f/s will have returned to the VFS.  Since tmpfs's -writepage can return
 AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
 unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.
 
 Should I be doing something different when unionfs stacks on top of tmpfs?
 (BTW, this is probably also relevant to ecryptfs.)

Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be 
calling unionfs_writepage() _at all_ if the lower mapping has 
BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally 
untested patch below?

Pekka

---
 fs/unionfs/mmap.c |   17 +
 1 file changed, 17 insertions(+)

Index: linux-2.6.23-rc8/fs/unionfs/mmap.c
===
--- linux-2.6.23-rc8.orig/fs/unionfs/mmap.c
+++ linux-2.6.23-rc8/fs/unionfs/mmap.c
@@ -17,6 +17,7 @@
  * published by the Free Software Foundation.
  */
 
+#include linux/backing-dev.h
 #include union.h
 
 /*
@@ -144,6 +145,21 @@ out:
return err;
 }
 
+static int unionfs_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+   struct inode *lower_inode;
+   struct inode *inode;
+
+   inode = mapping-host;
+   lower_inode = unionfs_lower_inode(inode);
+
+   if (!mapping_cap_writeback_dirty(lower_inode-i_mapping))
+   return 0;
+
+   return generic_writepages(mapping, wbc);
+}
+
 /*
  * readpage is called from generic_page_read and the fault handler.
  * If your file system uses generic_page_read for the read op, it
@@ -371,6 +387,7 @@ out:
 
 struct address_space_operations unionfs_aops = {
.writepage  = unionfs_writepage,
+   .writepages = unionfs_writepages,
.readpage   = unionfs_readpage,
.prepare_write  = unionfs_prepare_write,
.commit_write   = unionfs_commit_write,
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Erez Zadok
In message [EMAIL PROTECTED], Pekka J Enberg writes:
 Hi Erez,
 
 On Sun, 14 Oct 2007, Erez Zadok wrote:
  In unionfs_writepage() I tried to emulate as best possible what the lower
  f/s will have returned to the VFS.  Since tmpfs's -writepage can return
  AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
  unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.
  
  Should I be doing something different when unionfs stacks on top of tmpfs?
  (BTW, this is probably also relevant to ecryptfs.)
 
 Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be 
 calling unionfs_writepage() _at all_ if the lower mapping has 
 BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally 
 untested patch below?
 
   Pekka
[...]

Pekka, with a small change to your patch (to handle time-based cache
coherency), your patch worked well and passed all my tests.  Thanks.

So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
from being returned to userland.  I guess we still need it, b/c even with
your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
the VFS and we need to ensure that doesn't leak outside the kernel.

Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-mm1: BUG in reiserfs_delete_xattrs

2007-10-14 Thread Laurent Riffard
Le 12.10.2007 06:31, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23/2.6.23-mm1/

/home is mounted with the following options:
   /dev/mapper/vglinux1-lvhome on /home type reiserfs 
(rw,noatime,nodiratime,user_xattr)

I guess that beagled (the Beagle desktop search daemon) has populated user
xattrs on almost all files. Now, when I delete a file, two BUGs occur
and the system hangs. Here is the stack for the first BUG (the second
one is very similar):

[partially hand copied stack]
_fput
fput
reiserfs_delete_xattrs
reiserfs_delete_inode
generic_delete_inode
generic_drop_inode
iput
do_unlinkat
sys_unlink
sys_enter_past_esp

I reported a similar BUG in 2.6.22-rc8-mm2 (see
http://lkml.org/lkml/2007/9/27/235). Dave Hansen sent a patch for it, I
tested it and it was OK for 2.6.22-rc8-mm2.

I tried this patch on 2.6.23-mm1, and it fixed the BUGs here too.


From: Dave Hansen [EMAIL PROTECTED]

The bug is caused by reiserfs creating a special 'struct file' with a
NULL vfsmount.  

/* Opens a file pointer to the attribute associated with inode */
static struct file *open_xa_file(const struct inode *inode, const char
*name,
 int flags)
{
...
fp = dentry_open(xafile, NULL, O_RDWR);
/* dentry_open dputs the dentry if it fails */


As Christoph just said, this is somewhat of a bandaid.  But, it
shouldn't hurt anything.

---

 lxc-dave/fs/file_table.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN fs/open.c~fix-reiserfs-oops fs/open.c
diff -puN fs/file_table.c~fix-reiserfs-oops fs/file_table.c
--- lxc/fs/file_table.c~fix-reiserfs-oops   2007-09-27 13:32:20.0 
-0700
+++ lxc-dave/fs/file_table.c2007-09-27 13:33:11.0 -0700
@@ -236,7 +236,7 @@ void fastcall __fput(struct file *file)
fops_put(file-f_op);
if (file-f_mode  FMODE_WRITE) {
put_write_access(inode);
-   if (!special_file(inode-i_mode))
+   if (!special_file(inode-i_mode)  mnt)
mnt_drop_write(mnt);
}
put_pid(file-f_owner.pid);
diff -puN include/linux/mount.h~fix-reiserfs-oops include/linux/mount.h
_


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-14 Thread David Chinner
On Sat, Oct 13, 2007 at 07:05:17PM +0530, Bhagi rathi wrote:
 David, Can you let me know the use after free problem? I want to understand
 how the life cycle of linux inode
 and xfs inode are related to log flush.

Log I/O completion:

  - xfs_trans_commited
   - xfs_iunpin(xfs inode)
 get linux inode from xfs inode
  - mark_inode_dirty_sync(linux inode)

Freeing the linux inode:

clear_inode(linux_inode)
 - xfs_inactive()
- xfs_trans_commit() (e.g. freeing data associated with unlinked inode)
   - xfs_ipin()
(link between xfs and linux inode broken)
  linux inode freed 
 
So, in log I/O completion, we can be completing a previous
transaction at the same time clear_inode() is running, and
hence in xfs_iunpin() we can race with the freeing of the
linux inode as xfs_iunpin does not hold any locks.

 Any pointer is also of great help.

/me points at the code.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-14 Thread David Chinner
On Fri, Oct 12, 2007 at 12:36:01PM +0100, Andrew Clayton wrote:
 On Fri, 12 Oct 2007 10:26:13 +1000, David Chinner wrote:
  
  You can breath again. Here's a test patch (warning - may harm
 
 heh
 
  kittens - not fully tested or verified) that solves both
  the use-after-free issue (by avoiding it altogether) as well the
  unlink/create latency because the log force is no longer there.
  
  (yay! xfsqa test 016 passes again ;)
  
  It does have other possible side effects triggering extra
  log forces elsewhere on inode writeback and affects sync behaviour
  so it's only a proof of concept at this point.
 
 What kernel is that against?. I got rejects with 2.6.23 

The xfs-dev tree - i.e. the XFS that will be in 2.6.25 ;)

 However I tried a 2.6.18 on the file server and ran my test, it didn't
 show the problem. I then made a 2.6.23 but with the patch from my git
 bisect reverted.
 
 Doing the test with that kernel, while writing a 1GB file I saw only
 one  1 second latency (1.2) and only a few ~ 0.5 second latencies.
 
 However over the longer term I'm still seeing latencies  1 second.

Sure - you've got a busy disk. If the truncate has to flush the log
and wait for space, then it's going to take some time for I/Os
to complete. Full queue + busy disk == unpredictable latency for all
operations.

 Just leaving my strace test running (no dd) on the raid filesystem I see
 the
 latencies come when the raid5 stripe cache fills up. So I think I'm
 perhaps seeing another problem here.

Software raid isn't good for latency, either ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html