Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Sat, 13 Oct 2007, Pekka Enberg wrote: On 10/12/07, Hugh Dickins [EMAIL PROTECTED] wrote: But I keep suspecting that the answer might be the patch below (which rather follows what drivers/block/rd.c is doing). I'm especially worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned to userspace, bad enough in itself, you might be liable to hit that BUG_ON(page_mapped(page)). shmem_writepage does not expect to be called by anyone outside mm/vmscan.c, but unionfs can now get to it? Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages() without unionfs even? I believe not. Please do double-check my assertions, I've always found the _writepages paths rather twisty; but my belief (supported by the fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page)) in five years) is that tmpfs/shmem opts out of all of that with its .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, in shmem_backing_dev_info, which avoids all those _writepages avenues (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is just a subfunction of the _writepages. So, while I don't disagree with your patch to write_cache_pages (though it wasn't clear to me whether it should break from or continue the loop if it ever does meet an AOP_WRITEPAGE_ACTIVATE), I don't think that's really the root of the problem. Hugh - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, On Sat, 13 Oct 2007, Pekka Enberg wrote: Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages() without unionfs even? On 10/14/07, Hugh Dickins [EMAIL PROTECTED] wrote: I believe not. Please do double-check my assertions, I've always found the _writepages paths rather twisty; but my belief (supported by the fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page)) in five years) is that tmpfs/shmem opts out of all of that with its .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, in shmem_backing_dev_info, which avoids all those _writepages avenues (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is just a subfunction of the _writepages. Thanks for the explanation, you're obviously correct. However, I don't think the mapping_cap_writeback_dirty() check in __filemap_fdatawrite_range() works as expected when tmpfs is a lower mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability for unionfs mappings so do_fsync() will call write_cache_pages() that unconditionally invokes shmem_writepage() via unionfs_writepage(). Unless, of course, there's some other unionfs magic I am missing. Pekka - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message [EMAIL PROTECTED], Pekka Enberg writes: Hi Hugh, On Sat, 13 Oct 2007, Pekka Enberg wrote: Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages() without unionfs even? On 10/14/07, Hugh Dickins [EMAIL PROTECTED] wrote: I believe not. Please do double-check my assertions, I've always found the _writepages paths rather twisty; but my belief (supported by the fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page)) in five years) is that tmpfs/shmem opts out of all of that with its .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, in shmem_backing_dev_info, which avoids all those _writepages avenues (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is just a subfunction of the _writepages. Thanks for the explanation, you're obviously correct. However, I don't think the mapping_cap_writeback_dirty() check in __filemap_fdatawrite_range() works as expected when tmpfs is a lower mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability for unionfs mappings so do_fsync() will call write_cache_pages() that unconditionally invokes shmem_writepage() via unionfs_writepage(). Unless, of course, there's some other unionfs magic I am missing. Pekka In unionfs_writepage() I tried to emulate as best possible what the lower f/s will have returned to the VFS. Since tmpfs's -writepage can return AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. Should I be doing something different when unionfs stacks on top of tmpfs? (BTW, this is probably also relevant to ecryptfs.) Thanks, Erez. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Erez, On Sun, 14 Oct 2007, Erez Zadok wrote: In unionfs_writepage() I tried to emulate as best possible what the lower f/s will have returned to the VFS. Since tmpfs's -writepage can return AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. Should I be doing something different when unionfs stacks on top of tmpfs? (BTW, this is probably also relevant to ecryptfs.) Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be calling unionfs_writepage() _at all_ if the lower mapping has BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally untested patch below? Pekka --- fs/unionfs/mmap.c | 17 + 1 file changed, 17 insertions(+) Index: linux-2.6.23-rc8/fs/unionfs/mmap.c === --- linux-2.6.23-rc8.orig/fs/unionfs/mmap.c +++ linux-2.6.23-rc8/fs/unionfs/mmap.c @@ -17,6 +17,7 @@ * published by the Free Software Foundation. */ +#include linux/backing-dev.h #include union.h /* @@ -144,6 +145,21 @@ out: return err; } +static int unionfs_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct inode *lower_inode; + struct inode *inode; + + inode = mapping-host; + lower_inode = unionfs_lower_inode(inode); + + if (!mapping_cap_writeback_dirty(lower_inode-i_mapping)) + return 0; + + return generic_writepages(mapping, wbc); +} + /* * readpage is called from generic_page_read and the fault handler. * If your file system uses generic_page_read for the read op, it @@ -371,6 +387,7 @@ out: struct address_space_operations unionfs_aops = { .writepage = unionfs_writepage, + .writepages = unionfs_writepages, .readpage = unionfs_readpage, .prepare_write = unionfs_prepare_write, .commit_write = unionfs_commit_write, - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message [EMAIL PROTECTED], Pekka J Enberg writes: Hi Erez, On Sun, 14 Oct 2007, Erez Zadok wrote: In unionfs_writepage() I tried to emulate as best possible what the lower f/s will have returned to the VFS. Since tmpfs's -writepage can return AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. Should I be doing something different when unionfs stacks on top of tmpfs? (BTW, this is probably also relevant to ecryptfs.) Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be calling unionfs_writepage() _at all_ if the lower mapping has BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally untested patch below? Pekka [...] Pekka, with a small change to your patch (to handle time-based cache coherency), your patch worked well and passed all my tests. Thanks. So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE from being returned to userland. I guess we still need it, b/c even with your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to the VFS and we need to ensure that doesn't leak outside the kernel. Erez. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-mm1: BUG in reiserfs_delete_xattrs
Le 12.10.2007 06:31, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23/2.6.23-mm1/ /home is mounted with the following options: /dev/mapper/vglinux1-lvhome on /home type reiserfs (rw,noatime,nodiratime,user_xattr) I guess that beagled (the Beagle desktop search daemon) has populated user xattrs on almost all files. Now, when I delete a file, two BUGs occur and the system hangs. Here is the stack for the first BUG (the second one is very similar): [partially hand copied stack] _fput fput reiserfs_delete_xattrs reiserfs_delete_inode generic_delete_inode generic_drop_inode iput do_unlinkat sys_unlink sys_enter_past_esp I reported a similar BUG in 2.6.22-rc8-mm2 (see http://lkml.org/lkml/2007/9/27/235). Dave Hansen sent a patch for it, I tested it and it was OK for 2.6.22-rc8-mm2. I tried this patch on 2.6.23-mm1, and it fixed the BUGs here too. From: Dave Hansen [EMAIL PROTECTED] The bug is caused by reiserfs creating a special 'struct file' with a NULL vfsmount. /* Opens a file pointer to the attribute associated with inode */ static struct file *open_xa_file(const struct inode *inode, const char *name, int flags) { ... fp = dentry_open(xafile, NULL, O_RDWR); /* dentry_open dputs the dentry if it fails */ As Christoph just said, this is somewhat of a bandaid. But, it shouldn't hurt anything. --- lxc-dave/fs/file_table.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN fs/open.c~fix-reiserfs-oops fs/open.c diff -puN fs/file_table.c~fix-reiserfs-oops fs/file_table.c --- lxc/fs/file_table.c~fix-reiserfs-oops 2007-09-27 13:32:20.0 -0700 +++ lxc-dave/fs/file_table.c2007-09-27 13:33:11.0 -0700 @@ -236,7 +236,7 @@ void fastcall __fput(struct file *file) fops_put(file-f_op); if (file-f_mode FMODE_WRITE) { put_write_access(inode); - if (!special_file(inode-i_mode)) + if (!special_file(inode-i_mode) mnt) mnt_drop_write(mnt); } put_pid(file-f_owner.pid); diff -puN include/linux/mount.h~fix-reiserfs-oops include/linux/mount.h _ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Sat, Oct 13, 2007 at 07:05:17PM +0530, Bhagi rathi wrote: David, Can you let me know the use after free problem? I want to understand how the life cycle of linux inode and xfs inode are related to log flush. Log I/O completion: - xfs_trans_commited - xfs_iunpin(xfs inode) get linux inode from xfs inode - mark_inode_dirty_sync(linux inode) Freeing the linux inode: clear_inode(linux_inode) - xfs_inactive() - xfs_trans_commit() (e.g. freeing data associated with unlinked inode) - xfs_ipin() (link between xfs and linux inode broken) linux inode freed So, in log I/O completion, we can be completing a previous transaction at the same time clear_inode() is running, and hence in xfs_iunpin() we can race with the freeing of the linux inode as xfs_iunpin does not hold any locks. Any pointer is also of great help. /me points at the code. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Fri, Oct 12, 2007 at 12:36:01PM +0100, Andrew Clayton wrote: On Fri, 12 Oct 2007 10:26:13 +1000, David Chinner wrote: You can breath again. Here's a test patch (warning - may harm heh kittens - not fully tested or verified) that solves both the use-after-free issue (by avoiding it altogether) as well the unlink/create latency because the log force is no longer there. (yay! xfsqa test 016 passes again ;) It does have other possible side effects triggering extra log forces elsewhere on inode writeback and affects sync behaviour so it's only a proof of concept at this point. What kernel is that against?. I got rejects with 2.6.23 The xfs-dev tree - i.e. the XFS that will be in 2.6.25 ;) However I tried a 2.6.18 on the file server and ran my test, it didn't show the problem. I then made a 2.6.23 but with the patch from my git bisect reverted. Doing the test with that kernel, while writing a 1GB file I saw only one 1 second latency (1.2) and only a few ~ 0.5 second latencies. However over the longer term I'm still seeing latencies 1 second. Sure - you've got a busy disk. If the truncate has to flush the log and wait for space, then it's going to take some time for I/Os to complete. Full queue + busy disk == unpredictable latency for all operations. Just leaving my strace test running (no dd) on the raid filesystem I see the latencies come when the raid5 stripe cache fills up. So I think I'm perhaps seeing another problem here. Software raid isn't good for latency, either ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html