[patch 0/3] 2.6.20 fix for PageUptodate memorder problem
Still no independent confirmation as to whether this is a problem or not. I think it is, so I'll propose this patchset to fix it. Patch 1/3 has a reasonable description of the problem. Thanks, Nick -- SuSE Labs - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/3] mm: fix PageUptodate memorder
After running SetPageUptodate, preceeding stores to the page contents to actually bring it uptodate may not be ordered with the store to set the page uptodate. Therefore, another CPU which checks PageUptodate is true, then reads the page contents can get stale data. Fix this by ensuring SetPageUptodate is always called with the page locked (except in the case of a new page that cannot be visible to other CPUs), and requiring PageUptodate be checked only when the page is locked. To facilitate lockless checks, SetPageUptodate contains an smp_wmb to order preceeding stores before the store to page flags, and a new PageUptodate_NoLock is introduced, which issues a smp_rmb after the page flags are loaded for the test. I'm still not sure that a DMA memory barrier is not required, however I think the logical place to put such a barrier would be in the IO completion routines, when they come back to tell us that they have succeeded. (Help? Anyone?) One thing I like about it is that it unifies the anonymous page handling with the rest of the page management, by marking anon pages as uptodate when they _are_ uptodate, rather than when our implementation requires that they be marked as such. Doing this let me get rid of the smp_wmb's in the page copying functions, which were specially added for anonymous pages for a closely related issue, always vaguely troubled me. Convert core code and some filesystems to use PageUptodate_NoLock, just for reference (a more complete patch follows). Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/include/linux/highmem.h === --- linux-2.6.orig/include/linux/highmem.h +++ linux-2.6/include/linux/highmem.h @@ -57,8 +57,6 @@ static inline void clear_user_highpage(s void *addr = kmap_atomic(page, KM_USER0); clear_user_page(addr, vaddr, page); kunmap_atomic(addr, KM_USER0); - /* Make sure this page is cleared on other CPU's too before using it */ - smp_wmb(); } #ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE @@ -108,8 +106,6 @@ static inline void copy_user_highpage(st copy_user_page(vto, vfrom, vaddr, to); kunmap_atomic(vfrom, KM_USER0); kunmap_atomic(vto, KM_USER1); - /* Make sure this page is cleared on other CPU's too before using it */ - smp_wmb(); } #endif Index: linux-2.6/include/linux/page-flags.h === --- linux-2.6.orig/include/linux/page-flags.h +++ linux-2.6/include/linux/page-flags.h @@ -126,16 +126,62 @@ #define ClearPageReferenced(page) clear_bit(PG_referenced, (page)-flags) #define TestClearPageReferenced(page) test_and_clear_bit(PG_referenced, (page)-flags) -#define PageUptodate(page) test_bit(PG_uptodate, (page)-flags) -#ifdef CONFIG_S390 -static inline void SetPageUptodate(struct page *page) +static inline int PageUptodate(struct page *page) +{ + WARN_ON(!PageLocked(page)); + return test_bit(PG_uptodate, (page)-flags); +} + +/* + * PageUptodate to be used when not holding the page lock. + */ +static inline int PageUptodate_NoLock(struct page *page) { + int ret = test_bit(PG_uptodate, (page)-flags); + + /* +* Must ensure that the data we read out of the page is loaded +* _after_ we've loaded page-flags to check for PageUptodate. +* See SetPageUptodate() for the other side of the story. +*/ + smp_rmb(); + + return ret; +} + +static inline void __SetPageUptodate(struct page *page) +{ +#ifdef CONFIG_S390 if (!test_and_set_bit(PG_uptodate, page-flags)) page_test_and_clear_dirty(page); -} #else -#define SetPageUptodate(page) set_bit(PG_uptodate, (page)-flags) + /* +* Memory barrier must be issued before setting the PG_uptodate bit, +* so all previous writes that served to bring the page uptodate are +* visible before PageUptodate becomes true. +* +* S390 is guaranteed to have a barrier in the test_and_set operation +* (see Documentation/atomic_ops.txt). +* +* XXX: does this memory barrier need to be anything special to +* handle things like DMA writes into the page? +*/ + smp_wmb(); + set_bit(PG_uptodate, (page)-flags); #endif +} + +static inline void SetPageUptodate(struct page *page) +{ + WARN_ON(!PageLocked(page)); + __SetPageUptodate(page); +} + +static inline void SetNewPageUptodate(struct page *page) +{ + __SetPageUptodate(page); +} + #define ClearPageUptodate(page)clear_bit(PG_uptodate, (page)-flags) #define PageDirty(page)test_bit(PG_dirty, (page)-flags) Index: linux-2.6/mm/hugetlb.c === --- linux-2.6.orig/mm/hugetlb.c +++ linux-2.6/mm/hugetlb.c @@ -443,6 +443,7 @@ static int hugetlb_cow(struct mm_struct
[patch 2/3] fs: buffer don't PageUptodate without page locked
__block_write_full_page is calling SetPageUptodate without the page locked. This is unusual, but not incorrect, as PG_writeback is still set. However with the previous patch, this is now a problem: so don't bother setting the page uptodate in this case (it is weird that the write path does such a thing anyway). Instead just leave it to the read side to bring the page uptodate when it notices that all buffers are uptodate. Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -1679,6 +1679,7 @@ static int __block_write_full_page(struc */ BUG_ON(PageWriteback(page)); set_page_writeback(page); + unlock_page(page); do { struct buffer_head *next = bh-b_this_page; @@ -1688,7 +1689,6 @@ static int __block_write_full_page(struc } bh = next; } while (bh != head); - unlock_page(page); err = 0; done: @@ -1698,17 +1698,8 @@ done: * clean. Someone wrote them back by hand with * ll_rw_block/submit_bh. A rare case. */ - int uptodate = 1; - do { - if (!buffer_uptodate(bh)) { - uptodate = 0; - break; - } - bh = bh-b_this_page; - } while (bh != head); - if (uptodate) - SetPageUptodate(page); end_page_writeback(page); + /* * The page and buffer_heads can be released at any time from * here on. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 3/3] mm: make read_cache_page synchronous
Ensure pages are uptodate after returning from read_cache_page, which allows us to cut out most of the filesystem-internal PageUptodate_NoLock calls. I didn't have a great look down the call chains, but this appears to fixes 7 possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in block2mtd. All depending on whether the filler is async and/or can return with a !uptodate page. Also, a memory leak in sys_swapon(). Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/fs/afs/dir.c === --- linux-2.6.orig/fs/afs/dir.c +++ linux-2.6/fs/afs/dir.c @@ -187,10 +187,7 @@ static struct page *afs_dir_get_page(str page = read_mapping_page(dir-i_mapping, index, NULL); if (!IS_ERR(page)) { - wait_on_page_locked(page); kmap(page); - if (!PageUptodate(page)) - goto fail; if (!PageChecked(page)) afs_dir_check_page(dir, page); if (PageError(page)) Index: linux-2.6/fs/afs/mntpt.c === --- linux-2.6.orig/fs/afs/mntpt.c +++ linux-2.6/fs/afs/mntpt.c @@ -77,13 +77,11 @@ int afs_mntpt_check_symlink(struct afs_v } ret = -EIO; - wait_on_page_locked(page); - buf = kmap(page); - if (!PageUptodate(page)) - goto out_free; if (PageError(page)) goto out_free; + buf = kmap(page); + /* examine the symlink's contents */ size = vnode-status.size; _debug(symlink to %*.*s, size, (int) size, buf); @@ -100,8 +98,8 @@ int afs_mntpt_check_symlink(struct afs_v ret = 0; - out_free: kunmap(page); + out_free: page_cache_release(page); out: _leave( = %d, ret); @@ -184,8 +182,7 @@ static struct vfsmount *afs_mntpt_do_aut } ret = -EIO; - wait_on_page_locked(page); - if (!PageUptodate(page) || PageError(page)) + if (PageError(page)) goto error; buf = kmap(page); Index: linux-2.6/fs/cramfs/inode.c === --- linux-2.6.orig/fs/cramfs/inode.c +++ linux-2.6/fs/cramfs/inode.c @@ -180,7 +180,8 @@ static void *cramfs_read(struct super_bl struct page *page = NULL; if (blocknr + i devsize) { - page = read_mapping_page(mapping, blocknr + i, NULL); + page = read_mapping_page_async(mapping, blocknr + i, + NULL); /* synchronous error? */ if (IS_ERR(page)) page = NULL; Index: linux-2.6/fs/ext2/dir.c === --- linux-2.6.orig/fs/ext2/dir.c +++ linux-2.6/fs/ext2/dir.c @@ -161,10 +161,7 @@ static struct page * ext2_get_page(struc struct address_space *mapping = dir-i_mapping; struct page *page = read_mapping_page(mapping, n, NULL); if (!IS_ERR(page)) { - wait_on_page_locked(page); kmap(page); - if (!PageUptodate_NoLock(page)) - goto fail; if (!PageChecked(page)) ext2_check_page(page); if (PageError(page)) Index: linux-2.6/fs/freevxfs/vxfs_subr.c === --- linux-2.6.orig/fs/freevxfs/vxfs_subr.c +++ linux-2.6/fs/freevxfs/vxfs_subr.c @@ -74,10 +74,7 @@ vxfs_get_page(struct address_space *mapp pp = read_mapping_page(mapping, n, NULL); if (!IS_ERR(pp)) { - wait_on_page_locked(pp); kmap(pp); - if (!PageUptodate(pp)) - goto fail; /** if (!PageChecked(pp)) **/ /** vxfs_check_page(pp); **/ if (PageError(pp)) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1756,7 +1756,7 @@ int generic_file_readonly_mmap(struct fi EXPORT_SYMBOL(generic_file_mmap); EXPORT_SYMBOL(generic_file_readonly_mmap); -static inline struct page *__read_cache_page(struct address_space *mapping, +static struct page *__read_cache_page(struct address_space *mapping, unsigned long index, int (*filler)(void *,struct page*), void *data) @@ -1793,17 +1793,11 @@ repeat: return page; } -/** - * read_cache_page - read into page cache, fill it if needed - * @mapping: the page's address_space - * @index: the page index - * @filler:function
Re: [patch 2/3] fs: buffer don't PageUptodate without page locked
On Tue, 6 Feb 2007 09:02:23 +0100 (CET) Nick Piggin [EMAIL PROTECTED] wrote: __block_write_full_page is calling SetPageUptodate without the page locked. This is unusual, but not incorrect, as PG_writeback is still set. However with the previous patch, this is now a problem: so don't bother setting the page uptodate in this case (it is weird that the write path does such a thing anyway). Instead just leave it to the read side to bring the page uptodate when it notices that all buffers are uptodate. Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -1679,6 +1679,7 @@ static int __block_write_full_page(struc */ BUG_ON(PageWriteback(page)); set_page_writeback(page); + unlock_page(page); do { struct buffer_head *next = bh-b_this_page; @@ -1688,7 +1689,6 @@ static int __block_write_full_page(struc } bh = next; } while (bh != head); - unlock_page(page); err = 0; done: Why this change? Without looking at it too hard, it seems that if submit_bh() completes synchronously, this thread can end up playing with the buffers on a non-locked, non-PageWriteback page. Someone else could whip the buffers away and oops? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/3] mm: fix PageUptodate memorder
On Tue, 6 Feb 2007 09:02:11 +0100 (CET) Nick Piggin [EMAIL PROTECTED] wrote: +static inline void __SetPageUptodate(struct page *page) +{ +#ifdef CONFIG_S390 if (!test_and_set_bit(PG_uptodate, page-flags)) page_test_and_clear_dirty(page); -} #else -#define SetPageUptodate(page)set_bit(PG_uptodate, (page)-flags) + /* + * Memory barrier must be issued before setting the PG_uptodate bit, + * so all previous writes that served to bring the page uptodate are + * visible before PageUptodate becomes true. + * + * S390 is guaranteed to have a barrier in the test_and_set operation + * (see Documentation/atomic_ops.txt). + * + * XXX: does this memory barrier need to be anything special to + * handle things like DMA writes into the page? + */ + smp_wmb(); + set_bit(PG_uptodate, (page)-flags); #endif +} + +static inline void SetPageUptodate(struct page *page) +{ + WARN_ON(!PageLocked(page)); + __SetPageUptodate(page); +} + +static inline void SetNewPageUptodate(struct page *page) +{ + __SetPageUptodate(page); +} I was panicing for a minute when I saw that __SetPageUptodate() in there. Conventionally the __SetPageFoo namespace is for nonatomic updates to page-flags. Can we call this something different? What a fugly patchset :( - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/3] fs: buffer don't PageUptodate without page locked
On Tue, Feb 06, 2007 at 12:21:40AM -0800, Andrew Morton wrote: On Tue, 6 Feb 2007 09:02:23 +0100 (CET) Nick Piggin [EMAIL PROTECTED] wrote: __block_write_full_page is calling SetPageUptodate without the page locked. This is unusual, but not incorrect, as PG_writeback is still set. However with the previous patch, this is now a problem: so don't bother setting the page uptodate in this case (it is weird that the write path does such a thing anyway). Instead just leave it to the read side to bring the page uptodate when it notices that all buffers are uptodate. Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -1679,6 +1679,7 @@ static int __block_write_full_page(struc */ BUG_ON(PageWriteback(page)); set_page_writeback(page); + unlock_page(page); do { struct buffer_head *next = bh-b_this_page; @@ -1688,7 +1689,6 @@ static int __block_write_full_page(struc } bh = next; } while (bh != head); - unlock_page(page); err = 0; done: Why this change? Without looking at it too hard, it seems that if submit_bh() completes synchronously, this thread can end up playing with the buffers on a non-locked, non-PageWriteback page. Someone else could whip the buffers away and oops? Hmm, it definitely shouldn't be there, it leaked in from another patch to bring partiy with the error handling... Here is an updated patch. -- __block_write_full_page is calling SetPageUptodate without the page locked. This is unusual, but not incorrect, as PG_writeback is still set. However with the previous patch, this is now a problem: so don't bother setting the page uptodate in this case (it is weird that the write path does such a thing anyway). Instead just leave it to the read side to bring the page uptodate when it notices that all buffers are uptodate. Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -1698,17 +1698,8 @@ done: * clean. Someone wrote them back by hand with * ll_rw_block/submit_bh. A rare case. */ - int uptodate = 1; - do { - if (!buffer_uptodate(bh)) { - uptodate = 0; - break; - } - bh = bh-b_this_page; - } while (bh != head); - if (uptodate) - SetPageUptodate(page); end_page_writeback(page); + /* * The page and buffer_heads can be released at any time from * here on. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/1][RFC] mm: prepare_write positive return value
Almost all read/write operation handles data with chunks(segments or pages) and result has integral behaviour for folowing scenario: for_each_chunk() { res = op(); if(IS_ERROR(res)) return progress ? progress : res; progress += res; } prepare_write may has integral behaviour in case of blksize pgsize, for example we successfully allocated/read some blocks, but not all of them, and than some error happend. Currently we eliminate this progress by doing vmtrunate() after prepare_has failed. It is good to have ability to signal about this progress. Interprete positive prepare_write() ret code as bytes num that fs ready to handle at this moment. I've ask SAW, he think it is sane. This size always less than PAGE_CACHE_SIZE so it less than AOP_TRUNCATED_PAGE too. BTH: This approach was used in OpenVZ 2.6.9 kernel in order to make FS with delayed allocation more correct, and its works well. I think not everybody will happy about this, but let's discuss all advantages and disadvantages of this change. Signed-off-by: Dmitriy Monakhov [EMAIL PROTECTED] - diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 62632f5..b4f6eac 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -239,7 +239,11 @@ static int do_lo_send_aops(struct loop_device *lo, struct bio_vec *bvec, page_cache_release(page); continue; } - goto unlock; + if (ret 0 ret PAGE_CACHE_SIZE) + /* prepare_write demands limit size of bytes. */ + size = min(size, (unsigned)ret); + else + goto unlock; } transfer_result = lo_do_transfer(lo, WRITE, page, offset, bvec-bv_page, bv_offs, size, IV); diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c index 1e5d2ba..b5eebe8 100644 --- a/fs/ecryptfs/mmap.c +++ b/fs/ecryptfs/mmap.c @@ -454,6 +454,17 @@ static int ecryptfs_write_inode_size_to_header(struct file *lower_file, } lower_a_ops = lower_inode-i_mapping-a_ops; rc = lower_a_ops-prepare_write(lower_file, header_page, 0, 8); + if (unlikely(rc 0 rc PAGE_CACHE_SIZE)) { + /* +* prepare_write can handle less bytes whan we need. This is not likely +* to happend realy because usualy we need only one block. In order to +* preserve prepare/commit balanced invoke commit end fail. +*/ + int ret; + ret = lower_a_ops-commit_write(lower_file, header_page, 0, rc); + rc = ret ? ret : -ENOSPC; + goto unlock; + } file_size = (u64)i_size_read(inode); ecryptfs_printk(KERN_DEBUG, Writing size: [0x%.16x]\n, file_size); file_size = cpu_to_be64(file_size); @@ -462,6 +473,7 @@ static int ecryptfs_write_inode_size_to_header(struct file *lower_file, kunmap_atomic(header_virt, KM_USER0); flush_dcache_page(header_page); rc = lower_a_ops-commit_write(lower_file, header_page, 0, 8); +unlock: if (rc 0) ecryptfs_printk(KERN_ERR, Error commiting header page write\n); diff --git a/fs/namei.c b/fs/namei.c index b305589..723db81 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2704,6 +2704,16 @@ retry: page_cache_release(page); goto retry; } + if (unlikely(err 0 err PAGE_CACHE_SIZE)) { + /* +* prepare_write can handle less bytes whan we need. This is not likely +* to happend realy because usualy we need only one block. In order to +* preserve prepare/commit balanced invoke commit end fail. +*/ + int ret; + ret = mapping-a_ops-commit_write(NULL, page, 0, err); + err = ret ? ret : -ENOSPC; + } if (err) goto fail_map; kaddr = kmap_atomic(page, KM_USER0); diff --git a/fs/splice.c b/fs/splice.c index 2fca6eb..d2b92bf 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -652,6 +652,17 @@ find_page: if (unlikely(ret)) { loff_t isize = i_size_read(mapping-host); + if (ret 0 ret PAGE_CACHE_SIZE) { + /* +* prepare_write demands limit size of bytes. In order to +* preserve prepare/commit balanced invoke commit end fail. +* Initial i_size saved, so vmtruncate safely restore it later. +*/ + int ret2; + ret2 = mapping-a_ops-commit_write(file, page, offset, + offset + ret); + ret = ret2 ? ret2 : -ENOSPC; + } if (ret != AOP_TRUNCATED_PAGE) unlock_page(page);
[PATCH 1/1][RFC] mm: prepare_write positive return value
This patch solve ext3/4 retry loop issue Issue description: What we can do if block_prepare_write fail inside ext3_prepare_write ? a) Stop transaction and do retry if possible, but what happend if reboot comes after journal_force_commit, but before we exhaust all retry attempts and generic_file_buffered_write call trancate? We may get allocated blocks outside i_size. Witch is bad. b) Commit newly allocated blocks. This approach was used after Andrey's patch. But if reboot comes, or error happend, user will be surprised that i_size increased but blocks are zero filed. This is internal write operation state becames visiable to user. Witch is also bad. c) Just return as match bytes as we can deal with rigth now back to caller, and let's caller handles it and than call commit. In this scenario we never stop journal in the midle of some internal fs operation. If reboot comes before commit trunsaction was't closed so it will be droped while journal replay. Only (c) tend to garantie attomic data operation. Also fix some issues introduced by retries-in-ext3_prepare_write-violate-ordering-requirements: i_size may increase after error happend. possible data corruption caused commiting non uptodate bh. Signed-off-by: Dmitriy Monakhov [EMAIL PROTECTED] - diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index dba6dd2..4c5e9f7 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1154,6 +1154,18 @@ static int do_journal_get_write_access(handle_t *handle, * transaction must include the content of the newly allocated blocks. * This content is expected to be set to zeroes by block_prepare_write(). * 2006/10/14 SAW + * What we can do if some blocks was allocated? + * a) Stop transaction and do retry if possible, but what happend if + *reboot comes after journal_force_commit, but before we exhaust + *all retry attempts and caller call trancate? + *We may get allocated blocks outside i_size. Witch is bad. + * b) Commit newly allocated blocks. And if reboot comes, user will be + *surprised that i_size increased but blocks are zero filed. Witch is + *also bad. + * c) Just return as match bytes as we can deal with rigth now back to + *caller, and if reboot comes trunsaction was't closed so it will + *be droped while journal replay. + * Only (c) tend to garantie attomic data operation. */ static int ext3_prepare_failure(struct file *file, struct page *page, unsigned from, unsigned to) @@ -1186,7 +1198,7 @@ skip: block_start = to; break; } - if (!buffer_mapped(bh)) + if (!buffer_mapped(bh) || !buffer_uptodate(bh)) /* prepare_write failed on this bh */ break; if (ext3_should_journal_data(mapping-host)) { @@ -1204,8 +1216,8 @@ skip: if (block_start = from) goto skip; - /* commit allocated and zeroed buffers */ - return mapping-a_ops-commit_write(file, page, from, block_start); + /* return number of bytes till last mapped uptodate block */ + return block_start - from; } static int ext3_prepare_write(struct file *file, struct page *page, @@ -1239,7 +1251,8 @@ retry: failure: ret2 = ext3_prepare_failure(file, page, from, to); - if (ret2 0) + if (ret2) + /* ready to partial write, or fatal error */ return ret2; if (ret == -ENOSPC ext3_should_retry_alloc(inode-i_sb, retries)) goto retry; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 806eee1..da39f80 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1153,6 +1153,18 @@ static int do_journal_get_write_access(handle_t *handle, * transaction must include the content of the newly allocated blocks. * This content is expected to be set to zeroes by block_prepare_write(). * 2006/10/14 SAW + * What we can do if some blocks was allocated? + * a) Stop transaction and do retry if possible, but what happend if + *reboot comes after journal_force_commit, but before we exhaust + *all retry attempts and caller call trancate? + *We may get allocated blocks outside i_size. Witch is bad. + * b) Commit newly allocated blocks. And if reboot comes, user will be + *surprised that i_size increased but blocks are zero filed. Witch is + *also bad. + * c) Just return as match bytes as we can deal with rigth now back to + *caller, and if reboot comes trunsaction was't closed so it will + *be droped while journal replay. + * Only (c) tend to garantie attomic data operation. */ static int ext4_prepare_failure(struct file *file, struct page *page, unsigned from, unsigned to) @@ -1185,7 +1197,7 @@ skip: block_start = to; break; } - if (!buffer_mapped(bh)) +
Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks
On Mon, 2007-02-05 at 19:20 -0800, Andreas Gruenbacher wrote: On Monday 05 February 2007 11:02, Christoph Hellwig wrote: On Mon, Feb 05, 2007 at 10:58:26AM -0800, Trond Myklebust wrote: On Mon, 2007-02-05 at 18:44 +, Christoph Hellwig wrote: Just FYI: Al was very opposed to the idea of passing the vfsmount to the vfs_ helpers, so you should discuss this with him. Looking at the actual patches I see you're lazy in a lot of places. Please make sure that when you introduce a vfsmount argument somewhere that it is _always_ passed and not just when it's conveniant. Yes, that's more work, but then again if you're not consistant anyone half-serious will laught at a security model using this infrasturcture. nfsd in particular tends to be a bit lazy about passing around vfsmount info. Forcing it to do so should not be hard since the vfsmount is already cached in the struct export (which can be found using the filehandle). It will take a bit of re-engineering in order to pass that information around inside the nfsd code, though. I actually have a patch to fix that. It's part of a bigger series that's not quite ready, but I hope to finish all of it this month. It's actually not hard to fix, and nfsd would look a little less weird. But what would this add, what do pathnames mean in the context of nfsd, and would nfsd actually become less weird? On the wire, nfs transmits file handles, not filenames. The file handle usually contains the inode numbers of the file and the containing directory. The code in fs/exportfs/expfs.c:find_exported_dentry() shows that fh_verify() should return a dentry that may be connected or not, depending on the export options and whether it's a file or directory. Together with the vfsmount from the svc_export, we could compute a pathname. But there is no way to tell different hardlinks to the same inode in the same directory from each other (both the file and directory inode are the same), and depending on the export options, we may or may not be able to distinguish different hardlinks across directories. Who cares? There is no way to export a partial directory, and in any case the subtree_check crap is borken beyond repair (see cross-directory renames which lead to actual changes to the filehandle - broken, broken, broken). If the nohide or crossmnt export options are used, we might run into similar aliasing issues with mounts (I'm not sure about this). Wrong. Those create new export points since you are crossing into a different filesystem. So we could compute pathnames, but they wouldn't be very meaningful. Instead of going that way, it seems more reasonable to not do pathname based access control for the kernel kernel nfsd at all. Passing NULL as the vfsmounts does that. With a properly configured nfsd, the areas that remote users could access would still be well confined. Huh? Even if you don't pass in a vfsmount, you _still_ need to pass in a super_block. Inodes are only unique per filesystem. In fact, on an ideal NFS export, there is no ambiguity between the two (see above comment about subtree_check) since the entire filesystem will be exported. The focus with this whole pathname based security stuff really is to be able to better confine local processes. The kernel nfsd is a kernel thread, and as such, confining it from a security point of view cannot really work, anyway. Note also that it might be nice to enforce the vfsmount argument by replacing the existing dentry parameters with a struct path instead of adding an extra reference to the vfsmount to existing functions. That definitly sounds like a good idea, independent of whether we want to pass the vfsmount in more places or not. Note that you can still pass a NULL vfsmount in a struct path. ...but we will have Dick Cheney track you down and shoot you if you do. The point is that you only have to check _one_ argument (the initialisation of the struct path) instead of having a check for every function argument in the vfs. Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/3] mm: make read_cache_page synchronous
Andrew Morton wrote: On Tue, 6 Feb 2007 09:02:33 +0100 (CET) Nick Piggin [EMAIL PROTECTED] wrote: Ensure pages are uptodate after returning from read_cache_page, which allows us to cut out most of the filesystem-internal PageUptodate_NoLock calls. Normally it's good to rename functions when we change their behaviour, but I guess any missed (or out-of-tree) filesystems will just end up doing a pointless wait_on_page_locked() and will continue to work OK, yes? Yeah. I didn't have a great look down the call chains, but this appears to fixes 7 possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in block2mtd. All depending on whether the filler is async and/or can return with a !uptodate page. Also, a memory leak in sys_swapon(). Separate patch? Well its fixed by virtue of read_cache_page now correctly dropping the page refcount if it finds the page !uptodate, rather than any special logic I added. I can do another patch though. No problem, I'll be resending the series after this round of feedback. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/3] mm: fix PageUptodate memorder
Andrew Morton wrote: On Tue, 6 Feb 2007 09:02:11 +0100 (CET) Nick Piggin [EMAIL PROTECTED] wrote: +static inline void __SetPageUptodate(struct page *page) +{ +#ifdef CONFIG_S390 if (!test_and_set_bit(PG_uptodate, page-flags)) page_test_and_clear_dirty(page); -} #else -#define SetPageUptodate(page) set_bit(PG_uptodate, (page)-flags) + /* +* Memory barrier must be issued before setting the PG_uptodate bit, +* so all previous writes that served to bring the page uptodate are +* visible before PageUptodate becomes true. +* +* S390 is guaranteed to have a barrier in the test_and_set operation +* (see Documentation/atomic_ops.txt). +* +* XXX: does this memory barrier need to be anything special to +* handle things like DMA writes into the page? +*/ + smp_wmb(); + set_bit(PG_uptodate, (page)-flags); #endif +} + +static inline void SetPageUptodate(struct page *page) +{ + WARN_ON(!PageLocked(page)); + __SetPageUptodate(page); +} + +static inline void SetNewPageUptodate(struct page *page) +{ + __SetPageUptodate(page); +} I was panicing for a minute when I saw that __SetPageUptodate() in there. Conventionally the __SetPageFoo namespace is for nonatomic updates to page-flags. Can we call this something different? Duh, of course, sorry. What a fugly patchset :( Fugly problem. One could fix it by always locking the page, but I was worried about Hugh flaming me if I tried that ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH[RFC] kill sysrq-u (emergency remount r/o)
On Mon, Feb 05, 2007 at 10:17:44PM -0500, Theodore Tso wrote: sysrq+u is helpful. It is like \( sysrq+s make sure no further writes go to disk \). I agree it is useful, but if we're going to do it we really should do it right. We should have real revoke() functionality on file descriptors, which revokes all of the mmap()'s (any attempt to write into a previously read/write mmap will cause a SEGV) as well as changing f_mode, and then use that to implement emergency read-only remount. Revoke is only part of it. What we really need is proper forced unmount support. That means revoking any kind of userspace access, blocking new access and making sure the ondisk image is coherent. This would definitly be a useful feature, but it's a lot of work. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks
On Tue, Feb 06, 2007 at 12:51:52AM -0800, Trond Myklebust wrote: Who cares? There is no way to export a partial directory, and in any case the subtree_check crap is borken beyond repair (see cross-directory renames which lead to actual changes to the filehandle - broken, broken, broken). It's getting a little oftopic for this thread, but is there a chance we could just kill that crap after a resonable deprecation period? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks
On Mon, Feb 05, 2007 at 06:13:26PM -0800, Andreas Gruenbacher wrote: On Monday 05 February 2007 10:44, Christoph Hellwig wrote: Looking at the actual patches I see you're lazy in a lot of places. Please make sure that when you introduce a vfsmount argument somewhere that it is _always_ passed and not just when it's conveniant. Yes, that's more work, but then again if you're not consistant anyone half-serious will laught at a security model using this infrasturcture. It may appear like laziness, but it's not. Let's look at where we're passing NULL at the moment: You know, I've tracked a lot of this down previously when I submitted patches to add vfsmount arguments to the vfs_ helpers, just to get tought by Al that this is a bad idea :) fs/hpfs/namei.c In hpfs_unlink, hpfs truncates one of its own inodes through notify_change(). You definitely don't want any lsms to interfere here, pathname based or not; hpfs should probably truncate its inode itself instead. But given that hpfs goes via the vfs, we at least pass NULL to indicate that this file really has no meaningful paths to it anymore. (In addition, we don't really have a vfsmount at this point anymore, and neither would it make sense to pass it there.) To play more nicely with other lsms, hpfs could mark the inode as private before attempting the truncate. The right fix would be to not go through the vfs. Or even better to remove the utter hack. See my previous patch on this subject and the discussion started on it. fs/reiserfs/xattr.c The directories an files that reiserfs uses internally to store xattrs are hanging off .reiserfs_priv/xattrs in the filesystem. This part of the namespace is not accessible or visible from user space though except through the xattr syscalls. Reiserfs should probably just mark all its xattr inodes as private in order to play nicely with other lsms. As far as pathname based lsms are concerned, pathnames to those fs-internal objects are meaningless though, and so we pass NULL here. Well, the 100% correct fix would be to rip out the braindead reiserfs xattr code :) Of course that's not an option, but reiserfs would be much better of stop using vfs_rmdir. It really doesn't need most of the checks in there and should just call into reiserfs_rmdir instead of doing this useless trip into vfs land. fs/sysfs/file.c fs/nfsd/vfs.c and fs/nfsd/nfs4recover.c For nfsd, the concept of pathnames is a bit peculiar. I'll try to respond to that separately. the actually nfsd filehandle code is conceptually trivially fixable, although with quite a bit of code churn. As mention I'll soon submit a patch for it. nfs4recover.c is broken beyond repair and should just be deleted from the tree. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks
On Tuesday February 6, [EMAIL PROTECTED] wrote: On Mon, Feb 05, 2007 at 07:20:35PM -0800, Andreas Gruenbacher wrote: It's actually not hard to fix, and nfsd would look a little less weird. But what would this add, what do pathnames mean in the context of nfsd, and would nfsd actually become less weird? It's not actually a pathname we care about, but a vfsmount + dentry combo. That one means as much in nfsd as elsewhere. We want nfsd to obey r/o or noatime mount flags if /export/foo is exported with them but /foo not. Even better would be to change nfsd so it creates it's own non-visible vfsmount for the filesystems it exports.. What would be the benefit of having private non-visible vfsmounts? Sounds like a recipe for confusion? It is possible that mountd might start doing bind-mounts to create the 'pseudo filesystem' thing for NFSv4, but they would be very visible (under /var/lib/nfs/v4root or something). So having it's own vfsmount might make sense, but I don't get 'non-visible'. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks
On Tuesday February 6, [EMAIL PROTECTED] wrote: On Tue, Feb 06, 2007 at 12:51:52AM -0800, Trond Myklebust wrote: Who cares? There is no way to export a partial directory, and in any case the subtree_check crap is borken beyond repair (see cross-directory renames which lead to actual changes to the filehandle - broken, broken, broken). It's getting a little oftopic for this thread, but is there a chance we could just kill that crap after a resonable deprecation period? Probably. A couple of years? nfs-utils 1.1.0 is due to stop subtree_check from being the default. After that is released, the next kernel can start printing warnings that it might stop working in a couple of years. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks
On Tue, Feb 06, 2007 at 09:26:14PM +1100, Neil Brown wrote: What would be the benefit of having private non-visible vfsmounts? Sounds like a recipe for confusion? It is possible that mountd might start doing bind-mounts to create the 'pseudo filesystem' thing for NFSv4, but they would be very visible (under /var/lib/nfs/v4root or something). So having it's own vfsmount might make sense, but I don't get 'non-visible'. It would allow creating an exported tree without interferance with the local visible tree. Note that the local visible tree isn't global anymore either, and this allows to adjust what's exported through nfsd throug a specific interface instead of needing to get into nfsd namespace through some way. Think of listing the actually exported devices in /etc/exports instead of the indirection through fstab aswell. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] move remove_dquot_ref to dqout.c
On Mon 05-02-07 19:05:27, Christoph Hellwig wrote: Remove_dquot_ref can move to dqout.c instead of beeing in inode.c under #ifdef CONFIG_QUOTA. Also clean the resulting code up Yes, this was because at the time the code was written, inode_lock was not exported from inode.c. a tiny little bit by testing sb-dq_op earlier - it's constant over a filesystems lifetime. Looks fine. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Signed-off-by: Jan Kara [EMAIL PROTECTED] Index: linux-2.6/fs/dquot.c === --- linux-2.6.orig/fs/dquot.c 2007-02-05 18:49:41.0 +0100 +++ linux-2.6/fs/dquot.c 2007-02-05 18:53:49.0 +0100 @@ -79,6 +79,7 @@ #include linux/buffer_head.h #include linux/capability.h #include linux/quotaops.h +#include linux/writeback.h /* for inode_lock, oddly enough.. */ #include asm/uaccess.h @@ -756,15 +757,30 @@ } } +static void remove_dquot_ref(struct super_block *sb, int type, + struct list_head *tofree_head) +{ + struct inode *inode; + + spin_lock(inode_lock); + list_for_each_entry(inode, sb-s_inodes, i_sb_list) { + if (!IS_NOQUOTA(inode)) + remove_inode_dquot_ref(inode, type, tofree_head); + } + spin_unlock(inode_lock); +} + /* Gather all references from inodes and drop them */ static void drop_dquot_ref(struct super_block *sb, int type) { LIST_HEAD(tofree_head); - down_write(sb_dqopt(sb)-dqptr_sem); - remove_dquot_ref(sb, type, tofree_head); - up_write(sb_dqopt(sb)-dqptr_sem); - put_dquot_list(tofree_head); + if (sb-dq_op) { + down_write(sb_dqopt(sb)-dqptr_sem); + remove_dquot_ref(sb, type, tofree_head); + up_write(sb_dqopt(sb)-dqptr_sem); + put_dquot_list(tofree_head); + } } static inline void dquot_incr_inodes(struct dquot *dquot, unsigned long number) Index: linux-2.6/fs/inode.c === --- linux-2.6.orig/fs/inode.c 2007-02-05 18:49:41.0 +0100 +++ linux-2.6/fs/inode.c 2007-02-05 18:49:56.0 +0100 @@ -1252,33 +1252,6 @@ EXPORT_SYMBOL(inode_needs_sync); -/* - * Quota functions that want to walk the inode lists.. - */ -#ifdef CONFIG_QUOTA - -void remove_dquot_ref(struct super_block *sb, int type, - struct list_head *tofree_head) -{ - struct inode *inode; - - if (!sb-dq_op) - return; /* nothing to do */ - spin_lock(inode_lock); /* This lock is for inodes code */ - - /* - * We don't have to lock against quota code - test IS_QUOTAINIT is - * just for speedup... - */ - list_for_each_entry(inode, sb-s_inodes, i_sb_list) - if (!IS_NOQUOTA(inode)) - remove_inode_dquot_ref(inode, type, tofree_head); - - spin_unlock(inode_lock); -} - -#endif - int inode_wait(void *word) { schedule(); Index: linux-2.6/include/linux/fs.h === --- linux-2.6.orig/include/linux/fs.h 2007-02-05 18:51:48.0 +0100 +++ linux-2.6/include/linux/fs.h 2007-02-05 18:51:52.0 +0100 @@ -1681,7 +1681,6 @@ extern int __remove_suid(struct dentry *, int); extern int should_remove_suid(struct dentry *); extern int remove_suid(struct dentry *); -extern void remove_dquot_ref(struct super_block *, int, struct list_head *); extern void __insert_inode_hash(struct inode *, unsigned long hashval); extern void remove_inode_hash(struct inode *); Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] remove sb-s_files and file_list_lock usage in dquot.c
Iterate over sb-s_inodes instead of sb-s_files in add_dquot_ref. This reduces list search and lock hold time aswell as getting rid of one of the few uses of file_list_lock which Ingo identified as a scalability problem. Previously we called dq_op-initialize for every inode handing of a writeable file that wasn't initialized before. Now we're calling it for every inode that has a non-zero i_writecount, aka a writeable file descriptor refering to it. Thanks a lot to Jan Kara for running this patch through his quota test harness. Note: this is ontop of '[PATCH] move remove_dquot_ref to dqout.c' which I sent out yesterday. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/dquot.c === --- linux-2.6.orig/fs/dquot.c 2007-02-05 18:54:36.0 +0100 +++ linux-2.6/fs/dquot.c2007-02-05 18:59:48.0 +0100 @@ -689,23 +689,27 @@ /* This routine is guarded by dqonoff_mutex mutex */ static void add_dquot_ref(struct super_block *sb, int type) { - struct list_head *p; + struct inode *inode; restart: - file_list_lock(); - list_for_each(p, sb-s_files) { - struct file *filp = list_entry(p, struct file, f_u.fu_list); - struct inode *inode = filp-f_path.dentry-d_inode; - if (filp-f_mode FMODE_WRITE dqinit_needed(inode, type)) { - struct dentry *dentry = dget(filp-f_path.dentry); - file_list_unlock(); - sb-dq_op-initialize(inode, type); - dput(dentry); - /* As we may have blocked we had better restart... */ - goto restart; - } + spin_lock(inode_lock); + list_for_each_entry(inode, sb-s_inodes, i_sb_list) { + if (!atomic_read(inode-i_writecount)) + continue; + if (!dqinit_needed(inode, type)) + continue; + if (inode-i_state (I_FREEING|I_WILL_FREE)) + continue; + + __iget(inode); + spin_unlock(inode_lock); + + sb-dq_op-initialize(inode, type); + iput(inode); + /* As we may have blocked we had better restart... */ + goto restart; } - file_list_unlock(); + spin_unlock(inode_lock); } /* Return 0 if dqput() won't block (note that 1 doesn't necessarily mean blocking) */ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] remove sb-s_files and file_list_lock usage in dquot.c
On Tue 06-02-07 14:23:33, Christoph Hellwig wrote: Iterate over sb-s_inodes instead of sb-s_files in add_dquot_ref. This reduces list search and lock hold time aswell as getting rid of one of the few uses of file_list_lock which Ingo identified as a scalability problem. Previously we called dq_op-initialize for every inode handing of a writeable file that wasn't initialized before. Now we're calling it for every inode that has a non-zero i_writecount, aka a writeable file descriptor refering to it. Thanks a lot to Jan Kara for running this patch through his quota test harness. Note: this is ontop of '[PATCH] move remove_dquot_ref to dqout.c' which I sent out yesterday. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Signed-off-by: Jan Kara [EMAIL PROTECTED] Index: linux-2.6/fs/dquot.c === --- linux-2.6.orig/fs/dquot.c 2007-02-05 18:54:36.0 +0100 +++ linux-2.6/fs/dquot.c 2007-02-05 18:59:48.0 +0100 @@ -689,23 +689,27 @@ /* This routine is guarded by dqonoff_mutex mutex */ static void add_dquot_ref(struct super_block *sb, int type) { - struct list_head *p; + struct inode *inode; restart: - file_list_lock(); - list_for_each(p, sb-s_files) { - struct file *filp = list_entry(p, struct file, f_u.fu_list); - struct inode *inode = filp-f_path.dentry-d_inode; - if (filp-f_mode FMODE_WRITE dqinit_needed(inode, type)) { - struct dentry *dentry = dget(filp-f_path.dentry); - file_list_unlock(); - sb-dq_op-initialize(inode, type); - dput(dentry); - /* As we may have blocked we had better restart... */ - goto restart; - } + spin_lock(inode_lock); + list_for_each_entry(inode, sb-s_inodes, i_sb_list) { + if (!atomic_read(inode-i_writecount)) + continue; + if (!dqinit_needed(inode, type)) + continue; + if (inode-i_state (I_FREEING|I_WILL_FREE)) + continue; + + __iget(inode); + spin_unlock(inode_lock); + + sb-dq_op-initialize(inode, type); + iput(inode); + /* As we may have blocked we had better restart... */ + goto restart; } - file_list_unlock(); + spin_unlock(inode_lock); } /* Return 0 if dqput() won't block (note that 1 doesn't necessarily mean blocking) */ Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks
Tony Jones wrote: The following are a set of patches the goal of which is to pass vfsmounts through select portions of the VFS layer sufficient to be visible to the LSM inode operation hooks. I was looking forward to these patches for so long. Chris Wright wrote: This kind of change (or perhaps straight to struct path) is definitely needed from AA. Not only AppArmor, but also TOMOYO Linux needs these patches. TOMOYO Linux is a pathname based access control patch like AppArmor. http://lwn.net/Articles/165132/ I have been asked Why not use LSM? and the answer is always I can't, for VFS helper functions and LSM functions don't receive vfsmount. and I am manually patching locations that call VFS helper functions. But if these Tony's patches are accepted in upstream, TOMOYO Linux would be able to use LSM. I think these patches are also useful for auditing functions, for auditing logs will be able to include absolute pathname instead of partial pathname. I think most people want access logs in the form of pathnames rather than security labels. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Implement -page_mkwrite for XFS
Folks, I'm not sure of the exact locking rules and constraints for -page_mkwrite(), so I thought I better fish around for comments. With XFS, we need to hook pages being dirtied by mmap writes so that we can attach buffers of the correct state tothe pages. This means that when we write them back, the correct thing happens. For example, if you mmap an unwritten extent (preallocated), currently your data will get written to disk but the extent will not get converted to a written extent. IOWs, you lose the data because when you read it back it will seen as unwritten and treated as a hole. AFAICT, it is safe to lock the page during -page_mkwrite and that it is safe to issue I/O (e.g. metadata reads) to determine the current state of the file. I am also assuming that, at this point, we are not allowed to change the file size and so we have to be careful in -page_mkwrite we don't do that. What else have I missed here? IOWs, I've basically treated -page_mkwrite() as wrapper for block_prepare_write/block_commit_write because they do all the buffer mapping and state manipulation I think is necessary. Is it safe to call these functions, or are there some other constraints we have to work under here? Patch below. Comments? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_file.c | 34 ++ 1 file changed, 34 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_file.c 2007-01-16 10:54:15.0 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c 2007-02-07 09:49:00.508017483 +1100 @@ -446,6 +446,38 @@ xfs_file_open_exec( } #endif /* HAVE_FOP_OPEN_EXEC */ +/* + * mmap()d file has taken write protection fault and is being made + * writable. We can set the page state up correctly for a writable + * page, which means we can do correct delalloc accounting (ENOSPC + * checking!) and unwritten extent mapping. + */ +STATIC int +xfs_vm_page_mkwrite( + struct vm_area_struct *vma, + struct page *page) +{ + struct inode*inode = vma-vm_file-f_path.dentry-d_inode; + unsigned long end; + int ret = 0; + + end = page-index + 1; + end = PAGE_CACHE_SHIFT; + if (end i_size_read(inode)) + end = i_size_read(inode) ~PAGE_CACHE_MASK; + else + end = PAGE_CACHE_SIZE; + + lock_page(page); + ret = block_prepare_write(page, 0, end, xfs_get_blocks); + if (!ret) + ret = block_commit_write(page, 0, end); + unlock_page(page); + + return ret; +} + + const struct file_operations xfs_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -503,12 +535,14 @@ const struct file_operations xfs_dir_fil static struct vm_operations_struct xfs_file_vm_ops = { .nopage = filemap_nopage, .populate = filemap_populate, + .page_mkwrite = xfs_vm_page_mkwrite, }; #ifdef HAVE_DMAPI static struct vm_operations_struct xfs_dmapi_file_vm_ops = { .nopage = xfs_vm_nopage, .populate = filemap_populate, + .page_mkwrite = xfs_vm_page_mkwrite, #ifdef HAVE_VMOP_MPROTECT .mprotect = xfs_vm_mprotect, #endif - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Tue, 16 Jan 2007 21:05:20 +0900 [EMAIL PROTECTED] wrote: Move the blocks on the temporary inode to the original inode by a page. 1. Read the file data from the old blocks to the page 2. Move the block on the temporary inode to the original inode 3. Write the file data on the page into the new blocks ... + +/** + * ext4_ext_replace_branches - replace original extents with new extents. + * @org_inodeOriginal inode + * @dest_inode temporary inode + * @from_pagePage offset + * @count_page Page count to be replaced + * @delete_start block offset for deletion + * + * This function returns 0 if succeed, otherwise returns error value. + * Replace extents for blocks from from to from+count-1. + */ +static int +ext4_ext_replace_branches(struct inode *org_inode, struct inode *dest_inode, + pgoff_t from_page, pgoff_t dest_from_page, + pgoff_t count_page, unsigned long *delete_start) +{ + struct ext4_ext_path *org_path = NULL; + struct ext4_ext_path *dest_path = NULL; + struct ext4_extent *oext, *dext; + struct ext4_extent tmp_ext; + int err = 0; + int depth; + unsigned long from, count, dest_off, diff, replaced_count = 0; These should be sector_t, shouldn't they? + handle_t *handle = NULL; + unsigned jnum; + + from = from_page (PAGE_CACHE_SHIFT - dest_inode-i_blkbits); In which case one needs to be very careful to avoid overflows in expressions such as this one. + wait_on_page_locked(page); + lock_page(page); The wait_on_page_locked() is unneeded here. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Mon, 5 Feb 2007 14:12:04 +0100 Jan Kara [EMAIL PROTECTED] wrote: Move the blocks on the temporary inode to the original inode by a page. 1. Read the file data from the old blocks to the page 2. Move the block on the temporary inode to the original inode 3. Write the file data on the page into the new blocks I have one thing - it's probably not good to use page cache for defragmentation. Then it is no longer online defragmentation. The issues with maintaining correctness and coherency with ongoing VFS activity would be truly ghastly. If we're worried about pagecache pollution then it would be better to control that from userspace via fadvise(). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/3] 2.6.20 fix for PageUptodate memorder problem
On Tue, Feb 06, 2007 at 09:02:01AM +0100, Nick Piggin wrote: Still no independent confirmation as to whether this is a problem or not. I think it is, so I'll propose this patchset to fix it. Patch 1/3 has a reasonable description of the problem. Nick, can you include a diffstat at the head of your patches? Makes it much easier to see what the scope of the changes are when you are changing code in several filesystems Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/1][RFC] mm: prepare_write positive return value
On Tue, 06 Feb 2007 11:33:46 +0300 Dmitriy Monakhov [EMAIL PROTECTED] wrote: Almost all read/write operation handles data with chunks(segments or pages) and result has integral behaviour for folowing scenario: for_each_chunk() { res = op(); if(IS_ERROR(res)) return progress ? progress : res; progress += res; } prepare_write may has integral behaviour in case of blksize pgsize, for example we successfully allocated/read some blocks, but not all of them, and than some error happend. Currently we eliminate this progress by doing vmtrunate() after prepare_has failed. It is good to have ability to signal about this progress. Interprete positive prepare_write() ret code as bytes num that fs ready to handle at this moment. I've ask SAW, he think it is sane. This size always less than PAGE_CACHE_SIZE so it less than AOP_TRUNCATED_PAGE too. BTH: This approach was used in OpenVZ 2.6.9 kernel in order to make FS with delayed allocation more correct, and its works well. I think not everybody will happy about this, but let's discuss all advantages and disadvantages of this change. That seems to be a logical change. We'd need to review all the callers and callees to make sure that they handle this change correctly. Your changes deviate quite a lot from standard kernel coding style. Please fix that. Please cc linux-ext4@vger.kernel.org on the next version of these patches. I'm seriously running out of bandwidth over here and ext4 has other maintainers. Thanks. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] remove sb-s_files and file_list_lock usage in dquot.c
On Tue, 6 Feb 2007 14:23:33 +0100 Christoph Hellwig [EMAIL PROTECTED] wrote: static void add_dquot_ref(struct super_block *sb, int type) { - struct list_head *p; + struct inode *inode; restart: - file_list_lock(); - list_for_each(p, sb-s_files) { - struct file *filp = list_entry(p, struct file, f_u.fu_list); - struct inode *inode = filp-f_path.dentry-d_inode; - if (filp-f_mode FMODE_WRITE dqinit_needed(inode, type)) { - struct dentry *dentry = dget(filp-f_path.dentry); - file_list_unlock(); - sb-dq_op-initialize(inode, type); - dput(dentry); - /* As we may have blocked we had better restart... */ - goto restart; - } + spin_lock(inode_lock); + list_for_each_entry(inode, sb-s_inodes, i_sb_list) { + if (!atomic_read(inode-i_writecount)) + continue; + if (!dqinit_needed(inode, type)) + continue; + if (inode-i_state (I_FREEING|I_WILL_FREE)) + continue; + + __iget(inode); + spin_unlock(inode_lock); + + sb-dq_op-initialize(inode, type); + iput(inode); + /* As we may have blocked we had better restart... */ + goto restart; } - file_list_unlock(); + spin_unlock(inode_lock); } That loop has (and had) up to O(n^n) operations. Is there something which prevents this from going insane? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
Andrew Morton wrote: On Tue, 16 Jan 2007 21:05:20 +0900 [EMAIL PROTECTED] wrote: ... +ext4_ext_replace_branches(struct inode *org_inode, struct inode *dest_inode, + pgoff_t from_page, pgoff_t dest_from_page, + pgoff_t count_page, unsigned long *delete_start) +{ + struct ext4_ext_path *org_path = NULL; + struct ext4_ext_path *dest_path = NULL; + struct ext4_extent *oext, *dext; + struct ext4_extent tmp_ext; + int err = 0; + int depth; + unsigned long from, count, dest_off, diff, replaced_count = 0; These should be sector_t, shouldn't they? At some point should we start using blkcnt_t properly? (block-in[-large]-file, not block-in[-large]-device?) I think that's what it was introduced for, although it's not in wide use at this point. I guess the type really isn't used anywhere else; just in the inode's i_blocks. Hmm. -Eric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/3] 2.6.20 fix for PageUptodate memorder problem
On Wed, Feb 07, 2007 at 09:58:57AM +1100, David Chinner wrote: On Tue, Feb 06, 2007 at 09:02:01AM +0100, Nick Piggin wrote: Still no independent confirmation as to whether this is a problem or not. I think it is, so I'll propose this patchset to fix it. Patch 1/3 has a reasonable description of the problem. Nick, can you include a diffstat at the head of your patches? Makes it much easier to see what the scope of the changes are when you are changing code in several filesystems Good idea, I'll do that. Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] remove sb-s_files and file_list_lock usage in dquot.c
On Tue, Feb 06, 2007 at 03:50:01PM -0800, Andrew Morton wrote: On Tue, 6 Feb 2007 14:23:33 +0100 Christoph Hellwig [EMAIL PROTECTED] wrote: static void add_dquot_ref(struct super_block *sb, int type) { - struct list_head *p; + struct inode *inode; restart: - file_list_lock(); - list_for_each(p, sb-s_files) { - struct file *filp = list_entry(p, struct file, f_u.fu_list); - struct inode *inode = filp-f_path.dentry-d_inode; - if (filp-f_mode FMODE_WRITE dqinit_needed(inode, type)) { - struct dentry *dentry = dget(filp-f_path.dentry); - file_list_unlock(); - sb-dq_op-initialize(inode, type); - dput(dentry); - /* As we may have blocked we had better restart... */ - goto restart; - } + spin_lock(inode_lock); + list_for_each_entry(inode, sb-s_inodes, i_sb_list) { + if (!atomic_read(inode-i_writecount)) + continue; + if (!dqinit_needed(inode, type)) + continue; + if (inode-i_state (I_FREEING|I_WILL_FREE)) + continue; + + __iget(inode); + spin_unlock(inode_lock); + + sb-dq_op-initialize(inode, type); + iput(inode); + /* As we may have blocked we had better restart... */ + goto restart; } - file_list_unlock(); + spin_unlock(inode_lock); } That loop has (and had) up to O(n^n) operations. Is there something which prevents this from going insane? I don't think so. Then again it's only called when you call quotaon on a mounted filesystem, and normally you don't have that many inodes instanciated at that time. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Testing ext4 persistent preallocation patches for 64 bit features
I plan to test the persistent preallocation patches on a huge sparse device, to know if 32 bit physical block numbers (upto 48bit) behave as expected. I have following questions for this and will appreciate suggestions here: a) What should be the sparse device size which I should use for testing? Should a size of 8TB (say, 100 TB) be enough ? The physical device (backing store device) size I can have is upto 70GB. b) How do I test allocation of 32 bit physical block numbers ? I can not fill 8TB, since the physical storage available with me is just 70GB. c) Do I need to put some hack in the filesystem code for above (to allocate 32 bit physical block numbers) ? Any further ideas on how to test this will help. Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html