[patch 0/3] 2.6.20 fix for PageUptodate memorder problem

2007-02-06 Thread Nick Piggin
Still no independent confirmation as to whether this is a problem or not.
I think it is, so I'll propose this patchset to fix it. Patch 1/3 has a
reasonable description of the problem.

Thanks,
Nick

--
SuSE Labs

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/3] mm: fix PageUptodate memorder

2007-02-06 Thread Nick Piggin
After running SetPageUptodate, preceeding stores to the page contents to
actually bring it uptodate may not be ordered with the store to set the page
uptodate.

Therefore, another CPU which checks PageUptodate is true, then reads the
page contents can get stale data.

Fix this by ensuring SetPageUptodate is always called with the page locked
(except in the case of a new page that cannot be visible to other CPUs), and
requiring PageUptodate be checked only when the page is locked.

To facilitate lockless checks, SetPageUptodate contains an smp_wmb to order
preceeding stores before the store to page flags, and a new PageUptodate_NoLock
is introduced, which issues a smp_rmb after the page flags are loaded for the
test.

I'm still not sure that a DMA memory barrier is not required, however I think
the logical place to put such a barrier would be in the IO completion routines,
when they come back to tell us that they have succeeded. (Help? Anyone?)

One thing I like about it is that it unifies the anonymous page handling
with the rest of the page management, by marking anon pages as uptodate
when they _are_ uptodate, rather than when our implementation requires
that they be marked as such. Doing this let me get rid of the smp_wmb's
in the page copying functions, which were specially added for anonymous
pages for a closely related issue, always vaguely troubled me.

Convert core code and some filesystems to use PageUptodate_NoLock, just for
reference (a more complete patch follows).

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Index: linux-2.6/include/linux/highmem.h
===
--- linux-2.6.orig/include/linux/highmem.h
+++ linux-2.6/include/linux/highmem.h
@@ -57,8 +57,6 @@ static inline void clear_user_highpage(s
void *addr = kmap_atomic(page, KM_USER0);
clear_user_page(addr, vaddr, page);
kunmap_atomic(addr, KM_USER0);
-   /* Make sure this page is cleared on other CPU's too before using it */
-   smp_wmb();
 }
 
 #ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
@@ -108,8 +106,6 @@ static inline void copy_user_highpage(st
copy_user_page(vto, vfrom, vaddr, to);
kunmap_atomic(vfrom, KM_USER0);
kunmap_atomic(vto, KM_USER1);
-   /* Make sure this page is cleared on other CPU's too before using it */
-   smp_wmb();
 }
 
 #endif
Index: linux-2.6/include/linux/page-flags.h
===
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -126,16 +126,62 @@
 #define ClearPageReferenced(page)  clear_bit(PG_referenced, (page)-flags)
 #define TestClearPageReferenced(page) test_and_clear_bit(PG_referenced, 
(page)-flags)
 
-#define PageUptodate(page) test_bit(PG_uptodate, (page)-flags)
-#ifdef CONFIG_S390
-static inline void SetPageUptodate(struct page *page)
+static inline int PageUptodate(struct page *page)
+{
+   WARN_ON(!PageLocked(page));
+   return test_bit(PG_uptodate, (page)-flags);
+}
+
+/*
+ * PageUptodate to be used when not holding the page lock.
+ */
+static inline int PageUptodate_NoLock(struct page *page)
 {
+   int ret = test_bit(PG_uptodate, (page)-flags);
+
+   /*
+* Must ensure that the data we read out of the page is loaded
+* _after_ we've loaded page-flags to check for PageUptodate.
+* See SetPageUptodate() for the other side of the story.
+*/
+   smp_rmb();
+
+   return ret;
+}
+
+static inline void __SetPageUptodate(struct page *page)
+{
+#ifdef CONFIG_S390
if (!test_and_set_bit(PG_uptodate, page-flags))
page_test_and_clear_dirty(page);
-}
 #else
-#define SetPageUptodate(page)  set_bit(PG_uptodate, (page)-flags)
+   /*
+* Memory barrier must be issued before setting the PG_uptodate bit,
+* so all previous writes that served to bring the page uptodate are
+* visible before PageUptodate becomes true.
+*
+* S390 is guaranteed to have a barrier in the test_and_set operation
+* (see Documentation/atomic_ops.txt).
+*
+* XXX: does this memory barrier need to be anything special to
+* handle things like DMA writes into the page?
+*/
+   smp_wmb();
+   set_bit(PG_uptodate, (page)-flags);
 #endif
+}
+
+static inline void SetPageUptodate(struct page *page)
+{
+   WARN_ON(!PageLocked(page));
+   __SetPageUptodate(page);
+}
+
+static inline void SetNewPageUptodate(struct page *page)
+{
+   __SetPageUptodate(page);
+}
+
 #define ClearPageUptodate(page)clear_bit(PG_uptodate, (page)-flags)
 
 #define PageDirty(page)test_bit(PG_dirty, (page)-flags)
Index: linux-2.6/mm/hugetlb.c
===
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -443,6 +443,7 @@ static int hugetlb_cow(struct mm_struct 
 

[patch 2/3] fs: buffer don't PageUptodate without page locked

2007-02-06 Thread Nick Piggin
__block_write_full_page is calling SetPageUptodate without the page locked.
This is unusual, but not incorrect, as PG_writeback is still set.

However with the previous patch, this is now a problem: so don't bother
setting the page uptodate in this case (it is weird that the write path
does such a thing anyway). Instead just leave it to the read side to bring
the page uptodate when it notices that all buffers are uptodate.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1679,6 +1679,7 @@ static int __block_write_full_page(struc
 */
BUG_ON(PageWriteback(page));
set_page_writeback(page);
+   unlock_page(page);
 
do {
struct buffer_head *next = bh-b_this_page;
@@ -1688,7 +1689,6 @@ static int __block_write_full_page(struc
}
bh = next;
} while (bh != head);
-   unlock_page(page);
 
err = 0;
 done:
@@ -1698,17 +1698,8 @@ done:
 * clean.  Someone wrote them back by hand with
 * ll_rw_block/submit_bh.  A rare case.
 */
-   int uptodate = 1;
-   do {
-   if (!buffer_uptodate(bh)) {
-   uptodate = 0;
-   break;
-   }
-   bh = bh-b_this_page;
-   } while (bh != head);
-   if (uptodate)
-   SetPageUptodate(page);
end_page_writeback(page);
+
/*
 * The page and buffer_heads can be released at any time from
 * here on.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 3/3] mm: make read_cache_page synchronous

2007-02-06 Thread Nick Piggin
Ensure pages are uptodate after returning from read_cache_page, which allows
us to cut out most of the filesystem-internal PageUptodate_NoLock calls.

I didn't have a great look down the call chains, but this appears to fixes 7
possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in ecryptfs,
1 in jffs2, and a possible cleared data overwritten with readpage in block2mtd.
All depending on whether the filler is async and/or can return with a !uptodate
page.

Also, a memory leak in sys_swapon().

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Index: linux-2.6/fs/afs/dir.c
===
--- linux-2.6.orig/fs/afs/dir.c
+++ linux-2.6/fs/afs/dir.c
@@ -187,10 +187,7 @@ static struct page *afs_dir_get_page(str
 
page = read_mapping_page(dir-i_mapping, index, NULL);
if (!IS_ERR(page)) {
-   wait_on_page_locked(page);
kmap(page);
-   if (!PageUptodate(page))
-   goto fail;
if (!PageChecked(page))
afs_dir_check_page(dir, page);
if (PageError(page))
Index: linux-2.6/fs/afs/mntpt.c
===
--- linux-2.6.orig/fs/afs/mntpt.c
+++ linux-2.6/fs/afs/mntpt.c
@@ -77,13 +77,11 @@ int afs_mntpt_check_symlink(struct afs_v
}
 
ret = -EIO;
-   wait_on_page_locked(page);
-   buf = kmap(page);
-   if (!PageUptodate(page))
-   goto out_free;
if (PageError(page))
goto out_free;
 
+   buf = kmap(page);
+
/* examine the symlink's contents */
size = vnode-status.size;
_debug(symlink to %*.*s, size, (int) size, buf);
@@ -100,8 +98,8 @@ int afs_mntpt_check_symlink(struct afs_v
 
ret = 0;
 
- out_free:
kunmap(page);
+ out_free:
page_cache_release(page);
  out:
_leave( = %d, ret);
@@ -184,8 +182,7 @@ static struct vfsmount *afs_mntpt_do_aut
}
 
ret = -EIO;
-   wait_on_page_locked(page);
-   if (!PageUptodate(page) || PageError(page))
+   if (PageError(page))
goto error;
 
buf = kmap(page);
Index: linux-2.6/fs/cramfs/inode.c
===
--- linux-2.6.orig/fs/cramfs/inode.c
+++ linux-2.6/fs/cramfs/inode.c
@@ -180,7 +180,8 @@ static void *cramfs_read(struct super_bl
struct page *page = NULL;
 
if (blocknr + i  devsize) {
-   page = read_mapping_page(mapping, blocknr + i, NULL);
+   page = read_mapping_page_async(mapping, blocknr + i,
+   NULL);
/* synchronous error? */
if (IS_ERR(page))
page = NULL;
Index: linux-2.6/fs/ext2/dir.c
===
--- linux-2.6.orig/fs/ext2/dir.c
+++ linux-2.6/fs/ext2/dir.c
@@ -161,10 +161,7 @@ static struct page * ext2_get_page(struc
struct address_space *mapping = dir-i_mapping;
struct page *page = read_mapping_page(mapping, n, NULL);
if (!IS_ERR(page)) {
-   wait_on_page_locked(page);
kmap(page);
-   if (!PageUptodate_NoLock(page))
-   goto fail;
if (!PageChecked(page))
ext2_check_page(page);
if (PageError(page))
Index: linux-2.6/fs/freevxfs/vxfs_subr.c
===
--- linux-2.6.orig/fs/freevxfs/vxfs_subr.c
+++ linux-2.6/fs/freevxfs/vxfs_subr.c
@@ -74,10 +74,7 @@ vxfs_get_page(struct address_space *mapp
pp = read_mapping_page(mapping, n, NULL);
 
if (!IS_ERR(pp)) {
-   wait_on_page_locked(pp);
kmap(pp);
-   if (!PageUptodate(pp))
-   goto fail;
/** if (!PageChecked(pp)) **/
/** vxfs_check_page(pp); **/
if (PageError(pp))
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1756,7 +1756,7 @@ int generic_file_readonly_mmap(struct fi
 EXPORT_SYMBOL(generic_file_mmap);
 EXPORT_SYMBOL(generic_file_readonly_mmap);
 
-static inline struct page *__read_cache_page(struct address_space *mapping,
+static struct page *__read_cache_page(struct address_space *mapping,
unsigned long index,
int (*filler)(void *,struct page*),
void *data)
@@ -1793,17 +1793,11 @@ repeat:
return page;
 }
 
-/**
- * read_cache_page - read into page cache, fill it if needed
- * @mapping:   the page's address_space
- * @index: the page index
- * @filler:function 

Re: [patch 2/3] fs: buffer don't PageUptodate without page locked

2007-02-06 Thread Andrew Morton
On Tue,  6 Feb 2007 09:02:23 +0100 (CET) Nick Piggin [EMAIL PROTECTED] wrote:

 __block_write_full_page is calling SetPageUptodate without the page locked.
 This is unusual, but not incorrect, as PG_writeback is still set.
 
 However with the previous patch, this is now a problem: so don't bother
 setting the page uptodate in this case (it is weird that the write path
 does such a thing anyway). Instead just leave it to the read side to bring
 the page uptodate when it notices that all buffers are uptodate.
 
 Signed-off-by: Nick Piggin [EMAIL PROTECTED]
 
 Index: linux-2.6/fs/buffer.c
 ===
 --- linux-2.6.orig/fs/buffer.c
 +++ linux-2.6/fs/buffer.c
 @@ -1679,6 +1679,7 @@ static int __block_write_full_page(struc
*/
   BUG_ON(PageWriteback(page));
   set_page_writeback(page);
 + unlock_page(page);
  
   do {
   struct buffer_head *next = bh-b_this_page;
 @@ -1688,7 +1689,6 @@ static int __block_write_full_page(struc
   }
   bh = next;
   } while (bh != head);
 - unlock_page(page);
  
   err = 0;
  done:

Why this change?  Without looking at it too hard, it seems that if
submit_bh() completes synchronously, this thread can end up playing with
the buffers on a non-locked, non-PageWriteback page.  Someone else could
whip the buffers away and oops?

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/3] mm: fix PageUptodate memorder

2007-02-06 Thread Andrew Morton
On Tue,  6 Feb 2007 09:02:11 +0100 (CET) Nick Piggin [EMAIL PROTECTED] wrote:

 +static inline void __SetPageUptodate(struct page *page)
 +{
 +#ifdef CONFIG_S390
   if (!test_and_set_bit(PG_uptodate, page-flags))
   page_test_and_clear_dirty(page);
 -}
  #else
 -#define SetPageUptodate(page)set_bit(PG_uptodate, (page)-flags)
 + /*
 +  * Memory barrier must be issued before setting the PG_uptodate bit,
 +  * so all previous writes that served to bring the page uptodate are
 +  * visible before PageUptodate becomes true.
 +  *
 +  * S390 is guaranteed to have a barrier in the test_and_set operation
 +  * (see Documentation/atomic_ops.txt).
 +  *
 +  * XXX: does this memory barrier need to be anything special to
 +  * handle things like DMA writes into the page?
 +  */
 + smp_wmb();
 + set_bit(PG_uptodate, (page)-flags);
  #endif
 +}
 +
 +static inline void SetPageUptodate(struct page *page)
 +{
 + WARN_ON(!PageLocked(page));
 + __SetPageUptodate(page);
 +}
 +
 +static inline void SetNewPageUptodate(struct page *page)
 +{
 + __SetPageUptodate(page);
 +}

I was panicing for a minute when I saw that __SetPageUptodate() in there.

Conventionally the __SetPageFoo namespace is for nonatomic updates to
page-flags.  Can we call this something different?


What a fugly patchset :(
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/3] fs: buffer don't PageUptodate without page locked

2007-02-06 Thread Nick Piggin
On Tue, Feb 06, 2007 at 12:21:40AM -0800, Andrew Morton wrote:
 On Tue,  6 Feb 2007 09:02:23 +0100 (CET) Nick Piggin [EMAIL PROTECTED] 
 wrote:
 
  __block_write_full_page is calling SetPageUptodate without the page locked.
  This is unusual, but not incorrect, as PG_writeback is still set.
  
  However with the previous patch, this is now a problem: so don't bother
  setting the page uptodate in this case (it is weird that the write path
  does such a thing anyway). Instead just leave it to the read side to bring
  the page uptodate when it notices that all buffers are uptodate.
  
  Signed-off-by: Nick Piggin [EMAIL PROTECTED]
  
  Index: linux-2.6/fs/buffer.c
  ===
  --- linux-2.6.orig/fs/buffer.c
  +++ linux-2.6/fs/buffer.c
  @@ -1679,6 +1679,7 @@ static int __block_write_full_page(struc
   */
  BUG_ON(PageWriteback(page));
  set_page_writeback(page);
  +   unlock_page(page);
   
  do {
  struct buffer_head *next = bh-b_this_page;
  @@ -1688,7 +1689,6 @@ static int __block_write_full_page(struc
  }
  bh = next;
  } while (bh != head);
  -   unlock_page(page);
   
  err = 0;
   done:
 
 Why this change?  Without looking at it too hard, it seems that if
 submit_bh() completes synchronously, this thread can end up playing with
 the buffers on a non-locked, non-PageWriteback page.  Someone else could
 whip the buffers away and oops?

Hmm, it definitely shouldn't be there, it leaked in from another patch
to bring partiy with the error handling...

Here is an updated patch.

--

__block_write_full_page is calling SetPageUptodate without the page locked.
This is unusual, but not incorrect, as PG_writeback is still set.

However with the previous patch, this is now a problem: so don't bother
setting the page uptodate in this case (it is weird that the write path
does such a thing anyway). Instead just leave it to the read side to bring
the page uptodate when it notices that all buffers are uptodate.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1698,17 +1698,8 @@ done:
 * clean.  Someone wrote them back by hand with
 * ll_rw_block/submit_bh.  A rare case.
 */
-   int uptodate = 1;
-   do {
-   if (!buffer_uptodate(bh)) {
-   uptodate = 0;
-   break;
-   }
-   bh = bh-b_this_page;
-   } while (bh != head);
-   if (uptodate)
-   SetPageUptodate(page);
end_page_writeback(page);
+
/*
 * The page and buffer_heads can be released at any time from
 * here on.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/1][RFC] mm: prepare_write positive return value

2007-02-06 Thread Dmitriy Monakhov
Almost all read/write operation handles data with chunks(segments or pages)
and result has integral behaviour for folowing scenario: 
for_each_chunk() {
 res = op();
 if(IS_ERROR(res))
   return progress ? progress : res;
 progress += res;
}
prepare_write may has integral behaviour in case of blksize  pgsize,
for example we successfully allocated/read some blocks, but not all of them,
and than some error happend. Currently we eliminate this progress by doing
vmtrunate() after prepare_has failed.
It is good to have ability to signal about this progress. Interprete positive
prepare_write() ret code as bytes num that fs ready to handle at this moment.
I've ask SAW, he think it is sane. This size always less than PAGE_CACHE_SIZE
so it less than AOP_TRUNCATED_PAGE too.
 
BTH: This approach was used in OpenVZ 2.6.9 kernel in order to make FS with 
delayed allocation more correct, and its works well.
I think not everybody will happy about this,  but let's discuss all advantages
and disadvantages of this change.

Signed-off-by: Dmitriy Monakhov [EMAIL PROTECTED]
-
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 62632f5..b4f6eac 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -239,7 +239,11 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
page_cache_release(page);
continue;
}
-   goto unlock;
+   if (ret  0  ret  PAGE_CACHE_SIZE)
+   /* prepare_write demands limit size of bytes. */
+   size = min(size, (unsigned)ret);
+   else
+   goto unlock;
}
transfer_result = lo_do_transfer(lo, WRITE, page, offset,
bvec-bv_page, bv_offs, size, IV);
diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
index 1e5d2ba..b5eebe8 100644
--- a/fs/ecryptfs/mmap.c
+++ b/fs/ecryptfs/mmap.c
@@ -454,6 +454,17 @@ static int ecryptfs_write_inode_size_to_header(struct file 
*lower_file,
}
lower_a_ops = lower_inode-i_mapping-a_ops;
rc = lower_a_ops-prepare_write(lower_file, header_page, 0, 8);
+   if (unlikely(rc  0  rc  PAGE_CACHE_SIZE)) {
+   /* 
+* prepare_write can handle less bytes whan we need. This is not likely
+* to happend realy because usualy we need only one block. In order to
+* preserve prepare/commit balanced invoke commit end fail.
+*/
+   int ret;
+   ret = lower_a_ops-commit_write(lower_file, header_page, 0, rc);
+   rc = ret ? ret : -ENOSPC;
+   goto unlock;
+   }
file_size = (u64)i_size_read(inode);
ecryptfs_printk(KERN_DEBUG, Writing size: [0x%.16x]\n, file_size);
file_size = cpu_to_be64(file_size);
@@ -462,6 +473,7 @@ static int ecryptfs_write_inode_size_to_header(struct file 
*lower_file,
kunmap_atomic(header_virt, KM_USER0);
flush_dcache_page(header_page);
rc = lower_a_ops-commit_write(lower_file, header_page, 0, 8);
+unlock:
if (rc  0)
ecryptfs_printk(KERN_ERR, Error commiting header page 
write\n);
diff --git a/fs/namei.c b/fs/namei.c
index b305589..723db81 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2704,6 +2704,16 @@ retry:
page_cache_release(page);
goto retry;
}
+   if (unlikely(err  0  err  PAGE_CACHE_SIZE)) {
+   /* 
+* prepare_write can handle less bytes whan we need. This is not likely
+* to happend realy because usualy we need only one block. In order to
+* preserve prepare/commit balanced invoke commit end fail.
+*/
+   int ret;
+   ret = mapping-a_ops-commit_write(NULL, page, 0, err);
+   err = ret ? ret : -ENOSPC;
+   }
if (err)
goto fail_map;
kaddr = kmap_atomic(page, KM_USER0);
diff --git a/fs/splice.c b/fs/splice.c
index 2fca6eb..d2b92bf 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -652,6 +652,17 @@ find_page:
if (unlikely(ret)) {
loff_t isize = i_size_read(mapping-host);
 
+   if (ret  0  ret  PAGE_CACHE_SIZE) {
+   /* 
+* prepare_write demands limit size of bytes. In order to
+* preserve prepare/commit balanced invoke commit end fail. 
+* Initial i_size saved, so vmtruncate safely restore it later.
+*/
+   int ret2;
+   ret2 = mapping-a_ops-commit_write(file, page, offset,
+   offset + ret);
+   ret = ret2 ? ret2 : -ENOSPC;
+   }
if (ret != AOP_TRUNCATED_PAGE)
unlock_page(page);

[PATCH 1/1][RFC] mm: prepare_write positive return value

2007-02-06 Thread Dmitriy Monakhov
This patch solve ext3/4 retry loop issue

Issue description:
What we can do if block_prepare_write fail inside ext3_prepare_write ?
a) Stop transaction and do retry if possible, but what happend if 
   reboot comes after journal_force_commit, but before we exhaust
   all retry attempts and generic_file_buffered_write call trancate? 
   We may get allocated blocks outside i_size. Witch is bad.

b) Commit newly allocated blocks. This approach was used after Andrey's patch.
   But if reboot comes, or error happend, user will be surprised that i_size
   increased but blocks are zero filed. This is internal write operation state
   becames visiable to user. Witch is also bad. 
   
c) Just return as match bytes as we can deal with rigth now back to 
   caller, and let's caller handles it and than call commit. In this scenario
   we never stop journal in the midle of some internal fs operation.
   If reboot comes before commit trunsaction was't closed so it will 
   be droped while journal replay.

Only (c) tend to garantie attomic data operation.

Also fix some issues introduced by 
retries-in-ext3_prepare_write-violate-ordering-requirements:
  i_size may increase after error happend.
  possible data corruption caused commiting non uptodate bh.

Signed-off-by: Dmitriy Monakhov [EMAIL PROTECTED]
-
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index dba6dd2..4c5e9f7 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1154,6 +1154,18 @@ static int do_journal_get_write_access(handle_t *handle,
  * transaction must include the content of the newly allocated blocks.
  * This content is expected to be set to zeroes by block_prepare_write().
  * 2006/10/14  SAW
+ * What we can do if some blocks was allocated?
+ * a) Stop transaction and do retry if possible, but what happend if
+ *reboot comes after journal_force_commit, but before we exhaust
+ *all retry attempts and caller call trancate?
+ *We may get allocated blocks outside i_size. Witch is bad.
+ * b) Commit newly allocated blocks. And if reboot comes, user will be
+ *surprised that i_size increased but blocks are zero filed. Witch is
+ *also bad.
+ * c) Just return as match bytes as we can deal with rigth now back to
+ *caller, and if reboot comes trunsaction was't closed so it will
+ *be droped while journal replay.
+ * Only (c) tend to garantie attomic data operation.
  */
 static int ext3_prepare_failure(struct file *file, struct page *page,
unsigned from, unsigned to)
@@ -1186,7 +1198,7 @@ skip:
block_start = to;
break;
}
-   if (!buffer_mapped(bh))
+   if (!buffer_mapped(bh) || !buffer_uptodate(bh))
/* prepare_write failed on this bh */
break;
if (ext3_should_journal_data(mapping-host)) {
@@ -1204,8 +1216,8 @@ skip:
if (block_start = from)
goto skip;
 
-   /* commit allocated and zeroed buffers */
-   return mapping-a_ops-commit_write(file, page, from, block_start);
+   /* return number of bytes till last mapped  uptodate block */
+   return block_start - from;
 }
 
 static int ext3_prepare_write(struct file *file, struct page *page,
@@ -1239,7 +1251,8 @@ retry:
 
 failure:
ret2 = ext3_prepare_failure(file, page, from, to);
-   if (ret2  0)
+   if (ret2)
+   /* ready to partial write, or fatal error */
return ret2;
if (ret == -ENOSPC  ext3_should_retry_alloc(inode-i_sb, retries))
goto retry;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 806eee1..da39f80 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1153,6 +1153,18 @@ static int do_journal_get_write_access(handle_t *handle,
  * transaction must include the content of the newly allocated blocks.
  * This content is expected to be set to zeroes by block_prepare_write().
  * 2006/10/14  SAW
+ * What we can do if some blocks was allocated?
+ * a) Stop transaction and do retry if possible, but what happend if
+ *reboot comes after journal_force_commit, but before we exhaust
+ *all retry attempts and caller call trancate?
+ *We may get allocated blocks outside i_size. Witch is bad.
+ * b) Commit newly allocated blocks. And if reboot comes, user will be
+ *surprised that i_size increased but blocks are zero filed. Witch is
+ *also bad.
+ * c) Just return as match bytes as we can deal with rigth now back to
+ *caller, and if reboot comes trunsaction was't closed so it will
+ *be droped while journal replay.
+ * Only (c) tend to garantie attomic data operation.
  */
 static int ext4_prepare_failure(struct file *file, struct page *page,
unsigned from, unsigned to)
@@ -1185,7 +1197,7 @@ skip:
block_start = to;
break;
}
-   if (!buffer_mapped(bh))
+  

Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks

2007-02-06 Thread Trond Myklebust
On Mon, 2007-02-05 at 19:20 -0800, Andreas Gruenbacher wrote:
 On Monday 05 February 2007 11:02, Christoph Hellwig wrote:
  On Mon, Feb 05, 2007 at 10:58:26AM -0800, Trond Myklebust wrote:
   On Mon, 2007-02-05 at 18:44 +, Christoph Hellwig wrote:
Just FYI:  Al was very opposed to the idea of passing the vfsmount to
the vfs_ helpers, so you should discuss this with him.
   
Looking at the actual patches I see you're lazy in a lot of places.
Please make sure that when you introduce a vfsmount argument somewhere
that it is _always_ passed and not just when it's conveniant.  Yes,
that's more work, but then again if you're not consistant anyone
half-serious will laught at a security model using this infrasturcture.
  
   nfsd in particular tends to be a bit lazy about passing around vfsmount
   info. Forcing it to do so should not be hard since the vfsmount is
   already cached in the struct export (which can be found using the
   filehandle). It will take a bit of re-engineering in order to pass that
   information around inside the nfsd code, though.
 
  I actually have a patch to fix that.  It's part of a bigger series
  that's not quite ready, but I hope to finish all of it this month.
 
 It's actually not hard to fix, and nfsd would look a little less weird. But 
 what would this add, what do pathnames mean in the context of nfsd, and would 
 nfsd actually become less weird?
 
 On the wire, nfs transmits file handles, not filenames. The file handle 
 usually contains the inode numbers of the file and the containing directory.
 
 The code in fs/exportfs/expfs.c:find_exported_dentry() shows that fh_verify() 
 should return a dentry that may be connected or not, depending on the export 
 options and whether it's a file or directory. Together with the vfsmount from 
 the svc_export, we could compute a pathname.
 
 But there is no way to tell different hardlinks to the same inode in the same 
 directory from each other (both the file and directory inode are the same), 
 and depending on the export options, we may or may not be able to distinguish 
 different hardlinks across directories.

Who cares? There is no way to export a partial directory, and in any
case the subtree_check crap is borken beyond repair (see cross-directory
renames which lead to actual changes to the filehandle - broken, broken,
broken).

 If the nohide or crossmnt export options are used, we might run into similar 
 aliasing issues with mounts (I'm not sure about this).

Wrong. Those create new export points since you are crossing into a
different filesystem.

 So we could compute pathnames, but they wouldn't be very meaningful. Instead 
 of going that way, it seems more reasonable to not do pathname based access 
 control for the kernel kernel nfsd at all. Passing NULL as the vfsmounts does 
 that. With a properly configured nfsd, the areas that remote users could 
 access would still be well confined.

Huh? Even if you don't pass in a vfsmount, you _still_ need to pass in a
super_block. Inodes are only unique per filesystem.
In fact, on an ideal NFS export, there is no ambiguity between the two
(see above comment about subtree_check) since the entire filesystem will
be exported.

 The focus with this whole pathname based security stuff really is to be able 
 to better confine local processes. The kernel nfsd is a kernel thread, and as 
 such, confining it from a security point of view cannot really work, anyway.
 
   Note also that it might be nice to enforce the vfsmount argument by
   replacing the existing dentry parameters with a struct path instead of
   adding an extra reference to the vfsmount to existing functions.
 
  That definitly sounds like a good idea, independent of whether we want
  to pass the vfsmount in more places or not.
 
 Note that you can still pass a NULL vfsmount in a struct path.

...but we will have Dick Cheney track you down and shoot you if you do.
The point is that you only have to check _one_ argument (the
initialisation of the struct path) instead of having a check for every
function argument in the vfs.

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/3] mm: make read_cache_page synchronous

2007-02-06 Thread Nick Piggin

Andrew Morton wrote:

On Tue,  6 Feb 2007 09:02:33 +0100 (CET) Nick Piggin [EMAIL PROTECTED] wrote:



Ensure pages are uptodate after returning from read_cache_page, which allows
us to cut out most of the filesystem-internal PageUptodate_NoLock calls.



Normally it's good to rename functions when we change their behaviour, but
I guess any missed (or out-of-tree) filesystems will just end up doing a
pointless wait_on_page_locked() and will continue to work OK, yes?


Yeah.





I didn't have a great look down the call chains, but this appears to fixes 7
possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in ecryptfs,
1 in jffs2, and a possible cleared data overwritten with readpage in block2mtd.
All depending on whether the filler is async and/or can return with a !uptodate
page.

Also, a memory leak in sys_swapon().



Separate patch?


Well its fixed by virtue of read_cache_page now correctly dropping the page
refcount if it finds the page !uptodate, rather than any special logic I
added.

I can do another patch though. No problem, I'll be resending the series after
this round of feedback.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/3] mm: fix PageUptodate memorder

2007-02-06 Thread Nick Piggin

Andrew Morton wrote:

On Tue,  6 Feb 2007 09:02:11 +0100 (CET) Nick Piggin [EMAIL PROTECTED] wrote:



+static inline void __SetPageUptodate(struct page *page)
+{
+#ifdef CONFIG_S390
if (!test_and_set_bit(PG_uptodate, page-flags))
page_test_and_clear_dirty(page);
-}
#else
-#define SetPageUptodate(page)  set_bit(PG_uptodate, (page)-flags)
+   /*
+* Memory barrier must be issued before setting the PG_uptodate bit,
+* so all previous writes that served to bring the page uptodate are
+* visible before PageUptodate becomes true.
+*
+* S390 is guaranteed to have a barrier in the test_and_set operation
+* (see Documentation/atomic_ops.txt).
+*
+* XXX: does this memory barrier need to be anything special to
+* handle things like DMA writes into the page?
+*/
+   smp_wmb();
+   set_bit(PG_uptodate, (page)-flags);
#endif
+}
+
+static inline void SetPageUptodate(struct page *page)
+{
+   WARN_ON(!PageLocked(page));
+   __SetPageUptodate(page);
+}
+
+static inline void SetNewPageUptodate(struct page *page)
+{
+   __SetPageUptodate(page);
+}



I was panicing for a minute when I saw that __SetPageUptodate() in there.

Conventionally the __SetPageFoo namespace is for nonatomic updates to
page-flags.  Can we call this something different?


Duh, of course, sorry.


What a fugly patchset :(


Fugly problem. One could fix it by always locking the page, but I was
worried about Hugh flaming me if I tried that ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH[RFC] kill sysrq-u (emergency remount r/o)

2007-02-06 Thread Christoph Hellwig
On Mon, Feb 05, 2007 at 10:17:44PM -0500, Theodore Tso wrote:
  sysrq+u is helpful. It is like \( sysrq+s  make sure no further writes
  go to disk \).
 
 I agree it is useful, but if we're going to do it we really should do
 it right.  We should have real revoke() functionality on file
 descriptors, which revokes all of the mmap()'s (any attempt to write
 into a previously read/write mmap will cause a SEGV) as well as
 changing f_mode, and then use that to implement emergency read-only
 remount.

Revoke is only part of it.  What we really need is proper forced unmount
support.  That means revoking any kind of userspace access, blocking new
access and making sure the ondisk image is coherent.  This would definitly
be a useful feature, but it's a lot of work.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks

2007-02-06 Thread Christoph Hellwig
On Tue, Feb 06, 2007 at 12:51:52AM -0800, Trond Myklebust wrote:
 Who cares? There is no way to export a partial directory, and in any
 case the subtree_check crap is borken beyond repair (see cross-directory
 renames which lead to actual changes to the filehandle - broken, broken,
 broken).

It's getting a little oftopic for this thread, but is there a chance we
could just kill that crap after a resonable deprecation period?

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks

2007-02-06 Thread Christoph Hellwig
On Mon, Feb 05, 2007 at 06:13:26PM -0800, Andreas Gruenbacher wrote:
 On Monday 05 February 2007 10:44, Christoph Hellwig wrote:
  Looking at the actual patches I see you're lazy in a lot of places.
  Please make sure that when you introduce a vfsmount argument somewhere
  that it is _always_ passed and not just when it's conveniant.  Yes, that's
  more work, but then again if you're not consistant anyone half-serious
  will laught at a security model using this infrasturcture.
 
 It may appear like laziness, but it's not. Let's look at where we're passing 
 NULL at the moment:

You know, I've tracked a lot of this down previously when I submitted patches
to add vfsmount arguments to the vfs_ helpers, just to get tought by Al
that this is a bad idea :)

 fs/hpfs/namei.c
 
   In hpfs_unlink, hpfs truncates one of its own inodes through
   notify_change(). You definitely don't want any lsms to interfere here,
   pathname based or not; hpfs should probably truncate its inode itself
   instead. But given that hpfs goes via the vfs, we at least pass NULL
   to indicate that this file really has no meaningful paths to it
   anymore. (In addition, we don't really have a vfsmount at this
   point anymore, and neither would it make sense to pass it there.)
 
   To play more nicely with other lsms, hpfs could mark the inode as
   private before attempting the truncate.

The right fix would be to not go through the vfs.  Or even better to
remove the utter hack.  See my previous patch on this subject and the
discussion started on it.

 fs/reiserfs/xattr.c
 
   The directories an files that reiserfs uses internally to store xattrs
   are hanging off .reiserfs_priv/xattrs in the filesystem. This part
   of the namespace is not accessible or visible from user space though
   except through the xattr syscalls.
 
   Reiserfs should probably just mark all its xattr inodes as private in 
 order
   to play nicely with other lsms. As far as pathname based lsms are 
 concerned,
   pathnames to those fs-internal objects are meaningless though, and so we
   pass NULL here.

Well, the 100% correct fix would be to rip out the braindead reiserfs
xattr code :)
Of course that's not an option, but reiserfs would be much better of
stop using vfs_rmdir.  It really doesn't need most of the checks in there
and should just call into reiserfs_rmdir instead of doing this useless
trip into vfs land.

 fs/sysfs/file.c

 fs/nfsd/vfs.c and fs/nfsd/nfs4recover.c
 
   For nfsd, the concept of pathnames is a bit peculiar. I'll try to 
 respond to
   that separately.

the actually nfsd filehandle code is conceptually trivially fixable,
although with quite a bit of code churn.  As mention I'll soon submit
a patch for it.  nfs4recover.c is broken beyond repair and should just
be deleted from the tree.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks

2007-02-06 Thread Neil Brown
On Tuesday February 6, [EMAIL PROTECTED] wrote:
 On Mon, Feb 05, 2007 at 07:20:35PM -0800, Andreas Gruenbacher wrote:
  It's actually not hard to fix, and nfsd would look a little less weird. 
  But 
  what would this add, what do pathnames mean in the context of nfsd, and 
  would 
  nfsd actually become less weird?
 
 It's not actually a pathname we care about, but a vfsmount + dentry
 combo.  That one means as much in nfsd as elsewhere.  We want nfsd
 to obey r/o or noatime mount flags if /export/foo is exported with them
 but /foo not.  Even better would be to change nfsd so it creates it's
 own non-visible vfsmount for the filesystems it exports..

What would be the benefit of having private non-visible vfsmounts?
Sounds like a recipe for confusion?

It is possible that mountd might start doing bind-mounts to create the
'pseudo filesystem' thing for NFSv4, but they would be very visible
(under /var/lib/nfs/v4root or something).  So having it's own vfsmount
might make sense, but I don't get 'non-visible'.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks

2007-02-06 Thread Neil Brown
On Tuesday February 6, [EMAIL PROTECTED] wrote:
 On Tue, Feb 06, 2007 at 12:51:52AM -0800, Trond Myklebust wrote:
  Who cares? There is no way to export a partial directory, and in any
  case the subtree_check crap is borken beyond repair (see cross-directory
  renames which lead to actual changes to the filehandle - broken, broken,
  broken).
 
 It's getting a little oftopic for this thread, but is there a chance we
 could just kill that crap after a resonable deprecation period?

Probably.  A couple of years?
nfs-utils 1.1.0 is due to stop subtree_check from being the default.
After that is released, the next kernel can start printing warnings
that it might stop working in a couple of years.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks

2007-02-06 Thread Christoph Hellwig
On Tue, Feb 06, 2007 at 09:26:14PM +1100, Neil Brown wrote:
 What would be the benefit of having private non-visible vfsmounts?
 Sounds like a recipe for confusion?
 
 It is possible that mountd might start doing bind-mounts to create the
 'pseudo filesystem' thing for NFSv4, but they would be very visible
 (under /var/lib/nfs/v4root or something).  So having it's own vfsmount
 might make sense, but I don't get 'non-visible'.

It would allow creating an exported tree without interferance with
the local visible tree.  Note that the local visible tree isn't
global anymore either, and this allows to adjust what's exported
through nfsd throug a specific interface instead of needing to
get into nfsd namespace through some way.  Think of listing the
actually exported devices in /etc/exports instead of the indirection
through fstab aswell.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] move remove_dquot_ref to dqout.c

2007-02-06 Thread Jan Kara
On Mon 05-02-07 19:05:27, Christoph Hellwig wrote:
 Remove_dquot_ref can move to dqout.c instead of beeing in inode.c
 under #ifdef CONFIG_QUOTA.  Also clean the resulting code up
  Yes, this was because at the time the code was written, inode_lock was not 
exported
from inode.c.

 a tiny little bit by testing sb-dq_op earlier - it's constant
 over a filesystems lifetime.
  Looks fine.

 Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]
  Signed-off-by: Jan Kara [EMAIL PROTECTED]

 
 Index: linux-2.6/fs/dquot.c
 ===
 --- linux-2.6.orig/fs/dquot.c 2007-02-05 18:49:41.0 +0100
 +++ linux-2.6/fs/dquot.c  2007-02-05 18:53:49.0 +0100
 @@ -79,6 +79,7 @@
  #include linux/buffer_head.h
  #include linux/capability.h
  #include linux/quotaops.h
 +#include linux/writeback.h /* for inode_lock, oddly enough.. */
  
  #include asm/uaccess.h
  
 @@ -756,15 +757,30 @@
   }
  }
  
 +static void remove_dquot_ref(struct super_block *sb, int type,
 + struct list_head *tofree_head)
 +{
 + struct inode *inode;
 +
 + spin_lock(inode_lock);
 + list_for_each_entry(inode, sb-s_inodes, i_sb_list) {
 + if (!IS_NOQUOTA(inode))
 + remove_inode_dquot_ref(inode, type, tofree_head);
 + }
 + spin_unlock(inode_lock);
 +}
 +
  /* Gather all references from inodes and drop them */
  static void drop_dquot_ref(struct super_block *sb, int type)
  {
   LIST_HEAD(tofree_head);
  
 - down_write(sb_dqopt(sb)-dqptr_sem);
 - remove_dquot_ref(sb, type, tofree_head);
 - up_write(sb_dqopt(sb)-dqptr_sem);
 - put_dquot_list(tofree_head);
 + if (sb-dq_op) {
 + down_write(sb_dqopt(sb)-dqptr_sem);
 + remove_dquot_ref(sb, type, tofree_head);
 + up_write(sb_dqopt(sb)-dqptr_sem);
 + put_dquot_list(tofree_head);
 + }
  }
  
  static inline void dquot_incr_inodes(struct dquot *dquot, unsigned long 
 number)
 Index: linux-2.6/fs/inode.c
 ===
 --- linux-2.6.orig/fs/inode.c 2007-02-05 18:49:41.0 +0100
 +++ linux-2.6/fs/inode.c  2007-02-05 18:49:56.0 +0100
 @@ -1252,33 +1252,6 @@
  
  EXPORT_SYMBOL(inode_needs_sync);
  
 -/*
 - *   Quota functions that want to walk the inode lists..
 - */
 -#ifdef CONFIG_QUOTA
 -
 -void remove_dquot_ref(struct super_block *sb, int type,
 - struct list_head *tofree_head)
 -{
 - struct inode *inode;
 -
 - if (!sb-dq_op)
 - return; /* nothing to do */
 - spin_lock(inode_lock); /* This lock is for inodes code */
 -
 - /*
 -  * We don't have to lock against quota code - test IS_QUOTAINIT is
 -  * just for speedup...
 -  */
 - list_for_each_entry(inode, sb-s_inodes, i_sb_list)
 - if (!IS_NOQUOTA(inode))
 - remove_inode_dquot_ref(inode, type, tofree_head);
 -
 - spin_unlock(inode_lock);
 -}
 -
 -#endif
 -
  int inode_wait(void *word)
  {
   schedule();
 Index: linux-2.6/include/linux/fs.h
 ===
 --- linux-2.6.orig/include/linux/fs.h 2007-02-05 18:51:48.0 +0100
 +++ linux-2.6/include/linux/fs.h  2007-02-05 18:51:52.0 +0100
 @@ -1681,7 +1681,6 @@
  extern int __remove_suid(struct dentry *, int);
  extern int should_remove_suid(struct dentry *);
  extern int remove_suid(struct dentry *);
 -extern void remove_dquot_ref(struct super_block *, int, struct list_head *);
  
  extern void __insert_inode_hash(struct inode *, unsigned long hashval);
  extern void remove_inode_hash(struct inode *);


Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] remove sb-s_files and file_list_lock usage in dquot.c

2007-02-06 Thread Christoph Hellwig
Iterate over sb-s_inodes instead of sb-s_files in add_dquot_ref.
This reduces list search and lock hold time aswell as getting rid of
one of the few uses of file_list_lock which Ingo identified as a
scalability problem.

Previously we called dq_op-initialize for every inode handing of
a writeable file that wasn't initialized before.  Now we're calling
it for every inode that has a non-zero i_writecount, aka a writeable
file descriptor refering to it.

Thanks a lot to Jan Kara for running this patch through his quota
test harness.

Note:  this is ontop of '[PATCH] move remove_dquot_ref to dqout.c'
which I sent out yesterday.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/fs/dquot.c
===
--- linux-2.6.orig/fs/dquot.c   2007-02-05 18:54:36.0 +0100
+++ linux-2.6/fs/dquot.c2007-02-05 18:59:48.0 +0100
@@ -689,23 +689,27 @@
 /* This routine is guarded by dqonoff_mutex mutex */
 static void add_dquot_ref(struct super_block *sb, int type)
 {
-   struct list_head *p;
+   struct inode *inode;
 
 restart:
-   file_list_lock();
-   list_for_each(p, sb-s_files) {
-   struct file *filp = list_entry(p, struct file, f_u.fu_list);
-   struct inode *inode = filp-f_path.dentry-d_inode;
-   if (filp-f_mode  FMODE_WRITE  dqinit_needed(inode, type)) {
-   struct dentry *dentry = dget(filp-f_path.dentry);
-   file_list_unlock();
-   sb-dq_op-initialize(inode, type);
-   dput(dentry);
-   /* As we may have blocked we had better restart... */
-   goto restart;
-   }
+   spin_lock(inode_lock);
+   list_for_each_entry(inode, sb-s_inodes, i_sb_list) {
+   if (!atomic_read(inode-i_writecount))
+   continue;
+   if (!dqinit_needed(inode, type))
+   continue;
+   if (inode-i_state  (I_FREEING|I_WILL_FREE))
+   continue;
+
+   __iget(inode);
+   spin_unlock(inode_lock);
+
+   sb-dq_op-initialize(inode, type);
+   iput(inode);
+   /* As we may have blocked we had better restart... */
+   goto restart;
}
-   file_list_unlock();
+   spin_unlock(inode_lock);
 }
 
 /* Return 0 if dqput() won't block (note that 1 doesn't necessarily mean 
blocking) */
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] remove sb-s_files and file_list_lock usage in dquot.c

2007-02-06 Thread Jan Kara
On Tue 06-02-07 14:23:33, Christoph Hellwig wrote:
 Iterate over sb-s_inodes instead of sb-s_files in add_dquot_ref.
 This reduces list search and lock hold time aswell as getting rid of
 one of the few uses of file_list_lock which Ingo identified as a
 scalability problem.
 
 Previously we called dq_op-initialize for every inode handing of
 a writeable file that wasn't initialized before.  Now we're calling
 it for every inode that has a non-zero i_writecount, aka a writeable
 file descriptor refering to it.
 
 Thanks a lot to Jan Kara for running this patch through his quota
 test harness.
 
 Note:  this is ontop of '[PATCH] move remove_dquot_ref to dqout.c'
 which I sent out yesterday.
 
 
 Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]
  Signed-off-by: Jan Kara [EMAIL PROTECTED]

 Index: linux-2.6/fs/dquot.c
 ===
 --- linux-2.6.orig/fs/dquot.c 2007-02-05 18:54:36.0 +0100
 +++ linux-2.6/fs/dquot.c  2007-02-05 18:59:48.0 +0100
 @@ -689,23 +689,27 @@
  /* This routine is guarded by dqonoff_mutex mutex */
  static void add_dquot_ref(struct super_block *sb, int type)
  {
 - struct list_head *p;
 + struct inode *inode;
  
  restart:
 - file_list_lock();
 - list_for_each(p, sb-s_files) {
 - struct file *filp = list_entry(p, struct file, f_u.fu_list);
 - struct inode *inode = filp-f_path.dentry-d_inode;
 - if (filp-f_mode  FMODE_WRITE  dqinit_needed(inode, type)) {
 - struct dentry *dentry = dget(filp-f_path.dentry);
 - file_list_unlock();
 - sb-dq_op-initialize(inode, type);
 - dput(dentry);
 - /* As we may have blocked we had better restart... */
 - goto restart;
 - }
 + spin_lock(inode_lock);
 + list_for_each_entry(inode, sb-s_inodes, i_sb_list) {
 + if (!atomic_read(inode-i_writecount))
 + continue;
 + if (!dqinit_needed(inode, type))
 + continue;
 + if (inode-i_state  (I_FREEING|I_WILL_FREE))
 + continue;
 +
 + __iget(inode);
 + spin_unlock(inode_lock);
 +
 + sb-dq_op-initialize(inode, type);
 + iput(inode);
 + /* As we may have blocked we had better restart... */
 + goto restart;
   }
 - file_list_unlock();
 + spin_unlock(inode_lock);
  }
  
  /* Return 0 if dqput() won't block (note that 1 doesn't necessarily mean 
 blocking) */

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks

2007-02-06 Thread Tetsuo Handa
Tony Jones wrote:
 The following are a set of patches the goal of which is to pass vfsmounts
 through select portions of the VFS layer sufficient to be visible to the LSM
 inode operation hooks.
I was looking forward to these patches for so long.

Chris Wright wrote:
 This kind of change (or perhaps
 straight to struct path) is definitely
 needed from AA.
Not only AppArmor, but also TOMOYO Linux needs these patches.

TOMOYO Linux is a pathname based access control patch like AppArmor.
http://lwn.net/Articles/165132/
I have been asked Why not use LSM? and the answer is always
I can't, for VFS helper functions and LSM functions don't receive vfsmount.
and I am manually patching locations that call VFS helper functions.

But if these Tony's patches are accepted in upstream,
TOMOYO Linux would be able to use LSM.
I think these patches are also useful for auditing functions, for
auditing logs will be able to include absolute pathname
instead of partial pathname.
I think most people want access logs in the form of pathnames
rather than security labels.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Implement -page_mkwrite for XFS

2007-02-06 Thread David Chinner
Folks,

I'm not sure of the exact locking rules and constraints for
-page_mkwrite(), so I thought I better fish around for comments.

With XFS, we need to hook pages being dirtied by mmap writes so that
we can attach buffers of the correct state tothe pages.  This means
that when we write them back, the correct thing happens.

For example, if you mmap an unwritten extent (preallocated),
currently your data will get written to disk but the extent will not
get converted to a written extent. IOWs, you lose the data because
when you read it back it will seen as unwritten and treated as a
hole.

AFAICT, it is safe to lock the page during -page_mkwrite and that
it is safe to issue I/O (e.g. metadata reads) to determine the
current state of the file.  I am also assuming that, at this point,
we are not allowed to change the file size and so we have to be
careful in -page_mkwrite we don't do that. What else have I missed
here?

IOWs, I've basically treated -page_mkwrite() as wrapper for
block_prepare_write/block_commit_write because they do all the
buffer mapping and state manipulation I think is necessary.  Is it
safe to call these functions, or are there some other constraints we
have to work under here?

Patch below. Comments?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


---
 fs/xfs/linux-2.6/xfs_file.c |   34 ++
 1 file changed, 34 insertions(+)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_file.c  2007-01-16 
10:54:15.0 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c   2007-02-07 09:49:00.508017483 
+1100
@@ -446,6 +446,38 @@ xfs_file_open_exec(
 }
 #endif /* HAVE_FOP_OPEN_EXEC */
 
+/*
+ * mmap()d file has taken write protection fault and is being made
+ * writable. We can set the page state up correctly for a writable
+ * page, which means we can do correct delalloc accounting (ENOSPC
+ * checking!) and unwritten extent mapping.
+ */
+STATIC int
+xfs_vm_page_mkwrite(
+   struct vm_area_struct   *vma,
+   struct page *page)
+{
+   struct inode*inode = vma-vm_file-f_path.dentry-d_inode;
+   unsigned long   end;
+   int ret = 0;
+
+   end = page-index + 1;
+   end = PAGE_CACHE_SHIFT;
+   if (end  i_size_read(inode))
+   end = i_size_read(inode)  ~PAGE_CACHE_MASK;
+   else
+   end = PAGE_CACHE_SIZE;
+
+   lock_page(page);
+   ret = block_prepare_write(page, 0, end, xfs_get_blocks);
+   if (!ret)
+   ret = block_commit_write(page, 0, end);
+   unlock_page(page);
+
+   return ret;
+}
+
+
 const struct file_operations xfs_file_operations = {
.llseek = generic_file_llseek,
.read   = do_sync_read,
@@ -503,12 +535,14 @@ const struct file_operations xfs_dir_fil
 static struct vm_operations_struct xfs_file_vm_ops = {
.nopage = filemap_nopage,
.populate   = filemap_populate,
+   .page_mkwrite   = xfs_vm_page_mkwrite,
 };
 
 #ifdef HAVE_DMAPI
 static struct vm_operations_struct xfs_dmapi_file_vm_ops = {
.nopage = xfs_vm_nopage,
.populate   = filemap_populate,
+   .page_mkwrite   = xfs_vm_page_mkwrite,
 #ifdef HAVE_VMOP_MPROTECT
.mprotect   = xfs_vm_mprotect,
 #endif
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-06 Thread Andrew Morton
On Tue, 16 Jan 2007 21:05:20 +0900
[EMAIL PROTECTED] wrote:

 Move the blocks on the temporary inode to the original inode
 by a page.
 1. Read the file data from the old blocks to the page
 2. Move the block on the temporary inode to the original inode
 3. Write the file data on the page into the new blocks
 
 ...

 +
 +/**
 + * ext4_ext_replace_branches - replace original extents with new extents.
 + * @org_inodeOriginal inode
 + * @dest_inode   temporary inode
 + * @from_pagePage offset
 + * @count_page   Page count to be replaced
 + * @delete_start block offset for deletion
 + *
 + * This function returns 0 if succeed, otherwise returns error value.
 + * Replace extents for blocks from from to from+count-1.
 + */
 +static int
 +ext4_ext_replace_branches(struct inode *org_inode, struct inode *dest_inode,
 + pgoff_t from_page,  pgoff_t dest_from_page,
 + pgoff_t count_page, unsigned long *delete_start) 
 +{
 + struct ext4_ext_path *org_path = NULL;
 + struct ext4_ext_path *dest_path = NULL;
 + struct ext4_extent   *oext, *dext;
 + struct ext4_extent   tmp_ext;
 + int err = 0;
 + int depth;
 + unsigned long from, count, dest_off, diff, replaced_count = 0;

These should be sector_t, shouldn't they?

 + handle_t *handle = NULL;
 + unsigned jnum;
 +
 + from = from_page  (PAGE_CACHE_SHIFT - dest_inode-i_blkbits);

In which case one needs to be very careful to avoid overflows in
expressions such as this one.

 + wait_on_page_locked(page);
 + lock_page(page);

The wait_on_page_locked() is unneeded here.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-06 Thread Andrew Morton
On Mon, 5 Feb 2007 14:12:04 +0100
Jan Kara [EMAIL PROTECTED] wrote:

  Move the blocks on the temporary inode to the original inode
  by a page.
  1. Read the file data from the old blocks to the page
  2. Move the block on the temporary inode to the original inode
  3. Write the file data on the page into the new blocks
   I have one thing - it's probably not good to use page cache for
 defragmentation.

Then it is no longer online defragmentation.  The issues with maintaining
correctness and coherency with ongoing VFS activity would be truly ghastly.

If we're worried about pagecache pollution then it would be better to control
that from userspace via fadvise().
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/3] 2.6.20 fix for PageUptodate memorder problem

2007-02-06 Thread David Chinner
On Tue, Feb 06, 2007 at 09:02:01AM +0100, Nick Piggin wrote:
 Still no independent confirmation as to whether this is a problem or not.
 I think it is, so I'll propose this patchset to fix it. Patch 1/3 has a
 reasonable description of the problem.
 

Nick, can you include a diffstat at the head of your patches? Makes
it much easier to see what the scope of the changes are when you
are changing code in several filesystems

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1][RFC] mm: prepare_write positive return value

2007-02-06 Thread Andrew Morton
On Tue, 06 Feb 2007 11:33:46 +0300
Dmitriy Monakhov [EMAIL PROTECTED] wrote:

 Almost all read/write operation handles data with chunks(segments or pages)
 and result has integral behaviour for folowing scenario: 
 for_each_chunk() {
  res = op();
  if(IS_ERROR(res))
return progress ? progress : res;
  progress += res;
 }
 prepare_write may has integral behaviour in case of blksize  pgsize,
 for example we successfully allocated/read some blocks, but not all of them,
 and than some error happend. Currently we eliminate this progress by doing
 vmtrunate() after prepare_has failed.
 It is good to have ability to signal about this progress. Interprete positive
 prepare_write() ret code as bytes num that fs ready to handle at this moment.
 I've ask SAW, he think it is sane. This size always less than PAGE_CACHE_SIZE
 so it less than AOP_TRUNCATED_PAGE too.
  
 BTH: This approach was used in OpenVZ 2.6.9 kernel in order to make FS with 
 delayed allocation more correct, and its works well.
 I think not everybody will happy about this,  but let's discuss all advantages
 and disadvantages of this change.

That seems to be a logical change.  We'd need to review all the callers and
callees to make sure that they handle this change correctly.

Your changes deviate quite a lot from standard kernel coding style.  Please fix
that.

Please cc linux-ext4@vger.kernel.org on the next version of these patches.  I'm
seriously running out of bandwidth over here and ext4 has other maintainers.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] remove sb-s_files and file_list_lock usage in dquot.c

2007-02-06 Thread Andrew Morton
On Tue, 6 Feb 2007 14:23:33 +0100
Christoph Hellwig [EMAIL PROTECTED] wrote:

  static void add_dquot_ref(struct super_block *sb, int type)
  {
 - struct list_head *p;
 + struct inode *inode;
  
  restart:
 - file_list_lock();
 - list_for_each(p, sb-s_files) {
 - struct file *filp = list_entry(p, struct file, f_u.fu_list);
 - struct inode *inode = filp-f_path.dentry-d_inode;
 - if (filp-f_mode  FMODE_WRITE  dqinit_needed(inode, type)) {
 - struct dentry *dentry = dget(filp-f_path.dentry);
 - file_list_unlock();
 - sb-dq_op-initialize(inode, type);
 - dput(dentry);
 - /* As we may have blocked we had better restart... */
 - goto restart;
 - }
 + spin_lock(inode_lock);
 + list_for_each_entry(inode, sb-s_inodes, i_sb_list) {
 + if (!atomic_read(inode-i_writecount))
 + continue;
 + if (!dqinit_needed(inode, type))
 + continue;
 + if (inode-i_state  (I_FREEING|I_WILL_FREE))
 + continue;
 +
 + __iget(inode);
 + spin_unlock(inode_lock);
 +
 + sb-dq_op-initialize(inode, type);
 + iput(inode);
 + /* As we may have blocked we had better restart... */
 + goto restart;
   }
 - file_list_unlock();
 + spin_unlock(inode_lock);
  }

That loop has (and had) up to O(n^n) operations.  Is there something which
prevents this from going insane?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-06 Thread Eric Sandeen

Andrew Morton wrote:

On Tue, 16 Jan 2007 21:05:20 +0900
[EMAIL PROTECTED] wrote:


...


+ext4_ext_replace_branches(struct inode *org_inode, struct inode *dest_inode,
+   pgoff_t from_page,  pgoff_t dest_from_page,
+			pgoff_t count_page, unsigned long *delete_start) 
+{

+   struct ext4_ext_path *org_path = NULL;
+   struct ext4_ext_path *dest_path = NULL;
+   struct ext4_extent   *oext, *dext;
+   struct ext4_extent   tmp_ext;
+   int err = 0;
+   int depth;
+   unsigned long from, count, dest_off, diff, replaced_count = 0;


These should be sector_t, shouldn't they?


At some point should we start using blkcnt_t properly? 
(block-in[-large]-file, not block-in[-large]-device?)  I think that's 
what it was introduced for, although it's not in wide use at this point.


I guess the type really isn't used anywhere else; just in the inode's 
i_blocks.  Hmm.


-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/3] 2.6.20 fix for PageUptodate memorder problem

2007-02-06 Thread Nick Piggin
On Wed, Feb 07, 2007 at 09:58:57AM +1100, David Chinner wrote:
 On Tue, Feb 06, 2007 at 09:02:01AM +0100, Nick Piggin wrote:
  Still no independent confirmation as to whether this is a problem or not.
  I think it is, so I'll propose this patchset to fix it. Patch 1/3 has a
  reasonable description of the problem.
  
 
 Nick, can you include a diffstat at the head of your patches? Makes
 it much easier to see what the scope of the changes are when you
 are changing code in several filesystems

Good idea, I'll do that.

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] remove sb-s_files and file_list_lock usage in dquot.c

2007-02-06 Thread Christoph Hellwig
On Tue, Feb 06, 2007 at 03:50:01PM -0800, Andrew Morton wrote:
 On Tue, 6 Feb 2007 14:23:33 +0100
 Christoph Hellwig [EMAIL PROTECTED] wrote:
 
   static void add_dquot_ref(struct super_block *sb, int type)
   {
  -   struct list_head *p;
  +   struct inode *inode;
   
   restart:
  -   file_list_lock();
  -   list_for_each(p, sb-s_files) {
  -   struct file *filp = list_entry(p, struct file, f_u.fu_list);
  -   struct inode *inode = filp-f_path.dentry-d_inode;
  -   if (filp-f_mode  FMODE_WRITE  dqinit_needed(inode, type)) {
  -   struct dentry *dentry = dget(filp-f_path.dentry);
  -   file_list_unlock();
  -   sb-dq_op-initialize(inode, type);
  -   dput(dentry);
  -   /* As we may have blocked we had better restart... */
  -   goto restart;
  -   }
  +   spin_lock(inode_lock);
  +   list_for_each_entry(inode, sb-s_inodes, i_sb_list) {
  +   if (!atomic_read(inode-i_writecount))
  +   continue;
  +   if (!dqinit_needed(inode, type))
  +   continue;
  +   if (inode-i_state  (I_FREEING|I_WILL_FREE))
  +   continue;
  +
  +   __iget(inode);
  +   spin_unlock(inode_lock);
  +
  +   sb-dq_op-initialize(inode, type);
  +   iput(inode);
  +   /* As we may have blocked we had better restart... */
  +   goto restart;
  }
  -   file_list_unlock();
  +   spin_unlock(inode_lock);
   }
 
 That loop has (and had) up to O(n^n) operations.  Is there something which
 prevents this from going insane?

I don't think so.  Then again it's only called when you call quotaon on
a mounted filesystem, and normally you don't have that many inodes
instanciated at that time.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Testing ext4 persistent preallocation patches for 64 bit features

2007-02-06 Thread Amit K. Arora
I plan to test the persistent preallocation patches on a huge sparse
device, to know if 32 bit physical block numbers (upto 48bit) behave as
expected. I have following questions for this and will appreciate
suggestions here:

a) What should be the sparse device size which I should use for testing?
Should a size of  8TB (say, 100 TB) be enough ?
The physical device (backing store device) size I can have is upto 70GB.

b) How do I test allocation of 32 bit physical block numbers ? I can
not fill  8TB, since the physical storage available with me is just
70GB.

c) Do I need to put some hack in the filesystem code for above (to
allocate 32 bit physical block numbers) ?

Any further ideas on how to test this will help. Thanks!

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html