date:20070514

[patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6

2007-05-14 Thread npiggin

From: Andrew Morton [EMAIL PROTECTED]

This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
also revert.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |9 +
 mm/filemap.h |4 ++--
 2 files changed, 3 insertions(+), 10 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1957,12 +1957,6 @@ generic_file_buffered_write(struct kiocb
break;
}
 
-   if (unlikely(bytes == 0)) {
-   status = 0;
-   copied = 0;
-   goto zero_length_segment;
-   }
-
status = a_ops-prepare_write(file, page, offset, offset+bytes);
if (unlikely(status)) {
loff_t isize = i_size_read(inode);
@@ -1992,8 +1986,7 @@ generic_file_buffered_write(struct kiocb
page_cache_release(page);
continue;
}
-zero_length_segment:
-   if (likely(copied = 0)) {
+   if (likely(copied  0)) {
if (!status)
status = copied;
 
Index: linux-2.6/mm/filemap.h
===
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove
const struct iovec *iov = *iovp;
size_t base = *basep;
 
-   do {
+   while (bytes) {
int copy = min(bytes, iov-iov_len - base);
 
bytes -= copy;
@@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove
iov++;
base = 0;
}
-   } while (bytes);
+   }
*iovp = iov;
*basep = base;
 }

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 00/41] Buffered write deadlock fix and new aops for 2.6.21-mm2

2007-05-14 Thread npiggin

-- 
Here is an update against 2.6.21-mm2. Unfortunately UML broke for me, so
test coverage isn't so good as the last time I posted the series. Also,
several filesystems had significant clashes. Considering the amount of
time it took to get them working, I won't fix them again. They aren't
_broken_ as such, they'll just run slowly (but without the deadlock).

The OCFS2 patch seemed to have some clashes too, so I've left that out.
I'm sure Mark will take a look at that quickly if this patchset were to
get merged.

Thanks to Neil for some documentation suggestions and catching a bug, and
to Vladimir for the reiserfs implementation (not 100% done yet, but it is
a good start).



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 05/41] mm: debug write deadlocks

2007-05-14 Thread npiggin


Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
difficult race where the page may be unmapped before calling copy_from_user.
Makes the race much easier to hit.

This is useful for demonstration and testing purposes, but is removed in a
subsequent patch.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1940,6 +1940,7 @@ generic_file_buffered_write(struct kiocb
if (maxlen  bytes)
maxlen = bytes;
 
+#ifndef CONFIG_DEBUG_VM
/*
 * Bring in the user page that we will copy from _first_.
 * Otherwise there's a nasty deadlock on copying from the
@@ -1947,6 +1948,7 @@ generic_file_buffered_write(struct kiocb
 * up-to-date.
 */
fault_in_pages_readable(buf, maxlen);
+#endif
 
page = __grab_cache_page(mapping,index,cached_page,lru_pvec);
if (!page) {

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 03/41] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83

2007-05-14 Thread npiggin

From: Andrew Morton [EMAIL PROTECTED]

This patch fixed the following bug:

  When prefaulting in the pages in generic_file_buffered_write(), we only
  faulted in the pages for the firts segment of the iovec.  If the second of
  successive segment described a mmapping of the page into which we're
  write()ing, and that page is not up-to-date, the fault handler tries to lock
  the already-locked page (to bring it up to date) and deadlocks.

  An exploit for this bug is in writev-deadlock-demo.c, in
  http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

  (These demos assume blocksize  PAGE_CACHE_SIZE).

The problem with this fix is that it takes the kernel back to doing a single
prepare_write()/commit_write() per iovec segment.  So in the worst case we'll
run prepare_write+commit_write 1024 times where we previously would have run
it once. The other problem with the fix is that it fix all the locking problems.


insert numbers obtained via ext3-tools's writev-speed.c here

And apparently this change killed NFS overwrite performance, because, I
suppose, it talks to the server for each prepare_write+commit_write.

So just back that patch out - we'll be fixing the deadlock by other means.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Andrew Morton [EMAIL PROTECTED]

Nick says: also it only ever actually papered over the bug, because after
faulting in the pages, they might be unmapped or reclaimed.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1927,21 +1927,14 @@ generic_file_buffered_write(struct kiocb
do {
unsigned long index;
unsigned long offset;
+   unsigned long maxlen;
size_t copied;
 
offset = (pos  (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos  PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
-
-   /* Limit the size of the copy to the caller's write size */
-   bytes = min(bytes, count);
-
-   /*
-* Limit the size of the copy to that of the current segment,
-* because fault_in_pages_readable() doesn't know how to walk
-* segments.
-*/
-   bytes = min(bytes, cur_iov-iov_len - iov_base);
+   if (bytes  count)
+   bytes = count;
 
/*
 * Bring in the user page that we will copy from _first_.
@@ -1949,7 +1942,10 @@ generic_file_buffered_write(struct kiocb
 * same page as we're writing to, without it being marked
 * up-to-date.
 */
-   fault_in_pages_readable(buf, bytes);
+   maxlen = cur_iov-iov_len - iov_base;
+   if (maxlen  bytes)
+   maxlen = bytes;
+   fault_in_pages_readable(buf, maxlen);
 
page = __grab_cache_page(mapping,index,cached_page,lru_pvec);
if (!page) {

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 04/41] mm: clean up buffered write code

2007-05-14 Thread npiggin

From: Andrew Morton [EMAIL PROTECTED]

Rename some variables and fix some types.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   35 ++-
 1 file changed, 18 insertions(+), 17 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1900,16 +1900,15 @@ generic_file_buffered_write(struct kiocb
size_t count, ssize_t written)
 {
struct file *file = iocb-ki_filp;
-   struct address_space * mapping = file-f_mapping;
+   struct address_space *mapping = file-f_mapping;
const struct address_space_operations *a_ops = mapping-a_ops;
struct inode*inode = mapping-host;
longstatus = 0;
struct page *page;
struct page *cached_page = NULL;
-   size_t  bytes;
struct pagevec  lru_pvec;
const struct iovec *cur_iov = iov; /* current iovec */
-   size_t  iov_base = 0;  /* offset in the current iovec */
+   size_t  iov_offset = 0;/* offset in the current iovec */
char __user *buf;
 
pagevec_init(lru_pvec, 0);
@@ -1920,31 +1919,33 @@ generic_file_buffered_write(struct kiocb
if (likely(nr_segs == 1))
buf = iov-iov_base + written;
else {
-   filemap_set_next_iovec(cur_iov, iov_base, written);
-   buf = cur_iov-iov_base + iov_base;
+   filemap_set_next_iovec(cur_iov, iov_offset, written);
+   buf = cur_iov-iov_base + iov_offset;
}
 
do {
-   unsigned long index;
-   unsigned long offset;
-   unsigned long maxlen;
-   size_t copied;
+   pgoff_t index;  /* Pagecache index for current page */
+   unsigned long offset;   /* Offset into pagecache page */
+   unsigned long maxlen;   /* Bytes remaining in current iovec */
+   size_t bytes;   /* Bytes to write to page */
+   size_t copied;  /* Bytes copied from user */
 
-   offset = (pos  (PAGE_CACHE_SIZE -1)); /* Within page */
+   offset = (pos  (PAGE_CACHE_SIZE - 1));
index = pos  PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
if (bytes  count)
bytes = count;
 
+   maxlen = cur_iov-iov_len - iov_offset;
+   if (maxlen  bytes)
+   maxlen = bytes;
+
/*
 * Bring in the user page that we will copy from _first_.
 * Otherwise there's a nasty deadlock on copying from the
 * same page as we're writing to, without it being marked
 * up-to-date.
 */
-   maxlen = cur_iov-iov_len - iov_base;
-   if (maxlen  bytes)
-   maxlen = bytes;
fault_in_pages_readable(buf, maxlen);
 
page = __grab_cache_page(mapping,index,cached_page,lru_pvec);
@@ -1975,7 +1976,7 @@ generic_file_buffered_write(struct kiocb
buf, bytes);
else
copied = filemap_copy_from_user_iovec(page, offset,
-   cur_iov, iov_base, bytes);
+   cur_iov, iov_offset, bytes);
flush_dcache_page(page);
status = a_ops-commit_write(file, page, offset, offset+bytes);
if (status == AOP_TRUNCATED_PAGE) {
@@ -1993,12 +1994,12 @@ generic_file_buffered_write(struct kiocb
buf += status;
if (unlikely(nr_segs  1)) {
filemap_set_next_iovec(cur_iov,
-   iov_base, status);
+   iov_offset, status);
if (count)
buf = cur_iov-iov_base +
-   iov_base;
+   iov_offset;
} else {
-   iov_base += status;
+   iov_offset += status;
}
}
}

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 11/41] fs: fix data-loss on error

2007-05-14 Thread npiggin


New buffers against uptodate pages are simply be marked uptodate, while the
buffer_new bit remains set. This causes error-case code to zero out parts
of those buffers because it thinks they contain stale data: wrong, they
are actually uptodate so this is a data loss situation.

Fix this by actually clearning buffer_new and marking the buffer dirty. It
makes sense to always clear buffer_new before setting a buffer uptodate.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/buffer.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1793,7 +1793,9 @@ static int __block_prepare_write(struct 
unmap_underlying_metadata(bh-b_bdev,
bh-b_blocknr);
if (PageUptodate(page)) {
+   clear_buffer_new(bh);
set_buffer_uptodate(bh);
+   mark_buffer_dirty(bh);
continue;
}
if (block_end  to || block_start  from) {

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 07/41] mm: buffered write cleanup

2007-05-14 Thread npiggin


Quite a bit of code is used in maintaining these cached pages that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.

Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).

[this gets rid of some cond_resched() calls in readahead.c and mpage.c due
 to clashes in -mm. What put them there, and why? ]

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/mpage.c |   12 
 mm/filemap.c   |  144 ++---
 mm/readahead.c |   28 +++
 3 files changed, 66 insertions(+), 118 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -666,26 +666,22 @@ EXPORT_SYMBOL(find_lock_page);
 struct page *find_or_create_page(struct address_space *mapping,
unsigned long index, gfp_t gfp_mask)
 {
-   struct page *page, *cached_page = NULL;
+   struct page *page;
int err;
 repeat:
page = find_lock_page(mapping, index);
if (!page) {
-   if (!cached_page) {
-   cached_page = alloc_page(gfp_mask);
-   if (!cached_page)
-   return NULL;
-   }
-   err = add_to_page_cache_lru(cached_page, mapping,
-   index, gfp_mask);
-   if (!err) {
-   page = cached_page;
-   cached_page = NULL;
-   } else if (err == -EEXIST)
-   goto repeat;
+   page = alloc_page(gfp_mask);
+   if (!page)
+   return NULL;
+   err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+   if (unlikely(err)) {
+   page_cache_release(page);
+   page = NULL;
+   if (err == -EEXIST)
+   goto repeat;
+   }
}
-   if (cached_page)
-   page_cache_release(cached_page);
return page;
 }
 EXPORT_SYMBOL(find_or_create_page);
@@ -882,11 +878,9 @@ void do_generic_mapping_read(struct addr
unsigned long prev_index;
unsigned int prev_offset;
loff_t isize;
-   struct page *cached_page;
int error;
struct file_ra_state ra = *_ra;
 
-   cached_page = NULL;
index = *ppos  PAGE_CACHE_SHIFT;
next_index = index;
prev_index = ra.prev_index;
@@ -1053,23 +1047,20 @@ no_cached_page:
 * Ok, it wasn't cached, so we need to create a new
 * page..
 */
-   if (!cached_page) {
-   cached_page = page_cache_alloc_cold(mapping);
-   if (!cached_page) {
-   desc-error = -ENOMEM;
-   goto out;
-   }
+   page = page_cache_alloc_cold(mapping);
+   if (!page) {
+   desc-error = -ENOMEM;
+   goto out;
}
-   error = add_to_page_cache_lru(cached_page, mapping,
+   error = add_to_page_cache_lru(page, mapping,
index, GFP_KERNEL);
if (error) {
+   page_cache_release(page);
if (error == -EEXIST)
goto find_page;
desc-error = error;
goto out;
}
-   page = cached_page;
-   cached_page = NULL;
goto readpage;
}
 
@@ -1077,8 +1068,6 @@ out:
*_ra = ra;
 
*ppos = ((loff_t) index  PAGE_CACHE_SHIFT) + offset;
-   if (cached_page)
-   page_cache_release(cached_page);
if (filp)
file_accessed(filp);
 }
@@ -1561,35 +1550,28 @@ static struct page *__read_cache_page(st
int (*filler)(void *,struct page*),
void *data)
 {
-   struct page *page, *cached_page = NULL;
+   struct page *page;
int err;
 repeat:
page = find_get_page(mapping, index);
if (!page) {
-

[patch 09/41] mm: fix pagecache write deadlocks

2007-05-14 Thread npiggin


Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.prepare_write
4. unlock_page+vmtruncate
5. copy_from_user
6.  mmap_sem(r)
7.   handle_mm_fault
8.lock_page (filemap_nopage)
9.commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.get_user_pages
e. handle_mm_fault
f.  lock_page (filemap_nopage)

2,8 - recursive deadlock if page is same
2,8;2,8 - ABBA deadlock is page is different
2,6;b,f - ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
atomic usercopies which do not take pagefaults and do not zero the uncopied
tail of the destination. The destination is already uptodate, so we can
commit_write the full length even if there was a partial copy: it does not
matter that the tail was not modified, because if it is dirtied and written
back to disk it will not cause any problems (uptodate *means* that the
destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
information, because atomic usercopies cannot distinguish between
non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
made slightly more optimal), then allocate a temporary page to copy the
source data into. Relock the destination page and continue with the copy.
However, instead of a usercopy (which might take a fault), copy the data
from the pinned temporary page via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

This increases the CPU/memory copy cost by almost 50% on the affected
workloads. That will be solved by introducing a new set of pagecache write
aops in a subsequent patch.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 include/linux/pagemap.h |   11 +++-
 mm/filemap.c|  114 
 2 files changed, 104 insertions(+), 21 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1889,11 +1889,12 @@ generic_file_buffered_write(struct kiocb
filemap_set_next_iovec(cur_iov, nr_segs, iov_offset, written);
 
do {
+   struct page *src_page;
struct page *page;
pgoff_t index;  /* Pagecache index for current page */
unsigned long offset;   /* Offset into pagecache page */
-   unsigned long maxlen;   /* Bytes remaining in current iovec */
-   size_t bytes;   /* Bytes to write to page */
+   unsigned long seglen;   /* Bytes remaining in current iovec */
+   unsigned long bytes;/* Bytes to write to page */
size_t copied;  /* Bytes copied from user */
 
buf = cur_iov-iov_base + iov_offset;
@@ -1903,20 +1904,30 @@ generic_file_buffered_write(struct kiocb
if (bytes  count)
bytes = count;
 
-   maxlen = cur_iov-iov_len - iov_offset;
-   if (maxlen  bytes)
-   maxlen = bytes;
+   /*
+* a non-NULL src_page indicates that we're doing the
+* copy via get_user_pages and kmap.
+*/
+   src_page = NULL;
+
+   seglen = cur_iov-iov_len - iov_offset;
+   if (seglen  bytes)
+   seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
/*
 * Bring in the user page that we will copy from _first_.
 * Otherwise there's a nasty deadlock on copying from the
 * same page as we're writing to, without it being marked
 * up-to-date.
+*
+* Not only is this an optimisation, but it is also required
+* to check that the address is actually valid, when atomic
+* usercopies are used, below.
 */
-   fault_in_pages_readable(buf, maxlen);
-#endif
-
+   if (unlikely(fault_in_pages_readable(buf, seglen))) {
+   status = -EFAULT;
+   break;
+   }
 
page = __grab_cache_page(mapping, index);
if (!page) {
@@ -1924,32 +1935,104 @@ generic_file_buffered_write(struct kiocb
break;
}
 
+   /*
+* non-uptodate pages

[patch 20/41] xfs convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/xfs/linux-2.6/xfs_aops.c |   19 ---
 fs/xfs/linux-2.6/xfs_lrw.c  |   35 ---
 2 files changed, 24 insertions(+), 30 deletions(-)

Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
@@ -1479,13 +1479,18 @@ xfs_vm_direct_IO(
 }
 
 STATIC int
-xfs_vm_prepare_write(
+xfs_vm_write_begin(
struct file *file,
-   struct page *page,
-   unsigned intfrom,
-   unsigned intto)
+   struct address_space*mapping,
+   loff_t  pos,
+   unsignedlen,
+   unsignedflags,
+   struct page **pagep,
+   void**fsdata)
 {
-   return block_prepare_write(page, from, to, xfs_get_blocks);
+   *pagep = NULL;
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   xfs_get_blocks);
 }
 
 STATIC sector_t
@@ -1539,8 +1544,8 @@ const struct address_space_operations xf
.sync_page  = block_sync_page,
.releasepage= xfs_vm_releasepage,
.invalidatepage = xfs_vm_invalidatepage,
-   .prepare_write  = xfs_vm_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= xfs_vm_write_begin,
+   .write_end  = generic_write_end,
.bmap   = xfs_vm_bmap,
.direct_IO  = xfs_vm_direct_IO,
.migratepage= buffer_migrate_page,
Index: linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c
===
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_lrw.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c
@@ -134,45 +134,34 @@ xfs_iozero(
loff_t  pos,/* offset in file   */
size_t  count)  /* size of data to zero */
 {
-   unsignedbytes;
struct page *page;
struct address_space*mapping;
int status;
 
mapping = ip-i_mapping;
do {
-   unsigned long index, offset;
+   unsigned offset, bytes;
+   void *fsdata;
 
offset = (pos  (PAGE_CACHE_SIZE -1)); /* Within page */
-   index = pos  PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
if (bytes  count)
bytes = count;
 
-   status = -ENOMEM;
-   page = grab_cache_page(mapping, index);
-   if (!page)
-   break;
-
-   status = mapping-a_ops-prepare_write(NULL, page, offset,
-   offset + bytes);
+   status = pagecache_write_begin(NULL, mapping, pos, bytes,
+   AOP_FLAG_UNINTERRUPTIBLE,
+   page, fsdata);
if (status)
-   goto unlock;
+   break;
 
zero_user_page(page, offset, bytes, KM_USER0);
 
-   status = mapping-a_ops-commit_write(NULL, page, offset,
-   offset + bytes);
-   if (!status) {
-   pos += bytes;
-   count -= bytes;
-   }
-
-unlock:
-   unlock_page(page);
-   page_cache_release(page);
-   if (status)
-   break;
+   status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
+   page, fsdata);
+   WARN_ON(status = 0); /* can't return less than zero! */
+   pos += bytes;
+   count -= bytes;
+   status = 0;
} while (count);
 
return (-status);

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 14/41] implement simple fs aops

2007-05-14 Thread npiggin

Implement new aops for some of the simpler filesystems.

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/configfs/inode.c   |4 ++--
 fs/hugetlbfs/inode.c  |   16 ++--
 fs/ramfs/file-mmu.c   |4 ++--
 fs/ramfs/file-nommu.c |4 ++--
 fs/sysfs/inode.c  |4 ++--
 mm/shmem.c|   35 ---
 6 files changed, 46 insertions(+), 21 deletions(-)

Index: linux-2.6/mm/shmem.c
===
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1109,7 +1109,7 @@ static int shmem_getpage(struct inode *i
 * Normally, filepage is NULL on entry, and either found
 * uptodate immediately, or allocated and zeroed, or read
 * in under swappage, which is then assigned to filepage.
-* But shmem_prepare_write passes in a locked filepage,
+* But shmem_write_begin passes in a locked filepage,
 * which may be found not uptodate by other callers too,
 * and may need to be copied from the swappage read in.
 */
@@ -1454,14 +1454,35 @@ static const struct inode_operations shm
 static const struct inode_operations shmem_symlink_inline_operations;
 
 /*
- * Normally tmpfs makes no use of shmem_prepare_write, but it
+ * Normally tmpfs makes no use of shmem_write_begin, but it
  * lets a tmpfs file be used read-write below the loop driver.
  */
 static int
-shmem_prepare_write(struct file *file, struct page *page, unsigned offset, 
unsigned to)
+shmem_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   struct inode *inode = mapping-host;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   *pagep = NULL;
+   return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+}
+
+static int
+shmem_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   struct inode *inode = page-mapping-host;
-   return shmem_getpage(inode, page-index, page, SGP_WRITE, NULL);
+   struct inode *inode = mapping-host;
+
+   set_page_dirty(page);
+   mark_page_accessed(page);
+   page_cache_release(page);
+
+   if (pos+copied  inode-i_size)
+   i_size_write(inode, pos+copied);
+
+   return copied;
 }
 
 static ssize_t
@@ -2357,8 +2378,8 @@ static const struct address_space_operat
.writepage  = shmem_writepage,
.set_page_dirty = __set_page_dirty_no_writeback,
 #ifdef CONFIG_TMPFS
-   .prepare_write  = shmem_prepare_write,
-   .commit_write   = simple_commit_write,
+   .write_begin= shmem_write_begin,
+   .write_end  = shmem_write_end,
 #endif
.migratepage= migrate_page,
 };
Index: linux-2.6/fs/configfs/inode.c
===
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -40,8 +40,8 @@ extern struct super_block * configfs_sb;
 
 static const struct address_space_operations configfs_aops = {
.readpage   = simple_readpage,
-   .prepare_write  = simple_prepare_write,
-   .commit_write   = simple_commit_write
+   .write_begin= simple_write_begin,
+   .write_end  = simple_write_end,
 };
 
 static struct backing_dev_info configfs_backing_dev_info = {
Index: linux-2.6/fs/sysfs/inode.c
===
--- linux-2.6.orig/fs/sysfs/inode.c
+++ linux-2.6/fs/sysfs/inode.c
@@ -20,8 +20,8 @@ extern struct super_block * sysfs_sb;
 
 static const struct address_space_operations sysfs_aops = {
.readpage   = simple_readpage,
-   .prepare_write  = simple_prepare_write,
-   .commit_write   = simple_commit_write
+   .write_begin= simple_write_begin,
+   .write_end  = simple_write_end,
 };
 
 static struct backing_dev_info sysfs_backing_dev_info = {
Index: linux-2.6/fs/ramfs/file-mmu.c
===
--- linux-2.6.orig/fs/ramfs/file-mmu.c
+++ linux-2.6/fs/ramfs/file-mmu.c
@@ -29,8 +29,8 @@
 
 const struct address_space_operations ramfs_aops = {
.readpage   = simple_readpage,
-   .prepare_write  = simple_prepare_write,
-   .commit_write   = simple_commit_write,
+   .write_begin= simple_write_begin,
+   .write_end  = simple_write_end,
.set_page_dirty = __set_page_dirty_no_writeback,
 };
 
Index: linux-2.6/fs/ramfs/file-nommu.c
===
--- linux-2.6.orig/fs/ramfs/file-nommu.c
+++ linux-2.6/fs/ramfs/file-nommu.c
@@ -29,8 +29,8 @@ static int ramfs_nommu_setattr(struct de
 
 const struct address_space_operations

[patch 26/41] hpfs convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/hpfs/file.c |   20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

Index: linux-2.6/fs/hpfs/file.c
===
--- linux-2.6.orig/fs/hpfs/file.c
+++ linux-2.6/fs/hpfs/file.c
@@ -86,25 +86,33 @@ static int hpfs_writepage(struct page *p
 {
return block_write_full_page(page,hpfs_get_block, wbc);
 }
+
 static int hpfs_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,hpfs_get_block);
 }
-static int hpfs_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
-{
-   return cont_prepare_write(page,from,to,hpfs_get_block,
-   hpfs_i(page-mapping-host)-mmu_private);
+
+static int hpfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   hpfs_get_block,
+   hpfs_i(mapping-host)-mmu_private);
 }
+
 static sector_t _hpfs_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,hpfs_get_block);
 }
+
 const struct address_space_operations hpfs_aops = {
.readpage = hpfs_readpage,
.writepage = hpfs_writepage,
.sync_page = block_sync_page,
-   .prepare_write = hpfs_prepare_write,
-   .commit_write = generic_commit_write,
+   .write_begin = hpfs_write_begin,
+   .write_end = generic_write_end,
.bmap = _hpfs_bmap
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 17/41] ext2 convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/ext2/dir.c   |   47 +--
 fs/ext2/ext2.h  |3 +++
 fs/ext2/inode.c |   24 +---
 3 files changed, 45 insertions(+), 29 deletions(-)

Index: linux-2.6/fs/ext2/inode.c
===
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -726,18 +726,21 @@ ext2_readpages(struct file *file, struct
return mpage_readpages(mapping, pages, nr_pages, ext2_get_block);
 }
 
-static int
-ext2_prepare_write(struct file *file, struct page *page,
-   unsigned from, unsigned to)
+int __ext2_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page,from,to,ext2_get_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ext2_get_block);
 }
 
 static int
-ext2_nobh_prepare_write(struct file *file, struct page *page,
-   unsigned from, unsigned to)
+ext2_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return nobh_prepare_write(page,from,to,ext2_get_block);
+   *pagep = NULL;
+   return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
 }
 
 static int ext2_nobh_writepage(struct page *page,
@@ -773,8 +776,8 @@ const struct address_space_operations ex
.readpages  = ext2_readpages,
.writepage  = ext2_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = ext2_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= ext2_write_begin,
+   .write_end  = generic_write_end,
.bmap   = ext2_bmap,
.direct_IO  = ext2_direct_IO,
.writepages = ext2_writepages,
@@ -791,8 +794,7 @@ const struct address_space_operations ex
.readpages  = ext2_readpages,
.writepage  = ext2_nobh_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = ext2_nobh_prepare_write,
-   .commit_write   = nobh_commit_write,
+   /* XXX: todo */
.bmap   = ext2_bmap,
.direct_IO  = ext2_direct_IO,
.writepages = ext2_writepages,
Index: linux-2.6/fs/ext2/dir.c
===
--- linux-2.6.orig/fs/ext2/dir.c
+++ linux-2.6/fs/ext2/dir.c
@@ -22,6 +22,7 @@
  */
 
 #include ext2.h
+#include linux/buffer_head.h
 #include linux/pagemap.h
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
@@ -61,12 +62,14 @@ ext2_last_byte(struct inode *inode, unsi
return last_byte;
 }
 
-static int ext2_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-   struct inode *dir = page-mapping-host;
+   struct address_space *mapping = page-mapping;
+   struct inode *dir = mapping-host;
int err = 0;
+
dir-i_version++;
-   page-mapping-a_ops-commit_write(NULL, page, from, to);
+   block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
@@ -412,16 +415,18 @@ ino_t ext2_inode_by_name(struct inode * 
 void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
struct page *page, struct inode *inode)
 {
-   unsigned from = (char *) de - (char *) page_address(page);
-   unsigned to = from + le16_to_cpu(de-rec_len);
+   loff_t pos = (page-index  PAGE_CACHE_SHIFT) +
+   (char *) de - (char *) page_address(page);
+   unsigned len = le16_to_cpu(de-rec_len);
int err;
 
lock_page(page);
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __ext2_write_begin(NULL, page-mapping, pos, len,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
BUG_ON(err);
de-inode = cpu_to_le32(inode-i_ino);
-   ext2_set_de_type (de, inode);
-   err = ext2_commit_chunk(page, from, to);
+   ext2_set_de_type(de, inode);
+   err = ext2_commit_chunk(page, pos, len);
ext2_put_page(page);
dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC;
EXT2_I(dir)-i_flags = ~EXT2_BTREE_FL;
@@ -444,7 +449,7 @@ int ext2_add_link (struct dentry *dentry
unsigned long npages = dir_pages(dir);
unsigned long n;
char *kaddr;
-

[patch 19/41] ext4 convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Convert ext4 to use write_begin()/write_end() methods.

Signed-off-by: Badari Pulavarty [EMAIL PROTECTED]

 fs/ext4/inode.c |  147 +++-
 1 file changed, 93 insertions(+), 54 deletions(-)

Index: linux-2.6/fs/ext4/inode.c
===
--- linux-2.6.orig/fs/ext4/inode.c
+++ linux-2.6/fs/ext4/inode.c
@@ -1146,34 +1146,50 @@ static int do_journal_get_write_access(h
return ext4_journal_get_write_access(handle, bh);
 }
 
-static int ext4_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ext4_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = mapping-host;
int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
handle_t *handle;
int retries = 0;
+   struct page *page;
+   pgoff_t index;
+   unsigned from, to;
+
+   index = pos  PAGE_CACHE_SHIFT;
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
 
 retry:
-   handle = ext4_journal_start(inode, needed_blocks);
-   if (IS_ERR(handle)) {
-   ret = PTR_ERR(handle);
-   goto out;
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
+
+   handle = ext4_journal_start(inode, needed_blocks);
+   if (IS_ERR(handle)) {
+   unlock_page(page);
+   page_cache_release(page);
+   ret = PTR_ERR(handle);
+   goto out;
}
-   if (test_opt(inode-i_sb, NOBH)  ext4_should_writeback_data(inode))
-   ret = nobh_prepare_write(page, from, to, ext4_get_block);
-   else
-   ret = block_prepare_write(page, from, to, ext4_get_block);
-   if (ret)
-   goto prepare_write_failed;
 
-   if (ext4_should_journal_data(inode)) {
+   ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ext4_get_block);
+
+   if (!ret  ext4_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, do_journal_get_write_access);
}
-prepare_write_failed:
-   if (ret)
+
+   if (ret) {
ext4_journal_stop(handle);
+   unlock_page(page);
+   page_cache_release(page);
+   }
+
if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, retries))
goto retry;
 out:
@@ -1185,12 +1201,12 @@ int ext4_journal_dirty_data(handle_t *ha
int err = jbd2_journal_dirty_data(handle, bh);
if (err)
ext4_journal_abort_handle(__FUNCTION__, __FUNCTION__,
-   bh, handle,err);
+   bh, handle, err);
return err;
 }
 
-/* For commit_write() in data=journal mode */
-static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
+/* For write_end() in data=journal mode */
+static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
if (!buffer_mapped(bh) || buffer_freed(bh))
return 0;
@@ -1205,78 +1221,100 @@ static int commit_write_fn(handle_t *han
  * ext4 never places buffers on inode-i_mapping-private_list.  metadata
  * buffers are managed internally.
  */
-static int ext4_ordered_commit_write(struct file *file, struct page *page,
-unsigned from, unsigned to)
+static int ext4_ordered_write_end(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
handle_t *handle = ext4_journal_current_handle();
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = file-f_mapping-host;
+   unsigned from, to;
int ret = 0, ret2;
 
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
+
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, ext4_journal_dirty_data);
 
if (ret == 0) {
/*
-* generic_commit_write() will run mark_inode_dirty() if i_size
+* generic_write_end() will run mark_inode_dirty() if i_size
 * changes.  So let's piggyback the i_disksize mark_inode_dirty
 * into that.
 */
loff_t new_i_size;
 
-   new_i_size = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
+

[patch 18/41] ext3 convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]


Various fixes and improvements

Signed-off-by: Badari Pulavarty [EMAIL PROTECTED]

 fs/ext3/inode.c |  136 
 1 file changed, 88 insertions(+), 48 deletions(-)

Index: linux-2.6/fs/ext3/inode.c
===
--- linux-2.6.orig/fs/ext3/inode.c
+++ linux-2.6/fs/ext3/inode.c
@@ -1147,51 +1147,68 @@ static int do_journal_get_write_access(h
return ext3_journal_get_write_access(handle, bh);
 }
 
-static int ext3_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ext3_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = mapping-host;
int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
handle_t *handle;
int retries = 0;
+   struct page *page;
+   pgoff_t index;
+   unsigned from, to;
+
+   index = pos  PAGE_CACHE_SHIFT;
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
 
 retry:
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
+
handle = ext3_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
+   unlock_page(page);
+   page_cache_release(page);
ret = PTR_ERR(handle);
goto out;
}
-   if (test_opt(inode-i_sb, NOBH)  ext3_should_writeback_data(inode))
-   ret = nobh_prepare_write(page, from, to, ext3_get_block);
-   else
-   ret = block_prepare_write(page, from, to, ext3_get_block);
+   ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ext3_get_block);
if (ret)
-   goto prepare_write_failed;
+   goto write_begin_failed;
 
if (ext3_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, do_journal_get_write_access);
}
-prepare_write_failed:
-   if (ret)
+write_begin_failed:
+   if (ret) {
ext3_journal_stop(handle);
+   unlock_page(page);
+   page_cache_release(page);
+   }
if (ret == -ENOSPC  ext3_should_retry_alloc(inode-i_sb, retries))
goto retry;
 out:
return ret;
 }
 
+
 int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
 {
int err = journal_dirty_data(handle, bh);
if (err)
ext3_journal_abort_handle(__FUNCTION__, __FUNCTION__,
-   bh, handle,err);
+   bh, handle, err);
return err;
 }
 
-/* For commit_write() in data=journal mode */
-static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
+/* For write_end() in data=journal mode */
+static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
if (!buffer_mapped(bh) || buffer_freed(bh))
return 0;
@@ -1206,78 +1223,100 @@ static int commit_write_fn(handle_t *han
  * ext3 never places buffers on inode-i_mapping-private_list.  metadata
  * buffers are managed internally.
  */
-static int ext3_ordered_commit_write(struct file *file, struct page *page,
-unsigned from, unsigned to)
+static int ext3_ordered_write_end(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
handle_t *handle = ext3_journal_current_handle();
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = file-f_mapping-host;
+   unsigned from, to;
int ret = 0, ret2;
 
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
+
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, ext3_journal_dirty_data);
 
if (ret == 0) {
/*
-* generic_commit_write() will run mark_inode_dirty() if i_size
+* generic_write_end() will run mark_inode_dirty() if i_size
 * changes.  So let's piggyback the i_disksize mark_inode_dirty
 * into that.
 */
loff_t new_i_size;
 
-   new_i_size = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
+   new_i_size = pos + copied;
if (new_i_size  EXT3_I(inode)-i_disksize)

[patch 16/41] rd convert to new aops.

2007-05-14 Thread npiggin

Also clean up various little things.

I've got rid of the comment from akpm, because now that make_page_uptodate
is only called from 2 places, it is pretty easy to see that the buffers
are in an uptodate state at the time of the call. Actually, it was OK before
my patch as well, because the memset is equivalent to reading from disk
of course... however it is more explicit where the updates come from now.

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 drivers/block/rd.c |  125 ++---
 1 file changed, 73 insertions(+), 52 deletions(-)

Index: linux-2.6/drivers/block/rd.c
===
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -104,50 +104,60 @@ static void make_page_uptodate(struct pa
struct buffer_head *head = bh;
 
do {
-   if (!buffer_uptodate(bh)) {
-   memset(bh-b_data, 0, bh-b_size);
-   /*
-* akpm: I'm totally undecided about this.  The
-* buffer has just been magically brought up to
-* date, but nobody should want to be reading
-* it anyway, because it hasn't been used for
-* anything yet.  It is still in a not read
-* from disk yet state.
-*
-* But non-uptodate buffers against an uptodate
-* page are against the rules.  So do it anyway.
-*/
+   if (!buffer_uptodate(bh))
 set_buffer_uptodate(bh);
-   }
} while ((bh = bh-b_this_page) != head);
-   } else {
-   memset(page_address(page), 0, PAGE_CACHE_SIZE);
}
-   flush_dcache_page(page);
SetPageUptodate(page);
 }
 
 static int ramdisk_readpage(struct file *file, struct page *page)
 {
-   if (!PageUptodate(page))
+   if (!PageUptodate(page)) {
+   memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
make_page_uptodate(page);
+   }
unlock_page(page);
return 0;
 }
 
-static int ramdisk_prepare_write(struct file *file, struct page *page,
-   unsigned offset, unsigned to)
-{
-   if (!PageUptodate(page))
-   make_page_uptodate(page);
+static int ramdisk_write_begin(struct file *file, struct address_space 
*mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   struct page *page;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
return 0;
 }
 
-static int ramdisk_commit_write(struct file *file, struct page *page,
-   unsigned offset, unsigned to)
-{
+static int ramdisk_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
+{
+   if (!PageUptodate(page)) {
+   if (copied != PAGE_CACHE_SIZE) {
+   void *dst;
+   unsigned from = pos  (PAGE_CACHE_SIZE - 1);
+   unsigned to = from + copied;
+
+   dst = kmap_atomic(page, KM_USER0);
+   memset(dst, 0, from);
+   memset(dst + to, 0, PAGE_CACHE_SIZE - to);
+   flush_dcache_page(page);
+   kunmap_atomic(dst, KM_USER0);
+   }
+   make_page_uptodate(page);
+   }
+
set_page_dirty(page);
-   return 0;
+   unlock_page(page);
+   page_cache_release(page);
+   return copied;
 }
 
 /*
@@ -191,8 +201,8 @@ static int ramdisk_set_page_dirty(struct
 
 static const struct address_space_operations ramdisk_aops = {
.readpage   = ramdisk_readpage,
-   .prepare_write  = ramdisk_prepare_write,
-   .commit_write   = ramdisk_commit_write,
+   .write_begin= ramdisk_write_begin,
+   .write_end  = ramdisk_write_end,
.writepage  = ramdisk_writepage,
.set_page_dirty = ramdisk_set_page_dirty,
.writepages = ramdisk_writepages,
@@ -201,13 +211,14 @@ static const struct address_space_operat
 static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
struct address_space *mapping)
 {
-   pgoff_t index = sector  (PAGE_CACHE_SHIFT - 9);
+   loff_t pos = sector  9;
unsigned int vec_offset =

[patch 08/41] mm: write iovec cleanup

2007-05-14 Thread npiggin


Hide some of the open-coded nr_segs tests into the iovec helpers. This is
all to simplify generic_file_buffered_write, because that gets more complex
in the next patch.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   36 +--
 mm/filemap.h |  104 +++
 mm/filemap_xip.c |   17 +++-
 3 files changed, 69 insertions(+), 88 deletions(-)

Index: linux-2.6/mm/filemap.h
===
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -22,82 +22,82 @@ __filemap_copy_from_user_iovec_inatomic(
 
 /*
  * Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied.  If a fault is encountered then clear the page
- * out to (offset+bytes) and return the number of bytes which were copied.
- *
- * NOTE: For this to work reliably we really want 
copy_from_user_inatomic_nocache
- * to *NOT* zero any tail of the buffer that it failed to copy.  If it does,
- * and if the following non-atomic copy succeeds, then there is a small window
- * where the target page contains neither the data before the write, nor the
- * data after the write (it contains zero).  A read at this time will see
- * data that is inconsistent with any ordering of the read and the write.
- * (This has been detected in practice).
+ * were sucessfully copied.  If a fault is encountered then return the number 
of
+ * bytes which were copied.
  */
 static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
-   const char __user *buf, unsigned bytes)
+filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
+   const struct iovec *iov, unsigned long nr_segs,
+   size_t base, size_t bytes)
 {
char *kaddr;
-   int left;
+   size_t copied;
 
kaddr = kmap_atomic(page, KM_USER0);
-   left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
+   if (likely(nr_segs == 1)) {
+   int left;
+   char __user *buf = iov-iov_base + base;
+   left = __copy_from_user_inatomic_nocache(kaddr + offset,
+   buf, bytes);
+   copied = bytes - left;
+   } else {
+   copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+   iov, base, bytes);
+   }
kunmap_atomic(kaddr, KM_USER0);
 
-   if (left != 0) {
-   /* Do it the slow way */
-   kaddr = kmap(page);
-   left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
-   kunmap(page);
-   }
-   return bytes - left;
+   return copied;
 }
 
 /*
- * This has the same sideeffects and return value as filemap_copy_from_user().
- * The difference is that on a fault we need to memset the remainder of the
- * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
- * single-segment behaviour.
+ * This has the same sideeffects and return value as
+ * filemap_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
  */
 static inline size_t
-filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
-   const struct iovec *iov, size_t base, size_t bytes)
+filemap_copy_from_user(struct page *page, unsigned long offset,
+   const struct iovec *iov, unsigned long nr_segs,
+size_t base, size_t bytes)
 {
char *kaddr;
size_t copied;
 
-   kaddr = kmap_atomic(page, KM_USER0);
-   copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
-base, bytes);
-   kunmap_atomic(kaddr, KM_USER0);
-   if (copied != bytes) {
-   kaddr = kmap(page);
-   copied = __filemap_copy_from_user_iovec_inatomic(kaddr + 
offset, iov,
-base, bytes);
-   if (bytes - copied)
-   memset(kaddr + offset + copied, 0, bytes - copied);
-   kunmap(page);
+   kaddr = kmap(page);
+   if (likely(nr_segs == 1)) {
+   int left;
+   char __user *buf = iov-iov_base + base;
+   left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+   copied = bytes - left;
+   } else {
+   copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+   iov, base, bytes);
}
+   kunmap(page);
return copied;
 }
 
 static inline void
-filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, size_t bytes)

[patch 10/41] mm: buffered write iterator

2007-05-14 Thread npiggin


Add an iterator data structure to operate over an iovec. Add usercopy
operators needed by generic_file_buffered_write, and convert that function
over.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 include/linux/fs.h |   33 
 mm/filemap.c   |  144 +++--
 mm/filemap.h   |  103 -
 3 files changed, 150 insertions(+), 130 deletions(-)

Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -404,6 +404,39 @@ struct page;
 struct address_space;
 struct writeback_control;
 
+struct iov_iter {
+   const struct iovec *iov;
+   unsigned long nr_segs;
+   size_t iov_offset;
+   size_t count;
+};
+
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes);
+size_t iov_iter_copy_from_user(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes);
+void iov_iter_advance(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_readable(struct iov_iter *i);
+size_t iov_iter_single_seg_count(struct iov_iter *i);
+
+static inline void iov_iter_init(struct iov_iter *i,
+   const struct iovec *iov, unsigned long nr_segs,
+   size_t count, size_t written)
+{
+   i-iov = iov;
+   i-nr_segs = nr_segs;
+   i-iov_offset = 0;
+   i-count = count + written;
+
+   iov_iter_advance(i, written);
+}
+
+static inline size_t iov_iter_count(struct iov_iter *i)
+{
+   return i-count;
+}
+
+
 struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -30,7 +30,7 @@
 #include linux/security.h
 #include linux/syscalls.h
 #include linux/cpuset.h
-#include filemap.h
+#include linux/hardirq.h /* for BUG_ON(!in_atomic()) only */
 #include internal.h
 
 /*
@@ -1696,8 +1696,7 @@ int remove_suid(struct dentry *dentry)
 }
 EXPORT_SYMBOL(remove_suid);
 
-size_t
-__filemap_copy_from_user_iovec_inatomic(char *vaddr,
+static size_t __iovec_copy_from_user_inatomic(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
 {
size_t copied = 0, left = 0;
@@ -1720,6 +1719,110 @@ __filemap_copy_from_user_iovec_inatomic(
 }
 
 /*
+ * Copy as much as we can into the page and return the number of bytes which
+ * were sucessfully copied.  If a fault is encountered then return the number 
of
+ * bytes which were copied.
+ */
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+   char *kaddr;
+   size_t copied;
+
+   BUG_ON(!in_atomic());
+   kaddr = kmap_atomic(page, KM_USER0);
+   if (likely(i-nr_segs == 1)) {
+   int left;
+   char __user *buf = i-iov-iov_base + i-iov_offset;
+   left = __copy_from_user_inatomic_nocache(kaddr + offset,
+   buf, bytes);
+   copied = bytes - left;
+   } else {
+   copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+   i-iov, i-iov_offset, bytes);
+   }
+   kunmap_atomic(kaddr, KM_USER0);
+
+   return copied;
+}
+
+/*
+ * This has the same sideeffects and return value as
+ * iov_iter_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
+ * Page must not be locked.
+ */
+size_t iov_iter_copy_from_user(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+   char *kaddr;
+   size_t copied;
+
+   kaddr = kmap(page);
+   if (likely(i-nr_segs == 1)) {
+   int left;
+   char __user *buf = i-iov-iov_base + i-iov_offset;
+   left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+   copied = bytes - left;
+   } else {
+   copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+   i-iov, i-iov_offset, bytes);
+   }
+   kunmap(page);
+   return copied;
+}
+
+static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes)
+{
+   if (likely(i-nr_segs == 1)) {
+   i-iov_offset += bytes;
+   } else {
+   const struct iovec *iov = i-iov;
+   size_t base = i-iov_offset;
+
+   while (bytes) {
+   int copy = min(bytes, iov-iov_len - base);
+
+   bytes -= copy;
+

[patch 32/41] smb convert to new aops.

2007-05-14 Thread npiggin

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/smbfs/file.c |   34 +-
 1 file changed, 25 insertions(+), 9 deletions(-)

Index: linux-2.6/fs/smbfs/file.c
===
--- linux-2.6.orig/fs/smbfs/file.c
+++ linux-2.6/fs/smbfs/file.c
@@ -290,29 +290,45 @@ out:
  * If the writer ends up delaying the write, the writer needs to
  * increment the page use counts until he is done with the page.
  */
-static int smb_prepare_write(struct file *file, struct page *page, 
-unsigned offset, unsigned to)
-{
+static int smb_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   *pagep = __grab_cache_page(mapping, index);
+   if (!*pagep)
+   return -ENOMEM;
return 0;
 }
 
-static int smb_commit_write(struct file *file, struct page *page,
-   unsigned offset, unsigned to)
+static int smb_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
int status;
+   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
 
-   status = -EFAULT;
lock_kernel();
-   status = smb_updatepage(file, page, offset, to-offset);
+   status = smb_updatepage(file, page, offset, copied);
unlock_kernel();
+
+   if (!status) {
+   if (!PageUptodate(page)  copied == PAGE_CACHE_SIZE)
+   SetPageUptodate(page);
+   status = copied;
+   }
+
+   unlock_page(page);
+   page_cache_release(page);
+
return status;
 }
 
 const struct address_space_operations smb_file_aops = {
.readpage = smb_readpage,
.writepage = smb_writepage,
-   .prepare_write = smb_prepare_write,
-   .commit_write = smb_commit_write
+   .write_begin = smb_write_begin,
+   .write_end = smb_write_end,
 };
 
 /* 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 33/41] GFS2 convert to new aops.

2007-05-14 Thread npiggin

From: Steven Whitehouse [EMAIL PROTECTED]

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Steven Whitehouse [EMAIL PROTECTED]

 fs/gfs2/ops_address.c |  209 +-
 1 file changed, 125 insertions(+), 84 deletions(-)

Index: linux-2.6/fs/gfs2/ops_address.c
===
--- linux-2.6.orig/fs/gfs2/ops_address.c
+++ linux-2.6/fs/gfs2/ops_address.c
@@ -17,6 +17,7 @@
 #include linux/mpage.h
 #include linux/fs.h
 #include linux/writeback.h
+#include linux/swap.h
 #include linux/gfs2_ondisk.h
 #include linux/lm_interface.h
 
@@ -348,45 +349,49 @@ out_unlock:
 }
 
 /**
- * gfs2_prepare_write - Prepare to write a page to a file
+ * gfs2_write_begin - Begin to write to a file
  * @file: The file to write to
- * @page: The page which is to be prepared for writing
- * @from: From (byte range within page)
- * @to: To (byte range within page)
+ * @mapping: The mapping in which to write
+ * @pos: The file offset at which to start writing
+ * @len: Length of the write
+ * @flags: Various flags
+ * @pagep: Pointer to return the page
+ * @fsdata: Pointer to return fs data (unused by GFS2)
  *
  * Returns: errno
  */
 
-static int gfs2_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int gfs2_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   struct gfs2_inode *ip = GFS2_I(page-mapping-host);
-   struct gfs2_sbd *sdp = GFS2_SB(page-mapping-host);
+   struct gfs2_inode *ip = GFS2_I(mapping-host);
+   struct gfs2_sbd *sdp = GFS2_SB(mapping-host);
unsigned int data_blocks, ind_blocks, rblocks;
int alloc_required;
int error = 0;
-   loff_t pos = ((loff_t)page-index  PAGE_CACHE_SHIFT) + from;
-   loff_t end = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
struct gfs2_alloc *al;
-   unsigned int write_len = to - from;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   unsigned from = pos  (PAGE_CACHE_SIZE - 1);
+   unsigned to = from + len;
+   struct page *page;
 
-
-   gfs2_holder_init(ip-i_gl, LM_ST_EXCLUSIVE, GL_ATIME|LM_FLAG_TRY_1CB, 
ip-i_gh);
+   gfs2_holder_init(ip-i_gl, LM_ST_EXCLUSIVE, GL_ATIME, ip-i_gh);
error = gfs2_glock_nq_atime(ip-i_gh);
-   if (unlikely(error)) {
-   if (error == GLR_TRYFAILED) {
-   unlock_page(page);
-   error = AOP_TRUNCATED_PAGE;
-   yield();
-   }
+   if (unlikely(error))
goto out_uninit;
-   }
 
-   gfs2_write_calc_reserv(ip, write_len, data_blocks, ind_blocks);
+   error = -ENOMEM;
+   page = __grab_cache_page(mapping, index);
+   *pagep = page;
+   if (!page)
+   goto out_unlock;
+
+   gfs2_write_calc_reserv(ip, len, data_blocks, ind_blocks);
 
-   error = gfs2_write_alloc_required(ip, pos, write_len, alloc_required);
+   error = gfs2_write_alloc_required(ip, pos, len, alloc_required);
if (error)
-   goto out_unlock;
+   goto out_putpage;
 
 
ip-i_alloc.al_requested = 0;
@@ -418,7 +423,7 @@ static int gfs2_prepare_write(struct fil
goto out;
 
if (gfs2_is_stuffed(ip)) {
-   if (end  sdp-sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) {
+   if (pos + len  sdp-sd_sb.sb_bsize - sizeof(struct 
gfs2_dinode)) {
error = gfs2_unstuff_dinode(ip, page);
if (error == 0)
goto prepare_write;
@@ -440,6 +445,10 @@ out_qunlock:
 out_alloc_put:
gfs2_alloc_put(ip);
}
+out_putpage:
+   page_cache_release(page);
+   if (pos + len  ip-i_inode.i_size)
+   vmtruncate(ip-i_inode, ip-i_inode.i_size);
 out_unlock:
gfs2_glock_dq_m(1, ip-i_gh);
 out_uninit:
@@ -450,96 +459,128 @@ out_uninit:
 }
 
 /**
- * gfs2_commit_write - Commit write to a file
+ * gfs2_stuffed_write_end - Write end for stuffed files
+ * @inode: The inode
+ * @dibh: The buffer_head containing the on-disk inode
+ * @pos: The file position
+ * @len: The length of the write
+ * @copied: How much was actually copied by the VFS
+ * @page: The page
+ *
+ * This copies the data from the page into the inode block after
+ * the inode data structure itself.
+ *
+ * Returns: errno
+ */
+static int gfs2_stuffed_write_end(struct inode *inode, struct buffer_head 
*dibh,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page)
+{
+   struct gfs2_inode *ip = GFS2_I(inode);
+   struct gfs2_sbd *sdp = GFS2_SB(inode);
+   u64 to = pos + copied;
+

[patch 37/41] ufs convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/ufs/dir.c   |   50 +++---
 fs/ufs/inode.c |   23 +++
 2 files changed, 50 insertions(+), 23 deletions(-)

Index: linux-2.6/fs/ufs/inode.c
===
--- linux-2.6.orig/fs/ufs/inode.c
+++ linux-2.6/fs/ufs/inode.c
@@ -558,24 +558,39 @@ static int ufs_writepage(struct page *pa
 {
return block_write_full_page(page,ufs_getfrag_block,wbc);
 }
+
 static int ufs_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,ufs_getfrag_block);
 }
-static int ufs_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+
+int __ufs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page,from,to,ufs_getfrag_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ufs_getfrag_block);
 }
+
+static int ufs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return __ufs_write_begin(file, mapping, pos, len, flags, pagep, fsdata);
+}
+
 static sector_t ufs_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,ufs_getfrag_block);
 }
+
 const struct address_space_operations ufs_aops = {
.readpage = ufs_readpage,
.writepage = ufs_writepage,
.sync_page = block_sync_page,
-   .prepare_write = ufs_prepare_write,
-   .commit_write = generic_commit_write,
+   .write_begin = ufs_write_begin,
+   .write_end = generic_write_end,
.bmap = ufs_bmap
 };
 
Index: linux-2.6/fs/ufs/dir.c
===
--- linux-2.6.orig/fs/ufs/dir.c
+++ linux-2.6/fs/ufs/dir.c
@@ -38,12 +38,14 @@ static inline int ufs_match(struct super
return !memcmp(name, de-d_name, len);
 }
 
-static int ufs_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-   struct inode *dir = page-mapping-host;
+   struct address_space *mapping = page-mapping;
+   struct inode *dir = mapping-host;
int err = 0;
+
dir-i_version++;
-   page-mapping-a_ops-commit_write(NULL, page, from, to);
+   block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
@@ -81,16 +83,20 @@ ino_t ufs_inode_by_name(struct inode *di
 void ufs_set_link(struct inode *dir, struct ufs_dir_entry *de,
  struct page *page, struct inode *inode)
 {
-   unsigned from = (char *) de - (char *) page_address(page);
-   unsigned to = from + fs16_to_cpu(dir-i_sb, de-d_reclen);
+   loff_t pos = (page-index  PAGE_CACHE_SHIFT) +
+   (char *) de - (char *) page_address(page);
+   unsigned len = fs16_to_cpu(dir-i_sb, de-d_reclen);
int err;
 
lock_page(page);
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __ufs_write_begin(NULL, page-mapping, pos, len,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
BUG_ON(err);
+
de-d_ino = cpu_to_fs32(dir-i_sb, inode-i_ino);
ufs_set_de_type(dir-i_sb, de, inode-i_mode);
-   err = ufs_commit_chunk(page, from, to);
+
+   err = ufs_commit_chunk(page, pos, len);
ufs_put_page(page);
dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(dir);
@@ -312,7 +318,7 @@ int ufs_add_link(struct dentry *dentry, 
unsigned long npages = ufs_dir_pages(dir);
unsigned long n;
char *kaddr;
-   unsigned from, to;
+   loff_t pos;
int err;
 
UFSD(ENTER, name %s, namelen %u\n, name, namelen);
@@ -367,9 +373,10 @@ int ufs_add_link(struct dentry *dentry, 
return -EINVAL;
 
 got_it:
-   from = (char*)de - (char*)page_address(page);
-   to = from + rec_len;
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   pos = (page-index  PAGE_CACHE_SHIFT) +
+   (char*)de - (char*)page_address(page);
+   err = __ufs_write_begin(NULL, page-mapping, pos, rec_len,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
if (err)
goto out_unlock;
if (de-d_ino) {
@@ -386,7 +393,7 @@ got_it:
de-d_ino = cpu_to_fs32(sb, inode-i_ino);
ufs_set_de_type(sb, de, inode-i_mode);
 
-   err = ufs_commit_chunk(page, from,

[patch 40/41] minix convert to new aops.

2007-05-14 Thread npiggin

Cc: Andries Brouwer [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/minix/dir.c   |   43 +--
 fs/minix/inode.c |   23 +++
 2 files changed, 44 insertions(+), 22 deletions(-)

Index: linux-2.6/fs/minix/inode.c
===
--- linux-2.6.orig/fs/minix/inode.c
+++ linux-2.6/fs/minix/inode.c
@@ -347,24 +347,39 @@ static int minix_writepage(struct page *
 {
return block_write_full_page(page, minix_get_block, wbc);
 }
+
 static int minix_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,minix_get_block);
 }
-static int minix_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+
+int __minix_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page,from,to,minix_get_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   minix_get_block);
 }
+
+static int minix_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return __minix_write_begin(file, mapping, pos, len, flags, pagep, 
fsdata);
+}
+
 static sector_t minix_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,minix_get_block);
 }
+
 static const struct address_space_operations minix_aops = {
.readpage = minix_readpage,
.writepage = minix_writepage,
.sync_page = block_sync_page,
-   .prepare_write = minix_prepare_write,
-   .commit_write = generic_commit_write,
+   .write_begin = minix_write_begin,
+   .write_end = generic_write_end,
.bmap = minix_bmap
 };
 
Index: linux-2.6/fs/minix/dir.c
===
--- linux-2.6.orig/fs/minix/dir.c
+++ linux-2.6/fs/minix/dir.c
@@ -9,6 +9,7 @@
  */
 
 #include minix.h
+#include linux/buffer_head.h
 #include linux/highmem.h
 #include linux/smp_lock.h
 
@@ -48,11 +49,12 @@ static inline unsigned long dir_pages(st
return (inode-i_size+PAGE_CACHE_SIZE-1)PAGE_CACHE_SHIFT;
 }
 
-static int dir_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-   struct inode *dir = (struct inode *)page-mapping-host;
+   struct address_space *mapping = page-mapping;
+   struct inode *dir = mapping-host;
int err = 0;
-   page-mapping-a_ops-commit_write(NULL, page, from, to);
+   block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
@@ -220,7 +222,7 @@ int minix_add_link(struct dentry *dentry
char *kaddr, *p;
minix_dirent *de;
minix3_dirent *de3;
-   unsigned from, to;
+   loff_t pos;
int err;
char *namx = NULL;
__u32 inumber;
@@ -272,9 +274,9 @@ int minix_add_link(struct dentry *dentry
return -EINVAL;
 
 got_it:
-   from = p - (char*)page_address(page);
-   to = from + sbi-s_dirsize;
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   pos = (page-index  PAGE_CACHE_SHIFT) + p - (char*)page_address(page);
+   err = __minix_write_begin(NULL, page-mapping, pos, sbi-s_dirsize,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
if (err)
goto out_unlock;
memcpy (namx, name, namelen);
@@ -285,7 +287,7 @@ got_it:
memset (namx + namelen, 0, sbi-s_dirsize - namelen - 2);
de-inode = inode-i_ino;
}
-   err = dir_commit_chunk(page, from, to);
+   err = dir_commit_chunk(page, pos, sbi-s_dirsize);
dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(dir);
 out_put:
@@ -302,15 +304,16 @@ int minix_delete_entry(struct minix_dir_
struct address_space *mapping = page-mapping;
struct inode *inode = (struct inode*)mapping-host;
char *kaddr = page_address(page);
-   unsigned from = (char*)de - kaddr;
-   unsigned to = from + minix_sb(inode-i_sb)-s_dirsize;
+   loff_t pos = (page-index  PAGE_CACHE_SHIFT) + (char*)de - kaddr;
+   unsigned len = minix_sb(inode-i_sb)-s_dirsize;
int err;
 
lock_page(page);
-   err = mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __minix_write_begin(NULL, mapping, pos, len,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
if (err == 0) {
de-inode = 0;
-

[patch 28/41] qnx4 convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/qnx4/inode.c |   21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/qnx4/inode.c
===
--- linux-2.6.orig/fs/qnx4/inode.c
+++ linux-2.6/fs/qnx4/inode.c
@@ -433,16 +433,21 @@ static int qnx4_writepage(struct page *p
 {
return block_write_full_page(page,qnx4_get_block, wbc);
 }
+
 static int qnx4_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,qnx4_get_block);
 }
-static int qnx4_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
-{
-   struct qnx4_inode_info *qnx4_inode = qnx4_i(page-mapping-host);
-   return cont_prepare_write(page, from, to, qnx4_get_block,
- qnx4_inode-mmu_private);
+
+static int qnx4_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   struct qnx4_inode_info *qnx4_inode = qnx4_i(mapping-host);
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   qnx4_get_block,
+   qnx4_inode-mmu_private);
 }
 static sector_t qnx4_bmap(struct address_space *mapping, sector_t block)
 {
@@ -452,8 +457,8 @@ static const struct address_space_operat
.readpage   = qnx4_readpage,
.writepage  = qnx4_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = qnx4_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= qnx4_write_begin,
+   .write_end  = generic_write_end,
.bmap   = qnx4_bmap
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 35/41] hostfs convert to new aops.

2007-05-14 Thread npiggin

This also gets rid of a lot of useless read_file stuff. And also
optimises the full page write case by marking a !uptodate page uptodate.

Cc: Jeff Dike [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/hostfs/hostfs_kern.c |   70 +++-
 1 file changed, 28 insertions(+), 42 deletions(-)

Index: linux-2.6/fs/hostfs/hostfs_kern.c
===
--- linux-2.6.orig/fs/hostfs/hostfs_kern.c
+++ linux-2.6/fs/hostfs/hostfs_kern.c
@@ -466,56 +466,42 @@ int hostfs_readpage(struct file *file, s
return err;
 }
 
-int hostfs_prepare_write(struct file *file, struct page *page,
-unsigned int from, unsigned int to)
+int hostfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   char *buffer;
-   long long start, tmp;
-   int err;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
 
-   start = (long long) page-index  PAGE_CACHE_SHIFT;
-   buffer = kmap(page);
-   if(from != 0){
-   tmp = start;
-   err = read_file(FILE_HOSTFS_I(file)-fd, tmp, buffer,
-   from);
-   if(err  0) goto out;
-   }
-   if(to != PAGE_CACHE_SIZE){
-   start += to;
-   err = read_file(FILE_HOSTFS_I(file)-fd, start, buffer + to,
-   PAGE_CACHE_SIZE - to);
-   if(err  0) goto out;
-   }
-   err = 0;
- out:
-   kunmap(page);
-   return err;
+   *pagep = __grab_cache_page(mapping, index);
+   if (!*pagep)
+   return -ENOMEM;
+   return 0;
 }
 
-int hostfs_commit_write(struct file *file, struct page *page, unsigned from,
-unsigned to)
+int hostfs_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   struct address_space *mapping = page-mapping;
struct inode *inode = mapping-host;
-   char *buffer;
-   long long start;
-   int err = 0;
+   void *buffer;
+   unsigned from = pos  (PAGE_CACHE_SIZE - 1);
+   int err;
 
-   start = (((long long) page-index)  PAGE_CACHE_SHIFT) + from;
buffer = kmap(page);
-   err = write_file(FILE_HOSTFS_I(file)-fd, start, buffer + from,
-to - from);
-   if(err  0) err = 0;
-
-   /* Actually, if !err, write_file has added to-from to start, so, despite
-* the appearance, we are comparing i_size against the _last_ written
-* location, as we should. */
+   err = write_file(FILE_HOSTFS_I(file)-fd, pos, buffer + from, copied);
+   kunmap(page);
 
-   if(!err  (start  inode-i_size))
-   inode-i_size = start;
+   if (!PageUptodate(page)  err == PAGE_CACHE_SIZE)
+   SetPageUptodate(page);
+   unlock_page(page);
+   page_cache_release(page);
+
+   /* If err  0, write_file has added err to pos, so we are comparing
+* i_size against the last byte written.
+*/
+   if (err  0  (pos  inode-i_size))
+   inode-i_size = pos;
 
-   kunmap(page);
return err;
 }
 
@@ -523,8 +509,8 @@ static const struct address_space_operat
.writepage  = hostfs_writepage,
.readpage   = hostfs_readpage,
.set_page_dirty = __set_page_dirty_nobuffers,
-   .prepare_write  = hostfs_prepare_write,
-   .commit_write   = hostfs_commit_write
+   .write_begin= hostfs_write_begin,
+   .write_end  = hostfs_write_end,
 };
 
 static int init_inode(struct inode *inode, struct dentry *dentry)

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 22/41] fat convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/fat/inode.c |   27 ---
 1 file changed, 16 insertions(+), 11 deletions(-)

Index: linux-2.6/fs/fat/inode.c
===
--- linux-2.6.orig/fs/fat/inode.c
+++ linux-2.6/fs/fat/inode.c
@@ -140,19 +140,24 @@ static int fat_readpages(struct file *fi
return mpage_readpages(mapping, pages, nr_pages, fat_get_block);
 }
 
-static int fat_prepare_write(struct file *file, struct page *page,
-unsigned from, unsigned to)
+static int fat_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return cont_prepare_write(page, from, to, fat_get_block,
- MSDOS_I(page-mapping-host)-mmu_private);
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   fat_get_block,
+   MSDOS_I(mapping-host)-mmu_private);
 }
 
-static int fat_commit_write(struct file *file, struct page *page,
-   unsigned from, unsigned to)
+static int fat_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *pagep, void *fsdata)
 {
-   struct inode *inode = page-mapping-host;
-   int err = generic_commit_write(file, page, from, to);
-   if (!err  !(MSDOS_I(inode)-i_attrs  ATTR_ARCH)) {
+   struct inode *inode = mapping-host;
+   int err;
+   err = generic_write_end(file, mapping, pos, len, copied, pagep, fsdata);
+   if (!(err  0)  !(MSDOS_I(inode)-i_attrs  ATTR_ARCH)) {
inode-i_mtime = inode-i_ctime = CURRENT_TIME_SEC;
MSDOS_I(inode)-i_attrs |= ATTR_ARCH;
mark_inode_dirty(inode);
@@ -201,8 +206,8 @@ static const struct address_space_operat
.writepage  = fat_writepage,
.writepages = fat_writepages,
.sync_page  = block_sync_page,
-   .prepare_write  = fat_prepare_write,
-   .commit_write   = fat_commit_write,
+   .write_begin= fat_write_begin,
+   .write_end  = fat_write_end,
.direct_IO  = fat_direct_IO,
.bmap   = _fat_bmap
 };

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 21/41] fs: new cont helpers

2007-05-14 Thread npiggin

Rework the generic block cont routines to handle the new aops.
Supporting cont_prepare_write would take quite a lot of code to support,
so remove it instead (and we later convert all filesystems to use it).

write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
generic_cont_expand, so filesystems can avoid the old hacks they used.

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/buffer.c |  204 +---
 include/linux/buffer_head.h |5 -
 include/linux/fs.h  |1 
 mm/filemap.c|5 +
 4 files changed, 110 insertions(+), 105 deletions(-)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2133,14 +2133,14 @@ int block_read_full_page(struct page *pa
 }
 
 /* utility function for filesystems that need to do work on expanding
- * truncates.  Uses prepare/commit_write to allow the filesystem to
+ * truncates.  Uses filesystem pagecache writes to allow the filesystem to
  * deal with the hole.  
  */
-static int __generic_cont_expand(struct inode *inode, loff_t size,
-pgoff_t index, unsigned int offset)
+int generic_cont_expand_simple(struct inode *inode, loff_t size)
 {
struct address_space *mapping = inode-i_mapping;
struct page *page;
+   void *fsdata;
unsigned long limit;
int err;
 
@@ -2153,140 +2153,134 @@ static int __generic_cont_expand(struct 
if (size  inode-i_sb-s_maxbytes)
goto out;
 
-   err = -ENOMEM;
-   page = grab_cache_page(mapping, index);
-   if (!page)
-   goto out;
-   err = mapping-a_ops-prepare_write(NULL, page, offset, offset);
-   if (err) {
-   /*
-* -prepare_write() may have instantiated a few blocks
-* outside i_size.  Trim these off again.
-*/
-   unlock_page(page);
-   page_cache_release(page);
-   vmtruncate(inode, inode-i_size);
+   err = pagecache_write_begin(NULL, mapping, size, 0,
+   AOP_FLAG_UNINTERRUPTIBLE|AOP_FLAG_CONT_EXPAND,
+   page, fsdata);
+   if (err)
goto out;
-   }
 
-   err = mapping-a_ops-commit_write(NULL, page, offset, offset);
+   err = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata);
+   BUG_ON(err  0);
 
-   unlock_page(page);
-   page_cache_release(page);
-   if (err  0)
-   err = 0;
 out:
return err;
 }
 
 int generic_cont_expand(struct inode *inode, loff_t size)
 {
-   pgoff_t index;
unsigned int offset;
 
offset = (size  (PAGE_CACHE_SIZE - 1)); /* Within page */
 
/* ugh.  in prepare/commit_write, if from==to==start of block, we
-   ** skip the prepare.  make sure we never send an offset for the start
-   ** of a block
-   */
+* skip the prepare.  make sure we never send an offset for the start
+* of a block.
+* XXX: actually, this should be handled in those filesystems by
+* checking for the AOP_FLAG_CONT_EXPAND flag.
+*/
if ((offset  (inode-i_sb-s_blocksize - 1)) == 0) {
/* caller must handle this extra byte. */
-   offset++;
+   size++;
}
-   index = size  PAGE_CACHE_SHIFT;
-
-   return __generic_cont_expand(inode, size, index, offset);
-}
-
-int generic_cont_expand_simple(struct inode *inode, loff_t size)
-{
-   loff_t pos = size - 1;
-   pgoff_t index = pos  PAGE_CACHE_SHIFT;
-   unsigned int offset = (pos  (PAGE_CACHE_SIZE - 1)) + 1;
-
-   /* prepare/commit_write can handle even if from==to==start of block. */
-   return __generic_cont_expand(inode, size, index, offset);
+   return generic_cont_expand_simple(inode, size);
 }
 
-/*
- * For moronic filesystems that do not allow holes in file.
- * We may have to extend the file.
- */
-
-int cont_prepare_write(struct page *page, unsigned offset,
-   unsigned to, get_block_t *get_block, loff_t *bytes)
+int cont_expand_zero(struct file *file, struct address_space *mapping,
+   loff_t pos, loff_t *bytes)
 {
-   struct address_space *mapping = page-mapping;
struct inode *inode = mapping-host;
-   struct page *new_page;
-   pgoff_t pgpos;
-   long status;
-   unsigned zerofrom;
unsigned blocksize = 1  inode-i_blkbits;
+   struct page *page;
+   void *fsdata;
+   pgoff_t index, curidx;
+   loff_t curpos;
+   unsigned zerofrom, offset, len;
+   int err = 0;
 
-   while(page-index  (pgpos = *bytesPAGE_CACHE_SHIFT)) {
-   status = -ENOMEM;
-   new_page = grab_cache_page(mapping, pgpos);
-

[patch 25/41] hfsplus convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/hfsplus/extents.c |   21 +
 fs/hfsplus/inode.c   |   20 
 2 files changed, 21 insertions(+), 20 deletions(-)

Index: linux-2.6/fs/hfsplus/inode.c
===
--- linux-2.6.orig/fs/hfsplus/inode.c
+++ linux-2.6/fs/hfsplus/inode.c
@@ -26,10 +26,14 @@ static int hfsplus_writepage(struct page
return block_write_full_page(page, hfsplus_get_block, wbc);
 }
 
-static int hfsplus_prepare_write(struct file *file, struct page *page, 
unsigned from, unsigned to)
-{
-   return cont_prepare_write(page, from, to, hfsplus_get_block,
-   HFSPLUS_I(page-mapping-host).phys_size);
+static int hfsplus_write_begin(struct file *file, struct address_space 
*mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   hfsplus_get_block,
+   HFSPLUS_I(mapping-host).phys_size);
 }
 
 static sector_t hfsplus_bmap(struct address_space *mapping, sector_t block)
@@ -113,8 +117,8 @@ const struct address_space_operations hf
.readpage   = hfsplus_readpage,
.writepage  = hfsplus_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = hfsplus_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= hfsplus_write_begin,
+   .write_end  = generic_write_end,
.bmap   = hfsplus_bmap,
.releasepage= hfsplus_releasepage,
 };
@@ -123,8 +127,8 @@ const struct address_space_operations hf
.readpage   = hfsplus_readpage,
.writepage  = hfsplus_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = hfsplus_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= hfsplus_write_begin,
+   .write_end  = generic_write_end,
.bmap   = hfsplus_bmap,
.direct_IO  = hfsplus_direct_IO,
.writepages = hfsplus_writepages,
Index: linux-2.6/fs/hfsplus/extents.c
===
--- linux-2.6.orig/fs/hfsplus/extents.c
+++ linux-2.6/fs/hfsplus/extents.c
@@ -443,21 +443,18 @@ void hfsplus_file_truncate(struct inode 
if (inode-i_size  HFSPLUS_I(inode).phys_size) {
struct address_space *mapping = inode-i_mapping;
struct page *page;
-   u32 size = inode-i_size - 1;
+   void *fsdata;
+   u32 size = inode-i_size;
int res;
 
-   page = grab_cache_page(mapping, size  PAGE_CACHE_SHIFT);
-   if (!page)
-   return;
-   size = PAGE_CACHE_SIZE - 1;
-   size++;
-   res = mapping-a_ops-prepare_write(NULL, page, size, size);
-   if (!res)
-   res = mapping-a_ops-commit_write(NULL, page, size, 
size);
+   res = pagecache_write_begin(NULL, mapping, size, 0,
+   AOP_FLAG_UNINTERRUPTIBLE,
+   page, fsdata);
if (res)
-   inode-i_size = HFSPLUS_I(inode).phys_size;
-   unlock_page(page);
-   page_cache_release(page);
+   return;
+   res = pagecache_write_end(NULL, mapping, size, 0, 0, page, 
fsdata);
+   if (res  0)
+   return;
mark_inode_dirty(inode);
return;
} else if (inode-i_size == HFSPLUS_I(inode).phys_size)

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 24/41] hfs convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/hfs/extent.c |   19 ---
 fs/hfs/inode.c  |   20 
 2 files changed, 20 insertions(+), 19 deletions(-)

Index: linux-2.6/fs/hfs/inode.c
===
--- linux-2.6.orig/fs/hfs/inode.c
+++ linux-2.6/fs/hfs/inode.c
@@ -34,10 +34,14 @@ static int hfs_readpage(struct file *fil
return block_read_full_page(page, hfs_get_block);
 }
 
-static int hfs_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
-{
-   return cont_prepare_write(page, from, to, hfs_get_block,
- HFS_I(page-mapping-host)-phys_size);
+static int hfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   hfs_get_block,
+   HFS_I(mapping-host)-phys_size);
 }
 
 static sector_t hfs_bmap(struct address_space *mapping, sector_t block)
@@ -118,8 +122,8 @@ const struct address_space_operations hf
.readpage   = hfs_readpage,
.writepage  = hfs_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = hfs_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= hfs_write_begin,
+   .write_end  = generic_write_end,
.bmap   = hfs_bmap,
.releasepage= hfs_releasepage,
 };
@@ -128,8 +132,8 @@ const struct address_space_operations hf
.readpage   = hfs_readpage,
.writepage  = hfs_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = hfs_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= hfs_write_begin,
+   .write_end  = generic_write_end,
.bmap   = hfs_bmap,
.direct_IO  = hfs_direct_IO,
.writepages = hfs_writepages,
Index: linux-2.6/fs/hfs/extent.c
===
--- linux-2.6.orig/fs/hfs/extent.c
+++ linux-2.6/fs/hfs/extent.c
@@ -464,23 +464,20 @@ void hfs_file_truncate(struct inode *ino
   (long long)HFS_I(inode)-phys_size, inode-i_size);
if (inode-i_size  HFS_I(inode)-phys_size) {
struct address_space *mapping = inode-i_mapping;
+   void *fsdata;
struct page *page;
int res;
 
+   /* XXX: Can use generic_cont_expand? */
size = inode-i_size - 1;
-   page = grab_cache_page(mapping, size  PAGE_CACHE_SHIFT);
-   if (!page)
-   return;
-   size = PAGE_CACHE_SIZE - 1;
-   size++;
-   res = mapping-a_ops-prepare_write(NULL, page, size, size);
-   if (!res)
-   res = mapping-a_ops-commit_write(NULL, page, size, 
size);
+   res = pagecache_write_begin(NULL, mapping, size+1, 0,
+   AOP_FLAG_UNINTERRUPTIBLE, page, fsdata);
+   if (!res) {
+   res = pagecache_write_end(NULL, mapping, size+1, 0, 0,
+   page, fsdata);
+   }
if (res)
inode-i_size = HFS_I(inode)-phys_size;
-   unlock_page(page);
-   page_cache_release(page);
-   mark_inode_dirty(inode);
return;
} else if (inode-i_size == HFS_I(inode)-phys_size)
return;

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 39/41] sysv convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/sysv/dir.c   |   45 +
 fs/sysv/itree.c |   23 +++
 2 files changed, 44 insertions(+), 24 deletions(-)

Index: linux-2.6/fs/sysv/itree.c
===
--- linux-2.6.orig/fs/sysv/itree.c
+++ linux-2.6/fs/sysv/itree.c
@@ -453,23 +453,38 @@ static int sysv_writepage(struct page *p
 {
return block_write_full_page(page,get_block,wbc);
 }
+
 static int sysv_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,get_block);
 }
-static int sysv_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+
+int __sysv_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page,from,to,get_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   get_block);
 }
+
+static int sysv_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return __sysv_write_begin(file, mapping, pos, len, flags, pagep, 
fsdata);
+}
+
 static sector_t sysv_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,get_block);
 }
+
 const struct address_space_operations sysv_aops = {
.readpage = sysv_readpage,
.writepage = sysv_writepage,
.sync_page = block_sync_page,
-   .prepare_write = sysv_prepare_write,
-   .commit_write = generic_commit_write,
+   .write_begin = sysv_write_begin,
+   .write_end = generic_write_end,
.bmap = sysv_bmap
 };
Index: linux-2.6/fs/sysv/dir.c
===
--- linux-2.6.orig/fs/sysv/dir.c
+++ linux-2.6/fs/sysv/dir.c
@@ -37,12 +37,13 @@ static inline unsigned long dir_pages(st
return (inode-i_size+PAGE_CACHE_SIZE-1)PAGE_CACHE_SHIFT;
 }
 
-static int dir_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-   struct inode *dir = (struct inode *)page-mapping-host;
+   struct address_space *mapping = page-mapping;
+   struct inode *dir = mapping-host;
int err = 0;
 
-   page-mapping-a_ops-commit_write(NULL, page, from, to);
+   block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
@@ -186,7 +187,7 @@ int sysv_add_link(struct dentry *dentry,
unsigned long npages = dir_pages(dir);
unsigned long n;
char *kaddr;
-   unsigned from, to;
+   loff_t pos;
int err;
 
/* We take care of directory expansion in the same loop */
@@ -212,16 +213,17 @@ int sysv_add_link(struct dentry *dentry,
return -EINVAL;
 
 got_it:
-   from = (char*)de - (char*)page_address(page);
-   to = from + SYSV_DIRSIZE;
+   pos = (page-index  PAGE_CACHE_SHIFT) +
+   (char*)de - (char*)page_address(page);
lock_page(page);
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __sysv_write_begin(NULL, page-mapping, pos, SYSV_DIRSIZE,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
if (err)
goto out_unlock;
memcpy (de-name, name, namelen);
memset (de-name + namelen, 0, SYSV_DIRSIZE - namelen - 2);
de-inode = cpu_to_fs16(SYSV_SB(inode-i_sb), inode-i_ino);
-   err = dir_commit_chunk(page, from, to);
+   err = dir_commit_chunk(page, pos, SYSV_DIRSIZE);
dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(dir);
 out_page:
@@ -238,15 +240,15 @@ int sysv_delete_entry(struct sysv_dir_en
struct address_space *mapping = page-mapping;
struct inode *inode = (struct inode*)mapping-host;
char *kaddr = (char*)page_address(page);
-   unsigned from = (char*)de - kaddr;
-   unsigned to = from + SYSV_DIRSIZE;
+   loff_t pos = (page-index  PAGE_CACHE_SHIFT) + (char *)de - kaddr;
int err;
 
lock_page(page);
-   err = mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __sysv_write_begin(NULL, mapping, pos, SYSV_DIRSIZE,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
BUG_ON(err);
de-inode = 0;
-   err = dir_commit_chunk(page, from, to);
+   err = dir_commit_chunk(page, pos, SYSV_DIRSIZE);
dir_put_page(page);
inode-i_ctime = inode-i_mtime = CURRENT_TIME_SEC;

[patch 23/41] adfs convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/adfs/inode.c |   14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

Index: linux-2.6/fs/adfs/inode.c
===
--- linux-2.6.orig/fs/adfs/inode.c
+++ linux-2.6/fs/adfs/inode.c
@@ -61,10 +61,14 @@ static int adfs_readpage(struct file *fi
return block_read_full_page(page, adfs_get_block);
 }
 
-static int adfs_prepare_write(struct file *file, struct page *page, unsigned 
int from, unsigned int to)
+static int adfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return cont_prepare_write(page, from, to, adfs_get_block,
-   ADFS_I(page-mapping-host)-mmu_private);
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   adfs_get_block,
+   ADFS_I(mapping-host)-mmu_private);
 }
 
 static sector_t _adfs_bmap(struct address_space *mapping, sector_t block)
@@ -76,8 +80,8 @@ static const struct address_space_operat
.readpage   = adfs_readpage,
.writepage  = adfs_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = adfs_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= adfs_write_begin,
+   .write_end  = generic_write_end,
.bmap   = _adfs_bmap
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 36/41] jffs2 convert to new aops.

2007-05-14 Thread npiggin

Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/jffs2/file.c |  105 +++-
 1 file changed, 66 insertions(+), 39 deletions(-)

Index: linux-2.6/fs/jffs2/file.c
===
--- linux-2.6.orig/fs/jffs2/file.c
+++ linux-2.6/fs/jffs2/file.c
@@ -19,10 +19,12 @@
 #include linux/jffs2.h
 #include nodelist.h
 
-static int jffs2_commit_write (struct file *filp, struct page *pg,
-  unsigned start, unsigned end);
-static int jffs2_prepare_write (struct file *filp, struct page *pg,
-   unsigned start, unsigned end);
+static int jffs2_write_end(struct file *filp, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *pg, void *fsdata);
+static int jffs2_write_begin(struct file *filp, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata);
 static int jffs2_readpage (struct file *filp, struct page *pg);
 
 int jffs2_fsync(struct file *filp, struct dentry *dentry, int datasync)
@@ -65,8 +67,8 @@ const struct inode_operations jffs2_file
 const struct address_space_operations jffs2_file_address_operations =
 {
.readpage = jffs2_readpage,
-   .prepare_write =jffs2_prepare_write,
-   .commit_write = jffs2_commit_write
+   .write_begin =  jffs2_write_begin,
+   .write_end =jffs2_write_end,
 };
 
 static int jffs2_do_readpage_nolock (struct inode *inode, struct page *pg)
@@ -119,15 +121,23 @@ static int jffs2_readpage (struct file *
return ret;
 }
 
-static int jffs2_prepare_write (struct file *filp, struct page *pg,
-   unsigned start, unsigned end)
+static int jffs2_write_begin(struct file *filp, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   struct inode *inode = pg-mapping-host;
+   struct page *pg;
+   struct inode *inode = mapping-host;
struct jffs2_inode_info *f = JFFS2_INODE_INFO(inode);
-   uint32_t pageofs = pg-index  PAGE_CACHE_SHIFT;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   uint32_t pageofs = pos  (PAGE_CACHE_SIZE - 1);
int ret = 0;
 
-   D1(printk(KERN_DEBUG jffs2_prepare_write()\n));
+   pg = __grab_cache_page(mapping, index);
+   if (!pg)
+   return -ENOMEM;
+   *pagep = pg;
+
+   D1(printk(KERN_DEBUG jffs2_write_begin()\n));
 
if (pageofs  inode-i_size) {
/* Make new hole frag from old EOF to new page */
@@ -142,7 +152,7 @@ static int jffs2_prepare_write (struct f
ret = jffs2_reserve_space(c, sizeof(ri), alloc_len,
  ALLOC_NORMAL, 
JFFS2_SUMMARY_INODE_SIZE);
if (ret)
-   return ret;
+   goto out_page;
 
down(f-sem);
memset(ri, 0, sizeof(ri));
@@ -172,7 +182,7 @@ static int jffs2_prepare_write (struct f
ret = PTR_ERR(fn);
jffs2_complete_reservation(c);
up(f-sem);
-   return ret;
+   goto out_page;
}
ret = jffs2_add_full_dnode_to_inode(c, f, fn);
if (f-metadata) {
@@ -181,65 +191,79 @@ static int jffs2_prepare_write (struct f
f-metadata = NULL;
}
if (ret) {
-   D1(printk(KERN_DEBUG Eep. add_full_dnode_to_inode() 
failed in prepare_write, returned %d\n, ret));
+   D1(printk(KERN_DEBUG Eep. add_full_dnode_to_inode() 
failed in write_begin, returned %d\n, ret));
jffs2_mark_node_obsolete(c, fn-raw);
jffs2_free_full_dnode(fn);
jffs2_complete_reservation(c);
up(f-sem);
-   return ret;
+   goto out_page;
}
jffs2_complete_reservation(c);
inode-i_size = pageofs;
up(f-sem);
}
 
-   /* Read in the page if it wasn't already present, unless it's a whole 
page */
-   if (!PageUptodate(pg)  (start || end  PAGE_CACHE_SIZE)) {
+   /*
+* Read in the page if it wasn't already present. Cannot optimize away
+* the whole page write case until jffs2_write_end can handle the
+* case of a short-copy.
+*/
+   if (!PageUptodate(pg)) {
down(f-sem);
ret = jffs2_do_readpage_nolock(inode, pg);
up(f-sem);
+   if (ret)
+

[RFC][PATCH 1/14] Add union mount documentation

2007-05-14 Thread Bharata B Rao

From: Bharata B Rao [EMAIL PROTECTED]
Subject: Add union mount documentation.

This is an attempt to document some of the implementation details
and issues of union mount.

Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
Signed-off-by: Jan Blunck [EMAIL PROTECTED]
---
 Documentation/union-mounts.txt |  538 +
 1 files changed, 538 insertions(+)

--- /dev/null
+++ b/Documentation/union-mounts.txt
@@ -0,0 +1,538 @@
+VFS BASED UNION MOUNT
+=
+
+1. Overview
+2. Union stack
+3. Lookup
+4. Readdir
+   4.1 Duplicate elimination
+   4.2 Preserving state
+   4.3 File offset problem
+   4.4 Altered lseek behaviour
+   4.5 TODO
+5. Copyup
+6. Whiteout
+   6.1. Creation and deletion
+   6.2. Whiteout filetype support
+   6.3. Directory renaming
+7. Usage
+8. State of the code
+9. Extracted (old)mail comments
+
+1. Overview
+---
+Union mount allows mounting of two or more filesystems transparently on
+a single mount point. The contents(files or directories) of all the
+filesystems become visible at the mount point after a union mount. If
+there are files of same name in multiple layers, only the topmost files remain
+visible in a union mount. However (currently) common named directories are
+again union-ed to present a unified view at the subdir level.
+
+In this approach of unioning filesystems, the layering information of
+different components of the union mount are maintained at the VFS layer.
+Hence we call this a VFS based union mount.
+
+2. Union stack
+--
+Union stack reflects the stacking of two or more filesystems of the
+union mount. The stacking or the layering information is maintained
+as part of dentry structures of the mountpoint and mount root.
+
+The union stack information in the dentry structure looks like this:
+
+struct dentry {
+   ...
+
+#ifdef CONFIG_UNION_MOUNT
+   struct dentry *d_overlaid;  /* overlaid directory */
+   struct dentry *d_topmost;   /* topmost directory */
+   struct union_info *d_union; /* union stack info */
+#endif
+   ...
+};
+
+struct union_info {
+   struct mutex u_mutex;
+   atomic_t u_count;
+};
+
+There is one union_info shared by all dentries which are part of
+a union and u_count member holds the number of references to the union
+stack. When this reaches zero, the union stack ceases to exist and
+the union_info is freed.
+
+Union stack is essentially a singly linked list of dentries of the union
+with d_topmost as the head of the list and d_overlaid points
+to the next member of the stack. The walking of union stack is guarded by
+the u_mutex member.
+
+dget() references every dentry of the overlaid union stack to make sure
+that no dentry of the stack is discarded from memory while others are
+still in use. Since walking of union stack is protected by a mutex,
+dget() can now sleep.
+
+dput() also walks the union stack and releases references to all the
+dentries that are part of the union. If a dentry's reference count
+in a union stack reaches zero, it implies that the dentries above it
+in the stack must also be unused and the union stack can be safely
+destroyed at this point.
+
+Since dget() can sleep with union mount, it becomes necessary to
+fix many callers of dget() to release and re-acquire any spinlocks
+they are holding until they acquire the union lock(mutex).
+
+3. Lookup
+-
+With union mount, it becomes necessary to lookup pathnames not only
+in the topmost filesystem but also in the underlying filesystems.
+
+In case of looking up a filename, the lookup routines as a rule return
+the match from the topmost layer. However if the file is not found
+in the topmost layer, the lookup routines have been modified to
+find the file in the underlying filesystems of the union stack.
+
+When looking up a directory under a union mount point, the lookup
+code has been modified to build a union stack (if necessary).
+
+When looking up a name in a union directory, it is necessary to
+guarantee that the returned union stack remains valid. Hence
+concurrent lookups are prevented by obtaining the mutex lock during
+lookups.
+
+4. Readdir
+--
+The core functionality of union mount, viz., the merged view of
+multiple directories is provided by the readdir()/getdents() routines.
+This is achieved by reading the contents of every directory of the union
+stack and by merging the result.
+
+4.1 Duplicate elimination
+
+The directory entries are read starting from the top layer and they
+are maintained in a cache. Subsequently when the entries from the bottom layers
+of the union stack are read, they are checked for duplicates (in the cache)
+before being passed out to the user space. Since there can be mulitple
+readdir()/getdents() calls to read a single directory, the cache is made to
+persist across these calls. So we need to maintain this cache and the
+associated state across readdir calls.
+
+4.2

[RFC][PATCH 2/14] Add a new mount flag (MNT_UNION) for union mount

2007-05-14 Thread Bharata B Rao

From: Jan Blunck [EMAIL PROTECTED]
Subject: Add a new mount flag (MNT_UNION) for union mount.

Introduce MNT_UNION, MS_UNION and FS_WHT flags. There are the necessary flags
for doing

mount /dev/hda3 /mnt -o union

You need additional patches for util-linux for that to work.

Signed-off-by: Jan Blunck [EMAIL PROTECTED]
Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 fs/namespace.c|   14 +-
 include/linux/fs.h|2 ++
 include/linux/mount.h |1 +
 3 files changed, 16 insertions(+), 1 deletion(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -442,6 +442,7 @@ static int show_vfsmnt(struct seq_file *
{ MNT_NODIRATIME, ,nodiratime },
{ MNT_RELATIME, ,relatime },
{ MNT_NOMNT, ,nomnt },
+   { MNT_UNION, ,union },
{ 0, NULL }
};
struct proc_fs_info *fs_infop;
@@ -1256,6 +1257,14 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt-mnt_root-d_inode-i_mode))
goto unlock;
 
+   /* Unions couldn't be writable if the filesystem
+* doesn't know about whiteouts */
+   err = -ENOTSUPP;
+   if ((mnt_flags  MNT_UNION) 
+   !(newmnt-mnt_sb-s_flags  MS_RDONLY) 
+   !(newmnt-mnt_sb-s_type-fs_flags  FS_WHT))
+   goto unlock;
+
/* some flags may have been set earlier */
newmnt-mnt_flags |= mnt_flags;
if ((err = graft_tree(newmnt, nd)))
@@ -1562,9 +1571,12 @@ long do_mount(char *dev_name, char *dir_
mnt_flags |= MNT_RELATIME;
if (flags  MS_NOMNT)
mnt_flags |= MNT_NOMNT;
+   if (flags  MS_UNION)
+   mnt_flags |= MNT_UNION;
 
flags = ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
-  MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_NOMNT);
+  MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_NOMNT |
+  MS_UNION);
 
/* ... and get the mountpoint */
retval = path_lookup(dir_name, LOOKUP_FOLLOW, nd);
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -97,6 +97,7 @@ extern int dir_notify_enable;
 #define FS_BINARY_MOUNTDATA 2
 #define FS_HAS_SUBTYPE 4
 #define FS_SAFE 8  /* Safe to mount by unprivileged users */
+#define FS_WHT 16
 #define FS_REVAL_DOT   16384   /* Check the paths ., .. for staleness */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move()
 * during rename() internally.
@@ -113,6 +114,7 @@ extern int dir_notify_enable;
 #define MS_REMOUNT 32  /* Alter flags of a mounted FS */
 #define MS_MANDLOCK64  /* Allow mandatory locks on an FS */
 #define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_UNION   256 /* Union mount */
 #define MS_NOATIME 1024/* Do not update access times. */
 #define MS_NODIRATIME  2048/* Do not update directory access times */
 #define MS_BIND4096
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -36,6 +36,7 @@ struct mnt_namespace;
 #define MNT_SHARED 0x1000  /* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE 0x2000  /* if the vfsmount is a unbindable mount */
 #define MNT_PNODE_MASK 0x3000  /* propogation flag mask */
+#define MNT_UNION  0x4000  /* if the vfsmount is a union mount */
 
 struct vfsmount {
struct list_head mnt_hash;
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH 3/14] Add the whiteout file type

2007-05-14 Thread Bharata B Rao

From: Jan Blunck [EMAIL PROTECTED]
Subject: Add the whiteout file type

A white-out stops the VFS from further lookups of the white-outs name and
returns -ENOENT. This is the same behaviour as if the filename isn't
found. This can be used in combination with union mounts to virtually
delete (white-out) files by creating a file with this file type.

Signed-off-by: Jan Blunck [EMAIL PROTECTED]
Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 include/linux/stat.h |2 ++
 1 files changed, 2 insertions(+)

--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -10,6 +10,7 @@
 #if defined(__KERNEL__) || !defined(__GLIBC__) || (__GLIBC__  2)
 
 #define S_IFMT  0017
+#define S_IFWHT  016   /* whiteout */
 #define S_IFSOCK 014
 #define S_IFLNK 012
 #define S_IFREG  010
@@ -28,6 +29,7 @@
 #define S_ISBLK(m) (((m)  S_IFMT) == S_IFBLK)
 #define S_ISFIFO(m)(((m)  S_IFMT) == S_IFIFO)
 #define S_ISSOCK(m)(((m)  S_IFMT) == S_IFSOCK)
+#define S_ISWHT(m) (((m)  S_IFMT) == S_IFWHT)
 
 #define S_IRWXU 00700
 #define S_IRUSR 00400
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH 4/14] Add config options for union mount

2007-05-14 Thread Bharata B Rao

From: Jan Blunck [EMAIL PROTECTED]
Subject: Add config options for union mount

Introduces two new config options for union mount:

CONFIG_UNION_MOUNT - Enables union mount
CONFIG_UNION_MOUNT_DEBUG - Enables debugging support for union mount.

Also adds debugging routines.

FIXME: this needs some work. printk'ing isn't the right method for getting
good debugging output.

Signed-off-by: Jan Blunck [EMAIL PROTECTED]
Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 fs/Kconfig  |   16 +
 include/linux/union_debug.h |   76 
 2 files changed, 92 insertions(+)

--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -551,6 +551,22 @@ config INOTIFY_USER
 
  If unsure, say Y.
 
+config UNION_MOUNT
+   bool Union mount support (EXPERIMENTAL)
+   depends on EXPERIMENTAL
+   ---help---
+ If you say Y here, you will be able to mount file systems as
+ union mount stacks. This is a VFS based implementation and
+ should work with all file systems. If unsure, say N.
+
+config UNION_MOUNT_DEBUG
+   bool Union mount debugging output
+   depends on UNION_MOUNT
+   ---help---
+ If you say Y here, the union mount debugging code will be
+ compiled in. You have activate the appropriate UNION_MOUNT_DEBUG
+ flags in file:include/linux/union.h, too.
+
 config QUOTA
bool Quota support
help
--- /dev/null
+++ b/include/linux/union_debug.h
@@ -0,0 +1,76 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright © 2004-2007 IBM Corporation
+ *   Author(s): Jan Blunck ([EMAIL PROTECTED])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_UNION_DEBUG_H
+#define __LINUX_UNION_DEBUG_H
+
+#ifdef __KERNEL__
+
+#ifdef CONFIG_UNION_MOUNT_DEBUG
+
+#include linux/sched.h
+
+#ifndef UNION_MOUNT_DEBUG
+#define UNION_MOUNT_DEBUG 0
+#endif /* UNION_MOUNT_DEBUG */
+#ifndef UNION_MOUNT_DEBUG_DCACHE
+#define UNION_MOUNT_DEBUG_DCACHE 0
+#endif /* UNION_MOUNT_DEBUG_DCACHE */
+#ifndef UNION_MOUNT_DEBUG_LOCK
+#define UNION_MOUNT_DEBUG_LOCK 0
+#endif /* UNION_MOUNT_DEBUG_LOCK */
+#ifndef UNION_MOUNT_DEBUG_READDIR
+#define UNION_MOUNT_DEBUG_READDIR 0
+#endif /* UNION_MOUNT_DEBUG_READDIR */
+
+/*
+ * The really excessive debugging output is triggered by
+ * the user id () which is accessing the union stack
+ */
+#define UM_DEBUG(fmt, args...) \
+do {   \
+   if (UNION_MOUNT_DEBUG)  \
+   printk(KERN_DEBUG %s:  fmt, __FUNCTION__, ## args);   \
+} while (0)
+#define UM_DEBUG_UID(fmt, args...) \
+do {   \
+   if (UNION_MOUNT_DEBUG  (current-uid == ))\
+   printk(KERN_DEBUG %s:  fmt, __FUNCTION__, ## args);   \
+} while (0)
+#define UM_DEBUG_DCACHE(fmt, args...)  \
+do {   \
+   if (UNION_MOUNT_DEBUG_DCACHE  (current-uid == )) \
+   printk(KERN_DEBUG %s:  fmt, __FUNCTION__, ## args);   \
+} while (0)
+#define UM_DEBUG_LOCK(fmt, args...)\
+do {   \
+   if (UNION_MOUNT_DEBUG_LOCK  (current-uid == ))   \
+   printk(KERN_DEBUG %s:  fmt, __FUNCTION__, ## args);   \
+} while (0)
+#define UM_DEBUG_READDIR(fmt, args...) \
+do {   \
+   if (UNION_MOUNT_DEBUG_READDIR  (current-uid == ))\
+   printk(KERN_DEBUG %s:  fmt, __FUNCTION__, ## args);   \
+} while (0)
+
+#else  /* CONFIG_UNION_MOUNT_DEBUG */
+
+#define UM_DEBUG(fmt, args...) do { /* empty */ } while (0)
+#define UM_DEBUG_UID(fmt, args...) do { /* empty */ } while (0)
+#define UM_DEBUG_DCACHE(fmt, args...) do { /* empty */ } while (0)
+#define UM_DEBUG_LOCK(fmt, args...) do { /* empty */ } while (0)
+#define UM_DEBUG_READDIR(fmt, args...) do { /* empty */ } while (0)
+
+#endif /* CONFIG_UNION_MOUNT_DEBUG */
+
+#endif /* __KERNEL__ */
+#endif /*  __LINUX_UNION_DEBUG_H */
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH 5/14] Introduce union stack

2007-05-14 Thread Bharata B Rao

From: Jan Blunck [EMAIL PROTECTED]
Subject: Introduce union stack.

Adds union stack infrastructure to the dentry structure and provides
locking routines to walk the union stack.

Signed-off-by: Jan Blunck [EMAIL PROTECTED]
Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 fs/Makefile  |2 
 fs/dcache.c  |5 
 fs/union.c   |   53 +
 include/linux/dcache.h   |6 +
 include/linux/dcache_union.h |  248 +++
 5 files changed, 314 insertions(+)

--- a/fs/Makefile
+++ b/fs/Makefile
@@ -49,6 +49,8 @@ obj-$(CONFIG_FS_POSIX_ACL)+= posix_acl.
 obj-$(CONFIG_NFS_COMMON)   += nfs_common/
 obj-$(CONFIG_GENERIC_ACL)  += generic_acl.o
 
+obj-$(CONFIG_UNION_MOUNT)  += union.o
+
 obj-$(CONFIG_QUOTA)+= dquot.o
 obj-$(CONFIG_QFMT_V1)  += quota_v1.o
 obj-$(CONFIG_QFMT_V2)  += quota_v2.o
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -936,6 +936,11 @@ struct dentry *d_alloc(struct dentry * p
 #ifdef CONFIG_PROFILING
dentry-d_cookie = NULL;
 #endif
+#ifdef CONFIG_UNION_MOUNT
+   dentry-d_overlaid = NULL;
+   dentry-d_topmost = NULL;
+   dentry-d_union = NULL;
+#endif
INIT_HLIST_NODE(dentry-d_hash);
INIT_LIST_HEAD(dentry-d_lru);
INIT_LIST_HEAD(dentry-d_subdirs);
--- /dev/null
+++ b/fs/union.c
@@ -0,0 +1,53 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright © 2004-2007 IBM Corporation
+ *   Author(s): Jan Blunck ([EMAIL PROTECTED])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include linux/fs.h
+
+struct union_info * union_alloc(void)
+{
+   struct union_info *info;
+
+   info = kmalloc(sizeof(*info), GFP_ATOMIC);
+   if (!info)
+   return NULL;
+
+   mutex_init(info-u_mutex);
+   mutex_lock(info-u_mutex);
+   atomic_set(info-u_count, 1);
+   UM_DEBUG_LOCK(allocate union %p\n, info);
+   return info;
+}
+
+struct union_info * union_get(struct union_info *info)
+{
+   BUG_ON(!info);
+   BUG_ON(!atomic_read(info-u_count));
+   atomic_inc(info-u_count);
+   UM_DEBUG_LOCK(get union %p (count=%d)\n, info,
+ atomic_read(info-u_count));
+   return info;
+}
+
+void union_put(struct union_info *info)
+{
+   BUG_ON(!info);
+   UM_DEBUG_LOCK(put union %p (count=%d)\n, info,
+ atomic_read(info-u_count));
+   atomic_dec(info-u_count);
+
+   if (!atomic_read(info-u_count)) {
+   UM_DEBUG_LOCK(free union %p\n, info);
+   kfree(info);
+   }
+
+   return;
+}
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -93,6 +93,12 @@ struct dentry {
struct dentry *d_parent;/* parent directory */
struct qstr d_name;
 
+#ifdef CONFIG_UNION_MOUNT
+   struct dentry *d_overlaid;  /* overlaid directory */
+   struct dentry *d_topmost;   /* topmost directory */
+   struct union_info *d_union; /* union directory info */
+#endif
+
struct list_head d_lru; /* LRU list */
/*
 * d_child and d_rcu can share memory
--- /dev/null
+++ b/include/linux/dcache_union.h
@@ -0,0 +1,248 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright © 2004-2007 IBM Corporation
+ *   Author(s): Jan Blunck ([EMAIL PROTECTED])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_DCACHE_UNION_H
+#define __LINUX_DCACHE_UNION_H
+#ifdef __KERNEL__
+
+#include linux/union_debug.h
+#include linux/fs_struct.h
+#include asm/atomic.h
+#include asm/semaphore.h
+
+#ifdef CONFIG_UNION_MOUNT
+
+/*
+ * This is the union info object, that describes general information about this
+ * union directory
+ *
+ * u_mutex protects the union stack against modification. You can reach it
+ * through the d_union field in struct dentry. Hold it when you are walking
+ * or modifing the union stack !
+ */
+struct union_info {
+   atomic_t u_count;
+   struct mutex u_mutex;
+};
+
+/* allocate/de-allocate */
+extern struct union_info *union_alloc(void);
+extern struct union_info *union_get(struct union_info *);
+extern void union_put(struct union_info *);
+
+/*
+ * These are the functions for locking a dentry's union. When one
+ * want to acquire a denties union lock, use:
+ *
+ * - union_lock() when you can sleep,
+ * - union_lock_spinlock() when you are holding a spinlock (that
+ *   you CAN savely give up and reacquire again)
+ * - union_lock_readlock() when you are holding a readlock (that
+ *   you CAN savely give up and

[RFC][PATCH 7/14] Union-mount mounting

2007-05-14 Thread Bharata B Rao

From: Jan Blunck [EMAIL PROTECTED]
Subject: Union-mount mounting

Adds union mount support to mount() and umount() system calls.
Sets up the union stack during mount and destroys it during unmount.

TODO: bind and move mounts aren't yet supported with union mounts.

Signed-off-by: Jan Blunck [EMAIL PROTECTED]
Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 fs/namespace.c|   90 ++
 fs/union.c|   71 +++
 include/linux/fs.h|3 +
 include/linux/union.h |   33 ++
 4 files changed, 190 insertions(+), 7 deletions(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -169,7 +169,7 @@ void mnt_set_mountpoint(struct vfsmount 
struct vfsmount *child_mnt)
 {
child_mnt-mnt_parent = mntget(mnt);
-   child_mnt-mnt_mountpoint = dget(dentry);
+   child_mnt-mnt_mountpoint = __dget(dentry);
dentry-d_mounted++;
 }
 
@@ -294,6 +294,10 @@ static struct vfsmount *clone_mnt(struct
if (!mnt)
goto alloc_failed;
 
+   /*
+* As of now, cloning of union mounted mnt isn't permitted.
+*/
+   BUG_ON(mnt-mnt_flags  MNT_UNION);
mnt-mnt_flags = old-mnt_flags;
atomic_inc(sb-s_active);
mnt-mnt_sb = sb;
@@ -579,16 +583,20 @@ void release_mounts(struct list_head *he
mnt = list_first_entry(head, struct vfsmount, mnt_hash);
list_del_init(mnt-mnt_hash);
if (mnt-mnt_parent != mnt) {
-   struct dentry *dentry;
-   struct vfsmount *m;
+   struct path old_nd;
spin_lock(vfsmount_lock);
-   dentry = mnt-mnt_mountpoint;
-   m = mnt-mnt_parent;
+   old_nd.dentry = mnt-mnt_mountpoint;
+   old_nd.mnt = mnt-mnt_parent;
mnt-mnt_mountpoint = mnt-mnt_root;
mnt-mnt_parent = mnt;
+   detach_mnt_union(mnt, old_nd);
spin_unlock(vfsmount_lock);
-   dput(dentry);
-   mntput(m);
+   if (mnt-mnt_flags  MNT_UNION) {
+   UM_DEBUG(shrink the mountpoint's dcache\n);
+   shrink_dcache_sb(old_nd.dentry-d_sb);
+   }
+   __dput(old_nd.dentry);
+   mntput(old_nd.mnt);
}
mntput(mnt);
}
@@ -621,6 +629,9 @@ static int do_umount(struct vfsmount *mn
struct super_block *sb = mnt-mnt_sb;
int retval;
LIST_HEAD(umount_list);
+#ifdef CONFIG_UNION_MOUNT
+   struct union_info *uinfo = NULL;
+#endif
 
retval = security_sb_umount(mnt, flags);
if (retval)
@@ -685,6 +696,14 @@ static int do_umount(struct vfsmount *mn
}
 
down_write(namespace_sem);
+#ifdef CONFIG_UNION_MOUNT
+   /*
+* Grab a reference to the union_info which gets detached
+* from the dentries in release_mounts().
+*/
+   if (mnt-mnt_flags  MNT_UNION)
+   uinfo = union_lock_and_get(mnt-mnt_root);
+#endif
spin_lock(vfsmount_lock);
event++;
 
@@ -699,6 +718,15 @@ static int do_umount(struct vfsmount *mn
security_sb_umount_busy(mnt);
up_write(namespace_sem);
release_mounts(umount_list);
+#ifdef CONFIG_UNION_MOUNT
+   if (uinfo) {
+   if (atomic_read(uinfo-u_count) == 1)
+   /* We are the last user of this union_info */
+   union_release(uinfo);
+   else
+   union_put_and_unlock(uinfo);
+   }
+#endif
return retval;
 }
 
@@ -941,6 +969,9 @@ static int attach_recursive_mnt(struct v
set_mnt_shared(p);
}
 
+   if (source_mnt-mnt_flags  MNT_UNION)
+   union_alloc_dentry(nd-dentry);
+
spin_lock(vfsmount_lock);
if (parent_nd) {
detach_mnt(source_mnt, parent_nd);
@@ -948,6 +979,7 @@ static int attach_recursive_mnt(struct v
touch_mnt_namespace(current-nsproxy-mnt_ns);
} else {
mnt_set_mountpoint(dest_mnt, dest_dentry, source_mnt);
+   attach_mnt_union(source_mnt, nd);
commit_tree(source_mnt);
}
 
@@ -956,6 +988,7 @@ static int attach_recursive_mnt(struct v
commit_tree(child);
}
spin_unlock(vfsmount_lock);
+   union_unlock(nd-dentry);
return 0;
 }
 
@@ -1003,6 +1036,12 @@ static int do_change_type(struct nameida
if (nd-dentry != nd-mnt-mnt_root)
return -EINVAL;
 
+   /*
+* Don't change the type of union mounts
+*/
+   if (nd-mnt-mnt_flags  MNT_UNION)
+   return -EINVAL;
+

[RFC][PATCH 8/14] Union-mount lookup

2007-05-14 Thread Bharata B Rao

From: Jan Blunck [EMAIL PROTECTED]
Subject: Union-mount lookup

Modifies the vfs lookup routines to work with union mounted directories.

The existing lookup routines generally lookup for a pathname only in the
topmost or given directory. The changed versions of the lookup routines
search for the pathname in the entire union mounted stack. Also they have been
modified to setup the union stack during lookup from dcache cache and from
real_lookup().

Signed-off-by: Jan Blunck [EMAIL PROTECTED]
Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 fs/dcache.c|   16 +
 fs/namei.c |   78 +-
 fs/namespace.c |   35 ++
 fs/union.c |  598 +
 include/linux/dcache.h |   17 +
 include/linux/namei.h  |4 
 include/linux/union.h  |   49 
 7 files changed, 786 insertions(+), 11 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1286,7 +1286,7 @@ struct dentry * d_lookup(struct dentry *
return dentry;
 }
 
-struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
+struct dentry * __d_lookup_single(struct dentry *parent, struct qstr *name)
 {
unsigned int len = name-len;
unsigned int hash = name-hash;
@@ -1371,6 +1371,20 @@ out:
return dentry;
 }
 
+struct dentry * d_lookup_single(struct dentry *parent, struct qstr *name)
+{
+   struct dentry *dentry;
+   unsigned long seq;
+
+do {
+seq = read_seqbegin(rename_lock);
+dentry = __d_lookup_single(parent, name);
+if (dentry)
+   break;
+   } while (read_seqretry(rename_lock, seq));
+   return dentry;
+}
+
 /**
  * d_validate - verify dentry provided from insecure source
  * @dentry: The dentry alleged to be valid child of @dparent
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -374,6 +374,33 @@ void release_open_intent(struct nameidat
 }
 
 static inline struct dentry *
+do_revalidate_single(struct dentry *dentry, struct nameidata *nd)
+{
+   int status = dentry-d_op-d_revalidate(dentry, nd);
+   if (unlikely(status = 0)) {
+   /*
+* The dentry failed validation.
+* If d_revalidate returned 0 attempt to invalidate
+* the dentry otherwise d_revalidate is asking us
+* to return a fail status.
+*/
+   if (!status) {
+   if (!d_invalidate(dentry)) {
+   __dput_single(dentry);
+   dentry = NULL;
+   }
+   } else {
+   __dput_single(dentry);
+   dentry = ERR_PTR(status);
+   }
+   }
+   return dentry;
+}
+
+/*
+ * FIXME: We need a union aware revalidate here!
+ */
+static inline struct dentry *
 do_revalidate(struct dentry *dentry, struct nameidata *nd)
 {
int status = dentry-d_op-d_revalidate(dentry, nd);
@@ -403,16 +430,16 @@ do_revalidate(struct dentry *dentry, str
  */
 static struct dentry * cached_lookup(struct dentry * parent, struct qstr * 
name, struct nameidata *nd)
 {
-   struct dentry * dentry = __d_lookup(parent, name);
+   struct dentry *dentry = __d_lookup_single(parent, name);
 
/* lockess __d_lookup may fail due to concurrent d_move() 
 * in some unrelated directory, so try with d_lookup
 */
if (!dentry)
-   dentry = d_lookup(parent, name);
+   dentry = d_lookup_single(parent, name);
 
if (dentry  dentry-d_op  dentry-d_op-d_revalidate)
-   dentry = do_revalidate(dentry, nd);
+   dentry = do_revalidate_single(dentry, nd);
 
return dentry;
 }
@@ -465,7 +492,7 @@ ok:
  * make sure that nobody added the entry to the dcache in the meantime..
  * SMP-safe
  */
-static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, 
struct nameidata *nd)
+struct dentry * real_lookup_single(struct dentry *parent, struct qstr *name, 
struct nameidata *nd)
 {
struct dentry * result;
struct inode *dir = parent-d_inode;
@@ -485,7 +512,7 @@ static struct dentry * real_lookup(struc
 *
 * so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
 */
-   result = d_lookup(parent, name);
+   result = d_lookup_single(parent, name);
if (!result) {
struct dentry * dentry = d_alloc(parent, name);
result = ERR_PTR(-ENOMEM);
@@ -506,7 +533,7 @@ static struct dentry * real_lookup(struc
 */
mutex_unlock(dir-i_mutex);
if (result-d_op  result-d_op-d_revalidate) {
-   result = do_revalidate(result, nd);
+   result = do_revalidate_single(result, nd);
if (!result)
result = ERR_PTR(-ENOENT);
}
@@ -699,7 +726,7 @@ static int __follow_mount(struct path *p
return res;
 }
 
-static

[RFC][PATCH 9/14] Union-mount readdir

2007-05-14 Thread Bharata B Rao

From: Bharata B Rao [EMAIL PROTECTED]
Subject: Union mount readdir

This modifies the readdir()/getdents() routines to read directory
entries from toplevel and the lower directories of a union and present
a merged view.

The directory entries are read starting from the top layer and they
are maintained in a cache. Subsequently when the entries from the bottom layers
of the union stack are read they are checked for duplicates (in the cache)
before being passed out to the user space. There can be multiple calls
to readdir/getdents routines for reading the entries of a single directory.
And union directory cache is maitained across these calls.

Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
Signed-off-by: Jan Blunck [EMAIL PROTECTED]
---
 fs/aio.c |8 
 fs/file_table.c  |   14 -
 fs/read_write.c  |7 
 fs/readdir.c |2 
 fs/union.c   |  404 +++
 include/linux/dcache_union.h |   27 ++
 include/linux/union.h|   22 ++
 7 files changed, 475 insertions(+), 9 deletions(-)

--- a/fs/aio.c
+++ b/fs/aio.c
@@ -21,6 +21,7 @@
 
 #include linux/sched.h
 #include linux/fs.h
+#include linux/mount.h
 #include linux/file.h
 #include linux/mm.h
 #include linux/mman.h
@@ -486,6 +487,13 @@ static void aio_fput_routine(struct work
/* Complete the fput */
__fput(req-ki_filp);
 
+   /*
+* __fput no longer releases the dentry and vfsmnt, thanks to
+* to union mount. Hence do this manually.
+*/
+   dput(req-ki_filp-f_path.dentry);
+   mntput(req-ki_filp-f_path.mnt);
+
/* Link the iocb into the context's free list */
spin_lock_irq(ctx-ctx_lock);
really_put_req(ctx, req);
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -141,8 +141,14 @@ EXPORT_SYMBOL(get_empty_filp);
 
 void fastcall fput(struct file *file)
 {
-   if (atomic_dec_and_test(file-f_count))
+   struct dentry *dentry = file-f_path.dentry;
+   struct vfsmount *mnt = file-f_path.mnt;
+
+   if (atomic_dec_and_test(file-f_count)) {
__fput(file);
+   dput(dentry);
+   mntput(mnt);
+   }
 }
 
 EXPORT_SYMBOL(fput);
@@ -152,9 +158,7 @@ EXPORT_SYMBOL(fput);
  */
 void fastcall __fput(struct file *file)
 {
-   struct dentry *dentry = file-f_path.dentry;
-   struct vfsmount *mnt = file-f_path.mnt;
-   struct inode *inode = dentry-d_inode;
+   struct inode *inode = file-f_path.dentry-d_inode;
 
might_sleep();
 
@@ -180,8 +184,6 @@ void fastcall __fput(struct file *file)
file-f_path.dentry = NULL;
file-f_path.mnt = NULL;
file_free(file);
-   dput(dentry);
-   mntput(mnt);
 }
 
 struct file fastcall *fget(unsigned int fd)
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -15,6 +15,7 @@
 #include linux/module.h
 #include linux/syscalls.h
 #include linux/pagemap.h
+#include linux/union.h
 #include read_write.h
 
 #include asm/uaccess.h
@@ -123,6 +124,12 @@ loff_t vfs_llseek(struct file *file, lof
if (file-f_op  file-f_op-llseek)
fn = file-f_op-llseek;
}
+
+#ifdef CONFIG_UNION_MOUNT
+   if (S_ISDIR(file-f_path.dentry-d_inode-i_mode) 
+   unlikely(file-f_path.dentry-d_overlaid))
+   return union_dir_llseek(file, offset, origin);
+#endif
return fn(file, offset, origin);
 }
 EXPORT_SYMBOL(vfs_llseek);
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -33,7 +33,7 @@ int vfs_readdir(struct file *file, filld
mutex_lock(inode-i_mutex);
res = -ENOENT;
if (!IS_DEADDIR(inode)) {
-   res = file-f_op-readdir(file, buf, filler);
+   res = do_readdir(file, buf, filler);
file_accessed(file);
}
mutex_unlock(inode-i_mutex);
--- a/fs/union.c
+++ b/fs/union.c
@@ -14,6 +14,7 @@
 #include linux/namei.h
 #include linux/module.h
 #include linux/mount.h
+#include linux/file.h
 
 struct union_info * union_alloc(void)
 {
@@ -26,6 +27,8 @@ struct union_info * union_alloc(void)
mutex_init(info-u_mutex);
mutex_lock(info-u_mutex);
atomic_set(info-u_count, 1);
+   INIT_LIST_HEAD(info-u_rdcache);
+   info-u_cookie = 0;
UM_DEBUG_LOCK(allocate union %p\n, info);
return info;
 }
@@ -40,6 +43,7 @@ struct union_info * union_get(struct uni
return info;
 }
 
+static void release_rdstates(struct union_info *info);
 void union_put(struct union_info *info)
 {
BUG_ON(!info);
@@ -49,6 +53,7 @@ void union_put(struct union_info *info)
 
if (!atomic_read(info-u_count)) {
UM_DEBUG_LOCK(free union %p\n, info);
+   release_rdstates(info);
kfree(info);
}
 
@@ -968,3 +973,402 @@ int follow_union_mount(struct vfsmount *
 
return res;
 }
+
+/*
+ *

[RFC][PATCH 10/14] In-kernel file copy between union mounted filesystems

2007-05-14 Thread Bharata B Rao

From: Jan Blunck [EMAIL PROTECTED]
Subject: In-kernel file copy between union mounted filesystems

This patch introduces in-kernel file copy between union mounted
filesystems. When a file is opened for writing but resides on a lower (thus
read-only) layer of the union stack it is copied to the topmost union layer
first.

This patch uses the do_splice_direct() for doing the in-kernel file copy.

Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
Signed-off-by: Jan Blunck [EMAIL PROTECTED]
---
 fs/namei.c|   46 +
 fs/union.c|  415 ++
 include/linux/namei.h |2 
 include/linux/union.h |   14 +
 4 files changed, 476 insertions(+), 1 deletion(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -830,8 +830,17 @@ done:
path-mnt = mnt;
path-dentry = dentry;
 
-   if (nd-dentry-d_sb != dentry-d_sb)
+   /*
+* This should be checked after the following of unions.
+* Otherwise we might run into trouble creating directories
+* on mountpoints. :(
+* But maybe we shouldn't set the LAST_LOWLEVEL flag here
+* at all ... */
+   if (nd-dentry-d_sb != dentry-d_sb) {
path-mnt = find_mnt(dentry);
+   UM_DEBUG_UID(Setting LAST_LOWLEVEL for %s\n, name-name);
+   nd-um_flags |= LAST_LOWLEVEL;
+   }
 
__follow_mount(path);
follow_union_mount(path-mnt, path-dentry);
@@ -950,6 +959,14 @@ static fastcall int __link_path_walk(con
if (err)
break;
 
+   if ((nd-flags  LOOKUP_TOPMOST) 
+   (nd-um_flags  LAST_LOWLEVEL)) {
+   err = union_create_topdir(nd,next.dentry,next.mnt);
+   if (err)
+   goto out_dput;
+   nd-um_flags = ~LAST_LOWLEVEL;
+   }
+
err = -ENOENT;
inode = next.dentry-d_inode;
if (!inode)
@@ -1005,6 +1022,15 @@ last_component:
err = do_lookup(nd, this, next);
if (err)
break;
+
+   if ((nd-flags  LOOKUP_TOPMOST) 
+   (nd-um_flags  LAST_LOWLEVEL)) {
+   err = union_create_topdir(nd,next.dentry,next.mnt);
+   if (err)
+   goto out_dput;
+   nd-um_flags = ~LAST_LOWLEVEL;
+   }
+
inode = next.dentry-d_inode;
if ((lookup_flags  LOOKUP_FOLLOW)
 inode  inode-i_op  inode-i_op-follow_link) {
@@ -1177,6 +1203,7 @@ static int fastcall do_path_lookup(int d
 
nd-last_type = LAST_ROOT; /* if there are only slashes... */
nd-flags = flags;
+   nd-um_flags = 0;
nd-depth = 0;
 
if (*name=='/') {
@@ -1756,9 +1783,18 @@ int open_namei(int dfd, const char *path
 nd, flag);
if (error)
return error;
+   /* test for WRONLY and RDWR - flag's special lower bits */
+   if (flag  0x2) {
+   UM_DEBUG_UID(\%s\ opened for writing\n, pathname);
+   error = union_copyup(nd, flag);
+   if (error)
+   return error;
+   }
goto ok;
}
 
+   UM_DEBUG_UID(open called with O_CREATE\n);
+
/*
 * Create - we need to know the parent.
 */
@@ -1775,6 +1811,8 @@ int open_namei(int dfd, const char *path
if (nd-last_type != LAST_NORM || nd-last.name[nd-last.len])
goto exit;
 
+   UM_DEBUG_UID(do_last now\n);
+
dir = nd-dentry;
nd-flags = ~LOOKUP_PARENT;
mutex_lock(dir-d_inode-i_mutex);
@@ -1828,6 +1866,12 @@ do_last:
error = -EISDIR;
if (path.dentry-d_inode  S_ISDIR(path.dentry-d_inode-i_mode))
goto exit;
+
+   if (flag  0x2) {
+   error = union_copyup(nd, flag);
+   if (error)
+   goto exit;
+   }
 ok:
error = may_open(nd, acc_mode, flag);
if (error)
--- a/fs/union.c
+++ b/fs/union.c
@@ -15,6 +15,11 @@
 #include linux/module.h
 #include linux/mount.h
 #include linux/file.h
+#include linux/mm.h
+#include linux/quotaops.h
+#include linux/dnotify.h
+#include linux/security.h
+#include linux/pipe_fs_i.h
 
 struct union_info * union_alloc(void)
 {
@@ -305,6 +310,53 @@ void __dput_union(struct dentry *dentry)
return;
 }
 
+/*
+ * union_relookup_topmost - lookup and create the topmost path to dentry
+ * @nd: pointer to nameidata
+ * @flags: lookup flags
+ */
+int union_relookup_topmost(struct nameidata *nd, int flags)
+{
+   int err;
+   char *kbuf, *name;
+   struct nameidata this;
+
+   UM_DEBUG_UID(relookup the topmost dir for %s\n,
+nd-dentry-d_name.name);
+
+

[RFC][PATCH 11/14] VFS whiteout handling

2007-05-14 Thread Bharata B Rao

From: Jan Blunck [EMAIL PROTECTED]
Subject: VFS whiteout handling

Introduce white-out handling in the VFS.

Signed-off-by: Jan Blunck [EMAIL PROTECTED]
Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 fs/inode.c|   17 +
 fs/namei.c|  476 --
 fs/readdir.c  |   10 +
 fs/union.c|  104 ++
 include/linux/fs.h|4 
 include/linux/union.h |6 
 6 files changed, 605 insertions(+), 12 deletions(-)

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1421,6 +1421,21 @@ void __init inode_init(unsigned long mem
INIT_HLIST_HEAD(inode_hashtable[loop]);
 }
 
+/*
+ * Dummy default file-operations:
+ * Never open a whiteout. This is always a bug.
+ */
+static int whiteout_no_open(struct inode *irrelevant, struct file *dontcare)
+{
+   printk(Attemp to open a whiteout!\n);
+   WARN_ON(1);
+   return -ENXIO;
+}
+
+static struct file_operations def_wht_fops = {
+   .open   = whiteout_no_open,
+};
+
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
 {
inode-i_mode = mode;
@@ -1434,6 +1449,8 @@ void init_special_inode(struct inode *in
inode-i_fop = def_fifo_fops;
else if (S_ISSOCK(mode))
inode-i_fop = bad_sock_fops;
+   else if (S_ISWHT(mode))
+   inode-i_fop = def_wht_fops;
else
printk(KERN_DEBUG init_special_inode: bogus i_mode (%o)\n,
   mode);
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -969,7 +969,7 @@ static fastcall int __link_path_walk(con
 
err = -ENOENT;
inode = next.dentry-d_inode;
-   if (!inode)
+   if (!inode || S_ISWHT(inode-i_mode))
goto out_dput;
err = -ENOTDIR; 
if (!inode-i_op)
@@ -1043,6 +1043,12 @@ last_component:
err = -ENOENT;
if (!inode)
break;
+   if (S_ISWHT(inode-i_mode)) {
+   UM_DEBUG_UID(found a whiteout\n);
+   break;
+   //if (!(nd-flags  LOOKUP_WHT))
+   //break;
+   }
if (lookup_flags  LOOKUP_DIRECTORY) {
err = -ENOTDIR; 
if (!inode-i_op || !inode-i_op-lookup)
@@ -1556,7 +1562,7 @@ static int may_delete(struct inode *dir,
 static inline int may_create(struct inode *dir, struct dentry *child,
 struct nameidata *nd)
 {
-   if (child-d_inode)
+   if (child-d_inode  !S_ISWHT(child-d_inode-i_mode))
return -EEXIST;
if (IS_DEADDIR(dir))
return -ENOENT;
@@ -1623,6 +1629,82 @@ void unlock_rename(struct dentry *p1, st
}
 }
 
+/*
+ * __vfs_unlink_whiteout - Unlink a single whiteout from the system
+ * @dir: parent directory
+ * @dentry: the whiteout itself
+ *
+ * This is for unlinking a single whiteout. Don't use vfs_unlink() because we
+ * don't want any notification stuff etc. but basically it is the same stuff.
+ */
+static int
+__vfs_unlink_whiteout(struct inode *dir, struct dentry *dentry)
+{
+   int error = may_delete(dir, dentry, 0);
+
+   if (error)
+   return error;
+
+   if (!dir-i_op || !dir-i_op-unlink)
+   return -EPERM;
+
+   DQUOT_INIT(dir);
+
+   mutex_lock(dentry-d_inode-i_mutex);
+   if (d_mountpoint(dentry))
+   error = -EBUSY;
+   else {
+   error = security_inode_unlink(dir, dentry);
+   if (!error)
+   error = dir-i_op-unlink(dir, dentry);
+   }
+   mutex_unlock(dentry-d_inode-i_mutex);
+
+   /* We don't d_delete() NFS sillyrenamed files--they still exist. */
+   if (!error  !(dentry-d_flags  DCACHE_NFSFS_RENAMED)) {
+   d_delete(dentry);
+   //inode_dir_notify(dir, DN_DELETE);
+   }
+   return error;
+}
+
+/*
+ * vfs_unlink_whiteout - unlink and relookup the whiteout
+ *
+ * This is what you want to call from vfs_* functions to remove a whiteout. It
+ * unlinks the whiteout dentry and relookups it afterwards.
+ */
+static int
+vfs_unlink_whiteout(struct inode *dir, struct dentry **dp)
+{
+   struct dentry *dentry = *dp;
+   struct dentry *parent = dentry-d_parent;
+   struct qstr name;
+   int error;
+
+   BUG_ON(dir != parent-d_inode);
+
+   error = -ENOMEM;
+   name.name = kmalloc(dentry-d_name.len, GFP_KERNEL);
+   if (!name.name)
+   goto out;
+   strncpy((char *)name.name, dentry-d_name.name, dentry-d_name.len);
+   name.len = dentry-d_name.len;
+   name.hash = dentry-d_name.hash;
+
+   error = __vfs_unlink_whiteout(dir, dentry);
+   if (error)
+   goto out_freename;
+
+   __dput_single(dentry);
+   *dp = __lookup_hash_single(name, parent, NULL);
+

[RFC][PATCH 12/14] ext2 whiteout support

2007-05-14 Thread Bharata B Rao

From: Jan Blunck [EMAIL PROTECTED]
Subject: ext2 whiteout support

Introduce whiteout support to ext2.

Signed-off-by: Jan Blunck [EMAIL PROTECTED]
Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 fs/ext2/dir.c   |2 ++
 fs/ext2/namei.c |   17 +
 fs/ext2/super.c |   11 ++-
 include/linux/ext2_fs.h |4 
 4 files changed, 33 insertions(+), 1 deletion(-)

--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -218,6 +218,7 @@ static unsigned char ext2_filetype_table
[EXT2_FT_FIFO]  = DT_FIFO,
[EXT2_FT_SOCK]  = DT_SOCK,
[EXT2_FT_SYMLINK]   = DT_LNK,
+   [EXT2_FT_WHT]   = DT_WHT,
 };
 
 #define S_SHIFT 12
@@ -229,6 +230,7 @@ static unsigned char ext2_type_by_mode[S
[S_IFIFO  S_SHIFT]= EXT2_FT_FIFO,
[S_IFSOCK  S_SHIFT]   = EXT2_FT_SOCK,
[S_IFLNK  S_SHIFT]= EXT2_FT_SYMLINK,
+   [S_IFWHT  S_SHIFT]= EXT2_FT_WHT,
 };
 
 static inline void ext2_set_de_type(ext2_dirent *de, struct inode *inode)
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -288,6 +288,22 @@ static int ext2_rmdir (struct inode * di
return err;
 }
 
+static int ext2_whiteout(struct inode *dir, struct dentry *dentry)
+{
+   struct inode *inode;
+   int err;
+
+   inode = ext2_new_inode (dir, S_IFWHT | S_IRUGO);
+   err = PTR_ERR(inode);
+   if (IS_ERR(inode))
+   goto out;
+
+   mark_inode_dirty(inode);
+   err = ext2_add_nondir(dentry, inode);
+out:
+   return err;
+}
+
 static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
struct inode * new_dir, struct dentry * new_dentry )
 {
@@ -382,6 +398,7 @@ const struct inode_operations ext2_dir_i
.mkdir  = ext2_mkdir,
.rmdir  = ext2_rmdir,
.mknod  = ext2_mknod,
+   .whiteout   = ext2_whiteout,
.rename = ext2_rename,
 #ifdef CONFIG_EXT2_FS_XATTR
.setxattr   = generic_setxattr,
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -754,6 +754,15 @@ static int ext2_fill_super(struct super_
ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
EXT2_MOUNT_XIP if not */
 
+   if ((sb-s_flags  MS_UNION)  !(sb-s_flags  MS_RDONLY)) {
+   if (!EXT2_HAS_INCOMPAT_FEATURE(sb,
+   EXT2_FEATURE_INCOMPAT_WHITEOUT)) {
+   sb-s_flags |= MS_RDONLY;
+   ext2_warning(sb, __FUNCTION__,
+   no whiteout support, mounting filesystem read-only);
+   }
+   }
+
if (le32_to_cpu(es-s_rev_level) == EXT2_GOOD_OLD_REV 
(EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
 EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -1292,7 +1301,7 @@ static struct file_system_type ext2_fs_t
.name   = ext2,
.get_sb = ext2_get_sb,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV | FS_WHT,
 };
 
 static int __init init_ext2_fs(void)
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -61,6 +61,7 @@
 #define EXT2_ROOT_INO   2  /* Root inode */
 #define EXT2_BOOT_LOADER_INO5  /* Boot loader inode */
 #define EXT2_UNDEL_DIR_INO  6  /* Undelete directory inode */
+#define EXT2_WHT_INO7  /* Whiteout inode */
 
 /* First non-reserved inode for old ext2 filesystems */
 #define EXT2_GOOD_OLD_FIRST_INO11
@@ -479,10 +480,12 @@ struct ext2_super_block {
 #define EXT3_FEATURE_INCOMPAT_RECOVER  0x0004
 #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV  0x0008
 #define EXT2_FEATURE_INCOMPAT_META_BG  0x0010
+#define EXT2_FEATURE_INCOMPAT_WHITEOUT 0x0020
 #define EXT2_FEATURE_INCOMPAT_ANY  0x
 
 #define EXT2_FEATURE_COMPAT_SUPP   EXT2_FEATURE_COMPAT_EXT_ATTR
 #define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE| \
+EXT2_FEATURE_INCOMPAT_WHITEOUT| \
 EXT2_FEATURE_INCOMPAT_META_BG)
 #define EXT2_FEATURE_RO_COMPAT_SUPP(EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
 EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
@@ -549,6 +552,7 @@ enum {
EXT2_FT_FIFO,
EXT2_FT_SOCK,
EXT2_FT_SYMLINK,
+   EXT2_FT_WHT,
EXT2_FT_MAX
 };
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH 13/14] ext3 whiteout support

2007-05-14 Thread Bharata B Rao

From: Bharata B Rao [EMAIL PROTECTED]
Subject: ext3 whiteout support

Introduce whiteout support for ext3.

Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
Signed-off-by: Jan Blunck [EMAIL PROTECTED]
---
 fs/ext3/dir.c   |2 -
 fs/ext3/namei.c |   62 
 fs/ext3/super.c |   11 +++-
 include/linux/ext3_fs.h |5 +++
 4 files changed, 72 insertions(+), 8 deletions(-)

--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -29,7 +29,7 @@
 #include linux/rbtree.h
 
 static unsigned char ext3_filetype_table[] = {
-   DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
+   DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK, 
DT_WHT
 };
 
 static int ext3_readdir(struct file *, void *, filldir_t);
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1071,6 +1071,7 @@ static unsigned char ext3_type_by_mode[S
[S_IFIFO  S_SHIFT]= EXT3_FT_FIFO,
[S_IFSOCK  S_SHIFT]   = EXT3_FT_SOCK,
[S_IFLNK  S_SHIFT]= EXT3_FT_SYMLINK,
+   [S_IFWHT  S_SHIFT]= EXT3_FT_WHT,
 };
 
 static inline void ext3_set_de_type(struct super_block *sb,
@@ -1786,7 +1787,7 @@ out_stop:
 /*
  * routine to check that the specified directory is empty (for rmdir)
  */
-static int empty_dir (struct inode * inode)
+static int empty_dir (handle_t *handle, struct inode * inode)
 {
unsigned long offset;
struct buffer_head * bh;
@@ -1848,8 +1849,28 @@ static int empty_dir (struct inode * ino
continue;
}
if (le32_to_cpu(de-inode)) {
-   brelse (bh);
-   return 0;
+   /* If this is a whiteout, remove it */
+   if (de-file_type == EXT3_FT_WHT) {
+   unsigned long ino = le32_to_cpu(de-inode);
+   struct inode *tmp_inode = iget(inode-i_sb, 
ino);
+   if (!tmp_inode) {
+   brelse (bh);
+   return 0;
+   }
+
+   if (ext3_delete_entry(handle, inode, de, bh)) {
+   iput(tmp_inode);
+   brelse (bh);
+   return 0;
+   }
+
+   tmp_inode-i_ctime = inode-i_ctime;
+   tmp_inode-i_nlink--;
+   iput(tmp_inode);
+   } else {
+   brelse (bh);
+   return 0;
+   }
}
offset += le16_to_cpu(de-rec_len);
de = (struct ext3_dir_entry_2 *)
@@ -2031,7 +2052,7 @@ static int ext3_rmdir (struct inode * di
goto end_rmdir;
 
retval = -ENOTEMPTY;
-   if (!empty_dir (inode))
+   if (!empty_dir (handle, inode))
goto end_rmdir;
 
retval = ext3_delete_entry(handle, dir, de, bh);
@@ -2060,6 +2081,36 @@ end_rmdir:
return retval;
 }
 
+static int ext3_whiteout(struct inode *dir, struct dentry *dentry)
+{
+   struct inode * inode;
+   int err, retries = 0;
+   handle_t *handle;
+
+retry:
+   handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir-i_sb) +
+   EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 +
+   2*EXT3_QUOTA_INIT_BLOCKS(dir-i_sb));
+   if (IS_ERR(handle))
+   return PTR_ERR(handle);
+
+   if (IS_DIRSYNC(dir))
+   handle-h_sync = 1;
+
+   inode = ext3_new_inode (handle, dir, S_IFWHT | S_IRUGO);
+   err = PTR_ERR(inode);
+   if (IS_ERR(inode))
+   goto out_stop;
+
+   err = ext3_add_nondir(handle, dentry, inode);
+
+out_stop:
+   ext3_journal_stop(handle);
+   if (err == -ENOSPC  ext3_should_retry_alloc(dir-i_sb, retries))
+   goto retry;
+   return err;
+}
+
 static int ext3_unlink(struct inode * dir, struct dentry *dentry)
 {
int retval;
@@ -2261,7 +2312,7 @@ static int ext3_rename (struct inode * o
if (S_ISDIR(old_inode-i_mode)) {
if (new_inode) {
retval = -ENOTEMPTY;
-   if (!empty_dir (new_inode))
+   if (!empty_dir (handle, new_inode))
goto end_rename;
}
retval = -EIO;
@@ -2377,6 +2428,7 @@ const struct inode_operations ext3_dir_i
.mkdir  = ext3_mkdir,
.rmdir  = ext3_rmdir,
.mknod  = ext3_mknod,
+   .whiteout   = ext3_whiteout,
.rename = ext3_rename,
.setattr= ext3_setattr,
 #ifdef CONFIG_EXT3_FS_XATTR
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1492,6

[RFC][PATCH 14/14] tmpfs whiteout support

2007-05-14 Thread Bharata B Rao

From: Jan Blunck [EMAIL PROTECTED]
Subject: tmpfs whiteout support

Introduce whiteout support to tmpfs.

Signed-off-by: Jan Blunck [EMAIL PROTECTED]
Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 mm/shmem.c |9 -
 1 files changed, 8 insertions(+), 1 deletion(-)

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -74,7 +74,7 @@
 #define LATENCY_LIMIT   64
 
 /* Pretend that each entry is of this size in directory's i_size */
-#define BOGO_DIRENT_SIZE 20
+#define BOGO_DIRENT_SIZE 1
 
 /* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
 enum sgp_type {
@@ -1772,6 +1772,11 @@ static int shmem_create(struct inode *di
return shmem_mknod(dir, dentry, mode | S_IFREG, 0);
 }
 
+static int shmem_whiteout(struct inode *dir, struct dentry *dentry)
+{
+   return shmem_mknod(dir, dentry, S_IRUGO | S_IWUGO | S_IFWHT, 0);
+}
+
 /*
  * Link a file..
  */
@@ -2399,6 +2404,7 @@ static const struct inode_operations shm
.rmdir  = shmem_rmdir,
.mknod  = shmem_mknod,
.rename = shmem_rename,
+   .whiteout   = shmem_whiteout,
 #endif
 #ifdef CONFIG_TMPFS_POSIX_ACL
.setattr= shmem_notify_change,
@@ -2453,6 +2459,7 @@ static struct file_system_type tmpfs_fs_
.name   = tmpfs,
.get_sb = shmem_get_sb,
.kill_sb= kill_litter_super,
+   .fs_flags   = FS_WHT,
 };
 static struct vfsmount *shm_mnt;
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 9/14] Union-mount readdir

2007-05-14 Thread Carsten Otte


On 5/14/07, Bharata B Rao [EMAIL PROTECTED] wrote:

+/* This is a copy from fs/readdir.c */
+struct getdents_callback {
+   struct linux_dirent __user *current_dir;
+   struct linux_dirent __user *previous;
+   int count;
+   int error;
+};

This should go into a header file.


+static int union_cache_find_entry(struct list_head *uc_list,
+ const char *name, int namelen)
+{
+   struct union_cache_entry *p;
+   int ret = 0;
+
+   list_for_each_entry(p, uc_list, list) {
+   if (p-name.len != namelen)
+   continue;
+   if (strncmp(p-name.name, name, namelen) == 0) {
+   ret = 1;
+   break;
+   }
+   }
+   return ret;
+}

Why not use strlen instead of having both string and length as parameter?


+static struct file * __dentry_open_read(struct dentry *dentry,
+   struct vfsmount *mnt, int flags)
+{
+   struct file *f;
+   struct inode *inode;
+   int error;
+
+   error = -ENFILE;
+   f = get_empty_filp();
+   if (!f)
+   goto out;

This is the only case where error is not explicitly set to a different
value before hitting out/cleanup = consider setting conditionally.

so long,
Carsten
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 9/14] Union-mount readdir

2007-05-14 Thread Bharata B Rao

On Mon, May 14, 2007 at 12:43:43PM +0200, Carsten Otte wrote:
 On 5/14/07, Bharata B Rao [EMAIL PROTECTED] wrote:
 +/* This is a copy from fs/readdir.c */
 +struct getdents_callback {
 +   struct linux_dirent __user *current_dir;
 +   struct linux_dirent __user *previous;
 +   int count;
 +   int error;
 +};
 This should go into a header file.

Yes ideally. As the comment above says, it is copied from fs/readdir.c and
we should be using the definition from there. But that needs touching
additional files and we wanted to avoid that for this initial RFC post.

 
 +static int union_cache_find_entry(struct list_head *uc_list,
 + const char *name, int namelen)
 +{
 +   struct union_cache_entry *p;
 +   int ret = 0;
 +
 +   list_for_each_entry(p, uc_list, list) {
 +   if (p-name.len != namelen)
 +   continue;
 +   if (strncmp(p-name.name, name, namelen) == 0) {
 +   ret = 1;
 +   break;
 +   }
 +   }
 +   return ret;
 +}
 Why not use strlen instead of having both string and length as parameter?
 

All generic filldir routines in fs/readdir.c (filldir, fillonedir and
filldir64) don't depend on the dirent-d_name to be NULL terminated and
put a 0 themselves at the end. Hence we are also not depending on the
name string to be NULL terminated.

 +static struct file * __dentry_open_read(struct dentry *dentry,
 +   struct vfsmount *mnt, int flags)
 +{
 +   struct file *f;
 +   struct inode *inode;
 +   int error;
 +
 +   error = -ENFILE;
 +   f = get_empty_filp();
 +   if (!f)
 +   goto out;
 This is the only case where error is not explicitly set to a different
 value before hitting out/cleanup = consider setting conditionally.

Sure can be done. Again this routine is copied from dentry_open() and
hence it is like that atm.

Thanks for your review.

Regards,
Bharata.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 45/45] Fix file_permission()

2007-05-14 Thread jjohansen

We cannot easily switch from file_permission() to vfs_permission()
everywhere, so fix file_permission() to not use a NULL nameidata
for the remaining users.

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

---
 fs/namei.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -296,7 +296,13 @@ int vfs_permission(struct nameidata *nd,
  */
 int file_permission(struct file *file, int mask)
 {
-   return permission(file-f_path.dentry-d_inode, mask, NULL);
+   struct nameidata nd;
+   
+   nd.dentry = file-f_path.dentry;
+   nd.mnt = file-f_path.mnt;
+   nd.flags = LOOKUP_ACCESS;
+
+   return permission(nd.dentry-d_inode, mask, nd);
 }
 
 /*

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 24/45] Pass struct vfsmount to the inode_getxattr LSM hook

2007-05-14 Thread jjohansen

This is needed for computing pathnames in the AppArmor LSM.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/xattr.c   |2 +-
 include/linux/security.h |   13 -
 security/dummy.c |3 ++-
 security/selinux/hooks.c |3 ++-
 4 files changed, 13 insertions(+), 8 deletions(-)

--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -116,7 +116,7 @@ vfs_getxattr(struct dentry *dentry, stru
if (error)
return error;
 
-   error = security_inode_getxattr(dentry, name);
+   error = security_inode_getxattr(dentry, mnt, name);
if (error)
return error;
 
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -391,7 +391,7 @@ struct request_sock;
  * @value identified by @name for @dentry and @mnt.
  * @inode_getxattr:
  * Check permission before obtaining the extended attributes
- * identified by @name for @dentry.
+ * identified by @name for @dentry and @mnt.
  * Return 0 if permission is granted.
  * @inode_listxattr:
  * Check permission before obtaining the list of extended attribute 
@@ -1248,7 +1248,8 @@ struct security_operations {
 struct vfsmount *mnt,
 char *name, void *value,
 size_t size, int flags);
-   int (*inode_getxattr) (struct dentry *dentry, char *name);
+   int (*inode_getxattr) (struct dentry *dentry, struct vfsmount *mnt,
+  char *name);
int (*inode_listxattr) (struct dentry *dentry);
int (*inode_removexattr) (struct dentry *dentry, char *name);
const char *(*inode_xattr_getsuffix) (void);
@@ -1782,11 +1783,12 @@ static inline void security_inode_post_s
security_ops-inode_post_setxattr (dentry, mnt, name, value, size, 
flags);
 }
 
-static inline int security_inode_getxattr (struct dentry *dentry, char *name)
+static inline int security_inode_getxattr (struct dentry *dentry,
+   struct vfsmount *mnt, char *name)
 {
if (unlikely (IS_PRIVATE (dentry-d_inode)))
return 0;
-   return security_ops-inode_getxattr (dentry, name);
+   return security_ops-inode_getxattr (dentry, mnt, name);
 }
 
 static inline int security_inode_listxattr (struct dentry *dentry)
@@ -2487,7 +2489,8 @@ static inline void security_inode_post_s
 int flags)
 { }
 
-static inline int security_inode_getxattr (struct dentry *dentry, char *name)
+static inline int security_inode_getxattr (struct dentry *dentry,
+   struct vfsmount *mnt, char *name)
 {
return 0;
 }
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -368,7 +368,8 @@ static void dummy_inode_post_setxattr (s
 {
 }
 
-static int dummy_inode_getxattr (struct dentry *dentry, char *name)
+static int dummy_inode_getxattr (struct dentry *dentry,
+ struct vfsmount *mnt, char *name)
 {
return 0;
 }
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2393,7 +2393,8 @@ static void selinux_inode_post_setxattr(
return;
 }
 
-static int selinux_inode_getxattr (struct dentry *dentry, char *name)
+static int selinux_inode_getxattr (struct dentry *dentry, struct vfsmount *mnt,
+  char *name)
 {
return dentry_has_perm(current, NULL, dentry, FILE__GETATTR);
 }

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 42/45] AppArmor: add lock subtyping so lockdep does not report false dependencies

2007-05-14 Thread jjohansen

AppArmor uses lock subtyping to avoid false positives from lockdep.  The
profile lock is often taken nested, but it is guaranteed to be in a lock
safe order and not the same lock when done, so it is safe.

A third lock type (aa_lock_task_release) is given to the profile lock
when it is taken in soft irq context during task release (aa_release).
This is to avoid a false positive between the task lock and the profile
lock.  In task context the profile lock wraps the task lock with irqs
off, but the kernel takes the task lock with irqs enabled.  This won't
ever result in a deadlock because aa_release doesn't need to take the
task lock of the dead task that is released.

Signed-off-by: John Johansen [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Cc: Ingo Molnar [EMAIL PROTECTED]

---
 security/apparmor/apparmor.h  |7 +++
 security/apparmor/inline.h|   25 ++---
 security/apparmor/locking.txt |   21 +++--
 security/apparmor/main.c  |6 +++---
 4 files changed, 43 insertions(+), 16 deletions(-)

--- a/security/apparmor/apparmor.h
+++ b/security/apparmor/apparmor.h
@@ -185,6 +185,13 @@ struct aa_audit {
 #define AA_CHECK_DIR   2  /* file type is directory */
 #define AA_CHECK_MANGLE4  /* leave extra room for name mangling */
 
+/* lock subtypes so lockdep does not raise false dependencies */
+enum aa_lock_class {
+   aa_lock_normal,
+   aa_lock_nested,
+   aa_lock_task_release
+};
+
 /* main.c */
 extern int alloc_null_complain_profile(void);
 extern void free_null_complain_profile(void);
--- a/security/apparmor/inline.h
+++ b/security/apparmor/inline.h
@@ -99,7 +99,8 @@ static inline void aa_free_task_context(
  * While the profile is locked, local interrupts are disabled. This also
  * gives us RCU reader safety.
  */
-static inline void lock_profile(struct aa_profile *profile)
+static inline void lock_profile_nested(struct aa_profile *profile,
+  enum aa_lock_class lock_class)
 {
/* We always lock top-level profiles instead of children. */
if (profile)
@@ -112,7 +113,13 @@ static inline void lock_profile(struct a
 * the task_free_security hook, which may run in RCU context.
 */
if (profile)
-   spin_lock_irqsave(profile-lock, profile-int_flags);
+   spin_lock_irqsave_nested(profile-lock, profile-int_flags,
+lock_class);
+}
+
+static inline void lock_profile(struct aa_profile *profile)
+{
+   lock_profile_nested(profile, aa_lock_normal);
 }
 
 /**
@@ -161,17 +168,21 @@ static inline void lock_both_profiles(st
 */
if (!profile1 || profile1 == profile2) {
if (profile2)
-   spin_lock_irqsave(profile2-lock, profile2-int_flags);
+   spin_lock_irqsave_nested(profile2-lock,
+profile2-int_flags,
+aa_lock_normal);
} else if (profile1  profile2) {
/* profile1 cannot be NULL here. */
-   spin_lock_irqsave(profile1-lock, profile1-int_flags);
+   spin_lock_irqsave_nested(profile1-lock, profile1-int_flags,
+aa_lock_normal);
if (profile2)
-   spin_lock(profile2-lock);
+   spin_lock_nested(profile2-lock, aa_lock_nested);
 
} else {
/* profile2 cannot be NULL here. */
-   spin_lock_irqsave(profile2-lock, profile2-int_flags);
-   spin_lock(profile1-lock);
+   spin_lock_irqsave_nested(profile2-lock, profile2-int_flags,
+aa_lock_normal);
+   spin_lock_nested(profile1-lock, aa_lock_nested);
}
 }
 
--- a/security/apparmor/locking.txt
+++ b/security/apparmor/locking.txt
@@ -51,9 +51,18 @@ list, and can sleep. This ensures that p
 won't race with itself. We release the profile_list_lock as soon as
 possible to avoid stalling exec during profile loading/replacement/removal.
 
-lock_dep reports a false 'possible irq lock inversion dependency detected'
-when the profile lock is taken in aa_release.  This is due to that the
-task_lock is often taken inside the profile lock but other kernel code
-takes the task_lock with interrupts enabled.  A deadlock will not actually
-occur because apparmor does not take the task_lock in hard_irq or soft_irq
-context.
+AppArmor uses lock subtyping to avoid false positives from lockdep.  The
+profile lock is often taken nested, but it is guaranteed to be in a lock
+safe order and not the same lock when done, so it is safe.
+
+A third lock type (aa_lock_task_release) is given to the profile lock
+when it is taken in soft irq context during task release (aa_release).
+This is to avoid a false positive between the task lock and the profile

[RFD Patch 3/4] Dont use a NULL nameidata in xattr_permission()

2007-05-14 Thread jjohansen

Create nameidata2 struct xattr_permission so that it does not pass NULL
to permission.

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

---
 fs/xattr.c |   18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -25,8 +25,16 @@
  * because different namespaces have very different rules.
  */
 static int
-xattr_permission(struct inode *inode, const char *name, int mask)
+xattr_permission(struct dentry *dentry, struct vfsmount *mnt, const char *name,
+int mask)
 {
+   struct inode *inode = dentry-d_inode;
+   struct nameidata2 nd;
+
+   nd.dentry = dentry;
+   nd.mnt = mnt;
+   nd.flags = 0;
+
/*
 * We can never set or remove an extended attribute on a read-only
 * filesystem  or on an immutable / append-only inode.
@@ -65,7 +73,7 @@ xattr_permission(struct inode *inode, co
return -EPERM;
}
 
-   return permission(inode, mask, NULL);
+   return permission(inode, mask, nd);
 }
 
 int
@@ -75,7 +83,7 @@ vfs_setxattr(struct dentry *dentry, stru
struct inode *inode = dentry-d_inode;
int error;
 
-   error = xattr_permission(inode, name, MAY_WRITE);
+   error = xattr_permission(dentry, mnt, name, MAY_WRITE);
if (error)
return error;
 
@@ -112,7 +120,7 @@ vfs_getxattr(struct dentry *dentry, stru
struct inode *inode = dentry-d_inode;
int error;
 
-   error = xattr_permission(inode, name, MAY_READ);
+   error = xattr_permission(dentry, mnt, name, MAY_READ);
if (error)
return error;
 
@@ -174,7 +182,7 @@ vfs_removexattr(struct dentry *dentry, s
if (!inode-i_op-removexattr)
return -EOPNOTSUPP;
 
-   error = xattr_permission(inode, name, MAY_WRITE);
+   error = xattr_permission(dentry, mnt, name, MAY_WRITE);
if (error)
return error;
 

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 32/45] Enable LSM hooks to distinguish operations on file descriptors from operations on pathnames

2007-05-14 Thread jjohansen

Struct iattr already contains ia_file since commit cc4e69de from 
Miklos (which is related to commit befc649c). Use this to pass
struct file down the setattr hooks. This allows LSMs to distinguish
operations on file descriptors from operations on paths.

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]
Cc: Miklos Szeredi [EMAIL PROTECTED]

---
 fs/nfsd/vfs.c  |   12 +++-
 fs/open.c  |   16 +++-
 include/linux/fs.h |3 +++
 3 files changed, 21 insertions(+), 10 deletions(-)

--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -383,7 +383,7 @@ static ssize_t nfsd_getxattr(struct dent
 {
ssize_t buflen;
 
-   buflen = vfs_getxattr(dentry, mnt, key, NULL, 0);
+   buflen = vfs_getxattr(dentry, mnt, key, NULL, 0, NULL);
if (buflen = 0)
return buflen;
 
@@ -391,7 +391,7 @@ static ssize_t nfsd_getxattr(struct dent
if (!*buf)
return -ENOMEM;
 
-   return vfs_getxattr(dentry, mnt, key, *buf, buflen);
+   return vfs_getxattr(dentry, mnt, key, *buf, buflen, NULL);
 }
 #endif
 
@@ -417,7 +417,7 @@ set_nfsv4_acl_one(struct dentry *dentry,
goto out;
}
 
-   error = vfs_setxattr(dentry, mnt, key, buf, len, 0);
+   error = vfs_setxattr(dentry, mnt, key, buf, len, 0, NULL);
 out:
kfree(buf);
return error;
@@ -1992,12 +1992,14 @@ nfsd_set_posix_acl(struct svc_fh *fhp, i
 
mnt = fhp-fh_export-ex_mnt;
if (size)
-   error = vfs_setxattr(fhp-fh_dentry, mnt, name, value, size,0);
+   error = vfs_setxattr(fhp-fh_dentry, mnt, name, value, size, 0,
+NULL);
else {
if (!S_ISDIR(inode-i_mode)  type == ACL_TYPE_DEFAULT)
error = 0;
else {
-   error = vfs_removexattr(fhp-fh_dentry, mnt, name);
+   error = vfs_removexattr(fhp-fh_dentry, mnt, name,
+   NULL);
if (error == -ENODATA)
error = 0;
}
--- a/fs/open.c
+++ b/fs/open.c
@@ -522,6 +522,8 @@ asmlinkage long sys_fchmod(unsigned int 
mode = inode-i_mode;
newattrs.ia_mode = (mode  S_IALLUGO) | (inode-i_mode  ~S_IALLUGO);
newattrs.ia_valid = ATTR_MODE | ATTR_CTIME;
+   newattrs.ia_valid |= ATTR_FILE;
+   newattrs.ia_file = file;
err = notify_change(dentry, file-f_path.mnt, newattrs);
mutex_unlock(inode-i_mutex);
 
@@ -572,7 +574,7 @@ asmlinkage long sys_chmod(const char __u
 }
 
 static int chown_common(struct dentry * dentry, struct vfsmount *mnt,
-   uid_t user, gid_t group)
+   uid_t user, gid_t group, struct file *file)
 {
struct inode * inode;
int error;
@@ -600,6 +602,10 @@ static int chown_common(struct dentry * 
}
if (!S_ISDIR(inode-i_mode))
newattrs.ia_valid |= ATTR_KILL_SUID|ATTR_KILL_SGID;
+   if (file) {
+   newattrs.ia_file = file;
+   newattrs.ia_valid |= ATTR_FILE;
+   }
mutex_lock(inode-i_mutex);
error = notify_change(dentry, mnt, newattrs);
mutex_unlock(inode-i_mutex);
@@ -615,7 +621,7 @@ asmlinkage long sys_chown(const char __u
error = user_path_walk(filename, nd);
if (error)
goto out;
-   error = chown_common(nd.dentry, nd.mnt, user, group);
+   error = chown_common(nd.dentry, nd.mnt, user, group, NULL);
path_release(nd);
 out:
return error;
@@ -635,7 +641,7 @@ asmlinkage long sys_fchownat(int dfd, co
error = __user_walk_fd(dfd, filename, follow, nd);
if (error)
goto out;
-   error = chown_common(nd.dentry, nd.mnt, user, group);
+   error = chown_common(nd.dentry, nd.mnt, user, group, NULL);
path_release(nd);
 out:
return error;
@@ -649,7 +655,7 @@ asmlinkage long sys_lchown(const char __
error = user_path_walk_link(filename, nd);
if (error)
goto out;
-   error = chown_common(nd.dentry, nd.mnt, user, group);
+   error = chown_common(nd.dentry, nd.mnt, user, group, NULL);
path_release(nd);
 out:
return error;
@@ -668,7 +674,7 @@ asmlinkage long sys_fchown(unsigned int 
 
dentry = file-f_path.dentry;
audit_inode(NULL, dentry-d_inode);
-   error = chown_common(dentry, file-f_path.mnt, user, group);
+   error = chown_common(dentry, file-f_path.mnt, user, group, file);
fput(file);
 out:
return error;
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -351,6 +351,9 @@ struct iattr {
 * Not an attribute, but an auxilary info for filesystems wanting to
 * implement an ftruncate() like method.  NOTE: filesystem should
 * check for (ia_valid  ATTR_FILE), and not for

[AppArmor 06/45] Pass struct vfsmount to the inode_mkdir LSM hook

2007-05-14 Thread jjohansen

This is needed for computing pathnames in the AppArmor LSM.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/namei.c   |2 +-
 include/linux/security.h |8 ++--
 security/dummy.c |2 +-
 security/selinux/hooks.c |3 ++-
 4 files changed, 10 insertions(+), 5 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1946,7 +1946,7 @@ int vfs_mkdir(struct inode *dir, struct 
return -EPERM;
 
mode = (S_IRWXUGO|S_ISVTX);
-   error = security_inode_mkdir(dir, dentry, mode);
+   error = security_inode_mkdir(dir, dentry, mnt, mode);
if (error)
return error;
 
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -308,6 +308,7 @@ struct request_sock;
  * associated with inode strcture @dir. 
  * @dir containst the inode structure of parent of the directory to be 
created.
  * @dentry contains the dentry structure of new directory.
+ * @mnt is the vfsmount corresponding to @dentry (may be NULL).
  * @mode contains the mode of new directory.
  * Return 0 if permission is granted.
  * @inode_rmdir:
@@ -1213,7 +1214,8 @@ struct security_operations {
int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
int (*inode_symlink) (struct inode *dir,
  struct dentry *dentry, const char *old_name);
-   int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
+   int (*inode_mkdir) (struct inode *dir, struct dentry *dentry,
+   struct vfsmount *mnt, int mode);
int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
int mode, dev_t dev);
@@ -1650,11 +1652,12 @@ static inline int security_inode_symlink
 
 static inline int security_inode_mkdir (struct inode *dir,
struct dentry *dentry,
+   struct vfsmount *mnt,
int mode)
 {
if (unlikely (IS_PRIVATE (dir)))
return 0;
-   return security_ops-inode_mkdir (dir, dentry, mode);
+   return security_ops-inode_mkdir (dir, dentry, mnt, mode);
 }
 
 static inline int security_inode_rmdir (struct inode *dir,
@@ -2371,6 +2374,7 @@ static inline int security_inode_symlink
 
 static inline int security_inode_mkdir (struct inode *dir,
struct dentry *dentry,
+   struct vfsmount *mnt,
int mode)
 {
return 0;
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -288,7 +288,7 @@ static int dummy_inode_symlink (struct i
 }
 
 static int dummy_inode_mkdir (struct inode *inode, struct dentry *dentry,
- int mask)
+ struct vfsmount *mnt, int mask)
 {
return 0;
 }
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2207,7 +2207,8 @@ static int selinux_inode_symlink(struct 
return may_create(dir, dentry, SECCLASS_LNK_FILE);
 }
 
-static int selinux_inode_mkdir(struct inode *dir, struct dentry *dentry, int 
mask)
+static int selinux_inode_mkdir(struct inode *dir, struct dentry *dentry,
+  struct vfsmount *mnt, int mask)
 {
return may_create(dir, dentry, SECCLASS_DIR);
 }

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 20/45] Pass struct vfsmount to the inode_rename LSM hook

2007-05-14 Thread jjohansen

This is needed for computing pathnames in the AppArmor LSM.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/namei.c   |6 --
 include/linux/security.h |   18 +-
 security/dummy.c |4 +++-
 security/selinux/hooks.c |8 ++--
 4 files changed, 26 insertions(+), 10 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2417,7 +2417,8 @@ static int vfs_rename_dir(struct inode *
return error;
}
 
-   error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry);
+   error = security_inode_rename(old_dir, old_dentry, old_mnt,
+ new_dir, new_dentry, new_mnt);
if (error)
return error;
 
@@ -2451,7 +2452,8 @@ static int vfs_rename_other(struct inode
struct inode *target;
int error;
 
-   error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry);
+   error = security_inode_rename(old_dir, old_dentry, old_mnt,
+ new_dir, new_dentry, new_mnt);
if (error)
return error;
 
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -336,8 +336,10 @@ struct request_sock;
  * Check for permission to rename a file or directory.
  * @old_dir contains the inode structure for parent of the old link.
  * @old_dentry contains the dentry structure of the old link.
+ * @old_mnt is the vfsmount corresponding to @old_dentry (may be NULL).
  * @new_dir contains the inode structure for parent of the new link.
  * @new_dentry contains the dentry structure of the new link.
+ * @new_mnt is the vfsmount corresponding to @new_dentry (may be NULL).
  * Return 0 if permission is granted.
  * @inode_readlink:
  * Check the permission to read the symbolic link.
@@ -1230,7 +1232,9 @@ struct security_operations {
int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
struct vfsmount *mnt, int mode, dev_t dev);
int (*inode_rename) (struct inode *old_dir, struct dentry *old_dentry,
-struct inode *new_dir, struct dentry *new_dentry);
+struct vfsmount *old_mnt,
+struct inode *new_dir, struct dentry *new_dentry,
+struct vfsmount *new_mnt);
int (*inode_readlink) (struct dentry *dentry, struct vfsmount *mnt);
int (*inode_follow_link) (struct dentry *dentry, struct nameidata *nd);
int (*inode_permission) (struct inode *inode, int mask, struct 
nameidata *nd);
@@ -1696,14 +1700,16 @@ static inline int security_inode_mknod (
 
 static inline int security_inode_rename (struct inode *old_dir,
 struct dentry *old_dentry,
+struct vfsmount *old_mnt,
 struct inode *new_dir,
-struct dentry *new_dentry)
+struct dentry *new_dentry,
+struct vfsmount *new_mnt)
 {
 if (unlikely (IS_PRIVATE (old_dentry-d_inode) ||
 (new_dentry-d_inode  IS_PRIVATE (new_dentry-d_inode
return 0;
-   return security_ops-inode_rename (old_dir, old_dentry,
-  new_dir, new_dentry);
+   return security_ops-inode_rename (old_dir, old_dentry, old_mnt,
+  new_dir, new_dentry, new_mnt);
 }
 
 static inline int security_inode_readlink (struct dentry *dentry,
@@ -2419,8 +2425,10 @@ static inline int security_inode_mknod (
 
 static inline int security_inode_rename (struct inode *old_dir,
 struct dentry *old_dentry,
+struct vfsmount *old_mnt,
 struct inode *new_dir,
-struct dentry *new_dentry)
+struct dentry *new_dentry,
+struct vfsmount *new_mnt)
 {
return 0;
 }
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -310,8 +310,10 @@ static int dummy_inode_mknod (struct ino
 
 static int dummy_inode_rename (struct inode *old_inode,
   struct dentry *old_dentry,
+  struct vfsmount *old_mnt,
   struct inode *new_inode,
-  struct dentry *new_dentry)
+  struct dentry *new_dentry,
+  struct vfsmount *new_mnt)
 {
return 0;
 }
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2238,8 +2238,12 @@ static int selinux_inode_mknod(struct in

[AppArmor 14/45] Add a struct vfsmount parameter to vfs_rmdir()

2007-05-14 Thread jjohansen

The vfsmount will be passed down to the LSM hook so that LSMs can compute
pathnames.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/ecryptfs/inode.c   |4 +++-
 fs/namei.c|4 ++--
 fs/nfsd/nfs4recover.c |2 +-
 fs/nfsd/vfs.c |8 +---
 fs/reiserfs/xattr.c   |2 +-
 include/linux/fs.h|2 +-
 6 files changed, 13 insertions(+), 9 deletions(-)

--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -542,14 +542,16 @@ out:
 static int ecryptfs_rmdir(struct inode *dir, struct dentry *dentry)
 {
struct dentry *lower_dentry;
+   struct vfsmount *lower_mnt;
struct dentry *lower_dir_dentry;
int rc;
 
lower_dentry = ecryptfs_dentry_to_lower(dentry);
+   lower_mnt = ecryptfs_dentry_to_lower_mnt(dentry);
dget(dentry);
lower_dir_dentry = lock_parent(lower_dentry);
dget(lower_dentry);
-   rc = vfs_rmdir(lower_dir_dentry-d_inode, lower_dentry);
+   rc = vfs_rmdir(lower_dir_dentry-d_inode, lower_dentry, lower_mnt);
dput(lower_dentry);
if (!rc)
d_delete(lower_dentry);
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2024,7 +2024,7 @@ void dentry_unhash(struct dentry *dentry
spin_unlock(dcache_lock);
 }
 
-int vfs_rmdir(struct inode *dir, struct dentry *dentry)
+int vfs_rmdir(struct inode *dir, struct dentry *dentry,struct vfsmount *mnt)
 {
int error = may_delete(dir, dentry, 1);
 
@@ -2088,7 +2088,7 @@ static long do_rmdir(int dfd, const char
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit2;
-   error = vfs_rmdir(nd.dentry-d_inode, dentry);
+   error = vfs_rmdir(nd.dentry-d_inode, dentry, nd.mnt);
dput(dentry);
 exit2:
mutex_unlock(nd.dentry-d_inode-i_mutex);
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -276,7 +276,7 @@ nfsd4_clear_clid_dir(struct dentry *dir,
 * a kernel from the future */
nfsd4_list_rec_dir(dentry, nfsd4_remove_clid_file);
mutex_lock_nested(dir-d_inode-i_mutex, I_MUTEX_PARENT);
-   status = vfs_rmdir(dir-d_inode, dentry);
+   status = vfs_rmdir(dir-d_inode, dentry, rec_dir.mnt);
mutex_unlock(dir-d_inode-i_mutex);
return status;
 }
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1666,6 +1666,7 @@ nfsd_unlink(struct svc_rqst *rqstp, stru
char *fname, int flen)
 {
struct dentry   *dentry, *rdentry;
+   struct svc_export *exp;
struct inode*dirp;
__be32  err;
int host_err;
@@ -1680,6 +1681,7 @@ nfsd_unlink(struct svc_rqst *rqstp, stru
fh_lock_nested(fhp, I_MUTEX_PARENT);
dentry = fhp-fh_dentry;
dirp = dentry-d_inode;
+   exp = fhp-fh_export;
 
rdentry = lookup_one_len(fname, dentry, flen);
host_err = PTR_ERR(rdentry);
@@ -1697,21 +1699,21 @@ nfsd_unlink(struct svc_rqst *rqstp, stru
 
if (type != S_IFDIR) { /* It's UNLINK */
 #ifdef MSNFS
-   if ((fhp-fh_export-ex_flags  NFSEXP_MSNFS) 
+   if ((exp-ex_flags  NFSEXP_MSNFS) 
(atomic_read(rdentry-d_count)  1)) {
host_err = -EPERM;
} else
 #endif
host_err = vfs_unlink(dirp, rdentry);
} else { /* It's RMDIR */
-   host_err = vfs_rmdir(dirp, rdentry);
+   host_err = vfs_rmdir(dirp, rdentry, exp-ex_mnt);
}
 
dput(rdentry);
 
if (host_err)
goto out_nfserr;
-   if (EX_ISSYNC(fhp-fh_export))
+   if (EX_ISSYNC(exp))
host_err = nfsd_sync_dir(dentry);
 
 out_nfserr:
--- a/fs/reiserfs/xattr.c
+++ b/fs/reiserfs/xattr.c
@@ -775,7 +775,7 @@ int reiserfs_delete_xattrs(struct inode 
if (dir-d_inode-i_nlink = 2) {
root = get_xa_root(inode-i_sb, XATTR_REPLACE);
reiserfs_write_lock_xattrs(inode-i_sb);
-   err = vfs_rmdir(root-d_inode, dir);
+   err = vfs_rmdir(root-d_inode, dir, NULL);
reiserfs_write_unlock_xattrs(inode-i_sb);
dput(root);
} else {
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -996,7 +996,7 @@ extern int vfs_mkdir(struct inode *, str
 extern int vfs_mknod(struct inode *, struct dentry *, struct vfsmount *, int, 
dev_t);
 extern int vfs_symlink(struct inode *, struct dentry *, struct vfsmount *, 
const char *, int);
 extern int vfs_link(struct dentry *, struct vfsmount *, struct inode *, struct 
dentry *, struct vfsmount *);
-extern int vfs_rmdir(struct inode *, struct dentry *);
+extern int vfs_rmdir(struct inode *, struct dentry *, struct vfsmount *);
 extern int vfs_unlink(struct inode *, struct dentry *);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct 
dentry *);
 

-- 
-
To unsubscribe from this list: send

[AppArmor 22/45] Pass struct vfsmount to the inode_setxattr LSM hook

2007-05-14 Thread jjohansen

This is needed for computing pathnames in the AppArmor LSM.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/xattr.c   |4 ++--
 include/linux/security.h |   40 +---
 security/commoncap.c |4 ++--
 security/dummy.c |9 ++---
 security/selinux/hooks.c |8 ++--
 5 files changed, 41 insertions(+), 24 deletions(-)

--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -80,7 +80,7 @@ vfs_setxattr(struct dentry *dentry, stru
return error;
 
mutex_lock(inode-i_mutex);
-   error = security_inode_setxattr(dentry, name, value, size, flags);
+   error = security_inode_setxattr(dentry, mnt, name, value, size, flags);
if (error)
goto out;
error = -EOPNOTSUPP;
@@ -88,7 +88,7 @@ vfs_setxattr(struct dentry *dentry, stru
error = inode-i_op-setxattr(dentry, name, value, size, flags);
if (!error) {
fsnotify_xattr(dentry);
-   security_inode_post_setxattr(dentry, name, value,
+   security_inode_post_setxattr(dentry, mnt, name, value,
 size, flags);
}
} else if (!strncmp(name, XATTR_SECURITY_PREFIX,
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -49,7 +49,7 @@ extern void cap_capset_set (struct task_
 extern int cap_bprm_set_security (struct linux_binprm *bprm);
 extern void cap_bprm_apply_creds (struct linux_binprm *bprm, int unsafe);
 extern int cap_bprm_secureexec(struct linux_binprm *bprm);
-extern int cap_inode_setxattr(struct dentry *dentry, char *name, void *value, 
size_t size, int flags);
+extern int cap_inode_setxattr(struct dentry *dentry, struct vfsmount *mnt, 
char *name, void *value, size_t size, int flags);
 extern int cap_inode_removexattr(struct dentry *dentry, char *name);
 extern int cap_task_post_setuid (uid_t old_ruid, uid_t old_euid, uid_t 
old_suid, int flags);
 extern void cap_task_reparent_to_init (struct task_struct *p);
@@ -384,11 +384,11 @@ struct request_sock;
  * inode.
  * @inode_setxattr:
  * Check permission before setting the extended attributes
- * @value identified by @name for @dentry.
+ * @value identified by @name for @dentry and @mnt.
  * Return 0 if permission is granted.
  * @inode_post_setxattr:
  * Update inode security field after successful setxattr operation.
- * @value identified by @name for @dentry.
+ * @value identified by @name for @dentry and @mnt.
  * @inode_getxattr:
  * Check permission before obtaining the extended attributes
  * identified by @name for @dentry.
@@ -1242,9 +1242,11 @@ struct security_operations {
  struct iattr *attr);
int (*inode_getattr) (struct vfsmount *mnt, struct dentry *dentry);
 void (*inode_delete) (struct inode *inode);
-   int (*inode_setxattr) (struct dentry *dentry, char *name, void *value,
-  size_t size, int flags);
-   void (*inode_post_setxattr) (struct dentry *dentry, char *name, void 
*value,
+   int (*inode_setxattr) (struct dentry *dentry, struct vfsmount *mnt,
+  char *name, void *value, size_t size, int flags);
+   void (*inode_post_setxattr) (struct dentry *dentry,
+struct vfsmount *mnt,
+char *name, void *value,
 size_t size, int flags);
int (*inode_getxattr) (struct dentry *dentry, char *name);
int (*inode_listxattr) (struct dentry *dentry);
@@ -1760,20 +1762,24 @@ static inline void security_inode_delete
security_ops-inode_delete (inode);
 }
 
-static inline int security_inode_setxattr (struct dentry *dentry, char *name,
+static inline int security_inode_setxattr (struct dentry *dentry,
+  struct vfsmount *mnt, char *name,
   void *value, size_t size, int flags)
 {
if (unlikely (IS_PRIVATE (dentry-d_inode)))
return 0;
-   return security_ops-inode_setxattr (dentry, name, value, size, flags);
+   return security_ops-inode_setxattr (dentry, mnt, name, value, size,
+flags);
 }
 
-static inline void security_inode_post_setxattr (struct dentry *dentry, char 
*name,
-   void *value, size_t size, int 
flags)
+static inline void security_inode_post_setxattr (struct dentry *dentry,
+struct vfsmount *mnt,
+char *name, void *value,
+size_t size, int flags)
 {
if (unlikely (IS_PRIVATE

[AppArmor 19/45] Add struct vfsmount parameters to vfs_rename()

2007-05-14 Thread jjohansen

The vfsmount will be passed down to the LSM hook so that LSMs can compute
pathnames.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/ecryptfs/inode.c |7 ++-
 fs/namei.c  |   19 ---
 fs/nfsd/vfs.c   |3 ++-
 include/linux/fs.h  |2 +-
 4 files changed, 21 insertions(+), 10 deletions(-)

--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -598,19 +598,24 @@ ecryptfs_rename(struct inode *old_dir, s
 {
int rc;
struct dentry *lower_old_dentry;
+   struct vfsmount *lower_old_mnt;
struct dentry *lower_new_dentry;
+   struct vfsmount *lower_new_mnt;
struct dentry *lower_old_dir_dentry;
struct dentry *lower_new_dir_dentry;
 
lower_old_dentry = ecryptfs_dentry_to_lower(old_dentry);
+   lower_old_mnt = ecryptfs_dentry_to_lower_mnt(old_dentry);
lower_new_dentry = ecryptfs_dentry_to_lower(new_dentry);
+   lower_new_mnt = ecryptfs_dentry_to_lower_mnt(new_dentry);
dget(lower_old_dentry);
dget(lower_new_dentry);
lower_old_dir_dentry = dget_parent(lower_old_dentry);
lower_new_dir_dentry = dget_parent(lower_new_dentry);
lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
rc = vfs_rename(lower_old_dir_dentry-d_inode, lower_old_dentry,
-   lower_new_dir_dentry-d_inode, lower_new_dentry);
+   lower_old_mnt, lower_new_dir_dentry-d_inode,
+   lower_new_dentry, lower_new_mnt);
if (rc)
goto out_lock;
fsstack_copy_attr_all(new_dir, lower_new_dir_dentry-d_inode, NULL);
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2401,7 +2401,8 @@ asmlinkage long sys_link(const char __us
  *locking].
  */
 static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
- struct inode *new_dir, struct dentry *new_dentry)
+ struct vfsmount *old_mnt, struct inode *new_dir,
+ struct dentry *new_dentry, struct vfsmount *new_mnt)
 {
int error = 0;
struct inode *target;
@@ -2444,7 +2445,8 @@ static int vfs_rename_dir(struct inode *
 }
 
 static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
-   struct inode *new_dir, struct dentry *new_dentry)
+   struct vfsmount *old_mnt, struct inode *new_dir,
+   struct dentry *new_dentry, struct vfsmount *new_mnt)
 {
struct inode *target;
int error;
@@ -2472,7 +2474,8 @@ static int vfs_rename_other(struct inode
 }
 
 int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
-  struct inode *new_dir, struct dentry *new_dentry)
+   struct vfsmount *old_mnt, struct inode *new_dir,
+   struct dentry *new_dentry, struct vfsmount *new_mnt)
 {
int error;
int is_dir = S_ISDIR(old_dentry-d_inode-i_mode);
@@ -2501,9 +2504,11 @@ int vfs_rename(struct inode *old_dir, st
old_name = fsnotify_oldname_init(old_dentry-d_name.name);
 
if (is_dir)
-   error = vfs_rename_dir(old_dir,old_dentry,new_dir,new_dentry);
+   error = vfs_rename_dir(old_dir, old_dentry, old_mnt,
+  new_dir, new_dentry, new_mnt);
else
-   error = vfs_rename_other(old_dir,old_dentry,new_dir,new_dentry);
+   error = vfs_rename_other(old_dir, old_dentry, old_mnt,
+new_dir, new_dentry, new_mnt);
if (!error) {
const char *new_name = old_dentry-d_name.name;
fsnotify_move(old_dir, new_dir, old_name, new_name, is_dir,
@@ -2575,8 +2580,8 @@ static int do_rename(int olddfd, const c
if (new_dentry == trap)
goto exit5;
 
-   error = vfs_rename(old_dir-d_inode, old_dentry,
-  new_dir-d_inode, new_dentry);
+   error = vfs_rename(old_dir-d_inode, old_dentry, oldnd.mnt,
+  new_dir-d_inode, new_dentry, newnd.mnt);
 exit5:
dput(new_dentry);
 exit4:
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1630,7 +1630,8 @@ nfsd_rename(struct svc_rqst *rqstp, stru
host_err = -EPERM;
} else
 #endif
-   host_err = vfs_rename(fdir, odentry, tdir, ndentry);
+   host_err = vfs_rename(fdir, odentry, ffhp-fh_export-ex_mnt,
+ tdir, ndentry, tfhp-fh_export-ex_mnt);
if (!host_err  EX_ISSYNC(tfhp-fh_export)) {
host_err = nfsd_sync_dir(tdentry);
if (!host_err)
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -998,7 +998,7 @@ extern int vfs_symlink(struct inode *, s
 extern int vfs_link(struct dentry *, struct vfsmount *, struct inode *, struct 
dentry *, struct vfsmount *);
 extern int

[RFD Patch 0/4] AppArmor - Don't pass NULL nameidata to vfs_create/lookup/permission IOPs

2007-05-14 Thread jjohansen

lkml-explanatory.txt

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 16/45] Call lsm hook before unhashing dentry in vfs_rmdir()

2007-05-14 Thread jjohansen

If we unhash the dentry before calling the security_inode_rmdir hook,
we cannot compute the file's pathname in the hook anymore. AppArmor
needs to know the filename in order to decide whether a file may be
deleted, though.

Signed-off-by: John Johansen [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

---
 fs/namei.c |   13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2034,6 +2034,10 @@ int vfs_rmdir(struct inode *dir, struct 
if (!dir-i_op || !dir-i_op-rmdir)
return -EPERM;
 
+   error = security_inode_rmdir(dir, dentry, mnt);
+   if (error)
+   return error;
+
DQUOT_INIT(dir);
 
mutex_lock(dentry-d_inode-i_mutex);
@@ -2041,12 +2045,9 @@ int vfs_rmdir(struct inode *dir, struct 
if (d_mountpoint(dentry))
error = -EBUSY;
else {
-   error = security_inode_rmdir(dir, dentry, mnt);
-   if (!error) {
-   error = dir-i_op-rmdir(dir, dentry);
-   if (!error)
-   dentry-d_inode-i_flags |= S_DEAD;
-   }
+   error = dir-i_op-rmdir(dir, dentry);
+   if (!error)
+   dentry-d_inode-i_flags |= S_DEAD;
}
mutex_unlock(dentry-d_inode-i_mutex);
if (!error) {

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 30/45] Make d_path() consistent across mount operations

2007-05-14 Thread jjohansen

The path that __d_path() computes can become slightly inconsistent when it
races with mount operations: it grabs the vfsmount_lock when traversing mount
points but immediately drops it again, only to re-grab it when it reaches the
next mount point.  The result is that the filename computed is not always
consisent, and the file may never have had that name. (This is unlikely, but
still possible.)

Fix this by grabbing the vfsmount_lock when the first mount point is reached,
and holding onto it until the d_cache lookup is completed.

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

---
 fs/dcache.c |   14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1783,7 +1783,7 @@ static char *__d_path(struct dentry *den
  struct dentry *root, struct vfsmount *rootmnt,
  char *buffer, int buflen, int fail_deleted)
 {
-   int namelen, is_slash;
+   int namelen, is_slash, vfsmount_locked = 0;
 
if (buflen  2)
return ERR_PTR(-ENAMETOOLONG);
@@ -1806,14 +1806,14 @@ static char *__d_path(struct dentry *den
struct dentry * parent;
 
if (dentry == vfsmnt-mnt_root || IS_ROOT(dentry)) {
-   spin_lock(vfsmount_lock);
-   if (vfsmnt-mnt_parent == vfsmnt) {
-   spin_unlock(vfsmount_lock);
-   goto global_root;
+   if (!vfsmount_locked) {
+   spin_lock(vfsmount_lock);
+   vfsmount_locked = 1;
}
+   if (vfsmnt-mnt_parent == vfsmnt)
+   goto global_root;
dentry = vfsmnt-mnt_mountpoint;
vfsmnt = vfsmnt-mnt_parent;
-   spin_unlock(vfsmount_lock);
continue;
}
parent = dentry-d_parent;
@@ -1832,6 +1832,8 @@ static char *__d_path(struct dentry *den
*--buffer = '/';
 
 out:
+   if (vfsmount_locked)
+   spin_unlock(vfsmount_lock);
spin_unlock(dcache_lock);
return buffer;
 

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 05/45] Add struct vfsmount parameter to vfs_mkdir()

2007-05-14 Thread jjohansen

The vfsmount will be passed down to the LSM hook so that LSMs can compute
pathnames.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/ecryptfs/inode.c   |5 -
 fs/namei.c|5 +++--
 fs/nfsd/nfs4recover.c |3 ++-
 fs/nfsd/vfs.c |8 +---
 include/linux/fs.h|2 +-
 5 files changed, 15 insertions(+), 8 deletions(-)

--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -509,11 +509,14 @@ static int ecryptfs_mkdir(struct inode *
 {
int rc;
struct dentry *lower_dentry;
+   struct vfsmount *lower_mnt;
struct dentry *lower_dir_dentry;
 
lower_dentry = ecryptfs_dentry_to_lower(dentry);
+   lower_mnt = ecryptfs_dentry_to_lower_mnt(dentry);
lower_dir_dentry = lock_parent(lower_dentry);
-   rc = vfs_mkdir(lower_dir_dentry-d_inode, lower_dentry, mode);
+   rc = vfs_mkdir(lower_dir_dentry-d_inode, lower_dentry, lower_mnt,
+  mode);
if (rc || !lower_dentry-d_inode)
goto out;
rc = ecryptfs_interpose(lower_dentry, dentry, dir-i_sb, 0);
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1934,7 +1934,8 @@ asmlinkage long sys_mknod(const char __u
return sys_mknodat(AT_FDCWD, filename, mode, dev);
 }
 
-int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
+int vfs_mkdir(struct inode *dir, struct dentry *dentry, struct vfsmount *mnt,
+ int mode)
 {
int error = may_create(dir, dentry, NULL);
 
@@ -1978,7 +1979,7 @@ asmlinkage long sys_mkdirat(int dfd, con
 
if (!IS_POSIXACL(nd.dentry-d_inode))
mode = ~current-fs-umask;
-   error = vfs_mkdir(nd.dentry-d_inode, dentry, mode);
+   error = vfs_mkdir(nd.dentry-d_inode, dentry, nd.mnt, mode);
dput(dentry);
 out_unlock:
mutex_unlock(nd.dentry-d_inode-i_mutex);
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -156,7 +156,8 @@ nfsd4_create_clid_dir(struct nfs4_client
dprintk(NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n);
goto out_put;
}
-   status = vfs_mkdir(rec_dir.dentry-d_inode, dentry, S_IRWXU);
+   status = vfs_mkdir(rec_dir.dentry-d_inode, dentry, rec_dir.mnt,
+  S_IRWXU);
 out_put:
dput(dentry);
 out_unlock:
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1115,6 +1115,7 @@ nfsd_create(struct svc_rqst *rqstp, stru
int type, dev_t rdev, struct svc_fh *resfhp)
 {
struct dentry   *dentry, *dchild = NULL;
+   struct svc_export *exp;
struct inode*dirp;
__be32  err;
int host_err;
@@ -1131,6 +1132,7 @@ nfsd_create(struct svc_rqst *rqstp, stru
goto out;
 
dentry = fhp-fh_dentry;
+   exp = fhp-fh_export;
dirp = dentry-d_inode;
 
err = nfserr_notdir;
@@ -1147,7 +1149,7 @@ nfsd_create(struct svc_rqst *rqstp, stru
host_err = PTR_ERR(dchild);
if (IS_ERR(dchild))
goto out_nfserr;
-   err = fh_compose(resfhp, fhp-fh_export, dchild, fhp);
+   err = fh_compose(resfhp, exp, dchild, fhp);
if (err)
goto out;
} else {
@@ -1186,7 +1188,7 @@ nfsd_create(struct svc_rqst *rqstp, stru
host_err = vfs_create(dirp, dchild, iap-ia_mode, NULL);
break;
case S_IFDIR:
-   host_err = vfs_mkdir(dirp, dchild, iap-ia_mode);
+   host_err = vfs_mkdir(dirp, dchild, exp-ex_mnt, iap-ia_mode);
break;
case S_IFCHR:
case S_IFBLK:
@@ -1201,7 +1203,7 @@ nfsd_create(struct svc_rqst *rqstp, stru
if (host_err  0)
goto out_nfserr;
 
-   if (EX_ISSYNC(fhp-fh_export)) {
+   if (EX_ISSYNC(exp)) {
err = nfserrno(nfsd_sync_dir(dentry));
write_inode_now(dchild-d_inode, 1);
}
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -992,7 +992,7 @@ extern void unlock_super(struct super_bl
  */
 extern int vfs_permission(struct nameidata *, int);
 extern int vfs_create(struct inode *, struct dentry *, int, struct nameidata 
*);
-extern int vfs_mkdir(struct inode *, struct dentry *, int);
+extern int vfs_mkdir(struct inode *, struct dentry *, struct vfsmount *, int);
 extern int vfs_mknod(struct inode *, struct dentry *, int, dev_t);
 extern int vfs_symlink(struct inode *, struct dentry *, const char *, int);
 extern int vfs_link(struct dentry *, struct inode *, struct dentry *);

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 15/45] Pass struct vfsmount to the inode_rmdir LSM hook

2007-05-14 Thread jjohansen

This is needed for computing pathnames in the AppArmor LSM.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/namei.c   |2 +-
 include/linux/security.h |   12 
 security/dummy.c |3 ++-
 security/selinux/hooks.c |3 ++-
 4 files changed, 13 insertions(+), 7 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2041,7 +2041,7 @@ int vfs_rmdir(struct inode *dir, struct 
if (d_mountpoint(dentry))
error = -EBUSY;
else {
-   error = security_inode_rmdir(dir, dentry);
+   error = security_inode_rmdir(dir, dentry, mnt);
if (!error) {
error = dir-i_op-rmdir(dir, dentry);
if (!error)
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -318,6 +318,7 @@ struct request_sock;
  * Check the permission to remove a directory.
  * @dir contains the inode structure of parent of the directory to be 
removed.
  * @dentry contains the dentry structure of directory to be removed.
+ * @mnt is the vfsmount corresponding to @dentry (may be NULL).
  * Return 0 if permission is granted.
  * @inode_mknod:
  * Check permissions when creating a special file (or a socket or a fifo
@@ -1222,7 +1223,8 @@ struct security_operations {
  struct vfsmount *mnt, const char *old_name);
int (*inode_mkdir) (struct inode *dir, struct dentry *dentry,
struct vfsmount *mnt, int mode);
-   int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
+   int (*inode_rmdir) (struct inode *dir, struct dentry *dentry,
+   struct vfsmount *mnt);
int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
struct vfsmount *mnt, int mode, dev_t dev);
int (*inode_rename) (struct inode *old_dir, struct dentry *old_dentry,
@@ -1671,11 +1673,12 @@ static inline int security_inode_mkdir (
 }
 
 static inline int security_inode_rmdir (struct inode *dir,
-   struct dentry *dentry)
+   struct dentry *dentry,
+   struct vfsmount *mnt)
 {
if (unlikely (IS_PRIVATE (dentry-d_inode)))
return 0;
-   return security_ops-inode_rmdir (dir, dentry);
+   return security_ops-inode_rmdir (dir, dentry, mnt);
 }
 
 static inline int security_inode_mknod (struct inode *dir,
@@ -2396,7 +2399,8 @@ static inline int security_inode_mkdir (
 }
 
 static inline int security_inode_rmdir (struct inode *dir,
-   struct dentry *dentry)
+   struct dentry *dentry,
+   struct vfsmount *mnt)
 {
return 0;
 }
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -295,7 +295,8 @@ static int dummy_inode_mkdir (struct ino
return 0;
 }
 
-static int dummy_inode_rmdir (struct inode *inode, struct dentry *dentry)
+static int dummy_inode_rmdir (struct inode *inode, struct dentry *dentry,
+ struct vfsmount *mnt)
 {
return 0;
 }
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2219,7 +2219,8 @@ static int selinux_inode_mkdir(struct in
return may_create(dir, dentry, SECCLASS_DIR);
 }
 
-static int selinux_inode_rmdir(struct inode *dir, struct dentry *dentry)
+static int selinux_inode_rmdir(struct inode *dir, struct dentry *dentry,
+  struct vfsmount *mnt)
 {
return may_link(dir, dentry, MAY_RMDIR);
 }

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 01/45] Pass struct vfsmount to the inode_create LSM hook

2007-05-14 Thread jjohansen

This is needed for computing pathnames in the AppArmor LSM.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/namei.c   |2 +-
 include/linux/security.h |9 ++---
 security/dummy.c |2 +-
 security/selinux/hooks.c |3 ++-
 4 files changed, 10 insertions(+), 6 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1521,7 +1521,7 @@ int vfs_create(struct inode *dir, struct
return -EACCES; /* shouldn't it be ENOSYS? */
mode = S_IALLUGO;
mode |= S_IFREG;
-   error = security_inode_create(dir, dentry, mode);
+   error = security_inode_create(dir, dentry, nd ? nd-mnt : NULL, mode);
if (error)
return error;
DQUOT_INIT(dir);
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -283,6 +283,7 @@ struct request_sock;
  * Check permission to create a regular file.
  * @dir contains inode structure of the parent of the new file.
  * @dentry contains the dentry structure for the file to be created.
+ * @mnt is the vfsmount corresponding to @dentry (may be NULL).
  * @mode contains the file mode of the file to be created.
  * Return 0 if permission is granted.
  * @inode_link:
@@ -1204,8 +1205,8 @@ struct security_operations {
void (*inode_free_security) (struct inode *inode);
int (*inode_init_security) (struct inode *inode, struct inode *dir,
char **name, void **value, size_t *len);
-   int (*inode_create) (struct inode *dir,
-struct dentry *dentry, int mode);
+   int (*inode_create) (struct inode *dir, struct dentry *dentry,
+struct vfsmount *mnt, int mode);
int (*inode_link) (struct dentry *old_dentry,
   struct inode *dir, struct dentry *new_dentry);
int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
@@ -1611,11 +1612,12 @@ static inline int security_inode_init_se

 static inline int security_inode_create (struct inode *dir,
 struct dentry *dentry,
+struct vfsmount *mnt,
 int mode)
 {
if (unlikely (IS_PRIVATE (dir)))
return 0;
-   return security_ops-inode_create (dir, dentry, mode);
+   return security_ops-inode_create (dir, dentry, mnt, mode);
 }
 
 static inline int security_inode_link (struct dentry *old_dentry,
@@ -2338,6 +2340,7 @@ static inline int security_inode_init_se

 static inline int security_inode_create (struct inode *dir,
 struct dentry *dentry,
+struct vfsmount *mnt,
 int mode)
 {
return 0;
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -265,7 +265,7 @@ static int dummy_inode_init_security (st
 }
 
 static int dummy_inode_create (struct inode *inode, struct dentry *dentry,
-  int mask)
+  struct vfsmount *mnt, int mask)
 {
return 0;
 }
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2176,7 +2176,8 @@ static int selinux_inode_init_security(s
return 0;
 }
 
-static int selinux_inode_create(struct inode *dir, struct dentry *dentry, int 
mask)
+static int selinux_inode_create(struct inode *dir, struct dentry *dentry,
+struct vfsmount *mnt, int mask)
 {
return may_create(dir, dentry, SECCLASS_FILE);
 }

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 33/45] Pass struct file down the inode_*xattr security LSM hooks

2007-05-14 Thread jjohansen

This allows LSMs to also distinguish between file descriptor and path
access for the xattr operations. (The other relevant operations are
covered by the setattr hook.)

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/xattr.c   |   58 ---
 include/linux/security.h |   53 +-
 include/linux/xattr.h|8 +++---
 security/commoncap.c |4 +--
 security/dummy.c |   10 
 security/selinux/hooks.c |   10 
 6 files changed, 80 insertions(+), 63 deletions(-)

--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -70,7 +70,7 @@ xattr_permission(struct inode *inode, co
 
 int
 vfs_setxattr(struct dentry *dentry, struct vfsmount *mnt, char *name,
-void *value, size_t size, int flags)
+void *value, size_t size, int flags, struct file *file)
 {
struct inode *inode = dentry-d_inode;
int error;
@@ -80,7 +80,7 @@ vfs_setxattr(struct dentry *dentry, stru
return error;
 
mutex_lock(inode-i_mutex);
-   error = security_inode_setxattr(dentry, mnt, name, value, size, flags);
+   error = security_inode_setxattr(dentry, mnt, name, value, size, flags,  
file);
if (error)
goto out;
error = -EOPNOTSUPP;
@@ -107,7 +107,7 @@ EXPORT_SYMBOL_GPL(vfs_setxattr);
 
 ssize_t
 vfs_getxattr(struct dentry *dentry, struct vfsmount *mnt, char *name,
-void *value, size_t size)
+void *value, size_t size, struct file *file)
 {
struct inode *inode = dentry-d_inode;
int error;
@@ -116,7 +116,7 @@ vfs_getxattr(struct dentry *dentry, stru
if (error)
return error;
 
-   error = security_inode_getxattr(dentry, mnt, name);
+   error = security_inode_getxattr(dentry, mnt, name, file);
if (error)
return error;
 
@@ -144,12 +144,12 @@ EXPORT_SYMBOL_GPL(vfs_getxattr);
 
 ssize_t
 vfs_listxattr(struct dentry *dentry, struct vfsmount *mnt, char *list,
- size_t size)
+ size_t size, struct file *file)
 {
struct inode *inode = dentry-d_inode;
ssize_t error;
 
-   error = security_inode_listxattr(dentry, mnt);
+   error = security_inode_listxattr(dentry, mnt, file);
if (error)
return error;
error = -EOPNOTSUPP;
@@ -165,7 +165,8 @@ vfs_listxattr(struct dentry *dentry, str
 EXPORT_SYMBOL_GPL(vfs_listxattr);
 
 int
-vfs_removexattr(struct dentry *dentry, struct vfsmount *mnt, char *name)
+vfs_removexattr(struct dentry *dentry, struct vfsmount *mnt, char *name,
+   struct file *file)
 {
struct inode *inode = dentry-d_inode;
int error;
@@ -177,7 +178,7 @@ vfs_removexattr(struct dentry *dentry, s
if (error)
return error;
 
-   error = security_inode_removexattr(dentry, mnt, name);
+   error = security_inode_removexattr(dentry, mnt, name, file);
if (error)
return error;
 
@@ -197,7 +198,7 @@ EXPORT_SYMBOL_GPL(vfs_removexattr);
  */
 static long
 setxattr(struct dentry *dentry, struct vfsmount *mnt, char __user *name,
-void __user *value, size_t size, int flags)
+void __user *value, size_t size, int flags, struct file *file)
 {
int error;
void *kvalue = NULL;
@@ -224,7 +225,7 @@ setxattr(struct dentry *dentry, struct v
}
}
 
-   error = vfs_setxattr(dentry, mnt, kname, kvalue, size, flags);
+   error = vfs_setxattr(dentry, mnt, kname, kvalue, size, flags, file);
kfree(kvalue);
return error;
 }
@@ -239,7 +240,7 @@ sys_setxattr(char __user *path, char __u
error = user_path_walk(path, nd);
if (error)
return error;
-   error = setxattr(nd.dentry, nd.mnt, name, value, size, flags);
+   error = setxattr(nd.dentry, nd.mnt, name, value, size, flags, NULL);
path_release(nd);
return error;
 }
@@ -254,7 +255,7 @@ sys_lsetxattr(char __user *path, char __
error = user_path_walk_link(path, nd);
if (error)
return error;
-   error = setxattr(nd.dentry, nd.mnt, name, value, size, flags);
+   error = setxattr(nd.dentry, nd.mnt, name, value, size, flags, NULL);
path_release(nd);
return error;
 }
@@ -272,7 +273,7 @@ sys_fsetxattr(int fd, char __user *name,
return error;
dentry = f-f_path.dentry;
audit_inode(NULL, dentry-d_inode);
-   error = setxattr(dentry, f-f_vfsmnt, name, value, size, flags);
+   error = setxattr(dentry, f-f_vfsmnt, name, value, size, flags, f);
fput(f);
return error;
 }
@@ -282,7 +283,7 @@ sys_fsetxattr(int fd, char __user *name,
  */
 static ssize_t
 getxattr(struct dentry *dentry, struct vfsmount *mnt, char __user *name,
-void __user *value, size_t

[AppArmor 00/45] AppArmor security module overview

2007-05-14 Thread jjohansen

lkml-explanatory.txt

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching

2007-05-14 Thread jjohansen

Pathname matching, transition table loading, profile loading and
manipulation.

Signed-off-by: John Johansen [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

---
 security/apparmor/match.c|  232 
 security/apparmor/match.h|   83 
 security/apparmor/module_interface.c |  643 +++
 3 files changed, 958 insertions(+)

--- /dev/null
+++ b/security/apparmor/match.c
@@ -0,0 +1,232 @@
+/*
+ * Copyright (C) 2007 Novell/SUSE
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ * Regular expression transition table matching
+ */
+
+#include linux/kernel.h
+#include linux/slab.h
+#include linux/errno.h
+#include match.h
+
+static struct table_header *unpack_table(void *blob, size_t bsize)
+{
+   struct table_header *table = NULL;
+   struct table_header th;
+   size_t tsize;
+
+   if (bsize  sizeof(struct table_header))
+   goto out;
+
+   th.td_id = be16_to_cpu(*(u16 *) (blob));
+   th.td_flags = be16_to_cpu(*(u16 *) (blob + 2));
+   th.td_lolen = be32_to_cpu(*(u32 *) (blob + 8));
+   blob += sizeof(struct table_header);
+
+   if (!(th.td_flags == YYTD_DATA16 || th.td_flags == YYTD_DATA32 ||
+   th.td_flags == YYTD_DATA8))
+   goto out;
+
+   tsize = table_size(th.td_lolen, th.td_flags);
+   if (bsize  tsize)
+   goto out;
+
+   table = kmalloc(tsize, GFP_KERNEL);
+   if (table) {
+   *table = th;
+   if (th.td_flags == YYTD_DATA8)
+   UNPACK_ARRAY(table-td_data, blob, th.td_lolen,
+u8, byte_to_byte);
+   else if (th.td_flags == YYTD_DATA16)
+   UNPACK_ARRAY(table-td_data, blob, th.td_lolen,
+u16, be16_to_cpu);
+   else
+   UNPACK_ARRAY(table-td_data, blob, th.td_lolen,
+u32, be32_to_cpu);
+   }
+
+out:
+   return table;
+}
+
+int unpack_dfa(struct aa_dfa *dfa, void *blob, size_t size)
+{
+   int hsize, i;
+   int error = -ENOMEM;
+
+   /* get dfa table set header */
+   if (size  sizeof(struct table_set_header))
+   goto fail;
+
+   if (ntohl(*(u32 *)blob) != YYTH_MAGIC)
+   goto fail;
+
+   hsize = ntohl(*(u32 *)(blob + 4));
+   if (size  hsize)
+   goto fail;
+
+   blob += hsize;
+   size -= hsize;
+
+   error = -EPROTO;
+   while (size  0) {
+   struct table_header *table;
+   table = unpack_table(blob, size);
+   if (!table)
+   goto fail;
+
+   switch(table-td_id) {
+   case YYTD_ID_ACCEPT:
+   case YYTD_ID_BASE:
+   dfa-tables[table-td_id - 1] = table;
+   if (table-td_flags != YYTD_DATA32)
+   goto fail;
+   break;
+   case YYTD_ID_DEF:
+   case YYTD_ID_NXT:
+   case YYTD_ID_CHK:
+   dfa-tables[table-td_id - 1] = table;
+   if (table-td_flags != YYTD_DATA16)
+   goto fail;
+   break;
+   case YYTD_ID_EC:
+   dfa-tables[table-td_id - 1] = table;
+   if (table-td_flags != YYTD_DATA8)
+   goto fail;
+   break;
+   default:
+   kfree(table);
+   goto fail;
+   }
+
+   blob += table_size(table-td_lolen, table-td_flags);
+   size -= table_size(table-td_lolen, table-td_flags);
+   }
+
+   return 0;
+
+fail:
+   for (i = 0; i  ARRAY_SIZE(dfa-tables); i++) {
+   if (dfa-tables[i]) {
+   kfree(dfa-tables[i]);
+   dfa-tables[i] = NULL;
+   }
+   }
+   return error;
+}
+
+/**
+ * verify_dfa - verify that all the transitions and states in the dfa tables
+ *  are in bounds.
+ * @dfa: dfa to test
+ *
+ * assumes dfa has gone through the verification done by unpacking
+ */
+int verify_dfa(struct aa_dfa *dfa)
+{
+   size_t i, state_count, trans_count;
+   int error = -EPROTO;
+
+   /* check that required tables exist */
+   if (!(dfa-tables[YYTD_ID_ACCEPT -1 ] 
+ dfa-tables[YYTD_ID_DEF - 1] 
+ dfa-tables[YYTD_ID_BASE - 1] 
+ dfa-tables[YYTD_ID_NXT - 1] 
+ dfa-tables[YYTD_ID_CHK - 1]))
+   goto out;
+
+   /* accept.size == default.size == base.size */
+   state_count =

[AppArmor 29/45] Fix __d_path() for lazy unmounts and make it unambiguous

2007-05-14 Thread jjohansen

First, when __d_path() hits a lazily unmounted mount point, it tries to prepend
the name of the lazily unmounted dentry to the path name.  It gets this wrong,
and also overwrites the slash that separates the name from the following
pathname component. This patch fixes that; if a process was in directory
/foo/bar and /foo got lazily unmounted, the old result was ``foobar'' (note the
missing slash), while the new result with this patch is ``foo/bar''.

Second, it isn't always possible to tell from the __d_path() result whether the
specified root and rootmnt (i.e., the chroot) was reached.  We need an
unambiguous result for AppArmor at least though, so we make sure that paths
will only start with a slash if the path leads all the way up to the root.

We also add a @fail_deleted argument, which allows to get rid of some of the
mess in sys_getcwd().

This patch leaves getcwd() and d_path() as they were before for everything
except for bind-mounted directories; for them, it reports ``/foo/bar'' instead
of ``foobar'' in the example described above.

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Acked-by: Alan Cox [EMAIL PROTECTED]

---
 fs/dcache.c |  169 ++--
 1 file changed, 98 insertions(+), 71 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1761,52 +1761,51 @@ shouldnt_be_hashed:
 }
 
 /**
- * d_path - return the path of a dentry
+ * __d_path - return the path of a dentry
  * @dentry: dentry to report
  * @vfsmnt: vfsmnt to which the dentry belongs
  * @root: root dentry
  * @rootmnt: vfsmnt to which the root dentry belongs
  * @buffer: buffer to return value in
  * @buflen: buffer length
+ * @fail_deleted: what to return for deleted files
  *
- * Convert a dentry into an ASCII path name. If the entry has been deleted
+ * Convert a dentry into an ASCII path name. If the entry has been deleted,
+ * then if @fail_deleted is true, ERR_PTR(-ENOENT) is returned. Otherwise,
  * the string  (deleted) is appended. Note that this is ambiguous.
  *
- * Returns the buffer or an error code if the path was too long.
+ * If @dentry is not connected to @root, the path returned will be relative
+ * (i.e., it will not start with a slash).
  *
- * buflen should be positive. Caller holds the dcache_lock.
+ * Returns the buffer or an error code.
  */
-static char * __d_path( struct dentry *dentry, struct vfsmount *vfsmnt,
-   struct dentry *root, struct vfsmount *rootmnt,
-   char *buffer, int buflen)
-{
-   char * end = buffer+buflen;
-   char * retval;
-   int namelen;
+static char *__d_path(struct dentry *dentry, struct vfsmount *vfsmnt,
+ struct dentry *root, struct vfsmount *rootmnt,
+ char *buffer, int buflen, int fail_deleted)
+{
+   int namelen, is_slash;
+
+   if (buflen  2)
+   return ERR_PTR(-ENAMETOOLONG);
+   buffer += --buflen;
+   *buffer = '\0';
 
-   *--end = '\0';
-   buflen--;
+   spin_lock(dcache_lock);
if (!IS_ROOT(dentry)  d_unhashed(dentry)) {
-   buflen -= 10;
-   end -= 10;
-   if (buflen  0)
+   if (fail_deleted) {
+   buffer = ERR_PTR(-ENOENT);
+   goto out;
+   }
+   if (buflen  10)
goto Elong;
-   memcpy(end,  (deleted), 10);
+   buflen -= 10;
+   buffer -= 10;
+   memcpy(buffer,  (deleted), 10);
}
-
-   if (buflen  1)
-   goto Elong;
-   /* Get '/' right */
-   retval = end-1;
-   *retval = '/';
-
-   for (;;) {
+   while (dentry != root || vfsmnt != rootmnt) {
struct dentry * parent;
 
-   if (dentry == root  vfsmnt == rootmnt)
-   break;
if (dentry == vfsmnt-mnt_root || IS_ROOT(dentry)) {
-   /* Global root? */
spin_lock(vfsmount_lock);
if (vfsmnt-mnt_parent == vfsmnt) {
spin_unlock(vfsmount_lock);
@@ -1820,33 +1819,72 @@ static char * __d_path( struct dentry *d
parent = dentry-d_parent;
prefetch(parent);
namelen = dentry-d_name.len;
-   buflen -= namelen + 1;
-   if (buflen  0)
+   if (buflen  namelen + 1)
goto Elong;
-   end -= namelen;
-   memcpy(end, dentry-d_name.name, namelen);
-   *--end = '/';
-   retval = end;
+   buflen -= namelen + 1;
+   buffer -= namelen;
+   memcpy(buffer, dentry-d_name.name, namelen);
+   *--buffer = '/';
dentry = parent;
}
+   /* Get '/' right. */
+   if (*buffer != '/')
+   *--buffer = '/';
 
-   return retval;
+out:
+

[AppArmor 02/45] Pass struct path down to remove_suid and children

2007-05-14 Thread jjohansen

Required by a later patch that adds a struct vfsmount parameter to
notify_change().

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/ntfs/file.c |2 +-
 fs/reiserfs/file.c |2 +-
 fs/splice.c|4 ++--
 fs/xfs/linux-2.6/xfs_lrw.c |2 +-
 include/linux/fs.h |4 ++--
 mm/filemap.c   |   12 ++--
 mm/filemap_xip.c   |2 +-
 mm/shmem.c |2 +-
 8 files changed, 15 insertions(+), 15 deletions(-)

--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2121,7 +2121,7 @@ static ssize_t ntfs_file_aio_write_noloc
goto out;
if (!count)
goto out;
-   err = remove_suid(file-f_path.dentry);
+   err = remove_suid(file-f_path);
if (err)
goto out;
file_update_time(file);
--- a/fs/reiserfs/file.c
+++ b/fs/reiserfs/file.c
@@ -1335,7 +1335,7 @@ static ssize_t reiserfs_file_write(struc
if (count == 0)
goto out;
 
-   res = remove_suid(file-f_path.dentry);
+   res = remove_suid(file-f_path);
if (res)
goto out;
 
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -796,7 +796,7 @@ generic_file_splice_write_nolock(struct 
ssize_t ret;
int err;
 
-   err = remove_suid(out-f_path.dentry);
+   err = remove_suid(out-f_path);
if (unlikely(err))
return err;
 
@@ -845,7 +845,7 @@ generic_file_splice_write(struct pipe_in
err = should_remove_suid(out-f_path.dentry);
if (unlikely(err)) {
mutex_lock(inode-i_mutex);
-   err = __remove_suid(out-f_path.dentry, err);
+   err = __remove_suid(out-f_path, err);
mutex_unlock(inode-i_mutex);
if (err)
return err;
--- a/fs/xfs/linux-2.6/xfs_lrw.c
+++ b/fs/xfs/linux-2.6/xfs_lrw.c
@@ -798,7 +798,7 @@ start:
 !capable(CAP_FSETID)) {
error = xfs_write_clear_setuid(xip);
if (likely(!error))
-   error = -remove_suid(file-f_path.dentry);
+   error = -remove_suid(file-f_path);
if (unlikely(error)) {
goto out_unlock_internal;
}
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1697,9 +1697,9 @@ extern void __iget(struct inode * inode)
 extern void clear_inode(struct inode *);
 extern void destroy_inode(struct inode *);
 extern struct inode *new_inode(struct super_block *);
-extern int __remove_suid(struct dentry *, int);
+extern int __remove_suid(struct path *, int);
 extern int should_remove_suid(struct dentry *);
-extern int remove_suid(struct dentry *);
+extern int remove_suid(struct path *);
 
 extern void __insert_inode_hash(struct inode *, unsigned long hashval);
 extern void remove_inode_hash(struct inode *);
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1905,20 +1905,20 @@ int should_remove_suid(struct dentry *de
 }
 EXPORT_SYMBOL(should_remove_suid);
 
-int __remove_suid(struct dentry *dentry, int kill)
+int __remove_suid(struct path *path, int kill)
 {
struct iattr newattrs;
 
newattrs.ia_valid = ATTR_FORCE | kill;
-   return notify_change(dentry, newattrs);
+   return notify_change(path-dentry, newattrs);
 }
 
-int remove_suid(struct dentry *dentry)
+int remove_suid(struct path *path)
 {
-   int kill = should_remove_suid(dentry);
+   int kill = should_remove_suid(path-dentry);
 
if (unlikely(kill))
-   return __remove_suid(dentry, kill);
+   return __remove_suid(path, kill);
 
return 0;
 }
@@ -2269,7 +2269,7 @@ __generic_file_aio_write_nolock(struct k
if (count == 0)
goto out;
 
-   err = remove_suid(file-f_path.dentry);
+   err = remove_suid(file-f_path);
if (err)
goto out;
 
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -405,7 +405,7 @@ xip_file_write(struct file *filp, const 
if (count == 0)
goto out_backing;
 
-   ret = remove_suid(filp-f_path.dentry);
+   ret = remove_suid(filp-f_path);
if (ret)
goto out_backing;
 
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1516,7 +1516,7 @@ shmem_file_write(struct file *file, cons
if (err || !count)
goto out;
 
-   err = remove_suid(file-f_path.dentry);
+   err = remove_suid(file-f_path);
if (err)
goto out;
 

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFD Patch 2/4] Never pass a NULL nameidata to vfs_create()

2007-05-14 Thread jjohansen

Create a nameidata2 struct in nfsd and mqueue so that vfs_create does
need to conditionally pass the vfsmnt.

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

---
 fs/namei.c|2 +-
 fs/nfsd/vfs.c |   42 +-
 ipc/mqueue.c  |7 ++-
 3 files changed, 32 insertions(+), 19 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1532,7 +1532,7 @@ int vfs_create(struct inode *dir, struct
return -EACCES; /* shouldn't it be ENOSYS? */
mode = S_IALLUGO;
mode |= S_IFREG;
-   error = security_inode_create(dir, dentry, nd ? nd-mnt : NULL, mode);
+   error = security_inode_create(dir, dentry, nd-mnt, mode);
if (error)
return error;
DQUOT_INIT(dir);
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1121,7 +1121,8 @@ nfsd_create(struct svc_rqst *rqstp, stru
char *fname, int flen, struct iattr *iap,
int type, dev_t rdev, struct svc_fh *resfhp)
 {
-   struct dentry   *dentry, *dchild = NULL;
+   struct nameidata2 nd;
+   struct dentry   *dchild = NULL;
struct svc_export *exp;
struct inode*dirp;
__be32  err;
@@ -1138,9 +1139,11 @@ nfsd_create(struct svc_rqst *rqstp, stru
if (err)
goto out;
 
-   dentry = fhp-fh_dentry;
+   nd.dentry = fhp-fh_dentry;
exp = fhp-fh_export;
-   dirp = dentry-d_inode;
+   nd.mnt = exp-ex_mnt;
+   nd.flags = 0;
+   dirp = nd.dentry-d_inode;
 
err = nfserr_notdir;
if(!dirp-i_op || !dirp-i_op-lookup)
@@ -1152,7 +1155,7 @@ nfsd_create(struct svc_rqst *rqstp, stru
if (!resfhp-fh_dentry) {
/* called from nfsd_proc_mkdir, or possibly nfsd3_proc_create */
fh_lock_nested(fhp, I_MUTEX_PARENT);
-   dchild = lookup_one_len(fname, dentry, flen);
+   dchild = lookup_one_len(fname, nd.dentry, flen);
host_err = PTR_ERR(dchild);
if (IS_ERR(dchild))
goto out_nfserr;
@@ -1166,8 +1169,8 @@ nfsd_create(struct svc_rqst *rqstp, stru
/* not actually possible */
printk(KERN_ERR
nfsd_create: parent %s/%s not locked!\n,
-   dentry-d_parent-d_name.name,
-   dentry-d_name.name);
+   nd.dentry-d_parent-d_name.name,
+   nd.dentry-d_name.name);
err = nfserr_io;
goto out;
}
@@ -1178,7 +1181,7 @@ nfsd_create(struct svc_rqst *rqstp, stru
err = nfserr_exist;
if (dchild-d_inode) {
dprintk(nfsd_create: dentry %s/%s not negative!\n,
-   dentry-d_name.name, dchild-d_name.name);
+   nd.dentry-d_name.name, dchild-d_name.name);
goto out; 
}
 
@@ -1192,7 +1195,7 @@ nfsd_create(struct svc_rqst *rqstp, stru
err = 0;
switch (type) {
case S_IFREG:
-   host_err = vfs_create(dirp, dchild, iap-ia_mode, NULL);
+   host_err = vfs_create(nd.dentry-d_inode, dchild, iap-ia_mode, 
nd);
break;
case S_IFDIR:
host_err = vfs_mkdir(dirp, dchild, exp-ex_mnt, iap-ia_mode);
@@ -1212,7 +1215,7 @@ nfsd_create(struct svc_rqst *rqstp, stru
goto out_nfserr;
 
if (EX_ISSYNC(exp)) {
-   err = nfserrno(nfsd_sync_dir(dentry));
+   err = nfserrno(nfsd_sync_dir(nd.dentry));
write_inode_now(dchild-d_inode, 1);
}
 
@@ -1252,7 +1255,9 @@ nfsd_create_v3(struct svc_rqst *rqstp, s
struct svc_fh *resfhp, int createmode, u32 *verifier,
int *truncp, int *created)
 {
-   struct dentry   *dentry, *dchild = NULL;
+   struct nameidata2 nd;
+   struct dentry   *dchild = NULL;
+   struct svc_export *exp;
struct inode*dirp;
__be32  err;
int host_err;
@@ -1270,8 +1275,11 @@ nfsd_create_v3(struct svc_rqst *rqstp, s
if (err)
goto out;
 
-   dentry = fhp-fh_dentry;
-   dirp = dentry-d_inode;
+   nd.dentry = fhp-fh_dentry;
+   exp = fhp-fh_export;
+   nd.mnt = exp-ex_mnt;
+   nd.flags = 0;
+   dirp = nd.dentry-d_inode;
 
/* Get all the sanity checks out of the way before
 * we lock the parent. */
@@ -1283,12 +1291,12 @@ nfsd_create_v3(struct svc_rqst *rqstp, s
/*
 * Compose the response file handle.
 */
-   dchild = lookup_one_len(fname, dentry, flen);
+   dchild = lookup_one_len(fname, nd.dentry, flen);
host_err = PTR_ERR(dchild);
if (IS_ERR(dchild))
goto out_nfserr;
 
-   err = fh_compose(resfhp, fhp-fh_export, dchild, fhp);
+   err = fh_compose(resfhp, exp, dchild,

[AppArmor 34/45] Factor out sysctl pathname code

2007-05-14 Thread jjohansen

Convert the selinux sysctl pathname computation code into a standalone
function.

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 include/linux/sysctl.h   |2 ++
 kernel/sysctl.c  |   27 +++
 security/selinux/hooks.c |   35 +--
 3 files changed, 34 insertions(+), 30 deletions(-)

--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -963,6 +963,8 @@ extern int proc_doulongvec_minmax(ctl_ta
 extern int proc_doulongvec_ms_jiffies_minmax(ctl_table *table, int,
  struct file *, void __user *, size_t *, 
loff_t *);
 
+extern char *sysctl_pathname(ctl_table *, char *, int);
+
 extern int do_sysctl (int __user *name, int nlen,
  void __user *oldval, size_t __user *oldlenp,
  void __user *newval, size_t newlen);
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1110,6 +1110,33 @@ struct ctl_table_header *sysctl_head_nex
return NULL;
 }
 
+char *sysctl_pathname(ctl_table *table, char *buffer, int buflen)
+{
+   if (buflen  1)
+   return NULL;
+   buffer += --buflen;
+   *buffer = '\0';
+
+   while (table) {
+   int namelen = strlen(table-procname);
+
+   if (buflen  namelen + 1)
+   return NULL;
+   buflen -= namelen + 1;
+   buffer -= namelen;
+   memcpy(buffer, table-procname, namelen);
+   *--buffer = '/';
+   table = table-parent;
+   }
+   if (buflen  4)
+   return NULL;
+   buffer -= 4;
+   memcpy(buffer, /sys, 4);
+
+   return buffer;
+}
+EXPORT_SYMBOL(sysctl_pathname);
+
 #ifdef CONFIG_SYSCTL_SYSCALL
 int do_sysctl(int __user *name, int nlen, void __user *oldval, size_t __user 
*oldlenp,
   void __user *newval, size_t newlen)
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -1427,40 +1427,15 @@ static int selinux_capable(struct task_s
 
 static int selinux_sysctl_get_sid(ctl_table *table, u16 tclass, u32 *sid)
 {
-   int buflen, rc;
-   char *buffer, *path, *end;
+   char *buffer, *path;
+   int rc = -ENOMEM;
 
-   rc = -ENOMEM;
buffer = (char*)__get_free_page(GFP_KERNEL);
if (!buffer)
goto out;
-
-   buflen = PAGE_SIZE;
-   end = buffer+buflen;
-   *--end = '\0';
-   buflen--;
-   path = end-1;
-   *path = '/';
-   while (table) {
-   const char *name = table-procname;
-   size_t namelen = strlen(name);
-   buflen -= namelen + 1;
-   if (buflen  0)
-   goto out_free;
-   end -= namelen;
-   memcpy(end, name, namelen);
-   *--end = '/';
-   path = end;
-   table = table-parent;
-   }
-   buflen -= 4;
-   if (buflen  0)
-   goto out_free;
-   end -= 4;
-   memcpy(end, /sys, 4);
-   path = end;
-   rc = security_genfs_sid(proc, path, tclass, sid);
-out_free:
+   path = sysctl_pathname(table, buffer, PAGE_SIZE);
+   if (path)
+   rc = security_genfs_sid(proc, path, tclass, sid);
free_page((unsigned long)buffer);
 out:
return rc;

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 23/45] Add a struct vfsmount parameter to vfs_getxattr()

2007-05-14 Thread jjohansen

The vfsmount will be passed down to the LSM hook so that LSMs can compute
pathnames.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/nfsd/nfs4xdr.c |2 +-
 fs/nfsd/vfs.c |   21 -
 fs/xattr.c|   14 --
 include/linux/nfsd/nfsd.h |3 ++-
 include/linux/xattr.h |3 ++-
 5 files changed, 25 insertions(+), 18 deletions(-)

--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1469,7 +1469,7 @@ nfsd4_encode_fattr(struct svc_fh *fhp, s
}
if (bmval0  (FATTR4_WORD0_ACL | FATTR4_WORD0_ACLSUPPORT
| FATTR4_WORD0_SUPPORTED_ATTRS)) {
-   err = nfsd4_get_nfs4_acl(rqstp, dentry, acl);
+   err = nfsd4_get_nfs4_acl(rqstp, dentry, exp-ex_mnt, acl);
aclsupport = (err == 0);
if (bmval0  FATTR4_WORD0_ACL) {
if (err == -EOPNOTSUPP)
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -378,11 +378,12 @@ out_nfserr:
 #if defined(CONFIG_NFSD_V2_ACL) || \
 defined(CONFIG_NFSD_V3_ACL) || \
 defined(CONFIG_NFSD_V4)
-static ssize_t nfsd_getxattr(struct dentry *dentry, char *key, void **buf)
+static ssize_t nfsd_getxattr(struct dentry *dentry, struct vfsmount *mnt,
+char *key, void **buf)
 {
ssize_t buflen;
 
-   buflen = vfs_getxattr(dentry, key, NULL, 0);
+   buflen = vfs_getxattr(dentry, mnt, key, NULL, 0);
if (buflen = 0)
return buflen;
 
@@ -390,7 +391,7 @@ static ssize_t nfsd_getxattr(struct dent
if (!*buf)
return -ENOMEM;
 
-   return vfs_getxattr(dentry, key, *buf, buflen);
+   return vfs_getxattr(dentry, mnt, key, *buf, buflen);
 }
 #endif
 
@@ -479,13 +480,13 @@ out_nfserr:
 }
 
 static struct posix_acl *
-_get_posix_acl(struct dentry *dentry, char *key)
+_get_posix_acl(struct dentry *dentry, struct vfsmount *mnt, char *key)
 {
void *buf = NULL;
struct posix_acl *pacl = NULL;
int buflen;
 
-   buflen = nfsd_getxattr(dentry, key, buf);
+   buflen = nfsd_getxattr(dentry, mnt, key, buf);
if (!buflen)
buflen = -ENODATA;
if (buflen = 0)
@@ -497,14 +498,15 @@ _get_posix_acl(struct dentry *dentry, ch
 }
 
 int
-nfsd4_get_nfs4_acl(struct svc_rqst *rqstp, struct dentry *dentry, struct 
nfs4_acl **acl)
+nfsd4_get_nfs4_acl(struct svc_rqst *rqstp, struct dentry *dentry,
+  struct vfsmount *mnt, struct nfs4_acl **acl)
 {
struct inode *inode = dentry-d_inode;
int error = 0;
struct posix_acl *pacl = NULL, *dpacl = NULL;
unsigned int flags = 0;
 
-   pacl = _get_posix_acl(dentry, POSIX_ACL_XATTR_ACCESS);
+   pacl = _get_posix_acl(dentry, mnt, POSIX_ACL_XATTR_ACCESS);
if (IS_ERR(pacl)  PTR_ERR(pacl) == -ENODATA)
pacl = posix_acl_from_mode(inode-i_mode, GFP_KERNEL);
if (IS_ERR(pacl)) {
@@ -514,7 +516,7 @@ nfsd4_get_nfs4_acl(struct svc_rqst *rqst
}
 
if (S_ISDIR(inode-i_mode)) {
-   dpacl = _get_posix_acl(dentry, POSIX_ACL_XATTR_DEFAULT);
+   dpacl = _get_posix_acl(dentry, mnt, POSIX_ACL_XATTR_DEFAULT);
if (IS_ERR(dpacl)  PTR_ERR(dpacl) == -ENODATA)
dpacl = NULL;
else if (IS_ERR(dpacl)) {
@@ -1942,7 +1944,8 @@ nfsd_get_posix_acl(struct svc_fh *fhp, i
return ERR_PTR(-EOPNOTSUPP);
}
 
-   size = nfsd_getxattr(fhp-fh_dentry, name, value);
+   size = nfsd_getxattr(fhp-fh_dentry, fhp-fh_export-ex_mnt, name,
+value);
if (size  0)
return ERR_PTR(size);
 
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -106,7 +106,8 @@ out:
 EXPORT_SYMBOL_GPL(vfs_setxattr);
 
 ssize_t
-vfs_getxattr(struct dentry *dentry, char *name, void *value, size_t size)
+vfs_getxattr(struct dentry *dentry, struct vfsmount *mnt, char *name,
+void *value, size_t size)
 {
struct inode *inode = dentry-d_inode;
int error;
@@ -278,7 +279,8 @@ sys_fsetxattr(int fd, char __user *name,
  * Extended attribute GET operations
  */
 static ssize_t
-getxattr(struct dentry *d, char __user *name, void __user *value, size_t size)
+getxattr(struct dentry *dentry, struct vfsmount *mnt, char __user *name,
+void __user *value, size_t size)
 {
ssize_t error;
void *kvalue = NULL;
@@ -298,7 +300,7 @@ getxattr(struct dentry *d, char __user *
return -ENOMEM;
}
 
-   error = vfs_getxattr(d, kname, kvalue, size);
+   error = vfs_getxattr(dentry, mnt, kname, kvalue, size);
if (error  0) {
if (size  copy_to_user(value, kvalue, error))
error = -EFAULT;
@@ -321,7 +323,7 @@ sys_getxattr(char __user *path, char __u
error = user_path_walk(path, nd);
if

[AppArmor 28/45] Pass struct vfsmount to the inode_removexattr LSM hook

2007-05-14 Thread jjohansen

This is needed for computing pathnames in the AppArmor LSM.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/xattr.c   |2 +-
 include/linux/security.h |   15 +--
 security/commoncap.c |3 ++-
 security/dummy.c |3 ++-
 security/selinux/hooks.c |3 ++-
 5 files changed, 16 insertions(+), 10 deletions(-)

--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -177,7 +177,7 @@ vfs_removexattr(struct dentry *dentry, s
if (error)
return error;
 
-   error = security_inode_removexattr(dentry, name);
+   error = security_inode_removexattr(dentry, mnt, name);
if (error)
return error;
 
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -50,7 +50,7 @@ extern int cap_bprm_set_security (struct
 extern void cap_bprm_apply_creds (struct linux_binprm *bprm, int unsafe);
 extern int cap_bprm_secureexec(struct linux_binprm *bprm);
 extern int cap_inode_setxattr(struct dentry *dentry, struct vfsmount *mnt, 
char *name, void *value, size_t size, int flags);
-extern int cap_inode_removexattr(struct dentry *dentry, char *name);
+extern int cap_inode_removexattr(struct dentry *dentry, struct vfsmount *mnt, 
char *name);
 extern int cap_task_post_setuid (uid_t old_ruid, uid_t old_euid, uid_t 
old_suid, int flags);
 extern void cap_task_reparent_to_init (struct task_struct *p);
 extern int cap_syslog (int type);
@@ -1251,7 +1251,8 @@ struct security_operations {
int (*inode_getxattr) (struct dentry *dentry, struct vfsmount *mnt,
   char *name);
int (*inode_listxattr) (struct dentry *dentry, struct vfsmount *mnt);
-   int (*inode_removexattr) (struct dentry *dentry, char *name);
+   int (*inode_removexattr) (struct dentry *dentry, struct vfsmount *mnt,
+ char *name);
const char *(*inode_xattr_getsuffix) (void);
int (*inode_getsecurity)(const struct inode *inode, const char *name, 
void *buffer, size_t size, int err);
int (*inode_setsecurity)(struct inode *inode, const char *name, const 
void *value, size_t size, int flags);
@@ -1799,11 +1800,12 @@ static inline int security_inode_listxat
return security_ops-inode_listxattr (dentry, mnt);
 }
 
-static inline int security_inode_removexattr (struct dentry *dentry, char 
*name)
+static inline int security_inode_removexattr (struct dentry *dentry,
+ struct vfsmount *mnt, char *name)
 {
if (unlikely (IS_PRIVATE (dentry-d_inode)))
return 0;
-   return security_ops-inode_removexattr (dentry, name);
+   return security_ops-inode_removexattr (dentry, mnt, name);
 }
 
 static inline const char *security_inode_xattr_getsuffix(void)
@@ -2502,9 +2504,10 @@ static inline int security_inode_listxat
return 0;
 }
 
-static inline int security_inode_removexattr (struct dentry *dentry, char 
*name)
+static inline int security_inode_removexattr (struct dentry *dentry,
+ struct vfsmount *mnt, char *name)
 {
-   return cap_inode_removexattr(dentry, name);
+   return cap_inode_removexattr(dentry, mnt, name);
 }
 
 static inline const char *security_inode_xattr_getsuffix (void)
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -200,7 +200,8 @@ int cap_inode_setxattr(struct dentry *de
return 0;
 }
 
-int cap_inode_removexattr(struct dentry *dentry, char *name)
+int cap_inode_removexattr(struct dentry *dentry, struct vfsmount *mnt,
+ char *name)
 {
if (!strncmp(name, XATTR_SECURITY_PREFIX,
 sizeof(XATTR_SECURITY_PREFIX) - 1)  
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -379,7 +379,8 @@ static int dummy_inode_listxattr (struct
return 0;
 }
 
-static int dummy_inode_removexattr (struct dentry *dentry, char *name)
+static int dummy_inode_removexattr (struct dentry *dentry, struct vfsmount 
*mnt,
+   char *name)
 {
if (!strncmp(name, XATTR_SECURITY_PREFIX,
 sizeof(XATTR_SECURITY_PREFIX) - 1) 
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2404,7 +2404,8 @@ static int selinux_inode_listxattr (stru
return dentry_has_perm(current, NULL, dentry, FILE__GETATTR);
 }
 
-static int selinux_inode_removexattr (struct dentry *dentry, char *name)
+static int selinux_inode_removexattr (struct dentry *dentry,
+ struct vfsmount *mnt, char *name)
 {
if (strcmp(name, XATTR_NAME_SELINUX)) {
if (!strncmp(name, XATTR_SECURITY_PREFIX,

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 25/45] Add a struct vfsmount parameter to vfs_listxattr()

2007-05-14 Thread jjohansen

The vfsmount will be passed down to the LSM hook so that LSMs can compute
pathnames.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/xattr.c|   25 ++---
 include/linux/xattr.h |3 ++-
 2 files changed, 16 insertions(+), 12 deletions(-)

--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -143,18 +143,20 @@ vfs_getxattr(struct dentry *dentry, stru
 EXPORT_SYMBOL_GPL(vfs_getxattr);
 
 ssize_t
-vfs_listxattr(struct dentry *d, char *list, size_t size)
+vfs_listxattr(struct dentry *dentry, struct vfsmount *mnt, char *list,
+ size_t size)
 {
+   struct inode *inode = dentry-d_inode;
ssize_t error;
 
-   error = security_inode_listxattr(d);
+   error = security_inode_listxattr(dentry);
if (error)
return error;
error = -EOPNOTSUPP;
-   if (d-d_inode-i_op  d-d_inode-i_op-listxattr) {
-   error = d-d_inode-i_op-listxattr(d, list, size);
-   } else {
-   error = security_inode_listsecurity(d-d_inode, list, size);
+   if (inode-i_op  inode-i_op-listxattr)
+   error = inode-i_op-listxattr(dentry, list, size);
+   else {
+   error = security_inode_listsecurity(inode, list, size);
if (size  error  size)
error = -ERANGE;
}
@@ -362,7 +364,8 @@ sys_fgetxattr(int fd, char __user *name,
  * Extended attribute LIST operations
  */
 static ssize_t
-listxattr(struct dentry *d, char __user *list, size_t size)
+listxattr(struct dentry *dentry, struct vfsmount *mnt, char __user *list,
+ size_t size)
 {
ssize_t error;
char *klist = NULL;
@@ -375,7 +378,7 @@ listxattr(struct dentry *d, char __user 
return -ENOMEM;
}
 
-   error = vfs_listxattr(d, klist, size);
+   error = vfs_listxattr(dentry, mnt, klist, size);
if (error  0) {
if (size  copy_to_user(list, klist, error))
error = -EFAULT;
@@ -397,7 +400,7 @@ sys_listxattr(char __user *path, char __
error = user_path_walk(path, nd);
if (error)
return error;
-   error = listxattr(nd.dentry, list, size);
+   error = listxattr(nd.dentry, nd.mnt, list, size);
path_release(nd);
return error;
 }
@@ -411,7 +414,7 @@ sys_llistxattr(char __user *path, char _
error = user_path_walk_link(path, nd);
if (error)
return error;
-   error = listxattr(nd.dentry, list, size);
+   error = listxattr(nd.dentry, nd.mnt, list, size);
path_release(nd);
return error;
 }
@@ -426,7 +429,7 @@ sys_flistxattr(int fd, char __user *list
if (!f)
return error;
audit_inode(NULL, f-f_path.dentry-d_inode);
-   error = listxattr(f-f_path.dentry, list, size);
+   error = listxattr(f-f_path.dentry, f-f_path.mnt, list, size);
fput(f);
return error;
 }
--- a/include/linux/xattr.h
+++ b/include/linux/xattr.h
@@ -48,7 +48,8 @@ struct xattr_handler {
 
 ssize_t vfs_getxattr(struct dentry *, struct vfsmount *, char *, void *,
 size_t);
-ssize_t vfs_listxattr(struct dentry *d, char *list, size_t size);
+ssize_t vfs_listxattr(struct dentry *d, struct vfsmount *, char *list,
+ size_t size);
 int vfs_setxattr(struct dentry *, struct vfsmount *, char *, void *, size_t,
 int);
 int vfs_removexattr(struct dentry *, char *);

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 38/45] AppArmor: Module and LSM hooks

2007-05-14 Thread jjohansen

Module parameters, LSM hooks, initialization and teardown.

Signed-off-by: John Johansen [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

Index: b/security/apparmor/lsm.c
===
--- /dev/null
+++ b/security/apparmor/lsm.c
@@ -0,0 +1,790 @@
+/*
+ * Copyright (C) 1998-2007 Novell/SUSE
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ * AppArmor LSM interface
+ */
+
+#include linux/security.h
+#include linux/module.h
+#include linux/mm.h
+#include linux/mman.h
+#include linux/mount.h
+#include linux/namei.h
+#include linux/ctype.h
+#include linux/sysctl.h
+
+#include apparmor.h
+#include inline.h
+
+static int param_set_aabool(const char *val, struct kernel_param *kp);
+static int param_get_aabool(char *buffer, struct kernel_param *kp);
+#define param_check_aabool(name, p) __param_check(name, p, int)
+
+static int param_set_aauint(const char *val, struct kernel_param *kp);
+static int param_get_aauint(char *buffer, struct kernel_param *kp);
+#define param_check_aauint(name, p) __param_check(name, p, int)
+
+/* Flag values, also controllable via /sys/module/apparmor/parameters
+ * We define special types as we want to do additional mediation.
+ *
+ * Complain mode -- in complain mode access failures result in auditing only
+ * and task is allowed access.  audit events are processed by userspace to
+ * generate policy.  Default is 'enforce' (0).
+ * Value is also togglable per profile and referenced when global value is
+ * enforce.
+ */
+int apparmor_complain = 0;
+module_param_named(complain, apparmor_complain, aabool, S_IRUSR | S_IWUSR);
+MODULE_PARM_DESC(apparmor_complain, Toggle AppArmor complain mode);
+
+/* Debug mode */
+int apparmor_debug = 0;
+module_param_named(debug, apparmor_debug, aabool, S_IRUSR | S_IWUSR);
+MODULE_PARM_DESC(apparmor_debug, Toggle AppArmor debug mode);
+
+/* Audit mode */
+int apparmor_audit = 0;
+module_param_named(audit, apparmor_audit, aabool, S_IRUSR | S_IWUSR);
+MODULE_PARM_DESC(apparmor_audit, Toggle AppArmor audit mode);
+
+/* Syscall logging mode */
+int apparmor_logsyscall = 0;
+module_param_named(logsyscall, apparmor_logsyscall, aabool, S_IRUSR | S_IWUSR);
+MODULE_PARM_DESC(apparmor_logsyscall, Toggle AppArmor logsyscall mode);
+
+/* Maximum pathname length before accesses will start getting rejected */
+unsigned int apparmor_path_max = 2 * PATH_MAX;
+module_param_named(path_max, apparmor_path_max, aauint, S_IRUSR | S_IWUSR);
+MODULE_PARM_DESC(apparmor_path_max, Maximum pathname length allowed);
+
+static int param_set_aabool(const char *val, struct kernel_param *kp)
+{
+   if (aa_task_context(current))
+   return -EPERM;
+   return param_set_bool(val, kp);
+}
+
+static int param_get_aabool(char *buffer, struct kernel_param *kp)
+{
+   if (aa_task_context(current))
+   return -EPERM;
+   return param_get_bool(buffer, kp);
+}
+
+static int param_set_aauint(const char *val, struct kernel_param *kp)
+{
+   if (aa_task_context(current))
+   return -EPERM;
+   return param_set_uint(val, kp);
+}
+
+static int param_get_aauint(char *buffer, struct kernel_param *kp)
+{
+   if (aa_task_context(current))
+   return -EPERM;
+   return param_get_uint(buffer, kp);
+}
+
+static int aa_reject_syscall(struct task_struct *task, gfp_t flags,
+const char *name)
+{
+   struct aa_profile *profile = aa_get_profile(task);
+   int error = 0;
+
+   if (profile) {
+   error = aa_audit_syscallreject(profile, flags, name);
+   aa_put_profile(profile);
+   }
+
+   return error;
+}
+
+static int apparmor_ptrace(struct task_struct *parent,
+  struct task_struct *child)
+{
+   struct aa_task_context *cxt;
+   struct aa_task_context *child_cxt;
+   struct aa_profile *child_profile;
+   int error = 0;
+
+   /*
+* parent can ptrace child when
+* - parent is unconfined
+* - parent  child are in the same namespace 
+*   - parent is in complain mode
+*   - parent and child are confined by the same profile
+*   - parent profile has CAP_SYS_PTRACE
+*/
+
+   rcu_read_lock();
+   cxt = aa_task_context(parent);
+   child_cxt = aa_task_context(child);
+   child_profile = child_cxt ? child_cxt-profile : NULL;
+   if (cxt  (parent-nsproxy != child-nsproxy)) {
+   aa_audit_message(NULL, GFP_ATOMIC, REJECTING ptrace across 
+namespace of %d by %d,
+parent-pid, child-pid);
+   error = -EPERM;
+   } else {
+   error = aa_may_ptrace(cxt, child_profile);
+   if (cxt

[AppArmor 40/45] AppArmor: all the rest

2007-05-14 Thread jjohansen

All the things that didn't nicely fit in a category on their own: kbuild
code, declararions and inline functions, /sys/kernel/security/apparmor
filesystem for controlling apparmor from user space, profile list
functions, locking documentation, /proc/$pid/task/$tid/attr/current
access.

Signed-off-by: John Johansen [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

---
 security/apparmor/Kconfig  |9 +
 security/apparmor/Makefile |   13 ++
 security/apparmor/apparmor.h   |  259 +
 security/apparmor/apparmorfs.c |  250 +++
 security/apparmor/inline.h |  219 ++
 security/apparmor/list.c   |   94 ++
 security/apparmor/locking.txt  |   59 +
 security/apparmor/procattr.c   |  138 +
 8 files changed, 1041 insertions(+)

--- /dev/null
+++ b/security/apparmor/Kconfig
@@ -0,0 +1,9 @@
+config SECURITY_APPARMOR
+   tristate AppArmor support
+   depends on SECURITY!=n
+   help
+ This enables the AppArmor security module.
+ Required userspace tools (if they are not included in your
+ distribution) and further information may be found at
+ http://forge.novell.com/modules/xfmod/project/?apparmor
+ If you are unsure how to answer this question, answer N.
--- /dev/null
+++ b/security/apparmor/Makefile
@@ -0,0 +1,13 @@
+# Makefile for AppArmor Linux Security Module
+#
+obj-$(CONFIG_SECURITY_APPARMOR) += apparmor.o
+
+apparmor-y := main.o list.o procattr.o lsm.o apparmorfs.o \
+ module_interface.o match.o
+
+quiet_cmd_make-caps = GEN $@
+cmd_make-caps = sed -n -e /CAP_FS_MASK/d -e s/^\#define[ 
\\t]\\+CAP_\\([A-Z0-9_]\\+\\)[ \\t]\\+\\([0-9]\\+\\)\$$/[\\2]  = \\\1\,/p $ 
| tr A-Z a-z  $@
+
+$(obj)/main.o : $(obj)/capability_names.h
+$(obj)/capability_names.h : $(srctree)/include/linux/capability.h
+   $(call cmd,make-caps)
--- /dev/null
+++ b/security/apparmor/apparmor.h
@@ -0,0 +1,259 @@
+/*
+ * Copyright (C) 1998-2007 Novell/SUSE
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ * AppArmor internal prototypes
+ */
+
+#ifndef __APPARMOR_H
+#define __APPARMOR_H
+
+#include linux/sched.h
+#include linux/fs.h
+#include linux/binfmts.h
+#include linux/rcupdate.h
+
+/*
+ * We use MAY_READ, MAY_WRITE, MAY_EXEC, and the following flags for
+ * profile permissions (we don't use MAY_APPEND):
+ */
+#define AA_MAY_LINK0x0010
+#define AA_EXEC_INHERIT0x0020
+#define AA_EXEC_UNCONFINED 0x0040
+#define AA_EXEC_PROFILE0x0080
+#define AA_EXEC_MMAP   0x0100
+#define AA_EXEC_UNSAFE 0x0200
+
+#define AA_EXEC_MODIFIERS  (AA_EXEC_INHERIT | \
+AA_EXEC_UNCONFINED | \
+AA_EXEC_PROFILE)
+
+#define AA_SECURE_EXEC_NEEDED  1
+
+/* Control parameters (0 or 1), settable thru module/boot flags or
+ * via /sys/kernel/security/apparmor/control */
+extern int apparmor_complain;
+extern int apparmor_debug;
+extern int apparmor_audit;
+extern int apparmor_logsyscall;
+extern unsigned int apparmor_path_max;
+
+#define PROFILE_COMPLAIN(_profile) \
+   (apparmor_complain == 1 || ((_profile)  (_profile)-flags.complain))
+
+#define APPARMOR_COMPLAIN(_cxt) \
+   (apparmor_complain == 1 || \
+((_cxt)  (_cxt)-profile  (_cxt)-profile-flags.complain))
+
+#define PROFILE_AUDIT(_profile) \
+   (apparmor_audit == 1 || ((_profile)  (_profile)-flags.audit))
+
+#define APPARMOR_AUDIT(_cxt) \
+   (apparmor_audit == 1 || \
+((_cxt)  (_cxt)-profile  (_cxt)-profile-flags.audit))
+
+/*
+ * DEBUG remains global (no per profile flag) since it is mostly used in sysctl
+ * which is not related to profile accesses.
+ */
+
+#define AA_DEBUG(fmt, args...) \
+   do {\
+   if (apparmor_debug) \
+   printk(KERN_DEBUG AppArmor:  fmt, ##args);\
+   } while (0)
+
+#define AA_ERROR(fmt, args...) printk(KERN_ERR AppArmor:  fmt, ##args)
+
+/* struct aa_profile - basic confinement data
+ * @parent: non refcounted pointer to parent profile
+ * @name: the profiles name
+ * @file_rules: dfa containing the profiles file rules
+ * @list: list this profile is on
+ * @sub: profiles list of subprofiles (HATS)
+ * @flags: flags controlling profile behavior
+ * @null_profile: if needed per profile learning and null confinement profile
+ * @isstale: flag indicating if profile is stale
+ * @capabilities: capabilities granted by the process
+ *

[AppArmor 03/45] Add a vfsmount parameter to notify_change()

2007-05-14 Thread jjohansen

The vfsmount parameter must be set appropriately for files visibile
outside the kernel. Files that are only used in a filesystem (e.g.,
reiserfs xattr files) will have a NULL vfsmount.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/attr.c   |3 ++-
 fs/ecryptfs/inode.c |4 +++-
 fs/exec.c   |3 ++-
 fs/fat/file.c   |2 +-
 fs/hpfs/namei.c |2 +-
 fs/namei.c  |3 ++-
 fs/nfsd/vfs.c   |8 
 fs/open.c   |   28 +++-
 fs/reiserfs/xattr.c |6 +++---
 fs/sysfs/file.c |2 +-
 fs/utimes.c |   11 ++-
 include/linux/fs.h  |6 +++---
 mm/filemap.c|2 +-
 mm/tiny-shmem.c |2 +-
 14 files changed, 45 insertions(+), 37 deletions(-)

--- a/fs/attr.c
+++ b/fs/attr.c
@@ -100,7 +100,8 @@ int inode_setattr(struct inode * inode, 
 }
 EXPORT_SYMBOL(inode_setattr);
 
-int notify_change(struct dentry * dentry, struct iattr * attr)
+int notify_change(struct dentry *dentry, struct vfsmount *mnt,
+ struct iattr *attr)
 {
struct inode *inode = dentry-d_inode;
mode_t mode;
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -870,12 +870,14 @@ static int ecryptfs_setattr(struct dentr
 {
int rc = 0;
struct dentry *lower_dentry;
+   struct vfsmount *lower_mnt;
struct inode *inode;
struct inode *lower_inode;
struct ecryptfs_crypt_stat *crypt_stat;
 
crypt_stat = ecryptfs_inode_to_private(dentry-d_inode)-crypt_stat;
lower_dentry = ecryptfs_dentry_to_lower(dentry);
+   lower_mnt = ecryptfs_dentry_to_lower_mnt(dentry);
inode = dentry-d_inode;
lower_inode = ecryptfs_inode_to_lower(inode);
if (ia-ia_valid  ATTR_SIZE) {
@@ -890,7 +892,7 @@ static int ecryptfs_setattr(struct dentr
if (rc  0)
goto out;
}
-   rc = notify_change(lower_dentry, ia);
+   rc = notify_change(lower_dentry, lower_mnt, ia);
 out:
fsstack_copy_attr_all(inode, lower_inode, NULL);
return rc;
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1564,7 +1564,8 @@ int do_coredump(long signr, int exit_cod
goto close_fail;
if (!file-f_op-write)
goto close_fail;
-   if (!ispipe  do_truncate(file-f_path.dentry, 0, 0, file) != 0)
+   if (!ispipe 
+   do_truncate(file-f_path.dentry, file-f_path.mnt, 0, 0, file) != 0)
goto close_fail;
 
retval = binfmt-core_dump(signr, regs, file);
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -92,7 +92,7 @@ int fat_generic_ioctl(struct inode *inod
}
 
/* This MUST be done before doing anything irreversible... */
-   err = notify_change(filp-f_path.dentry, ia);
+   err = notify_change(filp-f_path.dentry, filp-f_path.mnt, ia);
if (err)
goto up;
 
--- a/fs/hpfs/namei.c
+++ b/fs/hpfs/namei.c
@@ -426,7 +426,7 @@ again:
/*printk(HPFS: truncating file before delete.\n);*/
newattrs.ia_size = 0;
newattrs.ia_valid = ATTR_SIZE | ATTR_CTIME;
-   err = notify_change(dentry, newattrs);
+   err = notify_change(dentry, NULL, newattrs);
put_write_access(inode);
if (!err)
goto again;
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1598,7 +1598,8 @@ int may_open(struct nameidata *nd, int a
if (!error) {
DQUOT_INIT(inode);

-   error = do_truncate(dentry, 0, ATTR_MTIME|ATTR_CTIME, 
NULL);
+   error = do_truncate(dentry, nd-mnt, 0,
+   ATTR_MTIME|ATTR_CTIME, NULL);
}
put_write_access(inode);
if (error)
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -358,7 +358,7 @@ nfsd_setattr(struct svc_rqst *rqstp, str
err = nfserr_notsync;
if (!check_guard || guardtime == inode-i_ctime.tv_sec) {
fh_lock(fhp);
-   host_err = notify_change(dentry, iap);
+   host_err = notify_change(dentry, fhp-fh_export-ex_mnt, iap);
err = nfserrno(host_err);
fh_unlock(fhp);
}
@@ -893,13 +893,13 @@ out:
return err;
 }
 
-static void kill_suid(struct dentry *dentry)
+static void kill_suid(struct dentry *dentry, struct vfsmount *mnt)
 {
struct iattria;
ia.ia_valid = ATTR_KILL_SUID | ATTR_KILL_SGID;
 
mutex_lock(dentry-d_inode-i_mutex);
-   notify_change(dentry, ia);
+   notify_change(dentry, mnt, ia);
mutex_unlock(dentry-d_inode-i_mutex);
 }
 
@@ -958,7 +958,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, s

[AppArmor 44/45] Switch to vfs_permission() in sys_fchdir()

2007-05-14 Thread jjohansen

Switch from file_permission() to vfs_permission() in sys_fchdir(): this
avoids calling permission() with a NULL nameidata here.

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

---
 fs/open.c |   16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

--- a/fs/open.c
+++ b/fs/open.c
@@ -440,10 +440,8 @@ out:
 
 asmlinkage long sys_fchdir(unsigned int fd)
 {
+   struct nameidata nd;
struct file *file;
-   struct dentry *dentry;
-   struct inode *inode;
-   struct vfsmount *mnt;
int error;
 
error = -EBADF;
@@ -451,17 +449,17 @@ asmlinkage long sys_fchdir(unsigned int 
if (!file)
goto out;
 
-   dentry = file-f_path.dentry;
-   mnt = file-f_path.mnt;
-   inode = dentry-d_inode;
+   nd.dentry = file-f_path.dentry;
+   nd.mnt = file-f_path.mnt;
+   nd.flags = 0;
 
error = -ENOTDIR;
-   if (!S_ISDIR(inode-i_mode))
+   if (!S_ISDIR(nd.dentry-d_inode-i_mode))
goto out_putf;
 
-   error = file_permission(file, MAY_EXEC);
+   error = vfs_permission(nd, MAY_EXEC);
if (!error)
-   set_fs_pwd(current-fs, mnt, dentry);
+   set_fs_pwd(current-fs, nd.mnt, nd.dentry);
 out_putf:
fput(file);
 out:

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 37/45] AppArmor: Main Part

2007-05-14 Thread jjohansen

The underlying functions by which the AppArmor LSM hooks are implemented.

Signed-off-by: John Johansen [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

Index: b/security/apparmor/main.c
===
--- /dev/null
+++ b/security/apparmor/main.c
@@ -0,0 +1,1399 @@
+/*
+ * Copyright (C) 2002-2007 Novell/SUSE
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ * AppArmor Core
+ */
+
+#include linux/security.h
+#include linux/namei.h
+#include linux/audit.h
+#include linux/mount.h
+#include linux/ptrace.h
+
+#include apparmor.h
+
+#include inline.h
+
+/*
+ * Table of capability names: we generate it from capabilities.h.
+ */
+static const char *capability_names[] = {
+#include capability_names.h
+};
+
+/* NULL complain profile
+ *
+ * Used when in complain mode, to emit Permitting messages for non-existant
+ * profiles and hats.  This is necessary because of selective mode, in which
+ * case we need a complain null_profile and enforce null_profile
+ *
+ * The null_complain_profile cannot be statically allocated, because it
+ * can be associated to files which keep their reference even if apparmor is
+ * unloaded
+ */
+struct aa_profile *null_complain_profile;
+
+static inline void aa_permerror2result(int perm_result, struct aa_audit *sa)
+{
+   if (perm_result == 0) { /* success */
+   sa-result = 1;
+   sa-error_code = 0;
+   } else { /* -ve internal error code or +ve mask of denied perms */
+   sa-result = 0;
+   sa-error_code = perm_result;
+   }
+}
+
+/**
+ * aa_file_denied - check for @mask access on a file
+ * @profile: profile to check against
+ * @name: pathname of file
+ * @mask: permission mask requested for file
+ *
+ * Return %0 on success, or else the permissions in @mask that the
+ * profile denies.
+ */
+static int aa_file_denied(struct aa_profile *profile, const char *name,
+ int mask)
+{
+   return (mask  ~aa_match(profile-file_rules, name));
+}
+
+/**
+ * aa_link_denied - check for permission to link a file
+ * @profile: profile to check against
+ * @link: pathname of link being created
+ * @target: pathname of target to be linked to
+ *
+ * Return %0 on success, or else the permissions that the profile denies.
+ */
+static int aa_link_denied(struct aa_profile *profile, const char *link,
+ const char *target)
+{
+   int l_mode, t_mode;
+
+   l_mode = aa_match(profile-file_rules, link);
+   t_mode = aa_match(profile-file_rules, target);
+
+   /* Link always requires 'l' on the link, a subset of the
+* target's 'r', 'w', 'x', and 'm' permissions on the link, and
+* if the link has 'x', an exact match of all the execute flags
+* ('i', 'u', 'U', 'p', 'P').
+*/
+#define RWXM (MAY_READ | MAY_WRITE | MAY_EXEC | AA_EXEC_MMAP)
+   if ((l_mode  AA_MAY_LINK) 
+   (l_mode  RWXM)  !(l_mode  ~t_mode  RWXM) 
+   (!(l_mode  MAY_EXEC) ||
+((l_mode  AA_EXEC_MODIFIERS) == (t_mode  AA_EXEC_MODIFIERS) 
+ (l_mode  AA_EXEC_UNSAFE) == (t_mode  AA_EXEC_UNSAFE
+   return 0;
+#undef RWXM
+   /* FIXME: There currenly is no way to report which permissions
+* we expect in t_mode, so linking could fail even after learning
+* the required l_mode.
+*/
+   return AA_MAY_LINK;
+}
+
+/**
+ * mangle -- escape special characters in str
+ * @str: string to escape
+ * @buffer: buffer containing str
+ *
+ * Escape special characters in @str, which is contained in @buffer. @str must
+ * be aligned to the end of the buffer, and the space between @buffer and @str
+ * may be used for escaping.
+ *
+ * Returns @str if no escaping was necessary, a pointer to the beginning of the
+ * escaped string, or NULL if there was not enough space in @buffer.  When
+ * called with a NULL buffer, the return value tells whether any escaping is
+ * necessary.
+ */
+static const char *mangle(const char *str, char *buffer)
+{
+   static const char c_escape[] = {
+   ['\a'] = 'a',   ['\b'] = 'b',
+   ['\f'] = 'f',   ['\n'] = 'n',
+   ['\r'] = 'r',   ['\t'] = 't',
+   ['\v'] = 'v',
+   [' '] = ' ',['\\'] = '\\',
+   };
+   const char *s;
+   char *t, c;
+
+#define mangle_escape(c)   \
+   unlikely((unsigned char)(c)  ARRAY_SIZE(c_escape)\
+c_escape[(unsigned char)c])
+
+   for (s = (char *)str; (c = *s) != '\0'; s++)
+   if (mangle_escape(c))
+   goto escape;
+   return str;
+
+escape:
+   if (!buffer)
+   return NULL;
+   for (s =

[AppArmor 36/45] Export audit subsystem for use by modules

2007-05-14 Thread jjohansen

Adds necessary export symbols for audit subsystem routines.
Changes audit_log_vformat to be externally visible (analagous to vprintf)
Patch is not in mainline -- pending AppArmor code submission to lkml

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 include/linux/audit.h |5 +
 kernel/audit.c|6 --
 2 files changed, 9 insertions(+), 2 deletions(-)

--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -114,6 +114,8 @@
 #define AUDIT_ANOM_PROMISCUOUS  1700 /* Device changed promiscuous mode */
 #define AUDIT_ANOM_ABEND1701 /* Process ended abnormally */
 
+#define AUDIT_APPARMOR 1500/* AppArmor audit */
+
 #define AUDIT_KERNEL   2000/* Asynchronous audit record. NOT A 
REQUEST. */
 
 /* Rule flags */
@@ -499,6 +501,9 @@ extern void audit_log(struct audit_
  __attribute__((format(printf,4,5)));
 
 extern struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t 
gfp_mask, int type);
+extern voidaudit_log_vformat(struct audit_buffer *ab,
+ const char *fmt, va_list args)
+   __attribute__((format(printf,2,0)));
 extern voidaudit_log_format(struct audit_buffer *ab,
 const char *fmt, ...)
__attribute__((format(printf,2,3)));
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1054,8 +1054,7 @@ static inline int audit_expand(struct au
  * will be called a second time.  Currently, we assume that a printk
  * can't format message larger than 1024 bytes, so we don't either.
  */
-static void audit_log_vformat(struct audit_buffer *ab, const char *fmt,
- va_list args)
+void audit_log_vformat(struct audit_buffer *ab, const char *fmt, va_list args)
 {
int len, avail;
struct sk_buff *skb;
@@ -1311,3 +1310,6 @@ EXPORT_SYMBOL(audit_log_start);
 EXPORT_SYMBOL(audit_log_end);
 EXPORT_SYMBOL(audit_log_format);
 EXPORT_SYMBOL(audit_log);
+EXPORT_SYMBOL_GPL(audit_log_vformat);
+EXPORT_SYMBOL_GPL(audit_log_untrustedstring);
+EXPORT_SYMBOL_GPL(audit_log_d_path);

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 17/45] Add a struct vfsmount parameter to vfs_unlink()

2007-05-14 Thread jjohansen

The vfsmount will be passed down to the LSM hook so that LSMs can compute
pathnames.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/ecryptfs/inode.c   |3 ++-
 fs/namei.c|4 ++--
 fs/nfsd/nfs4recover.c |2 +-
 fs/nfsd/vfs.c |2 +-
 include/linux/fs.h|2 +-
 ipc/mqueue.c  |2 +-
 6 files changed, 8 insertions(+), 7 deletions(-)

--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -453,10 +453,11 @@ static int ecryptfs_unlink(struct inode 
 {
int rc = 0;
struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
+   struct vfsmount *lower_mnt = ecryptfs_dentry_to_lower_mnt(dentry);
struct inode *lower_dir_inode = ecryptfs_inode_to_lower(dir);
 
lock_parent(lower_dentry);
-   rc = vfs_unlink(lower_dir_inode, lower_dentry);
+   rc = vfs_unlink(lower_dir_inode, lower_dentry, lower_mnt);
if (rc) {
printk(KERN_ERR Error in vfs_unlink; rc = [%d]\n, rc);
goto out_unlock;
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2105,7 +2105,7 @@ asmlinkage long sys_rmdir(const char __u
return do_rmdir(AT_FDCWD, pathname);
 }
 
-int vfs_unlink(struct inode *dir, struct dentry *dentry)
+int vfs_unlink(struct inode *dir, struct dentry *dentry, struct vfsmount *mnt)
 {
int error = may_delete(dir, dentry, 0);
 
@@ -2169,7 +2169,7 @@ static long do_unlinkat(int dfd, const c
inode = dentry-d_inode;
if (inode)
atomic_inc(inode-i_count);
-   error = vfs_unlink(nd.dentry-d_inode, dentry);
+   error = vfs_unlink(nd.dentry-d_inode, dentry, nd.mnt);
exit2:
dput(dentry);
}
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -261,7 +261,7 @@ nfsd4_remove_clid_file(struct dentry *di
return -EINVAL;
}
mutex_lock_nested(dir-d_inode-i_mutex, I_MUTEX_PARENT);
-   status = vfs_unlink(dir-d_inode, dentry);
+   status = vfs_unlink(dir-d_inode, dentry, rec_dir.mnt);
mutex_unlock(dir-d_inode-i_mutex);
return status;
 }
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1704,7 +1704,7 @@ nfsd_unlink(struct svc_rqst *rqstp, stru
host_err = -EPERM;
} else
 #endif
-   host_err = vfs_unlink(dirp, rdentry);
+   host_err = vfs_unlink(dirp, rdentry, exp-ex_mnt);
} else { /* It's RMDIR */
host_err = vfs_rmdir(dirp, rdentry, exp-ex_mnt);
}
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -997,7 +997,7 @@ extern int vfs_mknod(struct inode *, str
 extern int vfs_symlink(struct inode *, struct dentry *, struct vfsmount *, 
const char *, int);
 extern int vfs_link(struct dentry *, struct vfsmount *, struct inode *, struct 
dentry *, struct vfsmount *);
 extern int vfs_rmdir(struct inode *, struct dentry *, struct vfsmount *);
-extern int vfs_unlink(struct inode *, struct dentry *);
+extern int vfs_unlink(struct inode *, struct dentry *, struct vfsmount *);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct 
dentry *);
 
 /*
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -749,7 +749,7 @@ asmlinkage long sys_mq_unlink(const char
if (inode)
atomic_inc(inode-i_count);
 
-   err = vfs_unlink(dentry-d_parent-d_inode, dentry);
+   err = vfs_unlink(dentry-d_parent-d_inode, dentry, mqueue_mnt);
 out_err:
dput(dentry);
 

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 13/45] Pass the struct vfsmounts to the inode_link LSM hook

2007-05-14 Thread jjohansen

This is needed for computing pathnames in the AppArmor LSM.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/namei.c   |3 ++-
 include/linux/security.h |   18 +-
 security/dummy.c |6 --
 security/selinux/hooks.c |9 +++--
 4 files changed, 26 insertions(+), 10 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2293,7 +2293,8 @@ int vfs_link(struct dentry *old_dentry, 
if (S_ISDIR(old_dentry-d_inode-i_mode))
return -EPERM;
 
-   error = security_inode_link(old_dentry, dir, new_dentry);
+   error = security_inode_link(old_dentry, old_mnt, dir, new_dentry,
+   new_mnt);
if (error)
return error;
 
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -289,8 +289,10 @@ struct request_sock;
  * @inode_link:
  * Check permission before creating a new hard link to a file.
  * @old_dentry contains the dentry structure for an existing link to the 
file.
+ * @old_mnt is the vfsmount corresponding to @old_dentry (may be NULL).
  * @dir contains the inode structure of the parent directory of the new 
link.
  * @new_dentry contains the dentry structure for the new link.
+ * @new_mnt is the vfsmount corresponding to @new_dentry (may be NULL).
  * Return 0 if permission is granted.
  * @inode_unlink:
  * Check the permission to remove a hard link to a file. 
@@ -1212,8 +1214,9 @@ struct security_operations {
char **name, void **value, size_t *len);
int (*inode_create) (struct inode *dir, struct dentry *dentry,
 struct vfsmount *mnt, int mode);
-   int (*inode_link) (struct dentry *old_dentry,
-  struct inode *dir, struct dentry *new_dentry);
+   int (*inode_link) (struct dentry *old_dentry, struct vfsmount *old_mnt,
+  struct inode *dir, struct dentry *new_dentry,
+  struct vfsmount *new_mnt);
int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
int (*inode_symlink) (struct inode *dir, struct dentry *dentry,
  struct vfsmount *mnt, const char *old_name);
@@ -1628,12 +1631,15 @@ static inline int security_inode_create 
 }
 
 static inline int security_inode_link (struct dentry *old_dentry,
+  struct vfsmount *old_mnt,
   struct inode *dir,
-  struct dentry *new_dentry)
+  struct dentry *new_dentry,
+  struct vfsmount *new_mnt)
 {
if (unlikely (IS_PRIVATE (old_dentry-d_inode)))
return 0;
-   return security_ops-inode_link (old_dentry, dir, new_dentry);
+   return security_ops-inode_link (old_dentry, old_mnt, dir,
+new_dentry, new_mnt);
 }
 
 static inline int security_inode_unlink (struct inode *dir,
@@ -2359,8 +2365,10 @@ static inline int security_inode_create 
 }
 
 static inline int security_inode_link (struct dentry *old_dentry,
+  struct vfsmount *old_mnt,
   struct inode *dir,
-  struct dentry *new_dentry)
+  struct dentry *new_dentry,
+  struct vfsmount *new_mnt)
 {
return 0;
 }
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -270,8 +270,10 @@ static int dummy_inode_create (struct in
return 0;
 }
 
-static int dummy_inode_link (struct dentry *old_dentry, struct inode *inode,
-struct dentry *new_dentry)
+static int dummy_inode_link (struct dentry *old_dentry,
+struct vfsmount *old_mnt, struct inode *inode,
+struct dentry *new_dentry,
+struct vfsmount *new_mnt)
 {
return 0;
 }
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2182,11 +2182,16 @@ static int selinux_inode_create(struct i
return may_create(dir, dentry, SECCLASS_FILE);
 }
 
-static int selinux_inode_link(struct dentry *old_dentry, struct inode *dir, 
struct dentry *new_dentry)
+static int selinux_inode_link(struct dentry *old_dentry,
+ struct vfsmount *old_mnt,
+ struct inode *dir,
+ struct dentry *new_dentry,
+ struct vfsmount *new_mnt)
 {
int rc;
 
-   rc = secondary_ops-inode_link(old_dentry,dir,new_dentry);
+   rc = secondary_ops-inode_link(old_dentry, old_mnt, dir, new_dentry,
+  new_mnt);
if (rc)

[AppArmor 43/45] Switch to vfs_permission() in do_path_lookup()

2007-05-14 Thread jjohansen

Switch from file_permission() to vfs_permission() in do_path_lookup():
this avoids calling permission() with a NULL nameidata here.

Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

---
 fs/namei.c |   13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1130,25 +1130,24 @@ static int fastcall do_path_lookup(int d
nd-dentry = dget(fs-pwd);
read_unlock(fs-lock);
} else {
-   struct dentry *dentry;
-
file = fget_light(dfd, fput_needed);
retval = -EBADF;
if (!file)
goto out_fail;
 
-   dentry = file-f_path.dentry;
+   nd-dentry = file-f_path.dentry;
+   nd-mnt = file-f_path.mnt;
 
retval = -ENOTDIR;
-   if (!S_ISDIR(dentry-d_inode-i_mode))
+   if (!S_ISDIR(nd-dentry-d_inode-i_mode))
goto fput_fail;
 
-   retval = file_permission(file, MAY_EXEC);
+   retval = vfs_permission(nd, MAY_EXEC);
if (retval)
goto fput_fail;
 
-   nd-mnt = mntget(file-f_path.mnt);
-   nd-dentry = dget(dentry);
+   mntget(nd-mnt);
+   dget(nd-dentry);
 
fput_light(file, fput_needed);
}

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[AppArmor 08/45] Pass struct vfsmount to the inode_mknod LSM hook

2007-05-14 Thread jjohansen

This is needed for computing pathnames in the AppArmor LSM.

Signed-off-by: Tony Jones [EMAIL PROTECTED]
Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]
Signed-off-by: John Johansen [EMAIL PROTECTED]

---
 fs/namei.c   |2 +-
 include/linux/security.h |7 +--
 security/dummy.c |2 +-
 security/selinux/hooks.c |5 +++--
 4 files changed, 10 insertions(+), 6 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1869,7 +1869,7 @@ int vfs_mknod(struct inode *dir, struct 
if (!dir-i_op || !dir-i_op-mknod)
return -EPERM;
 
-   error = security_inode_mknod(dir, dentry, mode, dev);
+   error = security_inode_mknod(dir, dentry, mnt, mode, dev);
if (error)
return error;
 
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -323,6 +323,7 @@ struct request_sock;
  * and not this hook.
  * @dir contains the inode structure of parent of the new file.
  * @dentry contains the dentry structure of the new file.
+ * @mnt is the vfsmount corresponding to @dentry (may be NULL).
  * @mode contains the mode of the new file.
  * @dev contains the device number.
  * Return 0 if permission is granted.
@@ -1218,7 +1219,7 @@ struct security_operations {
struct vfsmount *mnt, int mode);
int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
-   int mode, dev_t dev);
+   struct vfsmount *mnt, int mode, dev_t dev);
int (*inode_rename) (struct inode *old_dir, struct dentry *old_dentry,
 struct inode *new_dir, struct dentry *new_dentry);
int (*inode_readlink) (struct dentry *dentry);
@@ -1670,11 +1671,12 @@ static inline int security_inode_rmdir (
 
 static inline int security_inode_mknod (struct inode *dir,
struct dentry *dentry,
+   struct vfsmount *mnt,
int mode, dev_t dev)
 {
if (unlikely (IS_PRIVATE (dir)))
return 0;
-   return security_ops-inode_mknod (dir, dentry, mode, dev);
+   return security_ops-inode_mknod (dir, dentry, mnt, mode, dev);
 }
 
 static inline int security_inode_rename (struct inode *old_dir,
@@ -2388,6 +2390,7 @@ static inline int security_inode_rmdir (
 
 static inline int security_inode_mknod (struct inode *dir,
struct dentry *dentry,
+   struct vfsmount *mnt,
int mode, dev_t dev)
 {
return 0;
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -299,7 +299,7 @@ static int dummy_inode_rmdir (struct ino
 }
 
 static int dummy_inode_mknod (struct inode *inode, struct dentry *dentry,
- int mode, dev_t dev)
+ struct vfsmount *mnt, int mode, dev_t dev)
 {
return 0;
 }
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2218,11 +2218,12 @@ static int selinux_inode_rmdir(struct in
return may_link(dir, dentry, MAY_RMDIR);
 }
 
-static int selinux_inode_mknod(struct inode *dir, struct dentry *dentry, int 
mode, dev_t dev)
+static int selinux_inode_mknod(struct inode *dir, struct dentry *dentry,
+  struct vfsmount *mnt, int mode, dev_t dev)
 {
int rc;
 
-   rc = secondary_ops-inode_mknod(dir, dentry, mode, dev);
+   rc = secondary_ops-inode_mknod(dir, dentry, mnt, mode, dev);
if (rc)
return rc;
 

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] AF_RXRPC: AF_RXRPC depends on IPv4

2007-05-14 Thread David Howells

Add a dependency for CONFIG_AF_RXRPC on CONFIG_INET.  This fixes this error:

net/built-in.o: In function `rxrpc_get_peer':
(.text+0x42824): undefined reference to `ip_route_output_key'

Signed-off-by: David Howells [EMAIL PROTECTED]
---

 net/rxrpc/Kconfig |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/rxrpc/Kconfig b/net/rxrpc/Kconfig
index 91b3d52..e662f1d 100644
--- a/net/rxrpc/Kconfig
+++ b/net/rxrpc/Kconfig
@@ -4,7 +4,7 @@
 
 config AF_RXRPC
tristate RxRPC session sockets
-   depends on EXPERIMENTAL
+   depends on INET  EXPERIMENTAL
select KEYS
help
  Say Y or M here to include support for RxRPC session sockets (just

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] AF_RXRPC: Make call state names available if CONFIG_PROC_FS=n

2007-05-14 Thread David Howells

Make the call state names array available even if CONFIG_PROC_FS is disabled
as it's used in other places (such as debugging statements) too.

Signed-off-by: David Howells [EMAIL PROTECTED]
---

 net/rxrpc/ar-call.c |   19 +++
 net/rxrpc/ar-proc.c |   19 ---
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/net/rxrpc/ar-call.c b/net/rxrpc/ar-call.c
index 4d92d88..3c04b00 100644
--- a/net/rxrpc/ar-call.c
+++ b/net/rxrpc/ar-call.c
@@ -15,6 +15,25 @@
 #include net/af_rxrpc.h
 #include ar-internal.h
 
+const char *rxrpc_call_states[] = {
+   [RXRPC_CALL_CLIENT_SEND_REQUEST]= ClSndReq,
+   [RXRPC_CALL_CLIENT_AWAIT_REPLY] = ClAwtRpl,
+   [RXRPC_CALL_CLIENT_RECV_REPLY]  = ClRcvRpl,
+   [RXRPC_CALL_CLIENT_FINAL_ACK]   = ClFnlACK,
+   [RXRPC_CALL_SERVER_SECURING]= SvSecure,
+   [RXRPC_CALL_SERVER_ACCEPTING]   = SvAccept,
+   [RXRPC_CALL_SERVER_RECV_REQUEST]= SvRcvReq,
+   [RXRPC_CALL_SERVER_ACK_REQUEST] = SvAckReq,
+   [RXRPC_CALL_SERVER_SEND_REPLY]  = SvSndRpl,
+   [RXRPC_CALL_SERVER_AWAIT_ACK]   = SvAwtACK,
+   [RXRPC_CALL_COMPLETE]   = Complete,
+   [RXRPC_CALL_SERVER_BUSY]= SvBusy  ,
+   [RXRPC_CALL_REMOTELY_ABORTED]   = RmtAbort,
+   [RXRPC_CALL_LOCALLY_ABORTED]= LocAbort,
+   [RXRPC_CALL_NETWORK_ERROR]  = NetError,
+   [RXRPC_CALL_DEAD]   = Dead,
+};
+
 struct kmem_cache *rxrpc_call_jar;
 LIST_HEAD(rxrpc_calls);
 DEFINE_RWLOCK(rxrpc_call_lock);
diff --git a/net/rxrpc/ar-proc.c b/net/rxrpc/ar-proc.c
index 58f4b4e..1c0be0e 100644
--- a/net/rxrpc/ar-proc.c
+++ b/net/rxrpc/ar-proc.c
@@ -25,25 +25,6 @@ static const char *rxrpc_conn_states[] = {
[RXRPC_CONN_NETWORK_ERROR]  = NetError,
 };
 
-const char *rxrpc_call_states[] = {
-   [RXRPC_CALL_CLIENT_SEND_REQUEST]= ClSndReq,
-   [RXRPC_CALL_CLIENT_AWAIT_REPLY] = ClAwtRpl,
-   [RXRPC_CALL_CLIENT_RECV_REPLY]  = ClRcvRpl,
-   [RXRPC_CALL_CLIENT_FINAL_ACK]   = ClFnlACK,
-   [RXRPC_CALL_SERVER_SECURING]= SvSecure,
-   [RXRPC_CALL_SERVER_ACCEPTING]   = SvAccept,
-   [RXRPC_CALL_SERVER_RECV_REQUEST]= SvRcvReq,
-   [RXRPC_CALL_SERVER_ACK_REQUEST] = SvAckReq,
-   [RXRPC_CALL_SERVER_SEND_REPLY]  = SvSndRpl,
-   [RXRPC_CALL_SERVER_AWAIT_ACK]   = SvAwtACK,
-   [RXRPC_CALL_COMPLETE]   = Complete,
-   [RXRPC_CALL_SERVER_BUSY]= SvBusy  ,
-   [RXRPC_CALL_REMOTELY_ABORTED]   = RmtAbort,
-   [RXRPC_CALL_LOCALLY_ABORTED]= LocAbort,
-   [RXRPC_CALL_NETWORK_ERROR]  = NetError,
-   [RXRPC_CALL_DEAD]   = Dead,
-};
-
 /*
  * generate a list of extant and dead calls in /proc/net/rxrpc_calls
  */

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/5][TAKE2] fallocate system call

2007-05-14 Thread Amit K. Arora

This is the new set of patches which take care of the review comments
received from the community (mainly from Andrew).

Description:
---
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
-
The proposed system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
a given file (specified by fd). This mode changes the file size if
the preallocation is done beyond the EOF. It also updates the
ctime/mtime in the inode of the corresponding file, marking a
successfull allocation.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
previously preallocated blocks. This also may change the file size
and the ctime/mtime.
* New modes might get added in future. One such new mode which is
  already under discussion is FA_PREALLOCATE, which when used will
  preallocate space but will not change the filesize and [cm]time.
  Since the semantics of this new mode is not clear and agreed upon yet,
  this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).
  

sys_fallocate() on s390:
---
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
-page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583

Since there is a talk of -fault() replacing -page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198

ToDos:
-
1 Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2 A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3 Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
-
Each post will have an individual changelog for the particular patch.
Following posts with patches follow:

Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/5 : fallocate() on s390
Patch 3/5 : ext4: Extent overlap bugfix
Patch 4/5 : ext4: fallocate support in ext4
Patch 5/5 : ext4: write support for preallocated blocks

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-14 Thread Jan Kara

 On Mon, 7 May 2007 05:37:54 -0600
 
 Does the proposed implementation handle quotas correctly, btw?  Has that
 been tested?
  It seems to handle quotas fine - the block allocation itself does not
differ from the usual case, just the extents in the tree are marked as
uninitialized...
  The only question is whether DQUOT_PREALLOC_BLOCK() shouldn't be
called instead of DQUOT_ALLOC_BLOCK(). Then fallocate() won't be able to
allocate anything after the softlimit has been reached which makes some
sence but probably current behavior is kind-of less surprising.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [AppArmor 00/45] AppArmor security module overview

2007-05-14 Thread John Johansen

and with the actual introductory text this time

This post contains patches to include the AppArmor application security
framework, with request for inclusion.  It contains fixes for almost
all of the feedback received from the previous post.  A second follow
up posting will address passing NULL nameidata.

Changes since previous post:

 - Refactor d_path() patches: Separate changes to d_path(), getcwd(),
   and /proc/mounts from __d_path() cleanups.

 - Switch from file_permission() to vfs_permission() in do_path_lookup()
   and sys_fchdir(): this avoids calling permission() with a NULL nameidata
   there.

 - Fix file_permission() to not use NULL nameidata for its remaining users:
   it makes little sense to replace file_permission() with vfs_permission()
   everywhere.

 - Remove special casing for access to /proc/self/attr/current by adding
   rules to policy user side.

 - Remove redundant fn's in lsm.c by calling cap functions directly from
   the security operations vector.

 - Disallow ptracing process with different namespace.

 - Use beX_to_cpu instead on ntoX in dfa unpack code.

 - Fix potential overflow in unpack bounds checking.

 - Limit profile recursion depth to 1 level.

 - Factor out sysctl pathname code from selinux to add generic
   sysctl_pathname() function in kernel/sysctl.c. Replace special casing of
   sysctl write with finer grained mediation using sysctl_pathname() function
   to provide pathname for sysctl mediation.

 - Escape special characters in pathnames when used in audit messages.

 - Remove use of task-comm from audit messages.  The use of task-comm was
   incorrect and only used as a human readable hint.

 - Some structural cleanups on apparmors audit code paths.

 - Set LOOKUP_CONTINUE flag when checking parent permissions.  This allows
   permission functions to tell between parent and leaf checks. Check for
   (LOOKUP_PARENT | LOOKUP_CONTINUE) in the inode_permission apparmor hook.

 - Drop rejection of CLONE_NEWNS since the kernel already requires
   CAP_SYS_ADMIN.

 - Add a missing dput() in apparmorfs_detry_refcount().

 - remove kernel doc style comment header from comments that are not
   in kernel doc format

 - use lock subtyping to address lockdep reporint  possible irq lock
   inversion


The patch series consists of five areas:

 (1) Pass struct vfsmount through to LSM hooks.

 (2) Fixes and improvements to __d_path():

 (a) make it unambiguous and exclude unreachable paths from
 /proc/mounts,

 (b) make its result consistent in the face of remounts,

 (c) introduce d_namespace_path(), a variant of d_path that goes up
 to the namespace root instead of the chroot.

 (d) the behavior of d_path() and getcwd() remain unchanged, and
 there is no hidding of unreachable paths in /proc/mounts.  The
 patches addressing these have been seperated from the AppArmor
 submission and will be introduced at a later date.
 
 Part (a) has been in the -mm tree for a while; this series includes
 an updated copy of the -mm patch. Parts (b) and (c) shouldn't be too
 controversial.

 (3) Be able to distinguish file descriptor access from access by name
 in LSM hooks.

 Applications expect different behavior from file descriptor
 accesses and accesses by name in some cases. We need to pass this
 information down the LSM hooks to allow AppArmor to tell which is
 which.

 (4) Convert the selinux sysctl pathname computation code into a standalone
 function.

 (5) The AppArmor LSM itself.

 (See below.)

A tarball of the kernel patches, base user-space utilities, example
profiles, and technical documentation (including a walk-through) are
available at:

  http://forgeftp.novell.com//apparmor/LKML_Submission-May_07/

Explaining the AppArmor design in detail would take by far too much
space here, so let me refer you to the technical documentation for that.
Included is a low-level walk-through of the system and basic tools, and
some examples.  The manual pages included in the apparmor-parser package
are worth a read as well.


pgpjytPcIcfFR.pgp
Description: PGP signature

Re: [RFD Patch 0/4] AppArmor - Don't pass NULL nameidata to vfs_create/lookup/permission IOPs

2007-05-14 Thread John Johansen

sigh, and with the intoductory text attached

This post is a request for discussion on creating a second minimal
nameidata struct to eliminate conditionally passing of vfsmounts
to the LSM.

It contains a series of patches that apply on top of the AppArmor
patch series.  A previous version of these patches was posted by
Andreas Gruenbacher on April 16, and the issues raised then have been
addressed.

To remove conditionally passing of vfsmounts to the LSM, a nameidata
struct can be instantiated in the nfsd and mqueue filesystems.  This
however results in useless information being passed down, as not
all fields in the nameidata struct will be meaingful.  The nameidata
struct is split creating struct nameidata2 that contains only the fields
that will carry meaningful information.

The creation of the nameidata2 struct raises the possibility of
replacing the current dentry, vfsmount argument pairs in the
vfs and lsm patches with a single nameidata2 argument although these
patches do not currently do this.

A tarball of these patches and the AppArmor kernel patches  are
available at:

  http://forgeftp.novell.com//apparmor/LKML_Submission-May_07/


pgpE0IRYuU6bi.pgp
Description: PGP signature

[PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc

2007-05-14 Thread Amit K. Arora

This patch implements sys_fallocate() and adds support on i386, x86_64
and powerpc platforms.

Changelog:
-
Following changes were made to the previous version:
 1) Added description before sys_fallocate() definition.
 2) Return EINVAL for len=0 (With new draft that Ulrich pointed to,
posix_fallocate should return EINVAL for len = 0.
 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
 4) Do not return ENODEV for dirs (let individual file systems decide if
they want to support preallocation to directories or not.
 5) Check for wrap through zero.
 6) Update c/mtime if fallocate() succeeds.
 7) Added mode descriptions in fs.h
 8) Added variable names to function definition (fallocate inode op)

Here is the new patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 arch/i386/kernel/syscall_table.S |1 
 arch/powerpc/kernel/sys_ppc32.c  |7 +++
 arch/x86_64/kernel/functionlist  |1 
 fs/open.c|   89 +++
 include/asm-i386/unistd.h|3 -
 include/asm-powerpc/systbl.h |1 
 include/asm-powerpc/unistd.h |3 -
 include/asm-x86_64/unistd.h  |4 +
 include/linux/fs.h   |   13 +
 include/linux/syscalls.h |1 
 10 files changed, 120 insertions(+), 3 deletions(-)

Index: linux-2.6.21/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.21/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_fallocate /* 320 */
Index: linux-2.6.21/arch/x86_64/kernel/functionlist
===
--- linux-2.6.21.orig/arch/x86_64/kernel/functionlist
+++ linux-2.6.21/arch/x86_64/kernel/functionlist
@@ -931,6 +931,7 @@
 *(.text.sys_getitimer)
 *(.text.sys_getgroups)
 *(.text.sys_ftruncate)
+*(.text.sys_fallocate)
 *(.text.sysfs_lookup)
 *(.text.sys_exit_group)
 *(.text.stub_fork)
Index: linux-2.6.21/fs/open.c
===
--- linux-2.6.21.orig/fs/open.c
+++ linux-2.6.21/fs/open.c
@@ -351,6 +351,95 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ *   (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ *   FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ * requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset  len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ * 0   : On SUCCESS a value of zero is returned.
+ * error   : On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ * TBD Generic fallocate to be added for file systems that do not
+ *  support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+
+   if (offset  0 || len = 0)
+   goto out;
+
+   /* Return error if mode is not supported */
+   ret = -EOPNOTSUPP;
+   if (mode != FA_ALLOCATE  mode !=FA_DEALLOCATE)
+   goto out;
+
+   ret = -EBADF;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   if (!(file-f_mode  FMODE_WRITE))
+   goto out_fput;
+
+   inode = file-f_path.dentry-d_inode;
+
+   ret = -ESPIPE;
+   if (S_ISFIFO(inode-i_mode))
+   goto out_fput;
+
+   ret = -ENODEV;
+   /*
+* Let individual file system decide if it supports preallocation
+* for directories or not.
+*/
+   if (!S_ISREG(inode-i_mode)  !S_ISDIR(inode-i_mode))
+   goto out_fput;
+
+   ret = -EFBIG;
+   /* Check for wrap through zero too */
+   if (((offset +

[PATCH 2/5][TAKE2] fallocate() on s390

2007-05-14 Thread Amit K. Arora

This is the patch suggested by Martin Schwidefsky. Here are the comments
and patch from him.

-
From: Martin Schwidefsky [EMAIL PROTECTED]

This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with the arguments of this system call.

Signed-off-by: Martin Schwidefsky [EMAIL PROTECTED]
---

 arch/s390/kernel/compat_wrapper.S |   10 ++
 arch/s390/kernel/sys_s390.c   |   29 +
 arch/s390/kernel/syscalls.S   |1 +
 include/asm-s390/unistd.h |3 ++-
 4 files changed, 42 insertions(+), 1 deletion(-)

Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S
===
--- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S
+++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S
@@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
llgtr   %r2,%r2 # char *
llgtr   %r3,%r3 # struct compat_timeval *
jg  compat_sys_utimes
+
+   .globl  sys_fallocate_wrapper
+sys_fallocate_wrapper:
+   lgfr%r2,%r2 # int
+   lgfr%r3,%r3 # int
+   sllg%r4,%r4,32  # get high word of 64bit loff_t
+   lr  %r4,%r5 # get low word of 64bit loff_t
+   sllg%r5,%r6,32  # get high word of 64bit loff_t
+   l   %r5,164(%r15)   # get low word of 64bit loff_t
+   jg  sys_fallocate
Index: linux-2.6.21/arch/s390/kernel/syscalls.S
===
--- linux-2.6.21.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.21/arch/s390/kernel/syscalls.S
@@ -322,3 +322,4 @@ NI_SYSCALL  
/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
Index: linux-2.6.21/arch/s390/kernel/sys_s390.c
===
--- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c
+++ linux-2.6.21/arch/s390/kernel/sys_s390.c
@@ -286,3 +286,32 @@ int kernel_execve(const char *filename, 
  d (__arg3) : memory);
return __svcres;
 }
+
+#ifndef CONFIG_64BIT
+/*
+ * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
+ * 64 bit argument len is split into the upper and lower 32 bits. The
+ * system call wrapper in the user space loads the value to %r6/%r7.
+ * The code in entry.S keeps the values in %r2 - %r6 where they are and
+ * stores %r7 to 96(%r15). But the standard C linkage requires that
+ * the whole 64 bit value for len is stored on the stack and doesn't
+ * use %r6 at all. So s390_fallocate has to convert the arguments from
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
+ * to
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
+ */
+asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
+  u32 len_high, u32 len_low)
+{
+   union {
+   u64 len;
+   struct {
+   u32 high;
+   u32 low;
+   };
+   } cv;
+   cv.high = len_high;
+   cv.low = len_low;
+   return sys_fallocate(fd, mode, offset, cv.len);
+}
+#endif
Index: linux-2.6.21/include/asm-s390/unistd.h
===
--- linux-2.6.21.orig/include/asm-s390/unistd.h
+++ linux-2.6.21/include/asm-s390/unistd.h
@@ -251,8 +251,9 @@
 #define __NR_getcpu311
 #define __NR_epoll_pwait   312
 #define __NR_utimes313
+#define __NR_fallocate 314
 
-#define NR_syscalls 314
+#define NR_syscalls 315
 
 /* 
  * There are some system calls that are not present on 64 bit, some
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/5][TAKE2] ext4: Extent overlap bugfix

2007-05-14 Thread Amit K. Arora

This patch adds a check for overlap of extents and cuts short the
new extent to be inserted, if there is a chance of overlap.

Changelog:
-
As suggested by Andrew, a check for wrap though zero has been added.

Here is the new patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |   60 ++--
 include/linux/ext4_fs_extents.h |1 
 2 files changed, 59 insertions(+), 2 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1129,6 +1129,55 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * check if a portion of the newext extent overlaps with an
+ * existing extent.
+ *
+ * If there is an overlap discovered, it updates the length of the newext
+ * such that there will be no overlap, and then returns 1.
+ * If there is no overlap found, it returns 0.
+ */
+unsigned int ext4_ext_check_overlap(struct inode *inode,
+   struct ext4_extent *newext,
+   struct ext4_ext_path *path)
+{
+   unsigned long b1, b2;
+   unsigned int depth, len1;
+   unsigned int ret = 0;
+
+   b1 = le32_to_cpu(newext-ee_block);
+   len1 = le16_to_cpu(newext-ee_len);
+   depth = ext_depth(inode);
+   if (!path[depth].p_ext)
+   goto out;
+   b2 = le32_to_cpu(path[depth].p_ext-ee_block);
+
+   /*
+* get the next allocated block if the extent in the path
+* is before the requested block(s) 
+*/
+   if (b2  b1) {
+   b2 = ext4_ext_next_allocated_block(path);
+   if (b2 == EXT_MAX_BLOCK)
+   goto out;
+   }
+
+   /* check for wrap through zero */
+   if (b1 + len1  b1) {
+   len1 = EXT_MAX_BLOCK - b1;
+   newext-ee_len = cpu_to_le16(len1);
+   ret = 1;
+   }
+
+   /* check for overlap */
+   if (b1 + len1  b2) {
+   newext-ee_len = cpu_to_le16(b2 - b1);
+   ret = 1;
+   }
+out:
+   return ret;
+}
+
+/*
  * ext4_ext_insert_extent:
  * tries to merge requsted extent into the existing extent or
  * inserts requested extent as new one into the tree,
@@ -2032,7 +2081,15 @@ int ext4_ext_get_blocks(handle_t *handle
 
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
-   allocated = max_blocks;
+
+   /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
+   newex.ee_block = cpu_to_le32(iblock);
+   newex.ee_len = cpu_to_le16(max_blocks);
+   err = ext4_ext_check_overlap(inode, newex, path);
+   if (err)
+   allocated = le16_to_cpu(newex.ee_len);
+   else
+   allocated = max_blocks;
newblock = ext4_new_blocks(handle, inode, goal, allocated, err);
if (!newblock)
goto out2;
@@ -2040,7 +2097,6 @@ int ext4_ext_get_blocks(handle_t *handle
goal, newblock, allocated);
 
/* try to insert new extent into found leaf and return */
-   newex.ee_block = cpu_to_le32(iblock);
ext4_ext_store_pblock(newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
err = ext4_ext_insert_extent(handle, inode, path, newex);
Index: linux-2.6.21/include/linux/ext4_fs_extents.h
===
--- linux-2.6.21.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.21/include/linux/ext4_fs_extents.h
@@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct 
ext4_ext_path *);
+extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent 
*, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct 
ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, 
ext_prepare_callback, void *);
 extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct 
ext4_ext_path *);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/5][TAKE2] ext4: fallocate support in ext4

2007-05-14 Thread Amit K. Arora

This patch implements -fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.

Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.

Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a To Do item.

Changelog:
-
Here are the changes from the previous post:
 1) Added more description for ext4_fallocate().
 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
 3) Moved journal_start  journal_stop inside the while loop.
 4) Replaced BUG_ON with WARN_ON  ext4_error.
 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
 6) Added variable names in the function declaration of ext4_fallocate()
 7) Converted macros that handle uninitialized extents into inline
functions.

Here is the updated patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  241 +---
 fs/ext4/file.c  |1 
 include/linux/ext4_fs.h |8 +
 include/linux/ext4_fs_extents.h |   12 +
 4 files changed, 221 insertions(+), 41 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1107,7 +1107,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int ret = 0;
 
b1 = le32_to_cpu(newext-ee_block);
-   len1 = le16_to_cpu(newext-ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1192,8 +1204,9 @@ int

[PATCH 5/5][TAKE2] ext4: write support for preallocated blocks

2007-05-14 Thread Amit K. Arora

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Changelog:
-
 1) Replaced BUG_ON with WARN_ON  ext4_error.
 2) Added variable names to the function declaration of
ext4_ext_try_to_merge().
 3) Updated variable declarations to use multiple-definitions-per-line.
 4) if((a=foo())).. was broken into a=foo(); if(a)..
 5) Removed extra spaces.

Here is the updated patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  234 +++-
 include/linux/ext4_fs_extents.h |3 
 2 files changed, 210 insertions(+), 27 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1141,6 +1141,54 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the ex extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass ex - 1 as argument instead of ex.
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done = 0;
+   int uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex  EXT_LAST_EXTENT(eh))
+   {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1  EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries) - 1);
+   merge_done = 1;
+   WARN_ON(eh-eh_entries == 0);
+   if (!eh-eh_entries)
+   ext4_error(inode-i_sb, ext4_ext_try_to_merge,
+  inode#%lu, eh-eh_entries = 0!, inode-i_ino);
+   }
+
+   return merge_done;
+}
+
+/*
  * check if a portion of the newext extent overlaps with an
  * existing extent.
  *
@@ -1328,25 +1376,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex  EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1  EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
-   BUG_ON(eh-eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -2012,15 +2042,152 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a There is no split required: Entire extent should be initialized
+ *   b Splits in two extents: Write is happening at either end of the extent
+ *   c Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+   struct ext4_ext_path *path,
+   ext4_fsblk_t iblock,
+

Re: [RFC][PATCH 14/14] tmpfs whiteout support

2007-05-14 Thread Hugh Dickins

On Mon, 14 May 2007, Bharata B Rao wrote:
 From: Jan Blunck [EMAIL PROTECTED]
 Subject: tmpfs whiteout support
 
 Introduce whiteout support to tmpfs.
 
 Signed-off-by: Jan Blunck [EMAIL PROTECTED]
 Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
 ---
  mm/shmem.c |9 -
  1 files changed, 8 insertions(+), 1 deletion(-)
 
 --- a/mm/shmem.c
 +++ b/mm/shmem.c
 @@ -74,7 +74,7 @@
  #define LATENCY_LIMIT 64
  
  /* Pretend that each entry is of this size in directory's i_size */
 -#define BOGO_DIRENT_SIZE 20
 +#define BOGO_DIRENT_SIZE 1

Why would that change be needed for whiteout support?

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] file capabilities: accomodate 32 bit capabilities

2007-05-14 Thread Serge E. Hallyn

Quoting Suparna Bhattacharya ([EMAIL PROTECTED]):
 On Thu, May 10, 2007 at 01:01:27PM -0700, Andreas Dilger wrote:
  On May 08, 2007  16:49 -0500, Serge E. Hallyn wrote:
   Quoting Andreas Dilger ([EMAIL PROTECTED]):
One of the important use cases I can see today is the ability to
split the heavily-overloaded e.g. CAP_SYS_ADMIN into much more fine
grained attributes.
   
   Sounds plausible, though it suffers from both making capabilities far
   more cumbersome (i.e. finding the right capability for what you wanted
   to do) and backward compatibility.  Perhaps at that point we should
   introduce security.capabilityv2 xattrs.  A binary can then carry
   security.capability=CAP_SYS_ADMIN=p, and
   security.capabilityv2=cap_may_clone_mntns=p.
  
  Well, the overhead of each EA is non-trivial (16 bytes/EA) for storing
  12 bytes worth of data, so it is probably just better to keep extending
  the original capability fields as was in the proposal.
  
What we definitely do NOT want to happen is an application that needs
priviledged access (e.g. e2fsck, mount) to stop running because the
new capabilities _would_ have been granted by the new kernel and are
not by the old kernel and STRICTXATTR is used.

To me it would seem that having extra capabilities on an old kernel
is relatively harmless if the old kernel doesn't know what they are.
It's like having a key to a door that you don't know where it is.
   
   If we ditch the STRICTXATTR option do the semantics seem sane to you?
  
  Seems reasonable.
 
 It would simplify the code as well, which is good.
 
 This does mean no sanity checking of fcaps, am not sure if that matters,
 I'm guessing it should be similar to the case for other security attributes.

which is to trust the xattr...

So here is a new consolidated patch without the STRICTXATTR config
option.

-serge

From: Serge E. Hallyn [EMAIL PROTECTED]
Subject: [PATCH] Implement file posix capabilities

Implement file posix capabilities.  This allows programs to be given a
subset of root's powers regardless of who runs them, without having to use
setuid and giving the binary all of root's powers.

This version works with Kaigai Kohei's userspace tools, found at
http://www.kaigai.gr.jp/index.php.  For more information on how to use this
patch, Chris Friedhoff has posted a nice page at
http://www.friedhoff.org/fscaps.html.

Changelog:
May 14:
Remove STRICTXATTR support which could make newer binaries
unusable on older kernels, and combine the two patches
into one.

[recent]:
1. Enable the CONFIG_SECURITY_FS_CAPABILITIES option
when CONFIG_SECURITY=n.
2. Rename CONFIG_SECURITY_FS_CAPABILITIES to
CONFIG_SECURITY_FILE_CAPABILITIES
3. To accomodate 64-bit caps, specify that capabilities are
stored as
u32 version; u32 eff0; u32 perm0; u32 inh0;
u32 eff1; u32 perm1; u32 inh1; (etc)

Nov 27:
Incorporate fixes from Andrew Morton
(security-introduce-file-caps-tweaks and
security-introduce-file-caps-warning-fix)
Fix Kconfig dependency.
Fix change signaling behavior when file caps are not compiled in.

Nov 13:
Integrate comments from Alexey: Remove CONFIG_ ifdef from
capability.h, and use %zd for printing a size_t.

Nov 13:
Fix endianness warnings by sparse as suggested by Alexey
Dobriyan.

Nov 09:
Address warnings of unused variables at cap_bprm_set_security
when file capabilities are disabled, and simultaneously clean
up the code a little, by pulling the new code into a helper
function.

Nov 08:
For pointers to required userspace tools and how to use
them, see http://www.friedhoff.org/fscaps.html.

Nov 07:
Fix the calculation of the highest bit checked in
check_cap_sanity().

Nov 07:
Allow file caps to be enabled without CONFIG_SECURITY, since
capabilities are the default.
Hook cap_task_setscheduler when !CONFIG_SECURITY.
Move capable(TASK_KILL) to end of cap_task_kill to reduce
audit messages.

Nov 05:
Add secondary calls in selinux/hooks.c to task_setioprio and
task_setscheduler so that selinux and capabilities with file
cap support can be stacked.

Sep 05:
As Seth Arnold points out, uid checks are out of place
for capability code.

Sep 01:
Define task_setscheduler, task_setioprio, cap_task_kill, and
task_setnice to make sure a user cannot affect a process in which
they called a program with some fscaps.

One remaining question is the note under task_setscheduler: are we
ok with CAP_SYS_NICE being sufficient to confine a process to a
cpuset?

It is a semantic change, as without fsccaps, attach_task doesn't

Re: [2.6.21] circular locking dependency found in QUOTA OFF

2007-05-14 Thread Michal Piotrowski


[adding Jan and fsdevel to CC]

Hi Folkert,

On 14/05/07, Folkert van Heusden [EMAIL PROTECTED] wrote:

Hi,

When I cleanly reboot my pc running 2.6.21 on a P4 with HT and 2GB of ram
and system on an 1-filesystem IDE disk, I get the following circular
locking dependency error:

[330961.226405] ===
[330961.226489] [ INFO: possible circular locking dependency detected ]
[330961.226531] 2.6.21 #5
[330961.226569] ---
[330961.226611] quotaoff/12249 is trying to acquire lock:
[330961.226652]  (sb-s_type-i_mutex_key#4){--..}, at: [c120e2a1]
mutex_lock+0x8/0xa
[330961.226861]
[330961.226862] but task is already holding lock:
[330961.226938]  (s-s_dquot.dqonoff_mutex){--..}, at: [c120e2a1]
mutex_lock+0x8/0xa
[330961.227111]
[330961.227111] which lock already depends on the new lock.
[330961.227112]
[330961.227225]
[330961.227225] the existing dependency chain (in reverse order) is:
[330961.227303]
[330961.227303] - #1 (s-s_dquot.dqonoff_mutex){--..}:
[330961.227473][c1039b02] check_prev_add+0x15b/0x281
[330961.227766][c1039cb3] check_prevs_add+0x8b/0xe8
[330961.228056][c103b683] __lock_acquire+0x692/0xb81
[330961.228353][c103bfda] lock_acquire+0x62/0x81
[330961.228643][c120e322] __mutex_lock_slowpath+0x75/0x28c
[330961.228934][c120e2a1] mutex_lock+0x8/0xa
[330961.229221][c109fbbe] vfs_quota_on_inode+0xc1/0x25f
[330961.229513][c109fdd1] vfs_quota_on+0x75/0x79
[330961.229803][c10bc92d] ext3_quota_on+0x95/0xb0
[330961.230093][c10a1eb2] do_quotactl+0xc9/0x2dd
[330961.230384][c10a214a] sys_quotactl+0x84/0xd6
[330961.230673][c1003f74] syscall_call+0x7/0xb
[330961.230963][] 0x
[330961.231268]
[330961.231268] - #0 (sb-s_type-i_mutex_key#4){--..}:
[330961.231469][c10399db] check_prev_add+0x34/0x281
[330961.231759][c1039cb3] check_prevs_add+0x8b/0xe8
[330961.232049][c103b683] __lock_acquire+0x692/0xb81
[330961.232344][c103bfda] lock_acquire+0x62/0x81
[330961.232632][c120e322] __mutex_lock_slowpath+0x75/0x28c
[330961.232923][c120e2a1] mutex_lock+0x8/0xa
[330961.233211][c109fa6c] vfs_quota_off+0x1cf/0x260
[330961.233500][c10a2088] do_quotactl+0x29f/0x2dd
[330961.233792][c10a214a] sys_quotactl+0x84/0xd6
[330961.234081][c1003f74] syscall_call+0x7/0xb
[330961.234503][] 0x
[330961.234795]
[330961.234795] other info that might help us debug this:
[330961.234796]
[330961.234908] 2 locks held by quotaoff/12249:
[330961.234947]  #0:  (type-s_umount_key#15){}, at: [c1070b5d]
get_super+0x53/0x94
[330961.235183]  #1:  (s-s_dquot.dqonoff_mutex){--..}, at: [c120e2a1]
mutex_lock+0x8/0xa
[330961.235386]
[330961.235387] stack backtrace:
[330961.235462]  [c1004d53] show_trace_log_lvl+0x1a/0x30
[330961.235535]  [c1004d7b] show_trace+0x12/0x14
[330961.235606]  [c1004e75] dump_stack+0x16/0x18
[330961.235679]  [c1039352] print_circular_bug_tail+0x6f/0x71
[330961.235753]  [c10399db] check_prev_add+0x34/0x281
[330961.235825]  [c1039cb3] check_prevs_add+0x8b/0xe8
[330961.235897]  [c103b683] __lock_acquire+0x692/0xb81
[330961.235969]  [c103bfda] lock_acquire+0x62/0x81
[330961.236041]  [c120e322] __mutex_lock_slowpath+0x75/0x28c
[330961.236113]  [c120e2a1] mutex_lock+0x8/0xa
[330961.236185]  [c109fa6c] vfs_quota_off+0x1cf/0x260
[330961.236257]  [c10a2088] do_quotactl+0x29f/0x2dd
[330961.236330]  [c10a214a] sys_quotactl+0x84/0xd6
[330961.236402]  [c1003f74] syscall_call+0x7/0xb
[330961.236473]  ===



Is this a 2.6.21 regression?

Regards,
Michal

--
Michal K. K. Piotrowski
Kernel Monkeys
(http://kernel.wikidot.com/start)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.21] circular locking dependency found in QUOTA OFF

2007-05-14 Thread Folkert van Heusden

 [adding Jan and fsdevel to CC]
 Hi Folkert,
 When I cleanly reboot my pc running 2.6.21 on a P4 with HT and 2GB of ram
 and system on an 1-filesystem IDE disk, I get the following circular
 locking dependency error:
 
 [330961.226405] ===
 [330961.226489] [ INFO: possible circular locking dependency detected ]
 [330961.226531] 2.6.21 #5
...
 [330961.236402]  [c1003f74] syscall_call+0x7/0xb
 [330961.236473]  ===
 
 Is this a 2.6.21 regression?

This is new for 2.6.21, yes.


Folkert van Heusden

-- 
MultiTail est un flexible tool pour suivre de logfiles et execution de
commandements. Filtrer, pourvoir de couleur, merge, 'diff-view', etc.
http://www.vanheusden.com/multitail/
--
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6

2007-05-14 Thread Dave Jones

On Mon, May 14, 2007 at 04:06:21PM +1000, [EMAIL PROTECTED] wrote:
  This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
  also revert.

changes like this play havoc with git-bisect.  If you must revert stuff
before patching new code in, revert it all in a single diff.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 14/14] tmpfs whiteout support

2007-05-14 Thread Jan Blunck

On 5/14/07, Hugh Dickins [EMAIL PROTECTED] wrote:

On Mon, 14 May 2007, Bharata B Rao wrote:
 From: Jan Blunck [EMAIL PROTECTED]
 Subject: tmpfs whiteout support

 Introduce whiteout support to tmpfs.

 Signed-off-by: Jan Blunck [EMAIL PROTECTED]
 Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
 ---
  mm/shmem.c |9 -
  1 files changed, 8 insertions(+), 1 deletion(-)

 --- a/mm/shmem.c
 +++ b/mm/shmem.c
 @@ -74,7 +74,7 @@
  #define LATENCY_LIMIT 64

  /* Pretend that each entry is of this size in directory's i_size */
 -#define BOGO_DIRENT_SIZE 20
 +#define BOGO_DIRENT_SIZE 1

Why would that change be needed for whiteout support?

Good question. It seems that this a survivor of the changes necessary
for union readdir. This isn't necessary for white-outs.

BTW: Why do we claim this to be 20??? Is there any meaning behind this?

Cheers,
Jan
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 14/14] tmpfs whiteout support

2007-05-14 Thread Hugh Dickins

On Mon, 14 May 2007, Jan Blunck wrote:
 On 5/14/07, Hugh Dickins [EMAIL PROTECTED] wrote:
  
   /* Pretend that each entry is of this size in directory's i_size */
   -#define BOGO_DIRENT_SIZE 20
   +#define BOGO_DIRENT_SIZE 1
 
  Why would that change be needed for whiteout support?
 
 Good question. It seems that this a survivor of the changes necessary
 for union readdir.

(I'd be asking the same question in that case, but don't worry about it!)

 This isn't necessary for white-outs.

Phew, thanks, please drop that hunk.

 BTW: Why do we claim this to be 20??? Is there any meaning behind this?

No great meaning, hence BOGO.  I put that in when hpa (IIRC) found
tmpfs directory size 0 didn't suit some apps.  I thought it would be
nice to have a size which indicates the current number of entries
(which your 1 would do), looks plausible (for short filenames),
and easy to make sense of in an ls -l.  Bogus, yes; but I'd
resist changing it after all this time, without very good reason.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] file capabilities: Introduction

2007-05-14 Thread Pavel Machek

Hi!

 Serge E. Hallyn [EMAIL PROTECTED] wrote:
 
  Following are two patches which have been sitting for some time in -mm.
 
 Where some time == nearly six months.
 
 We need help considering, reviewing and testing this code, please.

I did quick scan, and it looks ok. Plus, it means we can finally start
using that old capabilities subsystem... so I think we should do it.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 112 matches

Mail list logo