[PATCH, RESEND] Teach do_mpage_readpage() about unwritten buffers
Teach do_mpage_readpage() about unwritten extents so we can always map them in get_blocks rather than they are are holes on read. Allows setup_swap_extents() to use preallocated files on XFS filesystems for swap files without ever needing to convert them. Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]> --- fs/mpage.c |5 +++-- fs/xfs/linux-2.6/xfs_aops.c | 13 +++-- 2 files changed, 6 insertions(+), 12 deletions(-) Index: 2.6.x-xfs-new/fs/mpage.c === --- 2.6.x-xfs-new.orig/fs/mpage.c 2007-05-29 16:17:59.0 +1000 +++ 2.6.x-xfs-new/fs/mpage.c2007-06-27 22:39:35.568852270 +1000 @@ -207,7 +207,8 @@ do_mpage_readpage(struct bio *bio, struc * Map blocks using the result from the previous get_blocks call first. */ nblocks = map_bh->b_size >> blkbits; - if (buffer_mapped(map_bh) && block_in_file > *first_logical_block && + if (buffer_mapped(map_bh) && !buffer_unwritten(map_bh) && + block_in_file > *first_logical_block && block_in_file < (*first_logical_block + nblocks)) { unsigned map_offset = block_in_file - *first_logical_block; unsigned last = nblocks - map_offset; @@ -242,7 +243,7 @@ do_mpage_readpage(struct bio *bio, struc *first_logical_block = block_in_file; } - if (!buffer_mapped(map_bh)) { + if (!buffer_mapped(map_bh) || buffer_unwritten(map_bh)) { fully_mapped = 0; if (first_hole == blocks_per_page) first_hole = page_block; Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-06-05 22:14:39.0 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-06-27 22:39:29.545636749 +1000 @@ -1340,16 +1340,9 @@ __xfs_get_blocks( return 0; if (iomap.iomap_bn != IOMAP_DADDR_NULL) { - /* -* For unwritten extents do not report a disk address on -* the read case (treat as if we're reading into a hole). -*/ - if (create || !(iomap.iomap_flags & IOMAP_UNWRITTEN)) { - xfs_map_buffer(bh_result, &iomap, offset, - inode->i_blkbits); - } - if (create && (iomap.iomap_flags & IOMAP_UNWRITTEN)) { - if (direct) + xfs_map_buffer(bh_result, &iomap, offset, inode->i_blkbits); + if (iomap.iomap_flags & IOMAP_UNWRITTEN) { + if (create && direct) bh_result->b_private = inode; set_buffer_unwritten(bh_result); } - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Jul 11, 2007 16:04 -0400, J. Bruce Fields wrote: > A 32-bit i_version could in theory wrap pretty quickly, couldn't it? > That's not a problem in itself--the problem would only arise if two > subsequent client queries of the change attribute happened a multiple of > 2^32 i_version increments apart. > > This is more likely than the previous scenario, but still very unlikely. > I would have guessed that even in situations with a very high rate of > updates and a low rate of client revalidations, the chance of two > revalidations happening exactly 2^32 updates apart would still be no > more than 1 in 2^32. (Could odd characteristics of the workloads (like > updates that tend to happen in power-of-2 groups?) make it any more > likely?) > > I'd be happier if ext4 at least allowed the possibility of 64 bits in > the future. And there's always the chance someone would find a use for > an i_version that was nondecreasing, even if nfs didn't care. This would indeed be the case for ext3 filesystems updated to ext4. They will only be able to store the low 32 bits of the version, which is itself normally enough for NFSv4 because it only uses the inequality check. Having the full 64 bits available eliminates the risk of collisions, and given that the spec mandates a 64-bit version I'm sure someone will take full advantage of it in NFS at some point. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2 of 2] Make XFS use block_page_mkwrite
Implement ->page_mkwrite in XFS. Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_file.c | 16 1 file changed, 16 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_file.c 2007-02-07 23:00:10.0 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c 2007-02-07 23:15:20.170880823 +1100 @@ -446,6 +446,20 @@ xfs_file_open_exec( } #endif /* HAVE_FOP_OPEN_EXEC */ +/* + * mmap()d file has taken write protection fault and is being made + * writable. We can set the page state up correctly for a writable + * page, which means we can do correct delalloc accounting (ENOSPC + * checking!) and unwritten extent mapping. + */ +STATIC int +xfs_vm_page_mkwrite( + struct vm_area_struct *vma, + struct page *page) +{ + return block_page_mkwrite(vma, page, xfs_get_blocks); +} + const struct file_operations xfs_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -503,12 +517,14 @@ const struct file_operations xfs_dir_fil static struct vm_operations_struct xfs_file_vm_ops = { .nopage = filemap_nopage, .populate = filemap_populate, + .page_mkwrite = xfs_vm_page_mkwrite, }; #ifdef HAVE_DMAPI static struct vm_operations_struct xfs_dmapi_file_vm_ops = { .nopage = xfs_vm_nopage, .populate = filemap_populate, + .page_mkwrite = xfs_vm_page_mkwrite, #ifdef HAVE_VMOP_MPROTECT .mprotect = xfs_vm_mprotect, #endif - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1 of 2] block_page_mkwrite V2
Generic page_mkwrite functionality. Filesystems that make use of the VM ->page_mkwrite() callout will generally use the same core code to implement it. There are several tricky truncate-related issues that we need to deal with here as we cannot take the i_mutex as we normally would for these paths. These issues are not documented anywhere yet so block_page_mkwrite() seems like the best place to start. Version 2: - read inode size only once - more comments explaining implementation restrictions Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]> --- fs/buffer.c | 47 include/linux/buffer_head.h |2 + 2 files changed, 49 insertions(+) Index: 2.6.x-xfs-new/fs/buffer.c === --- 2.6.x-xfs-new.orig/fs/buffer.c 2007-03-17 10:55:32.291414968 +1100 +++ 2.6.x-xfs-new/fs/buffer.c 2007-03-19 08:13:54.519909087 +1100 @@ -2194,6 +2194,52 @@ int generic_commit_write(struct file *fi return 0; } +/* + * block_page_mkwrite() is not allowed to change the file size as it gets + * called from a page fault handler when a page is first dirtied. Hence we must + * be careful to check for EOF conditions here. We set the page up correctly + * for a written page which means we get ENOSPC checking when writing into + * holes and correct delalloc and unwritten extent mapping on filesystems that + * support these features. + * + * We are not allowed to take the i_mutex here so we have to play games to + * protect against truncate races as the page could now be beyond EOF. Because + * vmtruncate() writes the inode size before removing pages, once we have the + * page lock we can determine safely if the page is beyond EOF. If it is not + * beyond EOF, then the page is guaranteed safe against truncation until we + * unlock the page. + */ +int +block_page_mkwrite(struct vm_area_struct *vma, struct page *page, + get_block_t get_block) +{ + struct inode *inode = vma->vm_file->f_path.dentry->d_inode; + unsigned long end; + loff_t size; + int ret = -EINVAL; + + lock_page(page); + size = i_size_read(inode); + if ((page->mapping != inode->i_mapping) || + ((page->index << PAGE_CACHE_SHIFT) > size)) { + /* page got truncated out from underneath us */ + goto out_unlock; + } + + /* page is wholly or partially inside EOF */ + if (((page->index + 1) << PAGE_CACHE_SHIFT) > size) + end = size & ~PAGE_CACHE_MASK; + else + end = PAGE_CACHE_SIZE; + + ret = block_prepare_write(page, 0, end, get_block); + if (!ret) + ret = block_commit_write(page, 0, end); + +out_unlock: + unlock_page(page); + return ret; +} /* * nobh_prepare_write()'s prereads are special: the buffer_heads are freed @@ -2997,6 +3043,7 @@ EXPORT_SYMBOL(__brelse); EXPORT_SYMBOL(__wait_on_buffer); EXPORT_SYMBOL(block_commit_write); EXPORT_SYMBOL(block_prepare_write); +EXPORT_SYMBOL(block_page_mkwrite); EXPORT_SYMBOL(block_read_full_page); EXPORT_SYMBOL(block_sync_page); EXPORT_SYMBOL(block_truncate_page); Index: 2.6.x-xfs-new/include/linux/buffer_head.h === --- 2.6.x-xfs-new.orig/include/linux/buffer_head.h 2007-03-17 10:55:32.135435539 +1100 +++ 2.6.x-xfs-new/include/linux/buffer_head.h 2007-03-17 10:55:32.567378573 +1100 @@ -206,6 +206,8 @@ int cont_prepare_write(struct page*, uns int generic_cont_expand(struct inode *inode, loff_t size); int generic_cont_expand_simple(struct inode *inode, loff_t size); int block_commit_write(struct page *page, unsigned from, unsigned to); +int block_page_mkwrite(struct vm_area_struct *vma, struct page *page, + get_block_t get_block); void block_sync_page(struct page *); sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *); int generic_commit_write(struct file *, struct page *, unsigned, unsigned); - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23))
David Chinner wrote: On Thu, Jul 12, 2007 at 10:54:57AM +1000, Nick Piggin wrote: Andrew Morton wrote: The fault-vs-invalidate race fix. I have belatedly learned that these need more work, so their state is uncertain. The more work may turn out being too much for you (although it is nothing exactly tricky that would introduce subtle bugs, it is a fair amont of churn). OK, so does that mean we can finally get the block_page_mkwrite patches merged? i.e.: http://marc.info/?l=linux-kernel&m=117426058311032&w=2 http://marc.info/?l=linux-kernel&m=11742607036&w=2 I've got up-to-date versions of them ready to go and they've been consistently tested thanks to the XFSQA test I wrote for the bug that it fixes. I've been holding them out-of-tree for months now because ->fault was supposed to supercede this interface. Yeah, as I've said, don't hold them back because of me. They are relatively simple enough that I don't see why they couldn't be merged in this window. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23))
On Thu, Jul 12, 2007 at 10:54:57AM +1000, Nick Piggin wrote: > Andrew Morton wrote: > > The fault-vs-invalidate race fix. I have belatedly learned that these > > need > > more work, so their state is uncertain. > > The more work may turn out being too much for you (although it is nothing > exactly tricky that would introduce subtle bugs, it is a fair amont of > churn). OK, so does that mean we can finally get the block_page_mkwrite patches merged? i.e.: http://marc.info/?l=linux-kernel&m=117426058311032&w=2 http://marc.info/?l=linux-kernel&m=11742607036&w=2 I've got up-to-date versions of them ready to go and they've been consistently tested thanks to the XFSQA test I wrote for the bug that it fixes. I've been holding them out-of-tree for months now because ->fault was supposed to supercede this interface. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups
On Tue, 2007-07-10 at 23:20 -0700, Andrew Morton wrote: > On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > > > From: Dmitry Monakhov <[EMAIL PROTECTED]> > > Subject: ext4: extent macros cleanup > > > > - Replace math equation to it's macro equivalent > > s/it's/its/;) Okay. > > > - make ext4_ext_grow_indepth() indexes/leaf correct > > hm, what was wrong with it? > Looking at the code, ext4_ext_ext_grow_indepth() implements tree growing procedure. It allocates a new index block, moves the top-level data of the tree(root or leaf blocks in i_data) into the new block, initializes the new root, creating index that points to the just created index block The original top-level data (in i_data) could be extent tree root (index block) or extents (leaf block). The current code (without the patch) treats the top-level data always be the leaf block, which is incorrect. assumes when the tree is growing the extent structure pass in is always > > @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, > > struct inode *inode, > > curp->p_hdr->eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode)); > > curp->p_hdr->eh_entries = cpu_to_le16(1); > > curp->p_idx = EXT_FIRST_INDEX(curp->p_hdr); > > - /* FIXME: it works, but actually path[0] can be index */ > > - curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block; > > + > > + if (path[0].p_hdr->eh_depth) > > + curp->p_idx->ei_block = EXT_FIRST_INDEX(path[0].p_hdr)->ei_block; > > + else > > + curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block; > > whitespace bustage there. > > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/23] filesystem helpers for custom 'struct file's
Christoph H. says this stands on its own and can go in before the rest of the r/o bind mount set. --- Some filesystems forego the vfs and may_open() and create their own 'struct file's. This patch creates a couple of helper functions which can be used by these filesystems, and will provide a unified place which the r/o bind mount code may patch. Also, rename an existing, static-scope init_file() to a less generic name. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/configfs/dir.c|5 +++-- lxc-dave/fs/file_table.c | 34 ++ lxc-dave/fs/hugetlbfs/inode.c | 22 +- lxc-dave/include/linux/file.h |9 + lxc-dave/ipc/shm.c| 13 + lxc-dave/mm/shmem.c |7 ++- lxc-dave/mm/tiny-shmem.c | 19 +++ lxc-dave/net/socket.c | 18 +- 8 files changed, 78 insertions(+), 49 deletions(-) diff -puN fs/configfs/dir.c~filesystem-helpers-for-custom-struct-file-s fs/configfs/dir.c --- lxc/fs/configfs/dir.c~filesystem-helpers-for-custom-struct-file-s 2007-07-10 12:46:03.0 -0700 +++ lxc-dave/fs/configfs/dir.c 2007-07-10 12:46:03.0 -0700 @@ -142,7 +142,7 @@ static int init_dir(struct inode * inode return 0; } -static int init_file(struct inode * inode) +static int configfs_init_file(struct inode * inode) { inode->i_size = PAGE_SIZE; inode->i_fop = &configfs_file_operations; @@ -283,7 +283,8 @@ static int configfs_attach_attr(struct c dentry->d_fsdata = configfs_get(sd); sd->s_dentry = dentry; - error = configfs_create(dentry, (attr->ca_mode & S_IALLUGO) | S_IFREG, init_file); + error = configfs_create(dentry, (attr->ca_mode & S_IALLUGO) | S_IFREG, + configfs_init_file); if (error) { configfs_put(sd); return error; diff -puN fs/file_table.c~filesystem-helpers-for-custom-struct-file-s fs/file_table.c --- lxc/fs/file_table.c~filesystem-helpers-for-custom-struct-file-s 2007-07-10 12:46:03.0 -0700 +++ lxc-dave/fs/file_table.c2007-07-10 12:46:03.0 -0700 @@ -138,6 +138,40 @@ fail: EXPORT_SYMBOL(get_empty_filp); +struct file *alloc_file(struct vfsmount *mnt, struct dentry *dentry, + mode_t mode, const struct file_operations *fop) +{ + struct file *file; + struct path; + + file = get_empty_filp(); + if (!file) + return NULL; + + init_file(file, mnt, dentry, mode, fop); + return file; +} +EXPORT_SYMBOL(alloc_file); + +/* + * Note: This is a crappy interface. It is here to make + * merging with the existing users of get_empty_filp() + * who have complex failure logic easier. All users + * of this should be moving to alloc_file(). + */ +int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry, + mode_t mode, const struct file_operations *fop) +{ + int error = 0; + file->f_path.dentry = dentry; + file->f_path.mnt = mntget(mnt); + file->f_mapping = dentry->d_inode->i_mapping; + file->f_mode = mode; + file->f_op = fop; + return error; +} +EXPORT_SYMBOL(init_file); + void fastcall fput(struct file *file) { if (atomic_dec_and_test(&file->f_count)) diff -puN fs/hugetlbfs/inode.c~filesystem-helpers-for-custom-struct-file-s fs/hugetlbfs/inode.c --- lxc/fs/hugetlbfs/inode.c~filesystem-helpers-for-custom-struct-file-s 2007-07-10 12:46:03.0 -0700 +++ lxc-dave/fs/hugetlbfs/inode.c 2007-07-10 12:46:03.0 -0700 @@ -761,16 +761,11 @@ struct file *hugetlb_file_setup(const ch if (!dentry) goto out_shm_unlock; - error = -ENFILE; - file = get_empty_filp(); - if (!file) - goto out_dentry; - error = -ENOSPC; inode = hugetlbfs_get_inode(root->d_sb, current->fsuid, current->fsgid, S_IFREG | S_IRWXUGO, 0); if (!inode) - goto out_file; + goto out_dentry; error = -ENOMEM; if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT)) @@ -779,17 +774,18 @@ struct file *hugetlb_file_setup(const ch d_instantiate(dentry, inode); inode->i_size = size; inode->i_nlink = 0; - file->f_path.mnt = mntget(hugetlbfs_vfsmount); - file->f_path.dentry = dentry; - file->f_mapping = inode->i_mapping; - file->f_op = &hugetlbfs_file_operations; - file->f_mode = FMODE_WRITE | FMODE_READ; + + error = -ENFILE; + file = alloc_file(hugetlbfs_vfsmount, dentry, + FMODE_WRITE | FMODE_READ, + &hugetlbfs_file_operations); + if (!file) + goto out_inode; + return file; out_inode: iput(inode); -out_file: - put_filp(file); out_dentry: dput(dentry);
[PATCH 05/23] elevate write count open()'d files
This is the first really tricky patch in the series. It elevates the writer count on a mount each time a non-special file is opened for write. This is not completely apparent in the patch because the two if() conditions in may_open() above the mnt_want_write() call are, combined, equivalent to special_file(). There is also an elevated count around the vfs_create() call in open_namei(). The count needs to be kept elevated all the way into the may_open() call. Otherwise, when the write is dropped, a ro->rw transisition could occur. This would lead to having rw access on the newly created file, while the vfsmount is ro. That is bad. Some filesystems forego the use of normal vfs calls to create struct files. Make sure that these users elevate the mnt writer count because they will get __fput(), and we need to make sure they're balanced. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/file_table.c |9 - lxc-dave/fs/namei.c | 20 lxc-dave/ipc/mqueue.c|3 +++ 3 files changed, 27 insertions(+), 5 deletions(-) diff -puN fs/file_table.c~tricky-elevate-write-count-files-are-open-ed fs/file_table.c --- lxc/fs/file_table.c~tricky-elevate-write-count-files-are-open-ed 2007-07-10 12:46:04.0 -0700 +++ lxc-dave/fs/file_table.c2007-07-10 12:46:04.0 -0700 @@ -168,6 +168,10 @@ int init_file(struct file *file, struct file->f_mapping = dentry->d_inode->i_mapping; file->f_mode = mode; file->f_op = fop; + if (mode & FMODE_WRITE) { + error = mnt_want_write(mnt); + WARN_ON(error); + } return error; } EXPORT_SYMBOL(init_file); @@ -205,8 +209,11 @@ void fastcall __fput(struct file *file) if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL)) cdev_put(inode->i_cdev); fops_put(file->f_op); - if (file->f_mode & FMODE_WRITE) + if (file->f_mode & FMODE_WRITE) { put_write_access(inode); + if (!special_file(inode->i_mode)) + mnt_drop_write(mnt); + } put_pid(file->f_owner.pid); file_kill(file); file->f_path.dentry = NULL; diff -puN fs/namei.c~tricky-elevate-write-count-files-are-open-ed fs/namei.c --- lxc/fs/namei.c~tricky-elevate-write-count-files-are-open-ed 2007-07-10 12:46:04.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:04.0 -0700 @@ -1562,8 +1562,15 @@ int may_open(struct nameidata *nd, int a return -EACCES; flag &= ~O_TRUNC; - } else if (IS_RDONLY(inode) && (flag & FMODE_WRITE)) - return -EROFS; + } else if (flag & FMODE_WRITE) { + /* +* effectively: !special_file() +* balanced by __fput() +*/ + error = mnt_want_write(nd->mnt); + if (error) + return error; + } error = vfs_permission(nd, acc_mode); if (error) @@ -1706,14 +1713,17 @@ do_last: } if (IS_ERR(nd->intent.open.file)) { - mutex_unlock(&dir->d_inode->i_mutex); error = PTR_ERR(nd->intent.open.file); - goto exit_dput; + goto exit_mutex_unlock; } /* Negative dentry, just create the file */ if (!path.dentry->d_inode) { + error = mnt_want_write(nd->mnt); + if (error) + goto exit_mutex_unlock; error = open_namei_create(nd, &path, flag, mode); + mnt_drop_write(nd->mnt); if (error) goto exit; return 0; @@ -1751,6 +1761,8 @@ ok: goto exit; return 0; +exit_mutex_unlock: + mutex_unlock(&dir->d_inode->i_mutex); exit_dput: dput_path(&path, nd); exit: diff -puN ipc/mqueue.c~tricky-elevate-write-count-files-are-open-ed ipc/mqueue.c --- lxc/ipc/mqueue.c~tricky-elevate-write-count-files-are-open-ed 2007-07-10 12:46:04.0 -0700 +++ lxc-dave/ipc/mqueue.c 2007-07-10 12:46:04.0 -0700 @@ -686,6 +686,9 @@ asmlinkage long sys_mq_open(const char _ goto out; filp = do_open(dentry, oflag); } else { + error = mnt_want_write(mqueue_mnt); + if (error) + goto out; filp = do_create(mqueue_mnt->mnt_root, dentry, oflag, mode, u_attr); } _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/23] elevate writer count for chown and friends
chown/chmod,etc... don't call permission in the same way that the normal "open for write" calls do. They still write to the filesystem, so bump the write count during these operations. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/open.c | 39 ++- 1 file changed, 30 insertions(+), 9 deletions(-) diff -puN fs/open.c~elevate-writer-count-for-chown-and-friends fs/open.c --- lxc/fs/open.c~elevate-writer-count-for-chown-and-friends2007-07-10 12:46:06.0 -0700 +++ lxc-dave/fs/open.c 2007-07-10 12:46:06.0 -0700 @@ -510,12 +510,12 @@ asmlinkage long sys_fchmod(unsigned int audit_inode(NULL, inode); - err = -EROFS; - if (IS_RDONLY(inode)) + err = mnt_want_write(file->f_vfsmnt); + if (err) goto out_putf; err = -EPERM; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) - goto out_putf; + goto out_drop_write; mutex_lock(&inode->i_mutex); if (mode == (mode_t) -1) mode = inode->i_mode; @@ -524,6 +524,8 @@ asmlinkage long sys_fchmod(unsigned int err = notify_change(dentry, &newattrs); mutex_unlock(&inode->i_mutex); +out_drop_write: + mnt_drop_write(file->f_vfsmnt); out_putf: fput(file); out: @@ -543,13 +545,13 @@ asmlinkage long sys_fchmodat(int dfd, co goto out; inode = nd.dentry->d_inode; - error = -EROFS; - if (IS_RDONLY(inode)) + error = mnt_want_write(nd.mnt); + if (error) goto dput_and_out; error = -EPERM; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) - goto dput_and_out; + goto out_drop_write; mutex_lock(&inode->i_mutex); if (mode == (mode_t) -1) @@ -559,6 +561,8 @@ asmlinkage long sys_fchmodat(int dfd, co error = notify_change(nd.dentry, &newattrs); mutex_unlock(&inode->i_mutex); +out_drop_write: + mnt_drop_write(nd.mnt); dput_and_out: path_release(&nd); out: @@ -581,9 +585,6 @@ static int chown_common(struct dentry * printk(KERN_ERR "chown_common: NULL inode\n"); goto out; } - error = -EROFS; - if (IS_RDONLY(inode)) - goto out; error = -EPERM; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) goto out; @@ -613,7 +614,12 @@ asmlinkage long sys_chown(const char __u error = user_path_walk(filename, &nd); if (error) goto out; + error = mnt_want_write(nd.mnt); + if (error) + goto out_release; error = chown_common(nd.dentry, user, group); + mnt_drop_write(nd.mnt); +out_release: path_release(&nd); out: return error; @@ -633,7 +639,12 @@ asmlinkage long sys_fchownat(int dfd, co error = __user_walk_fd(dfd, filename, follow, &nd); if (error) goto out; + error = mnt_want_write(nd.mnt); + if (error) + goto out_release; error = chown_common(nd.dentry, user, group); + mnt_drop_write(nd.mnt); +out_release: path_release(&nd); out: return error; @@ -647,7 +658,12 @@ asmlinkage long sys_lchown(const char __ error = user_path_walk_link(filename, &nd); if (error) goto out; + error = mnt_want_write(nd.mnt); + if (error) + goto out_release; error = chown_common(nd.dentry, user, group); + mnt_drop_write(nd.mnt); +out_release: path_release(&nd); out: return error; @@ -664,9 +680,14 @@ asmlinkage long sys_fchown(unsigned int if (!file) goto out; + error = mnt_want_write(file->f_vfsmnt); + if (error) + goto out_fput; dentry = file->f_path.dentry; audit_inode(NULL, dentry->d_inode); error = chown_common(dentry, user, group); + mnt_drop_write(file->f_vfsmnt); +out_fput: fput(file); out: return error; _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/23] r/o bind mounts: elevate write count for some ioctls
Some ioctl()s can cause writes to the filesystem. Take these, and make them use mnt_want/drop_write() instead. We need to pass the filp one layer deeper in XFS, but somebody _just_ pulled it out in February because nobody was using it, so I don't feel guilty for adding it back. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/ext2/ioctl.c | 46 +- lxc-dave/fs/ext3/ioctl.c | 100 +--- lxc-dave/fs/ext4/ioctl.c | 105 +- lxc-dave/fs/fat/file.c| 10 +-- lxc-dave/fs/hfsplus/ioctl.c | 39 +++- lxc-dave/fs/jfs/ioctl.c | 33 ++ lxc-dave/fs/ocfs2/ioctl.c | 11 +-- lxc-dave/fs/reiserfs/ioctl.c | 53 +++-- lxc-dave/fs/xfs/linux-2.6/xfs_ioctl.c | 15 +++- lxc-dave/fs/xfs/linux-2.6/xfs_iops.c |7 -- lxc-dave/fs/xfs/linux-2.6/xfs_lrw.c |9 ++ 11 files changed, 272 insertions(+), 156 deletions(-) diff -puN fs/ext2/ioctl.c~ioctl-mnt-takers fs/ext2/ioctl.c --- lxc/fs/ext2/ioctl.c~ioctl-mnt-takers2007-07-10 12:46:05.0 -0700 +++ lxc-dave/fs/ext2/ioctl.c2007-07-10 12:46:05.0 -0700 @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -22,6 +23,7 @@ int ext2_ioctl (struct inode * inode, st { struct ext2_inode_info *ei = EXT2_I(inode); unsigned int flags; + int ret; ext2_debug ("cmd = %u, arg = %lu\n", cmd, arg); @@ -33,14 +35,19 @@ int ext2_ioctl (struct inode * inode, st case EXT2_IOC_SETFLAGS: { unsigned int oldflags; - if (IS_RDONLY(inode)) - return -EROFS; - - if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER)) - return -EACCES; + ret = mnt_want_write(filp->f_vfsmnt); + if (ret) + return ret; + + if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER)) { + ret = -EACCES; + goto setflags_out; + } - if (get_user(flags, (int __user *) arg)) - return -EFAULT; + if (get_user(flags, (int __user *) arg)) { + ret = -EFAULT; + goto setflags_out; + } if (!S_ISDIR(inode->i_mode)) flags &= ~EXT2_DIRSYNC_FL; @@ -57,7 +64,8 @@ int ext2_ioctl (struct inode * inode, st if ((flags ^ oldflags) & (EXT2_APPEND_FL | EXT2_IMMUTABLE_FL)) { if (!capable(CAP_LINUX_IMMUTABLE)) { mutex_unlock(&inode->i_mutex); - return -EPERM; + ret = -EPERM; + goto setflags_out; } } @@ -69,20 +77,26 @@ int ext2_ioctl (struct inode * inode, st ext2_set_inode_flags(inode); inode->i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(inode); - return 0; +setflags_out: + mnt_drop_write(filp->f_vfsmnt); + return ret; } case EXT2_IOC_GETVERSION: return put_user(inode->i_generation, (int __user *) arg); case EXT2_IOC_SETVERSION: if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER)) return -EPERM; - if (IS_RDONLY(inode)) - return -EROFS; - if (get_user(inode->i_generation, (int __user *) arg)) - return -EFAULT; - inode->i_ctime = CURRENT_TIME_SEC; - mark_inode_dirty(inode); - return 0; + ret = mnt_want_write(filp->f_vfsmnt); + if (ret) + return ret; + if (get_user(inode->i_generation, (int __user *) arg)) { + ret = -EFAULT; + } else { + inode->i_ctime = CURRENT_TIME_SEC; + mark_inode_dirty(inode); + } + mnt_drop_write(filp->f_vfsmnt); + return ret; default: return -ENOTTY; } diff -puN fs/ext3/ioctl.c~ioctl-mnt-takers fs/ext3/ioctl.c --- lxc/fs/ext3/ioctl.c~ioctl-mnt-takers2007-07-10 12:46:05.0 -0700 +++ lxc-dave/fs/ext3/ioctl.c2007-07-10 12:46:05.0 -0700 @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -38,14 +39,19 @@ int ext3_ioctl (struct inode * inode, st unsigned int oldflags; unsigned int jflag; - if (IS_RDONLY(inode)) - return -EROFS; + err = mnt_want_write(filp->f_vfsmnt); +
[PATCH 10/23] elevate write count during entire ncp_ioctl()
Some ioctls need write access, but others don't. Make a helper function to decide when write access is needed, and take it. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/ncpfs/ioctl.c | 55 +- 1 file changed, 54 insertions(+), 1 deletion(-) diff -puN fs/ncpfs/ioctl.c~elevate-write-count-during-entire-ncp-ioctl fs/ncpfs/ioctl.c --- lxc/fs/ncpfs/ioctl.c~elevate-write-count-during-entire-ncp-ioctl 2007-07-10 12:46:08.0 -0700 +++ lxc-dave/fs/ncpfs/ioctl.c 2007-07-10 12:46:08.0 -0700 @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -261,7 +262,7 @@ ncp_get_charsets(struct ncp_server* serv } #endif /* CONFIG_NCPFS_NLS */ -int ncp_ioctl(struct inode *inode, struct file *filp, +static int __ncp_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg) { struct ncp_server *server = NCP_SERVER(inode); @@ -822,6 +823,58 @@ outrel: return -EINVAL; } +static int ncp_ioctl_need_write(unsigned int cmd) +{ + switch (cmd) { + case NCP_IOC_GET_FS_INFO: + case NCP_IOC_GET_FS_INFO_V2: + case NCP_IOC_NCPREQUEST: + case NCP_IOC_SETDENTRYTTL: + case NCP_IOC_SIGN_INIT: + case NCP_IOC_LOCKUNLOCK: + case NCP_IOC_SET_SIGN_WANTED: + return 1; + case NCP_IOC_GETOBJECTNAME: + case NCP_IOC_SETOBJECTNAME: + case NCP_IOC_GETPRIVATEDATA: + case NCP_IOC_SETPRIVATEDATA: + case NCP_IOC_SETCHARSETS: + case NCP_IOC_GETCHARSETS: + case NCP_IOC_CONN_LOGGED_IN: + case NCP_IOC_GETDENTRYTTL: + case NCP_IOC_GETMOUNTUID2: + case NCP_IOC_SIGN_WANTED: + case NCP_IOC_GETROOT: + case NCP_IOC_SETROOT: + return 0; + default: + /* unkown IOCTL command, assume write */ + WARN_ON(1); + } + return 1; +} + +int ncp_ioctl(struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg) +{ + int ret; + + if (ncp_ioctl_need_write(cmd)) { + /* +* inside the ioctl(), any failures which +* are because of file_permission() are +* -EACCESS, so it seems consistent to keep +* that here. +*/ + if (mnt_want_write(filp->f_vfsmnt)) + return -EACCES; + } + ret = __ncp_ioctl(inode, filp, cmd, arg); + if (ncp_ioctl_need_write(cmd)) + mnt_drop_write(filp->f_vfsmnt); + return ret; +} + #ifdef CONFIG_COMPAT long ncp_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/23] make access() use mnt check
It is OK to let access() go without using a mnt_want/drop_write() pair because it doesn't actually do writes to the filesystem, and it is inherently racy anyway. This is a rare case when it is OK to use __mnt_is_readonly() directly. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/open.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff -puN fs/open.c~make-access-use-helper fs/open.c --- lxc/fs/open.c~make-access-use-helper2007-07-10 12:46:07.0 -0700 +++ lxc-dave/fs/open.c 2007-07-10 12:46:07.0 -0700 @@ -396,8 +396,17 @@ asmlinkage long sys_faccessat(int dfd, c if(res || !(mode & S_IWOTH) || special_file(nd.dentry->d_inode->i_mode)) goto out_path_release; - - if(IS_RDONLY(nd.dentry->d_inode)) + /* +* This is a rare case where using __mnt_is_readonly() +* is OK without a mnt_want/drop_write() pair. Since +* no actual write to the fs is performed here, we do +* not need to telegraph to that to anyone. +* +* By doing this, we accept that this access is +* inherently racy and know that the fs may change +* state before we even see this result. +*/ + if (__mnt_is_readonly(nd.mnt)) res = -EROFS; out_path_release: _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 17/23] nfs: check mnt instead of superblock directly
If we depend on the inodes for writeability, we will not catch the r/o mounts when implemented. This patches uses __mnt_want_write(). It does not guarantee that the mount will stay writeable after the check. But, this is OK for one of the checks because it is just for a printk(). The other two are probably unnecessary and duplicate existing checks in the VFS. This won't make them better checks than before, but it will make them detect r/o mounts. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/nfs/dir.c |3 ++- lxc-dave/fs/nfsd/vfs.c |4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff -puN fs/nfs/dir.c~nfs-check-mnt-instead-of-sb fs/nfs/dir.c --- lxc/fs/nfs/dir.c~nfs-check-mnt-instead-of-sb2007-07-10 12:46:13.0 -0700 +++ lxc-dave/fs/nfs/dir.c 2007-07-10 12:46:13.0 -0700 @@ -998,7 +998,8 @@ static int is_atomic_open(struct inode * if (nd->flags & LOOKUP_DIRECTORY) return 0; /* Are we trying to write to a read only partition? */ - if (IS_RDONLY(dir) && (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE))) + if (__mnt_is_readonly(nd->mnt) && + (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE))) return 0; return 1; } diff -puN fs/nfsd/vfs.c~nfs-check-mnt-instead-of-sb fs/nfsd/vfs.c --- lxc/fs/nfsd/vfs.c~nfs-check-mnt-instead-of-sb 2007-07-10 12:46:13.0 -0700 +++ lxc-dave/fs/nfsd/vfs.c 2007-07-10 12:46:13.0 -0700 @@ -1810,7 +1810,7 @@ nfsd_permission(struct svc_export *exp, inode->i_mode, IS_IMMUTABLE(inode)?" immut" : "", IS_APPEND(inode)? " append" : "", - IS_RDONLY(inode)? " ro" : ""); + __mnt_is_readonly(exp->ex_mnt)? " ro" : ""); dprintk(" owner %d/%d user %d/%d\n", inode->i_uid, inode->i_gid, current->fsuid, current->fsgid); #endif @@ -1821,7 +1821,7 @@ nfsd_permission(struct svc_export *exp, */ if (!(acc & MAY_LOCAL_ACCESS)) if (acc & (MAY_WRITE | MAY_SATTR | MAY_TRUNC)) { - if (EX_RDONLY(exp) || IS_RDONLY(inode)) + if (EX_RDONLY(exp) || __mnt_is_readonly(exp->ex_mnt)) return nfserr_rofs; if (/* (acc & MAY_WRITE) && */ IS_IMMUTABLE(inode)) return nfserr_perm; _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 22/23] elevate mnt writers for vfs_unlink() callers
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/namei.c |4 lxc-dave/ipc/mqueue.c |5 - 2 files changed, 8 insertions(+), 1 deletion(-) diff -puN fs/namei.c~elevate-mnt-writers-for-vfs-unlink-callers fs/namei.c --- lxc/fs/namei.c~elevate-mnt-writers-for-vfs-unlink-callers 2007-07-10 12:46:16.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:16.0 -0700 @@ -2195,7 +2195,11 @@ static long do_unlinkat(int dfd, const c inode = dentry->d_inode; if (inode) atomic_inc(&inode->i_count); + error = mnt_want_write(nd.mnt); + if (error) + goto exit2; error = vfs_unlink(nd.dentry->d_inode, dentry); + mnt_drop_write(nd.mnt); exit2: dput(dentry); } diff -puN ipc/mqueue.c~elevate-mnt-writers-for-vfs-unlink-callers ipc/mqueue.c --- lxc/ipc/mqueue.c~elevate-mnt-writers-for-vfs-unlink-callers 2007-07-10 12:46:16.0 -0700 +++ lxc-dave/ipc/mqueue.c 2007-07-10 12:46:16.0 -0700 @@ -750,8 +750,11 @@ asmlinkage long sys_mq_unlink(const char inode = dentry->d_inode; if (inode) atomic_inc(&inode->i_count); - + err = mnt_want_write(mqueue_mnt); + if (err) + goto out_err; err = vfs_unlink(dentry->d_parent->d_inode, dentry); + mnt_drop_write(mqueue_mnt); out_err: dput(dentry); _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 19/23] elevate write count for do_utimes()
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/utimes.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff -puN fs/utimes.c~elevate-write-count-for-do-utimes fs/utimes.c --- lxc/fs/utimes.c~elevate-write-count-for-do-utimes 2007-07-10 12:46:14.0 -0700 +++ lxc-dave/fs/utimes.c2007-07-10 12:46:14.0 -0700 @@ -2,6 +2,7 @@ #include #include #include +#include #include #include #include @@ -75,8 +76,8 @@ long do_utimes(int dfd, char __user *fil inode = dentry->d_inode; - error = -EROFS; - if (IS_RDONLY(inode)) + error = mnt_want_write(nd.mnt); + if (error) goto dput_and_out; /* Don't worry, the checks are done in inode_change_ok() */ @@ -84,7 +85,7 @@ long do_utimes(int dfd, char __user *fil if (times) { error = -EPERM; if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) -goto dput_and_out; + goto mnt_drop_write_and_out; if (times[0].tv_nsec == UTIME_OMIT) newattrs.ia_valid &= ~ATTR_ATIME; @@ -104,22 +105,24 @@ long do_utimes(int dfd, char __user *fil } else { error = -EACCES; if (IS_IMMUTABLE(inode)) -goto dput_and_out; + goto mnt_drop_write_and_out; if (current->fsuid != inode->i_uid) { if (f) { if (!(f->f_mode & FMODE_WRITE)) - goto dput_and_out; + goto mnt_drop_write_and_out; } else { error = vfs_permission(&nd, MAY_WRITE); if (error) - goto dput_and_out; + goto mnt_drop_write_and_out; } } } mutex_lock(&inode->i_mutex); error = notify_change(dentry, &newattrs); mutex_unlock(&inode->i_mutex); +mnt_drop_write_and_out: + mnt_drop_write(nd.mnt); dput_and_out: if (f) fput(f); _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 21/23] sys_mknodat(): elevate write count for vfs_mknod/create()
This takes care of all of the direct callers of vfs_mknod(). Since a few of these cases also handle normal file creation as well, this also covers some calls to vfs_create(). So that we don't have to make three mnt_want/drop_write() calls inside of the switch statement, we move some of its logic outside of the switch. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/namei.c | 32 +--- lxc-dave/fs/nfsd/vfs.c |4 lxc-dave/net/unix/af_unix.c |4 3 files changed, 29 insertions(+), 11 deletions(-) diff -puN fs/namei.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create fs/namei.c --- lxc/fs/namei.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-07-10 12:46:15.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-11 10:10:55.0 -0700 @@ -1912,12 +1912,25 @@ asmlinkage long sys_mknodat(int dfd, con if (error) goto out; dentry = lookup_create(&nd, 0); - error = PTR_ERR(dentry); - + if (IS_ERR(dentry)) { + error = PTR_ERR(dentry); + goto out_unlock; + } if (!IS_POSIXACL(nd.dentry->d_inode)) mode &= ~current->fs->umask; - if (!IS_ERR(dentry)) { - switch (mode & S_IFMT) { + if (S_ISDIR(mode)) { + error = -EPERM; + goto out_dput; + } + if (!S_ISREG(mode) && !S_ISCHR(mode) && !S_ISBLK(mode) && + !S_ISFIFO(mode) && !S_ISSOCK(mode) && mode != 0) { + error = -EINVAL; + goto out_dput; + } + error = mnt_want_write(nd.mnt); + if (error) + goto out_dput; + switch (mode & S_IFMT) { case 0: case S_IFREG: error = vfs_create(nd.dentry->d_inode,dentry,mode,&nd); break; @@ -1928,14 +1941,11 @@ asmlinkage long sys_mknodat(int dfd, con case S_IFIFO: case S_IFSOCK: error = vfs_mknod(nd.dentry->d_inode,dentry,mode,0); break; - case S_IFDIR: - error = -EPERM; - break; - default: - error = -EINVAL; - } - dput(dentry); } + mnt_drop_write(nd.mnt); +out_dput: + dput(dentry); +out_unlock: mutex_unlock(&nd.dentry->d_inode->i_mutex); path_release(&nd); out: diff -puN fs/nfsd/vfs.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create fs/nfsd/vfs.c --- lxc/fs/nfsd/vfs.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-07-10 12:46:15.0 -0700 +++ lxc-dave/fs/nfsd/vfs.c 2007-07-10 12:46:15.0 -0700 @@ -1199,7 +1199,11 @@ nfsd_create(struct svc_rqst *rqstp, stru case S_IFBLK: case S_IFIFO: case S_IFSOCK: + host_err = mnt_want_write(fhp->fh_export->ex_mnt); + if (host_err) + break; host_err = vfs_mknod(dirp, dchild, iap->ia_mode, rdev); + mnt_drop_write(fhp->fh_export->ex_mnt); break; default: printk("nfsd: bad file type %o in nfsd_create\n", type); diff -puN net/unix/af_unix.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create net/unix/af_unix.c --- lxc/net/unix/af_unix.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-07-10 12:46:15.0 -0700 +++ lxc-dave/net/unix/af_unix.c 2007-07-10 12:46:15.0 -0700 @@ -815,7 +815,11 @@ static int unix_bind(struct socket *sock */ mode = S_IFSOCK | (SOCK_INODE(sock)->i_mode & ~current->fs->umask); + err = mnt_want_write(nd.mnt); + if (err) + goto out_mknod_dput; err = vfs_mknod(nd.dentry->d_inode, dentry, mode, 0); + mnt_drop_write(nd.mnt); if (err) goto out_mknod_dput; mutex_unlock(&nd.dentry->d_inode->i_mutex); _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 23/23] do_rmdir(): elevate write count
Elevate the write count during the vfs_rmdir() call. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/namei.c |5 + 1 file changed, 5 insertions(+) diff -puN fs/namei.c~do-rmdir-elevate-write-count fs/namei.c --- lxc/fs/namei.c~do-rmdir-elevate-write-count 2007-07-10 12:46:16.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:16.0 -0700 @@ -2115,7 +2115,12 @@ static long do_rmdir(int dfd, const char error = PTR_ERR(dentry); if (IS_ERR(dentry)) goto exit2; + error = mnt_want_write(nd.mnt); + if (error) + goto exit3; error = vfs_rmdir(nd.dentry->d_inode, dentry); + mnt_drop_write(nd.mnt); +exit3: dput(dentry); exit2: mutex_unlock(&nd.dentry->d_inode->i_mutex); _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 18/23] elevate writer count for do_sys_truncate()
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/open.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff -puN fs/open.c~elevate-writer-count-for-do-sys-truncate fs/open.c --- lxc/fs/open.c~elevate-writer-count-for-do-sys-truncate 2007-07-10 12:46:13.0 -0700 +++ lxc-dave/fs/open.c 2007-07-10 12:46:13.0 -0700 @@ -243,28 +243,28 @@ static long do_sys_truncate(const char _ if (!S_ISREG(inode->i_mode)) goto dput_and_out; - error = vfs_permission(&nd, MAY_WRITE); + error = mnt_want_write(nd.mnt); if (error) goto dput_and_out; - error = -EROFS; - if (IS_RDONLY(inode)) - goto dput_and_out; + error = vfs_permission(&nd, MAY_WRITE); + if (error) + goto mnt_drop_write_and_out; error = -EPERM; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) - goto dput_and_out; + goto mnt_drop_write_and_out; /* * Make sure that there are no leases. */ error = break_lease(inode, FMODE_WRITE); if (error) - goto dput_and_out; + goto mnt_drop_write_and_out; error = get_write_access(inode); if (error) - goto dput_and_out; + goto mnt_drop_write_and_out; error = locks_verify_truncate(inode, NULL, length); if (!error) { @@ -273,6 +273,8 @@ static long do_sys_truncate(const char _ } put_write_access(inode); +mnt_drop_write_and_out: + mnt_drop_write(nd.mnt); dput_and_out: path_release(&nd); out: _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 20/23] elevate write count for do_sys_utime() and touch_atime()
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/inode.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff -puN fs/inode.c~elevate-write-count-for-do-sys-utime-and-touch-atime fs/inode.c --- lxc/fs/inode.c~elevate-write-count-for-do-sys-utime-and-touch-atime 2007-07-10 12:46:15.0 -0700 +++ lxc-dave/fs/inode.c 2007-07-10 12:46:15.0 -0700 @@ -1165,22 +1165,23 @@ void touch_atime(struct vfsmount *mnt, s struct inode *inode = dentry->d_inode; struct timespec now; - if (inode->i_flags & S_NOATIME) + if (mnt && mnt_want_write(mnt)) return; + if (inode->i_flags & S_NOATIME) + goto out; if (IS_NOATIME(inode)) - return; + goto out; if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode)) - return; + goto out; /* * We may have a NULL vfsmount when coming from NFSD */ if (mnt) { if (mnt->mnt_flags & MNT_NOATIME) - return; + goto out; if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)) - return; - + goto out; if (mnt->mnt_flags & MNT_RELATIME) { /* * With relative atime, only update atime if the @@ -1191,16 +1192,19 @@ void touch_atime(struct vfsmount *mnt, s &inode->i_atime) < 0 && timespec_compare(&inode->i_ctime, &inode->i_atime) < 0) - return; + goto out; } } now = current_fs_time(inode->i_sb); if (timespec_equal(&inode->i_atime, &now)) - return; + goto out; inode->i_atime = now; mark_inode_dirty_sync(inode); +out: + if (mnt) + mnt_drop_write(mnt); } EXPORT_SYMBOL(touch_atime); _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 16/23] elevate write count over calls to vfs_rename()
This also uses the little helper in the NFS code to make an if() a little bit less ugly. We introduced the helper at the beginning of the series. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/namei.c|4 lxc-dave/fs/nfsd/vfs.c | 15 +++ 2 files changed, 15 insertions(+), 4 deletions(-) diff -puN fs/namei.c~elevate-write-count-over-calls-to-vfs-rename fs/namei.c --- lxc/fs/namei.c~elevate-write-count-over-calls-to-vfs-rename 2007-07-10 12:46:12.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:12.0 -0700 @@ -2597,8 +2597,12 @@ static int do_rename(int olddfd, const c if (new_dentry == trap) goto exit5; + error = mnt_want_write(oldnd.mnt); + if (error) + goto exit5; error = vfs_rename(old_dir->d_inode, old_dentry, new_dir->d_inode, new_dentry); + mnt_drop_write(oldnd.mnt); exit5: dput(new_dentry); exit4: diff -puN fs/nfsd/vfs.c~elevate-write-count-over-calls-to-vfs-rename fs/nfsd/vfs.c --- lxc/fs/nfsd/vfs.c~elevate-write-count-over-calls-to-vfs-rename 2007-07-10 12:46:12.0 -0700 +++ lxc-dave/fs/nfsd/vfs.c 2007-07-10 12:46:12.0 -0700 @@ -1622,13 +1622,20 @@ nfsd_rename(struct svc_rqst *rqstp, stru if (ndentry == trap) goto out_dput_new; -#ifdef MSNFS - if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) && + if (svc_msnfs(ffhp) && ((atomic_read(&odentry->d_count) > 1) || (atomic_read(&ndentry->d_count) > 1))) { host_err = -EPERM; - } else -#endif + goto out_dput_new; + } + + host_err = -EXDEV; + if (ffhp->fh_export->ex_mnt != tfhp->fh_export->ex_mnt) + goto out_dput_new; + host_err = mnt_want_write(ffhp->fh_export->ex_mnt); + if (host_err) + goto out_dput_new; + host_err = vfs_rename(fdir, odentry, tdir, ndentry); if (!host_err && EX_ISSYNC(tfhp->fh_export)) { host_err = nfsd_sync_dir(tdentry); _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/23] unix_find_other() elevate write count for touch_atime()
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/net/unix/af_unix.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff -puN net/unix/af_unix.c~unix-find-other-elevate-write-count-for-touch-atime net/unix/af_unix.c --- lxc/net/unix/af_unix.c~unix-find-other-elevate-write-count-for-touch-atime 2007-07-10 12:46:11.0 -0700 +++ lxc-dave/net/unix/af_unix.c 2007-07-10 12:46:11.0 -0700 @@ -702,21 +702,27 @@ static struct sock *unix_find_other(stru err = path_lookup(sunname->sun_path, LOOKUP_FOLLOW, &nd); if (err) goto fail; + + err = mnt_want_write(nd.mnt); + if (err) + goto put_path_fail; + err = vfs_permission(&nd, MAY_WRITE); if (err) - goto put_fail; + goto mnt_drop_write_fail; err = -ECONNREFUSED; if (!S_ISSOCK(nd.dentry->d_inode->i_mode)) - goto put_fail; + goto mnt_drop_write_fail; u=unix_find_socket_byinode(nd.dentry->d_inode); if (!u) - goto put_fail; + goto mnt_drop_write_fail; if (u->sk_type == type) touch_atime(nd.mnt, nd.dentry); path_release(&nd); + mnt_drop_write(nd.mnt); err=-EPROTOTYPE; if (u->sk_type != type) { @@ -736,7 +742,9 @@ static struct sock *unix_find_other(stru } return u; -put_fail: +mnt_drop_write_fail: + mnt_drop_write(nd.mnt); +put_path_fail: path_release(&nd); fail: *error=err; _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/23] elevate write count for file_update_time()
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/inode.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff -puN fs/inode.c~elevate-write-count-for-file_update_time fs/inode.c --- lxc/fs/inode.c~elevate-write-count-for-file_update_time 2007-07-10 12:46:10.0 -0700 +++ lxc-dave/fs/inode.c 2007-07-10 12:46:10.0 -0700 @@ -1221,10 +1221,19 @@ void file_update_time(struct file *file) struct inode *inode = file->f_path.dentry->d_inode; struct timespec now; int sync_it = 0; + int err = 0; if (IS_NOCMTIME(inode)) return; - if (IS_RDONLY(inode)) + /* +* Ideally, we want to guarantee that 'f_vfsmnt' +* is non-NULL here. But, NFS exports need to +* be fixed up before we can do that. So, check +* it for now. - Dave Hansen +*/ + if (file->f_vfsmnt) + err = mnt_want_write(file->f_vfsmnt); + if (err) return; now = current_fs_time(inode->i_sb); @@ -1240,6 +1249,8 @@ void file_update_time(struct file *file) if (sync_it) mark_inode_dirty_sync(inode); + if (file->f_vfsmnt) + mnt_drop_write(file->f_vfsmnt); } EXPORT_SYMBOL(file_update_time); _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/23] mount_is_safe(): add comment
This area of code is currently #ifdef'd out, so add a comment for the time when it is actually used. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/namespace.c |4 1 file changed, 4 insertions(+) diff -puN fs/namespace.c~mount-is-safe-add-comment fs/namespace.c --- lxc/fs/namespace.c~mount-is-safe-add-comment2007-07-10 12:46:11.0 -0700 +++ lxc-dave/fs/namespace.c 2007-07-10 12:46:11.0 -0700 @@ -728,6 +728,10 @@ static int mount_is_safe(struct nameidat if (current->uid != nd->dentry->d_inode->i_uid) return -EPERM; } + /* +* We will eventually check for the mnt->writer_count here, +* but since the code is not used now, skip it - Dave Hansen +*/ if (vfs_permission(nd, MAY_WRITE)) return -EPERM; return 0; _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/23] elevate mount count for extended attributes
This basically audits the callers of xattr_permission(), which calls permission() and can perform writes to the filesystem. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/nfsd/nfs4proc.c |7 ++- lxc-dave/fs/xattr.c | 16 ++-- 2 files changed, 20 insertions(+), 3 deletions(-) diff -puN fs/nfsd/nfs4proc.c~elevate-mount-count-for-extended-attributes fs/nfsd/nfs4proc.c --- lxc/fs/nfsd/nfs4proc.c~elevate-mount-count-for-extended-attributes 2007-07-10 12:46:10.0 -0700 +++ lxc-dave/fs/nfsd/nfs4proc.c 2007-07-10 12:46:10.0 -0700 @@ -626,14 +626,19 @@ nfsd4_setattr(struct svc_rqst *rqstp, st return status; } } + status = mnt_want_write(cstate->current_fh.fh_export->ex_mnt); + if (status) + return status; status = nfs_ok; if (setattr->sa_acl != NULL) status = nfsd4_set_nfs4_acl(rqstp, &cstate->current_fh, setattr->sa_acl); if (status) - return status; + goto out; status = nfsd_setattr(rqstp, &cstate->current_fh, &setattr->sa_iattr, 0, (time_t)0); +out: + mnt_drop_write(cstate->current_fh.fh_export->ex_mnt); return status; } diff -puN fs/xattr.c~elevate-mount-count-for-extended-attributes fs/xattr.c --- lxc/fs/xattr.c~elevate-mount-count-for-extended-attributes 2007-07-10 12:46:10.0 -0700 +++ lxc-dave/fs/xattr.c 2007-07-10 12:46:10.0 -0700 @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -32,8 +33,6 @@ xattr_permission(struct inode *inode, co * filesystem or on an immutable / append-only inode. */ if (mask & MAY_WRITE) { - if (IS_RDONLY(inode)) - return -EROFS; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) return -EPERM; } @@ -236,7 +235,11 @@ sys_setxattr(char __user *path, char __u error = user_path_walk(path, &nd); if (error) return error; + error = mnt_want_write(nd.mnt); + if (error) + return error; error = setxattr(nd.dentry, name, value, size, flags); + mnt_drop_write(nd.mnt); path_release(&nd); return error; } @@ -251,7 +254,11 @@ sys_lsetxattr(char __user *path, char __ error = user_path_walk_link(path, &nd); if (error) return error; + error = mnt_want_write(nd.mnt); + if (error) + return error; error = setxattr(nd.dentry, name, value, size, flags); + mnt_drop_write(nd.mnt); path_release(&nd); return error; } @@ -267,9 +274,14 @@ sys_fsetxattr(int fd, char __user *name, f = fget(fd); if (!f) return error; + error = mnt_want_write(f->f_vfsmnt); + if (error) + goto out_fput; dentry = f->f_path.dentry; audit_inode(NULL, dentry->d_inode); error = setxattr(dentry, name, value, size, flags); + mnt_drop_write(f->f_vfsmnt); +out_fput: fput(f); return error; } _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/23] elevate write count for link and symlink calls
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/namei.c | 10 ++ 1 file changed, 10 insertions(+) diff -puN fs/namei.c~elevate-write-count-for-link-and-symlink-calls fs/namei.c --- lxc/fs/namei.c~elevate-write-count-for-link-and-symlink-calls 2007-07-10 12:46:09.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:09.0 -0700 @@ -2266,7 +2266,12 @@ asmlinkage long sys_symlinkat(const char if (IS_ERR(dentry)) goto out_unlock; + error = mnt_want_write(nd.mnt); + if (error) + goto out_dput; error = vfs_symlink(nd.dentry->d_inode, dentry, from, S_IALLUGO); + mnt_drop_write(nd.mnt); +out_dput: dput(dentry); out_unlock: mutex_unlock(&nd.dentry->d_inode->i_mutex); @@ -2361,7 +2366,12 @@ asmlinkage long sys_linkat(int olddfd, c error = PTR_ERR(new_dentry); if (IS_ERR(new_dentry)) goto out_unlock; + error = mnt_want_write(nd.mnt); + if (error) + goto out_dput; error = vfs_link(old_nd.dentry, nd.dentry->d_inode, new_dentry); + mnt_drop_write(nd.mnt); +out_dput: dput(new_dentry); out_unlock: mutex_unlock(&nd.dentry->d_inode->i_mutex); _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/23] r/o bind mounts: stub functions
This patch adds two function mnt_want_write() and mnt_drop_write(). These are used like a lock pair around and fs operations that might cause a write to the filesystem. Before these can become useful, we must first cover each place in the VFS where writes are performed with a want/drop pair. When that is complete, we can actually introduce code that will safely check the counts before allowing r/w<->r/o transitions to occur. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/namespace.c| 46 + lxc-dave/include/linux/mount.h |3 ++ 2 files changed, 49 insertions(+) diff -puN fs/namespace.c~add-vfsmount-writer-count fs/namespace.c --- lxc/fs/namespace.c~add-vfsmount-writer-count2007-07-10 12:46:04.0 -0700 +++ lxc-dave/fs/namespace.c 2007-07-10 12:46:04.0 -0700 @@ -76,6 +76,52 @@ struct vfsmount *alloc_vfsmnt(const char return mnt; } +/* + * Most r/o checks on a fs are for operations that take + * discrete amounts of time, like a write() or unlink(). + * We must keep track of when those operations start + * (for permission checks) and when they end, so that + * we can determine when writes are able to occur to + * a filesystem. + */ +/* + * This tells the low-level filesystem that a write is + * about to be performed to it, and makes sure that + * writes are allowed before returning success. When + * the write operation is finished, mnt_drop_write() + * must be called. This is effectively a refcount. + */ +int mnt_want_write(struct vfsmount *mnt) +{ + if (__mnt_is_readonly(mnt)) + return -EROFS; + return 0; +} +EXPORT_SYMBOL_GPL(mnt_want_write); + +/* + * Tells the low-level filesystem that we are done + * performing a write to it. Must be matched with + * mnt_want_write() call above. + */ +void mnt_drop_write(struct vfsmount *mnt) +{ +} +EXPORT_SYMBOL_GPL(mnt_drop_write); + +/* + * This shouldn't be used directly ouside of the VFS. + * It does not guarantee that the filesystem will stay + * r/w, just that it is right *now*e + * mnt_want/drop_write() will _keep_ the filesystem + * r/w. + */ +int __mnt_is_readonly(struct vfsmount *mnt) +{ + return (mnt->mnt_sb->s_flags & MS_RDONLY); +} +EXPORT_SYMBOL_GPL(__mnt_is_readonly); + int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb) { mnt->mnt_sb = sb; diff -puN include/linux/mount.h~add-vfsmount-writer-count include/linux/mount.h --- lxc/include/linux/mount.h~add-vfsmount-writer-count 2007-07-10 12:46:04.0 -0700 +++ lxc-dave/include/linux/mount.h 2007-07-10 12:46:04.0 -0700 @@ -70,9 +70,12 @@ static inline struct vfsmount *mntget(st return mnt; } +extern int mnt_want_write(struct vfsmount *mnt); +extern void mnt_drop_write(struct vfsmount *mnt); extern void mntput_no_expire(struct vfsmount *mnt); extern void mnt_pin(struct vfsmount *mnt); extern void mnt_unpin(struct vfsmount *mnt); +extern int __mnt_is_readonly(struct vfsmount *mnt); static inline void mntput(struct vfsmount *mnt) { _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/23] elevate mnt writers for callers of vfs_mkdir()
Pretty self-explanatory. Fits in with the rest of the series. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/namei.c|5 + lxc-dave/fs/nfsd/nfs4recover.c |4 2 files changed, 9 insertions(+) diff -puN fs/namei.c~elevate-mnt-writers-for-callers-of-vfs-mkdir fs/namei.c --- lxc/fs/namei.c~elevate-mnt-writers-for-callers-of-vfs-mkdir 2007-07-10 12:46:07.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:07.0 -0700 @@ -1993,7 +1993,12 @@ asmlinkage long sys_mkdirat(int dfd, con if (!IS_POSIXACL(nd.dentry->d_inode)) mode &= ~current->fs->umask; + error = mnt_want_write(nd.mnt); + if (error) + goto out_dput; error = vfs_mkdir(nd.dentry->d_inode, dentry, mode); + mnt_drop_write(nd.mnt); +out_dput: dput(dentry); out_unlock: mutex_unlock(&nd.dentry->d_inode->i_mutex); diff -puN fs/nfsd/nfs4recover.c~elevate-mnt-writers-for-callers-of-vfs-mkdir fs/nfsd/nfs4recover.c --- lxc/fs/nfsd/nfs4recover.c~elevate-mnt-writers-for-callers-of-vfs-mkdir 2007-07-10 12:46:07.0 -0700 +++ lxc-dave/fs/nfsd/nfs4recover.c 2007-07-10 12:46:07.0 -0700 @@ -156,7 +156,11 @@ nfsd4_create_clid_dir(struct nfs4_client dprintk("NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n"); goto out_put; } + status = mnt_want_write(rec_dir.mnt); + if (status) + goto out_put; status = vfs_mkdir(rec_dir.dentry->d_inode, dentry, S_IRWXU); + mnt_drop_write(rec_dir.mnt); out_put: dput(dentry); out_unlock: _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/23] rearrange may_open() to be r/o friendly
may_open() calls vfs_permission() before it does checks for IS_RDONLY(inode). It checks _again_ inside of vfs_permission(). The check inside of vfs_permission() is going away eventually. With the mnt_want/drop_write() functions, all of the r/o checks (except for this one) are consistently done before calling permission(). Because of this, I'd like to use permission() to hold a debugging check to make sure that the mnt_want/drop_write() calls are actually being made. So, to do this: 1. remove the IS_RDONLY() check from permission() 2. enforce that you must mnt_want_write() before even calling permission() 3. enable a debugging in permission() We need to rearrange may_open(). Here's the patch. --- lxc-dave/fs/namei.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff -puN fs/namei.c~rearrange-permission-and-ro-checks-in-may_open fs/namei.c --- lxc/fs/namei.c~rearrange-permission-and-ro-checks-in-may_open 2007-07-10 12:46:01.0 -0700 +++ lxc-dave/fs/namei.c 2007-07-10 12:46:01.0 -0700 @@ -228,6 +228,10 @@ int permission(struct inode *inode, int { umode_t mode = inode->i_mode; int retval, submask; + struct vfsmount *mnt = NULL; + + if (nd) + mnt = nd->mnt; if (mask & MAY_WRITE) { @@ -252,7 +256,7 @@ int permission(struct inode *inode, int * the fs is mounted with the "noexec" flag. */ if ((mask & MAY_EXEC) && S_ISREG(mode) && (!(mode & S_IXUGO) || - (nd && nd->mnt && (nd->mnt->mnt_flags & MNT_NOEXEC + (mnt && (mnt->mnt_flags & MNT_NOEXEC return -EACCES; /* Ordinary permission routines do not understand MAY_APPEND. */ @@ -1546,10 +1550,6 @@ int may_open(struct nameidata *nd, int a if (S_ISDIR(inode->i_mode) && (flag & FMODE_WRITE)) return -EISDIR; - error = vfs_permission(nd, acc_mode); - if (error) - return error; - /* * FIFO's, sockets and device files are special: they don't * actually live on the filesystem itself, and as such you @@ -1564,6 +1564,10 @@ int may_open(struct nameidata *nd, int a flag &= ~O_TRUNC; } else if (IS_RDONLY(inode) && (flag & FMODE_WRITE)) return -EROFS; + + error = vfs_permission(nd, acc_mode); + if (error) + return error; /* * An append-only file must be opened in append mode for writing. */ _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/23] create cleanup helper svc_msnfs()
I'm going to be modifying nfsd_rename() shortly to support read-only bind mounts. This #ifdef is around the area I'm patching, and it starts to get really ugly if I just try to add my new code by itself. Using this little helper makes things a lot cleaner to use. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> --- lxc-dave/fs/nfsd/vfs.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff -puN fs/nfsd/vfs.c~create-svc_msnfs-helper fs/nfsd/vfs.c --- lxc/fs/nfsd/vfs.c~create-svc_msnfs-helper 2007-07-10 12:46:02.0 -0700 +++ lxc-dave/fs/nfsd/vfs.c 2007-07-10 12:46:02.0 -0700 @@ -837,6 +837,15 @@ nfsd_read_actor(read_descriptor_t *desc, return size; } +static inline int svc_msnfs(struct svc_fh *ffhp) +{ +#ifdef MSNFS + return (ffhp->fh_export->ex_flags & NFSEXP_MSNFS); +#else + return 0; +#endif +} + static __be32 nfsd_vfs_read(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, loff_t offset, struct kvec *vec, int vlen, unsigned long *count) @@ -849,11 +858,9 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st err = nfserr_perm; inode = file->f_path.dentry->d_inode; -#ifdef MSNFS - if ((fhp->fh_export->ex_flags & NFSEXP_MSNFS) && - (!lock_may_read(inode, offset, *count))) + + if (svc_msnfs(fhp) && !lock_may_read(inode, offset, *count)) goto out; -#endif /* Get readahead parameters */ ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino); _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/23] Mount writer count API (read-only bind mounts prep)
The most contentious part of the r/o bind mount patches is actually implementing the count tracking. It has NUMA and SMP implications, and is going to need to have a whole discussion on that one patch. These patches, on the other hand, simply introduce a new API: mnt_want_write() and mnt_drop_write(). They do not functionally change the kernel, just alter the way in which we check filesystems for our ability to write to them. These functions should be used in place of IS_RDONLY(inode) as they explicitly spell out when a mount is expected to _stay_ r/w instead of simply checking at a single point in time. It should take a very small number (like 3) of small patches to actually implement read-only bind mounts on top of this new API. These apply to current -git (as of July 11th, 2007). --- Why do we need r/o bind mounts? This feature allows a read-only view into a read-write filesystem. In the process of doing that, it also provides infrastructure for keeping track of the number of writers to any given mount. This has a number of uses. It allows chroots to have parts of filesystems writable. It will be useful for containers in the future because users may have root inside a container, but should not be allowed to write to somefilesystems. This also replaces patches that vserver has had out of the tree for several years. It allows security enhancement by making sure that parts of your filesystem read-only (such as when you don't trust your FTP server), when you don't want to have entire new filesystems mounted, or when you want atime selectively updated. I've been using the following script to test that the feature is working as desired. It takes a directory and makes a regular bind and a r/o bind mount of it. It then performs some normal filesystem operations on the three directories, including ones that are expected to fail, like creating a file on the r/o mount. Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] nfs: disable leases over NFS
On Wed, Jul 11, 2007 at 11:20:18AM +0100, Christoph Hellwig wrote: > On Thu, Jul 05, 2007 at 11:41:00AM -0400, J. Bruce Fields wrote: > > OK, after looking at this a little more, I'm less happy about the idea > > of erroring out by default: > > > > - There are a ton of filesystems that probably should allow > > leases, and only a few (network filesystems) that shouldn't, > > so leaving leases on by default seems simpler. > > But it gets you possible wrong behaviour by default. I'm not a big > fan of non-trivial default methods as you see :) Yeah, I do understand the attraction of doing it your way. With some quick grepping, what I found was: - about 28 on-disk filesystems, all of which I assume should support leases. - about 12 filesystems that either are network filesystems (9p, afs, cifs, coda, ncpfs, nfs, nfs4, ocfs2, smbfs) or that don't have control over all opens/data modifications for whatever reason (ecryptfs, fuse, hostfs), so shouldn't be giving out leases by default. - A bunch of synthetic filesystems (proc, sysfs...). The latter category being the strongest argument for your approach, since it's sort of ludicrous to allow leases on those filesystems, but currently we do just out of laziness. (Or, in any case, I don't see any reason the current code wouldn't allow them; I haven't actually tested). > > - We already fall back on the local method by default in the case > > of locks, and I don't see a reason to treat leases differently. > > - The patch to add > > .setlease = setlease, > > to all the file_operations is going to be a big patch that > > changes behavior in a way that might be easy to miss (because > > it changes behavior exactly on those filesystems it *doesn't* > > touch.) I think it'll be easier to get better review on the > > patch that adds a method just to those filesystems that we're > > disabling leases for. > > Anyway, feel free to go ahead with the simpler version for now, I'll do > the switchover for locks and leases when I get some time. That would be great. For now I think I'll also add another simple EINVAL-returning setlease() at least to cifs (partly just as an attempt to goad Steve French into following up on a promise at OLS to look into proper lease support for cifs) But I've appended my first attempt at your suggestion for leases, in case it's of use. (Untested.) --b. From: J. Bruce Fields <[EMAIL PROTECTED]> Subject: [PATCH] Turn off support for fcntl leases by default A lease enforces mutual exclusion with conflicting opens. On filesystems such as network filesystems where not all opens happen under the control of this kernel, the default setlease() operation, which can only check for local conflicts, is incorrect. So in most cases the correct thing is probably to disable leases for those filesystems until they can implement something more sophisticated. The only users of leases that I'm aware of (samba and nfsd) are actually using them primarly to get synchronous notification of changes to files so that they can update their caches. So, more generally, we should disable leases for any filesystem which might allow file contents to change without a local open for write occuring. That includes most of the synthetic filesystems (like proc), which the file servers probably don't want to export anyway. Previously we explicitly disabled leases for some network filesystems, but with this patch we disable by default and add an explicit setlease method for those filesystems on which leases will be allowed. Signed-off-by: "J. Bruce Fields" <[EMAIL PROTECTED]> --- fs/adfs/file.c |1 + fs/affs/file.c |1 + fs/bfs/file.c |1 + fs/ext2/file.c |2 ++ fs/ext3/file.c |1 + fs/ext4/file.c |1 + fs/fat/file.c |1 + fs/hfs/inode.c |1 + fs/hfsplus/inode.c |1 + fs/hppfs/hppfs_kern.c |1 + fs/jffs2/file.c |1 + fs/jfs/file.c |1 + fs/locks.c |8 fs/minix/file.c |1 + fs/nfs/file.c | 12 fs/ntfs/file.c |1 + fs/qnx4/file.c |1 + fs/ramfs/file-mmu.c |1 + fs/ramfs/file-nommu.c |1 + fs/read_write.c |1 + fs/reiserfs/file.c |1 + fs/sysv/file.c |1 + fs/udf/file.c |1 + fs/ufs/file.c |1 + fs/xfs/linux-2.6/xfs_file.c |2 ++ 25 files changed, 29 insertions(+), 16 deletions(-) diff --git a/fs/adfs/file.c b/fs/adfs/file.c index f544a28..4e99861 100644 --- a/fs/adfs/file.c +++ b/fs/adfs/file.c @@ -34,6 +34,7 @@ const struct file_operations adfs_file_operations = { .write = do_sync_write,
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
[EMAIL PROTECTED] wrote: On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: All of the high end arrays have non-volatile cache (read, on power loss, it is a promise that it will get all of your data out to permanent storage). You don't need to ask this kind of array to drain the cache. In fact, it might just ignore you if you send it that kind of request ;-) OK, I'll bite - how does the kernel know whether the other end of that fiberchannel cable is attached to a DMX-3 or to some no-name product that may not have the same assurances? Is there a "I'm a high-end array" bit in the sense data that I'm unaware of? There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI has similar queries) to see what kind of device you are talking to. I am not sure it is worth the trouble to do any automatic detection/handling of this. In this specific case, it is more a case of when you attach a high end (or mid-tier) device to a server, you should configure it without barriers for its exported LUNs. ric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, Jul 11, 2007 at 09:28:06AM -0500, Dave Kleikamp wrote: > On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote: > > It just occurred to me: > > > > If i_version is 64bit, then knfsd would need to be careful when > > reading it on a 32bit host. What are the locking rules? > > How does knfsd use i_version? I would think that if all it was doing > was to compare (i_version == previous_version) That's correct. (Though it's the client that's doing the comparison, actually--the server is just reporting the value.) > then locking wouldn't really matter. Well, theoretically, > previous_version could be 0x1, and i_version could be > 0x1, knfsd checks the high word, then ext4 updates i_version > to 0x2, then knfsd checks the low word, detecting no change. > How likely is this? The choice of upper word in your example is arbitrary, but other than that I believe your example is essentially the only one. So this would only happen when *both* - the read of the new value of the low word happens precisely 2^32 i_version updates after the word was read on the client's previous cache revalidation, and - the value of i_version itself is close enough to a 32-bit boundary that wraparound can happen between the reads of the high and low words. > (I don't understand why i_version even needs to be 64 bits in the > first place.) A 32-bit i_version could in theory wrap pretty quickly, couldn't it? That's not a problem in itself--the problem would only arise if two subsequent client queries of the change attribute happened a multiple of 2^32 i_version increments apart. This is more likely than the previous scenario, but still very unlikely. I would have guessed that even in situations with a very high rate of updates and a low rate of client revalidations, the chance of two revalidations happening exactly 2^32 updates apart would still be no more than 1 in 2^32. (Could odd characteristics of the workloads (like updates that tend to happen in power-of-2 groups?) make it any more likely?) I'd be happier if ext4 at least allowed the possibility of 64 bits in the future. And there's always the chance someone would find a use for an i_version that was nondecreasing, even if nfs didn't care. > > Presumably it is only updated under i_mutex protection, but having to > > get i_mutex to read it would seem a little heavy handed. > > How does knfsd protect itself from the inode changing after i_version is > checked? Is any locking being done otherwise? If the client always requests the change attribute before reading, and the i_version is always updated after data is modified, I think we're OK. Admittedly this is a little subtle. --b. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Wed, 2007-07-11 at 10:34 -0700, Andrew Morton wrote: > On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote: > > > On Jul 10, 2007 16:32 -0700, Andrew Morton wrote: > > > > err = ext4_reserve_inode_write(handle, inode, &iloc); > > > > + if (EXT4_I(inode)->i_extra_isize < > > > > + EXT4_SB(inode->i_sb)->s_want_extra_isize && > > > > + !(EXT4_I(inode)->i_state & EXT4_STATE_NO_EXPAND)) { > > > > + /* We need extra buffer credits since we may write into > > > > EA block > > > > +* with this same handle */ > > > > + if ((jbd2_journal_extend(handle, > > > > +EXT4_DATA_TRANS_BLOCKS(inode->i_sb))) == > > > > 0) { > > > > + ret = ext4_expand_extra_isize(inode, > > > > + > > > > EXT4_SB(inode->i_sb)->s_want_extra_isize, > > > > + iloc, handle); > > > > + if (ret) { > > > > + EXT4_I(inode)->i_state |= > > > > EXT4_STATE_NO_EXPAND; > > > > + if (!expand_message) { > > > > + ext4_warning(inode->i_sb, > > > > __FUNCTION__, > > > > + "Unable to expand inode %lu. > > > > Delete" > > > > + " some EAs or run e2fsck.", > > > > + inode->i_ino); > > > > + expand_message = 1; > > > > + } > > > > + } > > > > + } > > > > + } > > > > > > Maybe that message should come out once per mount rather than once per > > > boot > > > (or once per modprobe)? > > > > Probably true. > > > > > What are the consequences of a jbd2_journal_extend() failure in here? > > > > Not fatal, just like every user of i_extra_isize. If the inode isn't a > > large inode, or it can't be expanded then there will be a minor loss of > > functionality on that inode. If the i_extra_isize is critical, then > > the sysadmin will have run e2fsck to force s_min_extra_isize large enough. > > > > Note that this is only applicable for filesystems which are upgraded. For > > new inodes (i.e. all inodes that exist in the filesystem if it was always > > run on a kernel with the currently understood extra fields) then this will > > never be invoked (until such a time new extra fields are added). > > I'd suggest that we get a comment in the code explaining this: this > unchecked error does rather stand out. > > > > > + if (EXT4_I(inode)->i_file_acl) { > > > > + bh = sb_bread(inode->i_sb, EXT4_I(inode)->i_file_acl); > > > > + error = -EIO; > > > > + if (!bh) > > > > + goto cleanup; > > > > + if (ext4_xattr_check_block(bh)) { > > > > + ext4_error(inode->i_sb, __FUNCTION__, > > > > + "inode %lu: bad block %llu", > > > > inode->i_ino, > > > > + EXT4_I(inode)->i_file_acl); > > > > + error = -EIO; > > > > + goto cleanup; > > > > + } > > > > + base = BHDR(bh); > > > > + first = BFIRST(bh); > > > > + end = bh->b_data + bh->b_size; > > > > + min_offs = end - base; > > > > + free = ext4_xattr_free_space(first, &min_offs, base, > > > > +&total_blk); > > > > + if (free < new_extra_isize) { > > > > + if (!tried_min_extra_isize && > > > > s_min_extra_isize) { > > > > + tried_min_extra_isize++; > > > > + new_extra_isize = s_min_extra_isize; > > > > > > Aren't we missing a brelse(bh) here? > > > > Seems likely, yes. > > OK - could we get a positive ack from someone indicating that this will get > looked at? Because I am about to forget about it. I will send a patch to add the comments and make the suggested corrections. Thanks, Kalpak. > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs
The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but create_proc_entry() does not do lookups on file names that are more that one directory deep. This causes the entry creation to fail and hence, no proc file is created. Instead of fixing this on procfs might as well move the jbd2-debug file to debugfs which would be the preferred location for this kind of tunable. The new location is now /sys/kernel/debug/jbd2/jbd2-debug. Signed-off-by: Jose R. Santos <[EMAIL PROTECTED]> --- fs/Kconfig | 105 + 5 - 0 ! fs/jbd2/journal.c| 6727 +40 -0 ! include/linux/jbd2.h |21 + 1 - 0 ! 3 files changed, 33 insertions(+), 46 deletions(-) Index: linux-2.6/fs/jbd2/journal.c === --- linux-2.6.orig/fs/jbd2/journal.c2007-07-11 09:46:25.0 -0500 +++ linux-2.6/fs/jbd2/journal.c 2007-07-11 11:31:30.0 -0500 @@ -35,6 +35,7 @@ #include #include #include +#include #include #include @@ -1951,64 +1952,50 @@ void jbd2_journal_put_journal_head(struc } /* - * /proc tunables + * debugfs tunables */ #if defined(CONFIG_JBD2_DEBUG) -int jbd2_journal_enable_debug; +u8 jbd2_journal_enable_debug; EXPORT_SYMBOL(jbd2_journal_enable_debug); #endif -#if defined(CONFIG_JBD2_DEBUG) && defined(CONFIG_PROC_FS) +#if defined(CONFIG_JBD2_DEBUG) && defined(CONFIG_DEBUG_FS) -static struct proc_dir_entry *proc_jbd_debug; +#define JBD2_DEBUG_NAME "jbd2-debug" -static int read_jbd_debug(char *page, char **start, off_t off, - int count, int *eof, void *data) -{ - int ret; +struct dentry *jbd2_debugfs_dir, *jbd2_debug; - ret = sprintf(page + off, "%d\n", jbd2_journal_enable_debug); - *eof = 1; - return ret; +static void __init jbd2_create_debugfs_entry(void) +{ + jbd2_debugfs_dir = debugfs_create_dir("jbd2", NULL); + if (jbd2_debugfs_dir) + jbd2_debug = debugfs_create_u8(JBD2_DEBUG_NAME, S_IRUGO, + jbd2_debugfs_dir, + &jbd2_journal_enable_debug); } -static int write_jbd_debug(struct file *file, const char __user *buffer, - unsigned long count, void *data) +static void __exit jbd2_remove_debugfs_entry(void) { - char buf[32]; - - if (count > ARRAY_SIZE(buf) - 1) - count = ARRAY_SIZE(buf) - 1; - if (copy_from_user(buf, buffer, count)) - return -EFAULT; - buf[ARRAY_SIZE(buf) - 1] = '\0'; - jbd2_journal_enable_debug = simple_strtoul(buf, NULL, 10); - return count; + if (jbd2_debug) + debugfs_remove(jbd2_debug); + if (jbd2_debugfs_dir) + debugfs_remove(jbd2_debugfs_dir); } -#define JBD_PROC_NAME "sys/fs/jbd2-debug" +#else -static void __init create_jbd_proc_entry(void) +static void __init jbd2_create_debugfs_entry(void) { - proc_jbd_debug = create_proc_entry(JBD_PROC_NAME, 0644, NULL); - if (proc_jbd_debug) { - /* Why is this so hard? */ - proc_jbd_debug->read_proc = read_jbd_debug; - proc_jbd_debug->write_proc = write_jbd_debug; - } + do { + } while (0); } -static void __exit jbd2_remove_jbd_proc_entry(void) +static void __exit jbd2_remove_debugfs_entry(void) { - if (proc_jbd_debug) - remove_proc_entry(JBD_PROC_NAME, NULL); + do { + } while (0); } -#else - -#define create_jbd_proc_entry() do {} while (0) -#define jbd2_remove_jbd_proc_entry() do {} while (0) - #endif struct kmem_cache *jbd2_handle_cache; @@ -2067,7 +2054,7 @@ static int __init journal_init(void) ret = journal_init_caches(); if (ret != 0) jbd2_journal_destroy_caches(); - create_jbd_proc_entry(); + jbd2_create_debugfs_entry(); return ret; } @@ -2078,7 +2065,7 @@ static void __exit journal_exit(void) if (n) printk(KERN_EMERG "JBD: leaked %d journal_heads!\n", n); #endif - jbd2_remove_jbd_proc_entry(); + jbd2_remove_debugfs_entry(); jbd2_journal_destroy_caches(); } Index: linux-2.6/include/linux/jbd2.h === --- linux-2.6.orig/include/linux/jbd2.h 2007-07-11 09:46:25.0 -0500 +++ linux-2.6/include/linux/jbd2.h 2007-07-11 10:37:06.0 -0500 @@ -57,7 +57,7 @@ * CONFIG_JBD2_DEBUG is on. */ #define JBD_EXPENSIVE_CHECKING -extern int jbd2_journal_enable_debug; +extern u8 jbd2_journal_enable_debug; #define jbd_debug(n, f, a...) \ do {\ Index: linux-2.6/fs/Kconfig === --- linux-2.6.orig/fs/Kconfig 2007-06-22 09:45:51.0 -0500 +++ linux-2.6/fs/Kc
Re: [EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for >16TB ext4 fs
Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more than 32bit block sizes during mount time. This ensure proper record lenth when writing to the journal. Signed-off-by: Jose R. Santos <[EMAIL PROTECTED]> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> Signed-off-by: Laurent Vivier <[EMAIL PROTECTED]> --- fs/ext4/super.c |7 7 + 0 - 0 ! 1 file changed, 7 insertions(+) Index: linux-2.6/fs/ext4/super.c === --- linux-2.6.orig/fs/ext4/super.c 2007-07-11 09:36:00.0 -0500 +++ linux-2.6/fs/ext4/super.c 2007-07-11 09:36:20.0 -0500 @@ -1804,6 +1804,13 @@ static int ext4_fill_super (struct super goto failed_mount3; } + if (ext4_blocks_count(es) > 0xULL && + !jbd2_journal_set_features(EXT4_SB(sb)->s_journal, 0, 0, + JBD2_FEATURE_INCOMPAT_64BIT)) { + printk(KERN_ERR "ext4: Failed to set 64-bit journal feature\n"); + goto failed_mount4; + } + /* We have now updated the journal if required, so we can * validate the data journaling mode. */ switch (test_opt(sb, DATA_FLAGS)) { - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote: > On Jul 10, 2007 16:32 -0700, Andrew Morton wrote: > > > err = ext4_reserve_inode_write(handle, inode, &iloc); > > > + if (EXT4_I(inode)->i_extra_isize < > > > + EXT4_SB(inode->i_sb)->s_want_extra_isize && > > > + !(EXT4_I(inode)->i_state & EXT4_STATE_NO_EXPAND)) { > > > + /* We need extra buffer credits since we may write into EA block > > > + * with this same handle */ > > > + if ((jbd2_journal_extend(handle, > > > + EXT4_DATA_TRANS_BLOCKS(inode->i_sb))) == 0) { > > > + ret = ext4_expand_extra_isize(inode, > > > + EXT4_SB(inode->i_sb)->s_want_extra_isize, > > > + iloc, handle); > > > + if (ret) { > > > + EXT4_I(inode)->i_state |= EXT4_STATE_NO_EXPAND; > > > + if (!expand_message) { > > > + ext4_warning(inode->i_sb, __FUNCTION__, > > > + "Unable to expand inode %lu. Delete" > > > + " some EAs or run e2fsck.", > > > + inode->i_ino); > > > + expand_message = 1; > > > + } > > > + } > > > + } > > > + } > > > > Maybe that message should come out once per mount rather than once per boot > > (or once per modprobe)? > > Probably true. > > > What are the consequences of a jbd2_journal_extend() failure in here? > > Not fatal, just like every user of i_extra_isize. If the inode isn't a > large inode, or it can't be expanded then there will be a minor loss of > functionality on that inode. If the i_extra_isize is critical, then > the sysadmin will have run e2fsck to force s_min_extra_isize large enough. > > Note that this is only applicable for filesystems which are upgraded. For > new inodes (i.e. all inodes that exist in the filesystem if it was always > run on a kernel with the currently understood extra fields) then this will > never be invoked (until such a time new extra fields are added). I'd suggest that we get a comment in the code explaining this: this unchecked error does rather stand out. > > > + if (EXT4_I(inode)->i_file_acl) { > > > + bh = sb_bread(inode->i_sb, EXT4_I(inode)->i_file_acl); > > > + error = -EIO; > > > + if (!bh) > > > + goto cleanup; > > > + if (ext4_xattr_check_block(bh)) { > > > + ext4_error(inode->i_sb, __FUNCTION__, > > > + "inode %lu: bad block %llu", inode->i_ino, > > > + EXT4_I(inode)->i_file_acl); > > > + error = -EIO; > > > + goto cleanup; > > > + } > > > + base = BHDR(bh); > > > + first = BFIRST(bh); > > > + end = bh->b_data + bh->b_size; > > > + min_offs = end - base; > > > + free = ext4_xattr_free_space(first, &min_offs, base, > > > + &total_blk); > > > + if (free < new_extra_isize) { > > > + if (!tried_min_extra_isize && s_min_extra_isize) { > > > + tried_min_extra_isize++; > > > + new_extra_isize = s_min_extra_isize; > > > > Aren't we missing a brelse(bh) here? > > Seems likely, yes. OK - could we get a positive ack from someone indicating that this will get looked at? Because I am about to forget about it. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 8][PATCH 1/1]Add journal checksums
On Wed, 11 Jul 2007 07:01:08 -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote: > > > > - /* AKPM: buglet - add `i' to tmp! */ > > > > > > Damn. After, what, seven years, someone actually fixed it? > > > > > > > for (i = 0; i < bh->b_size; i += 512) { > > > > - journal_header_t *tmp = (journal_header_t*)bh->b_data; > > > > + struct commit_header *tmp = > > > > + (struct commit_header *)(bh->b_data + i); > > > > tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER); > > > > tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK); > > > > tmp->h_sequence = > > > > cpu_to_be32(commit_transaction->t_tid); > > > > + > > > > + if (JBD2_HAS_COMPAT_FEATURE(journal, > > > > + > > > > JBD2_FEATURE_COMPAT_CHECKSUM)) { > > > > + tmp->h_chksum_type = JBD2_CRC32_CHKSUM; > > > > + tmp->h_chksum_size = > > > > JBD2_CRC32_CHKSUM_SIZE; > > > > + tmp->h_chksum[0]= > > > > cpu_to_be32(crc32_sum); > > > > + } > > > > } > > > > > > And in doing so, changed the on-disk format of the journal commit blocks. > > > > > > Surely this was worth a mention in the changelog, if not a standalone > > > patch? > > > > > > I don't think this is worth doing, really. Why not just leave the format > > > as it was, take the loop out and run this code once rather than eight > > > times? > > Well, we aren't using the rest of the commit block in any case. I think > the original intention was that we'd get 8 copies of the commit block so > we would be sure to get a good one. > > I don't know whether we'd rather have 8 copies of the commit block, or > more potential to expand the commit block? I don't personally have any > preference, since the checksum should be a more robust way of checking > validity than having multiple copies, so we may as well remove the loop > and stick with a single copy for now. We've never altered any commit block sectors apart from the zeroeth one (eight times) due to the above bug. So I'd suggest that we should formalise the old bug and leave the format as-is. That'll leave lots of space spare in the commit block. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, Jul 11, 2007 at 01:21:55PM +1000, Neil Brown wrote: > And just by-the-way, the server doesn't really have the option of not > sending the attribute. If i_version isn't defined, it has to fake > something using mtime, and hope that is good enough. ctime, actually--the change attribute is also supposed to be updated on attribute updates. > Alternately we could mandate that i_version is always kept up-to-date > and if a filesystem doesn't have anything to load from storage, it > just sets it to the current time in nanoseconds. > > That would mean that a client would need to flush it's cache whenever > the inode fell out of cache on the server, but I don't think we can > reliably do better than that. > > I think I like that approach. > > So my vote is to increment i_version in common code every time any > change is made to the file, and alloc_inode should initialise it to > current time, which might be changed by the filesystem before it calls > unlock_new_inode. So the client would be invalidating its cache more often than necessary, rather than failing to invalidate it when it should. I agree that that's probably the better tradeoff, although I wish I had a better idea of the downside. I don't know, for example, whether users might see unpleasant results if every client has to reread its cached data on a reboot. The currently proposed change--just providing a model change attribute implementation for ext4 and leaving other filesystems untouched--is a more conservative step. So I'm inclined to just do this ext4 thing first, and then look into further change attribute experiments next time around --b. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: -mm merge plans for 2.6.23
On Wed, 11 Jul 2007 10:21:03 -0700 Andrew Morton wrote: > On Wed, 11 Jul 2007 12:39:42 +0100 David Woodhouse <[EMAIL PROTECTED]> wrote: > > > On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote: > > > On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > > > > romfs-printk-format-warnings.patch > > > > > > NACK on this one. > > > > The rest of it is nacked anyway, until we unify the point and > > get_unmapped_area methods of the MTD API. > > > > Methinks you meant > nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, not > romfs-printk-format-warnings.patch. > > I'll drop nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, thamks. Thanks. I was certainly getting confused. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: -mm merge plans for 2.6.23
On Wed, 11 Jul 2007 12:39:42 +0100 David Woodhouse <[EMAIL PROTECTED]> wrote: > On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote: > > On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > > > romfs-printk-format-warnings.patch > > > > NACK on this one. > > The rest of it is nacked anyway, until we unify the point and > get_unmapped_area methods of the MTD API. > Methinks you meant nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, not romfs-printk-format-warnings.patch. I'll drop nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, thamks. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Tue, Jul 10, 2007 at 08:19:16PM -0400, Mingming Cao wrote: > On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote: > > And how does the NFS server know that the filesystem implements i_version? > > Will a zero-value of i_version have special significance, telling the > > server to not send this attribute, perhaps? > > Bruce raised up this question a few days back when he reviewed this > patch, I think the solution is add a superblock flag for fs support > inode versioning, probably at VFS layer? Sounds fine. As long as it's something that's standard across filesystems, then I can just do something like if (sb->s_flags & MS_CHANGEATTR) /* return i_version to client as change attribute */ else /* return ctime to client as change attribute */ --b. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates
On Tue, Jul 10, 2007 at 04:31:44PM -0700, Andrew Morton wrote: > On Sun, 01 Jul 2007 03:37:53 -0400 > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > Add a "noversion" mount option to disable inode version updates. > > Why is this option being offered to our users? To reduce disk traffic, > like noatime? > > If so, what are the implications of this? What would the user lose? This has been removed in the latest patch set; it's needed only for Lustre, because they set the version field themselves. Lustre needs the inode version to be globally monotonically increasing, so it can order updates between two different files, so it does this itself. NFSv4 only uses i_version to detect changes, and so there's no need to use a global atomic counter for i_version. So the thinking was that there was no point doing the global atomic counter if it was not necessary. Since "noversion" is Lustre specific, we've dropped that from the list of patches that we'll push, and so the inode version will only have local per-inode significance, and not have any global ordering properties. We have not actually benchmarked whether or not doing the global ordering actually *matters* in terms of being actually noticeable. If it isn't noticeable, I wouldn't mind changing things so that we always make i_version globally significant (without a mount option), and make life a bit easier for the Lustre folks. Or if someone other distributed filesystem requests a globally significant i_version. But we can cross that bridge when we get to it - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/1] VFS: Augment /proc/mount with subroot and shared-subtree
On Wed, 2007-07-11 at 11:24 +0100, Christoph Hellwig wrote: > On Sat, Jun 30, 2007 at 08:56:02AM -0400, H. Peter Anvin wrote: > > Is that conjecture, or do you have evidence to that effect? Most users > > of this file are using it via the glibc interfaces, and there probably > > aren't all that many users of it in the first place. > > I have written parsers for personal projects that might not have been > happy to deal with additional fields myself for example.. I modified the patch to add fields towards the end of each line. i.e after 'freq, passno' fields. And symlinked /etc/mtab to /proc/mounts. mount,df and friends were all perfectly happy. I imagine your script may also be happy with the additional fields **towards the end**. I would like to avoid one more mount interface if we can help it. RP - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote: > It just occurred to me: > > If i_version is 64bit, then knfsd would need to be careful when > reading it on a 32bit host. What are the locking rules? How does knfsd use i_version? I would think that if all it was doing was to compare (i_version == previous_version), then locking wouldn't really matter. Well, theoretically, previous_version could be 0x1, and i_version could be 0x1, knfsd checks the high word, then ext4 updates i_version to 0x2, then knfsd checks the low word, detecting no change. How likely is this? (I don't understand why i_version even needs to be 64 bits in the first place.) > Presumably it is only updated under i_mutex protection, but having to > get i_mutex to read it would seem a little heavy handed. How does knfsd protect itself from the inode changing after i_version is checked? Is any locking being done otherwise? > Should it use a seqlock like i_size? > Could we use the same seqlock that i_size uses, or would we need a > separate one? > > NeilBrown -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 8][PATCH 1/1]Add journal checksums
On Jul 11, 2007 17:16 +0530, Girish Shilamkar wrote: > > > + if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) { > > > + jbd2_journal_set_features(sbi->s_journal, > > > + JBD2_FEATURE_COMPAT_CHECKSUM, 0, > > > + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); > > > + } else if (test_opt(sb, JOURNAL_CHECKSUM)) { > > > + jbd2_journal_set_features(sbi->s_journal, > > > + JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0); > > > + jbd2_journal_clear_features(sbi->s_journal, 0, 0, > > > + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); > > > + } else { > > > + jbd2_journal_clear_features(sbi->s_journal, > > > + JBD2_FEATURE_COMPAT_CHECKSUM, 0, > > > + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); > > > + } > > > > Some discussion of the forward- and backward- compatibility design would be > > appropriate. Also a description of whether and how fsck can be used to fix > > up these feature flags. It is forward & backward compatible to enable COMPAT_CHECKSUM. That just means the commit blocks will have checksums in them, but older kernels will just ignore them. Hmm, I suppose there might be an issue with "upgrade, downgrade, upgrade" in that the commit blocks would not have checksums even though the superblock says they will... Does that mean we should accept a checksum == 0 as being valid (which is not very nice, given that 0 is an oft-hit bad value), or that we need a flag in every commit block which indicates if it actually has a checksum? The INCOMPAT_ASYNC_COMMIT can't be handled safely by older kernels, since they would assume "commit block == complete transaction", which isn't true if the commit block didn't wait for the rest of the blocks to make it to the disk. I don't think e2fsck can be used to individually clean up the feature flags, but it is always possible to remove and recreate the journal... > > > - /* AKPM: buglet - add `i' to tmp! */ > > > > Damn. After, what, seven years, someone actually fixed it? > > > > > for (i = 0; i < bh->b_size; i += 512) { > > > - journal_header_t *tmp = (journal_header_t*)bh->b_data; > > > + struct commit_header *tmp = > > > + (struct commit_header *)(bh->b_data + i); > > > tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER); > > > tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK); > > > tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid); > > > + > > > + if (JBD2_HAS_COMPAT_FEATURE(journal, > > > + JBD2_FEATURE_COMPAT_CHECKSUM)) { > > > + tmp->h_chksum_type = JBD2_CRC32_CHKSUM; > > > + tmp->h_chksum_size = JBD2_CRC32_CHKSUM_SIZE; > > > + tmp->h_chksum[0]= cpu_to_be32(crc32_sum); > > > + } > > > } > > > > And in doing so, changed the on-disk format of the journal commit blocks. > > > > Surely this was worth a mention in the changelog, if not a standalone patch? > > > > I don't think this is worth doing, really. Why not just leave the format > > as it was, take the loop out and run this code once rather than eight > > times? Well, we aren't using the rest of the commit block in any case. I think the original intention was that we'd get 8 copies of the commit block so we would be sure to get a good one. I don't know whether we'd rather have 8 copies of the commit block, or more potential to expand the commit block? I don't personally have any preference, since the checksum should be a more robust way of checking validity than having multiple copies, so we may as well remove the loop and stick with a single copy for now. > > > @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journa > > > unsigned intsequence; > > > int blocktype; > > > int tag_bytes = journal_tag_bytes(journal); > > > + __u32 crc32_sum = ~0; /* Transactional Checksums */ > > > > We normally use __u32 for visible-to-userspace stuff. Kernel code would > > use plain old u32. > Ok. Since the checksum is saved to disk, it seems more appropriate to use __u32 or maybe even __be32, though I'm not sure if the crc32 functions do that correctly or not. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.
On Jul 10, 2007 22:40 -0700, Andrew Morton wrote: > On Sun, 01 Jul 2007 03:38:18 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > > A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if > > the subdir count for any directory crosses 65000. > > Would I be correct in assuming that a later fsck will clear > EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any >65000 subdir > directories? Correct. > If so, that is worth a mention in the changelog, perhaps? > > Please remind us what is the behaviour of an RO_COMPAT flag? It means that > old ext4, ext3 and ext2 can only mount this fs read-only, yes? Also correct. The COMPAT flag behaviour is described in detail in Documentation/filesystems/ext[234].txt > > +static inline void ext4_inc_count(handle_t *handle, struct inode *inode) > > +{ > > + inc_nlink(inode); > > + if (is_dx(inode) && inode->i_nlink > 1) { > > + /* limit is 16-bit i_links_count */ > > + if (inode->i_nlink >= EXT4_LINK_MAX || inode->i_nlink == 2) { > > + inode->i_nlink = 1; > > + EXT4_SET_RO_COMPAT_FEATURE(inode->i_sb, > > + EXT4_FEATURE_RO_COMPAT_DIR_NLINK); > > + } > > + } > > +} > > Why do we set EXT4_FEATURE_RO_COMPAT_DIR_NLINK if i_nlink==2? Because that means it was previously 1 (inc_nlink() was already called). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 8][PATCH 1/1]Add journal checksums
On Tue, 2007-07-10 at 23:16 -0700, Andrew Morton wrote: > On Sun, 01 Jul 2007 03:38:25 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote: > > > Journal checksum feature has been added to detect corruption of journal. > > That was brief. No description of what it does, how it does it, why it > does it, how one operates it, why (or why not) one would choose to enable > it nor of the costs of enabling it. Somehow the description was lost due to the way the original patchset was sent to ext4 list. Here is the original description: The journal checksum feature adds two new flags i.e JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM. JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the checksum for the blocks described by the descriptor blocks. Due to checksums, writing of the commit record no longer needs to be synchronous. Now commit record can be sent to disk without waiting for descriptor blocks to be written to disk. This behavior is controlled using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be able to recover the journal with _ASYNC_COMMIT hence it is made incompat. The commit header has been extended to hold the checksum along with the type of the checksum. For recovery in pass scan checksums are verified to ensure the sanity and completeness(in case of _ASYNC_COMMIT) of every transaction. > > > Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]> > > Signed-off-by: Girish Shilamkar <[EMAIL PROTECTED]> > > Signed-off-by: Dave Kleikamp <[EMAIL PROTECTED]> > > > > diff -Nurp linux024/fs/ext4/super.c linux/fs/ext4/super.c > > --- linux024/fs/ext4/super.c2007-06-25 16:19:24.0 -0500 > > +++ linux/fs/ext4/super.c 2007-06-26 08:35:16.0 -0500 > > @@ -721,6 +721,7 @@ enum { > > Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, > > Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh, > > Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev, > > + Opt_journal_checksum, Opt_journal_async_commit, > > A new user-visible feature should be accompanied by an update to > Documentation/filesystems/ext4.txt? Ok. > > Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback, > > Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, > > Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, > > @@ -760,6 +761,8 @@ static match_table_t tokens = { > > {Opt_journal_update, "journal=update"}, > > {Opt_journal_inum, "journal=%u"}, > > {Opt_journal_dev, "journal_dev=%u"}, > > + {Opt_journal_checksum, "journal_checksum"}, > > + {Opt_journal_async_commit, "journal_async_commit"}, > > What's journal_async_commit? I hope the description has answered this question. > > {Opt_abort, "abort"}, > > {Opt_data_journal, "data=journal"}, > > {Opt_data_ordered, "data=ordered"}, > > @@ -948,6 +951,13 @@ static int parse_options (char *options, > > return 0; > > *journal_devnum = option; > > break; > > + case Opt_journal_checksum: > > + set_opt (sbi->s_mount_opt, JOURNAL_CHECKSUM); > > + break; > > + case Opt_journal_async_commit: > > + set_opt (sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT); > > + set_opt (sbi->s_mount_opt, JOURNAL_CHECKSUM); > > + break; > > case Opt_noload: > > set_opt (sbi->s_mount_opt, NOLOAD); > > break; > > @@ -1817,6 +1827,21 @@ static int ext4_fill_super (struct super > > goto failed_mount4; > > } > > > > + if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) { > > + jbd2_journal_set_features(sbi->s_journal, > > + JBD2_FEATURE_COMPAT_CHECKSUM, 0, > > + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); > > + } else if (test_opt(sb, JOURNAL_CHECKSUM)) { > > + jbd2_journal_set_features(sbi->s_journal, > > + JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0); > > + jbd2_journal_clear_features(sbi->s_journal, 0, 0, > > + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); > > + } else { > > + jbd2_journal_clear_features(sbi->s_journal, > > + JBD2_FEATURE_COMPAT_CHECKSUM, 0, > > + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT); > > + } > > Some discussion of the forward- and backward- compatibility design would be > appropriate. Also a description of whether and how fsck can be used to fix > up these feature flags. > > /* We have now updated the journal if required, so we can > > * validate the data journaling mode. */ > > switch (test_opt(sb, DATA_FLAGS)) { > > diff -Nurp linux024/fs/jbd2/commit.c linux/fs/jbd2/commit.c > > --- linux024/fs/jbd2/commit.c 2007-06-25 16:19:25.0 -0500 > > +++ linux/fs/jbd2/commit.c 2007-06-26 08:40:03
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
On Jul 10, 2007 16:32 -0700, Andrew Morton wrote: > > err = ext4_reserve_inode_write(handle, inode, &iloc); > > + if (EXT4_I(inode)->i_extra_isize < > > + EXT4_SB(inode->i_sb)->s_want_extra_isize && > > + !(EXT4_I(inode)->i_state & EXT4_STATE_NO_EXPAND)) { > > + /* We need extra buffer credits since we may write into EA block > > +* with this same handle */ > > + if ((jbd2_journal_extend(handle, > > +EXT4_DATA_TRANS_BLOCKS(inode->i_sb))) == 0) { > > + ret = ext4_expand_extra_isize(inode, > > + EXT4_SB(inode->i_sb)->s_want_extra_isize, > > + iloc, handle); > > + if (ret) { > > + EXT4_I(inode)->i_state |= EXT4_STATE_NO_EXPAND; > > + if (!expand_message) { > > + ext4_warning(inode->i_sb, __FUNCTION__, > > + "Unable to expand inode %lu. Delete" > > + " some EAs or run e2fsck.", > > + inode->i_ino); > > + expand_message = 1; > > + } > > + } > > + } > > + } > > Maybe that message should come out once per mount rather than once per boot > (or once per modprobe)? Probably true. > What are the consequences of a jbd2_journal_extend() failure in here? Not fatal, just like every user of i_extra_isize. If the inode isn't a large inode, or it can't be expanded then there will be a minor loss of functionality on that inode. If the i_extra_isize is critical, then the sysadmin will have run e2fsck to force s_min_extra_isize large enough. Note that this is only applicable for filesystems which are upgraded. For new inodes (i.e. all inodes that exist in the filesystem if it was always run on a kernel with the currently understood extra fields) then this will never be invoked (until such a time new extra fields are added). > > +static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry, > > +int value_offs_shift, void *to, > > +void *from, size_t n, int blocksize) > > +{ > > + /* Adjust the value offsets of the entries */ > > + for (; !IS_LAST_ENTRY(last); last = EXT4_XATTR_NEXT(last)) { > > + if (!last->e_value_block && last->e_value_size) { > > + new_offs = le16_to_cpu(last->e_value_offs) + > > + value_offs_shift; > > + BUG_ON(new_offs + le32_to_cpu(last->e_value_size) > > +> blocksize); > > + last->e_value_offs = cpu_to_le16(new_offs); > > + } > > + } > > + /* Shift the entries by n bytes */ > > + memmove(to, from, n); > > +} > > Under what circumstances will that new BUG_ON trigger? Can it be triggered > via incorrect disk contents? If so, it should not be there. Only under code defect or memory corruption. The needed extra space was already checked in ext4_expand_extra_isize_ea(), but I asked for this BUG_ON() because it would otherwise seem that there was a chance from reading just the above code that the shifted EA might overflow the buffer. > > +int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize, > > + struct ext4_inode *raw_inode, handle_t *handle) > > +{ > > + down_write(&EXT4_I(inode)->xattr_sem); > > +retry: > > + if (EXT4_I(inode)->i_extra_isize >= new_extra_isize) { > > + up_write(&EXT4_I(inode)->xattr_sem); > > + return 0; > > + } > > So.. xattr_sem is a lock which protects i_extra_isize? That seems a bit > strange to me - I'd have expected i_mutex. Well, since this is the only code that ever changes i_extra_isize, and it also needs to move the EAs around, it makes sense to use the EA lock for it. This is also a relatively low-contention lock, and it needs to be held to access any of the EAs (which are the only things beyond i_extra_isize). > > + if (EXT4_I(inode)->i_file_acl) { > > + bh = sb_bread(inode->i_sb, EXT4_I(inode)->i_file_acl); > > + error = -EIO; > > + if (!bh) > > + goto cleanup; > > + if (ext4_xattr_check_block(bh)) { > > + ext4_error(inode->i_sb, __FUNCTION__, > > + "inode %lu: bad block %llu", inode->i_ino, > > + EXT4_I(inode)->i_file_acl); > > + error = -EIO; > > + goto cleanup; > > + } > > + base = BHDR(bh); > > + first = BFIRST(bh); > > + end = bh->b_data + bh->b_size; > > + min_offs = end - base; > > + free = ext4_xattr_free_space(first, &min_offs, base, > > +&total_blk); > > +
Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates
On Wed, Jul 11, 2007 at 05:57:17AM -0600, Andreas Dilger wrote: > Ah, this is the patch to disable i_version updates for Lustre. I don't > think any normal user would use this mount option, so I don't know if > there is a need to document it. This is a reason to not merge it at all. If the only user of this is the out of tree lustre code there is no need to put this in. I should rather stay in clusterfs' patchkit. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Wed, Jul 11, 2007 at 05:52:24AM -0600, Andreas Dilger wrote: > On Jul 11, 2007 09:47 +0100, Christoph Hellwig wrote: > > On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote: > > > This patch is on top of i_version_update_vfs. > > > The i_version field of the inode is set on inode creation and incremented > > > when the inode is being modified. > > > > Which is not what i_version is supposed to do. It'll get you tons of misses > > for NFSv3 filehandles that rely on the generation staying the same for the > > same file. Please add a new field for the NFSv4 sequence counter instead > > of making i_version unuseable. > > You are confusing i_generation (the instance of this inode number) with > i_version (whether this file has been modified)? Yes, sorry. Objection dropped. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
fallocate, Re: -mm merge plans for 2.6.23
> fallocate-implementation-on-i86-x86_64-and-powerpc.patch > fallocate-on-s390.patch > fallocate-on-ia64.patch > fallocate-on-ia64-fix.patch > > Merge. Hopefull this will be done during the 2.6.23 merge window, but right now it's not (yet). - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates
On Jul 10, 2007 16:31 -0700, Andrew Morton wrote: > On Sun, 01 Jul 2007 03:37:53 -0400 > Mingming Cao <[EMAIL PROTECTED]> wrote: > > Add a "noversion" mount option to disable inode version updates. > > Why is this option being offered to our users? To reduce disk traffic, > like noatime? > > If so, what are the implications of this? What would the user lose? Ah, this is the patch to disable i_version updates for Lustre. I don't think any normal user would use this mount option, so I don't know if there is a need to document it. There are no performance implications, unless we end up changing the mtime granularity JUST to update i_version, in which case we can avoid some overhead if not exporting with NFSv4. If we want to go in the direction of forcing extra inode updates just for this, then we might even consider making i_version updates on disk default to OFF unless NFSv4 has exported the filesystem at least once, and then it should set a persistent flag in the superblock indicating that i_version updates are needed. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Wed, 2007-07-11 at 09:47 +0100, Christoph Hellwig wrote: > On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote: > > This patch is on top of i_version_update_vfs. > > The i_version field of the inode is set on inode creation and incremented > > when > > the inode is being modified. > > Which is not what i_version is supposed to do. It'll get you tons of misses > for NFSv3 filehandles that rely on the generation staying the same for the > same file. Please add a new field for the NFSv4 sequence counter instead > of making i_version unuseable. Aren't you confusing i_version and i_generation here? Those are two separate inode fields. Cheers Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Jul 11, 2007 09:47 +0100, Christoph Hellwig wrote: > On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote: > > This patch is on top of i_version_update_vfs. > > The i_version field of the inode is set on inode creation and incremented > > when the inode is being modified. > > Which is not what i_version is supposed to do. It'll get you tons of misses > for NFSv3 filehandles that rely on the generation staying the same for the > same file. Please add a new field for the NFSv4 sequence counter instead > of making i_version unuseable. You are confusing i_generation (the instance of this inode number) with i_version (whether this file has been modified)? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 3/5] i_version:ext4 inode version read/store
On Jul 10, 2007 16:31 -0700, Andrew Morton wrote: > > This patch adds 64-bit inode version support to ext4. The lower 32 bits > > are stored in the osd1.linux1.l_i_version field while the high 32 bits > > are stored in the i_version_hi field newly created in the ext4_inode. > > So reading the code here does serve to answer the question I raised against > the earlier patch. A bit. > > I'd have thought that this patch and the one which adds i_version_hi should > be folded into a single diff? It could be - the original patch to reserve i_version_hi was submitted to before the patches were ready to avoid that space being used by something else. > > + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) { > > + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) > > + inode->i_version |= > > + (__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32; > > > > > > > + } > > I don't quite see how the above two tests are sufficient to unambiguously > determine that the i_version_hi field is present on-disk. > > I guess we're implicitly assuming that if the on-disk inode is big enough > then it _must_ have i_version_hi in there? If so, why is the comparison > with EXT4_GOOD_OLD_INODE_SIZE needed? The GOOT_OLD_INODE_SIZE check is needed to know if i_extra_isize is even present or valid in the on-disk inode. > > @@ -2852,8 +2859,14 @@ static int ext4_do_update_inode(handle_t > > + raw_inode->i_disk_version = cpu_to_le32(inode->i_version); > > + if (ei->i_extra_isize) { > > + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) { > > There's no comparison with EXT4_GOOD_OLD_INODE_SIZE here... Because this is the in-memory version and it is always valid (set to zero if there is extra space in the on-disk inode). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: utimes() with vfat is problematic
Jan Engelhardt <[EMAIL PROTECTED]> wrote: > vfat does not know about ownership, hence the files are always owned by the > vfat mounter (or whatever the uid= option specified). Which brings > a problem to userspace programs trying to utime() but which do not > run as the same user as the vfat mounter, because: > > > fs/attr.c:53 > ret = -EPERM; > [...] > > /* Check for setting the inode time. */ > if (ia_valid & (ATTR_MTIME_SET | ATTR_ATIME_SET)) { > if (current->fsuid != inode->i_uid && !capable(CAP_FOWNER)) > goto error; > } > > > To trigger the problem: > # mount /somevfat -o umask=0,uid=root > $ touch -d "2005-05-05" /somevfat/myfile > > I am not sure how this could be dealt with besides passing -o quiet to > mount.vfat. Any ideas? Would it be possible to allow any user to modify the fs by adding "&& current->fsuid != -1"? I think it's commonly the desired behaviour. Off cause the default behaviour should stay the same. -- Those who hesitate under fire usually do not end up KIA or WIA. Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 2/5] i_version: Add hi 32 bit inode version on ext4 on-disk inode
On Jul 10, 2007 16:30 -0700, Andrew Morton wrote: > > Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> > > Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]> > > Signed-off-by: Kalpak Shah <[EMAIL PROTECTED]> > > --- > > Index: linux-2.6.21/include/linux/ext4_fs.h > > === > > --- linux-2.6.21.orig/include/linux/ext4_fs.h > > +++ linux-2.6.21/include/linux/ext4_fs.h > > @@ -342,6 +342,7 @@ struct ext4_inode { > > __le32 i_atime_extra; /* extra Access time (nsec << 2 | epoch) */ > > __le32 i_crtime; /* File Creation time */ > > __le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */ > > + __le32 i_version_hi; /* high 32 bits for 64-bit version */ > > }; > > Aren't there forward- backward-compatibility issues here? How does the > filesystem driver work out whether this field is present and valid? This uses the same EXT4_FITS_IN_INODE() check as any other, so the compatibility issues are handled. NFSv4 could live with 32-bit versions with only a small danger of overflow, so we can still export ext3 filesystems with 128-byte inodes that have been updated to ext4. For Lustre (which requires 64-bit versions), we will enforce that space is available with s_min_extra_isize and RO_COMPAT_EXTRA_ISIZE. In the case where an older ext3/ext4 filesystem with large inodes does not have enough space for i_version_hi the EAs that follow i_extra_isize will be shifted to make room for it (if possible, which is likely). There are no critical fields inside i_extra_isize so in the rare case of a failure to enlarge the i_extra_isize is not a cause for alarm. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Jul 10, 2007 23:34 -0400, Trond Myklebust wrote: > On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote: > > So my vote is to increment i_version in common code every time any > > change is made to the file, and alloc_inode should initialise it to > > current time, which might be changed by the filesystem before it calls > > unlock_new_inode. > > ... but doesn't lustre want to control its i_version... so maybe not :-( > > If lustre wants to be exportable via pNFS (as Peter Braam has suggested > it should), then it had better be able to return a change attribute that > is compatible with the NFSv4.1 spec... The Lustre use of i_version is a superset of what NFSv4 needs - the Lustre version can be used to compare the updates of two inodes. It is set to be the Lustre transaction number (which is sequentially incremented on a per filesystem basis), so that we can use the per-inode version to do replay of client operations even if they have been disconnected for a long time, which is why we want to be able to control the actual value. We don't want the version to be updated for e.g. file defragmentation or other similar internal-only changes which need ext4_mark_inode_dirty(). We had a patch to disable ext4 inode versioning by a flag the superblock, but we dropped it at the last minute because it needed some updates and we didn't want to wait on that for submitting these changes upstream. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: -mm merge plans for 2.6.23
On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote: > On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > > romfs-printk-format-warnings.patch > > NACK on this one. The rest of it is nacked anyway, until we unify the point and get_unmapped_area methods of the MTD API. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: -mm merge plans for 2.6.23
On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > romfs-printk-format-warnings.patch NACK on this one. This bloats romfs by almost half of it's previous size to add mtd support to it. Given that romfs is a compltely trivial filesystem it's much better to have a separate filesystem driver handling the format on mtd instead of adding all these indirections. In addition to that argument the switch on the underlying subsystem is done horrible. There's lots of ifdefs instead of proper functions pointers, there's one file containing both block and mtd code instead of seaparate files, etc. And the get_unmapped_area method in a bare filesystem needs a _lot_ of explanation. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Jul 10, 2007 16:30 -0700, Andrew Morton wrote: > > +#define EXT4_FITS_IN_INODE(ext4_inode, einode, field) \ > > + ((offsetof(typeof(*ext4_inode), field) +\ > > + sizeof((ext4_inode)->field)) \ > > + <= (EXT4_GOOD_OLD_INODE_SIZE + \ > > + (einode)->i_extra_isize)) \ > > Please add explanatory commentary to EXT4_FITS_IN_INODE(): tell readers > under what circumstances something will not fit in an inode and what the > consequences of this are. /* Extended fields will fit into an inode if the filesystem was formatted * with large inodes (-I 256 or larger) and there are not currently EAs * consuming all of the available space. For new inodes we always reserve * enough space for the kernel's known extended fields, but for inodes * created with an old kernel this might not have been the case. None of * the extended inode fields is critical for correct filesystem operation. */ > > +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) > >\ > > +do { > >\ > > + (inode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ > > + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ > > + ext4_decode_extra_time(&(inode)->xtime,\ > > + raw_inode->xtime ## _extra);\ > > +} while (0) > > Ugly. I expect these could be implemented as plain old C functions. > Caller could pass in the address of the ext4_inode field which the function > is to operate upon. We thought about that also, but then the caller needs to do all of the pointer gymnastics themselves like: ext4_inode_get_xtime(&inode->i_ctime, &inode->i_ctime_extra, &raw_inode->i_ctime, &raw_inode->i_ctime_extra) instead of the current: EXT4_INODE_GET_XTIME(ctime, inode, raw_inode); IMHO it is preferrable to make the multiple callsites more readable than the macros. > > #if defined(__KERNEL__) || defined(__linux__) > > (What's the __linux__ for?) > > > #define i_reserved1osd1.linux1.l_i_reserved1 > > #define i_frag osd2.linux2.l_i_frag This is actually unrelated to the current patch, just part of the context. AFAIK, this is historical, so that the kernel and e2fsprogs can use the same ext2_fs.h header. I don't think it is really needed, but such cleanup shouldn't be a part of this patch either. > > +static inline struct timespec ext4_current_time(struct inode *inode) > > +{ > > + return (inode->i_sb->s_time_gran < NSEC_PER_SEC) ? > > + current_fs_time(inode->i_sb) : CURRENT_TIME_SEC; > > +} > > Now, I've forgotten how this works. Remind me, please. Can ext4 > filesystems ever have one-second timestamp granularity? If so, how does > one cause that to come about? Yes, this is possible if an ext2/3/4 filesystem is formatted with 128-byte inodes (which is the default for all but ext4) and this fs is mounted as ext4dev. The inodes can never hold the extra time information (FITS_IN_INODE check above) so the superblock limits the timestamp resolution to 1s in that case. > > @@ -153,6 +153,7 @@ > > > > unsigned long i_ext_generation; > > struct ext4_ext_cache i_cached_extent; > > + struct timespec i_crtime; > > }; > > It is unobvious what this field does. Please prefer to add commentary to > _all_ struct fields - it really helps. It is the inode creation time. This is useful for debug/forensic purposes, and at some point there will be an API so that Samba can use it also. > > #endif /* _LINUX_EXT4_FS_I */ > > Index: linux-2.6.22-rc4/include/linux/ext4_fs_sb.h > > === > > --- linux-2.6.22-rc4.orig/include/linux/ext4_fs_sb.h2007-06-11 > > 17:28:15.0 -0700 > > +++ linux-2.6.22-rc4/include/linux/ext4_fs_sb.h 2007-06-11 > > 17:39:05.0 -0700 > > @@ -79,6 +79,7 @@ > > char *s_qf_names[MAXQUOTAS];/* Names of quota files with > > journalled quota */ > > int s_jquota_fmt; /* Format of quota to use */ > > #endif > > + unsigned int s_want_extra_isize; /* New inodes should reserve # bytes */ > > OK, I can kind-of see how this is working, but some overall description of > how the inode sizing design operates would be helpful. It would certainly > make reviewing of this proposed change more fruitful. Perhaps that new > comment over EXT4_FITS_IN_INODE() would be a suitable place. Hmm, I'm sure there were emails on the topic, but they aren't attached to the patch. s_want_extra_isize is just an override for sizeof(ext4_inode) in case the sysadmin wants to reserve more fields in new inodes. There is also s_min_extra_isize which is what the kernel and e2fsck guarantee that will be available in all in-use inodes, if RO_COMPAT_EXTRA_
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Jul 10, 2007 22:00 -0400, Mingming Cao wrote: > On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote: > > On Sun, 01 Jul 2007 03:36:56 -0400 > > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > > This patch is a spinoff of the old nanosecond patches. > > > > I don't know what the "old nanosecond patches" are. A link to a suitable > > changlog for those patches would do in a pinch. Preferable would be to > > write a proper changelog for this patch. > > > I found the original patch > http://marc.info/?l=linux-ext4&m=115091699809181&w=2 > > Andreas or Kalpak, is changelog from the original patch is accurate to > apply here? Mostly, yes, but the name of the feature flag has changed. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
Hi, On Wed, 11 Jul 2007, Al Viro wrote: > The fundamental issue here is that even if you do find struct file, > you can't blindly rip its ->f_mapping since it can be in the middle > of ->read(), ->write(), pageout, etc. And even if you do manage > that, you still have the ability to do fchmod() later. Then we would need to change the VFS and relevant parts so that we can take down ->f_mapping. I don't see how we could do that without affecting current hotpaths. Hmm. I suppose what we really need to do is cannibalize the actual inode (remove from inode cache, detach from dentry and take down the mapping) so that we don't have to touch existing struct file pointers at all. Pekka - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/1] VFS: Augment /proc/mount with subroot and shared-subtree
On Sat, Jun 30, 2007 at 08:56:02AM -0400, H. Peter Anvin wrote: > Is that conjecture, or do you have evidence to that effect? Most users > of this file are using it via the glibc interfaces, and there probably > aren't all that many users of it in the first place. I have written parsers for personal projects that might not have been happy to deal with additional fields myself for example.. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
On Thu, Jul 05, 2007 at 03:43:32PM -0700, Dave Hansen wrote: > This takes care of all of the direct callers of vfs_mknod(). > Since a few of these cases also handle normal file creation > as well, this also covers some calls to vfs_create(). > > So that we don't have to make three mnt_want/drop_write() > calls inside of the switch statement, we move some of its > logic outside of the switch. Looks good to me. > One thing I noticed: do we actually _need_ the first > S_ISDIR() check at the top of the function? The EPERM is required by Posix. > + if (!S_ISREG(mode) && !S_ISCHR(mode) && !S_ISBLK(mode) && > + !S_ISFIFO(mode) && !S_ISSOCK(mode) && (mode != 0)) { no need for braces around the mode != 0 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
On Sat, Jul 07, 2007 at 08:25:31PM +0200, Jan Engelhardt wrote: > Hm, I wonder why mknod(,S_IFDIR,) returns -EPERM rather than -EINVAL like the > others? Posix. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] nfs: disable leases over NFS
On Thu, Jul 05, 2007 at 11:41:00AM -0400, J. Bruce Fields wrote: > OK, after looking at this a little more, I'm less happy about the idea > of erroring out by default: > > - There are a ton of filesystems that probably should allow > leases, and only a few (network filesystems) that shouldn't, > so leaving leases on by default seems simpler. But it gets you possible wrong behaviour by default. I'm not a big fan of non-trivial default methods as you see :) > - We already fall back on the local method by default in the case > of locks, and I don't see a reason to treat leases differently. > - The patch to add > .setlease = setlease, > to all the file_operations is going to be a big patch that > changes behavior in a way that might be easy to miss (because > it changes behavior exactly on those filesystems it *doesn't* > touch.) I think it'll be easier to get better review on the > patch that adds a method just to those filesystems that we're > disabling leases for. Anyway, feel free to go ahead with the simpler version for now, I'll do the switchover for locks and leases when I get some time. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 01:01:07PM +0300, Pekka J Enberg wrote: > On Wed, 11 Jul 2007, Al Viro wrote: > > BTW, read() or write() in progress might get rather unhappy if your > > live replacement of ->f_mapping races with them... > > For writes, we (1) never start any new operations after we've cleaned up > the file descriptor tables so (2) after we're done with do_fsync() we > never touch ->f_mapping again. Er, no. do_fsync() won't hit the sys_write() that is yet to enter ->write(). And you can't get rid of new callers _anyway_ (see previous mail). - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 12:50:48PM +0300, Pekka J Enberg wrote: > Hi Al, > > On Wed, 11 Jul 2007, Al Viro wrote: > > Better: I have the only opened descriptor for foo. I send it to myself > > as described above. I close it. revoke() is called, finds no opened > > instances of foo in any descriptor tables and cheerfully does nothing. > > I call recvmsg() and I have completely undamaged opened file back. > > Uhm, nice. So, revoke() needs a proper inode -> struct files mapping > somewhere. Can we add a list of files to struct inode? Are there other > cases where a file can point to an inode but the file is not attached to > any file descriptor? Umm... Any number, really - it might be in the middle of syscall while another task sharing descriptor table has closed the descriptor. Then there's quota, then there's process accounting, then there's execve() in progress, then there's knfsd working with that struct file, etc. The fundamental issue here is that even if you do find struct file, you can't blindly rip its ->f_mapping since it can be in the middle of ->read(), ->write(), pageout, etc. And even if you do manage that, you still have the ability to do fchmod() later. I don't see how the ability to find all instances in SCM_RIGHTS datagrams (for example) will help you with the race I've described first. Original state: task B has the only reference to file. revoke() is called, passes task A. B sends datagram to A and closes file. A receives datagram. Now the only reference is in A's table and you've already passed that. So you can't avoid processes keeping pointers to struct file. If you could find all struct file over given inode (which, I suspect, will lead to interesting locking), you could call something on that struct file, but you'd have zero exclusion with processes calling methods on it. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, 11 Jul 2007, Al Viro wrote: > BTW, read() or write() in progress might get rather unhappy if your > live replacement of ->f_mapping races with them... For writes, we (1) never start any new operations after we've cleaned up the file descriptor tables so (2) after we're done with do_fsync() we never touch ->f_mapping again. But for reads, I think there's a problem if we're in do_generic_mapping_read() doing invalidate_inode_pages2() is not enough because we're hanging on to the real mapping. Hmm. Pekka - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
Hi Al, On Wed, 11 Jul 2007, Al Viro wrote: > Better: I have the only opened descriptor for foo. I send it to myself > as described above. I close it. revoke() is called, finds no opened > instances of foo in any descriptor tables and cheerfully does nothing. > I call recvmsg() and I have completely undamaged opened file back. Uhm, nice. So, revoke() needs a proper inode -> struct files mapping somewhere. Can we add a list of files to struct inode? Are there other cases where a file can point to an inode but the file is not attached to any file descriptor? Pekka - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 10:37:33AM +0100, Al Viro wrote: > On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote: > > From: Pekka Enberg <[EMAIL PROTECTED]> > > > > The revokeat(2) and frevoke(2) system calls invalidate open file descriptors > > and shared mappings of an inode. After an successful revocation, operations > > on file descriptors fail with the EBADF or ENXIO error code for regular and > > device files, respectively. Attempting to read from or write to a revoked > > mapping causes SIGBUS. > > > > The actual operation is done in two passes: > > > > 1. Revoke all file descriptors that point to the given inode. We do > > this under tasklist_lock so that after this pass, we don't need > > to worry about racing with close(2) or dup(2). > > How does that deal with the following: > > task A gets its descriptor table cleansed > task B sends a descriptor to task A via SCM_RIGHTS datagram > task B gets its descriptor table cleansed > task A receives the datagram and gets the descriptor installed > task A does any kind of IO on that descriptor > ->f_mapping gets replaced in the most inconvenient time. > > Come to think of that, what do you do if I create a socketpair, > stuff the descriptor into SCM_RIGHTS datagram and send it to > myself? Then receive it a day after you've called revoke() and > voila - I've got an almost undamaged struct file back... At > the very least, it allows me to fchmod(). Or fchdir() if that > had been a directory... > > BTW, read() or write() in progress might get rather unhappy if your > live replacement of ->f_mapping races with them... Better: I have the only opened descriptor for foo. I send it to myself as described above. I close it. revoke() is called, finds no opened instances of foo in any descriptor tables and cheerfully does nothing. I call recvmsg() and I have completely undamaged opened file back. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/5] revoke: core code
On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote: > From: Pekka Enberg <[EMAIL PROTECTED]> > > The revokeat(2) and frevoke(2) system calls invalidate open file descriptors > and shared mappings of an inode. After an successful revocation, operations > on file descriptors fail with the EBADF or ENXIO error code for regular and > device files, respectively. Attempting to read from or write to a revoked > mapping causes SIGBUS. > > The actual operation is done in two passes: > > 1. Revoke all file descriptors that point to the given inode. We do > this under tasklist_lock so that after this pass, we don't need > to worry about racing with close(2) or dup(2). How does that deal with the following: task A gets its descriptor table cleansed task B sends a descriptor to task A via SCM_RIGHTS datagram task B gets its descriptor table cleansed task A receives the datagram and gets the descriptor installed task A does any kind of IO on that descriptor ->f_mapping gets replaced in the most inconvenient time. Come to think of that, what do you do if I create a socketpair, stuff the descriptor into SCM_RIGHTS datagram and send it to myself? Then receive it a day after you've called revoke() and voila - I've got an almost undamaged struct file back... At the very least, it allows me to fchmod(). Or fchdir() if that had been a directory... BTW, read() or write() in progress might get rather unhappy if your live replacement of ->f_mapping races with them... - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] manpage for fallocate
On Wed, Jul 11, 2007 at 12:37:01AM +0300, Heikki Orsila wrote: > On Wed, Jul 11, 2007 at 01:48:20AM +0530, Amit K. Arora wrote: > > .BI "int syscall(int, int fd, int mode, loff_t offset, loff_t len); > > Correction: "int syscall(int fd, int mode, ...)", Here, we have syscall() with first argument being the system call number - so what you suggested is not correct. But, yes, the synopsis should change at some time. Maybe to something like: #include long fallocate(int fd, int mode, loff_t offset, loff_t len); > > .TP > > .B ENOSPC > > There is not enough space left on the device containing the file > > referred to by > > .IR fd. > > .TP > > .B ESPIPE > > .I fd > > refers to a pipe of file descriptor. > > .B ENOSYS > > The filesystem underlying the file descriptor does not support this > > operation. > > EINTR? Will add following errors: EINTR A signal was caught during execution EIO An I/O error occurred while reading from or writing to a file system. EOPNOTSUPPThe mode is not supported on the file descriptor. and will update following : EINVALoffset was less than 0, or len was less than or equal to 0. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Wed, Jul 04, 2007 at 03:37:01PM +1000, Timothy Shimmin wrote: > We use this capability in XFS at the moment. > I think this is mainly for DMF (HSM) but is done via the xfs handle > interface > (xfs_open_by_handle) AFAICT. > You're not :) You're using an O_INVIBLE equivalent (as described below), which would be a useful thing to have at the VFS level, but adding hacks to some system calls only wouldn't help any HSM system. It's just useless API clutter. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote: > Given the current behaviour for posix_fallocate() in glibc, I think > that retaining the same error semantic and punting the cleanup to > userspace (where the app will fail with ENOSPC anyway) is the only > sane thing we can do here. Trying to undo this in the kernel leads > to lots of extra rarely used code in error handling paths... Agreed, looks like we should stay with the user has to clean up behaviour. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote: > Well, if you see the modes proposed using above flags : > > #define FA_ALLOCATE 0 > #define FA_DEALLOCATE FA_FL_DEALLOC > #define FA_RESV_SPACE FA_FL_KEEP_SIZE > #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | > FA_FL_DEL_DATA) > > FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes > for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this > flag. Hence prealloction will never delete data. > This mode is required only for FA_UNRESV_SPACE, which is a deallocation > mode, to support any existing XFS aware applications/usage-scenarios. Sorry, but this doesn't make any sense. There is no need to put every feature in the XFS ioctls in the syscalls. The XFS ioctls will need to be supported forever anyway - as I suggested before they really should be moved to generic code. What needs to be supported is what makes sense as an interface. A punch a hole interface does make sense, but trying to hack this into a preallocation system call is just madness. We're not IRIX or windows that fit things into random subcall just because there was some space left to squeeze them in. > > > > FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on size, data > > > > change) */ > > > > FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on size, data > > > > change) */ > > > > NACK to these aswell. If i_size changes c/mtime need updates, if the size > > doesn't chamge they don't. No need to add more flags for this. > > This requirement was from the point of view of HSM applications. Hope > you saw Andreas previous post and are keeping that in mind. HSMs needs this basically for every system call, which screams for an open flag like O_INVISIBLE anyway. Adding this in a generic way is a good idea, but hacking bits and pieces that won't fit into the global design is completely wrong. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH 5/5] revoke: add documentation
From: Pekka Enberg <[EMAIL PROTECTED]> This documents revoke file operation in Documentation/filesystems/vfs.txt. Signed-off-by: Pekka Enberg <[EMAIL PROTECTED]> --- Documentation/filesystems/vfs.txt |5 + 1 file changed, 5 insertions(+) Index: 2.6/Documentation/filesystems/vfs.txt === --- 2.6.orig/Documentation/filesystems/vfs.txt 2007-05-21 15:37:59.0 +0300 +++ 2.6/Documentation/filesystems/vfs.txt 2007-07-11 11:48:46.0 +0300 @@ -732,6 +732,7 @@ struct file_operations { int); ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int); + int (*revoke)(struct file *); }; Again, all methods are called without any locks being held, unless @@ -805,6 +806,10 @@ otherwise noted. splice_read: called by the VFS to splice data from file to a pipe. This method is used by the splice(2) system call + revoke: called by revokeat(2) and frevoke(2) system calls to revoke access + to an open file. This method must ensure that all currently blocked + writes are flushed and reads will fail. + Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH 4/5] revoke: support for ext2 and ext3
From: Pekka Enberg <[EMAIL PROTECTED]> Add revoke support to ext2, ext3 and ext4 by wiring f_ops->revoke with generic_file_revoke. Signed-off-by: Pekka Enberg <[EMAIL PROTECTED]> --- fs/ext2/file.c |1 + fs/ext3/file.c |1 + fs/ext4/file.c |1 + 3 files changed, 3 insertions(+) Index: 2.6/fs/ext2/file.c === --- 2.6.orig/fs/ext2/file.c 2007-05-04 13:49:05.0 +0300 +++ 2.6/fs/ext2/file.c 2007-07-11 11:48:43.0 +0300 @@ -56,6 +56,7 @@ const struct file_operations ext2_file_o .sendfile = generic_file_sendfile, .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, + .revoke = generic_file_revoke, }; #ifdef CONFIG_EXT2_FS_XIP Index: 2.6/fs/ext3/file.c === --- 2.6.orig/fs/ext3/file.c 2007-05-04 13:49:05.0 +0300 +++ 2.6/fs/ext3/file.c 2007-07-11 11:48:43.0 +0300 @@ -123,6 +123,7 @@ const struct file_operations ext3_file_o .sendfile = generic_file_sendfile, .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, + .revoke = generic_file_revoke, }; const struct inode_operations ext3_file_inode_operations = { Index: 2.6/fs/ext4/file.c === --- 2.6.orig/fs/ext4/file.c 2007-05-04 13:49:05.0 +0300 +++ 2.6/fs/ext4/file.c 2007-07-11 11:48:43.0 +0300 @@ -123,6 +123,7 @@ const struct file_operations ext4_file_o .sendfile = generic_file_sendfile, .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, + .revoke = generic_file_revoke, }; const struct inode_operations ext4_file_inode_operations = { - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH 3/5] revoke: wire up i386 system calls
From: Pekka Enberg <[EMAIL PROTECTED]> Make revokeat and frevoke system calls available to user-space on i386. Signed-off-by: Pekka Enberg <[EMAIL PROTECTED]> --- arch/i386/kernel/syscall_table.S |2 ++ arch/x86_64/ia32/ia32entry.S |2 ++ include/asm-i386/unistd.h|4 +++- 3 files changed, 7 insertions(+), 1 deletion(-) Index: 2.6/arch/i386/kernel/syscall_table.S === --- 2.6.orig/arch/i386/kernel/syscall_table.S 2007-05-21 15:38:00.0 +0300 +++ 2.6/arch/i386/kernel/syscall_table.S2007-07-11 11:48:39.0 +0300 @@ -323,3 +323,5 @@ .long sys_utimensat /* 320 */ .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_revokeat + .long sys_frevoke /* 325 */ Index: 2.6/include/asm-i386/unistd.h === --- 2.6.orig/include/asm-i386/unistd.h 2007-05-21 15:38:15.0 +0300 +++ 2.6/include/asm-i386/unistd.h 2007-07-11 11:48:39.0 +0300 @@ -329,10 +329,12 @@ #define __NR_utimensat320 #define __NR_signalfd 321 #define __NR_timerfd 322 #define __NR_eventfd 323 +#define __NR_revokeat 324 +#define __NR_frevoke 325 #ifdef __KERNEL__ -#define NR_syscalls 324 +#define NR_syscalls 326 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: 2.6/arch/x86_64/ia32/ia32entry.S === --- 2.6.orig/arch/x86_64/ia32/ia32entry.S 2007-07-06 10:19:44.0 +0300 +++ 2.6/arch/x86_64/ia32/ia32entry.S2007-07-11 11:48:40.0 +0300 @@ -719,4 +719,6 @@ .quad compat_sys_utimensat /* 320 */ .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys_revokeat + .quad sys_frevoke /* 325 */ ia32_syscall_end: - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH 2/5] revoke: core code
From: Pekka Enberg <[EMAIL PROTECTED]> The revokeat(2) and frevoke(2) system calls invalidate open file descriptors and shared mappings of an inode. After an successful revocation, operations on file descriptors fail with the EBADF or ENXIO error code for regular and device files, respectively. Attempting to read from or write to a revoked mapping causes SIGBUS. The actual operation is done in two passes: 1. Revoke all file descriptors that point to the given inode. We do this under tasklist_lock so that after this pass, we don't need to worry about racing with close(2) or dup(2). 2. Take down shared memory mappings of the inode and close all file pointers. The file descriptors and memory mapping ranges are preserved until the owning task does close(2) and munmap(2), respectively. You use revoke() (with chown, for example) to gain exclusive access to an inode that might be in use by other processes. This means that we must mke sure that: - operations on opened file descriptors pointing to that inode fail - there are no shared mappings visible to other processes - in-progress system calls are either completed (writes) or abort (reads) After revoke() system call returns, you are guaranteed to have revoked access to an inode for any processes that had access to it when you started the operation. The caller is responsible for blocking any future open(2) calls that might occur while revoke() takes care of fork(2) and dup(2) during the operation. Signed-off-by: Pekka Enberg <[EMAIL PROTECTED]> --- fs/Makefile |1 fs/revoke.c | 777 +++ fs/revoked_inode.c | 417 +++ include/linux/fs.h |8 include/linux/magic.h|1 include/linux/mm.h |1 include/linux/revoked_fs_i.h | 18 include/linux/syscalls.h |3 mm/mmap.c| 11 9 files changed, 1237 insertions(+) Index: 2.6/fs/Makefile === --- 2.6.orig/fs/Makefile2007-05-21 15:38:14.0 +0300 +++ 2.6/fs/Makefile 2007-07-11 11:48:35.0 +0300 @@ -19,6 +19,7 @@ else obj-y += no-block.o endif +obj-$(CONFIG_MMU) += revoke.o revoked_inode.o obj-$(CONFIG_INOTIFY) += inotify.o obj-$(CONFIG_INOTIFY_USER) += inotify_user.o obj-$(CONFIG_EPOLL)+= eventpoll.o Index: 2.6/fs/revoke.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ 2.6/fs/revoke.c 2007-07-11 11:48:35.0 +0300 @@ -0,0 +1,777 @@ +/* + * fs/revoke.c - Invalidate all current open file descriptors of an inode. + * + * Copyright (C) 2006-2007 Pekka Enberg + * + * This file is released under the GPLv2. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/** + * fileset - an array of file pointers. + * @files:the array of file pointers + * @nr: number of elements in the array + * @end: index to next unused file pointer + */ +struct fileset { + struct file **files; + unsigned long nr; + unsigned long end; +}; + +/** + * revoke_details - details of the revoke operation + * @inode:invalidate open file descriptors of this inode + * @fset: set of files that point to a revoked inode + * @restore_start:index to the first file pointer that is currently in + *use by a file descriptor but the real file has not + *been revoked + */ +struct revoke_details { + struct fileset *fset; + unsigned long restore_start; +}; + +static struct kmem_cache *revokefs_inode_cache; + +static inline bool fset_is_full(struct fileset *set) +{ + return set->nr == set->end; +} + +static inline struct file *fset_get_filp(struct fileset *set) +{ + return set->files[set->end++]; +} + +static struct fileset *alloc_fset(unsigned long size) +{ + struct fileset *fset; + + fset = kzalloc(sizeof *fset, GFP_KERNEL); + if (!fset) + return NULL; + + fset->files = kcalloc(size, sizeof(struct file *), GFP_KERNEL); + if (!fset->files) { + kfree(fset); + return NULL; + } + fset->nr = size; + return fset; +} + +static void free_fset(struct fileset *fset) +{ + int i; + + for (i = fset->end; i < fset->nr; i++) + fput(fset->files[i]); + + kfree(fset->files); + kfree(fset); +} + +/* + * Revoked file descriptors point to inodes in the revokefs filesystem. + */ +static struct vfsmount *revokefs_mnt; + +static struct file *get_revoked_file(void) +{ + struct dentry *dentry; + struct inode *inode; + struct file *filp; + struct qstr name; + + filp = get_empty_filp(); +
[RFC/PATCH 1/5] revoke: special mmap handling
From: Pekka Enberg <[EMAIL PROTECTED]> This adds special handling for revoked memory mappings. We want to raise SIGBUS when accessing revoked mappings and return ENODEV when trying to remap with mmap(2). Signed-off-by: Pekka Enberg <[EMAIL PROTECTED]> --- include/linux/mm.h |1 + mm/memory.c|3 +++ mm/mmap.c | 12 3 files changed, 12 insertions(+), 4 deletions(-) Index: 2.6/include/linux/mm.h === --- 2.6.orig/include/linux/mm.h 2007-07-06 10:19:51.0 +0300 +++ 2.6/include/linux/mm.h 2007-07-11 11:48:28.0 +0300 @@ -169,6 +169,7 @@ #define VM_NONLINEAR0x0080 /* Is no #define VM_MAPPED_COPY 0x0100 /* T if mapped copy of data (nommu mmap) */ #define VM_INSERTPAGE 0x0200 /* The vma has had "vm_insert_page()" done on it */ #define VM_ALWAYSDUMP 0x0400 /* Always include in core dumps */ +#define VM_REVOKED 0x0800 /* Mapping has been revoked */ #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS Index: 2.6/mm/memory.c === --- 2.6.orig/mm/memory.c2007-07-06 10:19:53.0 +0300 +++ 2.6/mm/memory.c 2007-07-11 11:48:28.0 +0300 @@ -2597,6 +2597,9 @@ int __handle_mm_fault(struct mm_struct * if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, write_access); + if (unlikely(vma->vm_flags & VM_REVOKED)) + return VM_FAULT_SIGBUS; + pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); if (!pud) Index: 2.6/mm/mmap.c === --- 2.6.orig/mm/mmap.c 2007-07-06 10:19:53.0 +0300 +++ 2.6/mm/mmap.c 2007-07-11 11:48:28.0 +0300 @@ -1031,10 +1031,14 @@ accountable = 0; error = -ENOMEM; munmap_back: vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent); - if (vma && vma->vm_start < addr + len) { - if (do_munmap(mm, addr, len)) - return -ENOMEM; - goto munmap_back; + if (vma) { + if (unlikely(vma->vm_flags & VM_REVOKED)) + return -ENODEV; + if (vma->vm_start < addr + len) { + if (do_munmap(mm, addr, len)) + return -ENOMEM; + goto munmap_back; + } } /* Check against address space limit. */ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote: > This patch is on top of i_version_update_vfs. > The i_version field of the inode is set on inode creation and incremented when > the inode is being modified. Which is not what i_version is supposed to do. It'll get you tons of misses for NFSv3 filehandles that rely on the generation staying the same for the same file. Please add a new field for the NFSv4 sequence counter instead of making i_version unuseable. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc
On Wed, Jul 11, 2007 at 12:10:34PM +1000, Stephen Rothwell wrote: > On Wed, 11 Jul 2007 01:50:00 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > > --- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c > > +++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c > > @@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd, > > return sys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo, > > len, advice); > > } > > + > > +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo, > > + unsigned offset_hi, unsigned len_lo, > > + unsigned len_hi) > > Please call this compat_sys_fallocate in line with the powerpc version - > it gives us a hint that maybe we should think about how to consolidate > them. I know other stuff in that file is called sys32_ ... but it is time > for a change :-) I think this can be handled as a separate patch once this patchset is in mainline. Since, anyhow we will need to do this for other sys32_ calls which are already there... -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html