Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Mon, 2008-01-28 at 12:33 -0600, Steve French wrote: On Jan 28, 2008 2:17 AM, Andi Kleen [EMAIL PROTECTED] wrote: I completely agree. If one thread writes A and another writes B then the kernel should record either A or B, not ((A 0x) | (B 0x)) The problem is pretty nasty unfortunately. To solve it properly I think the file_operations-read/write prototypes would need to be changed because otherwise it is not possible to do atomic relative updates of f_pos. Right now the actual update is burrowed deeply in the low level read/write implementation. But that would be a huge impact all over the tree :/ If there were a wrapper around reads and writes of f_pos as there is for i_size e.g. it would hit a lot of code, but not as many as I had originally thought. the most important ones are in the vfs itself, where there are only 59 uses of the field (not all need to be changed). ext3 has fewer (25), and cifs only 12 uses. Most of the uses in ext3 and cifs deal with a directory's f_pos in readdir, which is protected by i_mutex, so I don't think we need to worry about them at all. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [9/18] BKL-removal: Use unlocked_ioctl for jfs
On Sun, 2008-01-27 at 03:17 +0100, Andi Kleen wrote: Convert jfs_ioctl over to not use the BKL. The only potential race I could see was with two ioctls in parallel changing the flags and losing the updates. Use the i_mutex to protect against this. Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen [EMAIL PROTECTED] Added to the jfs git tree. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [10/18] BKL-removal: Implement a compat_ioctl handler for JFS
On Sun, 2008-01-27 at 03:17 +0100, Andi Kleen wrote: The ioctls were already compatible except for the actual values so this was fairly easy to do. Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen [EMAIL PROTECTED] Added to the jfs git tree. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 19/26] mount options: fix jfs
On Thu, 2008-01-24 at 20:34 +0100, Miklos Szeredi wrote: plain text document attachment (jfs_opts.patch) From: Miklos Szeredi [EMAIL PROTECTED] Add iocharset= and errors= options to /proc/mounts for jfs filesystems. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] Acked-by: Dave Kleikamp [EMAIL PROTECTED] Andrew, Would you like me to add this to the jfs git tree, or would you like to handle these patches as a set? Thanks, Shaggy --- Index: linux/fs/jfs/super.c === --- linux.orig/fs/jfs/super.c 2008-01-17 19:00:55.0 +0100 +++ linux/fs/jfs/super.c 2008-01-21 19:39:30.0 +0100 @@ -602,6 +602,12 @@ static int jfs_show_options(struct seq_f seq_printf(seq, ,umask=%03o, sbi-umask); if (sbi-flag JFS_NOINTEGRITY) seq_puts(seq, ,nointegrity); + if (sbi-nls_tab) + seq_printf(seq, ,iocharset=%s, sbi-nls_tab-charset); + if (sbi-flag JFS_ERR_CONTINUE) + seq_printf(seq, ,errors=continue); + if (sbi-flag JFS_ERR_PANIC) + seq_printf(seq, ,errors=panic); #ifdef CONFIG_QUOTA if (sbi-flag JFS_USRQUOTA) -- -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 19/26] mount options: fix jfs
On Thu, 2008-01-24 at 13:57 -0800, Andrew Morton wrote: My usual algorithm here is to 1: queue all the patches and send the ones which have a maintainer to that maintainer until he merges it. 2: If the patches have a dependency upon (say) a VFS patch then I'll merge the VFS patch and will then goto 1. I don't think this particular patch has a VFS depencency so sure, merge away. You'll probably see that I merged it anyway, but I'll drop it again when I see it turn up in your tree (I used to resync with the git trees at least daily, but I now do this far less frequently because it is such torture because everyone is paddling in everyone else's puddle). Merged. Thanks. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC:PATCH 09/09] VM tail statistics support
From: Luiz Fernando N. Capitulino [EMAIL PROTECTED] [PATCH]: VM tail statistics support This patch is a hack which introduces initial statistics support for the VM tail functionality. It uses debugfs and does accouting of: 1. Number of times vm_file_tail_pack() have been called 2. Number of times vm_file_tail_unpack() have been called 3. Total size of file tails allocations 4. Number of file tail allocations 5. Bytes saved Signed-off-by: Luiz Fernando N. Capitulino [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- mm/file_tail.c | 127 - 1 file changed, 126 insertions(+), 1 deletion(-) diff -Nurp linux008/mm/file_tail.c linux009/mm/file_tail.c --- linux008/mm/file_tail.c 2007-11-08 10:49:46.0 -0600 +++ linux009/mm/file_tail.c 2007-11-08 10:49:46.0 -0600 @@ -12,10 +12,51 @@ #include linux/buffer_head.h #include linux/fs.h +#include linux/init.h +#include linux/debugfs.h #include linux/hardirq.h #include linux/module.h +#include linux/spinlock.h #include linux/vm_file_tail.h +struct { + struct dentry *root_dir; + struct dentry *nr_tails; + struct dentry *tail_size; + struct dentry *saved_bytes; + struct dentry *pack_called; + struct dentry *unpack_called; + struct dentry *read_called; +} vm_tail_debugfs; + +struct { + u32 nr_tails; + u32 tail_size; + u32 saved_bytes; + u32 pack_called; + u32 unpack_called; + u32 read_called; + spinlock_t lock; +} vm_tail_stats = { .lock = __SPIN_LOCK_UNLOCKED(lock) }; + +static void vm_file_tail_stats_inc(int length) +{ + spin_lock(vm_tail_stats.lock); + vm_tail_stats.nr_tails++; + vm_tail_stats.tail_size += length; + vm_tail_stats.saved_bytes += (PAGE_SIZE - length); + spin_unlock(vm_tail_stats.lock); +} + +static void vm_file_tail_stats_dec(int length) +{ + spin_lock(vm_tail_stats.lock); + vm_tail_stats.nr_tails--; + vm_tail_stats.tail_size -= length; + vm_tail_stats.saved_bytes -= (PAGE_SIZE - length); + spin_unlock(vm_tail_stats.lock); +} + /* * Free the file tail * @@ -29,7 +70,10 @@ void __vm_file_tail_free(struct address_ spin_lock_irqsave(mapping-tail_lock, flags); tail = mapping-tail; - mapping-tail = NULL; + if (tail) { + vm_file_tail_stats_dec(vm_file_tail_length(mapping)); + mapping-tail = NULL; + } spin_unlock_irqrestore(mapping-tail_lock, flags); kfree(tail); } @@ -49,6 +93,8 @@ void vm_file_tail_unpack(struct address_ struct page *page; void *tail; + vm_tail_stats.unpack_called++; + if (!mapping-tail) return; @@ -85,6 +131,7 @@ void vm_file_tail_unpack(struct address_ add_to_page_cache_lru(page, mapping, index, gfp_mask); unlock_page(page); page_cache_release(page); + vm_file_tail_stats_dec(length); } else /* Free the tail */ __vm_file_tail_free(mapping); @@ -120,6 +167,8 @@ int vm_file_tail_pack(struct page *page) struct address_space *mapping; void *tail; + vm_tail_stats.pack_called++; + if (TestSetPageLocked(page)) return 0; @@ -163,12 +212,86 @@ int vm_file_tail_pack(struct page *page) remove_from_page_cache(page); page_cache_release(page); /* pagecache ref */ ret = 1; + vm_file_tail_stats_inc(length); out: unlock_page(page); return ret; } +static int __init create_debugfs_file(const char *name, struct dentry **dir, + u32 *var) +{ + *dir = debugfs_create_u32(name, S_IFREG|S_IRUGO, + vm_tail_debugfs.root_dir, var); + if (!*dir) { + printk(KERN_ERR ERROR: vm_tail: could not create %s\n, name); + return -ENOMEM; + } + return 0; +} + +static int __init vm_file_tail_init(void) +{ + int err; + + vm_tail_debugfs.root_dir = debugfs_create_dir(vm_tail, NULL); + if (!vm_tail_debugfs.root_dir) { + printk(KERN_ERR ERROR: %s Could not create root directory\n, + __FUNCTION__); + return -ENOMEM; + } + + err = create_debugfs_file(nr_tails, vm_tail_debugfs.nr_tails, + vm_tail_stats.nr_tails); + if (err) + goto out_err; + + err = create_debugfs_file(tail_size, vm_tail_debugfs.tail_size, + vm_tail_stats.tail_size); + if (err) + goto out_err1; + + err = create_debugfs_file(saved_bytes, vm_tail_debugfs.saved_bytes, + vm_tail_stats.saved_bytes); + if (err) + goto out_err2; + + err = create_debugfs_file
[RFC:PATCH 08/09] generic_file_aio_read can read directly from the tail. No need to unpack
generic_file_aio_read can read directly from the tail. No need to unpack Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- include/linux/vm_file_tail.h | 13 ++ mm/file_tail.c | 54 +++ mm/filemap.c |4 ++- 3 files changed, 70 insertions(+), 1 deletion(-) diff -Nurp linux007/include/linux/vm_file_tail.h linux008/include/linux/vm_file_tail.h --- linux007/include/linux/vm_file_tail.h 2007-11-08 10:49:46.0 -0600 +++ linux008/include/linux/vm_file_tail.h 2007-11-08 10:49:46.0 -0600 @@ -53,6 +53,18 @@ static inline void vm_file_tail_unpack_i vm_file_tail_unpack(mapping); } +extern int __vm_file_tail_read(struct file *, loff_t *, read_descriptor_t *); + +static inline int vm_file_tail_read(struct file *filp, loff_t *ppos, + read_descriptor_t *desc) +{ + struct address_space *mapping = filp-f_mapping; + unsigned long index = *ppos PAGE_CACHE_SHIFT; + + if (mapping-tail index == vm_file_tail_index(mapping)) + return __vm_file_tail_read(filp, ppos, desc); + return 0; +} #else /* !CONFIG_VM_FILE_TAILS */ #define vm_file_tail_packed(mapping) 0 @@ -60,6 +72,7 @@ static inline void vm_file_tail_unpack_i #define vm_file_tail_pack(page) 0 #define vm_file_tail_unpack(mapping) do {} while (0) #define vm_file_tail_unpack_index(mapping, index) do {} while (0) +#define vm_file_tail_read(filp, ppos, desc) 0 #endif /* CONFIG_VM_FILE_TAILS */ diff -Nurp linux007/mm/file_tail.c linux008/mm/file_tail.c --- linux007/mm/file_tail.c 2007-11-08 10:49:46.0 -0600 +++ linux008/mm/file_tail.c 2007-11-08 10:49:46.0 -0600 @@ -178,3 +178,57 @@ void __vm_file_tail_unpack_on_resize(str vm_file_tail_free(inode-i_mapping); } EXPORT_SYMBOL(__vm_file_tail_unpack_on_resize); + +/* + * Copy tail data to user buffer + * + * Returns 1 on success + */ +int __vm_file_tail_read(struct file *filp, loff_t *ppos, + read_descriptor_t *desc) +{ + unsigned long count = desc-count; + unsigned int flags; + unsigned long left; + struct address_space *mapping = filp-f_mapping; + unsigned long offset; + unsigned long index = *ppos PAGE_CACHE_SHIFT; + unsigned long size; + + if (fault_in_pages_writeable(desc-arg.buf, count)) + /* +* Keep this simple since this path is an optimization. Let +* the tricky stuff get handled in the fallback path. +*/ + return 0; + + spin_lock_irqsave(mapping-tail_lock, flags); + + offset = *ppos ~PAGE_CACHE_MASK; + if (!mapping-tail || index != vm_file_tail_index(mapping) || + offset = vm_file_tail_length(mapping)) { + spin_unlock_irqrestore(mapping-tail_lock, flags); + return 0; + } + + size = vm_file_tail_length(mapping) - offset; + if (size count) + size = count; + + left = __copy_to_user_inatomic(desc-arg.buf, + (char *)mapping-tail + offset, size); + + spin_unlock_irqrestore(mapping-tail_lock, flags); + + if (left) { + size -= left; + desc-error = -EFAULT; + } + desc-count = count - size; + desc-written += size; + desc-arg.buf += size; + *ppos += size; + file_accessed(filp); + + return 1; +} diff -Nurp linux007/mm/filemap.c linux008/mm/filemap.c --- linux007/mm/filemap.c 2007-11-08 10:49:46.0 -0600 +++ linux008/mm/filemap.c 2007-11-08 10:49:46.0 -0600 @@ -1195,7 +1195,9 @@ generic_file_aio_read(struct kiocb *iocb if (desc.count == 0) continue; desc.error = 0; - do_generic_file_read(filp,ppos,desc,file_read_actor); + if (!vm_file_tail_read(filp, ppos, desc)) + do_generic_file_read(filp, ppos, desc, +file_read_actor); retval += desc.written; if (desc.error) { retval = retval ?: desc.error; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC:PATCH 05/09] find_get_page() and find_lock_page() need to unpack the tail
find_get_page() and find_lock_page() need to unpack the tail If the page being sought corresponds to the tail, and the tail is packed in the inode, the tail must be unpacked. Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- mm/filemap.c |3 +++ 1 file changed, 3 insertions(+) diff -Nurp linux004/mm/filemap.c linux005/mm/filemap.c --- linux004/mm/filemap.c 2007-11-07 08:14:01.0 -0600 +++ linux005/mm/filemap.c 2007-11-08 10:49:46.0 -0600 @@ -24,6 +24,7 @@ #include linux/file.h #include linux/uio.h #include linux/hash.h +#include linux/vm_file_tail.h #include linux/writeback.h #include linux/backing-dev.h #include linux/pagevec.h @@ -600,6 +601,7 @@ struct page * find_get_page(struct addre { struct page *page; + vm_file_tail_unpack_index(mapping, offset); read_lock_irq(mapping-tree_lock); page = radix_tree_lookup(mapping-page_tree, offset); if (page) @@ -624,6 +626,7 @@ struct page *find_lock_page(struct addre { struct page *page; + vm_file_tail_unpack_index(mapping, offset); repeat: read_lock_irq(mapping-tree_lock); page = radix_tree_lookup(mapping-page_tree, offset); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC:PATCH 01/09] Add tail to address space
Add tail to address space Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- fs/inode.c |3 +++ include/linux/fs.h |4 mm/Kconfig |9 + 3 files changed, 16 insertions(+) diff -Nurp linux000/fs/inode.c linux001/fs/inode.c --- linux000/fs/inode.c 2007-11-07 08:13:54.0 -0600 +++ linux001/fs/inode.c 2007-11-08 10:49:46.0 -0600 @@ -213,6 +213,9 @@ void inode_init_once(struct inode *inode spin_lock_init(inode-i_data.i_mmap_lock); INIT_LIST_HEAD(inode-i_data.private_list); spin_lock_init(inode-i_data.private_lock); +#ifdef CONFIG_VM_FILE_TAILS + spin_lock_init(inode-i_data.tail_lock); +#endif INIT_RAW_PRIO_TREE_ROOT(inode-i_data.i_mmap); INIT_LIST_HEAD(inode-i_data.i_mmap_nonlinear); i_size_ordered_init(inode); diff -Nurp linux000/include/linux/fs.h linux001/include/linux/fs.h --- linux000/include/linux/fs.h 2007-11-07 08:13:59.0 -0600 +++ linux001/include/linux/fs.h 2007-11-08 10:49:46.0 -0600 @@ -511,6 +511,10 @@ struct address_space { spinlock_t private_lock; /* for use by the address_space */ struct list_headprivate_list; /* ditto */ struct address_space*assoc_mapping; /* ditto */ +#ifdef CONFIG_VM_FILE_TAILS + void*tail; /* file tail */ + spinlock_t tail_lock; /* protect tail */ +#endif } __attribute__((aligned(sizeof(long; /* * On most architectures that alignment is already the case; but diff -Nurp linux000/mm/Kconfig linux001/mm/Kconfig --- linux000/mm/Kconfig 2007-11-07 08:14:01.0 -0600 +++ linux001/mm/Kconfig 2007-11-08 10:49:46.0 -0600 @@ -194,3 +194,12 @@ config NR_QUICK config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config VM_FILE_TAILS + bool Store file tails in slab cache + def_bool n + help + If the data at the end of a file, or the entire file, is small, + the kernel will attempt to store that data in the slab cache, + rather than allocate an entire page in the page cache. + If unsure, say N here. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC:PATCH 03/09] Release tail when inode is freed
Release tail when inode is freed Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- fs/inode.c |2 ++ 1 file changed, 2 insertions(+) diff -Nurp linux002/fs/inode.c linux003/fs/inode.c --- linux002/fs/inode.c 2007-11-08 10:49:46.0 -0600 +++ linux003/fs/inode.c 2007-11-08 10:49:46.0 -0600 @@ -10,6 +10,7 @@ #include linux/init.h #include linux/quotaops.h #include linux/slab.h +#include linux/vm_file_tail.h #include linux/writeback.h #include linux/module.h #include linux/backing-dev.h @@ -260,6 +261,7 @@ void __iget(struct inode * inode) void clear_inode(struct inode *inode) { might_sleep(); + vm_file_tail_free(inode-i_mapping); invalidate_inode_buffers(inode); BUG_ON(inode-i_data.nrpages); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC:PATCH 04/09] Unpack or remove file tail when inode is resized
Unpack or remove file tail when inode is resized If the inode size grows, we need to unpack the tail into a page. If the inode shrinks, such that the entire tail is beyond the end of the file, discard the tail. If the file shrinks, but part of the tail is still valid, just leave it. Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- include/linux/fs.h | 14 ++ mm/file_tail.c | 11 +++ 2 files changed, 25 insertions(+) diff -Nurp linux003/include/linux/fs.h linux004/include/linux/fs.h --- linux003/include/linux/fs.h 2007-11-08 10:49:46.0 -0600 +++ linux004/include/linux/fs.h 2007-11-08 10:49:46.0 -0600 @@ -715,6 +715,19 @@ static inline loff_t i_size_read(const s #endif } +#ifdef CONFIG_VM_FILE_TAILS +void __vm_file_tail_unpack_on_resize(struct inode *, loff_t); + +static inline void vm_file_tail_unpack_on_resize(struct inode *inode, +loff_t size) +{ + if (inode-i_mapping inode-i_mapping-tail) + __vm_file_tail_unpack_on_resize(inode, size); +} +#else +#define vm_file_tail_unpack_on_resize(mapping, new_size) do {} while (0) +#endif + /* * NOTE: unlike i_size_read(), i_size_write() does need locking around it * (normally i_mutex), otherwise on 32bit/SMP an update of i_size_seqcount @@ -722,6 +735,7 @@ static inline loff_t i_size_read(const s */ static inline void i_size_write(struct inode *inode, loff_t i_size) { + vm_file_tail_unpack_on_resize(inode, i_size); #if BITS_PER_LONG==32 defined(CONFIG_SMP) write_seqcount_begin(inode-i_size_seqcount); inode-i_size = i_size; diff -Nurp linux003/mm/file_tail.c linux004/mm/file_tail.c --- linux003/mm/file_tail.c 2007-11-08 10:49:46.0 -0600 +++ linux004/mm/file_tail.c 2007-11-08 10:49:46.0 -0600 @@ -13,6 +13,7 @@ #include linux/buffer_head.h #include linux/fs.h #include linux/hardirq.h +#include linux/module.h #include linux/vm_file_tail.h /* @@ -167,3 +168,13 @@ out: unlock_page(page); return ret; } + +void __vm_file_tail_unpack_on_resize(struct inode *inode, loff_t new_size) +{ + loff_t old_size = i_size_read(inode); + if (new_size old_size) + vm_file_tail_unpack(inode-i_mapping); + else if (new_size PAGE_CACHE_SHIFT != old_size PAGE_CACHE_SHIFT) + vm_file_tail_free(inode-i_mapping); +} +EXPORT_SYMBOL(__vm_file_tail_unpack_on_resize); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC:PATCH 06/09] For readahead, leave data in tail
For readahead, leave data in tail Don't unpack it until it's actually read. Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- mm/readahead.c |5 + 1 file changed, 5 insertions(+) diff -Nurp linux005/mm/readahead.c linux006/mm/readahead.c --- linux005/mm/readahead.c 2007-11-07 08:14:01.0 -0600 +++ linux006/mm/readahead.c 2007-11-08 10:49:46.0 -0600 @@ -16,6 +16,7 @@ #include linux/task_io_accounting_ops.h #include linux/pagevec.h #include linux/pagemap.h +#include linux/vm_file_tail.h void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page) { @@ -147,6 +148,10 @@ __do_page_cache_readahead(struct address if (page_offset end_index) break; + if ((page_offset == end_index) vm_file_tail_packed(mapping)) + /* Tail page is already packed */ + break; + rcu_read_lock(); page = radix_tree_lookup(mapping-page_tree, page_offset); rcu_read_unlock(); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC:PATCH 07/09] shrink_active_list: pack file tails rather than move to inactive list
shrink_active_list: pack file tails rather than move to inactive list The big question is how aggressively we pack the tails. This looked like an easy place to start. If a page is being moved from the active list to the inactive list, and the tail can be safely packed, that is not mapped, not dirty, etc., the tail is packed and the page removed from the page cache. Right now, pages that never get off the inactive list will not be packed. I will be soliciting ideas for other places in the code where tails can be packed. One of my goals is not to be too aggressive, where tails are packed and unpacked repeatedly. I also don't want to add too much overhead, such as an extra scan of the inactive list. Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- mm/vmscan.c |6 ++ 1 file changed, 6 insertions(+) diff -Nurp linux006/mm/vmscan.c linux007/mm/vmscan.c --- linux006/mm/vmscan.c2007-11-07 08:14:01.0 -0600 +++ linux007/mm/vmscan.c2007-11-08 10:49:46.0 -0600 @@ -19,6 +19,7 @@ #include linux/pagemap.h #include linux/init.h #include linux/highmem.h +#include linux/vm_file_tail.h #include linux/vmstat.h #include linux/file.h #include linux/writeback.h @@ -1035,7 +1036,12 @@ force_reclaim_mapped: list_add(page-lru, l_active); continue; } + } else if (vm_file_tail_pack(page)) { + ClearPageActive(page); + page_cache_release(page); + continue; } + list_add(page-lru, l_inactive); } - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/30] IGET: Stop BEFS from using iget() and read_inode()
On Tue, 2007-10-02 at 13:32 +0100, David Howells wrote: Zach Brown [EMAIL PROTECTED] wrote: /* haha, continuing the fine tradition of terrible names in this api.. */ static inline void *PTR_PTR(void *err_ptr) { BUG_ON(!IS_ERR(err_ptr) || !err_ptr); return err_ptr; } How about ERR_CAST() instead? Or maybe CAST_ERR()? It's a better name than PTR_PTR(). :-) struct dentry *blah(...) { struct inode *inode; inode = thing(...); if (IS_ERR(inode)) return ERR_CAST(inode); } Where ERR_CAST is defined as: static inline void *ERR_CAST(const void *error) { return (void *) x; } Of course, the cast is unnecessary, and I'm sure you meant to return error: static inline void *ERR_CAST(const void *error) { return error; } Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/30] IGET: Stop BEFS from using iget() and read_inode()
On Tue, 2007-10-02 at 14:24 +0100, David Howells wrote: Dave Kleikamp [EMAIL PROTECTED] wrote: Of course, the cast is unnecessary, The cast is necessary as the argument is a const pointer and the return type is not. Ah yes. I stand corrected. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/30] IGET: Stop JFS from using iget() and read_inode()
On Mon, 2007-10-01 at 14:11 +0100, David Howells wrote: Stop the JFS filesystem from using iget() and read_inode(). Replace jfs_read_inode() with jfs_iget(), and call that instead of iget(). jfs_iget() then uses iget_locked() directly and returns a proper error code instead of an inode in the event of an error. jfs_fill_super() returns any error incurred when getting the root inode instead of EINVAL. Signed-off-by: David Howells [EMAIL PROTECTED] Acked-by: Dave Kleikamp [EMAIL PROTECTED] --- fs/jfs/inode.c | 20 fs/jfs/jfs_inode.h |2 +- fs/jfs/namei.c | 34 ++ fs/jfs/super.c | 15 +-- 4 files changed, 40 insertions(+), 31 deletions(-) -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Jfs-discussion] [2/4] 2.6.23-rc6: known regressions
On Wed, 2007-09-19 at 13:44 +0200, Oliver Neukum wrote: Am Donnerstag 13 September 2007 schrieb Dave Kleikamp: On Wed, 2007-09-12 at 18:58 +0200, Michal Piotrowski wrote: Subject : umount triggers a warning in jfs and takes almost a minute References : http://lkml.org/lkml/2007/9/4/73 Last known good : ? Submitter : Oliver Neukum [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown I'm still waiting to hear from Oliver whether or not this is actually a regression. OK, I've done test. On 2.6.22 I was unable to trigger the warning. On 2.6.23-rc6 I get the following warning in about 3/4 of all unmounts: Sep 19 13:08:04 oenone kernel: WARNING: at fs/jfs/jfs_logmgr.c:1643 jfs_flush_journal() Sep 19 13:08:04 oenone kernel: Sep 19 13:08:04 oenone kernel: Call Trace: Sep 19 13:08:04 oenone kernel: [8860f962] :jfs:jfs_flush_journal+0x26a/0x27d Sep 19 13:08:04 oenone kernel: [8029fca1] dispose_list+0xde/0xf7 Sep 19 13:08:04 oenone kernel: [885fd003] :jfs:jfs_umount+0x30/0xe5 Sep 19 13:08:04 oenone kernel: [885f9a31] :jfs:jfs_put_super+0xd/0x5e Sep 19 13:08:04 oenone kernel: [8028ef8f] generic_shutdown_super+0x60/0xf0 Sep 19 13:08:04 oenone kernel: [8028f02c] kill_block_super+0xd/0x1e Sep 19 13:08:04 oenone kernel: [8028f0f7] deactivate_super+0x6a/0x82 Sep 19 13:08:04 oenone kernel: [802a2184] sys_umount+0x249/0x25a Sep 19 13:08:04 oenone kernel: [802616ff] audit_syscall_entry+0x141/0x174 Sep 19 13:08:04 oenone kernel: [8020be8c] tracesys+0xdc/0xe1 Sep 19 13:08:04 oenone kernel: This is on a partition of a usb mass storage device with 2K sector size. IMHO it is a regression. I am recompiling with CONFIG_JFS_DEBUG. Looks like a regression then. I'll take a look at the debug output and get back to you. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote: Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all cases except one handles memory allocation failure so I get rid of those GFP_NOFAIL flags. Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc in jbd/jbd2? I will send a separate patch to cleanup that. No. GFP_NOFS avoids deadlock. It prevents the allocation from making recursive calls back into the file system that could end up blocking on jbd code. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote: On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote: Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all cases except one handles memory allocation failure so I get rid of those GFP_NOFAIL flags. Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc in jbd/jbd2? I will send a separate patch to cleanup that. No. GFP_NOFS avoids deadlock. It prevents the allocation from making recursive calls back into the file system that could end up blocking on jbd code. Oh, I see your patch now. You mean use GFP_NOFS instead of GFP_KERNEL. :-) OK then. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2/3] 2.6.23-rc6: known regressions v2
On Tue, 2007-09-18 at 16:24 +0200, Jan Kara wrote: Subject : umount triggers a warning in jfs and takes almost a minute References : http://lkml.org/lkml/2007/9/4/73 Last known good : ? Submitter : Oliver Neukum [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown I thought Shaggy asked Oliver about some details (and he did not answer so far) so I'd assume Shaggy is handling this. I've put it on the back-burner since I haven't heard back from Oliver. I still haven't found out whether or not it is a regression. I'm not too concerned about fixing it right away. I don't think jfs on flash is very important. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote: On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote: On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote: Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Shouldn't we kill jbd_kmalloc instead? It seems useful to me to keep jbd_kmalloc/jbd_free. They are central places to handle memory (de)allocation(page size) via kmalloc/kfree, so in the future if we need to change memory allocation in jbd(e.g. not using kmalloc or using different flag), we don't need to touch every place in the jbd code calling jbd_kmalloc. I disagree. Why would jbd need to globally change the way it allocates memory? It currently uses kmalloc (and jbd_kmalloc) for allocating a variety of structures. Having to change one particular instance won't necessarily mean we want to change all of them. Adding unnecessary wrappers only obfuscates the code making it harder to understand. You wouldn't want every subsystem to have it's own *_kmalloc() that took different arguments. Besides, there aren't that many calls to kmalloc and kfree in the jbd code, so there wouldn't be much pain in changing GFP flags or whatever, if it ever needed to be done. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2/4] 2.6.23-rc5: known regressions v2
On Sat, 2007-09-08 at 13:11 +0200, Michal Piotrowski wrote: Hi all, Here is a list of some known regressions in 2.6.23-rc5. Feel free to add new regressions/remove fixed etc. http://kernelnewbies.org/known_regressions FS Subject : umount triggers a warning in jfs and takes almost a minute References : http://lkml.org/lkml/2007/9/4/73 Last known good : ? Submitter : Oliver Neukum [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown Is this a regression? Oliver, Have you been using jfs on this flash previously? If so, what was the most recent kernel that didn't have problems? Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs/jfs: use DIV_ROUND_UP where appropriate
On Wed, 2007-08-29 at 23:17 -0500, Shaun Zinck wrote: This replaces some macros and code, which do the same thing as DIV_ROUND_UP defined in kernel.h, to use the DIV_ROUND_UP macro. Signed-off-by: Shaun Zinck [EMAIL PROTECTED] Thanks. I've added this to the jfs git tree. It's queued for the -mm tree. Shaggy --- diff --git a/fs/jfs/jfs_dtree.h b/fs/jfs/jfs_dtree.h index 8561c6e..cdac2d5 100644 --- a/fs/jfs/jfs_dtree.h +++ b/fs/jfs/jfs_dtree.h @@ -74,7 +74,7 @@ struct idtentry { #define DTIHDRDATALEN11 /* compute number of slots for entry */ -#define NDTINTERNAL(klen) ( ((4 + (klen)) + (15 - 1)) / 15 ) +#define NDTINTERNAL(klen) (DIV_ROUND_UP((4 + (klen)), 15)) /* @@ -133,7 +133,7 @@ struct dir_table_slot { ( ((s64)((dts)-addr1)) 32 | __le32_to_cpu((dts)-addr2) ) /* compute number of slots for entry */ -#define NDTLEAF_LEGACY(klen)( ((2 + (klen)) + (15 - 1)) / 15 ) +#define NDTLEAF_LEGACY(klen)(DIV_ROUND_UP((2 + (klen)), 15)) #define NDTLEAF NDTINTERNAL diff --git a/fs/jfs/resize.c b/fs/jfs/resize.c index 71984ee..7f24a0b 100644 --- a/fs/jfs/resize.c +++ b/fs/jfs/resize.c @@ -172,7 +172,7 @@ int jfs_extendfs(struct super_block *sb, s64 newLVSize, int newLogSize) */ t64 = ((newLVSize - newLogSize + BPERDMAP - 1) L2BPERDMAP) L2BPERDMAP; - t32 = ((t64 + (BITSPERPAGE - 1)) / BITSPERPAGE) + 1 + 50; + t32 = DIV_ROUND_UP(t64, BITSPERPAGE) + 1 + 50; newFSCKSize = t32 sbi-l2nbperpage; newFSCKAddress = newLogAddress - newFSCKSize; -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Jfs-discussion] [PATCH 2/4] Fix mainline filesystems to handle ATTR_KILL_ bits correctly
On Mon, 2007-08-20 at 16:53 -0400, Jeff Layton wrote: This should fix all of the filesystems in the mainline kernels to handle ATTR_KILL_SUID and ATTR_KILL_SGID correctly. For most of them, this is just a matter of making sure that they call generic_attrkill early in the setattr inode op. Here's my ack for the jfs piece. Signed-off-by: Jeff Layton [EMAIL PROTECTED] Acked-by: Dave Kleikamp [EMAIL PROTECTED] -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Jfs-discussion] [PATCH 15/25] JFS: call attr_kill_to_mode from jfs_setattr
On Mon, 2007-08-06 at 09:54 -0400, Jeff Layton wrote: Signed-off-by: Jeff Layton [EMAIL PROTECTED] Acked-by: Dave Kleikamp [EMAIL PROTECTED] --- fs/jfs/acl.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/jfs/acl.c b/fs/jfs/acl.c index 4d84bdc..34ca314 100644 --- a/fs/jfs/acl.c +++ b/fs/jfs/acl.c @@ -227,6 +227,8 @@ int jfs_setattr(struct dentry *dentry, struct iattr *iattr) struct inode *inode = dentry-d_inode; int rc; + attr_kill_to_mode(inode, iattr); + rc = inode_change_ok(inode, iattr); if (rc) return rc; -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 12/26] ext2 white-out support
On Tue, 2007-07-31 at 13:11 -0400, Josef Sipek wrote: On Tue, Jul 31, 2007 at 07:00:12PM +0200, Jan Blunck wrote: On Tue, Jul 31, Josef Sipek wrote: On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote: Introduce white-out support to ext2. I think storing whiteouts on the branches is wrong. It creates all sort of nasty cases when people actually try to use unioning. Imagine a (no-so unlikely) scenario where you have 2 unions, and they share a branch. If you create a whiteout in one union on that shared branch, the whiteout magically affects the other union as well! Whiteouts are a union-level construct, and therefore storing them at the branch level is wrong. So you think that just because you mounted the filesystem somewhere else it should look different? This is what sharing is all about. If you share a filesystem you also share the removal of objects. The removal happens at the union level, not the branch level. Say you have: /a/ /b/foo /c/foo And you mount /u1 as a union of {a,b}, and /u2 as union of {a,c}. Who does this? I'm assuming that a is the top layer. Aren't union mounts typically about sharing lower layers and having a separate rw layer for each union mount? $ find /u* /u1 /u1/foo /u2 /u2/foo $ rm /u1/foo # this creates whiteout for foo in /a $ find /u* /u1 /u2 Is that what you'd expect as a user? I don't think so. That's exactly what I would expect. If I were to: $ echo this is new /u1/foo I would expect: $ cat /u2/foo this is new So why should rm behave differently? I haven't really been tuned into union mounts, so maybe I'm missing out on something basic here. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 12/26] ext2 white-out support
On Wed, 2007-08-01 at 14:44 -0400, Josef Sipek wrote: Alright not the greatest of examples, there is something to be said about symmetry, so...let me try again :) /a/ /b/bar(whiteout for bar) /c/foo/qwerty Now, let's mount a union of {a,b,c}, and we'll see: $ find /u /u /u/foo /u/foo/qwerty $ mv /u/foo /u/bar Now what? How do you rename? Do you rename in the same branch (assuming it is rw)? Er, no. According to Documentation/filesystems/union-mounts.txt, only the topmost layer of the mount stack can be altered. If you do, you'll get: $ find /u /u Oops! There's a whiteout in /b that hides the directory in /c -- rename(2) shouldn't make directory subtrees disappear. There are two ways to solve this: 1) cp -r the entire subtree being renamed to highest-priority branch, and rename there (you might have to recreate a series of directories to have a place to cp to...so you got cp -r _AND_ mkdir -p-like code in the VFS! 1/2 a :) ) I think this is the only alternative, given the design. 2) Don't store whiteouts within branches. This makes it really easy to rename and remove the whiteout. Sure, you could try to rename in-place and remove the whiteout, but what if you have: /a/ /b/bar(whiteout) /c/bar/blah /d/foo/qwerty $ mv /u/foo /u/bar You can't just remove the whiteout, because that'd uncover the whited-out directory bar in /c. Josef 'Jeff' Sipek. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vm/fs meetup details
Sorry for the late response, but I'm interested in attending. I didn't think I would be able to justify the trip for the one-day meeting, but I begged an invitation to the VM summit from Martin. I still need to get travel approval, but I'm optimistic that I can justify the trip now. Thanks, Shaggy On Thu, 2007-07-05 at 06:01 +0200, Nick Piggin wrote: Hi, The vm/fs meetup will be held September 4th from 10am till 4pm (with the option of going longer), at the University of Cambridge. Anton Altaparmakov has arranged a conference room for us with whiteboard and projector, so many thanks to him. I will send out the location and plans for meeting/getting there after we work out the best strategy for that. At the moment we have 15 people interested so far. We can have a few more people, so if you aren't cc'ed and would like to come along please let me know. We do have limited space, so I'm sorry in advance if anybody misses out. I'll post out a running list of suggested topics later, but they're really just a rough guideline. It will be a round-table kind of thing and long monologue talks won't be appropriate, however some slides or whiteboarding to interactively introduce and discuss your idea would be OK. I think we want to avoid assigning slots for specific people/topics. Feel free to propose anything, if it only gets a small amount of interest then at least you'll know who to discuss it with later :) Thanks, Nick -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote: It just occurred to me: If i_version is 64bit, then knfsd would need to be careful when reading it on a 32bit host. What are the locking rules? How does knfsd use i_version? I would think that if all it was doing was to compare (i_version == previous_version), then locking wouldn't really matter. Well, theoretically, previous_version could be 0x1, and i_version could be 0x1, knfsd checks the high word, then ext4 updates i_version to 0x2, then knfsd checks the low word, detecting no change. How likely is this? (I don't understand why i_version even needs to be 64 bits in the first place.) Presumably it is only updated under i_mutex protection, but having to get i_mutex to read it would seem a little heavy handed. How does knfsd protect itself from the inode changing after i_version is checked? Is any locking being done otherwise? Should it use a seqlock like i_size? Could we use the same seqlock that i_size uses, or would we need a separate one? NeilBrown -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Faster ext2_clear_inode()
On Mon, 2007-07-09 at 22:00 +0200, Jörn Engel wrote: On Mon, 9 July 2007 22:01:48 +0400, Alexey Dobriyan wrote: Yes. Note that ext2_clear_inode() is referenced from ext2_sops, so even empty, it leaves traces in resulting kernel. Is that your opinion or have you actually measured a difference? I strongly suspect that compilers are smart enough to optimize away a call to an empty static function. It's not a direct call to a static function. It is called as a super_ops method. I don't think the overhead is very significant, but it doesn't look like it could do any harm. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
On Thu, 2007-06-28 at 11:33 -0700, Andrew Morton wrote: On Thu, 28 Jun 2007 23:27:57 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: Please drop the non-ext4 patches from the ext4 tree and send incremental patches against the (non-ext4) fallocate patches in -mm. Please let us know what you think of Mingming's suggestion of posting all the fallocate patches including the ext4 ones as incremental ones against the -mm. I think Mingming was asking that Ted move the current quilt tree into git, presumably because she's working off git. I moved the fallocate patches to the very end of the series in the quilt tree. This way the patches will be in the quilt tree for testing, but Ted can easily leave them out of the git tree so you and Linus won't pull them with the ext4 patches. Fortunately, the ext4-specific fallocate patches don't conflict with the other patches in the queue, so they can (at least for now) be handled independently in the -mm tree. I'm not sure what to do, really. The core kernel patches need to be in Ted's tree for testing but that'll create a mess for me. ug. Options might be a) I drop the fallocate patches from -mm and from the ext4 tree, hack up any needed build fixes, then just wait for it all to mature and then think about it again b) We do what we normally don't do and reserve the syscall slots in mainline. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/26] make access() use mnt check
On Mon, 2007-06-25 at 11:27 -0700, Dave Hansen wrote: On Sat, 2007-06-23 at 08:45 +0100, Christoph Hellwig wrote: You probably want to add a big comment explaining why it's fine here. I've got this in the next set: - - if(IS_RDONLY(nd.dentry-d_inode)) + /* +* This is a rare case where using __mnt_is_readonly() +* is OK without a mnt_want/drop_write() pair. Since +* not actual write to the fs is performed here, we do s/not/no/ +* not need to telegraph to that to anyone. Also, we +* accept that access is inherently racy, and know that +* the fs might be remounted between this syscall, and +* any action taken because of its result. +*/ + if (__mnt_is_readonly(nd.mnt)) res = -EROFS; -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH resend] introduce I_SYNC
On Thu, 2007-05-31 at 16:25 +0200, Jörn Engel wrote: --- linux-2.6.21logfs/fs/jfs/jfs_txnmgr.c~I_LOCK2007-05-07 10:28:55.0 +0200 +++ linux-2.6.21logfs/fs/jfs/jfs_txnmgr.c 2007-05-29 13:10:32.0 +0200 @@ -1286,7 +1286,14 @@ int txCommit(tid_t tid, /* transaction * commit the transaction synchronously, so the last iput * will be done by the calling thread (or later) */ - if (tblk-u.ip-i_state I_LOCK) + /* +* I believe this code is no longer needed. Splitting I_LOCK +* into two bits, I_LOCK and I_SYNC should prevent this +* deadlock as well. But since I don't have a JFS testload +* to verify this, only a trivial s/I_LOCK/I_SYNC/ was done. +* Joern +*/ + if (tblk-u.ip-i_state I_SYNC) tblk-xflag = ~COMMIT_LAZY; } I think the code is still needed, and I think this change is correct. The deadlock that this code is avoiding is caused by clear_inode() calling wait_on_inode(). Since clear_inode() now calls inode_sync_wait(inode), we want to avoid the lazily committing this transaction when the I_SYNC flag is set. Unfortunately, recreating the deadlock is hard, and I haven't been able to recreate it with this code commented out. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/5] inode reservation v0.1 (ext4 kernel patch)
On Thu, 2007-05-24 at 02:06 +0800, coly wrote: The patch is generated based on 2.6.20-ext4-2 branch. you can find the benchmark from other email. DO NOT waste time on reading the patch :-) I post this patch here is to show that I really spent time on it and the patch can work (even not well). I won't waste my time then. I'm discouraged from trying by the lack of indentation. It looks like the tabs got converted to a single space somehow. diff --git a/Makefile b/Makefile index 7e2750f..21d21e4 100644 --- a/Makefile +++ b/Makefile @@ -1,8 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 20 -EXTRAVERSION = -NAME = Homicidal Dwarf Hamster +EXTRAVERSION = inores # *DOCUMENTATION* # To see a list of typical targets execute make help diff --git a/fs/ext4/bitmap.c b/fs/ext4/bitmap.c index 11e93c1..daf88b4 100644 --- a/fs/ext4/bitmap.c +++ b/fs/ext4/bitmap.c @@ -30,3 +30,29 @@ unsigned long ext4_count_free (struct buffer_head * map, unsigned int numchars) #endif /* EXT4FS_DEBUG */ +/* + * Read the inode allocation bitmap for a given block_group, reading + * into the specified slot in the superblock's bitmap cache. + * + * Return buffer_head of bitmap on success or NULL. + */ +struct buffer_head * +read_inode_bitmap(struct super_block * sb, unsigned long block_group) +{ + struct ext4_group_desc *desc; + struct buffer_head *bh = NULL; + + desc = ext4_get_group_desc(sb, block_group, NULL); + if (!desc) + goto error_out; + + bh = sb_bread(sb, ext4_inode_bitmap(sb, desc)); + if (!bh) + ext4_error(sb, read_inode_bitmap, + Cannot read inode bitmap - + block_group = %lu, inode_bitmap = %llu, + block_group, ext4_inode_bitmap(sb, desc)); +error_out: + return bh; +} + Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
On Thu, 2007-05-17 at 09:40 +1000, David Chinner wrote: On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote: On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote: Please don't make this always happen. c/mtime updates should be dependent on the mode being used and whether there is visible change to the file. If no userspace visible changes to the file occurred, then timestamps should not be changed. i_blocks will be updated, so it seems reasonable to update ctime. mtime shouldn't be changed, though, since the contents of the file will be unchanged. That's assuming blocks were actually allocated - if the prealloc range already has underlying blocks there is no change and so we should not be changing mtime either. Only the filesystem will know if it has changed the file, so I think that timestamp updates need to be driven down to that level, not done blindy at the highest layer Yes, I agree. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote: On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote: Following changes were made to the previous version: 1) Added description before sys_fallocate() definition. 2) Return EINVAL for len=0 (With new draft that Ulrich pointed to, posix_fallocate should return EINVAL for len = 0. 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE 4) Do not return ENODEV for dirs (let individual file systems decide if they want to support preallocation to directories or not. 5) Check for wrap through zero. 6) Update c/mtime if fallocate() succeeds. Please don't make this always happen. c/mtime updates should be dependent on the mode being used and whether there is visible change to the file. If no userspace visible changes to the file occurred, then timestamps should not be changed. i_blocks will be updated, so it seems reasonable to update ctime. mtime shouldn't be changed, though, since the contents of the file will be unchanged. e.g. FA_ALLOCATE that changes file size requires same semantics of ftruncate() extending the file, otherwise no change in timestamps should occur. Cheers, Dave. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Tue, 2007-05-08 at 16:22 +0530, Amit K. Arora wrote: On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote: On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: So we don't implement fallocate on bitmap-based files! Well that's huge news. The changelog would be an appropriate place to communicate this, along with reasons why, or a description of the plan to fix it. Ok. Will add this in the function description as well. Also, posix says nothing about fallocate() returning ENOTTY. Right. I don't seem to find any suitable error from posix description. Can you please suggest an error code which might make more sense here ? Will -ENOTSUPP be ok ? Since we want to say here that we don't support non-extent files. Isn't the idea that libc will interpret -ENOTTY, or whatever is returned here, and fall back to the current library code to do preallocation? This way, the caller of fallocate() will never see this return code, so it won't violate posix. You are right. But, we still need to standardize (and limit) the error codes which we should return from kernel when we want to fall back on the library implementation. The posix_fallocate() library function will have to look for a set of errors from fallocate() system call, upon receiving which it will do preallocation from user level; or else, it will return success/error-code returned by the system call to the user. I think we can make it fall back to library implementation of fallocate, whenever posix_fallocate() receives any of the following errors from fallocate() system call: 1. ENOSYS 2. EOPNOTSUPP 3. ENOTTY(?) Now the question is - should we limit the set of errors for this purpose to just 1 2 above ? In that case I will need to change the error being returned here to -EOPNOTSUPP (from current -ENOTTY). If you want my opinion, -EOPNOTSUPP is better than -ENOTTY. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] JFS: Fix race waking up jfsIO kernel thread
I've been looking at a hang that was reported off-list, and I believe I found the cause. I'm still waiting for confirmation on whether the patch does fix the problem, but I wanted to distribute the patch in case anyone else is seeing a similar hang. It looks like the problem is in kernels from 2.6.17 to the present. Thanks, Shaggy JFS: Fix race waking up jfsIO kernel thread It's possible for a journal I/O request to be added to the log_redrive queue and the jfsIO thread to be awakened after the thread releases log_redrive_lock but before it sets its state to TASK_INTERRUPTIBLE. The jfsIO thread should set its state before giving up the spinlock, so the waking thread will really wake it. Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c index ff7f1be..16c6268 100644 --- a/fs/jfs/jfs_logmgr.c +++ b/fs/jfs/jfs_logmgr.c @@ -2354,12 +2354,13 @@ int jfsIOWait(void *arg) lbmStartIO(bp); spin_lock_irq(log_redrive_lock); } - spin_unlock_irq(log_redrive_lock); if (freezing(current)) { + spin_unlock_irq(log_redrive_lock); refrigerator(); } else { set_current_state(TASK_INTERRUPTIBLE); + spin_unlock_irq(log_redrive_lock); schedule(); __set_current_state(TASK_RUNNING); } -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux page cache issue?
On Wed, 2007-03-28 at 02:45 -0400, Xin Zhao wrote: Hi, If a Linux process opens and reads a file A, then it closes the file. Will Linux keep the file A's data in cache for a while in case another process opens and reads the same in a short time? I think that is what I heard before. Yes. But after I digged into the kernel code, I am confused. When a process closes the file A, iput() will be called, which in turn calls the follows two functions: iput_final()-generic_drop_inode() A comment from the top of fs/dcache.c: /* * Notes on the allocation strategy: * * The dcache is a master of the icache - whenever a dcache entry * exists, the inode will always exist. iput() is done either when * the dcache entry is deleted or garbage collected. */ Basically, as long a a dentry is present, iput_final won't be called on the inode. But from the following calling chain, we can see that file close will eventually lead to evict and free all cached pages. Actually in truncate_complete_page(), the pages will be freed. This seems to imply that Linux has to re-read the same data from disk even if another process B read the same file right after process A closes the file. That does not make sense to me. /***calling chain ***/ generic_delete_inode/generic_forget_inode()- truncate_inode_pages()-truncate_inode_pages_range()- truncate_complete_page()-remove_from_page_cache()- __remove_from_page_cache()-radix_tree_delete() Am I missing something? Can someone please provide some advise? Thanks a lot -x Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add file position info to proc
On Fri, 2007-03-09 at 14:54 +0100, Miklos Szeredi wrote: Comments are welcome. This patch adds support for finding out the current file position, open flags and possibly other info in the future. These new entries are added: /proc/PID/fdinfo/FD /proc/PID/task/TID/fdinfo/FD For each fd the information is provided in the following format: pos: 1234 flags:012 I think this information would be a little easier to access if there would be a single file per pid or thread containing something like: handle flags pos path 0 012 1234/dev/pts/1 1 014 5678/tmp/output etc. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add file position info to proc
On Fri, 2007-03-09 at 10:23 -0500, Benjamin LaHaise wrote: On Fri, Mar 09, 2007 at 09:15:06AM -0600, Dave Kleikamp wrote: I think this information would be a little easier to access if there would be a single file per pid or thread containing something like: handle flags pos path 0 012 1234/dev/pts/1 1 014 5678/tmp/output etc. That would not be a good idea, as not all users have the same permissions for viewing this information. How will this have different permission issues than Miklos' patch? It's also quite against the design philosophy used elsewhere in /proc. What's so different between this and /proc/pid/maps or /proc/pid/mounts? -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount option to ignore permissions
On Thu, 2007-03-08 at 23:39 +0100, Ihar `Philips` Filipau wrote: The problem have beaten me before. And now I have it again. Imaging external hard drive with proper file system (proper == supports posix permissions) where files were created by user A and then it (ext. hard drive) was brought to another location/computer and user B tried to read them. Failure. Why? Because Linux preserved permissions on hard drive - though they are already irrelevant on system fs is currently mounted on. And that renders literally all files accessible only by root. What is needed is special mount option to tell file system (*): (1) to ignore permissions when file/directory is read; (2) when file/directory is created it receives automatically world writable permissions 0666 (I cannot imaging how to simulate user friendly file attribute read-only, though it seems not relevant to external storage anyway). I'm looking into the code and it seems that every file system parses option on their own. Global flags (ro/rw, nodev, etc) are handled by mount(8) itself and passed to sys_mount() as bitmask. How gid/uid are passed to file system? I do not see them in parameter list to sys_mount(). Or they are handled somehow otherwise? A posix file system doesn't have a uid or gid associated with it. These are set based on the owner of the process creating the file or directory. Any ideas on how I can simulate such behavior and or on how to implement such attribute would be appreciated. jfs does support mount options to override uid, gid, and umask for existing files. The reason for this is so a file system can be shared between linux and os2, and os2 doesn't use these fields. (I just realized that I've never documented these flags.) I don't know how other file systems' maintainers would feel about supporting these flags. I can see how it would be useful for external hard drives. P.S. chmod/chown isn't option since (1) they do not work for ro file system and (2) doing that every time on NNNk files might quite tiresome - every time disk is reattached. This is true of chown, but if you would chmod everything to 777 (or 666 for non-directories), you wouldn't have to repeat that every time you reattach. You could mount the drive under a directory with restricted permissions to have some degree of security. P.P.S. BTW MacOSX has such option and it is automatically selected for external hard drives. What are the details? If something were to be proposed, it wouldn't hurt to try to be consistent. P.P.P.S. That doesn't happen with most external hard drives since they are all FAT32. I moved to ext2/hfs+ in part to avoid the recurring nightmares of past when I have worked with FAT32 all day long. And also ext2/hfs+ (under Linux/MacOSX) are better than FAT*. And also I need case sensitiveness. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add file position info to proc
On Fri, 2007-03-09 at 22:03 +0100, Miklos Szeredi wrote: I think this information would be a little easier to access if there would be a single file per pid or thread containing something like: handle flags pos path 0 012 1234/dev/pts/1 1 014 5678/tmp/output etc. That would not be a good idea, as not all users have the same permissions for viewing this information. Since /proc/$$/fd is S_IRWXU, only the user owning the process (plus usually root) can view it. So making the single file you propose S_IRWXU too should solve any security issue, should not it? I think the problem is not with permissions, but with scaling to large numbers of file descriptors. Would not that be dealt accordingly with seq_file? Yes, most likely it wouldn't be so bad either. But it _would_ be: a) more complex, this implementation reuses most of the the /proc/fd code b) less efficient for querying a single fd So I prefer the current solution, but not very strongly. I don't have a real strong opinion, but I wanted to throw that out as a suggestion. You're the one writing the code. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Fri, 2007-03-02 at 18:45 +0800, Andreas Dilger wrote: On Mar 01, 2007 13:15 -0600, Eric Sandeen wrote: One thing I'd like to see is a cmd argument as well, to allow for example allocation vs. reservation (i.e. allocating blocks vs. simply reserving a number), as well as the inverse of those functions (un-reservation, de-allocation)? If the allocation interface allows allocation/reservation within arbitrary ranges, if the only way to un-allocate is via a truncate, that's pretty asymmetric. I'd rather we just get the oft-discussed punch() syscall instead. This is really what unallocate would do for persistent allocations and it would be useful for files that were not preallocated. I can see a difference though. punch() would throw away written data as well as pre-allocated-but-never-written-to data. I can see where a user might preallocate a large file and do a lot of random writes. At some point, he decides the file isn't going to grow much more, so let's free up the remaining pre-allocated blocks. This makes even more sense with reservation. The alternative would be to have punch() take a flag to specify if only preallocated or reserved blocks should be freed. For filesystems that don't implement punch glibc() would do zero-filling of the punched area I guess (to make it equivalent to reading from a hole in the file). Or it could just fail. Writing zeroes may be really slow and not give the caller any benefit. (The intention was to free blocks back to the file system.) Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Amit wrote: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); On Thu, 2007-03-01 at 22:16 -0800, Andrew Morton wrote: On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: Just curious .. What does posix_fallocate() return ? bookmark this: http://www.opengroup.org/onlinepubs/009695399/nfindex.html Upon successful completion, posix_fallocate() shall return zero; otherwise, an error number shall be returned to indicate the error. Then there's no need for sys_allocate to return a long. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + file = fget(fd); + if (!file) + goto out; + inode = file-f_path.dentry-d_inode; + if (inode-i_op inode-i_op-fallocate) + ret = inode-i_op-fallocate(inode, offset, len); + else + ret = -ENOTTY; + fput(file); +out: +return ret; +} ENOTTY is a bit unconventional - we often use EINVAL for this sort of thing. But EINVAL has other meanings for posix_fallocate() and isn't really appropriate here anyway. So I'm not sure what would be better... Would EINVAL (or whatever) make it back to the caller of posix_fallocate(), or would glibc fall back to its current implementation? Forgive me if I haven't put enough thought into it, but would it be useful to create a generic_fallocate() that writes zeroed pages for any non-existent pages in the range? I don't know how glibc currently implements posix_fallocate(), but maybe the kernel could do it more efficiently, even in generic code. Maybe we don't care, since the major file systems can probably do something better in their own code. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Thu, 2007-03-01 at 14:59 -0800, Andrew Morton wrote: On Thu, 01 Mar 2007 22:44:16 + Dave Kleikamp [EMAIL PROTECTED] wrote: On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote: On Fri, 2 Mar 2007 00:04:45 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + file = fget(fd); + if (!file) + goto out; + inode = file-f_path.dentry-d_inode; + if (inode-i_op inode-i_op-fallocate) + ret = inode-i_op-fallocate(inode, offset, len); + else + ret = -ENOTTY; + fput(file); +out: +return ret; +} ENOTTY is a bit unconventional - we often use EINVAL for this sort of thing. But EINVAL has other meanings for posix_fallocate() and isn't really appropriate here anyway. So I'm not sure what would be better... Would EINVAL (or whatever) make it back to the caller of posix_fallocate(), or would glibc fall back to its current implementation? Forgive me if I haven't put enough thought into it, but would it be useful to create a generic_fallocate() that writes zeroed pages for any non-existent pages in the range? I don't know how glibc currently implements posix_fallocate(), but maybe the kernel could do it more efficiently, even in generic code. Maybe we don't care, since the major file systems can probably do something better in their own code. Given that glibc already implements fallocate for all filesystems, it will need to continue to do so for filesystems which don't implement this syscall - otherwise applications would start breaking. I didn't make it clear, but my point was to call generic_fallocate if the file system did not define i_op-allocate(). if (inode-i_op inode-i_op-fallocate) ret = inode-i_op-fallocate(inode, offset, len); else ret = generic_fallocate(inode, offset, len); I'm not sure it's worth the effort, but I thought I'd throw the idea out there. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-cifs-client] i_mutex and deadlock
On Fri, 2007-02-23 at 10:02 -0600, Steve French (smfltc) wrote: A field in i_size_write (i_size_seqcount) must be protected against simultaneous update otherwise we risk looping in i_size_read. The suggestion in fs.h is to use i_mutex which seems too dangerous due to the possibility of deadlock. I'm not sure if it's as much a suggestion as a way of documenting the locking that exists (or existed when the comment was written). ... i_size_write() does need locking around it (normally i_mutex) ... There are 65 places in the fs directory which lock an i_mutex, and seven more in the mm directory. The vfs does clearly lock file inodes in some paths before calling into a particular filesystem (nfs, ext3, cifs etc.) - in particular for fsync but probably for others that are harder to trace. This seems to introduce the possibility of deadlock if a filesystem also uses i_mutex to protect file size updates Documentation/filesystems/Locking describes the use of i_mutex (was i_sem previously) and indicates that it is held by the vfs on three additional calls on file inodes which concern me (for deadlock possibility), setattr, truncate and unlink. nfs seems to limit its use of i_mutex to llseek and invalidate_mapping, and does not appear to grab the i_mutex (or any sem for that matter) to protect i_size_write (nfs calls i_size_write in nfs_grow_file) - and for the case of nfs_fhget (in which they bypass i_size_write and set i_size directly) does not seem to grab i_mutex either. ext3 also does not use i_mutex for this purpose (protecting i_size_write) - ony to protect a journalling ioctl. I am concerned about using i_mutex to protect the cifs calls to i_size_write (although it seems to fix a problem reported in i_size_read under stress) because of the following: 1) no one else calls i_size_write AFAIK (on our file inodes) I think you're right. 2) we don't block inside i_size_write do we ... (so why in the world do they take a slow mutex instead of a fast spinlock) My guess, is that in existing cases, it was already being held, so there is no need to do something different. I'm not sure if the comment is still accurate. What locking protects it in generic_commit_write() and nobh_commit_write()? 3) we don't really know what happens inside fsync (the paths through the page cache code seem complex and we don't want to reenter writepage in low memory conditions and deadlock updating the file size), and there is some concern that the vfs takes the i_mutex in other paths on file inodes before entering our code and could deadlock. Any reason, why an fs shouldn't simply use something else (a spinlock) other than i_mutex to protect the i_size_write call? i_mutex doesn't make sense in your case. Use whatever makes sense in cifs. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Accessing file-offset info for fds in /proc?
On Tue, 2007-02-20 at 02:31 -0500, Hank Leininger wrote: Is there anything provided by the kernel that would let you see the current offset of an existing filehandle? Sometimes when processing a very large file (grepping a log, bzip2'ing or gpg'ing a file, or whatever), I'd really like to know how far along it is, because I'm impatient. lsof has an -o flag to show offsets for file descriptors it lists, but it appears that's not supported under Linux. It looks like all of the information lsof and fuser print about files in use, etc can be gotten from /proc/*/fd/* (and /proc/*/maps, but I'm not really concerned with mmap'ed files, just positions on fds). Sometimes I'll resort to strace -s4096'ing the process to see what chunk of text it's currently reading, and try to guess from that. Silly. Has anybody ever developed a patch to implement this? I realize this could create a variety of information-leakage problems; the information probably would need to be restricted, such as by the same rules as dumpable. Are there any horribly painful reasons why this couldn't be done? It shouldn't be too painful. The code to populate /proc/*/fd/ has the file struct. It just doesn't have a place pass the offset to user-space since it's basically creating a symlink. In proc_fd_link(), it has the file struct. The offset is file-f_pos. One could create something like /proc/*/fd_offsets, whose read method could list the file descriptor, path, and offset for each open file. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Tue, 2007-02-20 at 21:39 +, Valerie Henson wrote: On Tue, Feb 20, 2007 at 01:30:25PM -0800, Junfeng Yang wrote: On 2/20/07, Valerie Henson [EMAIL PROTECTED] wrote: Google. (GoogleFS runs on top of ext2.) It's surprising to know that... I guess they reply on GoogleFS's own replication and checksumming for consistency. Yep, they just want a local file system with ultrafast on-line performance. They don't care about recovery time particularly because of the GoogleFS replication (although I heard rumors they have some fast fsck scheme, maybe resembling the dirty bit stuff I did last year). I wonder if they would consider this a important bug? I know nothing about GoogleFS, but I would guess that they have more sophisticated recovery than relying on an fsync shortly before a crash to ensure data integrity. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libfs.c:simple_lookup
On Mon, 2007-02-19 at 21:09 +0100, Jan Engelhardt wrote: Hello, simple_lookup() in fs/libfs.c does some extra steps, namely dentry-d_op = simple_dentry_operations; d_add(dentry, NULL); as far as I understand, this creates a negative dentry which will be deleted sometime later again. Is not it easier to not create it at all (since it is not going to be existent anyway)? I'm not sure I understand the question, so allow me to restate a similar question, and you can tell me if we're asking the same thing. In general, the negative dentry avoids the overhead of looking up a non-existent entry more than once by remembering that the entry does not exist. Since there is almost no overhead to calling simple_lookup(), would be be better off simply returning without creating the negative dentry (and letting simple_lookup be called more often)? Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Race between __sync_single_inode() and LogFS garbage collector
On Mon, 2007-02-19 at 21:31 +, Jörn Engel wrote: Looks like I really write the first log-structured filesystem for Linux. At least I can into a fairly arcane race that seems to be generic to all of them. Writing when space is tight may involve calling the garbage collector. The garbage collector will iget() random inodes, either to verify if a block is valid or to copy the block around. At this point, all writes to LogFS are serialized. __sync_single_inode() will first lock a random inode, then call write_inode(), then unlock the inode. So we can get this: __sync_single_inode() garbage collector - inode-i_state |= I_LOCK; ... ... mutex_lock(super-s_w_mutex); write_inode(inode, wait); ... ... iget(sb, ino); mutex_lock(super-s_w_mutex); ... ... wait_on_inode(inode); mutex_unlock(super-s_w_mutex); ... ... inode-i_state = ~I_LOCK; And once in a blue moon, those two will race for the same inode. As far as I can see, the race can only get fixed in two ways: 1. Never iget() inside the garbage collector. That would require having a private inode cache for LogFS. 2. Synchonize __sync_single_inode() and the garbage collector somehow. Variant 1 would result in double caching for the same object, something I would like to avoid. So does anyone have suggestions how variant 2 could be achieved? Essentially what I need is a way to say don't sync any inodes right now, I'll be back in 5 milliseconds or so. It'd be nice if you could drop s_w_mutex when the garbage collector calls i_get(). Otherwise, you may be able to call ilookup5_nowait() in the garbage collector, and skip that inode if I_LOCK is set. Jörn -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Race between __sync_single_inode() and LogFS garbage collector
On Mon, 2007-02-19 at 23:23 +, Jörn Engel wrote: On Mon, 19 February 2007 17:05:55 -0600, Dave Kleikamp wrote: It'd be nice if you could drop s_w_mutex when the garbage collector calls i_get(). Not possible. Garbage collection only happens when space is getting tight. At that moment all writes must be serialized or this race will be the least of my problems. :( Otherwise, you may be able to call ilookup5_nowait() in the garbage collector, and skip that inode if I_LOCK is set. Also not possible. I cannot skip that inode, or again this race will be a minor problem. The inode exists on the medium and I must get it by some means. Re-reading it from the medium is fine, writing is not and waiting for the write to happen brings me back to square one. Okay, I get it now. You've got more constraints than I initially realized. It is a nasty problem that has been haunting me for about a year now. For a while I tried ilookup5_nowait() and just used the inode in spite of the lock. But that will explode spectacularly when racing against generic_drop_inode(). You've obviously given this a lot more thought that I have, but this sounds like something that has possibilities. You couldn't implement your own drop_inode method that does better locking against the garbage collector? Double-caching or a common lock seem to be the only solutions. Jörn -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote: Another very heavyweight approach would be to simply force a full sync of the filesystem whenever fysnc() is called. Not pretty, and without the proper write ordering, the race is still potentially there. I don't think this race is an issue, in that it would require the crash to happen before the fsync completed, so there would be no expectation that the data is safe. It's a moot point, since I don't think this is an acceptable solution anyway. I'd say that the best way to handle this is in fsck, but quite frankly it's relatively low priority bug to handle, since a much simpler workaround is to tell people to use ext3 instead. Right. Who's still using ext2? -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 2007-02-15 at 10:59 -0500, sfaibish wrote: On Thu, 15 Feb 2007 10:09:22 -0500, Dave Kleikamp [EMAIL PROTECTED] wrote: On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote: Another very heavyweight approach would be to simply force a full sync of the filesystem whenever fysnc() is called. Not pretty, and without the proper write ordering, the race is still potentially there. I don't think this race is an issue, in that it would require the crash to happen before the fsync completed, so there would be no expectation that the data is safe. It's a moot point, since I don't think this is an acceptable solution anyway. I'd say that the best way to handle this is in fsck, but quite frankly it's relatively low priority bug to handle, since a much simpler workaround is to tell people to use ext3 instead. Right. Who's still using ext2? It was my understanding from the persentation of Dawson that ext3 and jfs have same problem. Hmm. If jfs has the problem, it is a bug. jfs is designed to handle this correctly. I'm pretty sure I've fixed at least one bug that eXplode has uncovered in the past. I'm not sure what was mentioned in the presentation though. I'd like any information about current problems in jfs. It is not an ext2 only problem. Also whatever solution we adopt we need to be sure that we test it using the eXplode methodology. /Sorin -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 2007-02-15 at 11:11 -0800, Junfeng Yang wrote: Hmm. If jfs has the problem, it is a bug. jfs is designed to handle this correctly. I'm pretty sure I've fixed at least one bug that eXplode has uncovered in the past. I'm not sure what was mentioned in the presentation though. I'd like any information about current problems in jfs. I believe you have fixed the JFS fsync bug, Dave. It was caused by reusing a directory inode as a file inode. If the machine crashes later, fsck would think this file is a directory, and clear all its data. Yeah. That one was fixed a while back. Thanks for clearing this up. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 2007-02-15 at 07:31 +1100, David Chinner wrote: On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote: Just some quick notes on possible ways to fix the ext2 fsync bug that eXplode found. Whether or not anyone will bother to implement it is another matter. Background: The eXplode file system checker found a bug in ext2 fsync behavior. Do the following: truncate file A, create file B which reallocates one of A's old indirect blocks, fsync file B. If you then crash before file A's metadata is all written out, fsck will complete the truncate for file A... thereby deleting file B's data. So fsync file B doesn't guarantee data is on disk after a crash. Details: http://www.stanford.edu/~engler/explode-osdi06.pdf Two possible solutions I can think of: * Rearrange order of duplicate block checking and fixing file size in fsck. Not sure how hard this is. (Ted?) * Keep a set of still allocated on disk block bitmaps that gets flushed whenever a sync happens. Don't allocate these blocks. Journaling file systems already have to do this. You don't need anything on disk or to fsck to fix this problem - just avoid it completely by keeping a list of recently truncated blocks in memory and don't reuse them until the old owner inode is sync'd to disk. I think that's pretty much what Val is suggesting. She suggests bitmaps rather than a list though. Maybe she should have used a better term than flushed, as this list only needs to be cleared, rather than written to disk. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] small smbfs bugfix
On Fri, 2007-01-26 at 09:51 +0300, Pavel Fedin wrote: Hello! This small patch fixes a problem with smbfs in 2.6.19.2 kernel. Should also apply to other kernels. NACK The problem is that smbfs doesn't understand user and noexec options. This makes it impossible to include into /etc/fstab lines like: /pc2/myshare /mnt/smb/pc2/myshare smbfs user,noauto 0 0 Smbmount expands user,noauto to user,noauto,noexec plus something else, don't remember exactly what. In the result such an entry simply can't be mounted because smbfs fails due to unknown options. This patch makes smbfs silently ignoring all unknown options instead of producing an error. I believe the problem is in smbmount. At the system call level, user and noauto are not valid options. These are usually parsed by the mount command, but not sent to the mount syscall. noexec also shouldn't be passed as a string in the syscall, but the MS_NOEXEC flag should be set in the mountflags argument. Also i would suggest to introduce this behavor to all other filesystems. This would greatly improve autofs usability. Imagine a situation: i have a ZIP drive and i'd like to be able to use ext2, ext3 and vfat-formatted cartridges. Also i'm russian user so i use NLS feature for names translation. In the fstab i would write: /dev/hdd /mnt/zip ext2,ext3,vfat user,noauto,iocharset=koi8-r,codepage=866 0 0 But in this case ext2 and ext3 wouldn't work at all since they would fail due to unknown iocharset and codepage options. If the file system doesn't honor the option, it shouldn't accept it. In the first example, if you really want to mount with noexec, and smbfs simply ignores it, you've undermined the security of the system. Not recognizing the iocharset and codepage probably isn't as big a problem, but ext2/3 would not be behaving the same was as if it truly respected those options. -- Best regards, Pavel mailto:[EMAIL PROTECTED] Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH: 2.6.20-rc4-mm1] JFS: Avoid deadlock introduced by explicit I/O plugging
On Thu, 2007-01-18 at 01:30 -0500, Josef Sipek wrote: On Wed, Jan 17, 2007 at 04:55:49PM -0600, Dave Kleikamp wrote: /* * jfs_lock.h @@ -42,6 +43,7 @@ do { \ if (cond) \ break; \ unlock_cmd; \ + blk_replug_current_nested();\ schedule(); \ lock_cmd; \ } \ Is {,un}lock_cmd a macro? ... They are the commands passed into this macro (as arguments) to release/take a lock. This is a home-grown wait_on_event type macro where the condition must be checked while holding a lock. And the commands passed in are themselves macros. The jfs code could probably be cleaned up a bit as far as excessive use of macros for locking, but it's been working for a few years with few problems. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH: 2.6.20-rc4-mm1] JFS: Avoid deadlock introduced by explicit I/O plugging
On Thu, 2007-01-18 at 10:18 +1100, Jens Axboe wrote: Can you try io_schedule() and verify that things just work? I actually did do that in the first place, but wondered if it was the right thing to introduce the accounting changes that came with that. I'll change it back to io_schedule() and test it again, just to make sure. If that's the right fix, I can push it directly since it won't have any dependencies on your patches. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH: 2.6.20-rc4-mm1] JFS: Avoid deadlock introduced by explicit I/O plugging
On Thu, 2007-01-18 at 10:46 +1100, Jens Axboe wrote: On Wed, Jan 17 2007, Dave Kleikamp wrote: On Thu, 2007-01-18 at 10:18 +1100, Jens Axboe wrote: Can you try io_schedule() and verify that things just work? I actually did do that in the first place, but wondered if it was the right thing to introduce the accounting changes that came with that. I'll change it back to io_schedule() and test it again, just to make sure. It appears to be the correct change to me - you really are waiting for IO resources (otherwise it would not hang with the plug change), so doing an inc/dec of iowait around the schedule should be done. Okay, here it is. If that's the right fix, I can push it directly since it won't have any dependencies on your patches. Perfect! It should make the next -mm. JFS: call io_schedule() instead of schedule() to avoid deadlock The introduction of Jens Axboe's explicit i/o plugging patches introduced a deadlock in jfs. This was caused by the process initiating I/O not unplugging the queue before waiting on the commit thread. The commit thread itself was waiting for that I/O to complete. Calling io_schedule() rather than schedule() unplugs the I/O queue avoiding the deadlock, and it appears to be the right function to call in any case. Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] --- commit 4aa0d230c2cfc1ac4bcf7c5466f9943cf14233a9 tree b873dce6146f4880c6c48ab53c0079566f52a60b parent 82d5b9a7c63054a9a2cd838ffd177697f86e7e34 author Dave Kleikamp [EMAIL PROTECTED] Wed, 17 Jan 2007 21:18:35 -0600 committer Dave Kleikamp [EMAIL PROTECTED] Wed, 17 Jan 2007 21:18:35 -0600 fs/jfs/jfs_lock.h |2 +- fs/jfs/jfs_metapage.c |2 +- fs/jfs/jfs_txnmgr.c |2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/jfs/jfs_lock.h b/fs/jfs/jfs_lock.h index 7d78e83..df48ece 100644 --- a/fs/jfs/jfs_lock.h +++ b/fs/jfs/jfs_lock.h @@ -42,7 +42,7 @@ do { \ if (cond) \ break; \ unlock_cmd; \ - schedule(); \ + io_schedule(); \ lock_cmd; \ } \ current-state = TASK_RUNNING; \ diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index ceaf03b..58deae0 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -56,7 +56,7 @@ static inline void __lock_metapage(struct metapage *mp) set_current_state(TASK_UNINTERRUPTIBLE); if (metapage_locked(mp)) { unlock_page(mp-page); - schedule(); + io_schedule(); lock_page(mp-page); } } while (trylock_metapage(mp)); diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c index d558e51..6988a10 100644 --- a/fs/jfs/jfs_txnmgr.c +++ b/fs/jfs/jfs_txnmgr.c @@ -135,7 +135,7 @@ static inline void TXN_SLEEP_DROP_LOCK(wait_queue_head_t * event) add_wait_queue(event, wait); set_current_state(TASK_UNINTERRUPTIBLE); TXN_UNLOCK(); - schedule(); + io_schedule(); current-state = TASK_RUNNING; remove_wait_queue(event, wait); } -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via pointer conversion)
On Fri, 2006-11-17 at 10:06 -0500, Jeff Layton wrote: On Fri, 2006-11-17 at 09:01 -0600, Dave Kleikamp wrote: Wouldn't you only be able to only crack a few of the low-order bits due to a cluster of inodes being sequential? I don't think you'd be able crack enough of it to be useful. You may be able to determine where some inodes are relative to others, but I don't think you'd be able to point the their location in memory. I don't know anything about crypto, so I could be wrong. On a 64-bit kernel, that would be the case. On a 32-bit kernel, there are no high order bits to chop off, so this would effectively give you the address. By a few of the low-order bits, I mean very few, not 32. The slab allocator generally allocates one page at a time, so you typically don't get more than about 6 inodes in a slab page. Even if you consider that you may be able allocate several slab pages together, it would be hard to get very many contiguous pages of inode slab cache. Even if it were possible to force the system to allocate around 256 contiguous inodes, you would still only be able to determine 8 bits out of 32. At most one or two more if you could determine a pattern of inode numbers that would be skipped due to the inclusion of struct inode within the fs-dependent inode structure. I may be wrong, but I doubt that someone could determine the entire mask from the observed i_ino's. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mark_inode_dirty vs mark_inode_dirty_sync
On Fri, 2005-09-02 at 11:55 +0200, David Sanchez wrote: Hi, Please, could somebody explain me what the mark_inode_dirty* functions do and what is the difference between mark_inode_dirty and mark_inode_dirty_sync ? They put the inode on the superblock's dirty list and make the inode as dirty in the i_state field. This makes sure that the inode will eventually be written to disk. mark_inode_dirty_sync only sets the I_DIRTY_SYNC flag, which does not imply that any file data was changed. It is called when a minor change is made to an inode, such as a timestamp is changed. Some sync operations will only write the inode if data was written, so can avoid writing the an inode that is only dirtied by I_DIRTY_SYNC. mark_inode_dirty sets I_DIRTY which is I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES. This indicates that the in-memory inode has changes to the data that have not yet been written to disk. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mirror a file system on the fly
On Fri, 2005-08-19 at 01:10 +0530, Dave Schwartz wrote: Hi list, Not too sure if this is the right forum to ask this question but since my requirement is around linux filesystems, I shall take this liberty to post my question. My requirement is to develop a kernel/user space module to add an extension to the shell program environment such that this shell forks a mirror look-alike filesystem of the underlying OS to the programs run in that particular shell. Was trying to look thru the FAQ and a few list archives to look for ideas around my requirement. The archives were overwhelming. Any ideas/pointers will be a great help, The chroot command and mount --bind should give you a good start. Gracias, decebel -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 0/2] JFS atomic inode security labeling
Here's the jfs implementation. The first patch fixes the transaction layering in the xattr code to allow xattrs to be added to an inode as a part of an existing transaction, and correctly calls jfs_init_acl() within the same transaction that creates a file/directory/device. The second patch actually implements jfs_init_security() similarly to the ext2/3 patches. The first patch was lightly tested, and the second was compile tested only. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 1/2] JFS atomic xattr/acl handling
__jfs_setxattr should allow extended attributes to be set within an existing transaction diff -urp linux-2.6.13-rc3/fs/jfs/acl.c linux-1/fs/jfs/acl.c --- linux-2.6.13-rc3/fs/jfs/acl.c 2005-07-13 10:03:40.0 -0500 +++ linux-1/fs/jfs/acl.c2005-07-14 09:19:50.0 -0500 @@ -23,6 +23,7 @@ #include linux/quotaops.h #include linux/posix_acl_xattr.h #include jfs_incore.h +#include jfs_txnmgr.h #include jfs_xattr.h #include jfs_acl.h @@ -75,7 +76,8 @@ static struct posix_acl *jfs_get_acl(str return acl; } -static int jfs_set_acl(struct inode *inode, int type, struct posix_acl *acl) +static int jfs_set_acl(tid_t tid, struct inode *inode, int type, + struct posix_acl *acl) { char *ea_name; struct jfs_inode_info *ji = JFS_IP(inode); @@ -110,7 +112,7 @@ static int jfs_set_acl(struct inode *ino if (rc 0) goto out; } - rc = __jfs_setxattr(inode, ea_name, value, size, 0); + rc = __jfs_setxattr(tid, inode, ea_name, value, size, 0); out: kfree(value); @@ -143,7 +145,7 @@ int jfs_permission(struct inode *inode, return generic_permission(inode, mask, jfs_check_acl); } -int jfs_init_acl(struct inode *inode, struct inode *dir) +int jfs_init_acl(tid_t tid, struct inode *inode, struct inode *dir) { struct posix_acl *acl = NULL; struct posix_acl *clone; @@ -159,7 +161,7 @@ int jfs_init_acl(struct inode *inode, st if (acl) { if (S_ISDIR(inode-i_mode)) { - rc = jfs_set_acl(inode, ACL_TYPE_DEFAULT, acl); + rc = jfs_set_acl(tid, inode, ACL_TYPE_DEFAULT, acl); if (rc) goto cleanup; } @@ -173,7 +175,8 @@ int jfs_init_acl(struct inode *inode, st if (rc = 0) { inode-i_mode = mode; if (rc 0) - rc = jfs_set_acl(inode, ACL_TYPE_ACCESS, clone); + rc = jfs_set_acl(tid, inode, ACL_TYPE_ACCESS, +clone); } posix_acl_release(clone); cleanup: @@ -202,8 +205,15 @@ static int jfs_acl_chmod(struct inode *i return -ENOMEM; rc = posix_acl_chmod_masq(clone, inode-i_mode); - if (!rc) - rc = jfs_set_acl(inode, ACL_TYPE_ACCESS, clone); + if (!rc) { + tid_t tid = txBegin(inode-i_sb, 0); + down(JFS_IP(inode)-commit_sem); + rc = jfs_set_acl(tid, inode, ACL_TYPE_ACCESS, clone); + if (!rc) + rc = txCommit(tid, 1, inode, 0); + txEnd(tid); + up(JFS_IP(inode)-commit_sem); + } posix_acl_release(clone); return rc; diff -urp linux-2.6.13-rc3/fs/jfs/jfs_acl.h linux-1/fs/jfs/jfs_acl.h --- linux-2.6.13-rc3/fs/jfs/jfs_acl.h 2005-07-13 10:03:40.0 -0500 +++ linux-1/fs/jfs/jfs_acl.h2005-07-14 08:45:01.0 -0500 @@ -21,8 +21,16 @@ #ifdef CONFIG_JFS_POSIX_ACL int jfs_permission(struct inode *, int, struct nameidata *); -int jfs_init_acl(struct inode *, struct inode *); +int jfs_init_acl(tid_t, struct inode *, struct inode *); int jfs_setattr(struct dentry *, struct iattr *); -#endif /* CONFIG_JFS_POSIX_ACL */ +#else + +static inline int jfs_init_acl(tid_t tid, struct inode *inode, + struct inode *dir) +{ + return 0; +} + +#endif #endif /* _H_JFS_ACL */ diff -urp linux-2.6.13-rc3/fs/jfs/jfs_xattr.h linux-1/fs/jfs/jfs_xattr.h --- linux-2.6.13-rc3/fs/jfs/jfs_xattr.h 2005-06-17 14:48:29.0 -0500 +++ linux-1/fs/jfs/jfs_xattr.h 2005-07-13 16:59:04.0 -0500 @@ -52,8 +52,8 @@ struct jfs_ea_list { #defineEND_EALIST(ealist) \ ((struct jfs_ea *) (((char *) (ealist)) + EALIST_SIZE(ealist))) -extern int __jfs_setxattr(struct inode *, const char *, const void *, size_t, - int); +extern int __jfs_setxattr(tid_t, struct inode *, const char *, const void *, + size_t, int); extern int jfs_setxattr(struct dentry *, const char *, const void *, size_t, int); extern ssize_t __jfs_getxattr(struct inode *, const char *, void *, size_t); diff -urp linux-2.6.13-rc3/fs/jfs/namei.c linux-1/fs/jfs/namei.c --- linux-2.6.13-rc3/fs/jfs/namei.c 2005-07-13 10:03:40.0 -0500 +++ linux-1/fs/jfs/namei.c 2005-07-14 09:41:21.0 -0500 @@ -39,6 +39,24 @@ struct dentry_operations jfs_ci_dentry_o static s64 commitZeroLink(tid_t, struct inode *); /* + * NAME: free_ea_wmap(inode) + * + * FUNCTION: free uncommitted extended attributes from working map + * + */ +static inline void free_ea_wmap(struct inode *inode) +{ + dxd_t *ea = JFS_IP(inode)-ea; + + if (ea-flag DXD_EXTENT) { +
[RFC][PATCH 2/2] JFS atomic inode security labeling
Implement jfs_init_security() diff -Nurp linux-1/fs/jfs/jfs_xattr.h linux/fs/jfs/jfs_xattr.h --- linux-1/fs/jfs/jfs_xattr.h 2005-07-13 16:59:04.0 -0500 +++ linux/fs/jfs/jfs_xattr.h2005-07-14 10:27:23.0 -0500 @@ -61,4 +61,14 @@ extern ssize_t jfs_getxattr(struct dentr extern ssize_t jfs_listxattr(struct dentry *, char *, size_t); extern int jfs_removexattr(struct dentry *, const char *); +#ifdef CONFIG_JFS_SECURITY +extern int jfs_init_security(tid_t, struct inode *, struct inode *); +#else +static inline int jfs_init_security(tid_t tid, struct inode *inode, + struct inode *dir) +{ + return 0; +} +#endif + #endif /* H_JFS_XATTR */ diff -Nurp linux-1/fs/jfs/namei.c linux/fs/jfs/namei.c --- linux-1/fs/jfs/namei.c 2005-07-14 09:41:21.0 -0500 +++ linux/fs/jfs/namei.c2005-07-14 10:18:21.0 -0500 @@ -111,6 +111,12 @@ static int jfs_create(struct inode *dip, if (rc) goto out3; + rc = jfs_init_security(tid, ip, dip); + if (rc) { + txAbort(tid, 0); + goto out3; + } + if ((rc = dtSearch(dip, dname, ino, btstack, JFS_CREATE))) { jfs_err(jfs_create: dtSearch returned %d, rc); txAbort(tid, 0); @@ -239,6 +245,12 @@ static int jfs_mkdir(struct inode *dip, if (rc) goto out3; + rc = jfs_init_security(tid, ip, dip); + if (rc) { + txAbort(tid, 0); + goto out3; + } + if ((rc = dtSearch(dip, dname, ino, btstack, JFS_CREATE))) { jfs_err(jfs_mkdir: dtSearch returned %d, rc); txAbort(tid, 0); @@ -906,6 +918,10 @@ static int jfs_symlink(struct inode *dip down(JFS_IP(dip)-commit_sem); down(JFS_IP(ip)-commit_sem); + rc = jfs_init_security(tid, ip, dip); + if (rc) + goto out3; + tblk = tid_to_tblock(tid); tblk-xflag |= COMMIT_CREATE; tblk-ino = ip-i_ino; @@ -1349,6 +1365,12 @@ static int jfs_mknod(struct inode *dir, if (rc) goto out3; + rc = jfs_init_security(tid, ip, dir); + if (rc) { + txAbort(tid, 0); + goto out3; + } + if ((rc = dtSearch(dir, dname, ino, btstack, JFS_CREATE))) { txAbort(tid, 0); goto out3; diff -Nurp linux-1/fs/jfs/xattr.c linux/fs/jfs/xattr.c --- linux-1/fs/jfs/xattr.c 2005-07-13 16:59:58.0 -0500 +++ linux/fs/jfs/xattr.c2005-07-14 10:54:11.0 -0500 @@ -21,6 +21,7 @@ #include linux/xattr.h #include linux/posix_acl_xattr.h #include linux/quotaops.h +#include linux/security.h #include jfs_incore.h #include jfs_superblock.h #include jfs_dmap.h @@ -1148,3 +1149,38 @@ int jfs_removexattr(struct dentry *dentr return rc; } + +#ifdef CONFIG_JFS_SECURITY +int jfs_init_security(tid_t tid, struct inode *inode, struct inode *dir) +{ + int rc; + size_t len; + void *value; + char *suffix; + char *name; + + rc = security_inode_init_security(inode, dir, suffix, value, len); + if (rc) { + if (rc == -EOPNOTSUPP) + return 0; + return rc; + } + name = kmalloc(XATTR_SECURITY_PREFIX_LEN + 1 + strlen(suffix), + GFP_NOFS); + if (!name) { + rc = -ENOMEM; + goto kmalloc_failed; + } + strcpy(name, XATTR_SECURITY_PREFIX); + strcpy(name + XATTR_SECURITY_PREFIX_LEN, suffix); + + rc = __jfs_setxattr(tid, inode, name, value, len, 0); + + kfree(name); +kmalloc_failed: + kfree(suffix); + kfree(value); + + return rc; +} +#endif - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: files of size larger than fs size
On Wed, 2005-03-16 at 20:52 -0800, Max wrote: Bryan, I'm not experienced in filesystems. But I've derived the testcase program from some buggy application that occasionally created huge files on my fs. That was not so easy to reproduce since not every sequence of seeks/reads/writes results in a huge file. But finally I got it 100% reproducible. So I would appriciate if somebody make any good from that testcase code. man lseek tells you to test the return code against (off_t)-1. I found that for larger values, your test program is returning -1, but unsigned it appears as 18446744073709551615. Different filesizes look strange to me. What I have so far: JFS:281474976706576 = 2**48 - 4080 XFS: 72057594037923856 = 2**56 - 4080 EXT3: 1099511627784 = 2**40 + 8 I can speak for jfs, and it supports file sizes up to 2**52. It uses 40 bits to address the logical block offset within a file with a 4K (2**12) size block. The individual file systems set sb-s_maxbytes to the appropriate value. Thanks, Max P.S. direct link to the testcase program: http://bugzilla.kernel.org/attachment.cgi?id=4729 -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: files of size larger than fs size
On Thu, 2005-03-17 at 14:51 -0600, Dave Kleikamp wrote: On Thu, 2005-03-17 at 12:06 -0800, Bryan Henderson wrote: I found that for larger values, your test program is returning -1, but unsigned it appears as 18446744073709551615. You mean you ran it? Then what about the more interesting question of what your filesize ends up to be? You say JFS allows files up to 2**52 bytes, so I expect the test case would succeed up through the write at 2**48 and leave the filesize 2**48 + 8. But Max reports seeing 2**48 - 4080. Yeah, this is funny. I missed that part of the story. I'll try to figure out what is going on here. The problem appears to be mixing calls to lseek64 with calls to fread and fwrite. fread fwrite don't appear to honor the position set by lseek64. I assume replacing fread fwrite with read write will fix it. I haven't tried it yet, but I'm about to. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: files of size larger than fs size
With this change, the file size on jfs becomes 2^48 + 4 as expected. --- jfs_bug.c.orig 2005-03-17 14:18:48.229634648 -0600 +++ jfs_bug.c 2005-03-17 15:32:45.952750104 -0600 @@ -13,12 +13,14 @@ int data = 0; struct flock fl; void read1() { -size_t rc = fread(data,sizeof(data),1,f); +/* size_t rc = fread(data,sizeof(data),1,f); */ +size_t rc = read(fn, data, sizeof(data)); printf(read() rc = %llu\n,rc); } void write1() { -size_t rc = fwrite(data,sizeof(data),1,f); +/* size_t rc = fwrite(data,sizeof(data),1,f); */ +size_t rc = write(fn, data, sizeof(data)); printf(write() rc = %llu\n,rc); } -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: files of size larger than fs size
On Thu, 2005-03-17 at 13:48 -0800, Max wrote: Dave, Shouldn't fread(data,sizeof(data),1,f) and read(fn, data, sizeof(data)) produce identical results? Is it a bug or what? fseek(f, offset, SEEK_SET); fread(data, sizeof(data, 1, f); should produce identical results to lseek(fn, offset, SEEK_SET); read(fn, data, sizeof(data); However lseek(fn, offset, SEEK_SET); fread(data, sizeof(data, 1, f); isn't valid. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: jfs, special characters
On Sun, 2005-02-27 at 15:00 -0800, Frederik Eaton wrote: Thanks, you were right that I recently upgraded to 2.6 (actually I transferred the hard drive between machines), and with a mount /home -o remount,iocharset=utf8 at least I can access the files now. But the names aren't what they should be, for instance I see: 04-The_Dark_Of_The_Matine.ogg (that's a theta). Were they supposed to be correct? Not necessarily. Even if the character returned is correct utf8, the terminal (xterm?) displaying the name may be assuming a different character set. It's also possible that on the older system, CONFIG_NLS_DEFAULT was not set to the same locale that the system used, so the character stored in the name was not the right unicode, but it was accessed consistently on the old system so it worked as long as nothing changed. As an aside, now, I just tried mount /home -o remount,iocharset=iso8859-1, and got same problem This is no surprise. iso8859-1 has the same mapping as iocharset=none (the 2.6 default). (The none keyword is first supported in 2.6.11-rc4-mm1 and 2.4.30-pre2, so you probably won't be able to use it.) ls: '/home/shared/frederik/Franz_Ferdinand/Franz_Ferdinand/04-The_Dark_Of_The_Matin?e.ogg': No such file or directory But when I repeated the mount /home -o remount,iocharset=utf8 command, the listing didn't show the name with a theta character, but instead with two question marks: 04-The_Dark_Of_The_Matin??e.ogg Strange behavior. I.e., upon changing the mount options and then changing them back I get a different listing. That I wouldn't expect. I really have no clue why it isn't consistent. The easiest way to fix these files (assuming there aren't too many) is to mount with iocharset=utf8, rename the files using only ascii characters, remount again with iocharset=iso8859-1, and rename the files to their proper names. After that you should be able to mount without any iocharset flag and get sane behavior. Frederik -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: jfs, special characters
On Fri, 2005-02-25 at 20:34 -0800, Frederik Eaton wrote: Does JFS have problems with files with special characters in them? Sometimes. jfs attempts to store pathnames in 16-bit unicode, and uses the mount option iocharset to determine what to convert to/from. In the 2.4 kernel, the default value of iocharset is determined by the kernel config option, CONFIG_NLS_DEFAULT. If an existing pathname exists that doesn't map to the iocharset, jfs has the problem you describe. Mounting with -o iocharset=utf8 should let you access any existing files. In the 2.6 kernel, the default was changed to store each byte of the filename as a 16-bit value in the directory without using any conversion. There can still be a problem if an existing pathname has a value with a non-zero high-order byte. Again mounting with -o iocharset=utf8 will let you access all files. $ ls ls: 04-The_Dark_Of_The_Matin?e.ogg: No such file or directory [1]$ rm 04-The_Dark_Of_The_Matin\?e.ogg rm: cannot lstat `04-The_Dark_Of_The_Matin?e.ogg': No such file or directory Frederik -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SV: Ext3 destroying ownerships and permissions
This is a WAG, but: 1. Did you build ext3 as a module? 2. If so, are you running on a kernel that was built before applying the ext3 patch and configuring? If the answers are yes, the inode cache may be too small to contain the ext3 in-memory inode, and the ext3 code is overwriting the inode by a few bytes. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: 64-bit block sizes on 32-bit systems
My turn to chime in. JFS was designed around a 4K meta-data page size. It would require some major re-design to use larger block sizes. On the other hand, JFS could take advantage of 64-bit block addresses immediately. JFS internally store the block address in 40 bits. (Sorry, file size volume size are both limited to 4 peta-bytes on JFS.) At the rate that storage hardware and requirements are increasing, increasing the block size is a short-term solution that is only going to delay the inevitable requirement for 64-bit block addressability. There is a practical limit to a usable block-size. Someone threw out 64K, which seems reasonable to me. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFC] sane access to per-fs metadata (was Re:[PATCH]Documentation/ioctl-number.txt)
Alexander Viro wrote: On Fri, 23 Mar 2001, Dave Kleikamp wrote: Are you okay with JFS using a few ioctl's in the utilities and continuing the discussion of whether more general-purpose metadata access should occur in procfs or under your proposal? OK, one immediate problem with ioctls on directories: unions. Think what happens when you union-mount JFS somewhere. And mountpoint also is on JFS. Where should ioctl() go? AFAIC, I wouldn't object to calling this a limitation and letting the ioctl fail. In fact, the utilities could detect a union-mount and fail immediately. Although, someone, someday, will ask for it to work, won't they? Other that that (and general ugliness of ioctls) - no problems. I really think that trick I've described would be cleaner, but that's a separate story. It's not like we had a moratorium on new ioctls, after all and JFS wouldn't be the first fs to do something like that. I don't think that it's a good idea, but the worst thing that can happen is inconvenience for union-mount setups. When union-mount gets to testable stage, that is - it's not like it was going into the tree before 2.5, anyway. procfs use for per-filesystem stuff is a different issue - that is just asking for bad breakage. I won't argue this point without learning more about it. So far I've only played with procfs to peek at some global data. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFC] sane access to per-fs metadata (was Re: [PATCH]Documentation/ioctl-number.txt)
Al, I didn't know that creating file system ioctl's was such a hot topic. Since the functions I want to implement are for a very specific purpose (I don't expect anything except the JFS utilities to invoke them), I would expect an ioctl to be an appropriate vehicle. If not ioctl's, why not procfs? Your proposal is that each filesystem implements its own mini-procfs. What's the advantage of that? My intentions so far are to use ioctl's for specific operations required by the JFS utilities, and procfs for debugging output, tuning, and anything else that make sense. Alexander Viro wrote: That may look like an overkill, however * You can get rid of any need to register ioctls, etc. This is a one-time thing * You can add debugging/whatever at any moment with no need to update any utilities - everything is available from plain shell We can do this with procfs right now. * You can conveniently view whatever metadata you want - no need to shove everything into ioctls on one object. Again, why re-invent procfs? We could put this under /proc/fs/jfs/metadata. * You can use normal permissions control - just set appropriate permission bits for objects on jfsmeta IOW, you can get normal filesystem view (meaning that you have all usual UNIX toolkit available) for per-fs control stuff. And keep the ability to do proper locking - it's the same driver that handles the main fs and you have access to superblock. No need to change the API - everything is already there... I'm not sure how a utility would make atomic changes to several pieces of metadata. The underlying fs code would protect the integrity of every metadata "file", but changes to more than one of these "files" would not be done as a group without some additional locking that would have to be coordinated between the utility and the fs. This kind of thing could be handled by writing some special command to a "command-processor" type file, but why is this better than an ioctl? -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: (struct dentry *)-vfsmnt;
AIX stores all of this information in the LVM, not in the filesystem. The filesystem itself has nothing to do with importing and exporting volume groups. Having the information stored as part of LVM's metadata allows the utilities to only deal with LVM instead of every individual file system. Andreas Dilger wrote: Al writes: On Tue, 13 Mar 2001, Andreas Dilger wrote: On AIX, it is possible to import a volume group, and it automatically builds /etc/fstab entries from information stored in the fs. Having the "last mounted on" would have the mount point info, and of course LVM would hold the device names. Wait a minute. What happens if you bring /home from one box to another, that already has /home? Corrupted /etc/fstab? For the same reason that the UUID and LABEL are stored in the superblock: you want this infomation kept with the filesystem and not anywhere else, otherwise it will quickly get out-of-date. Wherever you mounted the filesystem last is where it would be mounted if you import the VG on another system. You can obviously edit /etc/fstab afterwards if it is wrong, and then remount the filesystem(s), and this will store the correct mountpoint into the filesystem for the next vgimport. -- David J. Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]