Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Dave Kleikamp

On Mon, 2008-01-28 at 12:33 -0600, Steve French wrote:
 On Jan 28, 2008 2:17 AM, Andi Kleen [EMAIL PROTECTED] wrote:
   I completely agree.  If one thread writes A and another writes B then the
   kernel should record either A or B, not ((A  0x) | (B 
   0x))
 
  The problem is pretty nasty unfortunately. To solve it properly I think
  the file_operations-read/write prototypes would need to be changed
  because otherwise it is not possible to do atomic relative updates
  of f_pos. Right now the actual update is burrowed deeply in the low level
  read/write implementation. But that would be a huge impact all over
  the tree :/
 
 If there were a wrapper around reads and writes of f_pos as there is
 for i_size e.g. it would hit a lot of code, but not as many as I had
 originally thought.  the most important ones are in the vfs itself, where
 there are only 59 uses of the field (not all need to be changed).   ext3
 has fewer (25), and cifs only 12 uses.

Most of the uses in ext3 and cifs deal with a directory's f_pos in
readdir, which is protected by i_mutex, so I don't think we need to
worry about them at all.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [9/18] BKL-removal: Use unlocked_ioctl for jfs

2008-01-27 Thread Dave Kleikamp
On Sun, 2008-01-27 at 03:17 +0100, Andi Kleen wrote:
 Convert jfs_ioctl over to not use the BKL. The only potential race
 I could see was with two ioctls in parallel changing the flags
 and losing the updates. Use the i_mutex to protect against this.
 
 Cc: [EMAIL PROTECTED]
 
 Signed-off-by: Andi Kleen [EMAIL PROTECTED]

Added to the jfs git tree.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [10/18] BKL-removal: Implement a compat_ioctl handler for JFS

2008-01-27 Thread Dave Kleikamp
On Sun, 2008-01-27 at 03:17 +0100, Andi Kleen wrote:
 The ioctls were already compatible except for the actual values so this
 was fairly easy to do.
 
 Cc: [EMAIL PROTECTED]
 
 Signed-off-by: Andi Kleen [EMAIL PROTECTED]

Added to the jfs git tree.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 19/26] mount options: fix jfs

2008-01-24 Thread Dave Kleikamp
On Thu, 2008-01-24 at 20:34 +0100, Miklos Szeredi wrote:
 plain text document attachment (jfs_opts.patch)
 From: Miklos Szeredi [EMAIL PROTECTED]
 
 Add iocharset= and errors= options to /proc/mounts for jfs
 filesystems.
 
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

Acked-by: Dave Kleikamp [EMAIL PROTECTED]

Andrew,
Would you like me to add this to the jfs git tree, or would you like to
handle these patches as a set?

Thanks,
Shaggy

 ---
 
 Index: linux/fs/jfs/super.c
 ===
 --- linux.orig/fs/jfs/super.c 2008-01-17 19:00:55.0 +0100
 +++ linux/fs/jfs/super.c  2008-01-21 19:39:30.0 +0100
 @@ -602,6 +602,12 @@ static int jfs_show_options(struct seq_f
   seq_printf(seq, ,umask=%03o, sbi-umask);
   if (sbi-flag  JFS_NOINTEGRITY)
   seq_puts(seq, ,nointegrity);
 + if (sbi-nls_tab)
 + seq_printf(seq, ,iocharset=%s, sbi-nls_tab-charset);
 + if (sbi-flag  JFS_ERR_CONTINUE)
 + seq_printf(seq, ,errors=continue);
 + if (sbi-flag  JFS_ERR_PANIC)
 + seq_printf(seq, ,errors=panic);
  
  #ifdef CONFIG_QUOTA
   if (sbi-flag  JFS_USRQUOTA)
 
 --
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 19/26] mount options: fix jfs

2008-01-24 Thread Dave Kleikamp

On Thu, 2008-01-24 at 13:57 -0800, Andrew Morton wrote:

 My usual algorithm here is to
 
 1: queue all the patches and send the ones which have a maintainer to
that maintainer until he merges it.
 
 2: If the patches have a dependency upon (say) a VFS patch then I'll
merge the VFS patch and will then goto 1.
 
 I don't think this particular patch has a VFS depencency so sure, merge
 away.  You'll probably see that I merged it anyway, but I'll drop it again
 when I see it turn up in your tree (I used to resync with the git trees
 at least daily, but I now do this far less frequently because it is such
 torture because everyone is paddling in everyone else's puddle).

Merged.  Thanks.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC:PATCH 09/09] VM tail statistics support

2007-11-08 Thread Dave Kleikamp
From: Luiz Fernando N. Capitulino [EMAIL PROTECTED]

[PATCH]: VM tail statistics support

This patch is a hack which introduces initial statistics support
for the VM tail functionality.

It uses debugfs and does accouting of:

1. Number of times vm_file_tail_pack() have been called
2. Number of times vm_file_tail_unpack() have been called
3. Total size of file tails allocations
4. Number of file tail allocations
5. Bytes saved

Signed-off-by: Luiz Fernando N. Capitulino [EMAIL PROTECTED]
Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]

---

 mm/file_tail.c |  127 -
 1 file changed, 126 insertions(+), 1 deletion(-)

diff -Nurp linux008/mm/file_tail.c linux009/mm/file_tail.c
--- linux008/mm/file_tail.c 2007-11-08 10:49:46.0 -0600
+++ linux009/mm/file_tail.c 2007-11-08 10:49:46.0 -0600
@@ -12,10 +12,51 @@
 
 #include linux/buffer_head.h
 #include linux/fs.h
+#include linux/init.h
+#include linux/debugfs.h
 #include linux/hardirq.h
 #include linux/module.h
+#include linux/spinlock.h
 #include linux/vm_file_tail.h
 
+struct {
+   struct dentry *root_dir;
+   struct dentry *nr_tails;
+   struct dentry *tail_size;
+   struct dentry *saved_bytes;
+   struct dentry *pack_called;
+   struct dentry *unpack_called;
+   struct dentry *read_called;
+} vm_tail_debugfs;
+
+struct {
+   u32 nr_tails;
+   u32 tail_size;
+   u32 saved_bytes;
+   u32 pack_called;
+   u32 unpack_called;
+   u32 read_called;
+   spinlock_t lock;
+} vm_tail_stats = { .lock = __SPIN_LOCK_UNLOCKED(lock) };
+
+static void vm_file_tail_stats_inc(int length)
+{
+   spin_lock(vm_tail_stats.lock);
+   vm_tail_stats.nr_tails++;
+   vm_tail_stats.tail_size += length;
+   vm_tail_stats.saved_bytes += (PAGE_SIZE - length);
+   spin_unlock(vm_tail_stats.lock);
+}
+
+static void vm_file_tail_stats_dec(int length)
+{
+   spin_lock(vm_tail_stats.lock);
+   vm_tail_stats.nr_tails--;
+   vm_tail_stats.tail_size -= length;
+   vm_tail_stats.saved_bytes -= (PAGE_SIZE - length);
+   spin_unlock(vm_tail_stats.lock);
+}
+
 /*
  * Free the file tail
  *
@@ -29,7 +70,10 @@ void __vm_file_tail_free(struct address_
 
spin_lock_irqsave(mapping-tail_lock, flags);
tail = mapping-tail;
-   mapping-tail = NULL;
+   if (tail) {
+   vm_file_tail_stats_dec(vm_file_tail_length(mapping));
+   mapping-tail = NULL;
+   }
spin_unlock_irqrestore(mapping-tail_lock, flags);
kfree(tail);
 }
@@ -49,6 +93,8 @@ void vm_file_tail_unpack(struct address_
struct page *page;
void *tail;
 
+   vm_tail_stats.unpack_called++;
+
if (!mapping-tail)
return;
 
@@ -85,6 +131,7 @@ void vm_file_tail_unpack(struct address_
add_to_page_cache_lru(page, mapping, index, gfp_mask);
unlock_page(page);
page_cache_release(page);
+   vm_file_tail_stats_dec(length);
} else
/* Free the tail */
__vm_file_tail_free(mapping);
@@ -120,6 +167,8 @@ int vm_file_tail_pack(struct page *page)
struct address_space *mapping;
void *tail;
 
+   vm_tail_stats.pack_called++;
+
if (TestSetPageLocked(page))
return 0;
 
@@ -163,12 +212,86 @@ int vm_file_tail_pack(struct page *page)
remove_from_page_cache(page);
page_cache_release(page);   /* pagecache ref */
ret = 1;
+   vm_file_tail_stats_inc(length);
 
 out:
unlock_page(page);
return ret;
 }
 
+static int __init create_debugfs_file(const char *name, struct dentry **dir,
+ u32 *var)
+{
+   *dir = debugfs_create_u32(name, S_IFREG|S_IRUGO,
+ vm_tail_debugfs.root_dir, var);
+   if (!*dir) {
+   printk(KERN_ERR ERROR: vm_tail: could not create %s\n, name);
+   return -ENOMEM;
+   }
+   return 0;
+}
+
+static int __init vm_file_tail_init(void)
+{
+   int err;
+
+   vm_tail_debugfs.root_dir = debugfs_create_dir(vm_tail, NULL);
+   if (!vm_tail_debugfs.root_dir) {
+   printk(KERN_ERR ERROR: %s Could not create root directory\n,
+  __FUNCTION__);
+   return -ENOMEM;
+   }
+
+   err = create_debugfs_file(nr_tails, vm_tail_debugfs.nr_tails,
+ vm_tail_stats.nr_tails);
+   if (err)
+   goto out_err;
+
+   err = create_debugfs_file(tail_size, vm_tail_debugfs.tail_size,
+ vm_tail_stats.tail_size);
+   if (err)
+   goto out_err1;
+
+   err = create_debugfs_file(saved_bytes, vm_tail_debugfs.saved_bytes,
+ vm_tail_stats.saved_bytes);
+   if (err)
+   goto out_err2;
+
+   err = create_debugfs_file

[RFC:PATCH 08/09] generic_file_aio_read can read directly from the tail. No need to unpack

2007-11-08 Thread Dave Kleikamp
generic_file_aio_read can read directly from the tail.  No need to unpack

Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]

---

 include/linux/vm_file_tail.h |   13 ++
 mm/file_tail.c   |   54 +++
 mm/filemap.c |4 ++-
 3 files changed, 70 insertions(+), 1 deletion(-)

diff -Nurp linux007/include/linux/vm_file_tail.h 
linux008/include/linux/vm_file_tail.h
--- linux007/include/linux/vm_file_tail.h   2007-11-08 10:49:46.0 
-0600
+++ linux008/include/linux/vm_file_tail.h   2007-11-08 10:49:46.0 
-0600
@@ -53,6 +53,18 @@ static inline void vm_file_tail_unpack_i
vm_file_tail_unpack(mapping);
 }
 
+extern int __vm_file_tail_read(struct file *, loff_t *, read_descriptor_t *);
+
+static inline int vm_file_tail_read(struct file *filp, loff_t *ppos,
+   read_descriptor_t *desc)
+{
+   struct address_space *mapping = filp-f_mapping;
+   unsigned long index = *ppos  PAGE_CACHE_SHIFT;
+
+   if (mapping-tail  index == vm_file_tail_index(mapping))
+   return __vm_file_tail_read(filp, ppos, desc);
+   return 0;
+}
 #else /* !CONFIG_VM_FILE_TAILS */
 
 #define vm_file_tail_packed(mapping) 0
@@ -60,6 +72,7 @@ static inline void vm_file_tail_unpack_i
 #define vm_file_tail_pack(page) 0
 #define vm_file_tail_unpack(mapping) do {} while (0)
 #define vm_file_tail_unpack_index(mapping, index) do {} while (0)
+#define vm_file_tail_read(filp, ppos, desc) 0
 
 #endif /* CONFIG_VM_FILE_TAILS */
 
diff -Nurp linux007/mm/file_tail.c linux008/mm/file_tail.c
--- linux007/mm/file_tail.c 2007-11-08 10:49:46.0 -0600
+++ linux008/mm/file_tail.c 2007-11-08 10:49:46.0 -0600
@@ -178,3 +178,57 @@ void __vm_file_tail_unpack_on_resize(str
vm_file_tail_free(inode-i_mapping);
 }
 EXPORT_SYMBOL(__vm_file_tail_unpack_on_resize);
+
+/*
+ * Copy tail data to user buffer
+ *
+ * Returns 1 on success
+ */
+int __vm_file_tail_read(struct file *filp, loff_t *ppos,
+   read_descriptor_t *desc)
+{
+   unsigned long count = desc-count;
+   unsigned int flags;
+   unsigned long left;
+   struct address_space *mapping = filp-f_mapping;
+   unsigned long offset;
+   unsigned long index = *ppos  PAGE_CACHE_SHIFT;
+   unsigned long size;
+
+   if (fault_in_pages_writeable(desc-arg.buf, count))
+   /*
+* Keep this simple since this path is an optimization.  Let
+* the tricky stuff get handled in the fallback path.
+*/
+   return 0;
+
+   spin_lock_irqsave(mapping-tail_lock, flags);
+
+   offset = *ppos  ~PAGE_CACHE_MASK;
+   if (!mapping-tail || index != vm_file_tail_index(mapping) ||
+   offset = vm_file_tail_length(mapping)) {
+   spin_unlock_irqrestore(mapping-tail_lock, flags);
+   return 0;
+   }
+
+   size = vm_file_tail_length(mapping) - offset;
+   if (size  count)
+   size = count;
+
+   left = __copy_to_user_inatomic(desc-arg.buf,
+  (char *)mapping-tail + offset, size);
+
+   spin_unlock_irqrestore(mapping-tail_lock, flags);
+
+   if (left) {
+   size -= left;
+   desc-error = -EFAULT;
+   }
+   desc-count = count - size;
+   desc-written += size;
+   desc-arg.buf += size;
+   *ppos += size;
+   file_accessed(filp);
+
+   return 1;
+}
diff -Nurp linux007/mm/filemap.c linux008/mm/filemap.c
--- linux007/mm/filemap.c   2007-11-08 10:49:46.0 -0600
+++ linux008/mm/filemap.c   2007-11-08 10:49:46.0 -0600
@@ -1195,7 +1195,9 @@ generic_file_aio_read(struct kiocb *iocb
if (desc.count == 0)
continue;
desc.error = 0;
-   do_generic_file_read(filp,ppos,desc,file_read_actor);
+   if (!vm_file_tail_read(filp, ppos, desc))
+   do_generic_file_read(filp, ppos, desc,
+file_read_actor);
retval += desc.written;
if (desc.error) {
retval = retval ?: desc.error;
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC:PATCH 05/09] find_get_page() and find_lock_page() need to unpack the tail

2007-11-08 Thread Dave Kleikamp
find_get_page() and find_lock_page() need to unpack the tail

If the page being sought corresponds to the tail, and the tail is packed
in the inode, the tail must be unpacked.

Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
---

 mm/filemap.c |3 +++
 1 file changed, 3 insertions(+)

diff -Nurp linux004/mm/filemap.c linux005/mm/filemap.c
--- linux004/mm/filemap.c   2007-11-07 08:14:01.0 -0600
+++ linux005/mm/filemap.c   2007-11-08 10:49:46.0 -0600
@@ -24,6 +24,7 @@
 #include linux/file.h
 #include linux/uio.h
 #include linux/hash.h
+#include linux/vm_file_tail.h
 #include linux/writeback.h
 #include linux/backing-dev.h
 #include linux/pagevec.h
@@ -600,6 +601,7 @@ struct page * find_get_page(struct addre
 {
struct page *page;
 
+   vm_file_tail_unpack_index(mapping, offset);
read_lock_irq(mapping-tree_lock);
page = radix_tree_lookup(mapping-page_tree, offset);
if (page)
@@ -624,6 +626,7 @@ struct page *find_lock_page(struct addre
 {
struct page *page;
 
+   vm_file_tail_unpack_index(mapping, offset);
 repeat:
read_lock_irq(mapping-tree_lock);
page = radix_tree_lookup(mapping-page_tree, offset);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC:PATCH 01/09] Add tail to address space

2007-11-08 Thread Dave Kleikamp
Add tail to address space

Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]

---

 fs/inode.c |3 +++
 include/linux/fs.h |4 
 mm/Kconfig |9 +
 3 files changed, 16 insertions(+)

diff -Nurp linux000/fs/inode.c linux001/fs/inode.c
--- linux000/fs/inode.c 2007-11-07 08:13:54.0 -0600
+++ linux001/fs/inode.c 2007-11-08 10:49:46.0 -0600
@@ -213,6 +213,9 @@ void inode_init_once(struct inode *inode
spin_lock_init(inode-i_data.i_mmap_lock);
INIT_LIST_HEAD(inode-i_data.private_list);
spin_lock_init(inode-i_data.private_lock);
+#ifdef CONFIG_VM_FILE_TAILS
+   spin_lock_init(inode-i_data.tail_lock);
+#endif
INIT_RAW_PRIO_TREE_ROOT(inode-i_data.i_mmap);
INIT_LIST_HEAD(inode-i_data.i_mmap_nonlinear);
i_size_ordered_init(inode);
diff -Nurp linux000/include/linux/fs.h linux001/include/linux/fs.h
--- linux000/include/linux/fs.h 2007-11-07 08:13:59.0 -0600
+++ linux001/include/linux/fs.h 2007-11-08 10:49:46.0 -0600
@@ -511,6 +511,10 @@ struct address_space {
spinlock_t  private_lock;   /* for use by the address_space 
*/
struct list_headprivate_list;   /* ditto */
struct address_space*assoc_mapping; /* ditto */
+#ifdef CONFIG_VM_FILE_TAILS
+   void*tail;  /* file tail */
+   spinlock_t  tail_lock;  /* protect tail */
+#endif
 } __attribute__((aligned(sizeof(long;
/*
 * On most architectures that alignment is already the case; but
diff -Nurp linux000/mm/Kconfig linux001/mm/Kconfig
--- linux000/mm/Kconfig 2007-11-07 08:14:01.0 -0600
+++ linux001/mm/Kconfig 2007-11-08 10:49:46.0 -0600
@@ -194,3 +194,12 @@ config NR_QUICK
 config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config VM_FILE_TAILS
+   bool Store file tails in slab cache
+   def_bool n
+   help
+ If the data at the end of a file, or the entire file, is small,
+ the kernel will attempt to store that data in the slab cache,
+ rather than allocate an entire page in the page cache.
+ If unsure, say N here.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC:PATCH 03/09] Release tail when inode is freed

2007-11-08 Thread Dave Kleikamp
Release tail when inode is freed

Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
---

 fs/inode.c |2 ++
 1 file changed, 2 insertions(+)

diff -Nurp linux002/fs/inode.c linux003/fs/inode.c
--- linux002/fs/inode.c 2007-11-08 10:49:46.0 -0600
+++ linux003/fs/inode.c 2007-11-08 10:49:46.0 -0600
@@ -10,6 +10,7 @@
 #include linux/init.h
 #include linux/quotaops.h
 #include linux/slab.h
+#include linux/vm_file_tail.h
 #include linux/writeback.h
 #include linux/module.h
 #include linux/backing-dev.h
@@ -260,6 +261,7 @@ void __iget(struct inode * inode)
 void clear_inode(struct inode *inode)
 {
might_sleep();
+   vm_file_tail_free(inode-i_mapping);
invalidate_inode_buffers(inode);

BUG_ON(inode-i_data.nrpages);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC:PATCH 04/09] Unpack or remove file tail when inode is resized

2007-11-08 Thread Dave Kleikamp
Unpack or remove file tail when inode is resized

If the inode size grows, we need to unpack the tail into a page.
If the inode shrinks, such that the entire tail is beyond the end of the
file, discard the tail.  If the file shrinks, but part of the tail is still
valid, just leave it.

Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
---

 include/linux/fs.h |   14 ++
 mm/file_tail.c |   11 +++
 2 files changed, 25 insertions(+)

diff -Nurp linux003/include/linux/fs.h linux004/include/linux/fs.h
--- linux003/include/linux/fs.h 2007-11-08 10:49:46.0 -0600
+++ linux004/include/linux/fs.h 2007-11-08 10:49:46.0 -0600
@@ -715,6 +715,19 @@ static inline loff_t i_size_read(const s
 #endif
 }
 
+#ifdef CONFIG_VM_FILE_TAILS
+void __vm_file_tail_unpack_on_resize(struct inode *, loff_t);
+
+static inline void vm_file_tail_unpack_on_resize(struct inode *inode,
+loff_t size)
+{
+   if (inode-i_mapping  inode-i_mapping-tail)
+   __vm_file_tail_unpack_on_resize(inode, size);
+}
+#else
+#define vm_file_tail_unpack_on_resize(mapping, new_size) do {} while (0)
+#endif
+
 /*
  * NOTE: unlike i_size_read(), i_size_write() does need locking around it
  * (normally i_mutex), otherwise on 32bit/SMP an update of i_size_seqcount
@@ -722,6 +735,7 @@ static inline loff_t i_size_read(const s
  */
 static inline void i_size_write(struct inode *inode, loff_t i_size)
 {
+   vm_file_tail_unpack_on_resize(inode, i_size);
 #if BITS_PER_LONG==32  defined(CONFIG_SMP)
write_seqcount_begin(inode-i_size_seqcount);
inode-i_size = i_size;
diff -Nurp linux003/mm/file_tail.c linux004/mm/file_tail.c
--- linux003/mm/file_tail.c 2007-11-08 10:49:46.0 -0600
+++ linux004/mm/file_tail.c 2007-11-08 10:49:46.0 -0600
@@ -13,6 +13,7 @@
 #include linux/buffer_head.h
 #include linux/fs.h
 #include linux/hardirq.h
+#include linux/module.h
 #include linux/vm_file_tail.h
 
 /*
@@ -167,3 +168,13 @@ out:
unlock_page(page);
return ret;
 }
+
+void __vm_file_tail_unpack_on_resize(struct inode *inode, loff_t new_size)
+{
+   loff_t old_size = i_size_read(inode);
+   if (new_size  old_size)
+   vm_file_tail_unpack(inode-i_mapping);
+   else if (new_size  PAGE_CACHE_SHIFT != old_size  PAGE_CACHE_SHIFT)
+   vm_file_tail_free(inode-i_mapping);
+}
+EXPORT_SYMBOL(__vm_file_tail_unpack_on_resize);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC:PATCH 06/09] For readahead, leave data in tail

2007-11-08 Thread Dave Kleikamp
For readahead, leave data in tail

Don't unpack it until it's actually read.

Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
---

 mm/readahead.c |5 +
 1 file changed, 5 insertions(+)

diff -Nurp linux005/mm/readahead.c linux006/mm/readahead.c
--- linux005/mm/readahead.c 2007-11-07 08:14:01.0 -0600
+++ linux006/mm/readahead.c 2007-11-08 10:49:46.0 -0600
@@ -16,6 +16,7 @@
 #include linux/task_io_accounting_ops.h
 #include linux/pagevec.h
 #include linux/pagemap.h
+#include linux/vm_file_tail.h
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -147,6 +148,10 @@ __do_page_cache_readahead(struct address
if (page_offset  end_index)
break;
 
+   if ((page_offset == end_index)  vm_file_tail_packed(mapping))
+   /* Tail page is already packed */
+   break;
+
rcu_read_lock();
page = radix_tree_lookup(mapping-page_tree, page_offset);
rcu_read_unlock();
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC:PATCH 07/09] shrink_active_list: pack file tails rather than move to inactive list

2007-11-08 Thread Dave Kleikamp
shrink_active_list: pack file tails rather than move to inactive list

The big question is how aggressively we pack the tails.  This looked like
an easy place to start.  If a page is being moved from the active list to
the inactive list, and the tail can be safely packed, that is not mapped,
not dirty, etc., the tail is packed and the page removed from the page
cache.

Right now, pages that never get off the inactive list will not be packed.

I will be soliciting ideas for other places in the code where tails can
be packed.  One of my goals is not to be too aggressive, where tails are
packed and unpacked repeatedly.  I also don't want to add too much overhead,
such as an extra scan of the inactive list.

Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
---

 mm/vmscan.c |6 ++
 1 file changed, 6 insertions(+)

diff -Nurp linux006/mm/vmscan.c linux007/mm/vmscan.c
--- linux006/mm/vmscan.c2007-11-07 08:14:01.0 -0600
+++ linux007/mm/vmscan.c2007-11-08 10:49:46.0 -0600
@@ -19,6 +19,7 @@
 #include linux/pagemap.h
 #include linux/init.h
 #include linux/highmem.h
+#include linux/vm_file_tail.h
 #include linux/vmstat.h
 #include linux/file.h
 #include linux/writeback.h
@@ -1035,7 +1036,12 @@ force_reclaim_mapped:
list_add(page-lru, l_active);
continue;
}
+   } else if (vm_file_tail_pack(page)) {
+   ClearPageActive(page);
+   page_cache_release(page);
+   continue;
}
+
list_add(page-lru, l_inactive);
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/30] IGET: Stop BEFS from using iget() and read_inode()

2007-10-02 Thread Dave Kleikamp
On Tue, 2007-10-02 at 13:32 +0100, David Howells wrote:
 Zach Brown [EMAIL PROTECTED] wrote:
 
  /* haha, continuing the fine tradition of terrible names in this  api..  */
  static inline void *PTR_PTR(void *err_ptr) {
  BUG_ON(!IS_ERR(err_ptr) || !err_ptr);
  return err_ptr;
  }
 
 How about ERR_CAST() instead?  Or maybe CAST_ERR()?

It's a better name than PTR_PTR(). :-)
 
   struct dentry *blah(...)
   {
   struct inode *inode;
 
   inode = thing(...);
   if (IS_ERR(inode))
   return ERR_CAST(inode);
   }
 
 Where ERR_CAST is defined as:
 
   static inline void *ERR_CAST(const void *error)
   {
   return (void *) x;
   }

Of course, the cast is unnecessary, and I'm sure you meant to return
error:

static inline void *ERR_CAST(const void *error)
{
return error;
}

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/30] IGET: Stop BEFS from using iget() and read_inode()

2007-10-02 Thread Dave Kleikamp
On Tue, 2007-10-02 at 14:24 +0100, David Howells wrote:
 Dave Kleikamp [EMAIL PROTECTED] wrote:
 
  Of course, the cast is unnecessary,
 
 The cast is necessary as the argument is a const pointer and the return type
 is not.

Ah yes.  I stand corrected.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 20/30] IGET: Stop JFS from using iget() and read_inode()

2007-10-01 Thread Dave Kleikamp
On Mon, 2007-10-01 at 14:11 +0100, David Howells wrote:
 Stop the JFS filesystem from using iget() and read_inode().  Replace
 jfs_read_inode() with jfs_iget(), and call that instead of iget().
 jfs_iget() then uses iget_locked() directly and returns a proper error code
 instead of an inode in the event of an error.
 
 jfs_fill_super() returns any error incurred when getting the root inode
 instead of EINVAL.
 
 Signed-off-by: David Howells [EMAIL PROTECTED]

Acked-by: Dave Kleikamp [EMAIL PROTECTED]

 ---
 
  fs/jfs/inode.c |   20 
  fs/jfs/jfs_inode.h |2 +-
  fs/jfs/namei.c |   34 ++
  fs/jfs/super.c |   15 +--
  4 files changed, 40 insertions(+), 31 deletions(-)

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Jfs-discussion] [2/4] 2.6.23-rc6: known regressions

2007-09-19 Thread Dave Kleikamp
On Wed, 2007-09-19 at 13:44 +0200, Oliver Neukum wrote:
 Am Donnerstag 13 September 2007 schrieb Dave Kleikamp:
  On Wed, 2007-09-12 at 18:58 +0200, Michal Piotrowski wrote:
  
   Subject : umount triggers a warning in jfs and takes almost a 
   minute
   References  : http://lkml.org/lkml/2007/9/4/73
   Last known good : ?
   Submitter   : Oliver Neukum [EMAIL PROTECTED]
   Caused-By   : ?
   Handled-By  : ?
   Status  : unknown
  
  I'm still waiting to hear from Oliver whether or not this is actually a
  regression.
 
 OK, I've done test. On 2.6.22 I was unable to trigger the warning.
 On 2.6.23-rc6 I get the following warning in about 3/4 of all unmounts:
 
 Sep 19 13:08:04 oenone kernel: WARNING: at fs/jfs/jfs_logmgr.c:1643 
 jfs_flush_journal()
 Sep 19 13:08:04 oenone kernel:
 Sep 19 13:08:04 oenone kernel: Call Trace:
 Sep 19 13:08:04 oenone kernel:  [8860f962] 
 :jfs:jfs_flush_journal+0x26a/0x27d
 Sep 19 13:08:04 oenone kernel:  [8029fca1] dispose_list+0xde/0xf7
 Sep 19 13:08:04 oenone kernel:  [885fd003] :jfs:jfs_umount+0x30/0xe5
 Sep 19 13:08:04 oenone kernel:  [885f9a31] 
 :jfs:jfs_put_super+0xd/0x5e
 Sep 19 13:08:04 oenone kernel:  [8028ef8f] 
 generic_shutdown_super+0x60/0xf0
 Sep 19 13:08:04 oenone kernel:  [8028f02c] kill_block_super+0xd/0x1e
 Sep 19 13:08:04 oenone kernel:  [8028f0f7] 
 deactivate_super+0x6a/0x82
 Sep 19 13:08:04 oenone kernel:  [802a2184] sys_umount+0x249/0x25a
 Sep 19 13:08:04 oenone kernel:  [802616ff] 
 audit_syscall_entry+0x141/0x174
 Sep 19 13:08:04 oenone kernel:  [8020be8c] tracesys+0xdc/0xe1
 Sep 19 13:08:04 oenone kernel:
 
 This is on a partition of a usb mass storage device with 2K sector size.
 IMHO it is a regression. I am recompiling with CONFIG_JFS_DEBUG.

Looks like a regression then.  I'll take a look at the debug output and get 
back to you.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-19 Thread Dave Kleikamp
On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:

 Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
 cases except one handles memory allocation failure so I get rid of those
 GFP_NOFAIL flags.
 
 Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
 in jbd/jbd2? I will send a separate patch to cleanup that.

No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
recursive calls back into the file system that could end up blocking on
jbd code.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-19 Thread Dave Kleikamp
On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote:
 On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:
 
  Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
  cases except one handles memory allocation failure so I get rid of those
  GFP_NOFAIL flags.
  
  Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
  in jbd/jbd2? I will send a separate patch to cleanup that.
 
 No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
 recursive calls back into the file system that could end up blocking on
 jbd code.

Oh, I see your patch now.  You mean use GFP_NOFS instead of
GFP_KERNEL.  :-)  OK then.

 Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/3] 2.6.23-rc6: known regressions v2

2007-09-18 Thread Dave Kleikamp
On Tue, 2007-09-18 at 16:24 +0200, Jan Kara wrote:

  Subject : umount triggers a warning in jfs and takes almost a minute
  References  : http://lkml.org/lkml/2007/9/4/73
  Last known good : ?
  Submitter   : Oliver Neukum [EMAIL PROTECTED]
  Caused-By   : ?
  Handled-By  : ?
  Status  : unknown
   I thought Shaggy asked Oliver about some details (and he did not
 answer so far) so I'd assume Shaggy is handling this.

I've put it on the back-burner since I haven't heard back from Oliver.
I still haven't found out whether or not it is a regression.  I'm not
too concerned about fixing it right away.  I don't think jfs on flash is
very important.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-18 Thread Dave Kleikamp
On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote:
 On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
  On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
   Here is the incremental small cleanup patch. 
   
   Remove kamlloc usages in jbd/jbd2 and consistently use 
   jbd_kmalloc/jbd2_malloc.
  
  Shouldn't we kill jbd_kmalloc instead?
  
 
 It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
 places to handle memory (de)allocation(page size) via kmalloc/kfree, so
 in the future if we need to change memory allocation in jbd(e.g. not
 using kmalloc or using different flag), we don't need to touch every
 place in the jbd code calling jbd_kmalloc.

I disagree.  Why would jbd need to globally change the way it allocates
memory?  It currently uses kmalloc (and jbd_kmalloc) for allocating a
variety of structures.  Having to change one particular instance won't
necessarily mean we want to change all of them.  Adding unnecessary
wrappers only obfuscates the code making it harder to understand.  You
wouldn't want every subsystem to have it's own *_kmalloc() that took
different arguments.  Besides, there aren't that many calls to kmalloc
and kfree in the jbd code, so there wouldn't be much pain in changing
GFP flags or whatever, if it ever needed to be done.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/4] 2.6.23-rc5: known regressions v2

2007-09-08 Thread Dave Kleikamp
On Sat, 2007-09-08 at 13:11 +0200, Michal Piotrowski wrote:
 Hi all,
 
 Here is a list of some known regressions in 2.6.23-rc5.
 
 Feel free to add new regressions/remove fixed etc.
 http://kernelnewbies.org/known_regressions

 FS
 
 Subject : umount triggers a warning in jfs and takes almost a minute
 References  : http://lkml.org/lkml/2007/9/4/73
 Last known good : ?
 Submitter   : Oliver Neukum [EMAIL PROTECTED]
 Caused-By   : ?
 Handled-By  : ?
 Status  : unknown

Is this a regression?

Oliver,
Have you been using jfs on this flash previously?  If so, what was the
most recent kernel that didn't have problems?

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs/jfs: use DIV_ROUND_UP where appropriate

2007-08-31 Thread Dave Kleikamp
On Wed, 2007-08-29 at 23:17 -0500, Shaun Zinck wrote:
 This replaces some macros and code, which do the same thing as DIV_ROUND_UP
 defined in kernel.h, to use the DIV_ROUND_UP macro.
 
 Signed-off-by: Shaun Zinck [EMAIL PROTECTED]

Thanks.  I've added this to the jfs git tree.  It's queued for the -mm
tree.

Shaggy

 ---
 diff --git a/fs/jfs/jfs_dtree.h b/fs/jfs/jfs_dtree.h
 index 8561c6e..cdac2d5 100644
 --- a/fs/jfs/jfs_dtree.h
 +++ b/fs/jfs/jfs_dtree.h
 @@ -74,7 +74,7 @@ struct idtentry {
  #define DTIHDRDATALEN11
  
  /* compute number of slots for entry */
 -#define  NDTINTERNAL(klen) ( ((4 + (klen)) + (15 - 1)) / 15 )
 +#define  NDTINTERNAL(klen) (DIV_ROUND_UP((4 + (klen)), 15))
  
 
  /*
 @@ -133,7 +133,7 @@ struct dir_table_slot {
   ( ((s64)((dts)-addr1))  32 | __le32_to_cpu((dts)-addr2) )
  
  /* compute number of slots for entry */
 -#define  NDTLEAF_LEGACY(klen)( ((2 + (klen)) + (15 - 1)) / 15 )
 +#define  NDTLEAF_LEGACY(klen)(DIV_ROUND_UP((2 + (klen)), 15))
  #define  NDTLEAF NDTINTERNAL
  
 
 diff --git a/fs/jfs/resize.c b/fs/jfs/resize.c
 index 71984ee..7f24a0b 100644
 --- a/fs/jfs/resize.c
 +++ b/fs/jfs/resize.c
 @@ -172,7 +172,7 @@ int jfs_extendfs(struct super_block *sb, s64 newLVSize, 
 int newLogSize)
*/
   t64 = ((newLVSize - newLogSize + BPERDMAP - 1)  L2BPERDMAP)
L2BPERDMAP;
 - t32 = ((t64 + (BITSPERPAGE - 1)) / BITSPERPAGE) + 1 + 50;
 + t32 = DIV_ROUND_UP(t64, BITSPERPAGE) + 1 + 50;
   newFSCKSize = t32  sbi-l2nbperpage;
   newFSCKAddress = newLogAddress - newFSCKSize;
  
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Jfs-discussion] [PATCH 2/4] Fix mainline filesystems to handle ATTR_KILL_ bits correctly

2007-08-20 Thread Dave Kleikamp
On Mon, 2007-08-20 at 16:53 -0400, Jeff Layton wrote:
 This should fix all of the filesystems in the mainline kernels to handle
 ATTR_KILL_SUID and ATTR_KILL_SGID correctly. For most of them, this is
 just a matter of making sure that they call generic_attrkill early in
 the setattr inode op.

Here's my ack for the jfs piece.

 
 Signed-off-by: Jeff Layton [EMAIL PROTECTED]
Acked-by: Dave Kleikamp [EMAIL PROTECTED]

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Jfs-discussion] [PATCH 15/25] JFS: call attr_kill_to_mode from jfs_setattr

2007-08-07 Thread Dave Kleikamp
On Mon, 2007-08-06 at 09:54 -0400, Jeff Layton wrote:
 Signed-off-by: Jeff Layton [EMAIL PROTECTED]
Acked-by: Dave Kleikamp [EMAIL PROTECTED]

 ---
  fs/jfs/acl.c |2 ++
  1 files changed, 2 insertions(+), 0 deletions(-)
 
 diff --git a/fs/jfs/acl.c b/fs/jfs/acl.c
 index 4d84bdc..34ca314 100644
 --- a/fs/jfs/acl.c
 +++ b/fs/jfs/acl.c
 @@ -227,6 +227,8 @@ int jfs_setattr(struct dentry *dentry, struct iattr 
 *iattr)
   struct inode *inode = dentry-d_inode;
   int rc;
 
 + attr_kill_to_mode(inode, iattr);
 +
   rc = inode_change_ok(inode, iattr);
   if (rc)
   return rc;
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 12/26] ext2 white-out support

2007-08-01 Thread Dave Kleikamp
On Tue, 2007-07-31 at 13:11 -0400, Josef Sipek wrote:
 On Tue, Jul 31, 2007 at 07:00:12PM +0200, Jan Blunck wrote:
  On Tue, Jul 31, Josef Sipek wrote:
  
   On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote:
Introduce white-out support to ext2.
   
   I think storing whiteouts on the branches is wrong. It creates all sort of
   nasty cases when people actually try to use unioning. Imagine a (no-so
   unlikely) scenario where you have 2 unions, and they share a branch. If 
   you
   create a whiteout in one union on that shared branch, the whiteout 
   magically
   affects the other union as well! Whiteouts are a union-level construct, 
   and
   therefore storing them at the branch level is wrong.
  
  So you think that just because you mounted the filesystem somewhere else it
  should look different? This is what sharing is all about. If you share a
  filesystem you also share the removal of objects.
 
 The removal happens at the union level, not the branch level. Say you have:
 
 /a/
 /b/foo
 /c/foo
 
 And you mount /u1 as a union of {a,b}, and /u2 as union of {a,c}.

Who does this?  I'm assuming that a is the top layer.  Aren't union
mounts typically about sharing lower layers and having a separate rw
layer for each union mount?

 $ find /u*
 /u1
 /u1/foo
 /u2
 /u2/foo
 $ rm /u1/foo # this creates whiteout for foo in /a
 $ find /u*
 /u1
 /u2
 
 Is that what you'd expect as a user? I don't think so.

That's exactly what I would expect.

If I were to:
$ echo this is new  /u1/foo

I would expect:
$ cat /u2/foo
this is new

So why should rm behave differently?

I haven't really been tuned into union mounts, so maybe I'm missing out
on something basic here.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 12/26] ext2 white-out support

2007-08-01 Thread Dave Kleikamp
On Wed, 2007-08-01 at 14:44 -0400, Josef Sipek wrote:
 Alright not the greatest of examples, there is something to be said about
 symmetry, so...let me try again :)
 
 /a/
 /b/bar(whiteout for bar)
 /c/foo/qwerty
 
 Now, let's mount a union of {a,b,c}, and we'll see:
 
 $ find /u
 /u
 /u/foo
 /u/foo/qwerty
 $ mv /u/foo /u/bar
 
 Now what? How do you rename? Do you rename in the same branch (assuming it
 is rw)?

Er, no.  According to Documentation/filesystems/union-mounts.txt, only
the topmost layer of the mount stack can be altered.

 If you do, you'll get:
 
 $ find /u
 /u
 
 Oops! There's a whiteout in /b that hides the directory in /c -- rename(2)
 shouldn't make directory subtrees disappear.
 
 There are two ways to solve this:
 
 1) cp -r the entire subtree being renamed to highest-priority branch, and
 rename there (you might have to recreate a series of directories to have a
 place to cp to...so you got cp -r _AND_ mkdir -p-like code in the VFS!
 1/2 a :) )

I think this is the only alternative, given the design.

 2) Don't store whiteouts within branches. This makes it really easy to
 rename and remove the whiteout.
 
 Sure, you could try to rename in-place and remove the whiteout, but what if
 you have:
 
 /a/
 /b/bar(whiteout)
 /c/bar/blah
 /d/foo/qwerty
 
 $ mv /u/foo /u/bar
 
 You can't just remove the whiteout, because that'd uncover the whited-out
 directory bar in /c.
 
 Josef 'Jeff' Sipek.
 
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vm/fs meetup details

2007-07-27 Thread Dave Kleikamp
Sorry for the late response, but I'm interested in attending.  I didn't
think I would be able to justify the trip for the one-day meeting, but I
begged an invitation to the VM summit from Martin.  I still need to get
travel approval, but I'm optimistic that I can justify the trip now.

Thanks,
Shaggy

On Thu, 2007-07-05 at 06:01 +0200, Nick Piggin wrote:
 Hi,
 
 The vm/fs meetup will be held September 4th from 10am till 4pm (with the
 option of going longer), at the University of Cambridge. 
 
 Anton Altaparmakov has arranged a conference room for us with whiteboard
 and projector, so many thanks to him. I will send out the location and
 plans for meeting/getting there after we work out the best strategy for
 that.
 
 At the moment we have 15 people interested so far. We can have a few
 more people, so if you aren't cc'ed and would like to come along please
 let me know. We do have limited space, so I'm sorry in advance if anybody
 misses out.
 
 I'll post out a running list of suggested topics later, but they're
 really just a rough guideline. It will be a round-table kind of thing
 and long monologue talks won't be appropriate, however some slides or
 whiteboarding to interactively introduce and discuss your idea would
 be OK.
 
 I think we want to avoid assigning slots for specific people/topics.
 Feel free to propose anything, if it only gets a small amount of
 interest then at least you'll know who to discuss it with later :)
 
 Thanks,
 Nick
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Dave Kleikamp
On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote:
 It just occurred to me:
 
  If i_version is 64bit, then knfsd would need to be careful when
  reading it on a 32bit host.  What are the locking rules?

How does knfsd use i_version?  I would think that if all it was doing
was to compare (i_version == previous_version), then locking wouldn't
really matter.  Well, theoretically, previous_version could be
0x1, and i_version could be 0x1, knfsd checks the high
word, then ext4 updates i_version to 0x2, then knfsd checks the
low word, detecting no change.  How likely is this?  (I don't understand
why i_version even needs to be 64 bits in the first place.)

  Presumably it is only updated under i_mutex protection, but having to
  get i_mutex to read it would seem a little heavy handed.

How does knfsd protect itself from the inode changing after i_version is
checked?  Is any locking being done otherwise?

  Should it use a seqlock like i_size?
  Could we use the same seqlock that i_size uses, or would we need a
  separate one?
 
 NeilBrown

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Faster ext2_clear_inode()

2007-07-09 Thread Dave Kleikamp
On Mon, 2007-07-09 at 22:00 +0200, Jörn Engel wrote:
 On Mon, 9 July 2007 22:01:48 +0400, Alexey Dobriyan wrote:
  
  Yes. Note that ext2_clear_inode() is referenced from ext2_sops, so even
  empty, it leaves traces in resulting kernel.
 
 Is that your opinion or have you actually measured a difference?
 I strongly suspect that compilers are smart enough to optimize away a
 call to an empty static function.

It's not a direct call to a static function.  It is called as a
super_ops method.  I don't think the overhead is very significant, but
it doesn't look like it could do any harm.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-28 Thread Dave Kleikamp
On Thu, 2007-06-28 at 11:33 -0700, Andrew Morton wrote:
 On Thu, 28 Jun 2007 23:27:57 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
   Please drop the non-ext4 patches from the ext4 tree and send incremental
   patches against the (non-ext4) fallocate patches in -mm.
  
  Please let us know what you think of Mingming's suggestion of posting
  all the fallocate patches including the ext4 ones as incremental ones
  against the -mm.
 
 I think Mingming was asking that Ted move the current quilt tree into git,
 presumably because she's working off git.

I moved the fallocate patches to the very end of the series in the quilt
tree.  This way the patches will be in the quilt tree for testing, but
Ted can easily leave them out of the git tree so you and Linus won't
pull them with the ext4 patches.

Fortunately, the ext4-specific fallocate patches don't conflict with the
other patches in the queue, so they can (at least for now) be handled
independently in the -mm tree.

 I'm not sure what to do, really.  The core kernel patches need to be in
 Ted's tree for testing but that'll create a mess for me.
 
 ug.
 
 Options might be
 
 a) I drop the fallocate patches from -mm and from the ext4 tree, hack up
any needed build fixes, then just wait for it all to mature and then
think about it again
 
 b) We do what we normally don't do and reserve the syscall slots in mainline.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/26] make access() use mnt check

2007-06-26 Thread Dave Kleikamp
On Mon, 2007-06-25 at 11:27 -0700, Dave Hansen wrote:
 On Sat, 2007-06-23 at 08:45 +0100, Christoph Hellwig wrote:

  You probably want to add a big comment explaining why it's fine here.
 
 I've got this in the next set:
 
 -
 -   if(IS_RDONLY(nd.dentry-d_inode))
 +   /*
 +* This is a rare case where using __mnt_is_readonly()
 +* is OK without a mnt_want/drop_write() pair.  Since
 +* not actual write to the fs is performed here, we do

s/not/no/

 +* not need to telegraph to that to anyone.  Also, we
 +* accept that access is inherently racy, and know that
 +* the fs might be remounted between this syscall, and
 +* any action taken because of its result.
 +*/
 +   if (__mnt_is_readonly(nd.mnt))
 res = -EROFS;
 

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] introduce I_SYNC

2007-05-31 Thread Dave Kleikamp
On Thu, 2007-05-31 at 16:25 +0200, Jörn Engel wrote:
 --- linux-2.6.21logfs/fs/jfs/jfs_txnmgr.c~I_LOCK2007-05-07
 10:28:55.0 +0200
 +++ linux-2.6.21logfs/fs/jfs/jfs_txnmgr.c   2007-05-29
 13:10:32.0 +0200
 @@ -1286,7 +1286,14 @@ int txCommit(tid_t tid,  /*
 transaction 
  * commit the transaction synchronously, so the last
 iput
  * will be done by the calling thread (or later)
  */
 -   if (tblk-u.ip-i_state  I_LOCK)
 +   /*
 +* I believe this code is no longer needed.  Splitting
 I_LOCK
 +* into two bits, I_LOCK and I_SYNC should prevent
 this
 +* deadlock as well.  But since I don't have a JFS
 testload
 +* to verify this, only a trivial s/I_LOCK/I_SYNC/ was
 done.
 +* Joern
 +*/
 +   if (tblk-u.ip-i_state  I_SYNC)
 tblk-xflag = ~COMMIT_LAZY;
 }

I think the code is still needed, and I think this change is correct.
The deadlock that this code is avoiding is caused by clear_inode()
calling wait_on_inode().  Since clear_inode() now calls
inode_sync_wait(inode), we want to avoid the lazily committing this
transaction when the I_SYNC flag is set.

Unfortunately, recreating the deadlock is hard, and I haven't been able
to recreate it with this code commented out.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/5] inode reservation v0.1 (ext4 kernel patch)

2007-05-24 Thread Dave Kleikamp
On Thu, 2007-05-24 at 02:06 +0800, coly wrote:
 The patch is generated based on 2.6.20-ext4-2 branch. you can find the
 benchmark from other email.
 
 DO NOT waste time on reading the patch :-) I post this patch here is to
 show that I really spent time on it and the patch can work (even not
 well).

I won't waste my time then.  I'm discouraged from trying by the lack of
indentation.  It looks like the tabs got converted to a single space
somehow.

 diff --git a/Makefile b/Makefile
 index 7e2750f..21d21e4 100644
 --- a/Makefile
 +++ b/Makefile
 @@ -1,8 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 20
 -EXTRAVERSION =
 -NAME = Homicidal Dwarf Hamster
 +EXTRAVERSION = inores
 
 # *DOCUMENTATION*
 # To see a list of typical targets execute make help
 diff --git a/fs/ext4/bitmap.c b/fs/ext4/bitmap.c
 index 11e93c1..daf88b4 100644
 --- a/fs/ext4/bitmap.c
 +++ b/fs/ext4/bitmap.c
 @@ -30,3 +30,29 @@ unsigned long ext4_count_free (struct buffer_head *
 map, unsigned int numchars)
 
 #endif  /*  EXT4FS_DEBUG  */
 
 +/*
 + * Read the inode allocation bitmap for a given block_group, reading
 + * into the specified slot in the superblock's bitmap cache.
 + *
 + * Return buffer_head of bitmap on success or NULL.
 + */
 +struct buffer_head *
 +read_inode_bitmap(struct super_block * sb, unsigned long block_group)
 +{
 + struct ext4_group_desc *desc;
 + struct buffer_head *bh = NULL;
 +
 + desc = ext4_get_group_desc(sb, block_group, NULL);
 + if (!desc)
 + goto error_out;
 +
 + bh = sb_bread(sb, ext4_inode_bitmap(sb, desc));
 + if (!bh)
 + ext4_error(sb, read_inode_bitmap,
 + Cannot read inode bitmap - 
 + block_group = %lu, inode_bitmap = %llu,
 + block_group, ext4_inode_bitmap(sb, desc));
 +error_out:
 + return bh;
 +}
 +

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-17 Thread Dave Kleikamp
On Thu, 2007-05-17 at 09:40 +1000, David Chinner wrote:
 On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote:
  On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:

   Please don't make this always happen. c/mtime updates should be dependent
   on the mode being used and whether there is visible change to the file. 
   If no
   userspace visible changes to the file occurred, then timestamps should not
   be changed.
  
  i_blocks will be updated, so it seems reasonable to update ctime.  mtime
  shouldn't be changed, though, since the contents of the file will be
  unchanged.
 
 That's assuming blocks were actually allocated - if the prealloc range already
 has underlying blocks there is no change and so we should not be changing
 mtime either. Only the filesystem will know if it has changed the file, so I
 think that timestamp updates need to be driven down to that level, not done
 blindy at the highest layer

Yes, I agree.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-16 Thread Dave Kleikamp
On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:
 On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:

  Following changes were made to the previous version:
   1) Added description before sys_fallocate() definition.
   2) Return EINVAL for len=0 (With new draft that Ulrich pointed to,
  posix_fallocate should return EINVAL for len = 0.
   3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
   4) Do not return ENODEV for dirs (let individual file systems decide if
  they want to support preallocation to directories or not.
   5) Check for wrap through zero.
   6) Update c/mtime if fallocate() succeeds.
 
 Please don't make this always happen. c/mtime updates should be dependent
 on the mode being used and whether there is visible change to the file. If no
 userspace visible changes to the file occurred, then timestamps should not
 be changed.

i_blocks will be updated, so it seems reasonable to update ctime.  mtime
shouldn't be changed, though, since the contents of the file will be
unchanged.

 e.g. FA_ALLOCATE that changes file size requires same semantics of ftruncate()
 extending the file, otherwise no change in timestamps should occur.
 
 Cheers,
 
 Dave.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-08 Thread Dave Kleikamp
On Tue, 2007-05-08 at 16:22 +0530, Amit K. Arora wrote:
 On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote:
  On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
   On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:

So we don't implement fallocate on bitmap-based files!  Well that's huge
news.  The changelog would be an appropriate place to communicate this,
along with reasons why, or a description of the plan to fix it.
   
   Ok. Will add this in the function description as well.
   
Also, posix says nothing about fallocate() returning ENOTTY.
   
   Right. I don't seem to find any suitable error from posix description.
   Can you please suggest an error code which might make more sense here ?
   Will -ENOTSUPP be ok ? Since we want to say here that we don't support
   non-extent files.
  
  Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
  here, and fall back to the current library code to do preallocation?
  This way, the caller of fallocate() will never see this return code, so
  it won't violate posix.
 
 You are right.
 
 But, we still need to standardize (and limit) the error codes
 which we should return from kernel when we want to fall back on the
 library implementation. The posix_fallocate() library function will have
 to look for a set of errors from fallocate() system call, upon receiving
 which it will do preallocation from user level; or else, it will return
 success/error-code returned by the system call to the user.
 
 I think we can make it fall back to library implementation of fallocate,
 whenever posix_fallocate() receives any of the following errors from
 fallocate() system call:
 
 1. ENOSYS
 2. EOPNOTSUPP
 3. ENOTTY(?)
 
 Now the question is - should we limit the set of errors for this purpose
 to just 1  2 above ? In that case I will need to change the error being
 returned here to -EOPNOTSUPP (from current -ENOTTY).

If you want my opinion, -EOPNOTSUPP is better than -ENOTTY.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] JFS: Fix race waking up jfsIO kernel thread

2007-05-04 Thread Dave Kleikamp
I've been looking at a hang that was reported off-list, and I believe I
found the cause.  I'm still waiting for confirmation on whether the
patch does fix the problem, but I wanted to distribute the patch in case
anyone else is seeing a similar hang.  It looks like the problem is in
kernels from 2.6.17 to the present.

Thanks,
Shaggy

JFS: Fix race waking up jfsIO kernel thread

It's possible for a journal I/O request to be added to the log_redrive
queue and the jfsIO thread to be awakened after the thread releases
log_redrive_lock but before it sets its state to TASK_INTERRUPTIBLE.

The jfsIO thread should set its state before giving up the spinlock, so
the waking thread will really wake it.

Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]

diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index ff7f1be..16c6268 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -2354,12 +2354,13 @@ int jfsIOWait(void *arg)
lbmStartIO(bp);
spin_lock_irq(log_redrive_lock);
}
-   spin_unlock_irq(log_redrive_lock);
 
if (freezing(current)) {
+   spin_unlock_irq(log_redrive_lock);
refrigerator();
} else {
set_current_state(TASK_INTERRUPTIBLE);
+   spin_unlock_irq(log_redrive_lock);
schedule();
__set_current_state(TASK_RUNNING);
}

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux page cache issue?

2007-03-28 Thread Dave Kleikamp
On Wed, 2007-03-28 at 02:45 -0400, Xin Zhao wrote:
 Hi,
 
 If a Linux process opens and reads a file A, then it closes the file.
 Will Linux keep the file A's data in cache for a while in case another
 process opens and reads the same in a short time? I think that is what
 I heard before.

Yes.

 But after I digged into the kernel code, I am confused.
 
 When a process closes the file A, iput() will be called, which in turn
 calls the follows two functions:
 iput_final()-generic_drop_inode()

A comment from the top of fs/dcache.c:

/*
 * Notes on the allocation strategy:
 *
 * The dcache is a master of the icache - whenever a dcache entry
 * exists, the inode will always exist. iput() is done either when
 * the dcache entry is deleted or garbage collected.
 */

Basically, as long a a dentry is present, iput_final won't be called on
the inode.

 But from the following calling chain, we can see that file close will
 eventually lead to evict and free all cached pages. Actually in
 truncate_complete_page(), the pages will be freed.  This seems to
 imply that Linux has to re-read the same data from disk even if
 another process B read the same file right after process A closes the
 file. That does not make sense to me.
 
 /***calling chain ***/
 generic_delete_inode/generic_forget_inode()-
 truncate_inode_pages()-truncate_inode_pages_range()-
 truncate_complete_page()-remove_from_page_cache()-
 __remove_from_page_cache()-radix_tree_delete()
 
 Am I missing something? Can someone please provide some advise?
 
 Thanks a lot
 -x

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] add file position info to proc

2007-03-09 Thread Dave Kleikamp
On Fri, 2007-03-09 at 14:54 +0100, Miklos Szeredi wrote:
 Comments are welcome.
 
 
 This patch adds support for finding out the current file position,
 open flags and possibly other info in the future.
 
 These new entries are added:
 
   /proc/PID/fdinfo/FD
   /proc/PID/task/TID/fdinfo/FD
 
 For each fd the information is provided in the following format:
 
 pos:  1234
 flags:012

I think this information would be a little easier to access if there
would be a single file per pid or thread containing something like:

handle  flags   pos path
0   012 1234/dev/pts/1
1   014 5678/tmp/output
etc.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] add file position info to proc

2007-03-09 Thread Dave Kleikamp
On Fri, 2007-03-09 at 10:23 -0500, Benjamin LaHaise wrote:
 On Fri, Mar 09, 2007 at 09:15:06AM -0600, Dave Kleikamp wrote:
  I think this information would be a little easier to access if there
  would be a single file per pid or thread containing something like:
  
  handle  flags   pos path
  0   012 1234/dev/pts/1
  1   014 5678/tmp/output
  etc.
 
 That would not be a good idea, as not all users have the same permissions 
 for viewing this information.

How will this have different permission issues than Miklos' patch?

 It's also quite against the design philosophy 
 used elsewhere in /proc.

What's so different between this and /proc/pid/maps
or /proc/pid/mounts?

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount option to ignore permissions

2007-03-09 Thread Dave Kleikamp
On Thu, 2007-03-08 at 23:39 +0100, Ihar `Philips` Filipau wrote:

   The problem have beaten me before. And now I have it again.
   Imaging external hard drive with proper file system (proper ==
 supports posix permissions) where files were created by user A and
 then it (ext. hard drive) was brought to another location/computer and
 user B tried to read them. Failure. Why? Because Linux preserved
 permissions on hard drive - though they are already irrelevant on
 system fs is currently mounted on. And that renders literally all
 files accessible only by root.
 
   What is needed is special mount option to tell file system (*):
(1) to ignore permissions when file/directory is read;
(2) when file/directory is created it receives automatically world
 writable permissions 0666 (I cannot imaging how to simulate user
 friendly file attribute read-only, though it seems not relevant to
 external storage anyway).
 
   I'm looking into the code and it seems that every file system parses
 option on their own.
   Global flags (ro/rw, nodev, etc) are handled by mount(8) itself and
 passed to sys_mount() as bitmask.
 
   How gid/uid are passed to file system? I do not see them in
 parameter list to sys_mount(). Or they are handled somehow otherwise?

A posix file system doesn't have a uid or gid associated with it.  These
are set based on the owner of the process creating the file or
directory.

   Any ideas on how I can simulate such behavior and or on how to
 implement such attribute would be appreciated.

jfs does support mount options to override uid, gid, and umask for
existing files.  The reason for this is so a file system can be shared
between linux and os2, and os2 doesn't use these fields.  (I just
realized that I've never documented these flags.)

I don't know how other file systems' maintainers would feel about
supporting these flags.  I can see how it would be useful for external
hard drives.

 P.S. chmod/chown isn't option since (1) they do not work for ro file
 system and (2) doing that every time on NNNk files might quite
 tiresome - every time disk is reattached.

This is true of chown, but if you would chmod everything to 777 (or 666
for non-directories), you wouldn't have to repeat that every time you
reattach.  You could mount the drive under a directory with restricted
permissions to have some degree of security.

 P.P.S. BTW MacOSX has such option and it is automatically selected for
 external hard drives.

What are the details?  If something were to be proposed, it wouldn't
hurt to try to be consistent.

 P.P.P.S. That doesn't happen with most external hard drives since they
 are all FAT32. I moved to ext2/hfs+ in part to avoid the recurring
 nightmares of past when I have worked with FAT32 all day long. And
 also ext2/hfs+ (under Linux/MacOSX) are better than FAT*. And also I
 need case sensitiveness.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] add file position info to proc

2007-03-09 Thread Dave Kleikamp
On Fri, 2007-03-09 at 22:03 +0100, Miklos Szeredi wrote:
   I think this information would be a little easier to access if there
   would be a single file per pid or thread containing something like:
   
   handle flags   pos path
   0  012 1234/dev/pts/1
   1  014 5678/tmp/output
   etc.
   
   That would not be a good idea, as not all users have the same 
   permissions
   for viewing this information.
  
  Since /proc/$$/fd is S_IRWXU, only the user owning the process (plus
  usually root) can view it. So making the single file you propose S_IRWXU
  too should solve any security issue, should not it?
  
  I think the problem is not with permissions, but with scaling to large
  numbers of file descriptors.
  
  Would not that be dealt accordingly with seq_file?
 
 Yes, most likely it wouldn't be so bad either. But it _would_ be:
 
  a) more complex, this implementation reuses most of the  the /proc/fd code
  b) less efficient for querying a single fd
 
 So I prefer the current solution, but not very strongly.

I don't have a real strong opinion, but I wanted to throw that out as a
suggestion.  You're the one writing the code.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Dave Kleikamp
On Fri, 2007-03-02 at 18:45 +0800, Andreas Dilger wrote:
 On Mar 01, 2007  13:15 -0600, Eric Sandeen wrote:
  One thing I'd like to see is a cmd argument as well, to allow for 
  example allocation vs. reservation (i.e. allocating blocks vs. simply 
  reserving a number), as well as the inverse of those functions 
  (un-reservation, de-allocation)?
  
  If the allocation interface allows allocation/reservation within 
  arbitrary ranges, if the only way to un-allocate is via a truncate, 
  that's pretty asymmetric.
 
 I'd rather we just get the oft-discussed punch() syscall instead.
 This is really what unallocate would do for persistent allocations
 and it would be useful for files that were not preallocated.

I can see a difference though.  punch() would throw away written data as
well as pre-allocated-but-never-written-to data.  I can see where a user
might preallocate a large file and do a lot of random writes.  At some
point, he decides the file isn't going to grow much more, so let's free
up the remaining pre-allocated blocks.  This makes even more sense with
reservation.

The alternative would be to have punch() take a flag to specify if only
preallocated or reserved blocks should be freed.

 
 For filesystems that don't implement punch glibc() would do zero-filling
 of the punched area I guess (to make it equivalent to reading from a
 hole in the file).

Or it could just fail.  Writing zeroes may be really slow and not give
the caller any benefit.  (The intention was to free blocks back to the
file system.)

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Dave Kleikamp
Amit wrote:

  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

On Thu, 2007-03-01 at 22:16 -0800, Andrew Morton wrote:
 On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote:
 
  Just curious .. What does posix_fallocate() return ?
 
 bookmark this:
 
 http://www.opengroup.org/onlinepubs/009695399/nfindex.html
 
 Upon successful completion, posix_fallocate() shall return zero;
 otherwise, an error number shall be returned to indicate the error.

Then there's no need for sys_allocate to return a long.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Dave Kleikamp
On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
 On Fri, 2 Mar 2007 00:04:45 +0530
 Amit K. Arora [EMAIL PROTECTED] wrote:

  +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
  +{
  +   struct file *file;
  +   struct inode *inode;
  +   long ret = -EINVAL;
  +   file = fget(fd);
  +   if (!file)
  +   goto out;
  +   inode = file-f_path.dentry-d_inode;
  +   if (inode-i_op  inode-i_op-fallocate)
  +   ret = inode-i_op-fallocate(inode, offset, len);
  +   else
  +   ret = -ENOTTY;
  +   fput(file);
  +out:
  +return ret;
  +}
 

 ENOTTY is a bit unconventional - we often use EINVAL for this sort of
 thing.  But EINVAL has other meanings for posix_fallocate() and isn't
 really appropriate here anyway.  So I'm not sure what would be better...

Would EINVAL (or whatever) make it back to the caller of
posix_fallocate(), or would glibc fall back to its current
implementation?

Forgive me if I haven't put enough thought into it, but would it be
useful to create a generic_fallocate() that writes zeroed pages for any
non-existent pages in the range?  I don't know how glibc currently
implements posix_fallocate(), but maybe the kernel could do it more
efficiently, even in generic code.  Maybe we don't care, since the major
file systems can probably do something better in their own code.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Dave Kleikamp
On Thu, 2007-03-01 at 14:59 -0800, Andrew Morton wrote:
 On Thu, 01 Mar 2007 22:44:16 +
 Dave Kleikamp [EMAIL PROTECTED] wrote:
 
  On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
   On Fri, 2 Mar 2007 00:04:45 +0530
   Amit K. Arora [EMAIL PROTECTED] wrote:
  
+asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   inode = file-f_path.dentry-d_inode;
+   if (inode-i_op  inode-i_op-fallocate)
+   ret = inode-i_op-fallocate(inode, offset, len);
+   else
+   ret = -ENOTTY;
+   fput(file);
+out:
+return ret;
+}
   
  
   ENOTTY is a bit unconventional - we often use EINVAL for this sort of
   thing.  But EINVAL has other meanings for posix_fallocate() and isn't
   really appropriate here anyway.  So I'm not sure what would be better...
  
  Would EINVAL (or whatever) make it back to the caller of
  posix_fallocate(), or would glibc fall back to its current
  implementation?
  
  Forgive me if I haven't put enough thought into it, but would it be
  useful to create a generic_fallocate() that writes zeroed pages for any
  non-existent pages in the range?  I don't know how glibc currently
  implements posix_fallocate(), but maybe the kernel could do it more
  efficiently, even in generic code.  Maybe we don't care, since the major
  file systems can probably do something better in their own code.
 
 Given that glibc already implements fallocate for all filesystems, it will
 need to continue to do so for filesystems which don't implement this
 syscall - otherwise applications would start breaking.

I didn't make it clear, but my point was to call generic_fallocate if
the file system did not define i_op-allocate().

if (inode-i_op  inode-i_op-fallocate)
ret = inode-i_op-fallocate(inode, offset, len);
else
ret = generic_fallocate(inode, offset, len);

I'm not sure it's worth the effort, but I thought I'd throw the idea out
there.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-cifs-client] i_mutex and deadlock

2007-02-23 Thread Dave Kleikamp
On Fri, 2007-02-23 at 10:02 -0600, Steve French (smfltc) wrote:
 A field in i_size_write (i_size_seqcount) must be protected against 
 simultaneous update otherwise we risk looping in i_size_read.
 
 The suggestion in fs.h is to use i_mutex which seems too dangerous due 
 to the possibility of deadlock.

I'm not sure if it's as much a suggestion as a way of documenting the
locking  that exists (or existed when the comment was written).

... i_size_write() does need locking around it  (normally i_mutex) ...

 There are 65 places in the fs directory which lock an i_mutex, and seven 
 more in the mm directory.   The vfs does clearly lock file inodes in 
 some paths before calling into a particular filesystem (nfs, ext3, cifs 
 etc.) - in particular for fsync but probably for others that are harder 
 to trace.  This seems to introduce the possibility of deadlock if a 
 filesystem also uses i_mutex to protect file size updates
 
 Documentation/filesystems/Locking describes the use of i_mutex (was 
 i_sem previously) and indicates that it is held by the vfs on three 
 additional calls on file inodes which concern me (for deadlock 
 possibility), setattr, truncate and unlink.
 
 nfs seems to limit its use of i_mutex to llseek and invalidate_mapping, 
 and does not appear to grab the i_mutex (or any sem for that matter) to 
 protect i_size_write
 (nfs calls i_size_write in nfs_grow_file) - and for the case of 
 nfs_fhget (in which they bypass i_size_write and set i_size directly) 
 does not seem to grab i_mutex either.
 
 ext3 also does not use i_mutex for this purpose (protecting 
 i_size_write) - ony to protect a journalling ioctl.
 
 I am concerned about using i_mutex to protect the cifs calls to 
 i_size_write (although it seems to fix a problem reported in i_size_read 
 under stress) because of the following:
 
 1) no one else calls i_size_write AFAIK (on our file inodes)

I think you're right.

 2) we don't block inside i_size_write do we ... (so why in the world do 
 they take a slow mutex instead of a fast spinlock)

My guess, is that in existing cases, it was already being held, so there
is no need to do something different.  I'm not sure if the comment is
still accurate.  What locking protects it in generic_commit_write() and
nobh_commit_write()?

 3) we don't really know what happens inside fsync (the paths through the 
 page cache code seem complex and we don't want to reenter writepage in 
 low memory conditions and deadlock updating the file size), and there is 
 some concern that the vfs takes the i_mutex in other paths on file 
 inodes before entering our code and could deadlock.
 
 Any reason, why an fs shouldn't simply use something else (a spinlock) 
 other than i_mutex to protect the i_size_write call?

i_mutex doesn't make sense in your case.  Use whatever makes sense in
cifs.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Accessing file-offset info for fds in /proc?

2007-02-20 Thread Dave Kleikamp
On Tue, 2007-02-20 at 02:31 -0500, Hank Leininger wrote:
 Is there anything provided by the kernel that would let you see the
 current offset of an existing filehandle?
 
 Sometimes when processing a very large file (grepping a log, bzip2'ing
 or gpg'ing a file, or whatever), I'd really like to know how far along
 it is, because I'm impatient.  lsof has an -o flag to show offsets for
 file descriptors it lists, but it appears that's not supported under
 Linux.  It looks like all of the information lsof and fuser print about 
 files in use, etc can be gotten from /proc/*/fd/* (and /proc/*/maps, but
 I'm not really concerned with mmap'ed files, just positions on fds).
 Sometimes I'll resort to strace -s4096'ing the process to see what chunk
 of text it's currently reading, and try to guess from that.  Silly.
 
 Has anybody ever developed a patch to implement this?  I realize this
 could create a variety of information-leakage problems; the information
 probably would need to be restricted, such as by the same rules as
 dumpable.  Are there any horribly painful reasons why this couldn't be
 done?

It shouldn't be too painful.  The code to populate /proc/*/fd/ has the
file struct.  It just doesn't have a place pass the offset to user-space
since it's basically creating a symlink.  In proc_fd_link(), it has the
file struct.  The offset is file-f_pos.

One could create something like /proc/*/fd_offsets, whose read method
could list the file descriptor, path, and offset for each open file.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-20 Thread Dave Kleikamp
On Tue, 2007-02-20 at 21:39 +, Valerie Henson wrote:
 On Tue, Feb 20, 2007 at 01:30:25PM -0800, Junfeng Yang wrote:
  On 2/20/07, Valerie Henson [EMAIL PROTECTED] wrote:
  
  Google. (GoogleFS runs on top of ext2.)
 
  It's surprising to know that... I guess they reply on GoogleFS's own
  replication and checksumming for consistency.
 
 Yep, they just want a local file system with ultrafast on-line
 performance.  They don't care about recovery time particularly because
 of the GoogleFS replication (although I heard rumors they have some
 fast fsck scheme, maybe resembling the dirty bit stuff I did last
 year).

I wonder if they would consider this a important bug?  I know nothing
about GoogleFS, but I would guess that they have more sophisticated
recovery than relying on an fsync shortly before a crash to ensure data
integrity.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libfs.c:simple_lookup

2007-02-19 Thread Dave Kleikamp
On Mon, 2007-02-19 at 21:09 +0100, Jan Engelhardt wrote:
 Hello,
 
 
 simple_lookup() in fs/libfs.c does some extra steps, namely
 
   dentry-d_op = simple_dentry_operations;
   d_add(dentry, NULL);
 
 as far as I understand, this creates a negative dentry which will be 
 deleted sometime later again. Is not it easier to not create it at all 
 (since it is not going to be existent anyway)?

I'm not sure I understand the question, so allow me to restate a similar
question, and you can tell me if we're asking the same thing.

In general, the negative dentry avoids the overhead of looking up a
non-existent entry more than once by remembering that the entry does
not exist.  Since there is almost no overhead to calling
simple_lookup(), would be be better off simply returning without
creating the negative dentry (and letting simple_lookup be called more
often)?

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Race between __sync_single_inode() and LogFS garbage collector

2007-02-19 Thread Dave Kleikamp
On Mon, 2007-02-19 at 21:31 +, Jörn Engel wrote:
 Looks like I really write the first log-structured filesystem for Linux.
 At least I can into a fairly arcane race that seems to be generic to all
 of them.
 
 Writing when space is tight may involve calling the garbage collector.
 The garbage collector will iget() random inodes, either to verify if a
 block is valid or to copy the block around.  At this point, all writes
 to LogFS are serialized.
 
 __sync_single_inode() will first lock a random inode, then call
 write_inode(), then unlock the inode.  So we can get this:
 
 
 __sync_single_inode() garbage collector
 -
 inode-i_state |= I_LOCK; ...
 ...   mutex_lock(super-s_w_mutex);
 write_inode(inode, wait); ...
   ... iget(sb, ino);
   mutex_lock(super-s_w_mutex);  ...
   ...   wait_on_inode(inode);
   mutex_unlock(super-s_w_mutex);
   ... 
 ...
 inode-i_state = ~I_LOCK;
 
 
 And once in a blue moon, those two will race for the same inode.  As far
 as I can see, the race can only get fixed in two ways:
 1. Never iget() inside the garbage collector.  That would require having
a private inode cache for LogFS.
 2. Synchonize __sync_single_inode() and the garbage collector somehow.
 
 Variant 1 would result in double caching for the same object, something
 I would like to avoid.  So does anyone have suggestions how variant 2
 could be achieved?  Essentially what I need is a way to say don't sync
 any inodes right now, I'll be back in 5 milliseconds or so.

It'd be nice if you could drop s_w_mutex when the garbage collector
calls i_get().

Otherwise, you may be able to call ilookup5_nowait() in the garbage
collector, and skip that inode if I_LOCK is set.

 
 Jörn
 
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Race between __sync_single_inode() and LogFS garbage collector

2007-02-19 Thread Dave Kleikamp
On Mon, 2007-02-19 at 23:23 +, Jörn Engel wrote:
 On Mon, 19 February 2007 17:05:55 -0600, Dave Kleikamp wrote:
  
  It'd be nice if you could drop s_w_mutex when the garbage collector
  calls i_get().
 
 Not possible.  Garbage collection only happens when space is getting
 tight.  At that moment all writes must be serialized or this race will
 be the least of my problems. :(
 
  Otherwise, you may be able to call ilookup5_nowait() in the garbage
  collector, and skip that inode if I_LOCK is set.
 
 Also not possible.  I cannot skip that inode, or again this race will be
 a minor problem.  The inode exists on the medium and I must get it by
 some means.  Re-reading it from the medium is fine, writing is not and
 waiting for the write to happen brings me back to square one.

Okay, I get it now.  You've got more constraints than I initially
realized.

 It is a nasty problem that has been haunting me for about a year now.
 For a while I tried ilookup5_nowait() and just used the inode in spite
 of the lock.  But that will explode spectacularly when racing against
 generic_drop_inode().

You've obviously given this a lot more thought that I have, but this
sounds like something that has possibilities.  You couldn't implement
your own drop_inode method that does better locking against the garbage
collector?

 Double-caching or a common lock seem to be the only solutions.
 
 Jörn
 
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Dave Kleikamp
On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote:

 Another very heavyweight approach would be to simply force a full sync
 of the filesystem whenever fysnc() is called.  Not pretty, and without
 the proper write ordering, the race is still potentially there.

I don't think this race is an issue, in that it would require the crash
to happen before the fsync completed, so there would be no expectation
that the data is safe.  It's a moot point, since I don't think this is
an acceptable solution anyway.

 I'd say that the best way to handle this is in fsck, but quite frankly
 it's relatively low priority bug to handle, since a much simpler
 workaround is to tell people to use ext3 instead.

Right.  Who's still using ext2?
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Dave Kleikamp
On Thu, 2007-02-15 at 10:59 -0500, sfaibish wrote:
 On Thu, 15 Feb 2007 10:09:22 -0500, Dave Kleikamp  
 [EMAIL PROTECTED] wrote:
 
  On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote:
 
  Another very heavyweight approach would be to simply force a full sync
  of the filesystem whenever fysnc() is called.  Not pretty, and without
  the proper write ordering, the race is still potentially there.
 
  I don't think this race is an issue, in that it would require the crash
  to happen before the fsync completed, so there would be no expectation
  that the data is safe.  It's a moot point, since I don't think this is
  an acceptable solution anyway.
 
  I'd say that the best way to handle this is in fsck, but quite frankly
  it's relatively low priority bug to handle, since a much simpler
  workaround is to tell people to use ext3 instead.
 
  Right.  Who's still using ext2?
 It was my understanding from the persentation of Dawson that ext3 and jfs  
 have
 same problem.

Hmm.  If jfs has the problem, it is a bug.  jfs is designed to handle
this correctly.  I'm pretty sure I've fixed at least one bug that
eXplode has uncovered in the past.  I'm not sure what was mentioned in
the presentation though.  I'd like any information about current
problems in jfs.

 It is not an ext2 only problem. Also whatever solution we  
 adopt
 we need to be sure that we test it using the eXplode methodology.
 
 /Sorin
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Dave Kleikamp
On Thu, 2007-02-15 at 11:11 -0800, Junfeng Yang wrote:
 Hmm.  If jfs has the problem, it is a bug.  jfs is designed to
 handle
 this correctly.  I'm pretty sure I've fixed at least one bug
 that 
 eXplode has uncovered in the past.  I'm not sure what was
 mentioned in
 the presentation though.  I'd like any information about
 current
 problems in jfs.
 
 
 I believe you have fixed the JFS fsync bug, Dave.  It was caused by
 reusing a directory inode as a file inode.  If the machine crashes
 later, fsck would think this file is a directory, and clear all its
 data. 

Yeah.  That one was fixed a while back.  Thanks for clearing this up.

Shaggy

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-14 Thread Dave Kleikamp
On Thu, 2007-02-15 at 07:31 +1100, David Chinner wrote:
 On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote:
  Just some quick notes on possible ways to fix the ext2 fsync bug that
  eXplode found.  Whether or not anyone will bother to implement it is
  another matter.
  
  Background: The eXplode file system checker found a bug in ext2 fsync
  behavior.  Do the following: truncate file A, create file B which
  reallocates one of A's old indirect blocks, fsync file B.  If you then
  crash before file A's metadata is all written out, fsck will complete
  the truncate for file A... thereby deleting file B's data.  So fsync
  file B doesn't guarantee data is on disk after a crash.  Details:
  
  http://www.stanford.edu/~engler/explode-osdi06.pdf 
  
  Two possible solutions I can think of:
  
  * Rearrange order of duplicate block checking and fixing file size in
fsck.  Not sure how hard this is. (Ted?)
  
  * Keep a set of still allocated on disk block bitmaps that gets
flushed whenever a sync happens.  Don't allocate these blocks.
Journaling file systems already have to do this.
 
 You don't need anything on disk or to fsck to fix this problem -
 just avoid it completely by keeping a list of recently truncated
 blocks in memory and don't reuse them until the old owner inode is
 sync'd to disk.

I think that's pretty much what Val is suggesting.  She suggests bitmaps
rather than a list though.  Maybe she should have used a better term
than flushed, as this list only needs to be cleared, rather than
written to disk.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] small smbfs bugfix

2007-01-26 Thread Dave Kleikamp
On Fri, 2007-01-26 at 09:51 +0300, Pavel Fedin wrote:
  Hello!
 
  This small patch fixes a problem with smbfs in 2.6.19.2 kernel.
 Should also apply to other kernels.

NACK

  The problem is that smbfs doesn't understand user and noexec
 options. This makes it impossible to include into /etc/fstab lines
 like:
 
 /pc2/myshare /mnt/smb/pc2/myshare smbfs user,noauto 0 0
 
  Smbmount expands user,noauto to user,noauto,noexec plus something
 else, don't remember exactly what. In the result such an entry simply
 can't be mounted because smbfs fails due to unknown options.
  This patch makes smbfs silently ignoring all unknown options instead of
 producing an error.

I believe the problem is in smbmount.  At the system call level, user
and noauto are not valid options.  These are usually parsed by the
mount command, but not sent to the mount syscall.  noexec also
shouldn't be passed as a string in the syscall, but the MS_NOEXEC flag
should be set in the mountflags argument.

  Also i would suggest to introduce this behavor to all other
 filesystems. This would greatly improve autofs usability. Imagine a
 situation: i have a ZIP drive and i'd like to be able to use ext2,
 ext3 and vfat-formatted cartridges. Also i'm russian user so i use NLS feature
 for names translation. In the fstab i would write:
 
 /dev/hdd /mnt/zip ext2,ext3,vfat user,noauto,iocharset=koi8-r,codepage=866 0 0
 
  But in this case ext2 and ext3 wouldn't work at all since they would
 fail due to unknown iocharset and codepage options.

If the file system doesn't honor the option, it shouldn't accept it.  In
the first example, if you really want to mount with noexec, and smbfs
simply ignores it, you've undermined the security of the system.  Not
recognizing the iocharset and codepage probably isn't as big a problem,
but ext2/3 would not be behaving the same was as if it truly respected
those options.

 
 -- 
 Best regards,
  Pavel  mailto:[EMAIL PROTECTED]

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH: 2.6.20-rc4-mm1] JFS: Avoid deadlock introduced by explicit I/O plugging

2007-01-18 Thread Dave Kleikamp
On Thu, 2007-01-18 at 01:30 -0500, Josef Sipek wrote:
 On Wed, Jan 17, 2007 at 04:55:49PM -0600, Dave Kleikamp wrote:

   /*
* jfs_lock.h
  @@ -42,6 +43,7 @@ do {  
  \
  if (cond)   \
  break;  \
  unlock_cmd; \
  +   blk_replug_current_nested();\
  schedule(); \
  lock_cmd;   \
  }   \
 
 Is {,un}lock_cmd a macro? ...

They are the commands passed into this macro (as arguments) to
release/take a lock.  This is a home-grown wait_on_event type macro
where the condition must be checked while holding a lock.  And the
commands passed in are themselves macros.  The jfs code could probably
be cleaned up a bit as far as excessive use of macros for locking, but
it's been working for a few years with few problems.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH: 2.6.20-rc4-mm1] JFS: Avoid deadlock introduced by explicit I/O plugging

2007-01-17 Thread Dave Kleikamp
On Thu, 2007-01-18 at 10:18 +1100, Jens Axboe wrote:

 Can you try io_schedule() and verify that things just work?

I actually did do that in the first place, but wondered if it was the
right thing to introduce the accounting changes that came with that.
I'll change it back to io_schedule() and test it again, just to make
sure.

If that's the right fix, I can push it directly since it won't have any
dependencies on your patches.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH: 2.6.20-rc4-mm1] JFS: Avoid deadlock introduced by explicit I/O plugging

2007-01-17 Thread Dave Kleikamp
On Thu, 2007-01-18 at 10:46 +1100, Jens Axboe wrote:
 On Wed, Jan 17 2007, Dave Kleikamp wrote:
  On Thu, 2007-01-18 at 10:18 +1100, Jens Axboe wrote:
 
   Can you try io_schedule() and verify that things just work?
 
  I actually did do that in the first place, but wondered if it was the
  right thing to introduce the accounting changes that came with that.
  I'll change it back to io_schedule() and test it again, just to make
  sure.
 
 It appears to be the correct change to me - you really are waiting for
 IO resources (otherwise it would not hang with the plug change), so
 doing an inc/dec of iowait around the schedule should be done.

Okay, here it is.

  If that's the right fix, I can push it directly since it won't have any
  dependencies on your patches.
 
 Perfect!

It should make the next -mm.

JFS: call io_schedule() instead of schedule() to avoid deadlock

The introduction of Jens Axboe's explicit i/o plugging patches introduced a
deadlock in jfs.  This was caused by the process initiating I/O not
unplugging the queue before waiting on the commit thread.  The commit
thread itself was waiting for that I/O to complete.  Calling io_schedule()
rather than schedule() unplugs the I/O queue avoiding the deadlock, and it
appears to be the right function to call in any case.

Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]

---
commit 4aa0d230c2cfc1ac4bcf7c5466f9943cf14233a9
tree b873dce6146f4880c6c48ab53c0079566f52a60b
parent 82d5b9a7c63054a9a2cd838ffd177697f86e7e34
author Dave Kleikamp [EMAIL PROTECTED] Wed, 17 Jan 2007 21:18:35 -0600
committer Dave Kleikamp [EMAIL PROTECTED] Wed, 17 Jan 2007 21:18:35 -0600

 fs/jfs/jfs_lock.h |2 +-
 fs/jfs/jfs_metapage.c |2 +-
 fs/jfs/jfs_txnmgr.c   |2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/jfs/jfs_lock.h b/fs/jfs/jfs_lock.h
index 7d78e83..df48ece 100644
--- a/fs/jfs/jfs_lock.h
+++ b/fs/jfs/jfs_lock.h
@@ -42,7 +42,7 @@ do {  \
if (cond)   \
break;  \
unlock_cmd; \
-   schedule(); \
+   io_schedule();  \
lock_cmd;   \
}   \
current-state = TASK_RUNNING;  \
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index ceaf03b..58deae0 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -56,7 +56,7 @@ static inline void __lock_metapage(struct metapage *mp)
set_current_state(TASK_UNINTERRUPTIBLE);
if (metapage_locked(mp)) {
unlock_page(mp-page);
-   schedule();
+   io_schedule();
lock_page(mp-page);
}
} while (trylock_metapage(mp));
diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c
index d558e51..6988a10 100644
--- a/fs/jfs/jfs_txnmgr.c
+++ b/fs/jfs/jfs_txnmgr.c
@@ -135,7 +135,7 @@ static inline void TXN_SLEEP_DROP_LOCK(wait_queue_head_t * 
event)
add_wait_queue(event, wait);
set_current_state(TASK_UNINTERRUPTIBLE);
TXN_UNLOCK();
-   schedule();
+   io_schedule();
current-state = TASK_RUNNING;
remove_wait_queue(event, wait);
 }

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via pointer conversion)

2006-11-17 Thread Dave Kleikamp
On Fri, 2006-11-17 at 10:06 -0500, Jeff Layton wrote:
 On Fri, 2006-11-17 at 09:01 -0600, Dave Kleikamp wrote:
 
  Wouldn't you only be able to only crack a few of the low-order bits due
  to a cluster of inodes being sequential?  I don't think you'd be able
  crack enough of it to be useful.  You may be able to determine where
  some inodes are relative to others, but I don't think you'd be able to
  point the their location in memory.  I don't know anything about crypto,
  so I could be wrong.
  
 
 On a 64-bit kernel, that would be the case. On a 32-bit kernel, there
 are no high order bits to chop off, so this would effectively give you
 the address.

By a few of the low-order bits, I mean very few, not 32.  The slab
allocator generally allocates one page at a time, so you typically don't
get more than about 6 inodes in a slab page.  Even if you consider that
you may be able allocate several slab pages together, it would be hard
to get very many contiguous pages of inode slab cache.  Even if it were
possible to force the system to allocate around 256 contiguous inodes,
you would still only be able to determine 8 bits out of 32.  At most one
or two more if you could determine a pattern of inode numbers that would
be skipped due to the inclusion of struct inode within the fs-dependent
inode structure.  I may be wrong, but I doubt that someone could
determine the entire mask from the observed i_ino's.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mark_inode_dirty vs mark_inode_dirty_sync

2005-09-02 Thread Dave Kleikamp
On Fri, 2005-09-02 at 11:55 +0200, David Sanchez wrote:
 Hi,
 Please, could somebody explain me what the mark_inode_dirty* functions
 do and what is the difference between mark_inode_dirty and
 mark_inode_dirty_sync ?

They put the inode on the superblock's dirty list and make the inode as
dirty in the i_state field.  This makes sure that the inode will
eventually be written to disk.

mark_inode_dirty_sync only sets the I_DIRTY_SYNC flag, which does not
imply that any file data was changed.  It is called when a minor change
is made to an inode, such as a timestamp is changed.  Some sync
operations will only write the inode if data was written, so can avoid
writing the an inode that is only dirtied by I_DIRTY_SYNC.

mark_inode_dirty sets I_DIRTY which is I_DIRTY_SYNC | I_DIRTY_DATASYNC |
I_DIRTY_PAGES.  This indicates that the in-memory inode has changes to
the data that have not yet been written to disk.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mirror a file system on the fly

2005-08-18 Thread Dave Kleikamp
On Fri, 2005-08-19 at 01:10 +0530, Dave Schwartz wrote:
 Hi list,
  
 Not too sure if this is the right forum to ask this question but since
 my requirement is around linux filesystems, I shall take this liberty
 to post my question.
 
 My requirement is to develop a kernel/user space module to add an
 extension to the shell program environment such that this shell forks
 a mirror look-alike filesystem of the underlying OS to the programs
 run in that particular shell.
 
  
 Was trying to look thru the FAQ and a few list archives to look for
 ideas around my requirement. The archives were overwhelming.
  
 
 Any ideas/pointers will be a great help,

The chroot command and mount --bind should give you a good start.

 Gracias,
 decebel
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 0/2] JFS atomic inode security labeling

2005-07-14 Thread Dave Kleikamp
Here's the jfs implementation.  The first patch fixes the transaction
layering in the xattr code to allow xattrs to be added to an inode as a
part of an existing transaction, and correctly calls jfs_init_acl()
within the same transaction that creates a file/directory/device.

The second patch actually implements jfs_init_security() similarly to
the ext2/3 patches.

The first patch was lightly tested, and the second was compile tested
only.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 1/2] JFS atomic xattr/acl handling

2005-07-14 Thread Dave Kleikamp
__jfs_setxattr should allow extended attributes to be set within an
existing transaction

diff -urp linux-2.6.13-rc3/fs/jfs/acl.c linux-1/fs/jfs/acl.c
--- linux-2.6.13-rc3/fs/jfs/acl.c   2005-07-13 10:03:40.0 -0500
+++ linux-1/fs/jfs/acl.c2005-07-14 09:19:50.0 -0500
@@ -23,6 +23,7 @@
 #include linux/quotaops.h
 #include linux/posix_acl_xattr.h
 #include jfs_incore.h
+#include jfs_txnmgr.h
 #include jfs_xattr.h
 #include jfs_acl.h
 
@@ -75,7 +76,8 @@ static struct posix_acl *jfs_get_acl(str
return acl;
 }
 
-static int jfs_set_acl(struct inode *inode, int type, struct posix_acl *acl)
+static int jfs_set_acl(tid_t tid, struct inode *inode, int type,
+  struct posix_acl *acl)
 {
char *ea_name;
struct jfs_inode_info *ji = JFS_IP(inode);
@@ -110,7 +112,7 @@ static int jfs_set_acl(struct inode *ino
if (rc  0)
goto out;
}
-   rc = __jfs_setxattr(inode, ea_name, value, size, 0);
+   rc = __jfs_setxattr(tid, inode, ea_name, value, size, 0);
 out:
kfree(value);
 
@@ -143,7 +145,7 @@ int jfs_permission(struct inode *inode, 
return generic_permission(inode, mask, jfs_check_acl);
 }
 
-int jfs_init_acl(struct inode *inode, struct inode *dir)
+int jfs_init_acl(tid_t tid, struct inode *inode, struct inode *dir)
 {
struct posix_acl *acl = NULL;
struct posix_acl *clone;
@@ -159,7 +161,7 @@ int jfs_init_acl(struct inode *inode, st
 
if (acl) {
if (S_ISDIR(inode-i_mode)) {
-   rc = jfs_set_acl(inode, ACL_TYPE_DEFAULT, acl);
+   rc = jfs_set_acl(tid, inode, ACL_TYPE_DEFAULT, acl);
if (rc)
goto cleanup;
}
@@ -173,7 +175,8 @@ int jfs_init_acl(struct inode *inode, st
if (rc = 0) {
inode-i_mode = mode;
if (rc  0)
-   rc = jfs_set_acl(inode, ACL_TYPE_ACCESS, clone);
+   rc = jfs_set_acl(tid, inode, ACL_TYPE_ACCESS,
+clone);
}
posix_acl_release(clone);
 cleanup:
@@ -202,8 +205,15 @@ static int jfs_acl_chmod(struct inode *i
return -ENOMEM;
 
rc = posix_acl_chmod_masq(clone, inode-i_mode);
-   if (!rc)
-   rc = jfs_set_acl(inode, ACL_TYPE_ACCESS, clone);
+   if (!rc) {
+   tid_t tid = txBegin(inode-i_sb, 0);
+   down(JFS_IP(inode)-commit_sem);
+   rc = jfs_set_acl(tid, inode, ACL_TYPE_ACCESS, clone);
+   if (!rc)
+   rc = txCommit(tid, 1, inode, 0);
+   txEnd(tid);
+   up(JFS_IP(inode)-commit_sem);
+   }
 
posix_acl_release(clone);
return rc;
diff -urp linux-2.6.13-rc3/fs/jfs/jfs_acl.h linux-1/fs/jfs/jfs_acl.h
--- linux-2.6.13-rc3/fs/jfs/jfs_acl.h   2005-07-13 10:03:40.0 -0500
+++ linux-1/fs/jfs/jfs_acl.h2005-07-14 08:45:01.0 -0500
@@ -21,8 +21,16 @@
 #ifdef CONFIG_JFS_POSIX_ACL
 
 int jfs_permission(struct inode *, int, struct nameidata *);
-int jfs_init_acl(struct inode *, struct inode *);
+int jfs_init_acl(tid_t, struct inode *, struct inode *);
 int jfs_setattr(struct dentry *, struct iattr *);
 
-#endif /* CONFIG_JFS_POSIX_ACL */
+#else
+
+static inline int jfs_init_acl(tid_t tid, struct inode *inode,
+  struct inode *dir)
+{
+   return 0;
+}
+
+#endif
 #endif /* _H_JFS_ACL */
diff -urp linux-2.6.13-rc3/fs/jfs/jfs_xattr.h linux-1/fs/jfs/jfs_xattr.h
--- linux-2.6.13-rc3/fs/jfs/jfs_xattr.h 2005-06-17 14:48:29.0 -0500
+++ linux-1/fs/jfs/jfs_xattr.h  2005-07-13 16:59:04.0 -0500
@@ -52,8 +52,8 @@ struct jfs_ea_list {
 #defineEND_EALIST(ealist) \
((struct jfs_ea *) (((char *) (ealist)) + EALIST_SIZE(ealist)))
 
-extern int __jfs_setxattr(struct inode *, const char *, const void *, size_t,
- int);
+extern int __jfs_setxattr(tid_t, struct inode *, const char *, const void *,
+ size_t, int);
 extern int jfs_setxattr(struct dentry *, const char *, const void *, size_t,
int);
 extern ssize_t __jfs_getxattr(struct inode *, const char *, void *, size_t);
diff -urp linux-2.6.13-rc3/fs/jfs/namei.c linux-1/fs/jfs/namei.c
--- linux-2.6.13-rc3/fs/jfs/namei.c 2005-07-13 10:03:40.0 -0500
+++ linux-1/fs/jfs/namei.c  2005-07-14 09:41:21.0 -0500
@@ -39,6 +39,24 @@ struct dentry_operations jfs_ci_dentry_o
 static s64 commitZeroLink(tid_t, struct inode *);
 
 /*
+ * NAME:   free_ea_wmap(inode)
+ *
+ * FUNCTION:   free uncommitted extended attributes from working map 
+ *
+ */
+static inline void free_ea_wmap(struct inode *inode)
+{
+   dxd_t *ea = JFS_IP(inode)-ea;
+
+   if (ea-flag  DXD_EXTENT) {
+   

[RFC][PATCH 2/2] JFS atomic inode security labeling

2005-07-14 Thread Dave Kleikamp
Implement jfs_init_security()

diff -Nurp linux-1/fs/jfs/jfs_xattr.h linux/fs/jfs/jfs_xattr.h
--- linux-1/fs/jfs/jfs_xattr.h  2005-07-13 16:59:04.0 -0500
+++ linux/fs/jfs/jfs_xattr.h2005-07-14 10:27:23.0 -0500
@@ -61,4 +61,14 @@ extern ssize_t jfs_getxattr(struct dentr
 extern ssize_t jfs_listxattr(struct dentry *, char *, size_t);
 extern int jfs_removexattr(struct dentry *, const char *);
 
+#ifdef CONFIG_JFS_SECURITY
+extern int jfs_init_security(tid_t, struct inode *, struct inode *);
+#else
+static inline int jfs_init_security(tid_t tid, struct inode *inode,
+   struct inode *dir)
+{
+   return 0;
+}
+#endif
+
 #endif /* H_JFS_XATTR */
diff -Nurp linux-1/fs/jfs/namei.c linux/fs/jfs/namei.c
--- linux-1/fs/jfs/namei.c  2005-07-14 09:41:21.0 -0500
+++ linux/fs/jfs/namei.c2005-07-14 10:18:21.0 -0500
@@ -111,6 +111,12 @@ static int jfs_create(struct inode *dip,
if (rc)
goto out3;
 
+   rc = jfs_init_security(tid, ip, dip);
+   if (rc) {
+   txAbort(tid, 0);
+   goto out3;
+   }
+
if ((rc = dtSearch(dip, dname, ino, btstack, JFS_CREATE))) {
jfs_err(jfs_create: dtSearch returned %d, rc);
txAbort(tid, 0);
@@ -239,6 +245,12 @@ static int jfs_mkdir(struct inode *dip, 
if (rc)
goto out3;
 
+   rc = jfs_init_security(tid, ip, dip);
+   if (rc) {
+   txAbort(tid, 0);
+   goto out3;
+   }
+
if ((rc = dtSearch(dip, dname, ino, btstack, JFS_CREATE))) {
jfs_err(jfs_mkdir: dtSearch returned %d, rc);
txAbort(tid, 0);
@@ -906,6 +918,10 @@ static int jfs_symlink(struct inode *dip
down(JFS_IP(dip)-commit_sem);
down(JFS_IP(ip)-commit_sem);
 
+   rc = jfs_init_security(tid, ip, dip);
+   if (rc)
+   goto out3;
+
tblk = tid_to_tblock(tid);
tblk-xflag |= COMMIT_CREATE;
tblk-ino = ip-i_ino;
@@ -1349,6 +1365,12 @@ static int jfs_mknod(struct inode *dir, 
if (rc)
goto out3;
 
+   rc = jfs_init_security(tid, ip, dir);
+   if (rc) {
+   txAbort(tid, 0);
+   goto out3;
+   }
+
if ((rc = dtSearch(dir, dname, ino, btstack, JFS_CREATE))) {
txAbort(tid, 0);
goto out3;
diff -Nurp linux-1/fs/jfs/xattr.c linux/fs/jfs/xattr.c
--- linux-1/fs/jfs/xattr.c  2005-07-13 16:59:58.0 -0500
+++ linux/fs/jfs/xattr.c2005-07-14 10:54:11.0 -0500
@@ -21,6 +21,7 @@
 #include linux/xattr.h
 #include linux/posix_acl_xattr.h
 #include linux/quotaops.h
+#include linux/security.h
 #include jfs_incore.h
 #include jfs_superblock.h
 #include jfs_dmap.h
@@ -1148,3 +1149,38 @@ int jfs_removexattr(struct dentry *dentr
 
return rc;
 }
+
+#ifdef CONFIG_JFS_SECURITY
+int jfs_init_security(tid_t tid, struct inode *inode, struct inode *dir)
+{
+   int rc;
+   size_t len;
+   void *value;
+   char *suffix;
+   char *name;
+
+   rc = security_inode_init_security(inode, dir, suffix, value, len);
+   if (rc) {
+   if (rc == -EOPNOTSUPP)
+   return 0;
+   return rc;
+   }
+   name = kmalloc(XATTR_SECURITY_PREFIX_LEN + 1 + strlen(suffix),
+  GFP_NOFS);
+   if (!name) {
+   rc = -ENOMEM;
+   goto kmalloc_failed;
+   }
+   strcpy(name, XATTR_SECURITY_PREFIX);
+   strcpy(name + XATTR_SECURITY_PREFIX_LEN, suffix);
+
+   rc = __jfs_setxattr(tid, inode, name, value, len, 0);
+
+   kfree(name);
+kmalloc_failed:
+   kfree(suffix);
+   kfree(value);
+
+   return rc;
+}
+#endif



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: files of size larger than fs size

2005-03-17 Thread Dave Kleikamp
On Wed, 2005-03-16 at 20:52 -0800, Max wrote:
 Bryan,
 
 I'm not experienced in filesystems. But I've derived the testcase
 program from some buggy application that occasionally created huge
 files on my fs. That was not so easy to reproduce since not every
 sequence of seeks/reads/writes results in a huge file. But finally I
 got it 100% reproducible.
 
 So I would appriciate if somebody make any good from that testcase
 code.

man lseek tells you to test the return code against (off_t)-1.  I found
that for larger values, your test program is returning -1, but unsigned
it appears as 18446744073709551615.

 
 Different filesizes look strange to me. What I have so far:
 
 JFS:281474976706576  =  2**48 - 4080
 XFS:  72057594037923856  =  2**56 - 4080
 EXT3: 1099511627784  =  2**40 + 8

I can speak for jfs, and it supports file sizes up to 2**52.  It uses 40
bits to address the logical block offset within a file with a 4K (2**12)
size block.  The individual file systems set sb-s_maxbytes to the
appropriate value.

 Thanks,
 Max
 
 P.S. direct link to the testcase program: 
 http://bugzilla.kernel.org/attachment.cgi?id=4729

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: files of size larger than fs size

2005-03-17 Thread Dave Kleikamp
On Thu, 2005-03-17 at 14:51 -0600, Dave Kleikamp wrote:
 On Thu, 2005-03-17 at 12:06 -0800, Bryan Henderson wrote:
  I found
  that for larger values, your test program is returning -1, but unsigned
  it appears as 18446744073709551615.
  
  You mean you ran it?  Then what about the more interesting question of 
  what your filesize ends up to be?  You say JFS allows files up to 2**52 
  bytes, so I expect the test case would succeed up through the write at 
  2**48 and leave the filesize 2**48 + 8.  But Max reports seeing 2**48 - 
  4080.
 
 Yeah, this is funny.  I missed that part of the story.  I'll try to
 figure out what is going on here.

The problem appears to be mixing calls to lseek64 with calls to fread
and fwrite.  fread  fwrite don't appear to honor the position set by
lseek64.  I assume replacing fread  fwrite with read  write will fix
it.  I haven't tried it yet, but I'm about to.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: files of size larger than fs size

2005-03-17 Thread Dave Kleikamp
With this change, the file size on jfs becomes 2^48 + 4 as expected.

--- jfs_bug.c.orig  2005-03-17 14:18:48.229634648 -0600
+++ jfs_bug.c   2005-03-17 15:32:45.952750104 -0600
@@ -13,12 +13,14 @@ int data = 0;
 struct flock fl;
 
 void read1() {
-size_t rc = fread(data,sizeof(data),1,f);
+/* size_t rc = fread(data,sizeof(data),1,f); */
+size_t rc = read(fn, data, sizeof(data));
 printf(read() rc = %llu\n,rc);
 }
 
 void write1() {
-size_t rc = fwrite(data,sizeof(data),1,f);
+/* size_t rc = fwrite(data,sizeof(data),1,f); */
+size_t rc = write(fn, data, sizeof(data));
 printf(write() rc = %llu\n,rc);
 }
 

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: files of size larger than fs size

2005-03-17 Thread Dave Kleikamp
On Thu, 2005-03-17 at 13:48 -0800, Max wrote:
 Dave,
 
 Shouldn't fread(data,sizeof(data),1,f) and read(fn, data, 
 sizeof(data)) produce identical results?
 Is it a bug or what?

fseek(f, offset, SEEK_SET);
fread(data, sizeof(data, 1, f);

should produce identical results to

lseek(fn, offset, SEEK_SET);
read(fn, data, sizeof(data);

However

lseek(fn, offset, SEEK_SET);
fread(data, sizeof(data, 1, f);

isn't valid.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: jfs, special characters

2005-02-27 Thread Dave Kleikamp
On Sun, 2005-02-27 at 15:00 -0800, Frederik Eaton wrote:

 Thanks, you were right that I recently upgraded to 2.6 (actually I
 transferred the hard drive between machines), and with a mount /home -o
 remount,iocharset=utf8 at least I can access the files now. But the
 names aren't what they should be, for instance I see:
 
 04-The_Dark_Of_The_Matine.ogg
 
 (that's a theta). Were they supposed to be correct? 

Not necessarily.  Even if the character returned is correct utf8, the
terminal (xterm?) displaying the name may be assuming a different
character set.  It's also possible that on the older system,
CONFIG_NLS_DEFAULT was not set to the same locale that the system used,
so the character stored in the name was not the right unicode, but it
was accessed consistently on the old system so it worked as long as
nothing changed.

 As an aside, now, I just tried mount /home -o
 remount,iocharset=iso8859-1, and got same problem

This is no surprise.  iso8859-1 has the same mapping as iocharset=none
(the 2.6 default).  (The none keyword is first supported in
2.6.11-rc4-mm1 and 2.4.30-pre2, so you probably won't be able to use
it.)

 ls: 
 '/home/shared/frederik/Franz_Ferdinand/Franz_Ferdinand/04-The_Dark_Of_The_Matin?e.ogg':
  No such file or directory
 
 But when I repeated the mount /home -o remount,iocharset=utf8
 command, the listing didn't show the name with a theta character, but
 instead with two question marks:
 
 04-The_Dark_Of_The_Matin??e.ogg
 
 Strange behavior. I.e., upon changing the mount options and then
 changing them back I get a different listing.

That I wouldn't expect.  I really have no clue why it isn't consistent.

The easiest way to fix these files (assuming there aren't too many) is
to mount with iocharset=utf8, rename the files using only ascii
characters, remount again with iocharset=iso8859-1, and rename the files
to their proper names.  After that you should be able to mount without
any iocharset flag and get sane behavior.

 Frederik
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: jfs, special characters

2005-02-26 Thread Dave Kleikamp
On Fri, 2005-02-25 at 20:34 -0800, Frederik Eaton wrote:
 Does JFS have problems with files with special characters in them?

Sometimes.  jfs attempts to store pathnames in 16-bit unicode, and uses
the mount option iocharset to determine what to convert to/from.

In the 2.4 kernel, the default value of iocharset is determined by the
kernel config option, CONFIG_NLS_DEFAULT.  If an existing pathname
exists that doesn't map to the iocharset, jfs has the problem you
describe.  Mounting with -o iocharset=utf8 should let you access any
existing files.

In the 2.6 kernel, the default was changed to store each byte of the
filename as a 16-bit value in the directory without using any
conversion.  There can still be a problem if an existing pathname has a
value with a non-zero high-order byte.  Again mounting with -o
iocharset=utf8 will let you access all files.

 $ ls
 ls: 04-The_Dark_Of_The_Matin?e.ogg: No such file or directory
 [1]$ rm 04-The_Dark_Of_The_Matin\?e.ogg
 rm: cannot lstat `04-The_Dark_Of_The_Matin?e.ogg': No such file or directory
 
 Frederik

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SV: Ext3 destroying ownerships and permissions

2001-05-09 Thread Dave Kleikamp

This is a WAG, but:

1. Did you build ext3 as a module?
2. If so, are you running on a kernel that was built before applying the 
ext3 patch and configuring?

If the answers are yes, the inode cache may be too small to contain the 
ext3 in-memory inode, and the ext3 code is overwriting the inode by a 
few bytes.



-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: 64-bit block sizes on 32-bit systems

2001-03-28 Thread Dave Kleikamp

My turn to chime in.

JFS was designed around a 4K meta-data page size.  It would require some
major re-design to use larger block sizes.  On the other hand, JFS could
take advantage of 64-bit block addresses immediately.  JFS internally
store the block address in 40 bits.  (Sorry, file size  volume size are
both limited to 4 peta-bytes on JFS.)

At the rate that storage hardware and requirements are increasing,
increasing the block size is a short-term solution that is only going to
delay the inevitable requirement for 64-bit block addressability.  There
is a practical limit to a usable block-size.  Someone threw out 64K,
which seems reasonable to me.
-- 
David Kleikamp
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFC] sane access to per-fs metadata (was Re:[PATCH]Documentation/ioctl-number.txt)

2001-03-23 Thread Dave Kleikamp

Alexander Viro wrote:
 
 On Fri, 23 Mar 2001, Dave Kleikamp wrote:
 
  Are you okay with JFS using a few ioctl's in the utilities and
  continuing the discussion of whether more general-purpose metadata
  access should occur in procfs or under your proposal?
 
 OK, one immediate problem with ioctls on directories: unions.
 Think what happens when you union-mount JFS somewhere. And
 mountpoint also is on JFS. Where should ioctl() go?

AFAIC, I wouldn't object to calling this a limitation and letting the
ioctl fail.  In fact, the utilities could detect a union-mount and fail
immediately.  Although, someone, someday, will ask for it to work, won't
they?

 Other that that (and general ugliness of ioctls) - no problems. I
 really think that trick I've described would be cleaner, but that's
 a separate story. It's not like we had a moratorium on new ioctls,
 after all and JFS wouldn't be the first fs to do something like that.
 I don't think that it's a good idea, but the worst thing that can
 happen is inconvenience for union-mount setups. When union-mount
 gets to testable stage, that is - it's not like it was going into the
 tree before 2.5, anyway.
 
 procfs use for per-filesystem stuff is a different issue - that is just
 asking for bad breakage.

I won't argue this point without learning more about it.  So far I've
only played with procfs to peek at some global data.

-- 
David Kleikamp
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFC] sane access to per-fs metadata (was Re: [PATCH]Documentation/ioctl-number.txt)

2001-03-23 Thread Dave Kleikamp

Al,
I didn't know that creating file system ioctl's was such a hot topic. 
Since the functions I want to implement are for a very specific purpose
(I don't expect anything except the JFS utilities to invoke them), I
would expect an ioctl to be an appropriate vehicle.

If not ioctl's, why not procfs?  Your proposal is that each filesystem
implements its own mini-procfs.  What's the advantage of that?

My intentions so far are to use ioctl's for specific operations required
by the JFS utilities, and procfs for debugging output, tuning, and
anything else that make sense.

Alexander Viro wrote:
 That may look like an overkill, however
 * You can get rid of any need to register ioctls, etc.

This is a one-time thing

 * You can add debugging/whatever at any moment with no need to
   update any utilities - everything is available from plain shell

We can do this with procfs right now.

 * You can conveniently view whatever metadata you want - no need to
   shove everything into ioctls on one object.

Again, why re-invent procfs?  We could put this under
/proc/fs/jfs/metadata.

 * You can use normal permissions control - just set appropriate
   permission bits for objects on jfsmeta
 
 IOW, you can get normal filesystem view (meaning that you have all usual
 UNIX toolkit available) for per-fs control stuff. And keep the ability to
 do proper locking - it's the same driver that handles the main fs and you
 have access to superblock. No need to change the API - everything is already
 there...

I'm not sure how a utility would make atomic changes to several pieces
of metadata.  The underlying fs code would protect the integrity of
every metadata "file", but changes to more than one of these "files"
would not be done as a group without some additional locking that would
have to be coordinated between the utility and the fs.  This kind of
thing could be handled by writing some special command to a
"command-processor" type file, but why is this better than an ioctl?


-- 
David Kleikamp
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: (struct dentry *)-vfsmnt;

2001-03-14 Thread Dave Kleikamp

AIX stores all of this information in the LVM, not in the filesystem. 
The filesystem itself has nothing to do with importing and exporting
volume groups.  Having the information stored as part of LVM's metadata
allows the utilities to only deal with LVM instead of every individual
file system.

Andreas Dilger wrote:
 
 Al writes:
  On Tue, 13 Mar 2001, Andreas Dilger wrote:
 
   On AIX, it is possible to import a volume group, and it automatically
   builds /etc/fstab entries from information stored in the fs.  Having the
   "last mounted on" would have the mount point info, and of course LVM
   would hold the device names.
 
  Wait a minute. What happens if you bring /home from one box to another,
  that already has /home? Corrupted /etc/fstab?

 For the same reason that the UUID and LABEL are stored in the superblock:
 you want this infomation kept with the filesystem and not anywhere else,
 otherwise it will quickly get out-of-date.  Wherever you mounted the
 filesystem last is where it would be mounted if you import the VG on
 another system.  You can obviously edit /etc/fstab afterwards if it is
 wrong, and then remount the filesystem(s), and this will store the
 correct mountpoint into the filesystem for the next vgimport.

-- 
David J. Kleikamp
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]