Re: [patch 00/10] mount ownership and unprivileged mount syscall (v8)

2008-02-18 Thread Miklos Szeredi
> > > > However David and Christoph are beavering away on the r-o-bind-mounts
> > > > patches and I expect that there will be overlaps with unprivileged 
> > > > mounts.
> > > > 
> > > > Could we coordinate things a bit please?  Decide who goes first, review
> > > > and maybe even test each others work, etc?
> > > 
> > > Al is setting up a git tree for VFS work.  per-mount r/o will go in
> > > as one of the first things, aswell as his rework of the path lookup
> > > logic to fix the intents mess.
> > > 
> > 
> > That didn't answer my question..
> 
> Well, Al as the defacto VFS maintainer will decide on the ordering.

I think we agreed, that r-o-bind mounts are more important, so they
should go first.  They have also received more attention.  OTOH there
isn't really any fundamental conflict between the two patchsets, so
going in together (if the ro-bind patches miss 2.6.25) should also be
possible.

> Reviewing this stuff properly is still on my todo list, but currently
> I'm busy with more important things.

So what should I do?

Would Al be wanting to merge this into his VFS tree?  (Can't find it
on git.kernel.org yet, BTW.)  I can set up a git tree for these
patches if that makes things easier.

Or should I just wait and resubmit after every kernel release, hoping
that it becomes _the_ most important thing on Christoph's list ;)

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


very poor ext3 write performance on big filesystems?

2008-02-18 Thread Tomasz Chmielewski

I have a 1.2 TB (of which 750 GB is used) filesystem which holds
almost 200 millions of files.
1.2 TB doesn't make this filesystem that big, but 200 millions of files 
is a decent number.



Most of the files are hardlinked multiple times, some of them are
hardlinked thousands of times.


Recently I began removing some of unneeded files (or hardlinks) and to 
my surprise, it takes longer than I initially expected.



After cache is emptied (echo 3 > /proc/sys/vm/drop_caches) I can usually 
remove about 5-20 files with moderate performance. I see up to 
5000 kB read/write from/to the disk, wa reported by top is usually 20-70%.



After that, waiting for IO grows to 99%, and disk write speed is down to 
50 kB/s - 200 kB/s (fifty - two hundred kilobytes/s).



Is it normal to expect the write speed go down to only few dozens of 
kilobytes/s? Is it because of that many seeks? Can it be somehow 
optimized? The machine has loads of free memory, perhaps it could be 
uses better?



Also, writing big files is very slow - it takes more than 4 minutes to 
write and sync a 655 MB file (so, a little bit more than 1 MB/s) - 
fragmentation perhaps?


+ dd if=/dev/zero of=testfile bs=64k count=1
1+0 records in
1+0 records out
65536 bytes (655 MB) copied, 3,12109 seconds, 210 MB/s
+ sync
0.00user 2.14system 4:06.76elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+883minor)pagefaults 0swaps


# df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/sda  1,2T  697G  452G  61% /mnt/iscsi_backup

# df -i
FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/sda154M 20M134M   13% /mnt/iscsi_backup




--
Tomasz Chmielewski
http://wpkg.org

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Andi Kleen
Tomasz Chmielewski <[EMAIL PROTECTED]> writes:
>
> Is it normal to expect the write speed go down to only few dozens of
> kilobytes/s? Is it because of that many seeks? Can it be somehow
> optimized? 

I have similar problems on my linux source partition which also
has a lot of hard linked files (although probably not quite
as many as you do). It seems like hard linking prevents
some of the heuristics ext* uses to generate non fragmented
disk layouts and the resulting seeking makes things slow.

What has helped a bit was to recreate the file system with -O^dir_index
dir_index seems to cause more seeks.

Also keeping enough free space is also a good idea because that
allows the file system code better choices on where to place data.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso
On Mon, Feb 18, 2008 at 03:03:44PM +0100, Andi Kleen wrote:
> Tomasz Chmielewski <[EMAIL PROTECTED]> writes:
> >
> > Is it normal to expect the write speed go down to only few dozens of
> > kilobytes/s? Is it because of that many seeks? Can it be somehow
> > optimized? 
> 
> I have similar problems on my linux source partition which also
> has a lot of hard linked files (although probably not quite
> as many as you do). It seems like hard linking prevents
> some of the heuristics ext* uses to generate non fragmented
> disk layouts and the resulting seeking makes things slow.

ext3 tries to keep inodes in the same block group as their containing
directory.  If you have lots of hard links, obviously it can't really
do that, especially since we don't have a good way at mkdir time to
tell the filesystem, "Psst!  This is going to be a hard link clone of
that directory over there, put it in the same block group".

> What has helped a bit was to recreate the file system with -O^dir_index
> dir_index seems to cause more seeks.

Part of it may have simply been recreating the filesystem, not
necessarily removing the dir_index feature.  Dir_index speeds up
individual lookups, but it slows down workloads that do a readdir
followed by a stat of all of the files in the workload.  You can work
around this by calling readdir(), sorting all of the entries by inode
number, and then calling open or stat or whatever.  So this can help
out for workloads that are doing find or rm -r on a dir_index
workload.  Basically, it helps for some things, hurts for others.
Once things are in the cache it doesn't matter of course.

The following ld_preload can help in some cases.  Mutt has this hack
encoded in for maildir directories, which helps.

> Also keeping enough free space is also a good idea because that
> allows the file system code better choices on where to place data.

Yep, that too.

- Ted

/*
 * readdir accelerator
 *
 * (C) Copyright 2003, 2004 by Theodore Ts'o.
 *
 * Compile using the command:
 *
 * gcc -o spd_readdir.so -shared spd_readdir.c -ldl
 *
 * Use it by setting the LD_PRELOAD environment variable:
 * 
 * export LD_PRELOAD=/usr/local/sbin/spd_readdir.so
 *
 * %Begin-Header%
 * This file may be redistributed under the terms of the GNU Public
 * License.
 * %End-Header%
 * 
 */

#define ALLOC_STEPSIZE	100
#define MAX_DIRSIZE	0

#define DEBUG

#ifdef DEBUG
#define DEBUG_DIR(x)	{if (do_debug) { x; }}
#else
#define DEBUG_DIR(x)
#endif

#define _GNU_SOURCE
#define __USE_LARGEFILE64

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

struct dirent_s {
	unsigned long long d_ino;
	long long d_off;
	unsigned short int d_reclen;
	unsigned char d_type;
	char *d_name;
};

struct dir_s {
	DIR	*dir;
	int	num;
	int	max;
	struct dirent_s *dp;
	int	pos;
	int	fd;
	struct dirent ret_dir;
	struct dirent64 ret_dir64;
};

static int (*real_closedir)(DIR *dir) = 0;
static DIR *(*real_opendir)(const char *name) = 0;
static struct dirent *(*real_readdir)(DIR *dir) = 0;
static struct dirent64 *(*real_readdir64)(DIR *dir) = 0;
static off_t (*real_telldir)(DIR *dir) = 0;
static void (*real_seekdir)(DIR *dir, off_t offset) = 0;
static int (*real_dirfd)(DIR *dir) = 0;
static unsigned long max_dirsize = MAX_DIRSIZE;
static num_open = 0;
#ifdef DEBUG
static int do_debug = 0;
#endif

static void setup_ptr()
{
	char *cp;

	real_opendir = dlsym(RTLD_NEXT, "opendir");
	real_closedir = dlsym(RTLD_NEXT, "closedir");
	real_readdir = dlsym(RTLD_NEXT, "readdir");
	real_readdir64 = dlsym(RTLD_NEXT, "readdir64");
	real_telldir = dlsym(RTLD_NEXT, "telldir");
	real_seekdir = dlsym(RTLD_NEXT, "seekdir");
	real_dirfd = dlsym(RTLD_NEXT, "dirfd");
	if ((cp = getenv("SPD_READDIR_MAX_SIZE")) != NULL) {
		max_dirsize = atol(cp);
	}
#ifdef DEBUG
	if (getenv("SPD_READDIR_DEBUG"))
		do_debug++;
#endif
}

static void free_cached_dir(struct dir_s *dirstruct)
{
	int i;

	if (!dirstruct->dp)
		return;

	for (i=0; i < dirstruct->num; i++) {
		free(dirstruct->dp[i].d_name);
	}
	free(dirstruct->dp);
	dirstruct->dp = 0;
}	

static int ino_cmp(const void *a, const void *b)
{
	const struct dirent_s *ds_a = (const struct dirent_s *) a;
	const struct dirent_s *ds_b = (const struct dirent_s *) b;
	ino_t i_a, i_b;
	
	i_a = ds_a->d_ino;
	i_b = ds_b->d_ino;

	if (ds_a->d_name[0] == '.') {
		if (ds_a->d_name[1] == 0)
			i_a = 0;
		else if ((ds_a->d_name[1] == '.') && (ds_a->d_name[2] == 0))
			i_a = 1;
	}
	if (ds_b->d_name[0] == '.') {
		if (ds_b->d_name[1] == 0)
			i_b = 0;
		else if ((ds_b->d_name[1] == '.') && (ds_b->d_name[2] == 0))
			i_b = 1;
	}

	return (i_a - i_b);
}


DIR *opendir(const char *name)
{
	DIR *dir;
	struct dir_s	*dirstruct;
	struct dirent_s *ds, *dnew;
	struct dirent64 *d;
	struct stat st;

	if (!real_opendir)
		setup_ptr();

	DEBUG_DIR(printf("Opendir(%s) (%d open)\n", name, num_open++));
	dir = (*real_opendir)(name);
	if (!dir)
		return NULL;

	dirstruct = malloc(sizeof(struct dir_s)

Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Andi Kleen
On Mon, Feb 18, 2008 at 09:16:41AM -0500, Theodore Tso wrote:
> ext3 tries to keep inodes in the same block group as their containing
> directory.  If you have lots of hard links, obviously it can't really
> do that, especially since we don't have a good way at mkdir time to
> tell the filesystem, "Psst!  This is going to be a hard link clone of
> that directory over there, put it in the same block group".

Hmm, you think such a hint interface would be worth it?

> 
> > What has helped a bit was to recreate the file system with -O^dir_index
> > dir_index seems to cause more seeks.
> 
> Part of it may have simply been recreating the filesystem, not

Undoubtedly.

> necessarily removing the dir_index feature.  Dir_index speeds up
> individual lookups, but it slows down workloads that do a readdir

But only for large directories right? For kernel source like
directory sizes it seems to be a general loss.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Tomasz Chmielewski

Theodore Tso schrieb:

(...)


What has helped a bit was to recreate the file system with -O^dir_index
dir_index seems to cause more seeks.


Part of it may have simply been recreating the filesystem, not
necessarily removing the dir_index feature.


You mean, copy data somewhere else, mkfs a new filesystem, and copy data 
back?


Unfortunately, doing it on a file level is not possible with a 
reasonable amount of time.


I tried to copy that filesystem once (when it was much smaller) with 
"rsync -a -H", but after 3 days, rsync was still building an index and 
didn't copy any file.



Also, as files/hardlinks come and go, it would degrade again.


Are there better choices than ext3 for a filesystem with lots of 
hardlinks? ext4, once it's ready? xfs?



--
Tomasz Chmielewski
http://wpkg.org
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso
On Mon, Feb 18, 2008 at 04:02:36PM +0100, Tomasz Chmielewski wrote:
> I tried to copy that filesystem once (when it was much smaller) with "rsync 
> -a -H", but after 3 days, rsync was still building an index and didn't copy 
> any file.

If you're going to copy the whole filesystem don't use rsync!  Use cp
or a tar pipeline to move the files.

> Also, as files/hardlinks come and go, it would degrade again.

Yes...

> Are there better choices than ext3 for a filesystem with lots of hardlinks? 
> ext4, once it's ready? xfs?

All filesystems are going to have problems keeping inodes close to
directories when you have huge numbers of hard links.

I'd really need to know exactly what kind of operations you were
trying to do that were causing problems before I could say for sure.
Yes, you said you were removing unneeded files, but how were you doing
it?  With rm -r of old hard-linked directories?  How big are the
average files involved?  Etc.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso
On Mon, Feb 18, 2008 at 04:18:23PM +0100, Andi Kleen wrote:
> On Mon, Feb 18, 2008 at 09:16:41AM -0500, Theodore Tso wrote:
> > ext3 tries to keep inodes in the same block group as their containing
> > directory.  If you have lots of hard links, obviously it can't really
> > do that, especially since we don't have a good way at mkdir time to
> > tell the filesystem, "Psst!  This is going to be a hard link clone of
> > that directory over there, put it in the same block group".
> 
> Hmm, you think such a hint interface would be worth it?

It would definitely help ext2/3/4.  An interesting question is whether
it would help enough other filesystems that's worth adding.  

> > necessarily removing the dir_index feature.  Dir_index speeds up
> > individual lookups, but it slows down workloads that do a readdir
> 
> But only for large directories right? For kernel source like
> directory sizes it seems to be a general loss.

On my todo list is a hack which does the sorting of directory inodes
by inode number inside the kernel for smallish directories (say, less
than 2-3 blocks) where using the kernel memory space to store the
directory entries is acceptable, and which would speed up dir_index
performance for kernel source-like directory sizes --- without needing
to use the spd_readdir LD_PRELOAD hack.

But yes, right now, if you know that your directories are almost
always going to be kernel source like in size, then omitting dir_index
is probably goint to be a good idea.  

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso
On Mon, Feb 18, 2008 at 04:57:25PM +0100, Andi Kleen wrote:
> > Use cp
> > or a tar pipeline to move the files.
> 
> Are you sure cp handles hardlinks correctly? I know tar does,
> but I have my doubts about cp.

I *think* GNU cp does the right thing with --preserve=links.  I'm not
100% sure, though --- like you, probably, I always use tar for moving
or copying directory hierarchies.

   - Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Andi Kleen
On Mon, Feb 18, 2008 at 10:16:32AM -0500, Theodore Tso wrote:
> On Mon, Feb 18, 2008 at 04:02:36PM +0100, Tomasz Chmielewski wrote:
> > I tried to copy that filesystem once (when it was much smaller) with "rsync 
> > -a -H", but after 3 days, rsync was still building an index and didn't copy 
> > any file.
> 
> If you're going to copy the whole filesystem don't use rsync! 

Yes, I managed to kill systems (drive them really badly into oom and
get very long swap storms) with rsync -H in the past too. Something is very 
wrong with the rsync implementation of this.

> Use cp
> or a tar pipeline to move the files.

Are you sure cp handles hardlinks correctly? I know tar does,
but I have my doubts about cp.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Tomasz Chmielewski

Theodore Tso schrieb:

Are there better choices than ext3 for a filesystem with lots of hardlinks? 
ext4, once it's ready? xfs?


All filesystems are going to have problems keeping inodes close to
directories when you have huge numbers of hard links.

I'd really need to know exactly what kind of operations you were
trying to do that were causing problems before I could say for sure.
Yes, you said you were removing unneeded files, but how were you doing
it?  With rm -r of old hard-linked directories?


Yes, with rm -r.



How big are the
average files involved?  Etc.


It's hard to estimate the average size of a file. I'd say there are not 
many files bigger than 50 MB.


Basically, it's a filesystem where backups are kept. Backups are made 
with BackupPC [1].


Imagine a full rootfs backup of 100 Linux systems.

Instead of compressing and writing "/bin/bash" 100 times for each 
separate system, we do it once, and hardlink. Then, keep 40 copies back, 
and you have 4000 hardlinks.


For individual or user files, the number of hardlinks will be smaller of 
course.


The directories I want to remove have usually a structure of a "normal" 
Linux rootfs, nothing special there (other than most of the files will 
have multiple hardlinks).



I noticed using write back helps a tiny bit, but as dm and md don't 
support write barriers, I'm not very eager to use it.



[1] http://backuppc.sf.net
http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues



--
Tomasz Chmielewski
http://wpkg.org

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso
On Mon, Feb 18, 2008 at 05:16:55PM +0100, Tomasz Chmielewski wrote:
> Theodore Tso schrieb:
>
>> I'd really need to know exactly what kind of operations you were
>> trying to do that were causing problems before I could say for sure.
>> Yes, you said you were removing unneeded files, but how were you doing
>> it?  With rm -r of old hard-linked directories?
>
> Yes, with rm -r.

You should definitely try the spd_readdir hack; that will help reduce
the seek times.  This will probably help on any block group oriented
filesystems, including XFS, etc.

>> How big are the
>> average files involved?  Etc.
>
> It's hard to estimate the average size of a file. I'd say there are not 
> many files bigger than 50 MB.

Well, Ext4 will help for files bigger than 48k.

The other thing that might help for you is using an external journal
on a separate hard drive (either for ext3 or ext4).  That will help
alleviate some of the seek storms going on, since the journal is
written to only sequentially, and putting it on a separate hard drive
will help remove some of the contention on the hard drive.  

I assume that your 1.2 TB filesystem is located on a RAID array; did
you use the mke2fs -E stride option to make sure all of the bitmaps
don't get concentrated on one hard drive spindle?  One of the failure
modes which can happen is if you use a 4+1 raid 5 setup, that all of
the block and inode bitmaps can end up getting laid out on a single
hard drive, so it becomes a bottleneck for bitmap intensive workloads
--- including "rm -rf".  So that's another thing that might be going
on.  If you do a "dumpe2fs", and look at the block numbers for the
block and inode allocation bitmaps, and you find that they are are all
landing on the same physical hard drive, then that's very clearly the
biggest problem given an "rm -rf" workload.  You should be able to see
this as well visually; if one hard drive has its hard drive light
almost constantly on, and the other ones don't have much activity,
that's probably what is happening.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 02/11] introduce simple_fs_type

2008-02-18 Thread Arnd Bergmann
There is a number of pseudo file systems in the kernel
that are basically copies of debugfs, all implementing the
same boilerplate code, just with different bugs.

This adds yet another copy to the kernel in the libfs directory,
with generalized helpers that can be used by any of them.

The most interesting function here is the new "struct dentry *
simple_register_filesystem(struct simple_fs_type *type)", which
returns the root directory of a new file system that can then
be passed to simple_create_file() and similar functions as a
parent.

Signed-off-by: Arnd Bergman <[EMAIL PROTECTED]>
Index: linux-2.6/fs/libfs.c
===
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -263,11 +263,6 @@ int simple_link(struct dentry *old_dentr
return 0;
 }
 
-static inline int simple_positive(struct dentry *dentry)
-{
-   return dentry->d_inode && !d_unhashed(dentry);
-}
-
 int simple_empty(struct dentry *dentry)
 {
struct dentry *child;
@@ -409,109 +404,6 @@ int simple_write_end(struct file *file, 
return copied;
 }
 
-/*
- * the inodes created here are not hashed. If you use iunique to generate
- * unique inode values later for this filesystem, then you must take care
- * to pass it an appropriate max_reserved value to avoid collisions.
- */
-int simple_fill_super(struct super_block *s, int magic, struct tree_descr 
*files)
-{
-   struct inode *inode;
-   struct dentry *root;
-   struct dentry *dentry;
-   int i;
-
-   s->s_blocksize = PAGE_CACHE_SIZE;
-   s->s_blocksize_bits = PAGE_CACHE_SHIFT;
-   s->s_magic = magic;
-   s->s_op = &simple_super_operations;
-   s->s_time_gran = 1;
-
-   inode = new_inode(s);
-   if (!inode)
-   return -ENOMEM;
-   /*
-* because the root inode is 1, the files array must not contain an
-* entry at index 1
-*/
-   inode->i_ino = 1;
-   inode->i_mode = S_IFDIR | 0755;
-   inode->i_uid = inode->i_gid = 0;
-   inode->i_blocks = 0;
-   inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-   inode->i_op = &simple_dir_inode_operations;
-   inode->i_fop = &simple_dir_operations;
-   inode->i_nlink = 2;
-   root = d_alloc_root(inode);
-   if (!root) {
-   iput(inode);
-   return -ENOMEM;
-   }
-   for (i = 0; !files->name || files->name[0]; i++, files++) {
-   if (!files->name)
-   continue;
-
-   /* warn if it tries to conflict with the root inode */
-   if (unlikely(i == 1))
-   printk(KERN_WARNING "%s: %s passed in a files array"
-   "with an index of 1!\n", __func__,
-   s->s_type->name);
-
-   dentry = d_alloc_name(root, files->name);
-   if (!dentry)
-   goto out;
-   inode = new_inode(s);
-   if (!inode)
-   goto out;
-   inode->i_mode = S_IFREG | files->mode;
-   inode->i_uid = inode->i_gid = 0;
-   inode->i_blocks = 0;
-   inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-   inode->i_fop = files->ops;
-   inode->i_ino = i;
-   d_add(dentry, inode);
-   }
-   s->s_root = root;
-   return 0;
-out:
-   d_genocide(root);
-   dput(root);
-   return -ENOMEM;
-}
-
-static DEFINE_SPINLOCK(pin_fs_lock);
-
-int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int 
*count)
-{
-   struct vfsmount *mnt = NULL;
-   spin_lock(&pin_fs_lock);
-   if (unlikely(!*mount)) {
-   spin_unlock(&pin_fs_lock);
-   mnt = vfs_kern_mount(type, 0, type->name, NULL);
-   if (IS_ERR(mnt))
-   return PTR_ERR(mnt);
-   spin_lock(&pin_fs_lock);
-   if (!*mount)
-   *mount = mnt;
-   }
-   mntget(*mount);
-   ++*count;
-   spin_unlock(&pin_fs_lock);
-   mntput(mnt);
-   return 0;
-}
-
-void simple_release_fs(struct vfsmount **mount, int *count)
-{
-   struct vfsmount *mnt;
-   spin_lock(&pin_fs_lock);
-   mnt = *mount;
-   if (!--*count)
-   *mount = NULL;
-   spin_unlock(&pin_fs_lock);
-   mntput(mnt);
-}
-
 ssize_t simple_read_from_buffer(void __user *to, size_t count, loff_t *ppos,
const void *from, size_t available)
 {
@@ -786,14 +678,11 @@ EXPORT_SYMBOL(simple_dir_inode_operation
 EXPORT_SYMBOL(simple_dir_operations);
 EXPORT_SYMBOL(simple_empty);
 EXPORT_SYMBOL(d_alloc_name);
-EXPORT_SYMBOL(simple_fill_super);
 EXPORT_SYMBOL(simple_getattr);
 EXPORT_SYMBOL(simple_link);
 EXPORT_SYMBOL(simple_lookup);
-EXPORT_SYMBOL(simple_pin_fs);
 EXPORT_SYMBOL(simple_prepare_write);
 EXPORT_SYMBOL(simple_readpag

[RFC 06/11] split out linux/libfs.h from linux/fs.h

2008-02-18 Thread Arnd Bergmann
With libfs turning into a larger subsystem, it makes
sense to have a separate header that is not included
by the low-level vfs code.

Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]>
Index: linux-2.6/fs/debugfs/inode.c
===
--- linux-2.6.orig/fs/debugfs/inode.c
+++ linux-2.6/fs/debugfs/inode.c
@@ -18,6 +18,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
Index: linux-2.6/fs/dcache.c
===
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -947,6 +947,7 @@ struct dentry *d_alloc_name(struct dentr
q.hash = full_name_hash(q.name, q.len);
return d_alloc(parent, &q);
 }
+EXPORT_SYMBOL(d_alloc_name);
 
 /**
  * d_instantiate - fill in inode information for a dentry
Index: linux-2.6/drivers/usb/core/inode.c
===
--- linux-2.6.orig/drivers/usb/core/inode.c
+++ linux-2.6/drivers/usb/core/inode.c
@@ -27,6 +27,7 @@
 
 /*/
 
+#include 
 #include 
 #include 
 #include 
Index: linux-2.6/fs/binfmt_misc.c
===
--- linux-2.6.orig/fs/binfmt_misc.c
+++ linux-2.6/fs/binfmt_misc.c
@@ -16,6 +16,7 @@
  *  2001-02-28 AV: rewritten into something that resembles C. Original didn't.
  */
 
+#include 
 #include 
 #include 
 #include 
Index: linux-2.6/fs/configfs/mount.c
===
--- linux-2.6.orig/fs/configfs/mount.c
+++ linux-2.6/fs/configfs/mount.c
@@ -25,6 +25,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
Index: linux-2.6/fs/debugfs/file.c
===
--- linux-2.6.orig/fs/debugfs/file.c
+++ linux-2.6/fs/debugfs/file.c
@@ -15,6 +15,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
Index: linux-2.6/fs/fuse/control.c
===
--- linux-2.6.orig/fs/fuse/control.c
+++ linux-2.6/fs/fuse/control.c
@@ -9,6 +9,7 @@
 #include "fuse_i.h"
 
 #include 
+#include 
 #include 
 
 #define FUSE_CTL_SUPER_MAGIC 0x65735543
Index: linux-2.6/fs/nfsd/nfsctl.c
===
--- linux-2.6.orig/fs/nfsd/nfsctl.c
+++ linux-2.6/fs/nfsd/nfsctl.c
@@ -8,6 +8,7 @@
 
 #include 
 
+#include 
 #include 
 #include 
 #include 
Index: linux-2.6/net/sunrpc/rpc_pipe.c
===
--- linux-2.6.orig/net/sunrpc/rpc_pipe.c
+++ linux-2.6/net/sunrpc/rpc_pipe.c
@@ -8,6 +8,7 @@
  * Copyright (c) 2002, Trond Myklebust <[EMAIL PROTECTED]>
  *
  */
+#include 
 #include 
 #include 
 #include 
Index: linux-2.6/security/inode.c
===
--- linux-2.6.orig/security/inode.c
+++ linux-2.6/security/inode.c
@@ -16,6 +16,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #define SECURITYFS_MAGIC   0x73636673
Index: linux-2.6/security/selinux/selinuxfs.c
===
--- linux-2.6.orig/security/selinux/selinuxfs.c
+++ linux-2.6/security/selinux/selinuxfs.c
@@ -14,6 +14,7 @@
  * the Free Software Foundation, version 2.
  */
 
+#include 
 #include 
 #include 
 #include 
Index: linux-2.6/virt/kvm/kvm_main.c
===
--- linux-2.6.orig/virt/kvm/kvm_main.c
+++ linux-2.6/virt/kvm/kvm_main.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/spufs.h
===
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/spufs.h
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/spufs.h
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
Index: linux-2.6/fs/autofs4/autofs_i.h
===
--- linux-2.6.orig/fs/autofs4/autofs_i.h
+++ linux-2.6/fs/autofs4/autofs_i.h
@@ -22,6 +22,7 @@
 #define AUTOFS_IOC_COUNT 32
 
 #include 
+#include 
 #include 
 #include 
 #include 
Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -1957,12 +1957,7 @@ extern struct dentry *simple_lookup(stru
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t 
*);
 extern const struct file_operations simple_dir_operations;
 extern const struct inode_operations simple_dir_inode_operations;
-struct tree_descr { char *name; const struct file_operations *ops; int mode; };
 struct dentry *d_alloc_name(struct dentry *, const char *);
-extern int simple_fill_super(struct super_block *, int, cons

[RFC 04/11] slim down securityfs

2008-02-18 Thread Arnd Bergmann
With the new simple_fs_type in place, securityfs practically
becomes a nop and we just need to leave code around to manage
its mount point.

Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]>
Index: linux-2.6/security/inode.c
===
--- linux-2.6.orig/security/inode.c
+++ linux-2.6/security/inode.c
@@ -13,176 +13,14 @@
  */
 
 /* #define DEBUG */
+
 #include 
-#include 
-#include 
-#include 
 #include 
-#include 
 #include 
 
 #define SECURITYFS_MAGIC   0x73636673
 
-static struct vfsmount *mount;
-static int mount_count;
-
-/*
- * TODO:
- *   I think I can get rid of these default_file_ops, but not quite sure...
- */
-static ssize_t default_read_file(struct file *file, char __user *buf,
-size_t count, loff_t *ppos)
-{
-   return 0;
-}
-
-static ssize_t default_write_file(struct file *file, const char __user *buf,
-  size_t count, loff_t *ppos)
-{
-   return count;
-}
-
-static int default_open(struct inode *inode, struct file *file)
-{
-   if (inode->i_private)
-   file->private_data = inode->i_private;
-
-   return 0;
-}
-
-static const struct file_operations default_file_ops = {
-   .read = default_read_file,
-   .write =default_write_file,
-   .open = default_open,
-};
-
-static struct inode *get_inode(struct super_block *sb, int mode, dev_t dev)
-{
-   struct inode *inode = new_inode(sb);
-
-   if (inode) {
-   inode->i_mode = mode;
-   inode->i_uid = 0;
-   inode->i_gid = 0;
-   inode->i_blocks = 0;
-   inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-   switch (mode & S_IFMT) {
-   default:
-   init_special_inode(inode, mode, dev);
-   break;
-   case S_IFREG:
-   inode->i_fop = &default_file_ops;
-   break;
-   case S_IFDIR:
-   inode->i_op = &simple_dir_inode_operations;
-   inode->i_fop = &simple_dir_operations;
-
-   /* directory inodes start off with i_nlink == 2 (for 
"." entry) */
-   inc_nlink(inode);
-   break;
-   }
-   }
-   return inode;
-}
-
-/* SMP-safe */
-static int mknod(struct inode *dir, struct dentry *dentry,
-int mode, dev_t dev)
-{
-   struct inode *inode;
-   int error = -EPERM;
-
-   if (dentry->d_inode)
-   return -EEXIST;
-
-   inode = get_inode(dir->i_sb, mode, dev);
-   if (inode) {
-   d_instantiate(dentry, inode);
-   dget(dentry);
-   error = 0;
-   }
-   return error;
-}
-
-static int mkdir(struct inode *dir, struct dentry *dentry, int mode)
-{
-   int res;
-
-   mode = (mode & (S_IRWXUGO | S_ISVTX)) | S_IFDIR;
-   res = mknod(dir, dentry, mode, 0);
-   if (!res)
-   inc_nlink(dir);
-   return res;
-}
-
-static int create(struct inode *dir, struct dentry *dentry, int mode)
-{
-   mode = (mode & S_IALLUGO) | S_IFREG;
-   return mknod(dir, dentry, mode, 0);
-}
-
-static inline int positive(struct dentry *dentry)
-{
-   return dentry->d_inode && !d_unhashed(dentry);
-}
-
-static int fill_super(struct super_block *sb, void *data, int silent)
-{
-   static struct tree_descr files[] = {{""}};
-
-   return simple_fill_super(sb, SECURITYFS_MAGIC, files);
-}
-
-static int get_sb(struct file_system_type *fs_type,
- int flags, const char *dev_name,
- void *data, struct vfsmount *mnt)
-{
-   return get_sb_single(fs_type, flags, data, fill_super, mnt);
-}
-
-static struct file_system_type fs_type = {
-   .owner =THIS_MODULE,
-   .name = "securityfs",
-   .get_sb =   get_sb,
-   .kill_sb =  kill_litter_super,
-};
-
-static int create_by_name(const char *name, mode_t mode,
- struct dentry *parent,
- struct dentry **dentry)
-{
-   int error = 0;
-
-   *dentry = NULL;
-
-   /* If the parent is not specified, we create it in the root.
-* We need the root dentry to do this, which is in the super
-* block. A pointer to that is in the struct vfsmount that we
-* have around.
-*/
-   if (!parent ) {
-   if (mount && mount->mnt_sb) {
-   parent = mount->mnt_sb->s_root;
-   }
-   }
-   if (!parent) {
-   pr_debug("securityfs: Ah! can not find a parent!\n");
-   return -EFAULT;
-   }
-
-   mutex_lock(&parent->d_inode->i_mutex);
-   *dentry = lookup_one_len(name, parent, strlen(name));
-   if (!IS_ERR(dentry)) {
-   if ((mode & S_IFMT) == S_IFDIR)
-

[RFC 05/11] slim down usbfs

2008-02-18 Thread Arnd Bergmann
Half of the usbfs code is the same as debugfs, so we can
replace it now with calls to the generic libfs versions.

Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]>
Index: linux-2.6/drivers/usb/core/inode.c
===
--- linux-2.6.orig/drivers/usb/core/inode.c
+++ linux-2.6/drivers/usb/core/inode.c
@@ -47,11 +47,10 @@
 #define USBFS_DEFAULT_BUSMODE (S_IXUGO | S_IRUGO)
 #define USBFS_DEFAULT_LISTMODE S_IRUGO
 
-static struct super_operations usbfs_ops;
-static const struct file_operations default_file_operations;
-static struct vfsmount *usbfs_mount;
-static int usbfs_mount_count;  /* = 0 */
-static int ignore_mount = 0;
+static DEFINE_SIMPLE_FS(usb_fs_type, "usbfs", NULL, USBDEVICE_SUPER_MAGIC);
+static struct dentry *usbfs_root;
+
+static int ignore_mount = 1;
 
 static struct dentry *devices_usbfs_dentry;
 static int num_buses;  /* = 0 */
@@ -263,186 +262,11 @@ static int remount(struct super_block *s
return -EINVAL;
}
 
-   if (usbfs_mount && usbfs_mount->mnt_sb)
-   update_sb(usbfs_mount->mnt_sb);
-
-   return 0;
-}
-
-static struct inode *usbfs_get_inode (struct super_block *sb, int mode, dev_t 
dev)
-{
-   struct inode *inode = new_inode(sb);
-
-   if (inode) {
-   inode->i_mode = mode;
-   inode->i_uid = current->fsuid;
-   inode->i_gid = current->fsgid;
-   inode->i_blocks = 0;
-   inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-   switch (mode & S_IFMT) {
-   default:
-   init_special_inode(inode, mode, dev);
-   break;
-   case S_IFREG:
-   inode->i_fop = &default_file_operations;
-   break;
-   case S_IFDIR:
-   inode->i_op = &simple_dir_inode_operations;
-   inode->i_fop = &simple_dir_operations;
-
-   /* directory inodes start off with i_nlink == 2 (for 
"." entry) */
-   inc_nlink(inode);
-   break;
-   }
-   }
-   return inode; 
-}
-
-/* SMP-safe */
-static int usbfs_mknod (struct inode *dir, struct dentry *dentry, int mode,
-   dev_t dev)
-{
-   struct inode *inode = usbfs_get_inode(dir->i_sb, mode, dev);
-   int error = -EPERM;
-
-   if (dentry->d_inode)
-   return -EEXIST;
-
-   if (inode) {
-   d_instantiate(dentry, inode);
-   dget(dentry);
-   error = 0;
-   }
-   return error;
-}
-
-static int usbfs_mkdir (struct inode *dir, struct dentry *dentry, int mode)
-{
-   int res;
-
-   mode = (mode & (S_IRWXUGO | S_ISVTX)) | S_IFDIR;
-   res = usbfs_mknod (dir, dentry, mode, 0);
-   if (!res)
-   inc_nlink(dir);
-   return res;
-}
-
-static int usbfs_create (struct inode *dir, struct dentry *dentry, int mode)
-{
-   mode = (mode & S_IALLUGO) | S_IFREG;
-   return usbfs_mknod (dir, dentry, mode, 0);
-}
-
-static inline int usbfs_positive (struct dentry *dentry)
-{
-   return dentry->d_inode && !d_unhashed(dentry);
-}
-
-static int usbfs_empty (struct dentry *dentry)
-{
-   struct list_head *list;
-
-   spin_lock(&dcache_lock);
-
-   list_for_each(list, &dentry->d_subdirs) {
-   struct dentry *de = list_entry(list, struct dentry, 
d_u.d_child);
-   if (usbfs_positive(de)) {
-   spin_unlock(&dcache_lock);
-   return 0;
-   }
-   }
-
-   spin_unlock(&dcache_lock);
-   return 1;
-}
-
-static int usbfs_unlink (struct inode *dir, struct dentry *dentry)
-{
-   struct inode *inode = dentry->d_inode;
-   mutex_lock(&inode->i_mutex);
-   drop_nlink(dentry->d_inode);
-   dput(dentry);
-   mutex_unlock(&inode->i_mutex);
-   d_delete(dentry);
-   return 0;
-}
-
-static int usbfs_rmdir(struct inode *dir, struct dentry *dentry)
-{
-   int error = -ENOTEMPTY;
-   struct inode * inode = dentry->d_inode;
-
-   mutex_lock(&inode->i_mutex);
-   dentry_unhash(dentry);
-   if (usbfs_empty(dentry)) {
-   drop_nlink(dentry->d_inode);
-   drop_nlink(dentry->d_inode);
-   dput(dentry);
-   inode->i_flags |= S_DEAD;
-   drop_nlink(dir);
-   error = 0;
-   }
-   mutex_unlock(&inode->i_mutex);
-   if (!error)
-   d_delete(dentry);
-   dput(dentry);
-   return error;
-}
-
-
-/* default file operations */
-static ssize_t default_read_file (struct file *file, char __user *buf,
- size_t count, loff_t *ppos)
-{
-   return 0;
-}
-
-static ssize_t default_write_file (struct file *file, const char __user *buf,
-  size_t count, loff_t *ppos)
-{
-   ret

[RFC 09/11] split out libfs/super.c from libfs.c

2008-02-18 Thread Arnd Bergmann
Consolidate all super block manipulation code in libfs in a single
source file.

Signed-off-by: Arnd Bergman <[EMAIL PROTECTED]>
Index: linux-2.6/fs/libfs.c
===
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -12,63 +12,6 @@
 
 #include 
 
-static const struct super_operations simple_super_operations = {
-   .statfs = simple_statfs,
-};
-
-/*
- * Common helper for pseudo-filesystems (sockfs, pipefs, bdev - stuff that
- * will never be mountable)
- */
-int get_sb_pseudo(struct file_system_type *fs_type, char *name,
-   const struct super_operations *ops, unsigned long magic,
-   struct vfsmount *mnt)
-{
-   struct super_block *s = sget(fs_type, NULL, set_anon_super, NULL);
-   struct dentry *dentry;
-   struct inode *root;
-   struct qstr d_name = {.name = name, .len = strlen(name)};
-
-   if (IS_ERR(s))
-   return PTR_ERR(s);
-
-   s->s_flags = MS_NOUSER;
-   s->s_maxbytes = ~0ULL;
-   s->s_blocksize = 1024;
-   s->s_blocksize_bits = 10;
-   s->s_magic = magic;
-   s->s_op = ops ? ops : &simple_super_operations;
-   s->s_time_gran = 1;
-   root = new_inode(s);
-   if (!root)
-   goto Enomem;
-   /*
-* since this is the first inode, make it number 1. New inodes created
-* after this must take care not to collide with it (by passing
-* max_reserved of 1 to iunique).
-*/
-   root->i_ino = 1;
-   root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
-   root->i_uid = root->i_gid = 0;
-   root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
-   dentry = d_alloc(NULL, &d_name);
-   if (!dentry) {
-   iput(root);
-   goto Enomem;
-   }
-   dentry->d_sb = s;
-   dentry->d_parent = dentry;
-   d_instantiate(dentry, root);
-   s->s_root = dentry;
-   s->s_flags |= MS_ACTIVE;
-   return simple_set_mnt(mnt, s);
-
-Enomem:
-   up_write(&s->s_umount);
-   deactivate_super(s);
-   return -ENOMEM;
-}
-
 int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry 
*dentry)
 {
struct inode *inode = old_dentry->d_inode;
@@ -238,7 +181,6 @@ ssize_t simple_read_from_buffer(void __u
return count;
 }
 
-EXPORT_SYMBOL(get_sb_pseudo);
 EXPORT_SYMBOL(simple_write_begin);
 EXPORT_SYMBOL(simple_write_end);
 EXPORT_SYMBOL(simple_empty);
Index: linux-2.6/fs/libfs/super.c
===
--- linux-2.6.orig/fs/libfs/super.c
+++ linux-2.6/fs/libfs/super.c
@@ -54,6 +54,60 @@ static const struct super_operations sim
 };
 
 /*
+ * Common helper for pseudo-filesystems (sockfs, pipefs, bdev - stuff that
+ * will never be mountable)
+ */
+int get_sb_pseudo(struct file_system_type *fs_type, char *name,
+   const struct super_operations *ops, unsigned long magic,
+   struct vfsmount *mnt)
+{
+   struct super_block *s = sget(fs_type, NULL, set_anon_super, NULL);
+   struct dentry *dentry;
+   struct inode *root;
+   struct qstr d_name = {.name = name, .len = strlen(name)};
+
+   if (IS_ERR(s))
+   return PTR_ERR(s);
+
+   s->s_flags = MS_NOUSER;
+   s->s_maxbytes = ~0ULL;
+   s->s_blocksize = 1024;
+   s->s_blocksize_bits = 10;
+   s->s_magic = magic;
+   s->s_op = ops ? ops : &simple_super_operations;
+   s->s_time_gran = 1;
+   root = new_inode(s);
+   if (!root)
+   goto Enomem;
+   /*
+* since this is the first inode, make it number 1. New inodes created
+* after this must take care not to collide with it (by passing
+* max_reserved of 1 to iunique).
+*/
+   root->i_ino = 1;
+   root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
+   root->i_uid = root->i_gid = 0;
+   root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
+   dentry = d_alloc(NULL, &d_name);
+   if (!dentry) {
+   iput(root);
+   goto Enomem;
+   }
+   dentry->d_sb = s;
+   dentry->d_parent = dentry;
+   d_instantiate(dentry, root);
+   s->s_root = dentry;
+   s->s_flags |= MS_ACTIVE;
+   return simple_set_mnt(mnt, s);
+
+Enomem:
+   up_write(&s->s_umount);
+   deactivate_super(s);
+   return -ENOMEM;
+}
+EXPORT_SYMBOL(get_sb_pseudo);
+
+/*
  * the inodes created here are not hashed. If you use iunique to generate
  * unique inode values later for this filesystem, then you must take care
  * to pass it an appropriate max_reserved value to avoid collisions.

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 08/11] split out libfs/dentry.c from libfs.c

2008-02-18 Thread Arnd Bergmann
Consolidate all dentry manipulation code in libfs in a single
source file.

Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]>

Index: linux-2.6/fs/libfs.c
===
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -12,188 +12,6 @@
 
 #include 
 
-int simple_getattr(struct vfsmount *mnt, struct dentry *dentry,
-  struct kstat *stat)
-{
-   struct inode *inode = dentry->d_inode;
-   generic_fillattr(inode, stat);
-   stat->blocks = inode->i_mapping->nrpages << (PAGE_CACHE_SHIFT - 9);
-   return 0;
-}
-
-int simple_statfs(struct dentry *dentry, struct kstatfs *buf)
-{
-   buf->f_type = dentry->d_sb->s_magic;
-   buf->f_bsize = PAGE_CACHE_SIZE;
-   buf->f_namelen = NAME_MAX;
-   return 0;
-}
-
-/*
- * Retaining negative dentries for an in-memory filesystem just wastes
- * memory and lookup time: arrange for them to be deleted immediately.
- */
-static int simple_delete_dentry(struct dentry *dentry)
-{
-   return 1;
-}
-
-/*
- * Lookup the data. This is trivial - if the dentry didn't already
- * exist, we know it is negative.  Set d_op to delete negative dentries.
- */
-struct dentry *simple_lookup(struct inode *dir, struct dentry *dentry, struct 
nameidata *nd)
-{
-   static struct dentry_operations simple_dentry_operations = {
-   .d_delete = simple_delete_dentry,
-   };
-
-   if (dentry->d_name.len > NAME_MAX)
-   return ERR_PTR(-ENAMETOOLONG);
-   dentry->d_op = &simple_dentry_operations;
-   d_add(dentry, NULL);
-   return NULL;
-}
-
-int simple_sync_file(struct file * file, struct dentry *dentry, int datasync)
-{
-   return 0;
-}
- 
-int dcache_dir_open(struct inode *inode, struct file *file)
-{
-   static struct qstr cursor_name = {.len = 1, .name = "."};
-
-   file->private_data = d_alloc(file->f_path.dentry, &cursor_name);
-
-   return file->private_data ? 0 : -ENOMEM;
-}
-
-int dcache_dir_close(struct inode *inode, struct file *file)
-{
-   dput(file->private_data);
-   return 0;
-}
-
-loff_t dcache_dir_lseek(struct file *file, loff_t offset, int origin)
-{
-   mutex_lock(&file->f_path.dentry->d_inode->i_mutex);
-   switch (origin) {
-   case 1:
-   offset += file->f_pos;
-   case 0:
-   if (offset >= 0)
-   break;
-   default:
-   mutex_unlock(&file->f_path.dentry->d_inode->i_mutex);
-   return -EINVAL;
-   }
-   if (offset != file->f_pos) {
-   file->f_pos = offset;
-   if (file->f_pos >= 2) {
-   struct list_head *p;
-   struct dentry *cursor = file->private_data;
-   loff_t n = file->f_pos - 2;
-
-   spin_lock(&dcache_lock);
-   list_del(&cursor->d_u.d_child);
-   p = file->f_path.dentry->d_subdirs.next;
-   while (n && p != &file->f_path.dentry->d_subdirs) {
-   struct dentry *next;
-   next = list_entry(p, struct dentry, 
d_u.d_child);
-   if (!d_unhashed(next) && next->d_inode)
-   n--;
-   p = p->next;
-   }
-   list_add_tail(&cursor->d_u.d_child, p);
-   spin_unlock(&dcache_lock);
-   }
-   }
-   mutex_unlock(&file->f_path.dentry->d_inode->i_mutex);
-   return offset;
-}
-
-/* Relationship between i_mode and the DT_xxx types */
-static inline unsigned char dt_type(struct inode *inode)
-{
-   return (inode->i_mode >> 12) & 15;
-}
-
-/*
- * Directory is locked and all positive dentries in it are safe, since
- * for ramfs-type trees they can't go away without unlink() or rmdir(),
- * both impossible due to the lock on directory.
- */
-
-int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
-{
-   struct dentry *dentry = filp->f_path.dentry;
-   struct dentry *cursor = filp->private_data;
-   struct list_head *p, *q = &cursor->d_u.d_child;
-   ino_t ino;
-   int i = filp->f_pos;
-
-   switch (i) {
-   case 0:
-   ino = dentry->d_inode->i_ino;
-   if (filldir(dirent, ".", 1, i, ino, DT_DIR) < 0)
-   break;
-   filp->f_pos++;
-   i++;
-   /* fallthrough */
-   case 1:
-   ino = parent_ino(dentry);
-   if (filldir(dirent, "..", 2, i, ino, DT_DIR) < 0)
-   break;
-   filp->f_pos++;
-   i++;
-   /* fallthrough */
-   default:
-  

[RFC 01/11] add generic versions of debugfs file operations

2008-02-18 Thread Arnd Bergmann
The file operations in debugfs are rather generic and can
be used by other file systems, so it can be interesting to
include them in libfs, with more generic names, and exported
to modules.

This patch adds a new copy of these operations to libfs,
so that the debugfs version can later be cut down.

Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]>

Index: linux-2.6/fs/Makefile
===
--- linux-2.6.orig/fs/Makefile
+++ linux-2.6/fs/Makefile
@@ -13,6 +13,8 @@ obj-y :=  open.o read_write.o file_table.
pnode.o drop_caches.o splice.o sync.o utimes.o \
stack.o
 
+obj-$(CONFIG_LIBFS) += libfs/
+
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
 else
Index: linux-2.6/include/linux/libfs.h
===
--- /dev/null
+++ linux-2.6/include/linux/libfs.h
@@ -0,0 +1,21 @@
+#ifndef __LIBFS_H__
+#define __LIBFS_H__
+
+#include 
+
+extern const struct file_operations simple_fops_u8;
+extern const struct file_operations simple_fops_x8;
+extern const struct file_operations simple_fops_u16;
+extern const struct file_operations simple_fops_x16;
+extern const struct file_operations simple_fops_u32;
+extern const struct file_operations simple_fops_x32;
+extern const struct file_operations simple_fops_u64;
+extern const struct file_operations simple_fops_bool;
+extern const struct file_operations simple_fops_blob;
+
+struct simple_blob_wrapper {
+   void *data;
+   unsigned long size;
+};
+
+#endif /* __LIBFS_H__ */
Index: linux-2.6/fs/libfs/Makefile
===
--- /dev/null
+++ linux-2.6/fs/libfs/Makefile
@@ -0,0 +1,3 @@
+libfs-y += file.o
+
+obj-$(CONFIG_LIBFS) += libfs.o
Index: linux-2.6/fs/libfs/file.c
===
--- /dev/null
+++ linux-2.6/fs/libfs/file.c
@@ -0,0 +1,126 @@
+/*
+ * fs/libfs/file.c
+ * Library for filesystems writers.
+ */
+
+#include 
+#include 
+#include 
+
+#include 
+
+/* commonly used attribute file operations */
+static int simple_u8_set(void *data, u64 val)
+{
+   *(u8 *)data = val;
+   return 0;
+}
+static int simple_u8_get(void *data, u64 *val)
+{
+   *val = *(u8 *)data;
+   return 0;
+}
+DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_u8, simple_u8_get, simple_u8_set, 
"%llu\n");
+DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_x8, simple_u8_get, simple_u8_set, 
"0x%02llx\n");
+
+static int simple_u16_set(void *data, u64 val)
+{
+   *(u16 *)data = val;
+   return 0;
+}
+static int simple_u16_get(void *data, u64 *val)
+{
+   *val = *(u16 *)data;
+   return 0;
+}
+DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_u16, simple_u16_get, 
simple_u16_set, "%llu\n");
+DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_x16, simple_u16_get, 
simple_u16_set, "0x%04llx\n");
+
+static int simple_u32_set(void *data, u64 val)
+{
+   *(u32 *)data = val;
+   return 0;
+}
+static int simple_u32_get(void *data, u64 *val)
+{
+   *val = *(u32 *)data;
+   return 0;
+}
+DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_u32, simple_u32_get, 
simple_u32_set, "%llu\n");
+DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_x32, simple_u32_get, 
simple_u32_set, "0x%08llx\n");
+
+static int simple_u64_set(void *data, u64 val)
+{
+   *(u64 *)data = val;
+   return 0;
+}
+
+static int simple_u64_get(void *data, u64 *val)
+{
+   return *(u64 *)data;
+   return 0;
+}
+DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_u64, simple_u64_get, 
simple_u64_set, "%llu\n");
+
+static ssize_t read_file_bool(struct file *file, char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+   char buf[3];
+   u32 *val = file->private_data;
+
+   if (*val)
+   buf[0] = 'Y';
+   else
+   buf[0] = 'N';
+   buf[1] = '\n';
+   buf[2] = 0x00;
+   return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
+}
+
+static ssize_t write_file_bool(struct file *file, const char __user *user_buf,
+  size_t count, loff_t *ppos)
+{
+   char buf[32];
+   int buf_size;
+   u32 *val = file->private_data;
+
+   buf_size = min(count, (sizeof(buf)-1));
+   if (copy_from_user(buf, user_buf, buf_size))
+   return -EFAULT;
+
+   switch (buf[0]) {
+   case 'y':
+   case 'Y':
+   case '1':
+   *val = 1;
+   break;
+   case 'n':
+   case 'N':
+   case '0':
+   *val = 0;
+   break;
+   }
+
+   return count;
+}
+
+const struct file_operations simple_fops_bool = {
+   .read = read_file_bool,
+   .write =write_file_bool,
+   .open = simple_open,
+};
+EXPORT_SYMBOL_GPL(simple_fops_bool);
+
+static ssize_t read_file_blob(struct file *file, char __user *user_buf,
+

[RFC 00/11] possible debugfs/libfs consolidation

2008-02-18 Thread Arnd Bergmann
I noticed that there is a lot of duplication in pseudo
file systems, so I started looking into how to consolidate
them. I ended up with a largish rework of the structure
of libfs and moving almost all of debugfs in there as well.

As an example, I also have patches that reduce debugfs,
securityfs and usbfs to the point where they are mostly
thin wrappers around libfs, with large comment blocks.
Other file systems could be changed in the same way, but
I first like to see if people agree that I'm on the right
track.

These patches have seen practically no testing so far,
so don't expect them to work, but please tell me what
you think about the concept.

Arnd <><

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 03/11] slim down debugfs

2008-02-18 Thread Arnd Bergmann
With most of debugfs now copied to generic code in libfs,
we can remove the original copy and replace it with thin
wrappers around libfs.

Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]>
Index: linux-2.6/fs/Kconfig
===
--- linux-2.6.orig/fs/Kconfig
+++ linux-2.6/fs/Kconfig
@@ -1001,6 +1001,14 @@ config CONFIGFS_FS
  Both sysfs and configfs can and should exist together on the
  same system. One is not a replacement for the other.
 
+config LIBFS
+   tristate
+   default m
+   help
+ libfs is a helper library used by many of the simpler file
+ systems. Parts of libfs can be modular when all of its users
+ are modules as well, and the users should select this symbol.
+
 endmenu
 
 menu "Miscellaneous filesystems"
Index: linux-2.6/fs/debugfs/file.c
===
--- linux-2.6.orig/fs/debugfs/file.c
+++ linux-2.6/fs/debugfs/file.c
@@ -19,55 +19,6 @@
 #include 
 #include 
 
-static ssize_t default_read_file(struct file *file, char __user *buf,
-size_t count, loff_t *ppos)
-{
-   return 0;
-}
-
-static ssize_t default_write_file(struct file *file, const char __user *buf,
-  size_t count, loff_t *ppos)
-{
-   return count;
-}
-
-static int default_open(struct inode *inode, struct file *file)
-{
-   if (inode->i_private)
-   file->private_data = inode->i_private;
-
-   return 0;
-}
-
-const struct file_operations debugfs_file_operations = {
-   .read = default_read_file,
-   .write =default_write_file,
-   .open = default_open,
-};
-
-static void *debugfs_follow_link(struct dentry *dentry, struct nameidata *nd)
-{
-   nd_set_link(nd, dentry->d_inode->i_private);
-   return NULL;
-}
-
-const struct inode_operations debugfs_link_operations = {
-   .readlink   = generic_readlink,
-   .follow_link= debugfs_follow_link,
-};
-
-static int debugfs_u8_set(void *data, u64 val)
-{
-   *(u8 *)data = val;
-   return 0;
-}
-static int debugfs_u8_get(void *data, u64 *val)
-{
-   *val = *(u8 *)data;
-   return 0;
-}
-DEFINE_SIMPLE_ATTRIBUTE(fops_u8, debugfs_u8_get, debugfs_u8_set, "%llu\n");
-
 /**
  * debugfs_create_u8 - create a debugfs file that is used to read and write an 
unsigned 8-bit value
  * @name: a pointer to a string containing the name of the file to create.
@@ -95,22 +46,10 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_u8, debugfs
 struct dentry *debugfs_create_u8(const char *name, mode_t mode,
 struct dentry *parent, u8 *value)
 {
-   return debugfs_create_file(name, mode, parent, value, &fops_u8);
+   return debugfs_create_file(name, mode, parent, value, &simple_fops_u8);
 }
 EXPORT_SYMBOL_GPL(debugfs_create_u8);
 
-static int debugfs_u16_set(void *data, u64 val)
-{
-   *(u16 *)data = val;
-   return 0;
-}
-static int debugfs_u16_get(void *data, u64 *val)
-{
-   *val = *(u16 *)data;
-   return 0;
-}
-DEFINE_SIMPLE_ATTRIBUTE(fops_u16, debugfs_u16_get, debugfs_u16_set, "%llu\n");
-
 /**
  * debugfs_create_u16 - create a debugfs file that is used to read and write 
an unsigned 16-bit value
  * @name: a pointer to a string containing the name of the file to create.
@@ -138,22 +77,10 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_u16, debugf
 struct dentry *debugfs_create_u16(const char *name, mode_t mode,
  struct dentry *parent, u16 *value)
 {
-   return debugfs_create_file(name, mode, parent, value, &fops_u16);
+   return debugfs_create_file(name, mode, parent, value, &simple_fops_u16);
 }
 EXPORT_SYMBOL_GPL(debugfs_create_u16);
 
-static int debugfs_u32_set(void *data, u64 val)
-{
-   *(u32 *)data = val;
-   return 0;
-}
-static int debugfs_u32_get(void *data, u64 *val)
-{
-   *val = *(u32 *)data;
-   return 0;
-}
-DEFINE_SIMPLE_ATTRIBUTE(fops_u32, debugfs_u32_get, debugfs_u32_set, "%llu\n");
-
 /**
  * debugfs_create_u32 - create a debugfs file that is used to read and write 
an unsigned 32-bit value
  * @name: a pointer to a string containing the name of the file to create.
@@ -181,23 +108,10 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_u32, debugf
 struct dentry *debugfs_create_u32(const char *name, mode_t mode,
 struct dentry *parent, u32 *value)
 {
-   return debugfs_create_file(name, mode, parent, value, &fops_u32);
+   return debugfs_create_file(name, mode, parent, value, &simple_fops_u32);
 }
 EXPORT_SYMBOL_GPL(debugfs_create_u32);
 
-static int debugfs_u64_set(void *data, u64 val)
-{
-   *(u64 *)data = val;
-   return 0;
-}
-
-static int debugfs_u64_get(void *data, u64 *val)
-{
-   *val = *(u64 *)data;
-   return 0;
-}
-DEFINE_SIMPLE_ATTRIBUTE(fops_u64, debugfs_u64_get, debugfs_u64_set, "%llu\n");
-
 /**
  * debugfs_create_u64 - create a debugfs f

[RFC 07/11] split out libfs/file.c from libfs.c

2008-02-18 Thread Arnd Bergmann
Consolidate all file manipulation code in libfs in a single
source file.

Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]>
Index: linux-2.6/fs/libfs.c
===
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -421,165 +421,6 @@ ssize_t simple_read_from_buffer(void __u
 }
 
 /*
- * Transaction based IO.
- * The file expects a single write which triggers the transaction, and then
- * possibly a read which collects the result - which is stored in a
- * file-local buffer.
- */
-char *simple_transaction_get(struct file *file, const char __user *buf, size_t 
size)
-{
-   struct simple_transaction_argresp *ar;
-   static DEFINE_SPINLOCK(simple_transaction_lock);
-
-   if (size > SIMPLE_TRANSACTION_LIMIT - 1)
-   return ERR_PTR(-EFBIG);
-
-   ar = (struct simple_transaction_argresp *)get_zeroed_page(GFP_KERNEL);
-   if (!ar)
-   return ERR_PTR(-ENOMEM);
-
-   spin_lock(&simple_transaction_lock);
-
-   /* only one write allowed per open */
-   if (file->private_data) {
-   spin_unlock(&simple_transaction_lock);
-   free_page((unsigned long)ar);
-   return ERR_PTR(-EBUSY);
-   }
-
-   file->private_data = ar;
-
-   spin_unlock(&simple_transaction_lock);
-
-   if (copy_from_user(ar->data, buf, size))
-   return ERR_PTR(-EFAULT);
-
-   return ar->data;
-}
-
-ssize_t simple_transaction_read(struct file *file, char __user *buf, size_t 
size, loff_t *pos)
-{
-   struct simple_transaction_argresp *ar = file->private_data;
-
-   if (!ar)
-   return 0;
-   return simple_read_from_buffer(buf, size, pos, ar->data, ar->size);
-}
-
-int simple_transaction_release(struct inode *inode, struct file *file)
-{
-   free_page((unsigned long)file->private_data);
-   return 0;
-}
-
-/* Simple attribute files */
-
-struct simple_attr {
-   int (*get)(void *, u64 *);
-   int (*set)(void *, u64);
-   char get_buf[24];   /* enough to store a u64 and "\n\0" */
-   char set_buf[24];
-   void *data;
-   const char *fmt;/* format for read operation */
-   struct mutex mutex; /* protects access to these buffers */
-};
-
-/* simple_attr_open is called by an actual attribute open file operation
- * to set the attribute specific access operations. */
-int simple_attr_open(struct inode *inode, struct file *file,
-int (*get)(void *, u64 *), int (*set)(void *, u64),
-const char *fmt)
-{
-   struct simple_attr *attr;
-
-   attr = kmalloc(sizeof(*attr), GFP_KERNEL);
-   if (!attr)
-   return -ENOMEM;
-
-   attr->get = get;
-   attr->set = set;
-   attr->data = inode->i_private;
-   attr->fmt = fmt;
-   mutex_init(&attr->mutex);
-
-   file->private_data = attr;
-
-   return nonseekable_open(inode, file);
-}
-
-int simple_attr_release(struct inode *inode, struct file *file)
-{
-   kfree(file->private_data);
-   return 0;
-}
-
-/* read from the buffer that is filled with the get function */
-ssize_t simple_attr_read(struct file *file, char __user *buf,
-size_t len, loff_t *ppos)
-{
-   struct simple_attr *attr;
-   size_t size;
-   ssize_t ret;
-
-   attr = file->private_data;
-
-   if (!attr->get)
-   return -EACCES;
-
-   ret = mutex_lock_interruptible(&attr->mutex);
-   if (ret)
-   return ret;
-
-   if (*ppos) {/* continued read */
-   size = strlen(attr->get_buf);
-   } else {/* first read */
-   u64 val;
-   ret = attr->get(attr->data, &val);
-   if (ret)
-   goto out;
-
-   size = scnprintf(attr->get_buf, sizeof(attr->get_buf),
-attr->fmt, (unsigned long long)val);
-   }
-
-   ret = simple_read_from_buffer(buf, len, ppos, attr->get_buf, size);
-out:
-   mutex_unlock(&attr->mutex);
-   return ret;
-}
-
-/* interpret the buffer as a number to call the set function with */
-ssize_t simple_attr_write(struct file *file, const char __user *buf,
- size_t len, loff_t *ppos)
-{
-   struct simple_attr *attr;
-   u64 val;
-   size_t size;
-   ssize_t ret;
-
-   attr = file->private_data;
-   if (!attr->set)
-   return -EACCES;
-
-   ret = mutex_lock_interruptible(&attr->mutex);
-   if (ret)
-   return ret;
-
-   ret = -EFAULT;
-   size = min(sizeof(attr->set_buf) - 1, len);
-   if (copy_from_user(attr->set_buf, buf, size))
-   goto out;
-
-   ret = len; /* claim we got the whole input */
-   attr->set_buf[size] = '\0';
-   val = simple_strtol(attr->set_buf, NULL, 0);
-   attr->set(attr->data, val);
-out:
-   mutex_unlock(&attr->mutex);
- 

[RFC 10/11] split out libfs/inode.c from libfs.c

2008-02-18 Thread Arnd Bergmann
Consolidate all inode manipulation code in libfs in a single
source file.

Signed-off-by: Arnd Bergman <[EMAIL PROTECTED]>

Index: linux-2.6/fs/libfs.c
===
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -12,78 +12,6 @@
 
 #include 
 
-int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry 
*dentry)
-{
-   struct inode *inode = old_dentry->d_inode;
-
-   inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-   inc_nlink(inode);
-   atomic_inc(&inode->i_count);
-   dget(dentry);
-   d_instantiate(dentry, inode);
-   return 0;
-}
-
-int simple_empty(struct dentry *dentry)
-{
-   struct dentry *child;
-   int ret = 0;
-
-   spin_lock(&dcache_lock);
-   list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
-   if (simple_positive(child))
-   goto out;
-   ret = 1;
-out:
-   spin_unlock(&dcache_lock);
-   return ret;
-}
-
-int simple_unlink(struct inode *dir, struct dentry *dentry)
-{
-   struct inode *inode = dentry->d_inode;
-
-   inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-   drop_nlink(inode);
-   dput(dentry);
-   return 0;
-}
-
-int simple_rmdir(struct inode *dir, struct dentry *dentry)
-{
-   if (!simple_empty(dentry))
-   return -ENOTEMPTY;
-
-   drop_nlink(dentry->d_inode);
-   simple_unlink(dir, dentry);
-   drop_nlink(dir);
-   return 0;
-}
-
-int simple_rename(struct inode *old_dir, struct dentry *old_dentry,
-   struct inode *new_dir, struct dentry *new_dentry)
-{
-   struct inode *inode = old_dentry->d_inode;
-   int they_are_dirs = S_ISDIR(old_dentry->d_inode->i_mode);
-
-   if (!simple_empty(new_dentry))
-   return -ENOTEMPTY;
-
-   if (new_dentry->d_inode) {
-   simple_unlink(new_dir, new_dentry);
-   if (they_are_dirs)
-   drop_nlink(old_dir);
-   } else if (they_are_dirs) {
-   drop_nlink(old_dir);
-   inc_nlink(new_dir);
-   }
-
-   old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime =
-   new_dir->i_mtime = inode->i_ctime = CURRENT_TIME;
-
-   return 0;
-}
-
 int simple_readpage(struct file *file, struct page *page)
 {
clear_highpage(page);
@@ -183,11 +111,6 @@ ssize_t simple_read_from_buffer(void __u
 
 EXPORT_SYMBOL(simple_write_begin);
 EXPORT_SYMBOL(simple_write_end);
-EXPORT_SYMBOL(simple_empty);
-EXPORT_SYMBOL(simple_link);
 EXPORT_SYMBOL(simple_prepare_write);
 EXPORT_SYMBOL(simple_readpage);
-EXPORT_SYMBOL(simple_rename);
-EXPORT_SYMBOL(simple_rmdir);
-EXPORT_SYMBOL(simple_unlink);
 EXPORT_SYMBOL(simple_read_from_buffer);
Index: linux-2.6/fs/libfs/inode.c
===
--- linux-2.6.orig/fs/libfs/inode.c
+++ linux-2.6/fs/libfs/inode.c
@@ -417,4 +417,79 @@ exit:
 }
 EXPORT_SYMBOL_GPL(simple_rename_named);
 
+int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry 
*dentry)
+{
+   struct inode *inode = old_dentry->d_inode;
 
+   inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+   inc_nlink(inode);
+   atomic_inc(&inode->i_count);
+   dget(dentry);
+   d_instantiate(dentry, inode);
+   return 0;
+}
+EXPORT_SYMBOL(simple_link);
+
+int simple_empty(struct dentry *dentry)
+{
+   struct dentry *child;
+   int ret = 0;
+
+   spin_lock(&dcache_lock);
+   list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
+   if (simple_positive(child))
+   goto out;
+   ret = 1;
+out:
+   spin_unlock(&dcache_lock);
+   return ret;
+}
+EXPORT_SYMBOL(simple_empty);
+
+int simple_unlink(struct inode *dir, struct dentry *dentry)
+{
+   struct inode *inode = dentry->d_inode;
+
+   inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+   drop_nlink(inode);
+   dput(dentry);
+   return 0;
+}
+EXPORT_SYMBOL(simple_unlink);
+
+int simple_rmdir(struct inode *dir, struct dentry *dentry)
+{
+   if (!simple_empty(dentry))
+   return -ENOTEMPTY;
+
+   drop_nlink(dentry->d_inode);
+   simple_unlink(dir, dentry);
+   drop_nlink(dir);
+   return 0;
+}
+EXPORT_SYMBOL(simple_rmdir);
+
+int simple_rename(struct inode *old_dir, struct dentry *old_dentry,
+   struct inode *new_dir, struct dentry *new_dentry)
+{
+   struct inode *inode = old_dentry->d_inode;
+   int they_are_dirs = S_ISDIR(old_dentry->d_inode->i_mode);
+
+   if (!simple_empty(new_dentry))
+   return -ENOTEMPTY;
+
+   if (new_dentry->d_inode) {
+   simple_unlink(new_dir, new_dentry);
+   if (they_are_dirs)
+   drop_nlink(old_dir);
+   } else if (they_are_dirs) {
+   drop_nlink(old_dir);
+   inc_nlink(new_dir);

[RFC 11/11] split out libfs/aops.c from libfs.c

2008-02-18 Thread Arnd Bergmann
Consolidate all address space manipulation code in libfs in a single
source file.

Signed-off-by: Arnd Bergman <[EMAIL PROTECTED]>


Index: linux-2.6/fs/libfs.c
===
--- linux-2.6.orig/fs/libfs.c
+++ /dev/null
@@ -1,116 +0,0 @@
-/*
- * fs/libfs.c
- * Library for filesystems writers.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-
-int simple_readpage(struct file *file, struct page *page)
-{
-   clear_highpage(page);
-   flush_dcache_page(page);
-   SetPageUptodate(page);
-   unlock_page(page);
-   return 0;
-}
-
-int simple_prepare_write(struct file *file, struct page *page,
-   unsigned from, unsigned to)
-{
-   if (!PageUptodate(page)) {
-   if (to - from != PAGE_CACHE_SIZE)
-   zero_user_segments(page,
-   0, from,
-   to, PAGE_CACHE_SIZE);
-   }
-   return 0;
-}
-
-int simple_write_begin(struct file *file, struct address_space *mapping,
-   loff_t pos, unsigned len, unsigned flags,
-   struct page **pagep, void **fsdata)
-{
-   struct page *page;
-   pgoff_t index;
-   unsigned from;
-
-   index = pos >> PAGE_CACHE_SHIFT;
-   from = pos & (PAGE_CACHE_SIZE - 1);
-
-   page = __grab_cache_page(mapping, index);
-   if (!page)
-   return -ENOMEM;
-
-   *pagep = page;
-
-   return simple_prepare_write(file, page, from, from+len);
-}
-
-static int simple_commit_write(struct file *file, struct page *page,
-  unsigned from, unsigned to)
-{
-   struct inode *inode = page->mapping->host;
-   loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
-
-   if (!PageUptodate(page))
-   SetPageUptodate(page);
-   /*
-* No need to use i_size_read() here, the i_size
-* cannot change under us because we hold the i_mutex.
-*/
-   if (pos > inode->i_size)
-   i_size_write(inode, pos);
-   set_page_dirty(page);
-   return 0;
-}
-
-int simple_write_end(struct file *file, struct address_space *mapping,
-   loff_t pos, unsigned len, unsigned copied,
-   struct page *page, void *fsdata)
-{
-   unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-   /* zero the stale part of the page if we did a short copy */
-   if (copied < len) {
-   void *kaddr = kmap_atomic(page, KM_USER0);
-   memset(kaddr + from + copied, 0, len - copied);
-   flush_dcache_page(page);
-   kunmap_atomic(kaddr, KM_USER0);
-   }
-
-   simple_commit_write(file, page, from, from+copied);
-
-   unlock_page(page);
-   page_cache_release(page);
-
-   return copied;
-}
-
-ssize_t simple_read_from_buffer(void __user *to, size_t count, loff_t *ppos,
-   const void *from, size_t available)
-{
-   loff_t pos = *ppos;
-   if (pos < 0)
-   return -EINVAL;
-   if (pos >= available)
-   return 0;
-   if (count > available - pos)
-   count = available - pos;
-   if (copy_to_user(to, from + pos, count))
-   return -EFAULT;
-   *ppos = pos + count;
-   return count;
-}
-
-EXPORT_SYMBOL(simple_write_begin);
-EXPORT_SYMBOL(simple_write_end);
-EXPORT_SYMBOL(simple_prepare_write);
-EXPORT_SYMBOL(simple_readpage);
-EXPORT_SYMBOL(simple_read_from_buffer);
Index: linux-2.6/fs/libfs/Makefile
===
--- linux-2.6.orig/fs/libfs/Makefile
+++ linux-2.6/fs/libfs/Makefile
@@ -1,3 +1,3 @@
 libfs-y += file.o
 
-obj-$(CONFIG_LIBFS) += libfs.o inode.o super.o dentry.o
+obj-$(CONFIG_LIBFS) += libfs.o inode.o super.o dentry.o aops.o
Index: linux-2.6/fs/libfs/aops.c
===
--- /dev/null
+++ linux-2.6/fs/libfs/aops.c
@@ -0,0 +1,113 @@
+/*
+ * fs/libfs/aops.c
+ * Library for filesystems writers -- address space operations
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+int simple_readpage(struct file *file, struct page *page)
+{
+   clear_highpage(page);
+   flush_dcache_page(page);
+   SetPageUptodate(page);
+   unlock_page(page);
+   return 0;
+}
+EXPORT_SYMBOL(simple_readpage);
+
+int simple_prepare_write(struct file *file, struct page *page,
+   unsigned from, unsigned to)
+{
+   if (!PageUptodate(page)) {
+   if (to - from != PAGE_CACHE_SIZE)
+   zero_user_segments(page,
+   0, from,
+   to, PAGE_CACHE_SIZE);
+   }
+   return 0;
+}
+EXPORT_SYMBOL(simple_prepare_write);
+
+int simple_write_begin(struct file *file, struct address_space *mapping,
+  

Re: [linux-cifs-client] review 5, was Re: projected date for mount.cifs to support DFS junction points

2008-02-18 Thread Steve French
The patch looks fine - but since it does not set obj_type any more - I
want to think about it a little more since it may be useful coming
back from the open path (although the mode is probably good enough).
jra added support to Samba for a new POSIX open/create/mkdir request
(which we only use for mkdir at the moment) - using  this call for
open (when the server indicates support for it as Samba does) would
cut the number of roundtrips to the server on these calls as it does
now for mkdir.

On Feb 15, 2008 4:11 PM, Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> If you like these kind of consolidation patches here's another one:
>
>
>



-- 
Thanks,

Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Question about synchronous write on SSD

2008-02-18 Thread Kyungmin Park
Hi,

Don't you remember the topic "solid state drive access and context
switching" [1].
I want to measure it is really better performance on SSD?

To write it on ssd synchronously, I hacked the
'generic_make_request()' [2] and got following results.

# echo 3 > /proc/sys/vm/drop_caches
# tiotest -f 100 -R -d /dev/sdd1
Tiotest results for 4 concurrent io threads:
| Write 400 MBs |4.7 s |  84.306 MB/s |   8.4 %  |  77.5 % |
| Read  400 MBs |4.3 s |  92.945 MB/s |   7.2 %  |  53.5 % |
Tiotest latency results:
| Write|0.126 ms |  706.379 ms |  0.0 |   0.0 |
| Read |0.161 ms |  311.738 ms |  0.0 |   0.0 |

# echo 3 > /proc/sys/vm/drop_caches
# tiotest -f 1000 -R -d /dev/sdd1
Tiotest results for 4 concurrent io threads:
| Write4000 MBs |   47.5 s |  84.124 MB/s |   7.0 %  |  83.6 % |
| Read 4000 MBs |   41.9 s |  95.530 MB/s |   7.8 %  |  55.6 % |
Tiotest latency results:
| Write|0.176 ms |  714.677 ms |  0.0 |   0.0 |
| Read |0.161 ms |  311.815 ms |  0.0 |   0.0 |

However it's same performance as before. It means the patch is meaningless.

Could you tell me is it the proper place to hack or others?

Thank you,
Kyungmin Park

p.s. Of cource I got the following message
WARNING: at block/blk-core.c:1351 generic_make_request+0x126/0x3d8()

1. http://lkml.org/lkml/2007/12/3/247
2. simple hack
diff --git a/block/blk-core.c b/block/blk-core.c
index e9754dc..7262720 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1345,6 +1345,14 @@ static inline void
__generic_make_request(struct bio *bio)
if (bio_check_eod(bio, nr_sectors))
goto end_io;

+   /* FIXME simple hack by kmpark */
+   if (MINOR(bio->bi_bdev->bd_dev) == 49 &&
+   MAJOR(bio->bi_bdev->bd_dev) == 8 && bio_data_dir(bio) == WRITE) {
+   WARN_ON_ONCE(1);
+   /* Write synchronous */
+   bio->bi_rw |= (1 << BIO_RW_SYNC);
+   }
+
/*
 * Resolve the mapping until finished. (drivers are
 * still free to implement/resolve their own stacking
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8][for -mm] mem_notify v6

2008-02-18 Thread KOSAKI Motohiro
Hi Paul,

Thank you for wonderful interestings comment.
your comment is really nice.

I was HPC guy with large NUMA box at past. 
I promise i don't ignroe hpc user.
but unfortunately I didn't have experience of use CPUSET
because at that point, it was under development yet.

I hope discuss you that CPUSET usage case and mem_notify requirement.
to be honest, I thought hpc user doesn't use mem_notify, sorry.


> I have what seems, intuitively, a similar problem at the opposite
> end of the world, on big-honkin NUMA boxes (hundreds or thousands of
> CPUs, terabytes of main memory.)  The problem there is often best
> resolved if we can kill the offending task, rather than shrink its
> memory footprint.  The situation is that several compute intensive
> multi-threaded jobs are running, each in their own dedicated cpuset.

agreed.

> So we like to identify such jobs as soon as they begin to swap,
> and kill them very very quickly (before the direct reclaim code
> in mm/vmscan.c can push more than a few pages to the swap device.)

you think kill the process just after swap, right?
but unfortunately, almost user hope receive notification before swap ;-)
because avoid swap.

I think we need discuss this point more.


> For a much earlier, unsuccessful, attempt to accomplish this, see:
> 
>   [Patch] cpusets policy kill no swap
>   http://lkml.org/lkml/2005/3/19/148
> 
> Now, it may well be that we are too far apart to share any part of
> a solution; one seldom uses the same technology to build a Tour de
> France bicycle as one uses to build a Lockheed C-5A Galaxy heavy
> cargo transport.
> 
> One clear difference is the policy of what action we desire to take
> when under memory pressure: do we invite user space to free memory so
> as to avoid the wrath of the oom killer, or do we go to the opposite
> extreme, seeking a nearly instantant killing, faster than the oom
> killer can even begin its search for a victim.

Hmm, sorry
I understand your patch yet, because I don't know CPUSET so much.

I learn CPUSET more, about this week and I'll reply again about next week ;-)


> Another clear difference is the use of cpusets, which are a major and
> vital part of administering the big NUMA boxes, and I presume are not
> even compiled into embedded kernels (correct?).  This difference maybe
> unbridgeable ... these big NUMA systems require per-cpuset mechanisms,
> whereas embedded may require builds without cpusets.

Yes, some embedded distribution(i.e. monta vista) distribute as source.
but embedded people strongly dislike bloat code size.
I think they never turn on CPUSET.

I hope mem_notify works fine without CPUSET.


> 1) You have a little bit of code in the kernel to throttle the
>thundering herd problem.  Perhaps this could be moved to user space
>... one user daemon that is always notified of such memory pressure
>alarms, and in turn notifies interested applications.  This might
>avoid the need to add poll_wait_exclusive() to the kernel.  And it
>moves any fussy details of how to tame the thundering herd out of
>the kernel.

I think you talk about user space oom manager.
it and many user process are obviously different.

I doubt memory manager daemon model doesn't works on desktop and
typical server.
thus, current implementaion optimize to no manager environment.

of course, it doesn't mean i refuse add to code for oom manager.
it is very interesting idea.

i hope discussion it more.


> 2) Another possible mechanism for communicating events from
>the kernel to user space is inotify.  For example, I added
>the line:
> 
>   fsnotify_modify(dentry);   # dentry is current tasks cpuset

Excellent!
that is really good idea.

thaks.


> 3) Perhaps, instead of sending simple events, one could update
>a meter of the rate of recent such events, such as the per-cpuset
>'memory_pressure' mechanism does.  This might lead to addressing
>Andrew Morton's comment:
> 
>   If this feature is useful then I'd expect that some
>   applications would want notification at different times, or at
>   different levels of VM distress.  So this semi-randomly-chosen
>   notification point just won't be strong enough in real-world
>   use.

Hmmm, I don't think so.
I think timing of memmory_pressure_notify(1) is already best.

the page move active list to inactive list indicate swap I/O happen
a bit after.

but memmory_pressure_notify(0) is a bit messy.
I'll try to improve more simplify.


> 4) A place that I found well suited for my purposes (watching for
>swapping from direct reclaim) was just before the lines in the
>pageout() routine in mm/vmscan.c:
> 
>   if (clear_page_dirty_for_io(page)) {
>   ...
>   res = mapping->a_ops->writepage(page, &wbc);
> 
>It seemed that testing "PageAnon(page)" here allowed me to easily
>distinguish between dirty pages going back to the file system, and
>pages going to swap (this detail is