Re: [patch 00/10] mount ownership and unprivileged mount syscall (v8)
> > > > However David and Christoph are beavering away on the r-o-bind-mounts > > > > patches and I expect that there will be overlaps with unprivileged > > > > mounts. > > > > > > > > Could we coordinate things a bit please? Decide who goes first, review > > > > and maybe even test each others work, etc? > > > > > > Al is setting up a git tree for VFS work. per-mount r/o will go in > > > as one of the first things, aswell as his rework of the path lookup > > > logic to fix the intents mess. > > > > > > > That didn't answer my question.. > > Well, Al as the defacto VFS maintainer will decide on the ordering. I think we agreed, that r-o-bind mounts are more important, so they should go first. They have also received more attention. OTOH there isn't really any fundamental conflict between the two patchsets, so going in together (if the ro-bind patches miss 2.6.25) should also be possible. > Reviewing this stuff properly is still on my todo list, but currently > I'm busy with more important things. So what should I do? Would Al be wanting to merge this into his VFS tree? (Can't find it on git.kernel.org yet, BTW.) I can set up a git tree for these patches if that makes things easier. Or should I just wait and resubmit after every kernel release, hoping that it becomes _the_ most important thing on Christoph's list ;) Miklos - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
very poor ext3 write performance on big filesystems?
I have a 1.2 TB (of which 750 GB is used) filesystem which holds almost 200 millions of files. 1.2 TB doesn't make this filesystem that big, but 200 millions of files is a decent number. Most of the files are hardlinked multiple times, some of them are hardlinked thousands of times. Recently I began removing some of unneeded files (or hardlinks) and to my surprise, it takes longer than I initially expected. After cache is emptied (echo 3 > /proc/sys/vm/drop_caches) I can usually remove about 5-20 files with moderate performance. I see up to 5000 kB read/write from/to the disk, wa reported by top is usually 20-70%. After that, waiting for IO grows to 99%, and disk write speed is down to 50 kB/s - 200 kB/s (fifty - two hundred kilobytes/s). Is it normal to expect the write speed go down to only few dozens of kilobytes/s? Is it because of that many seeks? Can it be somehow optimized? The machine has loads of free memory, perhaps it could be uses better? Also, writing big files is very slow - it takes more than 4 minutes to write and sync a 655 MB file (so, a little bit more than 1 MB/s) - fragmentation perhaps? + dd if=/dev/zero of=testfile bs=64k count=1 1+0 records in 1+0 records out 65536 bytes (655 MB) copied, 3,12109 seconds, 210 MB/s + sync 0.00user 2.14system 4:06.76elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+883minor)pagefaults 0swaps # df -h FilesystemSize Used Avail Use% Mounted on /dev/sda 1,2T 697G 452G 61% /mnt/iscsi_backup # df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/sda154M 20M134M 13% /mnt/iscsi_backup -- Tomasz Chmielewski http://wpkg.org - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
Tomasz Chmielewski <[EMAIL PROTECTED]> writes: > > Is it normal to expect the write speed go down to only few dozens of > kilobytes/s? Is it because of that many seeks? Can it be somehow > optimized? I have similar problems on my linux source partition which also has a lot of hard linked files (although probably not quite as many as you do). It seems like hard linking prevents some of the heuristics ext* uses to generate non fragmented disk layouts and the resulting seeking makes things slow. What has helped a bit was to recreate the file system with -O^dir_index dir_index seems to cause more seeks. Also keeping enough free space is also a good idea because that allows the file system code better choices on where to place data. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
On Mon, Feb 18, 2008 at 03:03:44PM +0100, Andi Kleen wrote: > Tomasz Chmielewski <[EMAIL PROTECTED]> writes: > > > > Is it normal to expect the write speed go down to only few dozens of > > kilobytes/s? Is it because of that many seeks? Can it be somehow > > optimized? > > I have similar problems on my linux source partition which also > has a lot of hard linked files (although probably not quite > as many as you do). It seems like hard linking prevents > some of the heuristics ext* uses to generate non fragmented > disk layouts and the resulting seeking makes things slow. ext3 tries to keep inodes in the same block group as their containing directory. If you have lots of hard links, obviously it can't really do that, especially since we don't have a good way at mkdir time to tell the filesystem, "Psst! This is going to be a hard link clone of that directory over there, put it in the same block group". > What has helped a bit was to recreate the file system with -O^dir_index > dir_index seems to cause more seeks. Part of it may have simply been recreating the filesystem, not necessarily removing the dir_index feature. Dir_index speeds up individual lookups, but it slows down workloads that do a readdir followed by a stat of all of the files in the workload. You can work around this by calling readdir(), sorting all of the entries by inode number, and then calling open or stat or whatever. So this can help out for workloads that are doing find or rm -r on a dir_index workload. Basically, it helps for some things, hurts for others. Once things are in the cache it doesn't matter of course. The following ld_preload can help in some cases. Mutt has this hack encoded in for maildir directories, which helps. > Also keeping enough free space is also a good idea because that > allows the file system code better choices on where to place data. Yep, that too. - Ted /* * readdir accelerator * * (C) Copyright 2003, 2004 by Theodore Ts'o. * * Compile using the command: * * gcc -o spd_readdir.so -shared spd_readdir.c -ldl * * Use it by setting the LD_PRELOAD environment variable: * * export LD_PRELOAD=/usr/local/sbin/spd_readdir.so * * %Begin-Header% * This file may be redistributed under the terms of the GNU Public * License. * %End-Header% * */ #define ALLOC_STEPSIZE 100 #define MAX_DIRSIZE 0 #define DEBUG #ifdef DEBUG #define DEBUG_DIR(x) {if (do_debug) { x; }} #else #define DEBUG_DIR(x) #endif #define _GNU_SOURCE #define __USE_LARGEFILE64 #include #include #include #include #include #include #include #include #include struct dirent_s { unsigned long long d_ino; long long d_off; unsigned short int d_reclen; unsigned char d_type; char *d_name; }; struct dir_s { DIR *dir; int num; int max; struct dirent_s *dp; int pos; int fd; struct dirent ret_dir; struct dirent64 ret_dir64; }; static int (*real_closedir)(DIR *dir) = 0; static DIR *(*real_opendir)(const char *name) = 0; static struct dirent *(*real_readdir)(DIR *dir) = 0; static struct dirent64 *(*real_readdir64)(DIR *dir) = 0; static off_t (*real_telldir)(DIR *dir) = 0; static void (*real_seekdir)(DIR *dir, off_t offset) = 0; static int (*real_dirfd)(DIR *dir) = 0; static unsigned long max_dirsize = MAX_DIRSIZE; static num_open = 0; #ifdef DEBUG static int do_debug = 0; #endif static void setup_ptr() { char *cp; real_opendir = dlsym(RTLD_NEXT, "opendir"); real_closedir = dlsym(RTLD_NEXT, "closedir"); real_readdir = dlsym(RTLD_NEXT, "readdir"); real_readdir64 = dlsym(RTLD_NEXT, "readdir64"); real_telldir = dlsym(RTLD_NEXT, "telldir"); real_seekdir = dlsym(RTLD_NEXT, "seekdir"); real_dirfd = dlsym(RTLD_NEXT, "dirfd"); if ((cp = getenv("SPD_READDIR_MAX_SIZE")) != NULL) { max_dirsize = atol(cp); } #ifdef DEBUG if (getenv("SPD_READDIR_DEBUG")) do_debug++; #endif } static void free_cached_dir(struct dir_s *dirstruct) { int i; if (!dirstruct->dp) return; for (i=0; i < dirstruct->num; i++) { free(dirstruct->dp[i].d_name); } free(dirstruct->dp); dirstruct->dp = 0; } static int ino_cmp(const void *a, const void *b) { const struct dirent_s *ds_a = (const struct dirent_s *) a; const struct dirent_s *ds_b = (const struct dirent_s *) b; ino_t i_a, i_b; i_a = ds_a->d_ino; i_b = ds_b->d_ino; if (ds_a->d_name[0] == '.') { if (ds_a->d_name[1] == 0) i_a = 0; else if ((ds_a->d_name[1] == '.') && (ds_a->d_name[2] == 0)) i_a = 1; } if (ds_b->d_name[0] == '.') { if (ds_b->d_name[1] == 0) i_b = 0; else if ((ds_b->d_name[1] == '.') && (ds_b->d_name[2] == 0)) i_b = 1; } return (i_a - i_b); } DIR *opendir(const char *name) { DIR *dir; struct dir_s *dirstruct; struct dirent_s *ds, *dnew; struct dirent64 *d; struct stat st; if (!real_opendir) setup_ptr(); DEBUG_DIR(printf("Opendir(%s) (%d open)\n", name, num_open++)); dir = (*real_opendir)(name); if (!dir) return NULL; dirstruct = malloc(sizeof(struct dir_s)
Re: very poor ext3 write performance on big filesystems?
On Mon, Feb 18, 2008 at 09:16:41AM -0500, Theodore Tso wrote: > ext3 tries to keep inodes in the same block group as their containing > directory. If you have lots of hard links, obviously it can't really > do that, especially since we don't have a good way at mkdir time to > tell the filesystem, "Psst! This is going to be a hard link clone of > that directory over there, put it in the same block group". Hmm, you think such a hint interface would be worth it? > > > What has helped a bit was to recreate the file system with -O^dir_index > > dir_index seems to cause more seeks. > > Part of it may have simply been recreating the filesystem, not Undoubtedly. > necessarily removing the dir_index feature. Dir_index speeds up > individual lookups, but it slows down workloads that do a readdir But only for large directories right? For kernel source like directory sizes it seems to be a general loss. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
Theodore Tso schrieb: (...) What has helped a bit was to recreate the file system with -O^dir_index dir_index seems to cause more seeks. Part of it may have simply been recreating the filesystem, not necessarily removing the dir_index feature. You mean, copy data somewhere else, mkfs a new filesystem, and copy data back? Unfortunately, doing it on a file level is not possible with a reasonable amount of time. I tried to copy that filesystem once (when it was much smaller) with "rsync -a -H", but after 3 days, rsync was still building an index and didn't copy any file. Also, as files/hardlinks come and go, it would degrade again. Are there better choices than ext3 for a filesystem with lots of hardlinks? ext4, once it's ready? xfs? -- Tomasz Chmielewski http://wpkg.org - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
On Mon, Feb 18, 2008 at 04:02:36PM +0100, Tomasz Chmielewski wrote: > I tried to copy that filesystem once (when it was much smaller) with "rsync > -a -H", but after 3 days, rsync was still building an index and didn't copy > any file. If you're going to copy the whole filesystem don't use rsync! Use cp or a tar pipeline to move the files. > Also, as files/hardlinks come and go, it would degrade again. Yes... > Are there better choices than ext3 for a filesystem with lots of hardlinks? > ext4, once it's ready? xfs? All filesystems are going to have problems keeping inodes close to directories when you have huge numbers of hard links. I'd really need to know exactly what kind of operations you were trying to do that were causing problems before I could say for sure. Yes, you said you were removing unneeded files, but how were you doing it? With rm -r of old hard-linked directories? How big are the average files involved? Etc. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
On Mon, Feb 18, 2008 at 04:18:23PM +0100, Andi Kleen wrote: > On Mon, Feb 18, 2008 at 09:16:41AM -0500, Theodore Tso wrote: > > ext3 tries to keep inodes in the same block group as their containing > > directory. If you have lots of hard links, obviously it can't really > > do that, especially since we don't have a good way at mkdir time to > > tell the filesystem, "Psst! This is going to be a hard link clone of > > that directory over there, put it in the same block group". > > Hmm, you think such a hint interface would be worth it? It would definitely help ext2/3/4. An interesting question is whether it would help enough other filesystems that's worth adding. > > necessarily removing the dir_index feature. Dir_index speeds up > > individual lookups, but it slows down workloads that do a readdir > > But only for large directories right? For kernel source like > directory sizes it seems to be a general loss. On my todo list is a hack which does the sorting of directory inodes by inode number inside the kernel for smallish directories (say, less than 2-3 blocks) where using the kernel memory space to store the directory entries is acceptable, and which would speed up dir_index performance for kernel source-like directory sizes --- without needing to use the spd_readdir LD_PRELOAD hack. But yes, right now, if you know that your directories are almost always going to be kernel source like in size, then omitting dir_index is probably goint to be a good idea. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
On Mon, Feb 18, 2008 at 04:57:25PM +0100, Andi Kleen wrote: > > Use cp > > or a tar pipeline to move the files. > > Are you sure cp handles hardlinks correctly? I know tar does, > but I have my doubts about cp. I *think* GNU cp does the right thing with --preserve=links. I'm not 100% sure, though --- like you, probably, I always use tar for moving or copying directory hierarchies. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
On Mon, Feb 18, 2008 at 10:16:32AM -0500, Theodore Tso wrote: > On Mon, Feb 18, 2008 at 04:02:36PM +0100, Tomasz Chmielewski wrote: > > I tried to copy that filesystem once (when it was much smaller) with "rsync > > -a -H", but after 3 days, rsync was still building an index and didn't copy > > any file. > > If you're going to copy the whole filesystem don't use rsync! Yes, I managed to kill systems (drive them really badly into oom and get very long swap storms) with rsync -H in the past too. Something is very wrong with the rsync implementation of this. > Use cp > or a tar pipeline to move the files. Are you sure cp handles hardlinks correctly? I know tar does, but I have my doubts about cp. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
Theodore Tso schrieb: Are there better choices than ext3 for a filesystem with lots of hardlinks? ext4, once it's ready? xfs? All filesystems are going to have problems keeping inodes close to directories when you have huge numbers of hard links. I'd really need to know exactly what kind of operations you were trying to do that were causing problems before I could say for sure. Yes, you said you were removing unneeded files, but how were you doing it? With rm -r of old hard-linked directories? Yes, with rm -r. How big are the average files involved? Etc. It's hard to estimate the average size of a file. I'd say there are not many files bigger than 50 MB. Basically, it's a filesystem where backups are kept. Backups are made with BackupPC [1]. Imagine a full rootfs backup of 100 Linux systems. Instead of compressing and writing "/bin/bash" 100 times for each separate system, we do it once, and hardlink. Then, keep 40 copies back, and you have 4000 hardlinks. For individual or user files, the number of hardlinks will be smaller of course. The directories I want to remove have usually a structure of a "normal" Linux rootfs, nothing special there (other than most of the files will have multiple hardlinks). I noticed using write back helps a tiny bit, but as dm and md don't support write barriers, I'm not very eager to use it. [1] http://backuppc.sf.net http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues -- Tomasz Chmielewski http://wpkg.org - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
On Mon, Feb 18, 2008 at 05:16:55PM +0100, Tomasz Chmielewski wrote: > Theodore Tso schrieb: > >> I'd really need to know exactly what kind of operations you were >> trying to do that were causing problems before I could say for sure. >> Yes, you said you were removing unneeded files, but how were you doing >> it? With rm -r of old hard-linked directories? > > Yes, with rm -r. You should definitely try the spd_readdir hack; that will help reduce the seek times. This will probably help on any block group oriented filesystems, including XFS, etc. >> How big are the >> average files involved? Etc. > > It's hard to estimate the average size of a file. I'd say there are not > many files bigger than 50 MB. Well, Ext4 will help for files bigger than 48k. The other thing that might help for you is using an external journal on a separate hard drive (either for ext3 or ext4). That will help alleviate some of the seek storms going on, since the journal is written to only sequentially, and putting it on a separate hard drive will help remove some of the contention on the hard drive. I assume that your 1.2 TB filesystem is located on a RAID array; did you use the mke2fs -E stride option to make sure all of the bitmaps don't get concentrated on one hard drive spindle? One of the failure modes which can happen is if you use a 4+1 raid 5 setup, that all of the block and inode bitmaps can end up getting laid out on a single hard drive, so it becomes a bottleneck for bitmap intensive workloads --- including "rm -rf". So that's another thing that might be going on. If you do a "dumpe2fs", and look at the block numbers for the block and inode allocation bitmaps, and you find that they are are all landing on the same physical hard drive, then that's very clearly the biggest problem given an "rm -rf" workload. You should be able to see this as well visually; if one hard drive has its hard drive light almost constantly on, and the other ones don't have much activity, that's probably what is happening. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 02/11] introduce simple_fs_type
There is a number of pseudo file systems in the kernel that are basically copies of debugfs, all implementing the same boilerplate code, just with different bugs. This adds yet another copy to the kernel in the libfs directory, with generalized helpers that can be used by any of them. The most interesting function here is the new "struct dentry * simple_register_filesystem(struct simple_fs_type *type)", which returns the root directory of a new file system that can then be passed to simple_create_file() and similar functions as a parent. Signed-off-by: Arnd Bergman <[EMAIL PROTECTED]> Index: linux-2.6/fs/libfs.c === --- linux-2.6.orig/fs/libfs.c +++ linux-2.6/fs/libfs.c @@ -263,11 +263,6 @@ int simple_link(struct dentry *old_dentr return 0; } -static inline int simple_positive(struct dentry *dentry) -{ - return dentry->d_inode && !d_unhashed(dentry); -} - int simple_empty(struct dentry *dentry) { struct dentry *child; @@ -409,109 +404,6 @@ int simple_write_end(struct file *file, return copied; } -/* - * the inodes created here are not hashed. If you use iunique to generate - * unique inode values later for this filesystem, then you must take care - * to pass it an appropriate max_reserved value to avoid collisions. - */ -int simple_fill_super(struct super_block *s, int magic, struct tree_descr *files) -{ - struct inode *inode; - struct dentry *root; - struct dentry *dentry; - int i; - - s->s_blocksize = PAGE_CACHE_SIZE; - s->s_blocksize_bits = PAGE_CACHE_SHIFT; - s->s_magic = magic; - s->s_op = &simple_super_operations; - s->s_time_gran = 1; - - inode = new_inode(s); - if (!inode) - return -ENOMEM; - /* -* because the root inode is 1, the files array must not contain an -* entry at index 1 -*/ - inode->i_ino = 1; - inode->i_mode = S_IFDIR | 0755; - inode->i_uid = inode->i_gid = 0; - inode->i_blocks = 0; - inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; - inode->i_op = &simple_dir_inode_operations; - inode->i_fop = &simple_dir_operations; - inode->i_nlink = 2; - root = d_alloc_root(inode); - if (!root) { - iput(inode); - return -ENOMEM; - } - for (i = 0; !files->name || files->name[0]; i++, files++) { - if (!files->name) - continue; - - /* warn if it tries to conflict with the root inode */ - if (unlikely(i == 1)) - printk(KERN_WARNING "%s: %s passed in a files array" - "with an index of 1!\n", __func__, - s->s_type->name); - - dentry = d_alloc_name(root, files->name); - if (!dentry) - goto out; - inode = new_inode(s); - if (!inode) - goto out; - inode->i_mode = S_IFREG | files->mode; - inode->i_uid = inode->i_gid = 0; - inode->i_blocks = 0; - inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; - inode->i_fop = files->ops; - inode->i_ino = i; - d_add(dentry, inode); - } - s->s_root = root; - return 0; -out: - d_genocide(root); - dput(root); - return -ENOMEM; -} - -static DEFINE_SPINLOCK(pin_fs_lock); - -int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int *count) -{ - struct vfsmount *mnt = NULL; - spin_lock(&pin_fs_lock); - if (unlikely(!*mount)) { - spin_unlock(&pin_fs_lock); - mnt = vfs_kern_mount(type, 0, type->name, NULL); - if (IS_ERR(mnt)) - return PTR_ERR(mnt); - spin_lock(&pin_fs_lock); - if (!*mount) - *mount = mnt; - } - mntget(*mount); - ++*count; - spin_unlock(&pin_fs_lock); - mntput(mnt); - return 0; -} - -void simple_release_fs(struct vfsmount **mount, int *count) -{ - struct vfsmount *mnt; - spin_lock(&pin_fs_lock); - mnt = *mount; - if (!--*count) - *mount = NULL; - spin_unlock(&pin_fs_lock); - mntput(mnt); -} - ssize_t simple_read_from_buffer(void __user *to, size_t count, loff_t *ppos, const void *from, size_t available) { @@ -786,14 +678,11 @@ EXPORT_SYMBOL(simple_dir_inode_operation EXPORT_SYMBOL(simple_dir_operations); EXPORT_SYMBOL(simple_empty); EXPORT_SYMBOL(d_alloc_name); -EXPORT_SYMBOL(simple_fill_super); EXPORT_SYMBOL(simple_getattr); EXPORT_SYMBOL(simple_link); EXPORT_SYMBOL(simple_lookup); -EXPORT_SYMBOL(simple_pin_fs); EXPORT_SYMBOL(simple_prepare_write); EXPORT_SYMBOL(simple_readpag
[RFC 06/11] split out linux/libfs.h from linux/fs.h
With libfs turning into a larger subsystem, it makes sense to have a separate header that is not included by the low-level vfs code. Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]> Index: linux-2.6/fs/debugfs/inode.c === --- linux-2.6.orig/fs/debugfs/inode.c +++ linux-2.6/fs/debugfs/inode.c @@ -18,6 +18,7 @@ #include #include +#include #include #include #include Index: linux-2.6/fs/dcache.c === --- linux-2.6.orig/fs/dcache.c +++ linux-2.6/fs/dcache.c @@ -947,6 +947,7 @@ struct dentry *d_alloc_name(struct dentr q.hash = full_name_hash(q.name, q.len); return d_alloc(parent, &q); } +EXPORT_SYMBOL(d_alloc_name); /** * d_instantiate - fill in inode information for a dentry Index: linux-2.6/drivers/usb/core/inode.c === --- linux-2.6.orig/drivers/usb/core/inode.c +++ linux-2.6/drivers/usb/core/inode.c @@ -27,6 +27,7 @@ /*/ +#include #include #include #include Index: linux-2.6/fs/binfmt_misc.c === --- linux-2.6.orig/fs/binfmt_misc.c +++ linux-2.6/fs/binfmt_misc.c @@ -16,6 +16,7 @@ * 2001-02-28 AV: rewritten into something that resembles C. Original didn't. */ +#include #include #include #include Index: linux-2.6/fs/configfs/mount.c === --- linux-2.6.orig/fs/configfs/mount.c +++ linux-2.6/fs/configfs/mount.c @@ -25,6 +25,7 @@ */ #include +#include #include #include #include Index: linux-2.6/fs/debugfs/file.c === --- linux-2.6.orig/fs/debugfs/file.c +++ linux-2.6/fs/debugfs/file.c @@ -15,6 +15,7 @@ #include #include +#include #include #include #include Index: linux-2.6/fs/fuse/control.c === --- linux-2.6.orig/fs/fuse/control.c +++ linux-2.6/fs/fuse/control.c @@ -9,6 +9,7 @@ #include "fuse_i.h" #include +#include #include #define FUSE_CTL_SUPER_MAGIC 0x65735543 Index: linux-2.6/fs/nfsd/nfsctl.c === --- linux-2.6.orig/fs/nfsd/nfsctl.c +++ linux-2.6/fs/nfsd/nfsctl.c @@ -8,6 +8,7 @@ #include +#include #include #include #include Index: linux-2.6/net/sunrpc/rpc_pipe.c === --- linux-2.6.orig/net/sunrpc/rpc_pipe.c +++ linux-2.6/net/sunrpc/rpc_pipe.c @@ -8,6 +8,7 @@ * Copyright (c) 2002, Trond Myklebust <[EMAIL PROTECTED]> * */ +#include #include #include #include Index: linux-2.6/security/inode.c === --- linux-2.6.orig/security/inode.c +++ linux-2.6/security/inode.c @@ -16,6 +16,7 @@ #include #include +#include #include #define SECURITYFS_MAGIC 0x73636673 Index: linux-2.6/security/selinux/selinuxfs.c === --- linux-2.6.orig/security/selinux/selinuxfs.c +++ linux-2.6/security/selinux/selinuxfs.c @@ -14,6 +14,7 @@ * the Free Software Foundation, version 2. */ +#include #include #include #include Index: linux-2.6/virt/kvm/kvm_main.c === --- linux-2.6.orig/virt/kvm/kvm_main.c +++ linux-2.6/virt/kvm/kvm_main.c @@ -40,6 +40,7 @@ #include #include #include +#include #include #include Index: linux-2.6/arch/powerpc/platforms/cell/spufs/spufs.h === --- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/spufs.h +++ linux-2.6/arch/powerpc/platforms/cell/spufs/spufs.h @@ -27,6 +27,7 @@ #include #include #include +#include #include #include Index: linux-2.6/fs/autofs4/autofs_i.h === --- linux-2.6.orig/fs/autofs4/autofs_i.h +++ linux-2.6/fs/autofs4/autofs_i.h @@ -22,6 +22,7 @@ #define AUTOFS_IOC_COUNT 32 #include +#include #include #include #include Index: linux-2.6/include/linux/fs.h === --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -1957,12 +1957,7 @@ extern struct dentry *simple_lookup(stru extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *); extern const struct file_operations simple_dir_operations; extern const struct inode_operations simple_dir_inode_operations; -struct tree_descr { char *name; const struct file_operations *ops; int mode; }; struct dentry *d_alloc_name(struct dentry *, const char *); -extern int simple_fill_super(struct super_block *, int, cons
[RFC 04/11] slim down securityfs
With the new simple_fs_type in place, securityfs practically becomes a nop and we just need to leave code around to manage its mount point. Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]> Index: linux-2.6/security/inode.c === --- linux-2.6.orig/security/inode.c +++ linux-2.6/security/inode.c @@ -13,176 +13,14 @@ */ /* #define DEBUG */ + #include -#include -#include -#include #include -#include #include #define SECURITYFS_MAGIC 0x73636673 -static struct vfsmount *mount; -static int mount_count; - -/* - * TODO: - * I think I can get rid of these default_file_ops, but not quite sure... - */ -static ssize_t default_read_file(struct file *file, char __user *buf, -size_t count, loff_t *ppos) -{ - return 0; -} - -static ssize_t default_write_file(struct file *file, const char __user *buf, - size_t count, loff_t *ppos) -{ - return count; -} - -static int default_open(struct inode *inode, struct file *file) -{ - if (inode->i_private) - file->private_data = inode->i_private; - - return 0; -} - -static const struct file_operations default_file_ops = { - .read = default_read_file, - .write =default_write_file, - .open = default_open, -}; - -static struct inode *get_inode(struct super_block *sb, int mode, dev_t dev) -{ - struct inode *inode = new_inode(sb); - - if (inode) { - inode->i_mode = mode; - inode->i_uid = 0; - inode->i_gid = 0; - inode->i_blocks = 0; - inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; - switch (mode & S_IFMT) { - default: - init_special_inode(inode, mode, dev); - break; - case S_IFREG: - inode->i_fop = &default_file_ops; - break; - case S_IFDIR: - inode->i_op = &simple_dir_inode_operations; - inode->i_fop = &simple_dir_operations; - - /* directory inodes start off with i_nlink == 2 (for "." entry) */ - inc_nlink(inode); - break; - } - } - return inode; -} - -/* SMP-safe */ -static int mknod(struct inode *dir, struct dentry *dentry, -int mode, dev_t dev) -{ - struct inode *inode; - int error = -EPERM; - - if (dentry->d_inode) - return -EEXIST; - - inode = get_inode(dir->i_sb, mode, dev); - if (inode) { - d_instantiate(dentry, inode); - dget(dentry); - error = 0; - } - return error; -} - -static int mkdir(struct inode *dir, struct dentry *dentry, int mode) -{ - int res; - - mode = (mode & (S_IRWXUGO | S_ISVTX)) | S_IFDIR; - res = mknod(dir, dentry, mode, 0); - if (!res) - inc_nlink(dir); - return res; -} - -static int create(struct inode *dir, struct dentry *dentry, int mode) -{ - mode = (mode & S_IALLUGO) | S_IFREG; - return mknod(dir, dentry, mode, 0); -} - -static inline int positive(struct dentry *dentry) -{ - return dentry->d_inode && !d_unhashed(dentry); -} - -static int fill_super(struct super_block *sb, void *data, int silent) -{ - static struct tree_descr files[] = {{""}}; - - return simple_fill_super(sb, SECURITYFS_MAGIC, files); -} - -static int get_sb(struct file_system_type *fs_type, - int flags, const char *dev_name, - void *data, struct vfsmount *mnt) -{ - return get_sb_single(fs_type, flags, data, fill_super, mnt); -} - -static struct file_system_type fs_type = { - .owner =THIS_MODULE, - .name = "securityfs", - .get_sb = get_sb, - .kill_sb = kill_litter_super, -}; - -static int create_by_name(const char *name, mode_t mode, - struct dentry *parent, - struct dentry **dentry) -{ - int error = 0; - - *dentry = NULL; - - /* If the parent is not specified, we create it in the root. -* We need the root dentry to do this, which is in the super -* block. A pointer to that is in the struct vfsmount that we -* have around. -*/ - if (!parent ) { - if (mount && mount->mnt_sb) { - parent = mount->mnt_sb->s_root; - } - } - if (!parent) { - pr_debug("securityfs: Ah! can not find a parent!\n"); - return -EFAULT; - } - - mutex_lock(&parent->d_inode->i_mutex); - *dentry = lookup_one_len(name, parent, strlen(name)); - if (!IS_ERR(dentry)) { - if ((mode & S_IFMT) == S_IFDIR) -
[RFC 05/11] slim down usbfs
Half of the usbfs code is the same as debugfs, so we can replace it now with calls to the generic libfs versions. Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]> Index: linux-2.6/drivers/usb/core/inode.c === --- linux-2.6.orig/drivers/usb/core/inode.c +++ linux-2.6/drivers/usb/core/inode.c @@ -47,11 +47,10 @@ #define USBFS_DEFAULT_BUSMODE (S_IXUGO | S_IRUGO) #define USBFS_DEFAULT_LISTMODE S_IRUGO -static struct super_operations usbfs_ops; -static const struct file_operations default_file_operations; -static struct vfsmount *usbfs_mount; -static int usbfs_mount_count; /* = 0 */ -static int ignore_mount = 0; +static DEFINE_SIMPLE_FS(usb_fs_type, "usbfs", NULL, USBDEVICE_SUPER_MAGIC); +static struct dentry *usbfs_root; + +static int ignore_mount = 1; static struct dentry *devices_usbfs_dentry; static int num_buses; /* = 0 */ @@ -263,186 +262,11 @@ static int remount(struct super_block *s return -EINVAL; } - if (usbfs_mount && usbfs_mount->mnt_sb) - update_sb(usbfs_mount->mnt_sb); - - return 0; -} - -static struct inode *usbfs_get_inode (struct super_block *sb, int mode, dev_t dev) -{ - struct inode *inode = new_inode(sb); - - if (inode) { - inode->i_mode = mode; - inode->i_uid = current->fsuid; - inode->i_gid = current->fsgid; - inode->i_blocks = 0; - inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; - switch (mode & S_IFMT) { - default: - init_special_inode(inode, mode, dev); - break; - case S_IFREG: - inode->i_fop = &default_file_operations; - break; - case S_IFDIR: - inode->i_op = &simple_dir_inode_operations; - inode->i_fop = &simple_dir_operations; - - /* directory inodes start off with i_nlink == 2 (for "." entry) */ - inc_nlink(inode); - break; - } - } - return inode; -} - -/* SMP-safe */ -static int usbfs_mknod (struct inode *dir, struct dentry *dentry, int mode, - dev_t dev) -{ - struct inode *inode = usbfs_get_inode(dir->i_sb, mode, dev); - int error = -EPERM; - - if (dentry->d_inode) - return -EEXIST; - - if (inode) { - d_instantiate(dentry, inode); - dget(dentry); - error = 0; - } - return error; -} - -static int usbfs_mkdir (struct inode *dir, struct dentry *dentry, int mode) -{ - int res; - - mode = (mode & (S_IRWXUGO | S_ISVTX)) | S_IFDIR; - res = usbfs_mknod (dir, dentry, mode, 0); - if (!res) - inc_nlink(dir); - return res; -} - -static int usbfs_create (struct inode *dir, struct dentry *dentry, int mode) -{ - mode = (mode & S_IALLUGO) | S_IFREG; - return usbfs_mknod (dir, dentry, mode, 0); -} - -static inline int usbfs_positive (struct dentry *dentry) -{ - return dentry->d_inode && !d_unhashed(dentry); -} - -static int usbfs_empty (struct dentry *dentry) -{ - struct list_head *list; - - spin_lock(&dcache_lock); - - list_for_each(list, &dentry->d_subdirs) { - struct dentry *de = list_entry(list, struct dentry, d_u.d_child); - if (usbfs_positive(de)) { - spin_unlock(&dcache_lock); - return 0; - } - } - - spin_unlock(&dcache_lock); - return 1; -} - -static int usbfs_unlink (struct inode *dir, struct dentry *dentry) -{ - struct inode *inode = dentry->d_inode; - mutex_lock(&inode->i_mutex); - drop_nlink(dentry->d_inode); - dput(dentry); - mutex_unlock(&inode->i_mutex); - d_delete(dentry); - return 0; -} - -static int usbfs_rmdir(struct inode *dir, struct dentry *dentry) -{ - int error = -ENOTEMPTY; - struct inode * inode = dentry->d_inode; - - mutex_lock(&inode->i_mutex); - dentry_unhash(dentry); - if (usbfs_empty(dentry)) { - drop_nlink(dentry->d_inode); - drop_nlink(dentry->d_inode); - dput(dentry); - inode->i_flags |= S_DEAD; - drop_nlink(dir); - error = 0; - } - mutex_unlock(&inode->i_mutex); - if (!error) - d_delete(dentry); - dput(dentry); - return error; -} - - -/* default file operations */ -static ssize_t default_read_file (struct file *file, char __user *buf, - size_t count, loff_t *ppos) -{ - return 0; -} - -static ssize_t default_write_file (struct file *file, const char __user *buf, - size_t count, loff_t *ppos) -{ - ret
[RFC 09/11] split out libfs/super.c from libfs.c
Consolidate all super block manipulation code in libfs in a single source file. Signed-off-by: Arnd Bergman <[EMAIL PROTECTED]> Index: linux-2.6/fs/libfs.c === --- linux-2.6.orig/fs/libfs.c +++ linux-2.6/fs/libfs.c @@ -12,63 +12,6 @@ #include -static const struct super_operations simple_super_operations = { - .statfs = simple_statfs, -}; - -/* - * Common helper for pseudo-filesystems (sockfs, pipefs, bdev - stuff that - * will never be mountable) - */ -int get_sb_pseudo(struct file_system_type *fs_type, char *name, - const struct super_operations *ops, unsigned long magic, - struct vfsmount *mnt) -{ - struct super_block *s = sget(fs_type, NULL, set_anon_super, NULL); - struct dentry *dentry; - struct inode *root; - struct qstr d_name = {.name = name, .len = strlen(name)}; - - if (IS_ERR(s)) - return PTR_ERR(s); - - s->s_flags = MS_NOUSER; - s->s_maxbytes = ~0ULL; - s->s_blocksize = 1024; - s->s_blocksize_bits = 10; - s->s_magic = magic; - s->s_op = ops ? ops : &simple_super_operations; - s->s_time_gran = 1; - root = new_inode(s); - if (!root) - goto Enomem; - /* -* since this is the first inode, make it number 1. New inodes created -* after this must take care not to collide with it (by passing -* max_reserved of 1 to iunique). -*/ - root->i_ino = 1; - root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR; - root->i_uid = root->i_gid = 0; - root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME; - dentry = d_alloc(NULL, &d_name); - if (!dentry) { - iput(root); - goto Enomem; - } - dentry->d_sb = s; - dentry->d_parent = dentry; - d_instantiate(dentry, root); - s->s_root = dentry; - s->s_flags |= MS_ACTIVE; - return simple_set_mnt(mnt, s); - -Enomem: - up_write(&s->s_umount); - deactivate_super(s); - return -ENOMEM; -} - int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) { struct inode *inode = old_dentry->d_inode; @@ -238,7 +181,6 @@ ssize_t simple_read_from_buffer(void __u return count; } -EXPORT_SYMBOL(get_sb_pseudo); EXPORT_SYMBOL(simple_write_begin); EXPORT_SYMBOL(simple_write_end); EXPORT_SYMBOL(simple_empty); Index: linux-2.6/fs/libfs/super.c === --- linux-2.6.orig/fs/libfs/super.c +++ linux-2.6/fs/libfs/super.c @@ -54,6 +54,60 @@ static const struct super_operations sim }; /* + * Common helper for pseudo-filesystems (sockfs, pipefs, bdev - stuff that + * will never be mountable) + */ +int get_sb_pseudo(struct file_system_type *fs_type, char *name, + const struct super_operations *ops, unsigned long magic, + struct vfsmount *mnt) +{ + struct super_block *s = sget(fs_type, NULL, set_anon_super, NULL); + struct dentry *dentry; + struct inode *root; + struct qstr d_name = {.name = name, .len = strlen(name)}; + + if (IS_ERR(s)) + return PTR_ERR(s); + + s->s_flags = MS_NOUSER; + s->s_maxbytes = ~0ULL; + s->s_blocksize = 1024; + s->s_blocksize_bits = 10; + s->s_magic = magic; + s->s_op = ops ? ops : &simple_super_operations; + s->s_time_gran = 1; + root = new_inode(s); + if (!root) + goto Enomem; + /* +* since this is the first inode, make it number 1. New inodes created +* after this must take care not to collide with it (by passing +* max_reserved of 1 to iunique). +*/ + root->i_ino = 1; + root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR; + root->i_uid = root->i_gid = 0; + root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME; + dentry = d_alloc(NULL, &d_name); + if (!dentry) { + iput(root); + goto Enomem; + } + dentry->d_sb = s; + dentry->d_parent = dentry; + d_instantiate(dentry, root); + s->s_root = dentry; + s->s_flags |= MS_ACTIVE; + return simple_set_mnt(mnt, s); + +Enomem: + up_write(&s->s_umount); + deactivate_super(s); + return -ENOMEM; +} +EXPORT_SYMBOL(get_sb_pseudo); + +/* * the inodes created here are not hashed. If you use iunique to generate * unique inode values later for this filesystem, then you must take care * to pass it an appropriate max_reserved value to avoid collisions. -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 08/11] split out libfs/dentry.c from libfs.c
Consolidate all dentry manipulation code in libfs in a single source file. Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]> Index: linux-2.6/fs/libfs.c === --- linux-2.6.orig/fs/libfs.c +++ linux-2.6/fs/libfs.c @@ -12,188 +12,6 @@ #include -int simple_getattr(struct vfsmount *mnt, struct dentry *dentry, - struct kstat *stat) -{ - struct inode *inode = dentry->d_inode; - generic_fillattr(inode, stat); - stat->blocks = inode->i_mapping->nrpages << (PAGE_CACHE_SHIFT - 9); - return 0; -} - -int simple_statfs(struct dentry *dentry, struct kstatfs *buf) -{ - buf->f_type = dentry->d_sb->s_magic; - buf->f_bsize = PAGE_CACHE_SIZE; - buf->f_namelen = NAME_MAX; - return 0; -} - -/* - * Retaining negative dentries for an in-memory filesystem just wastes - * memory and lookup time: arrange for them to be deleted immediately. - */ -static int simple_delete_dentry(struct dentry *dentry) -{ - return 1; -} - -/* - * Lookup the data. This is trivial - if the dentry didn't already - * exist, we know it is negative. Set d_op to delete negative dentries. - */ -struct dentry *simple_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd) -{ - static struct dentry_operations simple_dentry_operations = { - .d_delete = simple_delete_dentry, - }; - - if (dentry->d_name.len > NAME_MAX) - return ERR_PTR(-ENAMETOOLONG); - dentry->d_op = &simple_dentry_operations; - d_add(dentry, NULL); - return NULL; -} - -int simple_sync_file(struct file * file, struct dentry *dentry, int datasync) -{ - return 0; -} - -int dcache_dir_open(struct inode *inode, struct file *file) -{ - static struct qstr cursor_name = {.len = 1, .name = "."}; - - file->private_data = d_alloc(file->f_path.dentry, &cursor_name); - - return file->private_data ? 0 : -ENOMEM; -} - -int dcache_dir_close(struct inode *inode, struct file *file) -{ - dput(file->private_data); - return 0; -} - -loff_t dcache_dir_lseek(struct file *file, loff_t offset, int origin) -{ - mutex_lock(&file->f_path.dentry->d_inode->i_mutex); - switch (origin) { - case 1: - offset += file->f_pos; - case 0: - if (offset >= 0) - break; - default: - mutex_unlock(&file->f_path.dentry->d_inode->i_mutex); - return -EINVAL; - } - if (offset != file->f_pos) { - file->f_pos = offset; - if (file->f_pos >= 2) { - struct list_head *p; - struct dentry *cursor = file->private_data; - loff_t n = file->f_pos - 2; - - spin_lock(&dcache_lock); - list_del(&cursor->d_u.d_child); - p = file->f_path.dentry->d_subdirs.next; - while (n && p != &file->f_path.dentry->d_subdirs) { - struct dentry *next; - next = list_entry(p, struct dentry, d_u.d_child); - if (!d_unhashed(next) && next->d_inode) - n--; - p = p->next; - } - list_add_tail(&cursor->d_u.d_child, p); - spin_unlock(&dcache_lock); - } - } - mutex_unlock(&file->f_path.dentry->d_inode->i_mutex); - return offset; -} - -/* Relationship between i_mode and the DT_xxx types */ -static inline unsigned char dt_type(struct inode *inode) -{ - return (inode->i_mode >> 12) & 15; -} - -/* - * Directory is locked and all positive dentries in it are safe, since - * for ramfs-type trees they can't go away without unlink() or rmdir(), - * both impossible due to the lock on directory. - */ - -int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir) -{ - struct dentry *dentry = filp->f_path.dentry; - struct dentry *cursor = filp->private_data; - struct list_head *p, *q = &cursor->d_u.d_child; - ino_t ino; - int i = filp->f_pos; - - switch (i) { - case 0: - ino = dentry->d_inode->i_ino; - if (filldir(dirent, ".", 1, i, ino, DT_DIR) < 0) - break; - filp->f_pos++; - i++; - /* fallthrough */ - case 1: - ino = parent_ino(dentry); - if (filldir(dirent, "..", 2, i, ino, DT_DIR) < 0) - break; - filp->f_pos++; - i++; - /* fallthrough */ - default: -
[RFC 01/11] add generic versions of debugfs file operations
The file operations in debugfs are rather generic and can be used by other file systems, so it can be interesting to include them in libfs, with more generic names, and exported to modules. This patch adds a new copy of these operations to libfs, so that the debugfs version can later be cut down. Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]> Index: linux-2.6/fs/Makefile === --- linux-2.6.orig/fs/Makefile +++ linux-2.6/fs/Makefile @@ -13,6 +13,8 @@ obj-y := open.o read_write.o file_table. pnode.o drop_caches.o splice.o sync.o utimes.o \ stack.o +obj-$(CONFIG_LIBFS) += libfs/ + ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o else Index: linux-2.6/include/linux/libfs.h === --- /dev/null +++ linux-2.6/include/linux/libfs.h @@ -0,0 +1,21 @@ +#ifndef __LIBFS_H__ +#define __LIBFS_H__ + +#include + +extern const struct file_operations simple_fops_u8; +extern const struct file_operations simple_fops_x8; +extern const struct file_operations simple_fops_u16; +extern const struct file_operations simple_fops_x16; +extern const struct file_operations simple_fops_u32; +extern const struct file_operations simple_fops_x32; +extern const struct file_operations simple_fops_u64; +extern const struct file_operations simple_fops_bool; +extern const struct file_operations simple_fops_blob; + +struct simple_blob_wrapper { + void *data; + unsigned long size; +}; + +#endif /* __LIBFS_H__ */ Index: linux-2.6/fs/libfs/Makefile === --- /dev/null +++ linux-2.6/fs/libfs/Makefile @@ -0,0 +1,3 @@ +libfs-y += file.o + +obj-$(CONFIG_LIBFS) += libfs.o Index: linux-2.6/fs/libfs/file.c === --- /dev/null +++ linux-2.6/fs/libfs/file.c @@ -0,0 +1,126 @@ +/* + * fs/libfs/file.c + * Library for filesystems writers. + */ + +#include +#include +#include + +#include + +/* commonly used attribute file operations */ +static int simple_u8_set(void *data, u64 val) +{ + *(u8 *)data = val; + return 0; +} +static int simple_u8_get(void *data, u64 *val) +{ + *val = *(u8 *)data; + return 0; +} +DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_u8, simple_u8_get, simple_u8_set, "%llu\n"); +DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_x8, simple_u8_get, simple_u8_set, "0x%02llx\n"); + +static int simple_u16_set(void *data, u64 val) +{ + *(u16 *)data = val; + return 0; +} +static int simple_u16_get(void *data, u64 *val) +{ + *val = *(u16 *)data; + return 0; +} +DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_u16, simple_u16_get, simple_u16_set, "%llu\n"); +DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_x16, simple_u16_get, simple_u16_set, "0x%04llx\n"); + +static int simple_u32_set(void *data, u64 val) +{ + *(u32 *)data = val; + return 0; +} +static int simple_u32_get(void *data, u64 *val) +{ + *val = *(u32 *)data; + return 0; +} +DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_u32, simple_u32_get, simple_u32_set, "%llu\n"); +DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_x32, simple_u32_get, simple_u32_set, "0x%08llx\n"); + +static int simple_u64_set(void *data, u64 val) +{ + *(u64 *)data = val; + return 0; +} + +static int simple_u64_get(void *data, u64 *val) +{ + return *(u64 *)data; + return 0; +} +DEFINE_SIMPLE_EXPORTED_ATTRIBUTE(simple_fops_u64, simple_u64_get, simple_u64_set, "%llu\n"); + +static ssize_t read_file_bool(struct file *file, char __user *user_buf, + size_t count, loff_t *ppos) +{ + char buf[3]; + u32 *val = file->private_data; + + if (*val) + buf[0] = 'Y'; + else + buf[0] = 'N'; + buf[1] = '\n'; + buf[2] = 0x00; + return simple_read_from_buffer(user_buf, count, ppos, buf, 2); +} + +static ssize_t write_file_bool(struct file *file, const char __user *user_buf, + size_t count, loff_t *ppos) +{ + char buf[32]; + int buf_size; + u32 *val = file->private_data; + + buf_size = min(count, (sizeof(buf)-1)); + if (copy_from_user(buf, user_buf, buf_size)) + return -EFAULT; + + switch (buf[0]) { + case 'y': + case 'Y': + case '1': + *val = 1; + break; + case 'n': + case 'N': + case '0': + *val = 0; + break; + } + + return count; +} + +const struct file_operations simple_fops_bool = { + .read = read_file_bool, + .write =write_file_bool, + .open = simple_open, +}; +EXPORT_SYMBOL_GPL(simple_fops_bool); + +static ssize_t read_file_blob(struct file *file, char __user *user_buf, +
[RFC 00/11] possible debugfs/libfs consolidation
I noticed that there is a lot of duplication in pseudo file systems, so I started looking into how to consolidate them. I ended up with a largish rework of the structure of libfs and moving almost all of debugfs in there as well. As an example, I also have patches that reduce debugfs, securityfs and usbfs to the point where they are mostly thin wrappers around libfs, with large comment blocks. Other file systems could be changed in the same way, but I first like to see if people agree that I'm on the right track. These patches have seen practically no testing so far, so don't expect them to work, but please tell me what you think about the concept. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 03/11] slim down debugfs
With most of debugfs now copied to generic code in libfs, we can remove the original copy and replace it with thin wrappers around libfs. Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]> Index: linux-2.6/fs/Kconfig === --- linux-2.6.orig/fs/Kconfig +++ linux-2.6/fs/Kconfig @@ -1001,6 +1001,14 @@ config CONFIGFS_FS Both sysfs and configfs can and should exist together on the same system. One is not a replacement for the other. +config LIBFS + tristate + default m + help + libfs is a helper library used by many of the simpler file + systems. Parts of libfs can be modular when all of its users + are modules as well, and the users should select this symbol. + endmenu menu "Miscellaneous filesystems" Index: linux-2.6/fs/debugfs/file.c === --- linux-2.6.orig/fs/debugfs/file.c +++ linux-2.6/fs/debugfs/file.c @@ -19,55 +19,6 @@ #include #include -static ssize_t default_read_file(struct file *file, char __user *buf, -size_t count, loff_t *ppos) -{ - return 0; -} - -static ssize_t default_write_file(struct file *file, const char __user *buf, - size_t count, loff_t *ppos) -{ - return count; -} - -static int default_open(struct inode *inode, struct file *file) -{ - if (inode->i_private) - file->private_data = inode->i_private; - - return 0; -} - -const struct file_operations debugfs_file_operations = { - .read = default_read_file, - .write =default_write_file, - .open = default_open, -}; - -static void *debugfs_follow_link(struct dentry *dentry, struct nameidata *nd) -{ - nd_set_link(nd, dentry->d_inode->i_private); - return NULL; -} - -const struct inode_operations debugfs_link_operations = { - .readlink = generic_readlink, - .follow_link= debugfs_follow_link, -}; - -static int debugfs_u8_set(void *data, u64 val) -{ - *(u8 *)data = val; - return 0; -} -static int debugfs_u8_get(void *data, u64 *val) -{ - *val = *(u8 *)data; - return 0; -} -DEFINE_SIMPLE_ATTRIBUTE(fops_u8, debugfs_u8_get, debugfs_u8_set, "%llu\n"); - /** * debugfs_create_u8 - create a debugfs file that is used to read and write an unsigned 8-bit value * @name: a pointer to a string containing the name of the file to create. @@ -95,22 +46,10 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_u8, debugfs struct dentry *debugfs_create_u8(const char *name, mode_t mode, struct dentry *parent, u8 *value) { - return debugfs_create_file(name, mode, parent, value, &fops_u8); + return debugfs_create_file(name, mode, parent, value, &simple_fops_u8); } EXPORT_SYMBOL_GPL(debugfs_create_u8); -static int debugfs_u16_set(void *data, u64 val) -{ - *(u16 *)data = val; - return 0; -} -static int debugfs_u16_get(void *data, u64 *val) -{ - *val = *(u16 *)data; - return 0; -} -DEFINE_SIMPLE_ATTRIBUTE(fops_u16, debugfs_u16_get, debugfs_u16_set, "%llu\n"); - /** * debugfs_create_u16 - create a debugfs file that is used to read and write an unsigned 16-bit value * @name: a pointer to a string containing the name of the file to create. @@ -138,22 +77,10 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_u16, debugf struct dentry *debugfs_create_u16(const char *name, mode_t mode, struct dentry *parent, u16 *value) { - return debugfs_create_file(name, mode, parent, value, &fops_u16); + return debugfs_create_file(name, mode, parent, value, &simple_fops_u16); } EXPORT_SYMBOL_GPL(debugfs_create_u16); -static int debugfs_u32_set(void *data, u64 val) -{ - *(u32 *)data = val; - return 0; -} -static int debugfs_u32_get(void *data, u64 *val) -{ - *val = *(u32 *)data; - return 0; -} -DEFINE_SIMPLE_ATTRIBUTE(fops_u32, debugfs_u32_get, debugfs_u32_set, "%llu\n"); - /** * debugfs_create_u32 - create a debugfs file that is used to read and write an unsigned 32-bit value * @name: a pointer to a string containing the name of the file to create. @@ -181,23 +108,10 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_u32, debugf struct dentry *debugfs_create_u32(const char *name, mode_t mode, struct dentry *parent, u32 *value) { - return debugfs_create_file(name, mode, parent, value, &fops_u32); + return debugfs_create_file(name, mode, parent, value, &simple_fops_u32); } EXPORT_SYMBOL_GPL(debugfs_create_u32); -static int debugfs_u64_set(void *data, u64 val) -{ - *(u64 *)data = val; - return 0; -} - -static int debugfs_u64_get(void *data, u64 *val) -{ - *val = *(u64 *)data; - return 0; -} -DEFINE_SIMPLE_ATTRIBUTE(fops_u64, debugfs_u64_get, debugfs_u64_set, "%llu\n"); - /** * debugfs_create_u64 - create a debugfs f
[RFC 07/11] split out libfs/file.c from libfs.c
Consolidate all file manipulation code in libfs in a single source file. Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]> Index: linux-2.6/fs/libfs.c === --- linux-2.6.orig/fs/libfs.c +++ linux-2.6/fs/libfs.c @@ -421,165 +421,6 @@ ssize_t simple_read_from_buffer(void __u } /* - * Transaction based IO. - * The file expects a single write which triggers the transaction, and then - * possibly a read which collects the result - which is stored in a - * file-local buffer. - */ -char *simple_transaction_get(struct file *file, const char __user *buf, size_t size) -{ - struct simple_transaction_argresp *ar; - static DEFINE_SPINLOCK(simple_transaction_lock); - - if (size > SIMPLE_TRANSACTION_LIMIT - 1) - return ERR_PTR(-EFBIG); - - ar = (struct simple_transaction_argresp *)get_zeroed_page(GFP_KERNEL); - if (!ar) - return ERR_PTR(-ENOMEM); - - spin_lock(&simple_transaction_lock); - - /* only one write allowed per open */ - if (file->private_data) { - spin_unlock(&simple_transaction_lock); - free_page((unsigned long)ar); - return ERR_PTR(-EBUSY); - } - - file->private_data = ar; - - spin_unlock(&simple_transaction_lock); - - if (copy_from_user(ar->data, buf, size)) - return ERR_PTR(-EFAULT); - - return ar->data; -} - -ssize_t simple_transaction_read(struct file *file, char __user *buf, size_t size, loff_t *pos) -{ - struct simple_transaction_argresp *ar = file->private_data; - - if (!ar) - return 0; - return simple_read_from_buffer(buf, size, pos, ar->data, ar->size); -} - -int simple_transaction_release(struct inode *inode, struct file *file) -{ - free_page((unsigned long)file->private_data); - return 0; -} - -/* Simple attribute files */ - -struct simple_attr { - int (*get)(void *, u64 *); - int (*set)(void *, u64); - char get_buf[24]; /* enough to store a u64 and "\n\0" */ - char set_buf[24]; - void *data; - const char *fmt;/* format for read operation */ - struct mutex mutex; /* protects access to these buffers */ -}; - -/* simple_attr_open is called by an actual attribute open file operation - * to set the attribute specific access operations. */ -int simple_attr_open(struct inode *inode, struct file *file, -int (*get)(void *, u64 *), int (*set)(void *, u64), -const char *fmt) -{ - struct simple_attr *attr; - - attr = kmalloc(sizeof(*attr), GFP_KERNEL); - if (!attr) - return -ENOMEM; - - attr->get = get; - attr->set = set; - attr->data = inode->i_private; - attr->fmt = fmt; - mutex_init(&attr->mutex); - - file->private_data = attr; - - return nonseekable_open(inode, file); -} - -int simple_attr_release(struct inode *inode, struct file *file) -{ - kfree(file->private_data); - return 0; -} - -/* read from the buffer that is filled with the get function */ -ssize_t simple_attr_read(struct file *file, char __user *buf, -size_t len, loff_t *ppos) -{ - struct simple_attr *attr; - size_t size; - ssize_t ret; - - attr = file->private_data; - - if (!attr->get) - return -EACCES; - - ret = mutex_lock_interruptible(&attr->mutex); - if (ret) - return ret; - - if (*ppos) {/* continued read */ - size = strlen(attr->get_buf); - } else {/* first read */ - u64 val; - ret = attr->get(attr->data, &val); - if (ret) - goto out; - - size = scnprintf(attr->get_buf, sizeof(attr->get_buf), -attr->fmt, (unsigned long long)val); - } - - ret = simple_read_from_buffer(buf, len, ppos, attr->get_buf, size); -out: - mutex_unlock(&attr->mutex); - return ret; -} - -/* interpret the buffer as a number to call the set function with */ -ssize_t simple_attr_write(struct file *file, const char __user *buf, - size_t len, loff_t *ppos) -{ - struct simple_attr *attr; - u64 val; - size_t size; - ssize_t ret; - - attr = file->private_data; - if (!attr->set) - return -EACCES; - - ret = mutex_lock_interruptible(&attr->mutex); - if (ret) - return ret; - - ret = -EFAULT; - size = min(sizeof(attr->set_buf) - 1, len); - if (copy_from_user(attr->set_buf, buf, size)) - goto out; - - ret = len; /* claim we got the whole input */ - attr->set_buf[size] = '\0'; - val = simple_strtol(attr->set_buf, NULL, 0); - attr->set(attr->data, val); -out: - mutex_unlock(&attr->mutex); -
[RFC 10/11] split out libfs/inode.c from libfs.c
Consolidate all inode manipulation code in libfs in a single source file. Signed-off-by: Arnd Bergman <[EMAIL PROTECTED]> Index: linux-2.6/fs/libfs.c === --- linux-2.6.orig/fs/libfs.c +++ linux-2.6/fs/libfs.c @@ -12,78 +12,6 @@ #include -int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) -{ - struct inode *inode = old_dentry->d_inode; - - inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; - inc_nlink(inode); - atomic_inc(&inode->i_count); - dget(dentry); - d_instantiate(dentry, inode); - return 0; -} - -int simple_empty(struct dentry *dentry) -{ - struct dentry *child; - int ret = 0; - - spin_lock(&dcache_lock); - list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) - if (simple_positive(child)) - goto out; - ret = 1; -out: - spin_unlock(&dcache_lock); - return ret; -} - -int simple_unlink(struct inode *dir, struct dentry *dentry) -{ - struct inode *inode = dentry->d_inode; - - inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; - drop_nlink(inode); - dput(dentry); - return 0; -} - -int simple_rmdir(struct inode *dir, struct dentry *dentry) -{ - if (!simple_empty(dentry)) - return -ENOTEMPTY; - - drop_nlink(dentry->d_inode); - simple_unlink(dir, dentry); - drop_nlink(dir); - return 0; -} - -int simple_rename(struct inode *old_dir, struct dentry *old_dentry, - struct inode *new_dir, struct dentry *new_dentry) -{ - struct inode *inode = old_dentry->d_inode; - int they_are_dirs = S_ISDIR(old_dentry->d_inode->i_mode); - - if (!simple_empty(new_dentry)) - return -ENOTEMPTY; - - if (new_dentry->d_inode) { - simple_unlink(new_dir, new_dentry); - if (they_are_dirs) - drop_nlink(old_dir); - } else if (they_are_dirs) { - drop_nlink(old_dir); - inc_nlink(new_dir); - } - - old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime = - new_dir->i_mtime = inode->i_ctime = CURRENT_TIME; - - return 0; -} - int simple_readpage(struct file *file, struct page *page) { clear_highpage(page); @@ -183,11 +111,6 @@ ssize_t simple_read_from_buffer(void __u EXPORT_SYMBOL(simple_write_begin); EXPORT_SYMBOL(simple_write_end); -EXPORT_SYMBOL(simple_empty); -EXPORT_SYMBOL(simple_link); EXPORT_SYMBOL(simple_prepare_write); EXPORT_SYMBOL(simple_readpage); -EXPORT_SYMBOL(simple_rename); -EXPORT_SYMBOL(simple_rmdir); -EXPORT_SYMBOL(simple_unlink); EXPORT_SYMBOL(simple_read_from_buffer); Index: linux-2.6/fs/libfs/inode.c === --- linux-2.6.orig/fs/libfs/inode.c +++ linux-2.6/fs/libfs/inode.c @@ -417,4 +417,79 @@ exit: } EXPORT_SYMBOL_GPL(simple_rename_named); +int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) +{ + struct inode *inode = old_dentry->d_inode; + inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; + inc_nlink(inode); + atomic_inc(&inode->i_count); + dget(dentry); + d_instantiate(dentry, inode); + return 0; +} +EXPORT_SYMBOL(simple_link); + +int simple_empty(struct dentry *dentry) +{ + struct dentry *child; + int ret = 0; + + spin_lock(&dcache_lock); + list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) + if (simple_positive(child)) + goto out; + ret = 1; +out: + spin_unlock(&dcache_lock); + return ret; +} +EXPORT_SYMBOL(simple_empty); + +int simple_unlink(struct inode *dir, struct dentry *dentry) +{ + struct inode *inode = dentry->d_inode; + + inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; + drop_nlink(inode); + dput(dentry); + return 0; +} +EXPORT_SYMBOL(simple_unlink); + +int simple_rmdir(struct inode *dir, struct dentry *dentry) +{ + if (!simple_empty(dentry)) + return -ENOTEMPTY; + + drop_nlink(dentry->d_inode); + simple_unlink(dir, dentry); + drop_nlink(dir); + return 0; +} +EXPORT_SYMBOL(simple_rmdir); + +int simple_rename(struct inode *old_dir, struct dentry *old_dentry, + struct inode *new_dir, struct dentry *new_dentry) +{ + struct inode *inode = old_dentry->d_inode; + int they_are_dirs = S_ISDIR(old_dentry->d_inode->i_mode); + + if (!simple_empty(new_dentry)) + return -ENOTEMPTY; + + if (new_dentry->d_inode) { + simple_unlink(new_dir, new_dentry); + if (they_are_dirs) + drop_nlink(old_dir); + } else if (they_are_dirs) { + drop_nlink(old_dir); + inc_nlink(new_dir);
[RFC 11/11] split out libfs/aops.c from libfs.c
Consolidate all address space manipulation code in libfs in a single source file. Signed-off-by: Arnd Bergman <[EMAIL PROTECTED]> Index: linux-2.6/fs/libfs.c === --- linux-2.6.orig/fs/libfs.c +++ /dev/null @@ -1,116 +0,0 @@ -/* - * fs/libfs.c - * Library for filesystems writers. - */ - -#include -#include -#include -#include -#include -#include - -#include - -int simple_readpage(struct file *file, struct page *page) -{ - clear_highpage(page); - flush_dcache_page(page); - SetPageUptodate(page); - unlock_page(page); - return 0; -} - -int simple_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) -{ - if (!PageUptodate(page)) { - if (to - from != PAGE_CACHE_SIZE) - zero_user_segments(page, - 0, from, - to, PAGE_CACHE_SIZE); - } - return 0; -} - -int simple_write_begin(struct file *file, struct address_space *mapping, - loff_t pos, unsigned len, unsigned flags, - struct page **pagep, void **fsdata) -{ - struct page *page; - pgoff_t index; - unsigned from; - - index = pos >> PAGE_CACHE_SHIFT; - from = pos & (PAGE_CACHE_SIZE - 1); - - page = __grab_cache_page(mapping, index); - if (!page) - return -ENOMEM; - - *pagep = page; - - return simple_prepare_write(file, page, from, from+len); -} - -static int simple_commit_write(struct file *file, struct page *page, - unsigned from, unsigned to) -{ - struct inode *inode = page->mapping->host; - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; - - if (!PageUptodate(page)) - SetPageUptodate(page); - /* -* No need to use i_size_read() here, the i_size -* cannot change under us because we hold the i_mutex. -*/ - if (pos > inode->i_size) - i_size_write(inode, pos); - set_page_dirty(page); - return 0; -} - -int simple_write_end(struct file *file, struct address_space *mapping, - loff_t pos, unsigned len, unsigned copied, - struct page *page, void *fsdata) -{ - unsigned from = pos & (PAGE_CACHE_SIZE - 1); - - /* zero the stale part of the page if we did a short copy */ - if (copied < len) { - void *kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr + from + copied, 0, len - copied); - flush_dcache_page(page); - kunmap_atomic(kaddr, KM_USER0); - } - - simple_commit_write(file, page, from, from+copied); - - unlock_page(page); - page_cache_release(page); - - return copied; -} - -ssize_t simple_read_from_buffer(void __user *to, size_t count, loff_t *ppos, - const void *from, size_t available) -{ - loff_t pos = *ppos; - if (pos < 0) - return -EINVAL; - if (pos >= available) - return 0; - if (count > available - pos) - count = available - pos; - if (copy_to_user(to, from + pos, count)) - return -EFAULT; - *ppos = pos + count; - return count; -} - -EXPORT_SYMBOL(simple_write_begin); -EXPORT_SYMBOL(simple_write_end); -EXPORT_SYMBOL(simple_prepare_write); -EXPORT_SYMBOL(simple_readpage); -EXPORT_SYMBOL(simple_read_from_buffer); Index: linux-2.6/fs/libfs/Makefile === --- linux-2.6.orig/fs/libfs/Makefile +++ linux-2.6/fs/libfs/Makefile @@ -1,3 +1,3 @@ libfs-y += file.o -obj-$(CONFIG_LIBFS) += libfs.o inode.o super.o dentry.o +obj-$(CONFIG_LIBFS) += libfs.o inode.o super.o dentry.o aops.o Index: linux-2.6/fs/libfs/aops.c === --- /dev/null +++ linux-2.6/fs/libfs/aops.c @@ -0,0 +1,113 @@ +/* + * fs/libfs/aops.c + * Library for filesystems writers -- address space operations + */ + +#include +#include +#include +#include + +#include + +int simple_readpage(struct file *file, struct page *page) +{ + clear_highpage(page); + flush_dcache_page(page); + SetPageUptodate(page); + unlock_page(page); + return 0; +} +EXPORT_SYMBOL(simple_readpage); + +int simple_prepare_write(struct file *file, struct page *page, + unsigned from, unsigned to) +{ + if (!PageUptodate(page)) { + if (to - from != PAGE_CACHE_SIZE) + zero_user_segments(page, + 0, from, + to, PAGE_CACHE_SIZE); + } + return 0; +} +EXPORT_SYMBOL(simple_prepare_write); + +int simple_write_begin(struct file *file, struct address_space *mapping, +
Re: [linux-cifs-client] review 5, was Re: projected date for mount.cifs to support DFS junction points
The patch looks fine - but since it does not set obj_type any more - I want to think about it a little more since it may be useful coming back from the open path (although the mode is probably good enough). jra added support to Samba for a new POSIX open/create/mkdir request (which we only use for mkdir at the moment) - using this call for open (when the server indicates support for it as Samba does) would cut the number of roundtrips to the server on these calls as it does now for mkdir. On Feb 15, 2008 4:11 PM, Christoph Hellwig <[EMAIL PROTECTED]> wrote: > If you like these kind of consolidation patches here's another one: > > > -- Thanks, Steve - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Question about synchronous write on SSD
Hi, Don't you remember the topic "solid state drive access and context switching" [1]. I want to measure it is really better performance on SSD? To write it on ssd synchronously, I hacked the 'generic_make_request()' [2] and got following results. # echo 3 > /proc/sys/vm/drop_caches # tiotest -f 100 -R -d /dev/sdd1 Tiotest results for 4 concurrent io threads: | Write 400 MBs |4.7 s | 84.306 MB/s | 8.4 % | 77.5 % | | Read 400 MBs |4.3 s | 92.945 MB/s | 7.2 % | 53.5 % | Tiotest latency results: | Write|0.126 ms | 706.379 ms | 0.0 | 0.0 | | Read |0.161 ms | 311.738 ms | 0.0 | 0.0 | # echo 3 > /proc/sys/vm/drop_caches # tiotest -f 1000 -R -d /dev/sdd1 Tiotest results for 4 concurrent io threads: | Write4000 MBs | 47.5 s | 84.124 MB/s | 7.0 % | 83.6 % | | Read 4000 MBs | 41.9 s | 95.530 MB/s | 7.8 % | 55.6 % | Tiotest latency results: | Write|0.176 ms | 714.677 ms | 0.0 | 0.0 | | Read |0.161 ms | 311.815 ms | 0.0 | 0.0 | However it's same performance as before. It means the patch is meaningless. Could you tell me is it the proper place to hack or others? Thank you, Kyungmin Park p.s. Of cource I got the following message WARNING: at block/blk-core.c:1351 generic_make_request+0x126/0x3d8() 1. http://lkml.org/lkml/2007/12/3/247 2. simple hack diff --git a/block/blk-core.c b/block/blk-core.c index e9754dc..7262720 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1345,6 +1345,14 @@ static inline void __generic_make_request(struct bio *bio) if (bio_check_eod(bio, nr_sectors)) goto end_io; + /* FIXME simple hack by kmpark */ + if (MINOR(bio->bi_bdev->bd_dev) == 49 && + MAJOR(bio->bi_bdev->bd_dev) == 8 && bio_data_dir(bio) == WRITE) { + WARN_ON_ONCE(1); + /* Write synchronous */ + bio->bi_rw |= (1 << BIO_RW_SYNC); + } + /* * Resolve the mapping until finished. (drivers are * still free to implement/resolve their own stacking - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8][for -mm] mem_notify v6
Hi Paul, Thank you for wonderful interestings comment. your comment is really nice. I was HPC guy with large NUMA box at past. I promise i don't ignroe hpc user. but unfortunately I didn't have experience of use CPUSET because at that point, it was under development yet. I hope discuss you that CPUSET usage case and mem_notify requirement. to be honest, I thought hpc user doesn't use mem_notify, sorry. > I have what seems, intuitively, a similar problem at the opposite > end of the world, on big-honkin NUMA boxes (hundreds or thousands of > CPUs, terabytes of main memory.) The problem there is often best > resolved if we can kill the offending task, rather than shrink its > memory footprint. The situation is that several compute intensive > multi-threaded jobs are running, each in their own dedicated cpuset. agreed. > So we like to identify such jobs as soon as they begin to swap, > and kill them very very quickly (before the direct reclaim code > in mm/vmscan.c can push more than a few pages to the swap device.) you think kill the process just after swap, right? but unfortunately, almost user hope receive notification before swap ;-) because avoid swap. I think we need discuss this point more. > For a much earlier, unsuccessful, attempt to accomplish this, see: > > [Patch] cpusets policy kill no swap > http://lkml.org/lkml/2005/3/19/148 > > Now, it may well be that we are too far apart to share any part of > a solution; one seldom uses the same technology to build a Tour de > France bicycle as one uses to build a Lockheed C-5A Galaxy heavy > cargo transport. > > One clear difference is the policy of what action we desire to take > when under memory pressure: do we invite user space to free memory so > as to avoid the wrath of the oom killer, or do we go to the opposite > extreme, seeking a nearly instantant killing, faster than the oom > killer can even begin its search for a victim. Hmm, sorry I understand your patch yet, because I don't know CPUSET so much. I learn CPUSET more, about this week and I'll reply again about next week ;-) > Another clear difference is the use of cpusets, which are a major and > vital part of administering the big NUMA boxes, and I presume are not > even compiled into embedded kernels (correct?). This difference maybe > unbridgeable ... these big NUMA systems require per-cpuset mechanisms, > whereas embedded may require builds without cpusets. Yes, some embedded distribution(i.e. monta vista) distribute as source. but embedded people strongly dislike bloat code size. I think they never turn on CPUSET. I hope mem_notify works fine without CPUSET. > 1) You have a little bit of code in the kernel to throttle the >thundering herd problem. Perhaps this could be moved to user space >... one user daemon that is always notified of such memory pressure >alarms, and in turn notifies interested applications. This might >avoid the need to add poll_wait_exclusive() to the kernel. And it >moves any fussy details of how to tame the thundering herd out of >the kernel. I think you talk about user space oom manager. it and many user process are obviously different. I doubt memory manager daemon model doesn't works on desktop and typical server. thus, current implementaion optimize to no manager environment. of course, it doesn't mean i refuse add to code for oom manager. it is very interesting idea. i hope discussion it more. > 2) Another possible mechanism for communicating events from >the kernel to user space is inotify. For example, I added >the line: > > fsnotify_modify(dentry); # dentry is current tasks cpuset Excellent! that is really good idea. thaks. > 3) Perhaps, instead of sending simple events, one could update >a meter of the rate of recent such events, such as the per-cpuset >'memory_pressure' mechanism does. This might lead to addressing >Andrew Morton's comment: > > If this feature is useful then I'd expect that some > applications would want notification at different times, or at > different levels of VM distress. So this semi-randomly-chosen > notification point just won't be strong enough in real-world > use. Hmmm, I don't think so. I think timing of memmory_pressure_notify(1) is already best. the page move active list to inactive list indicate swap I/O happen a bit after. but memmory_pressure_notify(0) is a bit messy. I'll try to improve more simplify. > 4) A place that I found well suited for my purposes (watching for >swapping from direct reclaim) was just before the lines in the >pageout() routine in mm/vmscan.c: > > if (clear_page_dirty_for_io(page)) { > ... > res = mapping->a_ops->writepage(page, &wbc); > >It seemed that testing "PageAnon(page)" here allowed me to easily >distinguish between dirty pages going back to the file system, and >pages going to swap (this detail is