Re: [RFC] set MS_NOATIME on FAT ?
Werner Almesberger <[EMAIL PROTECTED]> writes: > Ah, I see. But, at the moment, VFAT doesn't set atime from adate, > and vice versa, or have I overlooked something ? Right. However, if you need NOATIME, you can set it with mount options. And I think, we just need to fix ->adate, no need to change default options. Thanks. -- OGAWA Hirofumi <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] set MS_NOATIME on FAT ?
OGAWA Hirofumi wrote: > No. The fatfs has the ->adate, so I think we should update it rather. Ah, I see. But, at the moment, VFAT doesn't set atime from adate, and vice versa, or have I overlooked something ? Thanks, - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] set MS_NOATIME on FAT ?
Werner Almesberger <[EMAIL PROTECTED]> writes: > as far as I can tell, none of FAT or its offsprings use atime, so > perhaps fs/fat/inode.c should just set MS_NOATIME, so that we don't > get unnecessary "inode" writes ? No. The fatfs has the ->adate, so I think we should update it rather. Thanks. -- OGAWA Hirofumi <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] set MS_NOATIME on FAT ?
Hi, as far as I can tell, none of FAT or its offsprings use atime, so perhaps fs/fat/inode.c should just set MS_NOATIME, so that we don't get unnecessary "inode" writes ? (They hurt if you want to reduce worst-case latency in the write path.) Here's a patch for 2.6.11 (with some offset, because I pulled it from a larger patch). Does this look good ? Thanks, - Werner -- cut here --- Signed-off-by: Werner Almesberger <[EMAIL PROTECTED]> --- linux-2.6.11-orig/fs/fat/inode.cWed Mar 2 04:38:08 2005 +++ linux-2.6.11/fs/fat/inode.c Thu Mar 3 01:35:57 2005 @@ -413,7 +483,7 @@ static void __exit fat_destroy_inodecach static int fat_remount(struct super_block *sb, int *flags, char *data) { - *flags |= MS_NODIRATIME; + *flags |= MS_NODIRATIME | MS_NOATIME; return 0; } @@ -1058,7 +1128,7 @@ int fat_fill_super(struct super_block *s sb->s_fs_info = sbi; memset(sbi, 0, sizeof(struct msdos_sb_info)); - sb->s_flags |= MS_NODIRATIME; + sb->s_flags |= MS_NODIRATIME | MS_NOATIME; sb->s_magic = MSDOS_SUPER_MAGIC; sb->s_op = &fat_sops; sb->s_export_op = &fat_export_ops; -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
Bryan Henderson wrote: > I think "reservation" is wrong for one of them and anyone using it that > way should stop. Hehe, start with ext3 :-) > I believe the common terminology is: Sounds reasonable. The thing with "reservation" is that people use it in daily life with all kinds of meanings, and often with the object of the reservation, e.g. "reserve a seat" (typically a specific seat), "reserve some time" (often not a specific interval), or "reserve a table" (at a restaurant, you don't know which one, but the restaurant staff does). To muddy the issue further, reservations can be more or less firm. E.g. if we "reserve" the next hundred blocks, so that allocation is contiguous, we may want to be able to take them away if some other file needs them. On the other hand, if storage is already committed, but just not on disk yet, that reservation shouldn't be revokable. - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
Alex Tomas wrote: > you can drop PG_locked right as you set PG_writeback, I think Hmm, not sure. mpage_writepage never calls writepage with PG_writeback, only with PG_locked. Also, mpage_writepage calls get_block with PG_locked, so the allocation, which may take a while, holds the lock. This situation is admittedly a bit annoying: on the one hand, "sync" should write all dirty data. On the other hand, if a random user typing "sync" can break performance guarantees, these guarantees aren't very valuable. - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
>Hmm, it's a bit confusing that we call both things "reservation". I think "reservation" is wrong for one of them and anyone using it that way should stop. I believe the common terminology is: - choosing the blocks is "placement." - committing the required number of blocks from the resource pool for the instant use is "reservation." - the combination of reservation and placement is "allocation." Obviously, traditional filesystem drivers haven't split placement from reservation, so don't bother to use those terms. Most delaying schemes delay the placement but not the reservation because they don't want to accept the possibility that a write would fail for lack of space after the write() system call succeeded. Even in non-filesystem areas, "allocate" usually means to assign particular resources, while "reserve" just means to make arrangements so that a future allocate will succeed. For example, if you know you need up to 10 blocks of memory to complete a task without deadlocking, but you don't know yet how exactly how many, you would reserve 10 blocks and later, if necessary, allocate the actual blocks. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: exporting per-superblock statistics to user space
we still have a need to provide "iostat" like statistics for NFS clients. attached are a couple of patches, against 2.6.11.3, which prototype an approach for providing this kind of data to user programs. i'd like some comment on the approach. 01-mountstats.patch adds a new file called /proc/self/mountstats and a new file system hook called show_stats. this just replicates /proc/mounts and the show_options hook. 02-nfs-iostat.patch teachs the NFS client to use the new show_stats hook as a demonstration. note that this approach addresses previously voiced concerns about exporting per-superblock stats to user space. 1. processes can't see stats for file systems mounted outside their namespace. 2. reading the stats file is serialized with mount and unmount operations. 3. the approach doesn't use /sys or kobjects. 4. there are no lifetime issues tied to file systems loaded as a module. [PATCH] VFS: New /proc file /proc/self/mountstats Create a new file under /proc/self, called mountstats, where mounted file systems can export information (configuration options, performance counters, and so on). Use a mechanism similar to /proc/mounts and s_ops->show_options. This mechanism does not violate namespace security, and is safe to use while other processes are unmounting file systems. Test-plan: Test concurrent mount/unmount operations while cat'ing /proc/self/mountstats. Version: Mon, 14 Mar 2005 17:06:04 -0500 Signed-off-by: Chuck Lever <[EMAIL PROTECTED]> --- fs/namespace.c | 66 + fs/proc/base.c | 40 +++ include/linux/fs.h |1 3 files changed, 107 insertions(+) diff -X /home/cel/src/linux/dont-diff -Naurp 00-stock/fs/namespace.c 01-mountstats/fs/namespace.c --- 00-stock/fs/namespace.c 2005-03-02 02:38:13.0 -0500 +++ 01-mountstats/fs/namespace.c 2005-03-14 15:24:51.565085000 -0500 @@ -265,6 +265,72 @@ struct seq_operations mounts_op = { .show = show_vfsmnt }; +/* iterator */ +static void *ms_start(struct seq_file *m, loff_t *pos) +{ + struct namespace *n = m->private; + struct list_head *p; + loff_t l = *pos; + + down_read(&n->sem); + list_for_each(p, &n->list) + if (!l--) + return list_entry(p, struct vfsmount, mnt_list); + return NULL; +} + +static void *ms_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct namespace *n = m->private; + struct list_head *p = ((struct vfsmount *)v)->mnt_list.next; + (*pos)++; + return p==&n->list ? NULL : list_entry(p, struct vfsmount, mnt_list); +} + +static void ms_stop(struct seq_file *m, void *v) +{ + struct namespace *n = m->private; + up_read(&n->sem); +} + +static int show_vfsstat(struct seq_file *m, void *v) +{ + struct vfsmount *mnt = v; + int err = 0; + + /* device */ + if (mnt->mnt_devname) { + seq_puts(m, "device "); + mangle(m, mnt->mnt_devname); + } else + seq_puts(m, "no device"); + + /* mount point */ + seq_puts(m, " mounted on "); + seq_path(m, mnt, mnt->mnt_root, " \t\n\\"); + seq_putc(m, ' '); + + /* file system type */ + seq_puts(m, "with fstype "); + mangle(m, mnt->mnt_sb->s_type->name); + + /* optional statistics */ + if (mnt->mnt_sb->s_op->show_stats) { + seq_putc(m, ' '); + err = mnt->mnt_sb->s_op->show_stats(m, mnt); + } + + seq_putc(m, '\n'); + return err; +} + +struct seq_operations mountstats_op = { + .start = ms_start, + .next = ms_next, + .stop = ms_stop, + .show = show_vfsstat, +}; + /** * may_umount_tree - check if a mount tree is busy * @mnt: root of mount tree diff -X /home/cel/src/linux/dont-diff -Naurp 00-stock/fs/proc/base.c 01-mountstats/fs/proc/base.c --- 00-stock/fs/proc/base.c 2005-03-02 02:38:12.0 -0500 +++ 01-mountstats/fs/proc/base.c 2005-03-14 15:24:51.571085000 -0500 @@ -60,6 +60,7 @@ enum pid_directory_inos { PROC_TGID_STATM, PROC_TGID_MAPS, PROC_TGID_MOUNTS, + PROC_TGID_MOUNTSTATS, PROC_TGID_WCHAN, #ifdef CONFIG_SCHEDSTATS PROC_TGID_SCHEDSTAT, @@ -91,6 +92,7 @@ enum pid_directory_inos { PROC_TID_STATM, PROC_TID_MAPS, PROC_TID_MOUNTS, + PROC_TID_MOUNTSTATS, PROC_TID_WCHAN, #ifdef CONFIG_SCHEDSTATS PROC_TID_SCHEDSTAT, @@ -134,6 +136,7 @@ static struct pid_entry tgid_base_stuff[ E(PROC_TGID_ROOT, "root",S_IFLNK|S_IRWXUGO), E(PROC_TGID_EXE, "exe", S_IFLNK|S_IRWXUGO), E(PROC_TGID_MOUNTS,"mounts", S_IFREG|S_IRUGO), + E(PROC_TGID_MOUNTSTATS, "mountstats", S_IFREG|S_IRUGO), #ifdef CONFIG_SECURITY E(PROC_TGID_ATTR, "attr",S_IFDIR|S_IRUGO|S_IXUGO), #endif @@ -164,6 +167,7 @@ static struct pid_entry tid_base_stuff[] E(PROC_TID_ROOT, "root",S_IFLNK|S_IRWXUGO), E(PROC_TID_EXE,"exe", S_IFLNK|S_IRWXUGO), E(PROC_TID_MOUNTS, "mounts", S_IFREG|S_IRUGO), + E(PROC_TID_MOUNTSTATS, "mountstats", S_IFREG|S_IRUGO), #ifdef CONFIG_SECURITY E(PROC_TID_ATTR, "attr",S_IFDIR|S_IRUGO|S_IXUGO), #endif @@ -528,6 +532,38 @@ static struct file_operations proc_mount .release = mounts_rele
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
> Werner Almesberger (WA) writes: >> locked during writeback? PG_writeback should be used instead of PG_locked. WA> In mpage_writepages, writepage can also get called with the page just WA> PG_locked. you can drop PG_locked right as you set PG_writeback, I think thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
Alex Tomas wrote: > I see no reason to reserve specific block in ->prepare/->commit in > delayed allocation case. We already do this with reservation. This seems like a sensible approach to me. Trying to reserve specific blocks in an FS-independent way was what got us in trouble on ABISS. So the plan B is to add this kind of reservation to where it is really lacking (i.e. FAT). Hmm, it's a bit confusing that we call both things "reservation". Well, airlines do this too, "free seating". > locked during writeback? PG_writeback should be used instead of PG_locked. In mpage_writepages, writepage can also get called with the page just PG_locked. - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
> Werner Almesberger (WA) writes: WA> Do you plan to reserve space as "blocks, somewhere", or as "these WA> specific on-disk locations" ? In ABISS, we did something of the WA> latter kind (in order to make large contiguous allocations also on WA> FAT), and it turned out to be a big mess, because ABISS needed too WA> much support from the file system driver. So we just scrapped that WA> bit :-) I see no reason to reserve specific block in ->prepare/->commit in delayed allocation case. We already do this with reservation. The sole point of delayed allocation is to allocate many blocks at once: to minimize fragmentation, to decrease allocator involvement, to avoid allocation at all if the file gets truncated quickly. WA> The main parts: we added a new page flag, PG_delalloc, which WA> basically tells everyone to stay away from that page. There are WA> two purposes: (a) to make sure no allocation happens unless WA> explicitly requested, and (b) prevent the page from being written WA> back while it is still in ABISS' playout buffer. The reason for WA> (b) is that the page gets locked during writeback, which could WA> cause delays if the ABISS-using application then decides to WA> access the page. locked during writeback? PG_writeback should be used instead of PG_locked. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
Suparna Bhattacharya wrote: > I'm looking at whether we can do most of it at VFS level Do you plan to reserve space as "blocks, somewhere", or as "these specific on-disk locations" ? In ABISS, we did something of the latter kind (in order to make large contiguous allocations also on FAT), and it turned out to be a big mess, because ABISS needed too much support from the file system driver. So we just scrapped that bit :-) > Of course, I haven't looked at how ABISS does delayed alloc -- > do you have a patch snippet I can look at ? I just made a release. The kernel patch is in abiss-7/kernel/abiss.patch It's all in one big patch, sorry. The main purpose of this is to see what we can achieve, so it's not very polished. The main parts: we added a new page flag, PG_delalloc, which basically tells everyone to stay away from that page. There are two purposes: (a) to make sure no allocation happens unless explicitly requested, and (b) prevent the page from being written back while it is still in ABISS' playout buffer. The reason for (b) is that the page gets locked during writeback, which could cause delays if the ABISS-using application then decides to access the page. The "hands off" code is mainly in fs/buffer.c, in the functions __block_commit_write (set the page dirty, then go away), cont_prepare_write (for FAT, do nothing), block_prepare_write (for ext2, do nothing), and then fs/mpage.c:mpage_writepages (skip pages marked for delayed allocation). cont_prepare_write also needs to handle the special case where it has to fill holes in a file. In this case, it simply overrides delayed allocation. This bit will need more work. Since ABISS prefetches pages, cont_prepare_write and cont_prepare_write may now see pages that are already up to date, so they must not zero them. The prefetching happens in fs/abiss/sched_lib.c:abiss_read_page, and writeback in abiss_put_page. We also experimented with leaving the writeback to MM, but that led to OOM far too often. The current solution works quite smoothly even if we tax the system hard. In order to keep things simple, I didn't try to make delayed allocation do anything for writers that don't use ABISS. The life cycle of a page is about as follows: when an application reads or writes a file, ABISS maintains a playout buffer for it, that typically reaches a few hundred kB ahead of the current file position. Pages are prefetched and locked in the playout buffer. The playout buffer is dimensioned that when file data enters the playout buffer, there is enough time for the data to be in memory by the time the application reaches it. ABISS just calls readpage to get the data, which either causes it to be read from disk, or the page to be zeroed, if we're beyond EOF or at a hole. The application accesses the page through the normal VFS functions, so in the case of writing, the prepare/commit process happens. Once the application has accessed the page, and moves the playout buffer beyond it, the page is released and written back to disk. Prefetching and writeback is done in a separate kernel thread, so the application does not get delayed. - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Active Block I/O Scheduling System (ABISS), version 7
The Active Block I/O Scheduling System (ABISS) is an extension of the hard-disk storage subsystem of Linux, whose main purpose is to provide a guaranteed reading and (eventually) writing bit rate to applications. ABISS is conducted by Philips Research in Eindhoven, the Netherlands (see http://www.research.philips.com/technologies/storage/index.html). http://abiss.sourceforge.net/abiss-7.tar.gz md5sum 081abbfa1d11ce268dab300576edc194 sha1sum 7851ebd768fc1a96207836b5189450c90e4ddd05 This release upgrades ABISS to the 2.6.11 kernel, brings some major cleanup and introduces experimental support for writing with a guaranteed rate. The highlights: - the "allocator" functionality has been completely removed. It represented a very complicated way for doing things that can be done much more efficiently and cleanlier in the file system driver, complicated the inner workings of ABISS, and wasn't of much use in its present state anyway. - removed the abiss_detach message, which was a no-op - this release adds an experimental mechanism for delayed allocation of file blocks. In its current form, this is mainly intended for exploring performance aspects, and may have yet undiscovered fascinating bugs. This may also be of interest for a broader audience, hence the cross-posting to linux-fsdevel. - ABISS now tries to guarantee the accepted data rate also when writing. For now, this only works for FAT and ext2, and when delayed allocations are enabled. All this is still very experimental and only works most of the time. For additional information, please have a look at http://abiss.sourceforge.net/ - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
On Mon, Mar 14, 2005 at 05:36:58AM -0300, Werner Almesberger wrote: > Mingming Cao wrote: > > I agree delayed allocation make much sense with multiblock allocation. > > But I still think itself worth the effort, even without multiple block > > allocation. > > On ABISS, we're currently also experimenting with delayed allocation. > There, the goal is less to improve overall performance, but to move > the accesses out of the synchronous code path for write(2). > > The code works quite nicely for FAT and ext2, limiting the time it > takes to make a write call writing new data to about 4-6 ms on a > fairly sluggish machine (plus about 2-4 ms for moving the playout > point, which is a separate operation in ABISS), and with eight > competing best-effort writers who each enjoy write latencies of some > 8 seconds, worst-case, overwriting old data. > > Of course, this fails horribly on ext3, because it doesn't do anything > useful with the journal. Another problem is error handling. Since FAT > and ext2 don't have any form of reservation, a full disk isn't detected > until it's far too late. > > So, a VFS-level reservation function would indeed be nice to have. > > I looked at ext3 delalloc briefly, and while it did indeed improve > performance quite nicely, by being tied to ext3 internals, it would > be difficult to use in the framework of ABISS, where the code paths > are different (e.g. the prepare/commit functions should be as close > to no-ops as possible, and leave all the work to the prefetcher > thread), and which tries to be relatively file system independent. I'm looking at whether we can do most of it at VFS level ... with ext3 only taking care of the additional journalling bit - seems quite feasible. There are two reqs (1) reservation (2) changing mpage_writepages to use get_blocks(), which don't seem too hard. ext3 ordered mode will need a bit more thought. Of course, I haven't looked at how ABISS does delayed alloc -- do you have a patch snippet I can look at ? Regards Suparna > > - Werner > > -- > _ > / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / > /_http://www.almesberger.net// > > > --- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > ___ > Ext2-devel mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/ext2-devel -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
Mingming Cao wrote: > I agree delayed allocation make much sense with multiblock allocation. > But I still think itself worth the effort, even without multiple block > allocation. On ABISS, we're currently also experimenting with delayed allocation. There, the goal is less to improve overall performance, but to move the accesses out of the synchronous code path for write(2). The code works quite nicely for FAT and ext2, limiting the time it takes to make a write call writing new data to about 4-6 ms on a fairly sluggish machine (plus about 2-4 ms for moving the playout point, which is a separate operation in ABISS), and with eight competing best-effort writers who each enjoy write latencies of some 8 seconds, worst-case, overwriting old data. Of course, this fails horribly on ext3, because it doesn't do anything useful with the journal. Another problem is error handling. Since FAT and ext2 don't have any form of reservation, a full disk isn't detected until it's far too late. So, a VFS-level reservation function would indeed be nice to have. I looked at ext3 delalloc briefly, and while it did indeed improve performance quite nicely, by being tied to ext3 internals, it would be difficult to use in the framework of ABISS, where the code paths are different (e.g. the prepare/commit functions should be as close to no-ops as possible, and leave all the work to the prefetcher thread), and which tries to be relatively file system independent. - Werner -- _ / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] / /_http://www.almesberger.net// - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html