Re: Getting FS access events
Hi, On Tue, May 15, 2001 at 04:37:01PM +1200, Chris Wedgwood wrote: > On Sun, May 13, 2001 at 08:39:23PM -0600, Richard Gooch wrote: > > Yeah, we need a decent unfragmenter. We can do that now with > bmap(). > > SCT wrote a defragger for ext2 but it only handles 1k blocks :( Actually, I wrote it for extfs, and Alexey Vovenko ported it to ext2. Extfs *really* needed a defragmenter, because it had weird behaviour patterns which included allocating all of the blocks of a file in descending disk blocks at times. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Hi, On Fri, May 18, 2001 at 09:55:14AM +0200, Rogier Wolff wrote: > The "boot quickly" was an example. "Load netscape quickly" on some > systems is done by dd-ing the binary to /dev/null. This is one of the reasons why some filesystems use extent maps instead of inode indirection trees. The problem of caching the metadata basically just goes away if your mapping information is a few bytes saying "this file is an extent of a hundred block at offset FOO followed by fifty blocks at offset BAR." If the mapping metadata is _that_ compact, then your binaries are almost guaranteed to be either mapped in the inode or in a single mapping block, so the problem of seeking between indirect blocks basically just goes away. You still have to do things like prime the inode/indirect cache before the first data access if you want directory scans to go fast, and you still have to preload data pages for readahead, of course. If the objective is "start netscape faster", then the cost of having to do one synchronous IO to pull in a single indirect extent map block is going to be negligible next to the other costs. (Extent maps have their own problems, especially when it comes to dealing with holes, but that's a different story...) --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Hi, On Sat, May 19, 2001 at 12:47:15PM -0700, Linus Torvalds wrote: > > On Sat, 19 May 2001, Pavel Machek wrote: > > > > > Don't get _too_ hung up about the power-management kind of "invisible > > > suspend/resume" sequence where you resume the whole kernel state. > > > > Ugh. Now I'm confused. How do you do usefull resume from disk when you > > don't restore complete state? Do you propose something like "write > > only pagecache to disk"? > > Go back to the original _reason_ for this whole discussion. > > It's not really a "resume" event, it's a "populate caches really > efficiently at boot" event. Then you'd better be sure that the cache (or at least, the saved image) only contains data which is guaranteed not to be written between successive restores from the same image. The big advantage of just resuming from the state of the previous shutdown (whether it's cache or the whole kernenl state) is that you've got a much higher expectation that nothing on disk got modified between the save and the restore. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
> I'm confused. I've always wondered that before you suspend the state > of a machine to disk, why we just don't throw away unnecessary data > like anything not actively referenced. swsusp does exactly that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Sat, 19 May 2001, Pavel Machek wrote: > > > Don't get _too_ hung up about the power-management kind of "invisible > > suspend/resume" sequence where you resume the whole kernel state. > > Ugh. Now I'm confused. How do you do usefull resume from disk when you > don't restore complete state? Do you propose something like "write > only pagecache to disk"? Go back to the original _reason_ for this whole discussion. It's not really a "resume" event, it's a "populate caches really efficiently at boot" event. But the two are basically the same problem, it's only a matter of how much you populate (do you populate _everything_ or do you populate just disk caches. Populating just the caches is the smaller and simpler problem, that only solves the "fast boot" issue). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Hi! > > resume from disk is actually pretty hard to do in way it is readed linearily. > > > > While playing with swsusp patches (== suspend to disk) I found out that > > it was slow. It needs to do atomic snapshot, and only reasonable way to > > do that is free half of RAM, cli() and copy. > > Note that "resume from disk" does _not_ have to necessarily resume kernel > data structures. It is enough if it just resumes the caches etc. > Don't get _too_ hung up about the power-management kind of "invisible > suspend/resume" sequence where you resume the whole kernel state. Ugh. Now I'm confused. How do you do usefull resume from disk when you don't restore complete state? Do you propose something like "write only pagecache to disk"? Pavel -- The best software in life is free (not shareware)! Pavel GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tue, 15 May 2001, Pavel Machek wrote: > > resume from disk is actually pretty hard to do in way it is readed linearily. > > While playing with swsusp patches (== suspend to disk) I found out that > it was slow. It needs to do atomic snapshot, and only reasonable way to > do that is free half of RAM, cli() and copy. Note that "resume from disk" does _not_ have to necessarily resume kernel data structures. It is enough if it just resumes the caches etc. Don't get _too_ hung up about the power-management kind of "invisible suspend/resume" sequence where you resume the whole kernel state. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Linus Torvalds wrote: > I'm really serious about doing "resume from disk". If you want a fast > boot, I will bet you a dollar that you cannot do it faster than by loading > a contiguous image of several megabytes contiguously into memory. There is > NO overhead, you're pretty much guaranteed platter speeds, and there are > no issues about trying to order accesses etc. There are also no issues > about messing up any run-time data structures. Linus, The "boot quickly" was an example. "Load netscape quickly" on some systems is done by dd-ing the binary to /dev/null. Now, you're going to say again that this won't work because of buffer-cache/page-cache incoherency. That is NOT the point. The point is that the fun about a cache is that it's just a cache. It speeds things up transparently. If I need a new "prime-the-cache" program to mmap the files, and trigger a page-in in the right order, then that's fine with me. The fun about doing these tricks is that it works, and keeps on working (functionally) even if it stops working (fast). Yes, there is a way to boot even faster: preloading memory. Fine. But this doesn't allow me to load netscape quicker. Roger. -- ** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 ** *-- BitWizard writes Linux device drivers for any device you may have! --* * There are old pilots, and there are bold pilots. * There are also old, bald pilots. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Hi! > Besides, just how often do you reboot the box? If that's the hotspot for > you - when the hell does the boor beast find time to do something useful? Ten times a day? But booting is special case: You can read your mail while compiling kernel, but try to read your mail while your machine is booting. What's worse, boot time tends to be time critical, as in "I need to find that mail that tells me where I'm expected to be half an hour from now. Ouch. It's going to take 40 minutes to get there." Pavel -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Hi! > And because your suspend/resume idea isn't really going to help me > much. That's because my boot scripts have the notion of > "personalities" (change the boot configuration by asking the user > early on in the boot process). If I suspend after I've got XDM > running, it's too late. Why not e2defrag so that everything needed for bootup is linear on the start of disk? Use strace to collect statistics of what happens during bootup. [strac should be good enough. If not, uml is.] -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Hi! > I'm really serious about doing "resume from disk". If you want a fast > boot, I will bet you a dollar that you cannot do it faster than by loading > a contiguous image of several megabytes contiguously into memory. There is > NO overhead, you're pretty much guaranteed platter speeds, and there are > no issues about trying to order accesses etc. There are also no issues > about messing up any run-time data structures. resume from disk is actually pretty hard to do in way it is readed linearily. While playing with swsusp patches (== suspend to disk) I found out that it was slow. It needs to do atomic snapshot, and only reasonable way to do that is free half of RAM, cli() and copy. -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Anton Altaparmakov wrote: > > True, but I was under the impression that Linus' master plan was that the > two would be in entirely separate name spaces using separate cached copies > of the device blocks. > Nothing was said about the superblock at all. -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
At 02:30 16/05/2001, H. Peter Anvin wrote: >Anton Altaparmakov wrote: > > And how are you thinking of this working "without introducing new > > interfaces" if the caches are indeed incoherent? Please correct me if I > > understand wrong, but when two caches are incoherent, I thought it means > > that the above _would_ screw up unless protected by exclusive write locking > > as I suggested in my previous post with the side effect that you can't > > write the boot block without unmounting the filesystem or modifying some > > interface somewhere. > >Not if direct device acess and the superblock exist in the same mapping >space, OR an explicit interface to write the boot block is created. True, but I was under the impression that Linus' master plan was that the two would be in entirely separate name spaces using separate cached copies of the device blocks. Putting them into the same cache would make things work of course, although direct access would probably give you a view of an inconsistent file system if the fs was writing around the page cache at the time (unless the fs and direct accesses lock every page on write access, perhaps by zeroing the uptodate flag on the page). An explicit interface for the boot block would be interesting. AFAICS it would have to call down into the file system driver itself (a read/write_boot_block method in super_operations perhaps?) due to the differences in how the boot block is stored on different filesystems (thinking of the "boot block is a file" NTFS case). Best regards, Anton -- Anton Altaparmakov (replace at with @) Linux NTFS Maintainer / WWW: http://sourceforge.net/projects/linux-ntfs/ ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Anton Altaparmakov wrote: > > And how are you thinking of this working "without introducing new > interfaces" if the caches are indeed incoherent? Please correct me if I > understand wrong, but when two caches are incoherent, I thought it means > that the above _would_ screw up unless protected by exclusive write locking > as I suggested in my previous post with the side effect that you can't > write the boot block without unmounting the filesystem or modifying some > interface somewhere. > Not if direct device acess and the superblock exist in the same mapping space, OR an explicit interface to write the boot block is created. -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
At 23:35 15/05/2001, H. Peter Anvin wrote: >"Albert D. Cahalan" wrote: > > H. Peter Anvin writes: > > > This would leave no way (without introducing new interfaces) to write, > > > for example, the boot block on an ext2 filesystem. Note that the > > > bootblock (defined as the first 1024 bytes) is not actually used by > > > the filesystem, although depending on the block size it may share a > > > block with the superblock (if blocksize > 1024). > > > > The lack of coherency would screw this up anyway, doesn't it? > > You have a block device, soon to be in the page cache, and > > a superblock, also soon to be in the page cache. LILO writes to > > the block device, while the ext2 driver updates the superblock. > > Whatever gets written out last wins, and the other is lost. > >Albert, I *did* say "this better work or we have a problem." And how are you thinking of this working "without introducing new interfaces" if the caches are indeed incoherent? Please correct me if I understand wrong, but when two caches are incoherent, I thought it means that the above _would_ screw up unless protected by exclusive write locking as I suggested in my previous post with the side effect that you can't write the boot block without unmounting the filesystem or modifying some interface somewhere. As not all filesystems are like ext2, perhaps it would be better to fix ext2 and not the cache coherency? If ext2 is claiming ownership of a device, then it should do so in its entirety IMHO. You could always extend ext2 to use the NTFS approach where the bootsector is nothing more than a file which happens to exist on sector(s) zero (and following) of the device... (just a thought) Best regards, Anton -- Anton Altaparmakov (replace at with @) Linux NTFS Maintainer / WWW: http://sourceforge.net/projects/linux-ntfs/ ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
"Albert D. Cahalan" wrote: > > H. Peter Anvin writes: > > > This would leave no way (without introducing new interfaces) to write, > > for example, the boot block on an ext2 filesystem. Note that the > > bootblock (defined as the first 1024 bytes) is not actually used by > > the filesystem, although depending on the block size it may share a > > block with the superblock (if blocksize > 1024). > > The lack of coherency would screw this up anyway, doesn't it? > You have a block device, soon to be in the page cache, and > a superblock, also soon to be in the page cache. LILO writes to > the block device, while the ext2 driver updates the superblock. > Whatever gets written out last wins, and the other is lost. > Albert, I *did* say "this better work or we have a problem." -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
H. Peter Anvin writes: > This would leave no way (without introducing new interfaces) to write, > for example, the boot block on an ext2 filesystem. Note that the > bootblock (defined as the first 1024 bytes) is not actually used by > the filesystem, although depending on the block size it may share a > block with the superblock (if blocksize > 1024). The lack of coherency would screw this up anyway, doesn't it? You have a block device, soon to be in the page cache, and a superblock, also soon to be in the page cache. LILO writes to the block device, while the ext2 driver updates the superblock. Whatever gets written out last wins, and the other is lost. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tue, May 15, 2001 at 02:02:29PM -0700, Linus Torvalds wrote: > In article <[EMAIL PROTECTED]>, > Alexander Viro <[EMAIL PROTECTED]> wrote: > >On Tue, 15 May 2001, H. Peter Anvin wrote: > > > >> Alexander Viro wrote: > >> > > > >> > > None whatsoever. The one thing that matters is that noone starts making > >> > > the assumption that mapping->host->i_mapping == mapping. Don't worry too much about that, that relationship has been false for Coda ever since i_mapping was introduced. The only problem that is still lingering is related to i_size. Writes update inode->i_mapping->host->i_size, and stat reads inode->i_size, which are not the same. I sent a small patch to stat.c for this a long time ago (Linux 2.3.99-pre6-7), which made the assumption in stat that i_mapping->host was an inode. (effectively tmp.st_size = inode->i_mapping->host->i_size) Other solutions were to finish the getattr implementation, or keep a small Coda-specific wrapper for generic_file_write around. > >> > One actually shouldn't assume that mapping->host is an inode. > >> > >> What else could it be, since it's a "struct inode *"? NULL? > > > >struct block_device *, for one thing. We'll have to do that as soon > >as we do block devices in pagecache. > > No, Al. It's an inode. It was a major mistake to ever think anything > else. So is anyone interested in a small patch for stat.c? It fixes, as far as I know, the last place that 'assumes' that inode->i_mapping->host is identical to &inode. Jan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Alexander Viro wrote: > > void *. > > Look, methods of your address_space certainly know what they hell they > are dealing with. Just as autofs_root_readdir() knows what inode->u.generic_ip > really points to. > > Anybody else has no business to care about the contents of ->host. > Why do we need a ->host at all, then? Why not simply make it a private pointer? -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
In article <[EMAIL PROTECTED]>, Alexander Viro <[EMAIL PROTECTED]> wrote: >> >> How would you know what datatype it is? A union? Making "struct >> block_device *" a "struct inode *" in a nonmounted filesystem? In a >> devfs? (Seriously. Being able to do these kinds of data-structural >> equivalence is IMO the nice thing about devfs & co...) > >void *. No. It used to be that way, and it was a horrible mess. We _need_ to know that it's an inode, because the generic mapping functions basically need to do things like mark_inode_dirty_pages(mapping->host); which in turn needs the host to be an inode (otherwise you don't know how and where to write the dang things back again). There's no question that you can avoid it being an inode by virtualizing more of it, and adding more virtual functions to the mapping operations (right now the only one you'd HAVE to add is the "mark_page_dirty()" operation), but the fact is that code gets really ugly by doing things like that. It was an absolute pleasure to remove all the casts of "mapping->host". With "void *" it needed to be cast to the right type (and you had to be able to _prove_ that you knew what the right type was). With "inode *", the type is statically known, and you don't actually lose anything (at worst, you'd have a virtual inode and then do an extra layer of indirection there). I really don't think we want to go back to "void *". Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tue, 15 May 2001, Alexander Viro wrote: > On 15 May 2001, Kai Henningsen wrote: > > > [EMAIL PROTECTED] (Alexander Viro) wrote on 15.05.01 in ><[EMAIL PROTECTED]>: > > > > > ... and Multics had all access to files through equivalent of mmap() > > > in 60s. "Segments" in ls(1) got that name for a good reason. > > > > Where's something called "segments" connected with ls(1)? I can't seem to > > find the reference. > > ls == list segments. Name came from Multics. Basically, they had the whole address space consisting of mmaped files. address was (segment << 18) + offset (both up to 18 bits) and primitive was "attach segment (== file) to address space". Each segment had its own page table, BTW. Directories were special segments and contained references to other segments (both files and directories). Root had fixed ID. You could lookup segment by name. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
In article <[EMAIL PROTECTED]>, Alexander Viro <[EMAIL PROTECTED]> wrote: >On Tue, 15 May 2001, H. Peter Anvin wrote: > >> Alexander Viro wrote: >> > > >> > > None whatsoever. The one thing that matters is that noone starts making >> > > the assumption that mapping->host->i_mapping == mapping. >> > >> > One actually shouldn't assume that mapping->host is an inode. >> > >> >> What else could it be, since it's a "struct inode *"? NULL? > >struct block_device *, for one thing. We'll have to do that as soon >as we do block devices in pagecache. No, Al. It's an inode. It was a major mistake to ever think anything else. I see your problem, but it's not a real problem. What you do for block devices (or anything like that where you might have _multiple_ inodes pointing to the same thing, is to just create a "virtual inode", and have THAT be the one that the mapping is associated with. Basically each "struct block_device *" would have an inode associated with it, to act as a anchor for things like this. What is "struct inode", after all? It's just the virtual representation of a "entity". The inodes associated with /dev/hda are not the inodes associated with the actual _device_ - they are just on-disk "links" to the physical device. [ Aside: there are good arguments to _not_ embed "struct inode" into "struct block_device", but instead do it the other way around - the same way we have filesystem-specific inode data inside "struct inode" we can easily have device-type specific data there. And it makes a whole lot more sense to attach a mount to an inode than it makes to attach a mount to a "struct block_device". Done right, we could eventually get rid of "loopback block devices". They'd just be inodes that aren't of type "struct block_device", and the index to "struct buffer_head" would not be , but . See? The added level of indirection is one that we actually already _use_, it's just that we have this loopback device special case for it.. In a "perfect" setup you could actually do "mount -t ext2 file /mnt/x" without having _any_ loopback setup or anything like that, simply because you don't _need_ it. It would be automatic. ] Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On 15 May 2001, Kai Henningsen wrote: > [EMAIL PROTECTED] (Alexander Viro) wrote on 15.05.01 in ><[EMAIL PROTECTED]>: > > > ... and Multics had all access to files through equivalent of mmap() > > in 60s. "Segments" in ls(1) got that name for a good reason. > > Where's something called "segments" connected with ls(1)? I can't seem to > find the reference. ls == list segments. Name came from Multics. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
[EMAIL PROTECTED] (Alexander Viro) wrote on 15.05.01 in <[EMAIL PROTECTED]>: > ... and Multics had all access to files through equivalent of mmap() > in 60s. "Segments" in ls(1) got that name for a good reason. Where's something called "segments" connected with ls(1)? I can't seem to find the reference. MfG Kai - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tue, 15 May 2001, H. Peter Anvin wrote: > Alexander Viro wrote: > > > > > > What else could it be, since it's a "struct inode *"? NULL? > > > > struct block_device *, for one thing. We'll have to do that as soon > > as we do block devices in pagecache. > > > > How would you know what datatype it is? A union? Making "struct > block_device *" a "struct inode *" in a nonmounted filesystem? In a > devfs? (Seriously. Being able to do these kinds of data-structural > equivalence is IMO the nice thing about devfs & co...) void *. Look, methods of your address_space certainly know what they hell they are dealing with. Just as autofs_root_readdir() knows what inode->u.generic_ip really points to. Anybody else has no business to care about the contents of ->host. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tue, 15 May 2001, H. Peter Anvin wrote: > Alexander Viro wrote: > > > > > > None whatsoever. The one thing that matters is that noone starts making > > > the assumption that mapping->host->i_mapping == mapping. > > > > One actually shouldn't assume that mapping->host is an inode. > > > > What else could it be, since it's a "struct inode *"? NULL? struct block_device *, for one thing. We'll have to do that as soon as we do block devices in pagecache. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Alexander Viro wrote: > > > > What else could it be, since it's a "struct inode *"? NULL? > > struct block_device *, for one thing. We'll have to do that as soon > as we do block devices in pagecache. > How would you know what datatype it is? A union? Making "struct block_device *" a "struct inode *" in a nonmounted filesystem? In a devfs? (Seriously. Being able to do these kinds of data-structural equivalence is IMO the nice thing about devfs & co...) -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tue, 15 May 2001, H. Peter Anvin wrote: > Alexander Viro wrote: > > > > On 15 May 2001, H. Peter Anvin wrote: > > > > > isofs wouldn't be too bad as long as struct mapping:struct inode is a > > > many-to-one mapping. > > > > Erm... What's wrong with inode->u.isofs_i.my_very_own_address_space ? > > > > None whatsoever. The one thing that matters is that noone starts making > the assumption that mapping->host->i_mapping == mapping. One actually shouldn't assume that mapping->host is an inode. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Alexander Viro wrote: > > > > None whatsoever. The one thing that matters is that noone starts making > > the assumption that mapping->host->i_mapping == mapping. > > One actually shouldn't assume that mapping->host is an inode. > What else could it be, since it's a "struct inode *"? NULL? -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Alexander Viro wrote: > > On 15 May 2001, H. Peter Anvin wrote: > > > isofs wouldn't be too bad as long as struct mapping:struct inode is a > > many-to-one mapping. > > Erm... What's wrong with inode->u.isofs_i.my_very_own_address_space ? > None whatsoever. The one thing that matters is that noone starts making the assumption that mapping->host->i_mapping == mapping. -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On 15 May 2001, H. Peter Anvin wrote: > isofs wouldn't be too bad as long as struct mapping:struct inode is a > many-to-one mapping. Erm... What's wrong with inode->u.isofs_i.my_very_own_address_space ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Followup to: <[EMAIL PROTECTED]> By author:Anton Altaparmakov <[EMAIL PROTECTED]> In newsgroup: linux.dev.kernel > > They shouldn't, but maybe some stupid utility or a typo will do it creating > two incoherent copies of the same block on the device. -> Bad Things can > happen. > > Can't we simply stop people from doing it by say having mount lock the > device from further opens (and vice versa of course, doing a "dd" should > result in lock of device preventing a mount during the duration of "dd"). - > Wouldn't this be a good thing, guaranteeing that problems cannot happen > while not incurring any overhead except on device open/close? Or is this a > matter of "give the user enough rope"? - If proper rw locking is > implemented it could allow simultaneous -o ro mount with a dd from the > device but do exclusive write locking, for example, for maximum flexibility. > This would leave no way (without introducing new interfaces) to write, for example, the boot block on an ext2 filesystem. Note that the bootblock (defined as the first 1024 bytes) is not actually used by the filesystem, although depending on the block size it may share a block with the superblock (if blocksize > 1024). -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Followup to: <[EMAIL PROTECTED]> By author:Alexander Viro <[EMAIL PROTECTED]> In newsgroup: linux.dev.kernel > > UNIX-like ones (and that includes QNX) are easy. HFS is hopeless - it won't > be fixed unless authors will do it. Tigran will probably fix BFS just as a > learning experience ;-) ADFS looks tolerably easy to fix. AFFS... directories > will be pure hell - blocks jump from directory to directory at zero notice. > NTFS and HPFS will win from switch (esp. NTFS). FAT is not a problem, if we > are willing to break CVF and let author fix it. Reiserfs... Dunno. They've > got a private (slightly mutated) copy of ~60% of fs/buffer.c. UDF should be > OK. ISOFS... ask Peter. JFFS - dunno. > isofs wouldn't be too bad as long as struct mapping:struct inode is a many-to-one mapping. -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
>And because your suspend/resume idea isn't really going to help me >much. That's because my boot scripts have the notion of >"personalities" (change the boot configuration by asking the user >early on in the boot process). If I suspend after I've got XDM >running, it's too late. Preface: As has been mentioned on this discussion thread, some disk devices maintain a cache of their own, running on a small (by today's standards) CPU. These caches are probably sector oriented, not block oriented, but are almost certainly not page oriented or filesystem oriented. Well, OK, some might have DOS filesystem knowlege built-in, I suppose... yuck! Anyway, although there may be slight differences, they are effectively block-orieted caches. As long as they are write-through (and/or there are cache flushing commands, etc), there are reasonably coherent with the operating system's main cache, and they meet the expectations of database programs, etc. that want stable storage. In terms of efficiency, there are questions about read-aheead, write-behind, write-through with invalidation or write-through with cache update -- the usual stuff. I leave it as an exercise for the reader to decide how to best tune their system, and merely assert that it can be done. Imagine, as a mental exercise, that you move this block-oriented cache out of the disk drive, and into the main CPU and operating system, say roughly at the disk driver level. We lose the efficiency of having the small CPU do the block lookups, but a hashed block lookup is rather cheap nowadays, wouldn't you say? Ignoring issues of, "What if the disk drive fails independently of the main CPU, or vice versa?", the transplanted block cache should operate pretty much as it did in the disk drive. In particular, it should continue to operate properly with the main CPU's main page cache. Conclusion: a page cache can successfully run over a appropriately designed block cache. QED. What's the hitch? It's the "appropriately designed" constraint. It is quite possible that the Linux block cache is not designed (data strictures and code paths considered together) in a way that allows it to mimic a simple disk drive's block cache. I assume that there's some impediment, or this discussion wouldn't have lasted so long -- the idea of using the Linux block cache to model a disk drive's block cache is pretty obvious, after all. >So what I want is a solution that will keep the kernel clean (believe >me, I really do want to keep it clean), but gives me a fast boot too. >And I believe the solution is out there. We just haven't found it yet. Well, if you want a fast boot *on a single type of disk drive*, and the existing Linux block cache doesn't work, you could extend the driver for that hardware with an optional block cache, independently of Linux' block cache, along with an appropriate interface to populate it with boot-time blocks, and to flush it when no longer needed. That's not exactly clean, though, is it? You could extend the md (or LVM) drivers, or create a new driver similar to one of them, that incorporates a simple block cache, with appropriate mechanisms for populating and flushing it. Clean? er, no, rather muddy, in fact. You might want to lock down the pages that you've prepopulated, rather than let them be discarded before they're needed. This could be designed into a new block cache, but you might need to play some accounting games to get it right with the existing block cache. Finally, there's Linus' offer for a preread call, to prepopulate the page cache. By virtue of your knowlege of the underlying implementation of the system, you could preload the file system index pages into the block cache, and load the datd pages into the page cache. Clean! Sewer-like! Craig Milo Rogers - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tuesday, May 15, 2001 04:33:57 AM -0400 Alexander Viro <[EMAIL PROTECTED]> wrote: > > > On Tue, 15 May 2001, Linus Torvalds wrote: > >> Looks like there are 19 filesystems that use the buffer cache right now: >> >> grep -l bread fs/*/*.c | cut -d/ -f2 | sort -u | wc >> >> So quite a bit of work involved. > > Reiserfs... Dunno. They've got a private (slightly mutated) copy of > ~60% of fs/buffer.c. But, putting the log and the metadata in the page cache makes memory pressure and such cleaner, so this is one of my goals for 2.5. reiserfs will still have alias issues due to the packed tails (one copy in the btree, another in the page), but it will be no worse than it is now. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tuesday 15 May 2001 12:44, Alexander Viro wrote: > On Tue, 15 May 2001, Daniel Phillips wrote: > > That's because you left out his invalidate: > > > > * create an instance in pagecache > > * start reading into buffer cache (doesn't invalidate, right?) > > * start writing using pagecache (invalidate buffer copy) > > Bzzert. You have a race here. Let's make it explicit: > > start writing > put write request in queue > block on that > start reading into buffer cache > put read request into queue > read from media > write to media > > And no, we can't invalidate from IO completion hook. > > > * lose the page > > * try to read it (via pagecache) > > > > Everthing ok. > > Nope. The problem is that we have two IO operations on the same physical block in the queue at the same time, and we don't know it. Maybe we should know it. For your specific example we are ok if we do: * create an instance in pagecache * start reading into buffer cache (doesn't invalidate, right?) * start writing using pagecache (invalidate buffer copy) * lose the page (invalidate buffer copy) * try to read it (via pagecache) We are also ok if we follow my suggested optimization and move the page to the buffer cache instead of just losing it. We are not ok if we do: * try to read it (via buffercache) because its copy is out of date, but this can be fixed by enforcing coherency in the request queue. 1) Why should the request queue not be coherent? 2) Can we stop talking about buffer cache here and start talking about blocks mapped into a separate address space in the page cache? From Linus's previous comments in this thread we are going to have that anyway, and your race also applies there. I'd like to call that separate address space a 'block cache'. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tue, 15 May 2001, Daniel Phillips wrote: > That's because you left out his invalidate: > > * create an instance in pagecache > * start reading into buffer cache (doesn't invalidate, right?) > * start writing using pagecache (invalidate buffer copy) Bzzert. You have a race here. Let's make it explicit: start writing put write request in queue block on that start reading into buffer cache put read request into queue read from media write to media And no, we can't invalidate from IO completion hook. > * lose the page > * try to read it (via pagecache) > > Everthing ok. Nope. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tuesday 15 May 2001 08:57, Alexander Viro wrote: > On Tue, 15 May 2001, Richard Gooch wrote: > > > What happens if you create a buffer cache entry? Does that > > > invalidate the page cache one? Or do you just allow invalidates > > > one way, and not the other? And why= > > > > I just figured on one way invalidates, because that seems cheap and > > easy and has some benefits. Invalidating the other way is costly, > > so don't bother, even if there were some benefits. > > Cute. > * create an instance in pagecache > * start reading into buffer cache (doesn't invalidate, right?) > * start writing using pagecache > * lose the page > * try to read it (via pagecache) > Woops - just found a copy in buffer cache, let's pick data from it. > Pity that said data is obsolete... That's because you left out his invalidate: * create an instance in pagecache * start reading into buffer cache (doesn't invalidate, right?) * start writing using pagecache (invalidate buffer copy) * lose the page * try to read it (via pagecache) Everthing ok. As an optimization, instead of 'lose the page', do 'move page blocks to buffer cache'. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
[EMAIL PROTECTED] said: > JFFS - dunno. Bah. JFFS doesn't use any of those horrible block device thingies. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
At 08:13 15/05/01, Linus Torvalds wrote: >On Tue, 15 May 2001, Richard Gooch wrote: > > So what happens if I dd from the block device and also from a file on > > the mounted FS, where that file overlaps the bnums I dd'ed? Do we get > > two copies in the page cache? One for the block device access, and one > > for the file access? > >Yup. And never the two shall meet. > >Why should they? Why would you ever do something like that, or care about >the fact? They shouldn't, but maybe some stupid utility or a typo will do it creating two incoherent copies of the same block on the device. -> Bad Things can happen. Can't we simply stop people from doing it by say having mount lock the device from further opens (and vice versa of course, doing a "dd" should result in lock of device preventing a mount during the duration of "dd"). - Wouldn't this be a good thing, guaranteeing that problems cannot happen while not incurring any overhead except on device open/close? Or is this a matter of "give the user enough rope"? - If proper rw locking is implemented it could allow simultaneous -o ro mount with a dd from the device but do exclusive write locking, for example, for maximum flexibility. Just my 2p. Anton -- Anton Altaparmakov (replace at with @) Linux NTFS Maintainer / WWW: http://sourceforge.net/projects/linux-ntfs/ ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Alan Cox <[EMAIL PROTECTED]> writes: > > Larry, go read up on TOPS-20. :-) SunOS did give unix mmap(), but it > > did not come up the idea. > Seems to be TOPS-10 > http://www.opost.com/dlm/tenex/fjcc72/ TENEX is not TOPS-10. TOPS-10 didn't get virtual memory until around 1974. By then, TENEX had been around for years. TOPS-20 was developed from TENEX starting around 1973. -- http://lars.nocrew.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tue, 15 May 2001, Linus Torvalds wrote: > Looks like there are 19 filesystems that use the buffer cache right now: > > grep -l bread fs/*/*.c | cut -d/ -f2 | sort -u | wc > > So quite a bit of work involved. UNIX-like ones (and that includes QNX) are easy. HFS is hopeless - it won't be fixed unless authors will do it. Tigran will probably fix BFS just as a learning experience ;-) ADFS looks tolerably easy to fix. AFFS... directories will be pure hell - blocks jump from directory to directory at zero notice. NTFS and HPFS will win from switch (esp. NTFS). FAT is not a problem, if we are willing to break CVF and let author fix it. Reiserfs... Dunno. They've got a private (slightly mutated) copy of ~60% of fs/buffer.c. UDF should be OK. ISOFS... ask Peter. JFFS - dunno. So probably we'll have to keep the buffer cache (AFFS looks like a real killer), but we will be able to do pagecache-only versions of a_ops methods. If fs has no metadata in buffer cache we can drop unmap_underlying_metadata() for it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tue, 15 May 2001, Chris Wedgwood wrote: > > On Tue, May 15, 2001 at 12:13:13AM -0700, Linus Torvalds wrote: > > We should not create crap code just because we _can_. > > How about removing code? Absolutely. It's not all that often that we can do it, but when we can, it's the best thing in the world. > In 2.5.x is we move fs metadata into the pagecache, do we even need a > buffer cache anymore? Can't we just ditch it completely and make all > device access raw? Yes and no. Yes, it would be nice. But no, I doubt we'll move _all_ metadata into the page-cache. I doubt, for example, that we'll find people re-doing all the other filesystems. So even if ext2 was page-cache only, what about all the 35 other filesystems out there in the standard sources, never mind others that haven't been integrated (XFS, ext3 etc..). Yeah, I know. Some of them already do not use the buffer cache at all (the network filesystems come to mind ;), but even so.. Looks like there are 19 filesystems that use the buffer cache right now: grep -l bread fs/*/*.c | cut -d/ -f2 | sort -u | wc So quite a bit of work involved. But on the whole I'm definitely hoping that yes, we'll relegate the "buffer_head" to be mainly just for IO, and not be a first-class caching entity at all. It's just that I think it will take a _lng_ time until we actually reach that noble goal completely. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tue, 15 May 2001, Richard Gooch wrote: > > > > What happens if you create a buffer cache entry? Does that > > invalidate the page cache one? Or do you just allow invalidates one > > way, and not the other? And why= > > I just figured on one way invalidates, because that seems cheap and > easy and has some benefits. Invalidating the other way is costly, so > don't bother, even if there were some benefits. Ahh.. Well, excuse me while I puke all over your shoes. Why don't you go hack the NT kernel, or something like that? I have some taste, and part of that is having this silly notion of "Things should make sense". We should not create crap code just because we _can_. Sure, it's easy to write the code you suggest. Do you really want a system like that? A system where you have rules that make no sense, except "it was easy to invlidate one way, so let's do that, and never mind that it makes no logical sense at all?". > > Ehh.. And then you'll be unhappy _again_, when we early in 2.5.x > > start using the page cache for block device accesses. Which we > > _have_ to do if we want to be able to mmap block devices. Which we > > _do_ want to do (hint: DVD's etc). > > So what happens if I dd from the block device and also from a file on > the mounted FS, where that file overlaps the bnums I dd'ed? Do we get > two copies in the page cache? One for the block device access, and one > for the file access? Yup. And never the two shall meet. Why should they? Why would you ever do something like that, or care about the fact? Why would you design a system around a perversity, slowing down (and uglifying) the sane and common case? > And because your suspend/resume idea isn't really going to help me > much. That's because my boot scripts have the notion of > "personalities" (change the boot configuration by asking the user > early on in the boot process). If I suspend after I've got XDM > running, it's too late. Note that I never said "suspend". I said _resume_. You would create the resume-image once, and you'd create it not at shutdown time, but at the point you want to resume from. You don't want to ever suspend the dang thing - just shut it down, and reboot it quickly by resuming from the snapshot. So you just create a simple resume snapshot. Which is easy to do, with the exact same tools that you've been talking about all the time. What you do is: - trace what pages get loaded off the disk - create a snapshot of the contents of those pages - archive it all up (may I suggest compressing it at the same time?) - the "resume" function is just a "uncompress and populate the virtual caches with the contents" action. Note that the "uncompress and populate" doesn't actually have to use the _real_ disk contents of the file. A byte is a byte is a byte, and it doesn't actually need to come from the actual filesystem the system _thinks_ it comes from. Once it is loaded into memory, it's just a value. You'e "primed" your caches, so when you actually run the bootup scripts, you'll have some random hit-rate (say, 98%), and improve the bootup immensely that way. Another way of saying this: Imagine that you "tar" up and compress the files you need for booting. You then uncompress and untar the archive, but instead of untar'ing onto a filesystem, you _just_ populate the caches. This is how some CPU's bootstrap themselves: they fill their icache from a serial rom (at least some alpha chips did this). Never mind that they didn't actually get that initial state from the _real_ backing store (RAM, or in the hypothetical "resume" case, the filesystem off disk). There's no way to tell, if your cached copies have the same data as the data on disk. Never mind that the data _got_ there a strange way. (And yes, your "cache priming" had better prime the cache with the same stuff that _is_ on the real filesystem, otherwise you'd obviously get strange behaviour with the caches not actually matching what the filesystem contents are. But that's simple to do, and it's easy enough to boot up in safe mode without a cache priming stage). One of the advantages of "resuming" (or "priming the cache", or whatever you want to call it) is that you're free to lay out the resume/cache image any way you want on disk, as it has nothing to do with the actual filesystem - except for the fact of sharing some of the same data. Which means that you can really read it in efficiently. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tue, 15 May 2001, Richard Gooch wrote: > > What happens if you create a buffer cache entry? Does that > > invalidate the page cache one? Or do you just allow invalidates one > > way, and not the other? And why= > > I just figured on one way invalidates, because that seems cheap and > easy and has some benefits. Invalidating the other way is costly, so > don't bother, even if there were some benefits. Cute. * create an instance in pagecache * start reading into buffer cache (doesn't invalidate, right?) * start writing using pagecache * lose the page * try to read it (via pagecache) Woops - just found a copy in buffer cache, let's pick data from it. Pity that said data is obsolete... > So what happens if I dd from the block device and also from a file on > the mounted FS, where that file overlaps the bnums I dd'ed? Do we get > two copies in the page cache? One for the block device access, and one > for the file access? Yes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Linus Torvalds writes: > > On Tue, 15 May 2001, Richard Gooch wrote: > > > > However, what about simply invalidating an entry in the buffer cache > > when you do a write from the page cache? > > And how do you do the invalidate the other way, pray tell? > > What happens if you create a buffer cache entry? Does that > invalidate the page cache one? Or do you just allow invalidates one > way, and not the other? And why= I just figured on one way invalidates, because that seems cheap and easy and has some benefits. Invalidating the other way is costly, so don't bother, even if there were some benefits. > > Actually, I'd kind of like it if the page cache steals from the buffer > > cache on read. The buffer cache is mostly populated by fsck. Once I've > > done the fsck, those buffers are useless to me. They might be useful > > again if they are steal-able by the page cache. > > Ehh.. And then you'll be unhappy _again_, when we early in 2.5.x > start using the page cache for block device accesses. Which we > _have_ to do if we want to be able to mmap block devices. Which we > _do_ want to do (hint: DVD's etc). So what happens if I dd from the block device and also from a file on the mounted FS, where that file overlaps the bnums I dd'ed? Do we get two copies in the page cache? One for the block device access, and one for the file access? > Face it. What you ask for is stupid and fundamentally unworkable. > > Tell me WHY you are completely ignoring my arguments, when I (a) > tell you why your way is bad and stupid (and when you ignore the > arguments, don't complain when I call you stupid) and (b) I give you > alternate ways to do the same thing, except my suggestion is > _faster_ and has none of the downside yours has. > > WHY? Because I like to understand completely all the different options before giving up on any. That in itself is a good enough reason, IMO. Because I've found that when arguing about this kind of stuff, even if the other person asks for something that is "wrong" or "stupid" from your own point of view, if you respect their intelligence, then maybe you can together find an alternative solution that solves the underlying problem but does it cleanly. I've been on the other side of this with a friend and colleague. We used to have healthy arguments that lasted all afternoon. He'd ask for something that was unclean and didn't fit into the structure or the philosophy. But I respected his intelligence, skill and his need for a solution. In the end, we'd come up with a better way than either one would have proposed. We had a dialogue. And because your suspend/resume idea isn't really going to help me much. That's because my boot scripts have the notion of "personalities" (change the boot configuration by asking the user early on in the boot process). If I suspend after I've got XDM running, it's too late. So what I want is a solution that will keep the kernel clean (believe me, I really do want to keep it clean), but gives me a fast boot too. And I believe the solution is out there. We just haven't found it yet. Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tue, 15 May 2001, Richard Gooch wrote: > > However, what about simply invalidating an entry in the buffer cache > when you do a write from the page cache? And how do you do the invalidate the other way, pray tell? What happens if you create a buffer cache entry? Does that invalidate the page cache one? Or do you just allow invalidates one way, and not the other? And why= > Actually, I'd kind of like it if the page cache steals from the buffer > cache on read. The buffer cache is mostly populated by fsck. Once I've > done the fsck, those buffers are useless to me. They might be useful > again if they are steal-able by the page cache. Ehh.. And then you'll be unhappy _again_, when we early in 2.5.x start using the page cache for block device accesses. Which we _have_ to do if we want to be able to mmap block devices. Which we _do_ want to do (hint: DVD's etc). Face it. What you ask for is stupid and fundamentally unworkable. Tell me WHY you are completely ignoring my arguments, when I (a) tell you why your way is bad and stupid (and when you ignore the arguments, don't complain when I call you stupid) and (b) I give you alternate ways to do the same thing, except my suggestion is _faster_ and has none of the downside yours has. WHY? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Linus Torvalds writes: > You could choose to do "partial coherency", ie be coherent only one > way, for example. That would make the coherency overhead much less, > but would also make the caches basically act very unpredictably - > you might have somebody write through the page cache yet on a read > actually not _see_ what he wrote, because it got written out to disk > and was shadowed by cached data in the buffer cache that didn't get > updated. OK, I see your concern. And the old way of doing things, placing a copy in the buffer cache when the page cache does a write, will eat away performance. However, what about simply invalidating an entry in the buffer cache when you do a write from the page cache? By the time you get ready to do the I/O, you have the device bnum, so then isn't it a trivial operation to index into the buffer cache and invalidate that block? Is there some other subtlety I'm missing here? Actually, I'd kind of like it if the page cache steals from the buffer cache on read. The buffer cache is mostly populated by fsck. Once I've done the fsck, those buffers are useless to me. They might be useful again if they are steal-able by the page cache. Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Linus Torvalds writes: > > On Mon, 14 May 2001, Richard Gooch wrote: > > > > Is there some fundamental reason why a buffer cache can't ever be > > fast? > > Yes. > > Or rather, there is a fundamental reason why we must NEVER EVER look at > the buffer cache: it is not coherent with the page cache. > > And keeping it coherent would be _extremely_ expensive. How do we > know? Because we used to do that. Remember the small mindcraft > benchmark? Yup. Double copies all over the place, double lookups, double > everything. > > You could think: "oh, we only need to look up the buffer cache when we > create a new page cache mapping, so..". > > You'd be wrong. We'd need to go the other way too: every time we create a > new buffer cache entry, we'd need to make sure that it isn't mapped > somewhere in the page cache (impossible), or otherwise we'd do the wrong > thing sometimes (ie we might have two dirty copies, and we wouldn't know > _which_ one is valid etc). > > Aliasing is bad. Don't do it. OK, this (combined with the other message) explains why we want to keep away from the buffer cache. Thanks. > You know, the mark of intelligence is realizing when you're making > the same mistake over and over and over again, and not hitting your > head in the wall five hundred times before you understand that it's > not a clever thing to do. But you didn't have to add this. Please note that I asked why not use the buffer cache. I didn't proclaim that it was the ideal solution. I did say what benefits it had, but I didn't assert that the benefits outweighed the disadvantages. > Please show some intelligence. Well, frankly, I think I have. Things are obvious when you know them already. Even if I'm ignorant, I'm not stupid! Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Mon, 14 May 2001, David S. Miller wrote: > > Larry McVoy writes: > > Hell, that's the OS that gave us mmap, remember that? > > Larry, go read up on TOPS-20. :-) SunOS did give unix mmap(), but it > did not come up the idea. s/TOPS-20/Multics/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Mon, 14 May 2001, Linus Torvalds wrote: > The current page cache is completely non-coherent (with _anything_: it's > not coherent with other files using a page cache because they have a > different index, and it's not coherent with the buffer cache because that > one isn't even in the same name space). Unfortunately, we have cases when disk block migrates from buffer cache to page cache. Source of serious PITA and (IMO) the only serious reason to take indirect blocks into page cache. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Larry McVoy writes: > Hell, that's the OS that gave us mmap, remember that? Larry, go read up on TOPS-20. :-) SunOS did give unix mmap(), but it did not come up the idea. Later, David S. Miller [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Mon, 14 May 2001, Larry McVoy wrote: > Hell, that's the OS that gave us mmap, remember that? "I got it from Agnes..." Don't get me wrong, SunOS 4 was probably the nicest thing Sun had ever released and I love it, but mmap(2) was _not_ the best of ideas. Files as streams of bytes and files as persistent segments really do not mix well. If you still have their source check the effects of write() from mmaped area. Especially when you play with unaligned stuff. Said that, in all sane cases we want indexing by (vnode,offset), not by (device,block number). We _certainly_ don't want uncontrolled readahead on block level. E.g. because we might have just allocated a new block and are busy filling it with data we want to write. The last thing we want is some fsckwit overwriting it with crap we have on disk. And that's what such readahead is. Besides, just how often do you reboot the box? If that's the hotspot for you - when the hell does the boor beast find time to do something useful? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Mon, 14 May 2001, Linus Torvalds wrote: > > Or rather, there is a fundamental reason why we must NEVER EVER look at > the buffer cache: it is not coherent with the page cache. > > And keeping it coherent would be _extremely_ expensive. How do we > know? Because we used to do that. Remember the small mindcraft > benchmark? Yup. Double copies all over the place, double lookups, double > everything. I think I should explain a bit more. The current page cache is completely non-coherent (with _anything_: it's not coherent with other files using a page cache because they have a different index, and it's not coherent with the buffer cache because that one isn't even in the same name space). Now, being non-coherent is always the best option if you can get away with it. It means that there is no way you can ever have _any_ performance overhead from maintaining the coherency, and it's 100% reproducible - there's no question where the page cache gets its data from (the raw disk device. No if's, but's and why's). The disadvantage of virtual caches is that they can have aliases. That's fine, but you hav eto be aware of it, and you have to live with the consequences. That's what we do now. There are no aliases that are worth worrying about, so virtual caches work perfectly. This is not always true (virtual CPU data caches tend to be a really bad idea, while virtual CPU instruction caches tend to work fairly well, although potentially with a lower utilization ratio than a physical one due to aliasing). The other alternative is to have a physical cache. That's fine too: you avoid aliases, but you have to look up the physical address when looking up the cache. THIS is the real cost of the buffer cache - not the hashing and the locking, but the fact that you have to know the physical location. A mixed-mode cache is not a good idea. It gets the worst from both worlds, without getting _any_ of the good qualities. You have the horrible coherency issue, together with the overhead of having to find out the physical address. You could choose to do "partial coherency", ie be coherent only one way, for example. That would make the coherency overhead much less, but would also make the caches basically act very unpredictably - you might have somebody write through the page cache yet on a read actually not _see_ what he wrote, because it got written out to disk and was shadowed by cached data in the buffer cache that didn't get updated. So "partial coherency" might avoid some of the performance issues, but it's unacceptable to me simply it's pretty non-repeatable and has some strange behaviour that can be considered "obviously wrong" (see above about one example). Which leaves us with the fact that the page cache is best done the way it is, and anybody who has coherency concerns might really think about those concerns another way. I'm really serious about doing "resume from disk". If you want a fast boot, I will bet you a dollar that you cannot do it faster than by loading a contiguous image of several megabytes contiguously into memory. There is NO overhead, you're pretty much guaranteed platter speeds, and there are no issues about trying to order accesses etc. There are also no issues about messing up any run-time data structures. Give it some thought. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Mon, May 14, 2001 at 09:00:44PM -0700, Linus Torvalds wrote: > Or rather, there is a fundamental reason why we must NEVER EVER look at > the buffer cache: it is not coherent with the page cache. Not that Linus needs any backing up but Sun got rid of the buffer cache and just had a page cache in SunOS 4.0, which was before I got there, I suspect something like 15 years ago. It was a good move. SunOS was an extremely pleasant place to work, all you had to understand was vnode,offset and you basically understood the VM system. It is so _blindingly_ obvious that Linus is right, it's been proven, you don't have to think about it, just read some history. Hell, that's the OS that gave us mmap, remember that? > Really. Give it up. Your silly "make bootup faster" is not going to happen > this way. You're trying to break some rather fundamental data structures, > all for the unusual case of booting up. There are other ways to boot up > quickly: look into pre-filling your memory image (aka "resume from disk"), Which is pretty much what I have been asking for, in a general way, for a long time. I've wanted "directory clustering" forever, where you read one file, read the next, and go into "file readahead mode" wherein you slurp in the entire directories worth of files in one I/O. If we had that, not only would we go faster in general, you could easily tweak it slightly for the fast bootup. > You know, the mark of intelligence is realizing when you're making the > same mistake over and over and over again, and not hitting your head in > the wall five hundred times before you understand that it's not a clever > thing to do. > > Please show some intelligence. Those who don't learn from history are doomed to repeat it, eh? -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Mon, 14 May 2001, Richard Gooch wrote: > > Is there some fundamental reason why a buffer cache can't ever be > fast? Yes. Or rather, there is a fundamental reason why we must NEVER EVER look at the buffer cache: it is not coherent with the page cache. And keeping it coherent would be _extremely_ expensive. How do we know? Because we used to do that. Remember the small mindcraft benchmark? Yup. Double copies all over the place, double lookups, double everything. You could think: "oh, we only need to look up the buffer cache when we create a new page cache mapping, so..". You'd be wrong. We'd need to go the other way too: every time we create a new buffer cache entry, we'd need to make sure that it isn't mapped somewhere in the page cache (impossible), or otherwise we'd do the wrong thing sometimes (ie we might have two dirty copies, and we wouldn't know _which_ one is valid etc). Aliasing is bad. Don't do it. Really. Give it up. Your silly "make bootup faster" is not going to happen this way. You're trying to break some rather fundamental data structures, all for the unusual case of booting up. There are other ways to boot up quickly: look into pre-filling your memory image (aka "resume from disk"), which I will _guarantee_ you is a lot faster than anything else you can come up with, and which doesn't have the downsides that your approach has. You know, the mark of intelligence is realizing when you're making the same mistake over and over and over again, and not hitting your head in the wall five hundred times before you understand that it's not a clever thing to do. Please show some intelligence. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tuesday 15 May 2001 01:19, Richard Gooch wrote: > Linus Torvalds writes: > > On Sun, 13 May 2001, Richard Gooch wrote: > > > So, why can't the page cache check if a block is in the buffer > > > cache? > > > > Because it would make the damn thing slower. > > > > The whole point of the page cache is to be FAST FAST FAST. The > > reason we _have_ a page cache is that the buffer cache is slow and > > inefficient, and it will always remain so. > > Is there some fundamental reason why a buffer cache can't ever be > fast? Just looking at getblk, it takes one more lock than read_cache_page (these are noops in UP) and otherwise has very nearly the same sequence of operations. This can't be the slowness he's talking about. I know of three ways the buffer cache earned its reputation for slowness: 1) There used to be a copy from the buffer cache to page cache on every write, to keep the two in sync 2) Having the same data in both the buffer and page cache created extra memory pressure 3) To get at file data through the buffer cache you have to traverse all the index blocks every time, whereas with the logically-indexed page cache you go straight to the page data, if it's there, and in theory[1], only up as many levels of index as you have to. Once you have looked into the page cache and know the page isn't there you know you are going to have to read it. At this point, the overhead of hashing into, say, the buffer cache to see if the block is there is trivial. Just one saved read by doing that will be worth hundreds of hash lookups. But why use the buffer cache? The page cache will work perfectly well for this. There's a big saving in using a block cache for readahead instead of file-oriented readahead: if we guess wrong and don't actually need the readahead blocks then we paid less to get them - we didn't call into the filesystem to map each one. Additionally, a block cache can do things that file readahead can't, as you showed in your example: > - inode at block N > - indirect block at N+k+j > - data block at N+k Another example is where you have blocks from two different files mixed together, and you read both of those files. Note that your scsi disk controller is keeping a cache for you over on its side of the bus. This erodes the benefit of the block cache somewhat, but the same argument applies to file readahead. For all people who don't have scsi the block cache would be a big win. [1] This remains theoretical until we get the indirect blocks into the page cache. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Linus Torvalds writes: > > > On Sun, 13 May 2001, Richard Gooch wrote: > > > > OK, provided the prefetch will queue up a large number of requests > > before starting the I/O. If there was a way of controlling when the > > I/O actually starts (say by having a START flag), that would be ideal, > > I think. > > Ehh. The "start" flag is when you actually start reading, That would be OK. > > So, why can't the page cache check if a block is in the buffer cache? > > Because it would make the damn thing slower. > > The whole point of the page cache is to be FAST FAST FAST. The > reason we _have_ a page cache is that the buffer cache is slow and > inefficient, and it will always remain so. Is there some fundamental reason why a buffer cache can't ever be fast? > We want to get _away_ from the buffer cache, not add support for a legacy > cache into the new and more efficient one. > > And remember: when raw devices are in the page cache, you simply WILL NOT > HAVE a buffer cache at all. > > Just stop this line of thought. It's not going anywhere. I'm just going back to it because I don't see how we can otherwise handle this case: - inode at block N - indirect block at N+k+j - data block at N+k and have the prefetch read blocks N, N+k and N+k+j in that order. Reading them via the FS will result in two seeks, because we need to read N before we know to read N+k+j, and we need to read N+k+j before we know to read N+k. Doing the work at the block device layer makes this simple. However, if there was a way of doing this at the page cache level, then I'd be happy. > > > Try it. You won't be able to. "read()" is an inherently > > > synchronizing operation, and you cannot get _any_ overlap with > > > multiple reads, except for the pre-fetching that the kernel will do > > > for you anyway. > > > > How's that? It won't matter if read(2) synchronises, because I'll be > > issuing the requests in device bnum order. > > Ehh.. You don't seem to know how disks work. > > By the time you follow up with the next "read", the platter will > probably have rotated past the point you want to read. You need to > have multiple outstanding requests (or _biiig_ requests) to get > close to platter speed. Sure, I know about rotational latency. I'm counting on read-ahead. > [ Aside: with most IDE stuff doing extensive track buffering, you won't > see this as much. It depends on the disk, the cache size, and the > buffering characteristics. ] These days, even IDE drives come with 2 MiB of cache or more. Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Sun, 13 May 2001, Richard Gooch wrote: > > OK, provided the prefetch will queue up a large number of requests > before starting the I/O. If there was a way of controlling when the > I/O actually starts (say by having a START flag), that would be ideal, > I think. Ehh. The "start" flag is when you actually start reading, or when you've prefetched so much that the queue has filled up. That's the behaviour you'd get naturally, and it's the behaviour you want. > So, why can't the page cache check if a block is in the buffer cache? Because it would make the damn thing slower. The whole point of the page cache is to be FAST FAST FAST. The reason we _have_ a page cache is that the buffer cache is slow and inefficient, and it will always remain so. We want to get _away_ from the buffer cache, not add support for a legacy cache into the new and more efficient one. And remember: when raw devices are in the page cache, you simply WILL NOT HAVE a buffer cache at all. Just stop this line of thought. It's not going anywhere. > That opens up a nasty race: if the dentry is released before the > pointer is harvested, you get a bogus pointer. ..which is why you increment the dentry count when you profile it, and decrement it when you have output the path... > > Try it. You won't be able to. "read()" is an inherently > > synchronizing operation, and you cannot get _any_ overlap with > > multiple reads, except for the pre-fetching that the kernel will do > > for you anyway. > > How's that? It won't matter if read(2) synchronises, because I'll be > issuing the requests in device bnum order. Ehh.. You don't seem to know how disks work. By the time you follow up with the next "read", the platter will probably have rotated past the point you want to read. You need to have multiple outstanding requests (or _biiig_ requests) to get close to platter speed. [ Aside: with most IDE stuff doing extensive track buffering, you won't see this as much. It depends on the disk, the cache size, and the buffering characteristics. ] Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Daniel writes: > But we don't need anything so fancy to try out your idea, we just need > a lvm-like device that can: > > - Maintain a block cache > - Remap logical to physical blocks > - Record the block accesses > - Physically reorder the blocks according to the recorded order > - Load a given region of disk into the block cache on command The current LVM device (if compiled with DEBUG_MAP) will report all of the logical->physical block mappings via printk. Probably too heavy- weight for a large amount of IO. It could be changed to save the block numbers into a cache, to be extracted later. All of the LVM mapping is done in the lvm_map() function. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Monday 14 May 2001 07:15, Richard Gooch wrote: > Linus Torvalds writes: > > But sure, you can use bmap if you want. It would be interesting to > > hear whether it makes much of a difference.. > > I doubt bmap() would make any difference if there is a way of > controlling when the I/O starts. > > However, this still doesn't address the issue of indirect blocks. If > the indirect block has a higher bnum than the data blocks it points > to, you've got a costly seek. This is why I'm still attracted to the > idea of doing this at the block device layer. It's easy to capture > *all* accesses and then warm the buffer cache. > > So, why can't the page cache check if a block is in the buffer cache? That's not quite what you want, if only because there won't be anything in the buffer cache pretty soon. What we really want is a block cache, tightly integrated with the page cache. Readahead with a block cache would be more effective than our current file-based readahead. For example, it handles the case where blocks of two files are interleaved. Since we know that the page cache maps each block at most once, the optimal thing to do would be to just move a pointer from the block cache to the page cache whenever we can. Unfortunately the layering in the VFS as it stands isn't friendly to this: typically we allocate a page in generic_file_read long before we ask the filesystem to map it. To test this zero-copy idea we'd need to replace generic_file_read and for mmap, filemap_nopage. But we don't need anything so fancy to try out your idea, we just need a lvm-like device that can: - Maintain a block cache - Remap logical to physical blocks - Record the block accesses - Physically reorder the blocks according to the recorded order - Load a given region of disk into the block cache on command None of this has to be particularly general to get to the benchmarking stage. E.g, the 'block cache' only needs to cache one physical region. The central idea here is that you obviously can't do any better than to have all the blocks you want to read at boot physically together on disk. The advantage of using this lvm-style remapping is, it will work for any filesystem. The disadvantage is that the ordering is then cast in stone - after the system is up it might not like the ordering you chose for the boot, and the elevator will be completely confused ;-) But the thing is, everything you need to measure the boot performance is together in one place, just one device driver to write. Then once you know what the perfect result is you have a yardstick to measure the effectivenns of other, less intrusive approaches. I took a look at the lvm and md code to see if there's a quick way to press them into service for this test, and there probably is, but the complexity there is daunting. I think starting with a clean sheet and writing a new driver would be easier. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fwd: Re: Getting FS access events
Richard Gooch <[EMAIL PROTECTED]>: > > > OK, provided the prefetch will queue up a large number of requests > before starting the I/O. If there was a way of controlling when the > I/O actually starts (say by having a START flag), that would be ideal, > I think. > The START flag is equivalent to the first actual read, whereupon the elevator code will do the Right Thing. > That opens up a nasty race: if the dentry is released before the > pointer is harvested, you get a bogus pointer. > You simply increase the reference count of every dentry you visit, and free it when the log is read. > How's that? It won't matter if read(2) synchronises, because I'll be > issuing the requests in device bnum order. > Of course it does, because the kernel needs to wait for the next read() system call from your application, which it can only do after the first one completes, which adds another delay which will slow you down, especially with high-latency I/O protocols. -- Matthias Urlichs | noris network AG | http://smurf.noris.de/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Linus Torvalds writes: > > On Sun, 13 May 2001, Richard Gooch wrote: > > > > Think about it:-) You need to generate prefetch accesses in ascending > > device bnum order. > > I seriously doubt it is worth it. > > Th ekernel will do the ordering for you anyway: that's what the > elevator is, and that's why you have a "prefetch" system call (to > avoid the synchronization that kills the elevator). And you'll end > up wanting to pre-fetch on virtual addresses, which implies that you > have to open the files: I doubt you want to have tons of files open > and try to get a "global" order. OK, provided the prefetch will queue up a large number of requests before starting the I/O. If there was a way of controlling when the I/O actually starts (say by having a START flag), that would be ideal, I think. > But sure, you can use bmap if you want. It would be interesting to > hear whether it makes much of a difference.. I doubt bmap() would make any difference if there is a way of controlling when the I/O starts. However, this still doesn't address the issue of indirect blocks. If the indirect block has a higher bnum than the data blocks it points to, you've got a costly seek. This is why I'm still attracted to the idea of doing this at the block device layer. It's easy to capture *all* accesses and then warm the buffer cache. So, why can't the page cache check if a block is in the buffer cache? > > Sure, this would work too. I'm a bit worried about the increased > > amount of traffic this will generate. > > No increased traffic. "path" is a pointer (to a dentry), ie 32 > bits. "ino" is at least 128 bits on some filesystems. You make for _less_ > data to save. > > > So on every page fault or read(2) call, we have to generate the full > > path from the dentry? Isn't that going to add a fair bit of overhead? > > You just save the dentry pointer. You do the path _later_, when > somebody reads it away from the /proc file. That opens up a nasty race: if the dentry is released before the pointer is harvested, you get a bogus pointer. > > I don't see the advantage of the prefetch(2) system call. It seems to > > me I can get the same effect by just making read(2) calls in another > > task. Of course, I'd need to use bmap() to generate the sort key, but > > I don't see why that's a bad thing. > > Try it. You won't be able to. "read()" is an inherently > synchronizing operation, and you cannot get _any_ overlap with > multiple reads, except for the pre-fetching that the kernel will do > for you anyway. How's that? It won't matter if read(2) synchronises, because I'll be issuing the requests in device bnum order. Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Sun, 13 May 2001, Richard Gooch wrote: > > Think about it:-) You need to generate prefetch accesses in ascending > device bnum order. I seriously doubt it is worth it. Th ekernel will do the ordering for you anyway: that's what the elevator is, and that's why you have a "prefetch" system call (to avoid the synchronization that kills the elevator). And you'll end up wanting to pre-fetch on virtual addresses, which implies that you have to open the files: I doubt you want to have tons of files open and try to get a "global" order. But sure, you can use bmap if you want. It would be interesting to hear whether it makes much of a difference.. > > Why not just "path,pagenr" instead? You make your instrumentation save > > away the whole pathname, by just using the dentry pointer. Many > > filesystems don't even _have_ a "inum", so anything less doesn't work > > anyway. > > Sure, this would work too. I'm a bit worried about the increased > amount of traffic this will generate. No increased traffic. "path" is a pointer (to a dentry), ie 32 bits. "ino" is at least 128 bits on some filesystems. You make for _less_ data to save. > So on every page fault or read(2) call, we have to generate the full > path from the dentry? Isn't that going to add a fair bit of overhead? You just save the dentry pointer. You do the path _later_, when somebody reads it away from the /proc file. > I don't see the advantage of the prefetch(2) system call. It seems to > me I can get the same effect by just making read(2) calls in another > task. Of course, I'd need to use bmap() to generate the sort key, but > I don't see why that's a bad thing. Try it. You won't be able to. "read()" is an inherently synchronizing operation, and you cannot get _any_ overlap with multiple reads, except for the pre-fetching that the kernel will do for you anyway. And when it comes to IO and the elevator, overlap is where it matters. Sending out several tagged commands to the disk in one go. You'd have to have multiple processes doing the reads to get the same kind of performance. Much easier to do "prefetch()", when that's really what you want anyway. Remember, you'r enot interested in the data. You're just populating the cache. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Rik van Riel writes: > On Sun, 13 May 2001, Richard Gooch wrote: > > Larry McVoy writes: > > > > Ha. For once you're both wrong but not where you are thinking. One > > > of the few places that I actually hacked Linux was for exactly this > > > - it was in the 0.99 days I think. I saved the list of I/O's in a > > > file and filled the buffer cache with them at next boot. It > > > actually didn't help at all. > > > > Maybe you did something wrong :-) > > How about "the data loads got instrumented, but the metadata > loads which caused over half of the disk seeks didn't" ? > > (just a wild guess ... if it turns out to be true we may want > to look into doing agressive readahead on inode blocks ;)) Caching metadata is definately part of my cunning plan. I'd like to think that once Al's metadata-in-page-cache patches go in, we'll get that for free. However, that will still leave indirect blocks unordered. I don't see a clean way of fixing that. Which is why doing things at the block device layer has it's attractions (except it doesn't work). Hm. Is there a reason why the page cache can't see if a a block is in the block cache, and read it from there first? Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Sun, 13 May 2001, Richard Gooch wrote: > Larry McVoy writes: > > Ha. For once you're both wrong but not where you are thinking. One > > of the few places that I actually hacked Linux was for exactly this > > - it was in the 0.99 days I think. I saved the list of I/O's in a > > file and filled the buffer cache with them at next boot. It > > actually didn't help at all. > > Maybe you did something wrong :-) How about "the data loads got instrumented, but the metadata loads which caused over half of the disk seeks didn't" ? (just a wild guess ... if it turns out to be true we may want to look into doing agressive readahead on inode blocks ;)) regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to [EMAIL PROTECTED] (spam digging piggy) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Larry McVoy writes: > On Sun, May 13, 2001 at 06:32:02PM -0700, Linus Torvalds wrote: > > > Hi, Linus. I've been thinking more about trying to warm the page > > > cache with blocks needed by the bootup process. What is currently > > > missing is (AFAIK) a mechanism to find out what inodes and blocks have > > > been accessed. Sure, you can use bmap() to convert from file block to > > > device block, but first you need to figure out the file blocks > > > accessed. I'd like to find out what kind of patch you'd accept to > > > provide the missing functionality. > > > > Why would you use bmap() anyway? You CANNOT warm up the page cache with > > the physical map nr as discussed. So there's no real point in using > > bmap() at any time. > > Ha. For once you're both wrong but not where you are thinking. One > of the few places that I actually hacked Linux was for exactly this > - it was in the 0.99 days I think. I saved the list of I/O's in a > file and filled the buffer cache with them at next boot. It > actually didn't help at all. Maybe you did something wrong :-) Seriously, maybe you're right, and maybe not. I'd like to find out, and having the infrastructure to get FS access events will help in that (as well as your preferred approach: see below). If I am digging into a rathole, I'll do it with my eyese open ;-) > I don't remember why, maybe it was back so long ago that I didn't > have the memory, but I think it was more subtle than that. It's > basically a queuing problem and my instincts were wrong, I thought > if I could get all the data in there then things would go faster. > If you think through all the stuff going on during a boot it doesn't > really work that way. Well, on my machines anyway, the discs rattle an awful lot during bootup. Not just little adjacent seeks, but big, partition crossing seeks. > Anyway, a _much_ better thing to do would be to have all this data > laid out contig, then slurp in all the blocks in on I/O and then let > them get turned into files. This has been true for the last 30 > years and people still don't do it. We're actually moving in this > direction with BitKeeper, in the future, large numbers of small > files will be stored in one big file and extracted on demand. Then > we do one I/O to get all the related stuff. Yeah, we need a decent unfragmenter. We can do that now with bmap(). But to speed up boots, for example, we need to lay all the inodes that are accessed during boot in one contiguous chunk on the disc. Again, we need to know which files are being accessed to know that. /proc/fsaccess would tell us that. The down side of just relying on contiguous files is that some files (especially bloated C libraries) are not fully used. I would not be at all surprised if more than 75% of glibc is not (or rarely) used. There's a lot of stuff in there that isn't used very often. However, a *refragmenter* might be interesting. Find out which blocks in which files are actually used during boot, and lay just those out in a contiguous section. *That* would smoke! Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
Linus Torvalds writes: > > On Sun, 13 May 2001, Richard Gooch wrote: > > > > Hi, Linus. I've been thinking more about trying to warm the page > > cache with blocks needed by the bootup process. What is currently > > missing is (AFAIK) a mechanism to find out what inodes and blocks have > > been accessed. Sure, you can use bmap() to convert from file block to > > device block, but first you need to figure out the file blocks > > accessed. I'd like to find out what kind of patch you'd accept to > > provide the missing functionality. > > Why would you use bmap() anyway? You CANNOT warm up the page cache > with the physical map nr as discussed. So there's no real point in > using bmap() at any time. Think about it:-) You need to generate prefetch accesses in ascending device bnum order. So the bmap() is there to tell you those device bnums. You'd still prefetch using file bnums, the the *ordering* is done based on device bnum. In fact, once the list is sorted, you can chuck out the device bnums. You only need to store inum/path and file bnum in the database. > > One approach would be to create a new ioctl(2) for a FS that would > > read out inum,bnum pairs. > > Why not just "path,pagenr" instead? You make your instrumentation save > away the whole pathname, by just using the dentry pointer. Many > filesystems don't even _have_ a "inum", so anything less doesn't work > anyway. Sure, this would work too. I'm a bit worried about the increased amount of traffic this will generate. > Example acceptable approach: > > - save away full dentry and page number. Don't make it an ioctl. Think >"profiling" - this is _exactly_ the same thing, and profiling uses a > (a) command line argument to turn it on > (b) /proc/profile >(and because you have the full pathname, you should just make the dang >/proc/fsaccess file be ASCII) So on every page fault or read(2) call, we have to generate the full path from the dentry? Isn't that going to add a fair bit of overhead? Remember, we want to do this on every boot (to keep the database as up-to-date as possible). > - add a "prefetch()" system call that does all the same things >"read()" does, but doesn't actually wait for (or transfer) the >data. Basically just a read-ahead thing. So you'd basically end up >doing > > foreach (filename in /proc/fsaccess) > fd = open(filename); > foreach (sorted pagenr for filename in /proc/fsaccess) > prefetch(fd, pagenr); > end > end I don't see the advantage of the prefetch(2) system call. It seems to me I can get the same effect by just making read(2) calls in another task. Of course, I'd need to use bmap() to generate the sort key, but I don't see why that's a bad thing. > Forget about all these crappy "ioctl" ideas. Basic rule of thumb: if > you think an ioctl is a good idea, you're (a) being stupid and (b) > thinking wrong and (c) on the wrong track. Don't hold back now. Tell us what you really think :-) > And notice how there's not a single bmap anywhere, and not a single > "raw device open" anywhere. I don't mind the /proc/fsaccess approach, I'm just worried about the overhead of doing the denty->pathname conversions on each fault/read. Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Sun, May 13, 2001 at 06:32:02PM -0700, Linus Torvalds wrote: > > Hi, Linus. I've been thinking more about trying to warm the page > > cache with blocks needed by the bootup process. What is currently > > missing is (AFAIK) a mechanism to find out what inodes and blocks have > > been accessed. Sure, you can use bmap() to convert from file block to > > device block, but first you need to figure out the file blocks > > accessed. I'd like to find out what kind of patch you'd accept to > > provide the missing functionality. > > Why would you use bmap() anyway? You CANNOT warm up the page cache with > the physical map nr as discussed. So there's no real point in using > bmap() at any time. Ha. For once you're both wrong but not where you are thinking. One of the few places that I actually hacked Linux was for exactly this - it was in the 0.99 days I think. I saved the list of I/O's in a file and filled the buffer cache with them at next boot. It actually didn't help at all. I don't remember why, maybe it was back so long ago that I didn't have the memory, but I think it was more subtle than that. It's basically a queuing problem and my instincts were wrong, I thought if I could get all the data in there then things would go faster. If you think through all the stuff going on during a boot it doesn't really work that way. Anyway, a _much_ better thing to do would be to have all this data laid out contig, then slurp in all the blocks in on I/O and then let them get turned into files. This has been true for the last 30 years and people still don't do it. We're actually moving in this direction with BitKeeper, in the future, large numbers of small files will be stored in one big file and extracted on demand. Then we do one I/O to get all the related stuff. Dave Hitz at NetApp is about the only guy I know who really gets this, Daniel Phillips may also get it, he's certainly thinking about it. Lots of little I/O's == bad, one big I/O == good. Work through the numbers and it starts to look like you'd never want to do less than a 1MB I/O, probably not less than a 4MB I/O. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Sun, 13 May 2001, Richard Gooch wrote: > > Hi, Linus. I've been thinking more about trying to warm the page > cache with blocks needed by the bootup process. What is currently > missing is (AFAIK) a mechanism to find out what inodes and blocks have > been accessed. Sure, you can use bmap() to convert from file block to > device block, but first you need to figure out the file blocks > accessed. I'd like to find out what kind of patch you'd accept to > provide the missing functionality. Why would you use bmap() anyway? You CANNOT warm up the page cache with the physical map nr as discussed. So there's no real point in using bmap() at any time. > One approach would be to create a new ioctl(2) for a FS that would > read out inum,bnum pairs. Why not just "path,pagenr" instead? You make your instrumentation save away the whole pathname, by just using the dentry pointer. Many filesystems don't even _have_ a "inum", so anything less doesn't work anyway. Example acceptable approach: - save away full dentry and page number. Don't make it an ioctl. Think "profiling" - this is _exactly_ the same thing, and profiling uses a (a) command line argument to turn it on (b) /proc/profile (and because you have the full pathname, you should just make the dang /proc/fsaccess file be ASCII) - add a "prefetch()" system call that does all the same things "read()" does, but doesn't actually wait for (or transfer) the data. Basically just a read-ahead thing. So you'd basically end up doing foreach (filename in /proc/fsaccess) fd = open(filename); foreach (sorted pagenr for filename in /proc/fsaccess) prefetch(fd, pagenr); end end Forget about all these crappy "ioctl" ideas. Basic rule of thumb: if you think an ioctl is a good idea, you're (a) being stupid and (b) thinking wrong and (c) on the wrong track. And notice how there's not a single bmap anywhere, and not a single "raw device open" anywhere. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/