Re: [PATCH 00/37] Permit filesystem local caching
On Tuesday 26 February 2008 06:33, David Howells wrote: > > Suppose one were to take a mundane approach to the persistent cache > > problem instead of layering filesystems. What you would do then is > > change NFS's ->write_page and variants to fiddle the persistent > > cache > > It is a requirement laid down by the Linux NFS fs maintainers that the writes > to the cache be asynchronous, even if the writes to NFS aren't. As it happens, I will be hanging out for the next few days with said NFS maintainers, it would help to be as informed as possible about your patch set. > Note further that NFS's write_page() != writing to the cache. Writing to the > cache is typically done by NFS's readpages(). Yes, of course. But also by ->write_page no? > > Which I could eventually find out by reading all the patches but asking you > > is so much more fun :-) > > And a waste of my time. I've provided documentation in the main FS-Cache > patch, both as text files and in comments in header files that answer your > questions. Please read them first. 37 Patches, none of which has "Documentation" in the subject line, and you did not provide a diffstat in patch 0 for the patch set as a whole. If I had known it was there of course I would have read it. It is great to see this level of documentation. But I do not think it is fair to blame your (one) reader for missing it. See the smiley above? The _real_ reason I am asking you is that I do not think anybody understands your patch set, in spite of your considerable efforts to address that. Discussion in public, right or wrong, is the only way to fix that. It is counterproductive to drive readers away from the discussion for fear that they may miss some point obvious to the original author, or perhaps already discussed earlier on lkml, and get flamed for it. Obviously, the patch set is not going to be perfect when it goes in and it would be a silly abuse of the open source process to require that, but the parts where it touches the rest of the system have to be really well understood, and it is clear from the level of participation in the thread that they are not. One bit that already came out of this, which you have alluded to several times yourself but somehow seem to keep glossing over, is that you need a ->direct_bio file operations method. So does loopback mount. It might be worth putting some effort into seeing how ->direct_IO can be refactored to make that happen. You can get it in separately on the basis of helping loopback, and it will make your patches nicer. Daniel - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips <[EMAIL PROTECTED]> wrote: > I need to respond to this in pieces... first the bit that is bugging > me: > > > > * two new page flags > > > > I need to keep track of two bits of per-cached-page information: > > > > (1) This page is known by the cache, and that the cache must be informed if > > the page is going to go away. > > I still do not understand the life cycle of this bit. What does the > cache do when it learns the page has gone away? That's up to the cache. CacheFS, for example, unpins some resources when all the pages managed by a pointer block are taken away from it. The cache may also reserve a block on disk to back this page, and that reservation may then be discarded by the netfs uncaching the page. The cache may also speculatively take copies of the page if the machine is idle. Documentation/filesystems/caching/netfs-api.txt describes the caching API as a process, including the presentation of netfs pages to the cache and their uncaching. > How is it informed? [Documentation/filesystems/caching/netfs-api.txt] == PAGE UNCACHING == To uncache a page, this function should be called: void fscache_uncache_page(struct fscache_cookie *cookie, struct page *page); This function permits the cache to release any in-memory representation it might be holding for this netfs page. This function must be called once for each page on which the read or write page functions above have been called to make sure the cache's in-memory tracking information gets torn down. Note that pages can't be explicitly deleted from the data file. The whole data file must be retired (see the relinquish cookie function below). Furthermore, note that this does not cancel the asynchronous read or write operation started by the read/alloc and write functions. [/] > Who owns the page cache in which such a page lives, the nfs client? > Filesystem that hosts the page? A third page cache owned by the > cache itself? (See my basic confusion about how many page cache > levels you have, below.) [Documentation/filesystems/caching/fscache.txt] (7) Data I/O is done direct to and from the netfs's pages. The netfs indicates that page A is at index B of the data-file represented by cookie C, and that it should be read or written. The cache backend may or may not start I/O on that page, but if it does, a netfs callback will be invoked to indicate completion. The I/O may be either synchronous or asynchronous. [/] I should perhaps make the documentation more explicit: the pages passed to the routines defined in include/linux/fscache.h are netfs pages, normally belonging the pagecache of the appropriate netfs inode. This is, however, mentioned in the function banner comments in fscache.h. > Suppose one were to take a mundane approach to the persistent cache > problem instead of layering filesystems. What you would do then is > change NFS's ->write_page and variants to fiddle the persistent > cache It is a requirement laid down by the Linux NFS fs maintainers that the writes to the cache be asynchronous, even if the writes to NFS aren't. Note further that NFS's write_page() != writing to the cache. Writing to the cache is typically done by NFS's readpages(). Besides, at the moment, caching is suppressed for any NFS file opened for writing due to coherency issues. This is something to be revisited later. > as well as the network, instead of just the network as now. Not as now. See above. > This fiddling could even consist of ->write calls to another > filesystem, though working directly with the bio interface would > yield the fastest, and therefore to my mind, best result. You can't necessarily access the BIO interface, and even if you can, the cache is still a filesystem. Essentially, what cachefiles does is to do what you say: to perform ->write calls on another filesystem. FS-Cache also protects the netfs against (a) there being no cache, (b) the cache suffering a fatal I/O error and (c) the cache being removed; and protects the cache against (d) the netfs uncaching pages that the cache is using and (e) conflicting operations from the netfs, some of which may be queued for asynchronous processing. FS-Cache also groups asynchronous netfs store requests together, which hopefully, one day, I'll be able to pass on to the backing fs. > In any case, you find out how to write the page to backing store by > asking the filesystem, which in the naive approach would be nfs > augmented with caching library calls. NFS and AFS and CIFS and ISOFS, but yes, that's what fscache is, if you like, a caching library. > The filesystem keeps its own metadata around to know how to map the page to > disk. So again naively, this metadata could tell the nfs client that the > page is not mapped to disk at all. The netfs should _not_ know about the metadata of a backing fs. Firstly, there are many different potent
Re: [PATCH 00/37] Permit filesystem local caching
I need to respond to this in pieces... first the bit that is bugging me: > > * two new page flags > > I need to keep track of two bits of per-cached-page information: > > (1) This page is known by the cache, and that the cache must be informed if > the page is going to go away. I still do not understand the life cycle of this bit. What does the cache do when it learns the page has gone away? How is it informed? Who owns the page cache in which such a page lives, the nfs client? Filesystem that hosts the page? A third page cache owned by the cache itself? (See my basic confusion about how many page cache levels you have, below.) Suppose one were to take a mundane approach to the persistent cache problem instead of layering filesystems. What you would do then is change NFS's ->write_page and variants to fiddle the persistent cache as well as the network, instead of just the network as now. This fiddling could even consist of ->write calls to another filesystem, though working directly with the bio interface would yield the fastest, and therefore to my mind, best result. In any case, you find out how to write the page to backing store by asking the filesystem, which in the naive approach would be nfs augmented with caching library calls. The filesystem keeps its own metadata around to know how to map the page to disk. So again naively, this metadata could tell the nfs client that the page is not mapped to disk at all. So I do not see what your per-page bit is for, obviously because I do not fully understand your caching scheme. Which I could eventually find out by reading all the patches but asking you is so much more fun :-) By the way, how many levels of page caching for the same data are there, is it: 1) nfs client 2) cache layer's own page cache 3) filesystem hosting the cache or just: 1) nfs client page cache 2) filesystem hosting the cache I think it is the second, but that is already double caching, which has got to hurt. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips <[EMAIL PROTECTED]> wrote: > On Monday 25 February 2008 15:19, David Howells wrote: > > So I guess there's a problem in cachefiles's efficiency - possibly due > > to the fact that it tries to be fully asynchronous. > > OK, not just my imagination, and it makes me feel better about the patch > set because efficiency bugs are fixable while fundamental limitations > are not. One can hope:-) > How much of a hurry are you in to merge this feature? You have bits > like this: I'd like to get it upstream sooner rather than later. As it's not upstream, but it's prerequisite patches touch a lot of code, I have to spend time regularly making my patches work again. Merge windows are completely not fun. > "Add a function to install a monitor on the page lock waitqueue for a > particular page, thus allowing the page being unlocked to be detected. > This is used by CacheFiles to detect read completion on a page in the > backing filesystem so that it can then copy the data to the waiting > netfs page." > > We already have that hook, it is called bio_endio. Except that isn't accessible. CacheFiles currently has no access to the notification from the blockdev to the backing fs, if indeed there is one. All we can do it trap the backing fs page becoming available. > My strong intuition is that your whole mechanism should sit directly on the > block device, no matter how attractive it seems to be able to piggyback on > the namespace and layout management code of existing filesystems. There's a place for both. Consider a laptop with a small disk, possibly subdivided between Linux and Windows. Linux then subdivides its bit further to get a swap space. What you then propose is to break off yet another chunk to provide the cache. You can't then use this other chunk for anything else, even if it's, say, 1% used by the cache. The way CacheFiles works is that you tell it that it can use up to a certain percentage of the otherwise free disk space on an otherwise existing filesystem. In the laptop case, you may just have a single big partition. The cache will fill up as much of it can, and as the other contents of the partition consume space, the cache will be culled to make room. On the other hand, a system like my desktop, where I can slap in extra disks with mound of extra disk space, it might very well make sense to commit block devices to caching, as this can be used to gain performance. I have another cache backend (CacheFS) which takes the form of a filesystem, thus allowing you to mount a blockdev as a cache. It's much faster than Ext3 at storing and retrieving files... at first. The problem is that I've mucked up the free space retrieval such that performance degrades by 20x over time for files of any size. Basically any cache on a raw blockdev _is_ a filesystem, just one in which you're randomly allowed to discard data to make life easier. > I see your current effort as the moral equivalent of FUSE: you are able to > demonstrate certain desirable behavioral properties, but you are unable to > reach full theoretical efficiency because there are layers and layers of > interface gunk interposed between the netfs user and the cache device. The interface gunk is meant to be as thin as possible, but there are constraints (see the documentation in the main FS-Cache patch for more details): (1) It's a requirement that it only be tied to, say, AFS. We might have several netfs's that want caching: AFS, CIFS, ISOFS (okay, that last isn't really a netfs, but it might still want caching). (2) I want to be able to change the backing cache. Under some circumstances I might want to use an existing filesystem, under others I might want to commit a blockdev. I've even been asked about using battery-backed RAM - which has different design constraints. (3) The constraint has been imposed by the NFS team that the cache be completely asynchronous. I haven't quite met this: readpages() will wait until the cache knows whether or not the pages are available on the principle that read operations done through the cache can be considered synchronous. This is an attempt to reduce the context switchage involved. Unfortunately, the asynchronicity requirement has caused the middle layer to bloat. Fortunately, the backing cache needn't bloat as it can use the middle layer's bloat. > That said, I also see you have put a huge amount of work into this over > the years, it is nicely broken out, you are responsive and easy to work > with, all arguments for an early merge. Against that, you invade core > kernel for reasons that are not necessarily justified: > > * two new page flags I need to keep track of two bits of per-cached-page information: (1) This page is known by the cache, and that the cache must be informed if the page is going to go away. (2) This page is being written to disk by the cache, and that it cannot be released un
Re: [PATCH 00/37] Permit filesystem local caching
On Monday 25 February 2008 15:19, David Howells wrote: > So I guess there's a problem in cachefiles's efficiency - possibly due > to the fact that it tries to be fully asynchronous. OK, not just my imagination, and it makes me feel better about the patch set because efficiency bugs are fixable while fundamental limitations are not. How much of a hurry are you in to merge this feature? You have bits like this: "Add a function to install a monitor on the page lock waitqueue for a particular page, thus allowing the page being unlocked to be detected. This is used by CacheFiles to detect read completion on a page in the backing filesystem so that it can then copy the data to the waiting netfs page." We already have that hook, it is called bio_endio. My strong intuition is that your whole mechanism should sit directly on the block device, no matter how attractive it seems to be able to piggyback on the namespace and layout management code of existing filesystems. I see your current effort as the moral equivalent of FUSE: you are able to demonstrate certain desirable behavioral properties, but you are unable to reach full theoretical efficiency because there are layers and layers of interface gunk interposed between the netfs user and the cache device. That said, I also see you have put a huge amount of work into this over the years, it is nicely broken out, you are responsive and easy to work with, all arguments for an early merge. Against that, you invade core kernel for reasons that are not necessarily justified: * two new page flags * a new fileops method * many changes to LSM including new object class and new hooks * separate fs*id from task struct * new page-private destructor hook * probably other bits I missed Would it be correct to say that some of these changes are to support disconnected operation? If so, you really have two patch sets: 1) Persistent netfs cache 2) Disconnected netfs operation You have some short snappers that look generally useful: * add_wait_queue_tail (cool) * write to a file without a struct file (includes ->mapping cleanup, probably good) * export fsync_super Why not hunt around for existing in-kernel users that would benefit so these can be submitted as standalone patches, shortening the remaining patch set and partially overcoming objections due to core kernel changes? One thing I don't see is users coming on to lkml and saying "please merge this, it works great for me". Since you probably have such users, why not give them a poke? Your cachefilesd is going to need anti-deadlock medicine like ddsnap has. Since you don't seem at all worried about that right now, I suspect you have not hammered this code really heavily, correct? Without preventative measures, any memory-using daemon sitting in the block IO path will deadlock if you hit it hard enough. A couple of years ago you explained the purpose of the new page flags to me and there is no way I can find that email again. Could you explain it again please? Meanwhile I am doing my duty and reading your OLS slides etc. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips <[EMAIL PROTECTED]> wrote: > This factor of four (even worse on XFS, not quite as bad on Ext3) is > worth ruminating upon. Is all of the difference explained by avoiding > seeks on the server, which has the files in memory? Here are some more stats for you to consider: (1) Copy the data across the network to a fresh Ext3 fs on the same partition I was using for the cache: [EMAIL PROTECTED] ~]# time cp -a /warthog/aaa /var/fscache real0m39.052s user0m0.368s sys 0m15.229s (2) Reboot and read back the files just written into Ext3 on the local disk: [EMAIL PROTECTED] ~]# time tar cf - /var/fscache/aaa >/dev/zero real0m40.574s user0m0.164s sys 0m3.512s (3) Run through the cache population process, and then run a tar directly on cachefiles's cache directly after a reboot: [EMAIL PROTECTED] ~]# time tar cf - /var/fscache/cache >/dev/zero real4m53.104s user0m0.192s sys 0m4.240s So I guess there's a problem in cachefiles's efficiency - possibly due to the fact that it tries to be fully asynchronous. In case (1) this is very similar to the time for a read through a completely cold cache (37.497s). In case (2) this is comparable to cachefiles with a cache warmed prior to a reboot (1m54.350s); in this case, however, cachefiles is doing some extra work: (a) It's doing a lookup on the server for each file, in addition to the lookups on the disk. However, just doing a tar from plain NFS, the command completes in 22.330s. (b) It's reading an xattr per object for cache coherency management. (c) As the cache knows nothing of directories, files, etc., it lays its directory subtree out in a way that suits it. File lookup keys are turned into filenames. This may result in a less efficient arrangement in the cache than the original data, especially as directories may become very large, so Ext3 may be doing some extra work. In case (3), this perhaps suggests that cachefiles's directory layout may be part of the problem. Running the following: ls -ldSr `find . -type d` in /var/fscache/cache shows that the directories are either 4096 bytes in size (158 instances) or 12288 bytes in size (105 instances), for a total of 263 directories. There are 19255 files. Running that ls command in /warthog/aaa shows 1185 directories, all but three of them 4096 bytes in size; two are 12288 bytes and one is 20480 bytes in size (include/linux/ unsurprisingly). There are 19258 files, three of which are hardlinks to other files in the tree. > This could be easily tested by running a test against a server that is the > same as the client, and does not have the files in memory. If local access > is still slower than network then there is a real issue with cache > efficiency. My server is also my desktop machine. The only way to guarantee that the memory is scrubbed is to reboot it:-( I'll look at setting up one of my other machines as an NFS server. David - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips <[EMAIL PROTECTED]> wrote: > I am eventually going to suggest cutting the backing filesystem entirely out > of the picture, You still need a database to manage the cache. A filesystem such as Ext3 makes a very handy database for four reasons: (1) It exists and works. (2) It has a well defined interface within the kernel. (3) I can place my cache on, say, my root partition on my laptop. I don't have to dedicate a partition to the cache. (4) Userspace cache management tools (such as cachefilesd) have an already existing interface to use: rmdir, unlink, open, getdents, etc.. I do have a cache-on-blockdev thing, but it's basically a wandering tree filesystem inside. It is, or was, much faster than ext3 on a clean cache, but it degrades horribly over time because my free space reclamation sucks - it gradually randomises the block allocation sequence over time. So, what would you suggest instead of a backing filesystem? > I really do not like idea of force fitting this cache into a generic > vfs model. Sun was collectively smoking some serious crack when they > cooked that one up. But there is also the ageless principle "isness is > more important than niceness". What do you mean? I'm not doing it like Sun. The cache is a side path from the netfs. It should be transparent to the user, the VFS and the server. The only place it might not be transparent is that you might to have to instruct the netfs mount to use the cache. I'd prefer to do it some other way than passing parameters to mount, though, as (1) this causes fun with NIS distributed automounter maps, and (2) people are asking for a finer grain of control than per-mountpoint. Unfortunately, I can't seem to find a way to do it that's acceptable to Al. > Which would require a change to NFS, not an option because you hope to > work with standard servers? Of course with years to think about this, > the required protocol changes were put into v4. Not. I don't think there's much I can do about NFS. It requires the filesystem from which the NFS server is dealing to have inode uniquifiers, which are then incorporated into the file handle. I don't think the NFS protocol itself needs to change to support this. > Have you completely exhausted optimization ideas for the file handle > lookup? No, but there aren't many. CacheFiles doesn't actually do very much, and it's hard to reduce that not very much. The most obvious thing is to prepopulate the dcache, but that's at the expense of memory usage. Actually, if I cache the name => FH mapping I used last time, I can make a start on looking up in the cache whilst simultaneously accessing the server. If what's on the server has changed, I can ditch the speculative cache lookup I was making and start a new cache lookup. However, storing directory entries has penalties of its own, though it'll be necesary if we want to do disconnected operation. > > Where "lookup table" == "dcache". That would be good yes. cachefilesd > > prescans all the files in the cache, which ought to do just that, but it > > doesn't seem to be very effective. I'm not sure why. > > RCU? Anyway, it is something to be tracked down and put right. cachefilesd runs in userspace. It's possible it isn't doing enough to preload all the metadata. > What I tried to say. So still... got any ideas? That extra synchronous > network round trip is a killer. Can it be made streaming/async to keep > throughput healthy? That's a per-netfs thing. With the test rig I've got, it's going to the on-disk cache that's the killer. Going over the network is much faster. See the results I posted. For the tarball load, and using Ext3 to back the cache: Cold NFS cache, no disk cache: 0m22.734s Warm on-disk cache, cold pagecaches:1m54.350s The problem is reading using tar is a worst case workload for this. Everything it does is pretty much completely synchronous. One thing that might help is if things like tar and find can be made to use fadvise() on directories to hint to the filesystem (NFS, AFS, whatever) that it's going to access every file in those directories. Certainly AFS could make use of that: the directory is read as a file, and the netfs then parses the file to get a list of vnode IDs that that directory points to. It could then do bulk status fetch operations to instantiate the inodes 50 at a time. I don't know whether NFS could use it. Someone like Trond or SteveD or Chuck would have to answer that. David - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
On Friday 22 February 2008 04:48, David Howells wrote: > > But looking up the object in the cache should be nearly free - much less > > than a microsecond per block. > > The problem is that you have to do a database lookup of some sort, possibly > involving several synchronous disk operations. Right, so the obvious optimization strategy for this corner of it is to decimate the synchronous disk ops for the average case, for which there are a variety of options, one of which you already suggested. > CacheFiles does a disk lookup by taking the key given to it by NFS, turning it > into a set of file or directory names, and doing a short pathwalk to the > target > cache file. Throwing in extra indices won't necessarily help. What matters > is > how quick the backing filesystem is at doing lookups. As it turns out, Ext3 > is > a fair bit better then BTRFS when the disk cache is cold. All understood. I am eventually going to suggest cutting the backing filesystem entirely out of the picture, with a view to improving both efficiency and transparency, hopefully with a code size reduction as well. But you are up and running with the filesystem approach, enough to tackle the basic algorithm questions, which is worth a lot. I really do not like idea of force fitting this cache into a generic vfs model. Sun was collectively smoking some serious crack when they cooked that one up. But there is also the ageless principle "isness is more important than niceness". > > > The metadata problem is quite a tricky one since it increases with the > > > number of files you're dealing with. As things stand in my patches, when > > > NFS, for example, wants to access a new inode, it first has to go to the > > > server to lookup the NFS file handle, and only then can it go to the cache > > > to find out if there's a matching object in the case. > > > > So without the persistent cache it can omit the LOOKUP and just send the > > filehandle as part of the READ? > > What 'it'? Note that the get the filehandle, you have to do a LOOKUP op. > With > the cache, we could actually cache the results of lookups that we've done, > however, we don't know that the results are still valid without going to the > server:-/ What I was trying to say. It => the cache logic. > AFS has a way around that - it versions its vnode (inode) IDs. Which would require a change to NFS, not an option because you hope to work with standard servers? Of course with years to think about this, the required protocol changes were put into v4. Not. /me hopes for an NFS hack to show up and explain the thinking there Actually, there are many situations where changing both the client (you must do that anyway) and the server is logistically practical. In fact that is true for all actual use cases I know of for this cache model. So elaborating the protocol is not an option to reject out of hand. A hack along those lines could (should?) be provided as an opportunistic option. Have you completely exhausted optimization ideas for the file handle lookup? > > > The reason my client going to my server is so quick is that the server has > > > the dcache and the pagecache preloaded, so that across-network lookup > > > operations are really, really quick, as compared to the synchronous > > > slogging of the local disk to find the cache object. > > > > Doesn't that just mean you have to preload the lookup table for the > > persistent cache so you can determine whether you are caching the data > > for a filehandle without going to disk? > > Where "lookup table" == "dcache". That would be good yes. cachefilesd > prescans all the files in the cache, which ought to do just that, but it > doesn't seem to be very effective. I'm not sure why. RCU? Anyway, it is something to be tracked down and put right. > > Your big can-t-get-there-from-here is the round trip to the server to > > determine whether you should read from the local cache. Got any ideas? > > I'm not sure what you mean. Your statement should probably read "... to > determine _what_ you should read from the local cache". What I tried to say. So still... got any ideas? That extra synchronous network round trip is a killer. Can it be made streaming/async to keep throughput healthy? > > And where is the Trond-meister in all of this? > > Keeping quiet as far as I can tell. /me does the Trond summoning dance Daniel - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Chris Mason <[EMAIL PROTECTED]> wrote: > Thanks for trying this, of course I'll ask you to try again with the latest > v0.13 code, it has a number of optimizations especially for CPU usage. Here you go. The numbers are very similar. David = FEW BIG FILES TEST ON BTRFS v0.13 = Completely cold caches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m2.202s user0m0.000s sys 0m1.716s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m4.212s user0m0.000s sys 0m0.896s Warm BTRFS pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m0.197s user0m0.000s sys 0m0.192s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m0.376s user0m0.000s sys 0m0.372s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m1.543s user0m0.004s sys 0m1.448s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m3.111s user0m0.000s sys 0m2.856s == MANY SMALL/MEDIUM FILE READING TEST ON BTRFS v0.13 == Completely cold caches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real0m31.575s user0m0.176s sys 0m6.316s Warm BTRFS pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real0m16.081s user0m0.164s sys 0m5.528s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real2m15.245s user0m0.064s sys 0m2.808s - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 00/37] Permit filesystem local caching
> Well, the AFS paper that was referenced earlier was written around the > time of 10bt and 100bt. Local disk caching worked well then. There > should also be some papers at CITI about disk caching over slower > connections, and disconnected operation (which should still be > applicable today). There are still winners from local disk caching, but > their numbers have been reduced. Server load reduction should be a win. > I'm not sure if it's worth it from a security/manageability standpoint, > but I haven't looked that closely at David's code. One area that you might want to look at is WAN performance. When RPC RTT goes up, ordinary NFS performance goes down. This tends to get overlooked by the machine room folks. (There are several tools out there that can introduce delay in an IP packet stream and emulate WAN RTTs.) Just a thought, rick - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
David Howells <[EMAIL PROTECTED]> wrote: > > > Have you got before/after benchmark results? > > > > See attached. > > Attached here are results using BTRFS (patched so that it'll work at all) > rather than Ext3 on the client on the partition backing the cache. And here are XFS results. Tuning XFS makes a *really* big difference for the lots of small/medium files being tarred case. However, in general BTRFS is much better. David --- = FEW BIG FILES TEST ON XFS = Completely cold caches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m2.286s user0m0.000s sys 0m1.828s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m4.228s user0m0.000s sys 0m1.360s Warm NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m0.058s user0m0.000s sys 0m0.060s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m0.122s user0m0.000s sys 0m0.120s Warm XFS pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m0.181s user0m0.000s sys 0m0.180s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m1.034s user0m0.000s sys 0m0.404s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m1.540s user0m0.000s sys 0m0.256s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m3.003s user0m0.000s sys 0m0.532s == MANY SMALL/MEDIUM FILE READING TEST ON XFS == Completely cold caches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real4m56.827s user0m0.180s sys 0m6.668s Warm NFS pagecache: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real0m15.084s user0m0.212s sys 0m5.008s Warm XFS pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real0m13.547s user0m0.220s sys 0m5.652s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real4m36.316s user0m0.148s sys 0m4.440s === MANY SMALL/MEDIUM FILE READING TEST ON AN OPTIMISED XFS === mkfs.xfs -d agcount=4 -l size=128m,version=2 /dev/sda6 Completely cold caches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real3m44.033s user0m0.248s sys 0m6.632s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real3m8.582s user0m0.108s sys 0m3.420s - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Chris Mason <[EMAIL PROTECTED]> wrote: > > The interesting case is where the disk cache is warm, but the pagecache is > > cold (ie: just after a reboot after filling the caches). Here, for the two > > big files case, BTRFS appears quite a bit better than Ext3, showing a 21% > > reduction in time for the smaller case and a 13% reduction for the larger > > case. > > I'm afraid I don't have a good handle on the filesystem operations that > result from this workload. Are we reading from the FS to fill the NFS page > cache? I'm not sure what you're asking. When the cache is cold, we determine that we can't read from the cache very quickly. We then read data from the server and, in the background, create the metadata in the cache and store the data to it (by copying netfs pages to backingfs pages). When the cache is warm, we read the data from the cache, copying the data from the backingfs pages to the netfs pages. We use bmap() to ascertain that there is data to be read, otherwise we detect a hole and fallback to reading from the server. Looking up cache object involves a sequence of lookup() ops and getxattr() ops on the backingfs. Should an object not exist, we defer creation of that object to a background thread and do lookups(), mkdirs() and setxattrs() and a create() to manufacture the object. We read data from an object by calling readpages() on the backingfs to bring the data into the pagecache. We monitor the PG_lock bits to find out when each page is read or has completed with an error. Writing pages to the cache is done completely in the background. PG_fscache_write is set on a page when it is handed to fscache to storage, then at some point a background thread wakes up and calls write_one_page() in the backingfs to write that page to the cache file. At the moment, this copies the data into a backingfs page which is then marked PG_dirty, and the VM writes it out in the usual way. > > More surprising is that BTRFS performed significantly worse (15% increase > > in time) in the case where the cache on disk was fully populated and then > > the machine had been rebooted to clear the pagecaches. > > Which FS operations are included here? Finding all the files or just an > unmount? Btrfs defrags metadata in the background, and unmount has to wait > for that defrag to finish. BTRFS might not be doing any writing at all here - apart from local atimes (used by cache culling), that is. What it does have to do is lots of lookups, reads and getxattrs, all of which are synchronous. David - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
On Thursday 21 February 2008, David Howells wrote: > David Howells <[EMAIL PROTECTED]> wrote: > > > Have you got before/after benchmark results? > > > > See attached. > > Attached here are results using BTRFS (patched so that it'll work at all) > rather than Ext3 on the client on the partition backing the cache. Thanks for trying this, of course I'll ask you to try again with the latest v0.13 code, it has a number of optimizations especially for CPU usage. > > Note that I didn't bother redoing the tests that didn't involve a cache as > the choice of filesystem backing the cache should have no bearing on the > result. > > Generally, completely cold caches shouldn't show much variation as all the > writing can be done completely asynchronously, provided the client doesn't > fill its RAM. > > The interesting case is where the disk cache is warm, but the pagecache is > cold (ie: just after a reboot after filling the caches). Here, for the two > big files case, BTRFS appears quite a bit better than Ext3, showing a 21% > reduction in time for the smaller case and a 13% reduction for the larger > case. I'm afraid I don't have a good handle on the filesystem operations that result from this workload. Are we reading from the FS to fill the NFS page cache? > > For the many small/medium files case, BTRFS performed significantly better > (15% reduction in time) in the case where the caches were completely cold. > I'm not sure why, though - perhaps because it doesn't execute a > write_begin() stage during the write_one_page() call and thus doesn't go > allocating disk blocks to back the data, but instead allocates them later. If your write_one_page call does parts of btrfs_file_write, you'll get delayed allocation for anything bigger than 8k by default. <= 8k will get packed into the btree leaves. > > More surprising is that BTRFS performed significantly worse (15% increase > in time) in the case where the cache on disk was fully populated and then > the machine had been rebooted to clear the pagecaches. Which FS operations are included here? Finding all the files or just an unmount? Btrfs defrags metadata in the background, and unmount has to wait for that defrag to finish. Thanks again, Chris - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips <[EMAIL PROTECTED]> wrote: > > The way the client works is like this: > > Thanks for the excellent ascii art, that cleared up the confusion right > away. You know what they say about pictures... :-) > > What are you trying to do exactly? Are you actually playing with it, or > > just looking at the numbers I've produced? > > Trying to see if you are offering enough of a win to justify testing it, > and if that works out, then going shopping for a bin of rotten vegetables > to throw at your design, which I hope you will perceive as useful. One thing that you have to remember: my test setup is pretty much the worst-case for being appropriate for showing the need for caching to improve performance. There's a single client and a single server, they've got GigE networking between them that has very little other load, and the server has sufficient memory to hold the entire test data set. > From the numbers you have posted I think you are missing some basic > efficiencies that could take this design from the sorta-ok zone to wow! Not really, it's just that this lashup could be considered designed to show local caching in the worst light. > But looking up the object in the cache should be nearly free - much less > than a microsecond per block. The problem is that you have to do a database lookup of some sort, possibly involving several synchronous disk operations. CacheFiles does a disk lookup by taking the key given to it by NFS, turning it into a set of file or directory names, and doing a short pathwalk to the target cache file. Throwing in extra indices won't necessarily help. What matters is how quick the backing filesystem is at doing lookups. As it turns out, Ext3 is a fair bit better then BTRFS when the disk cache is cold. > > The metadata problem is quite a tricky one since it increases with the > > number of files you're dealing with. As things stand in my patches, when > > NFS, for example, wants to access a new inode, it first has to go to the > > server to lookup the NFS file handle, and only then can it go to the cache > > to find out if there's a matching object in the case. > > So without the persistent cache it can omit the LOOKUP and just send the > filehandle as part of the READ? What 'it'? Note that the get the filehandle, you have to do a LOOKUP op. With the cache, we could actually cache the results of lookups that we've done, however, we don't know that the results are still valid without going to the server:-/ AFS has a way around that - it versions its vnode (inode) IDs. > > The reason my client going to my server is so quick is that the server has > > the dcache and the pagecache preloaded, so that across-network lookup > > operations are really, really quick, as compared to the synchronous > > slogging of the local disk to find the cache object. > > Doesn't that just mean you have to preload the lookup table for the > persistent cache so you can determine whether you are caching the data > for a filehandle without going to disk? Where "lookup table" == "dcache". That would be good yes. cachefilesd prescans all the files in the cache, which ought to do just that, but it doesn't seem to be very effective. I'm not sure why. > > I can probably improve this a little by pre-loading the subindex > > directories (hash tables) that I use to reduce the directory size in the > > cache, but I don't know by how much. > > Ah I should have read ahead. I think the correct answer is "a lot". Quite possibly. It'll allow me to dispense with at least one fs lookup call per cache object request call. > Your big can-t-get-there-from-here is the round trip to the server to > determine whether you should read from the local cache. Got any ideas? I'm not sure what you mean. Your statement should probably read "... to determine _what_ you should read from the local cache". > And where is the Trond-meister in all of this? Keeping quiet as far as I can tell. David - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
On Thursday 21 February 2008 16:07, David Howells wrote: > The way the client works is like this: Thanks for the excellent ascii art, that cleared up the confusion right away. > What are you trying to do exactly? Are you actually playing with it, or just > looking at the numbers I've produced? Trying to see if you are offering enough of a win to justify testing it, and if that works out, then going shopping for a bin of rotten vegetables to throw at your design, which I hope you will perceive as useful. In short I am looking for a reason to throw engineering effort at it. >From the numbers you have posted I think you are missing some basic efficiencies that could take this design from the sorta-ok zone to wow! I think you may already be in the wow zone for taking load off a server and I know of applications where an NFS server gets hammered so badly that having the client suck a little in the unloaded case is a price worth paying. But the whole idea would be much more attractive if the regressions were smaller. > > Who is supposed to win big? Is this mainly about reducing the load on > > the server, or is the client supposed to win even with a lightly loaded > > server? > > These are difficult questions to answer. The obvious answer to both is "it > depends", and the real answer to both is "it's a compromise". > > Inserting a cache adds overhead: you have to look in the cache to see if your > objects are mirrored there, and then you have to look in the cache to see if > the data you want is stored there; and then you might have to go to the server > anyway and then schedule a copy to be stored in the cache. But looking up the object in the cache should be nearly free - much less than a microsecond per block. If not then there are design issues. I suspect that you are doing yourself a disservice by going all the way through the vfs to do this cache lookup, but this needs to be proved. > The characteristics of this type of cache depend on a number of things: the > filesystem backing it being the most obvious variable, but also how fragmented > it is and the properties of the disk drive or drives it is on. Double caching and vm unawareness of that has to hurt. > The metadata problem is quite a tricky one since it increases with the number > of files you're dealing with. As things stand in my patches, when NFS, for > example, wants to access a new inode, it first has to go to the server to > lookup the NFS file handle, and only then can it go to the cache to find out > if > there's a matching object in the case. So without the persistent cache it can omit the LOOKUP and just send the filehandle as part of the READ? > Worse, the cache must then perform > several synchronous disk bound metadata operations before it can be possible > to > read from the cache. Worse still, this means that a read on the network file > cannot proceed until (a) we've been to the server *plus* (b) we've been to the > disk. > > The reason my client going to my server is so quick is that the server has > the > dcache and the pagecache preloaded, so that across-network lookup operations > are really, really quick, as compared to the synchronous slogging of the local > disk to find the cache object. Doesn't that just mean you have to preload the lookup table for the persistent cache so you can determine whether you are caching the data for a filehandle without going to disk? > I can probably improve this a little by pre-loading the subindex directories > (hash tables) that I use to reduce the directory size in the cache, but I > don't > know by how much. Ah I should have read ahead. I think the correct answer is "a lot". Your big can-t-get-there-from-here is the round trip to the server to determine whether you should read from the local cache. Got any ideas? And where is the Trond-meister in all of this? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips <[EMAIL PROTECTED]> wrote: > When you say Ext3 cache vs NFS cache is the first on the server and the > second on the client? The filesystem on the server is pretty much irrelevant as long as (a) it doesn't change, and (b) all the data is in memory on the server anyway. The way the client works is like this: +-+ | | | NFS |--+ | | | +-+ | +--+ | | | +-+ +-->| | | | | | | AFS |->| FS-Cache | | | | |--+ +-+ +-->| | | | | | | +--+ +--+ +-+ | +--+ | | | | | | | | +-->| CacheFiles |-->| Ext3| | ISOFS |--+ | /var/cache | | /dev/sda6 | | |+--+ +--+ +-+ (1) NFS, say, asks FS-Cache to store/retrieve data for it; (2) FS-Cache asks the cache backend, in this case CacheFiles to honour the operation; (3) CacheFiles 'opens' a file in a mounted filesystem, say Ext3, and does read and write operations of a sort on it; (4) Ext3 decides how the cache data is laid out on disk - CacheFiles just attempts to use one sparse file per netfs inode. > I am trying to spot the numbers that show the sweet spot for this > optimization, without much success so far. What are you trying to do exactly? Are you actually playing with it, or just looking at the numbers I've produced? > Who is supposed to win big? Is this mainly about reducing the load on > the server, or is the client supposed to win even with a lightly loaded > server? These are difficult questions to answer. The obvious answer to both is "it depends", and the real answer to both is "it's a compromise". Inserting a cache adds overhead: you have to look in the cache to see if your objects are mirrored there, and then you have to look in the cache to see if the data you want is stored there; and then you might have to go to the server anyway and then schedule a copy to be stored in the cache. The characteristics of this type of cache depend on a number of things: the filesystem backing it being the most obvious variable, but also how fragmented it is and the properties of the disk drive or drives it is on. Whether it's worth having a cache depend on the characteristics of the network versus the characteristics of the cache. Latency of the cache vs latency of the network, for example. Network loading is another: having a cache on each of several clients sharing a server can reduce network traffic by avoiding the read requests to the server. NFS has a characteristic that it keeps spamming the server with file status requests, so even if you take the read requests out of the load, an NFS client still generates quite a lot of network traffic to the server - but the reduction is still useful. The metadata problem is quite a tricky one since it increases with the number of files you're dealing with. As things stand in my patches, when NFS, for example, wants to access a new inode, it first has to go to the server to lookup the NFS file handle, and only then can it go to the cache to find out if there's a matching object in the case. Worse, the cache must then perform several synchronous disk bound metadata operations before it can be possible to read from the cache. Worse still, this means that a read on the network file cannot proceed until (a) we've been to the server *plus* (b) we've been to the disk. The reason my client going to my server is so quick is that the server has the dcache and the pagecache preloaded, so that across-network lookup operations are really, really quick, as compared to the synchronous slogging of the local disk to find the cache object. I can probably improve this a little by pre-loading the subindex directories (hash tables) that I use to reduce the directory size in the cache, but I don't know by how much. Anyway, to answer your questions: (1) It may help with heavily loaded networks with lots of read-only traffic. (2) It may help with slow connections (like doing NFS between the UK and Australia). (3) It could be used to do offline/disconnected operation. David - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
David Howells <[EMAIL PROTECTED]> wrote: > > Have you got before/after benchmark results? > > See attached. Attached here are results using BTRFS (patched so that it'll work at all) rather than Ext3 on the client on the partition backing the cache. Note that I didn't bother redoing the tests that didn't involve a cache as the choice of filesystem backing the cache should have no bearing on the result. Generally, completely cold caches shouldn't show much variation as all the writing can be done completely asynchronously, provided the client doesn't fill its RAM. The interesting case is where the disk cache is warm, but the pagecache is cold (ie: just after a reboot after filling the caches). Here, for the two big files case, BTRFS appears quite a bit better than Ext3, showing a 21% reduction in time for the smaller case and a 13% reduction for the larger case. For the many small/medium files case, BTRFS performed significantly better (15% reduction in time) in the case where the caches were completely cold. I'm not sure why, though - perhaps because it doesn't execute a write_begin() stage during the write_one_page() call and thus doesn't go allocating disk blocks to back the data, but instead allocates them later. More surprising is that BTRFS performed significantly worse (15% increase in time) in the case where the cache on disk was fully populated and then the machine had been rebooted to clear the pagecaches. It's important to note that I've only run each test once apiece, so the numbers should be taken with a modicum of salt (bad statistics and all that). David --- === FEW BIG FILES TEST ON BTRFS === Completely cold caches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m2.124s user0m0.000s sys 0m1.260s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m4.538s user0m0.000s sys 0m2.624s Warm NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m0.061s user0m0.000s sys 0m0.064s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m0.118s user0m0.000s sys 0m0.116s Warm BTRFS pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m0.189s user0m0.000s sys 0m0.188s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m0.369s user0m0.000s sys 0m0.368s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m1.540s user0m0.000s sys 0m1.440s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m3.132s user0m0.000s sys 0m1.724s MANY SMALL/MEDIUM FILE READING TEST ON BTRFS Completely cold caches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real0m31.838s user0m0.192s sys 0m6.076s Warm NFS pagecache: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real0m14.841s user0m0.148s sys 0m4.988s Warm BTRFS pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real0m16.773s user0m0.148s sys 0m5.512s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa >/dev/zero real2m12.527s user0m0.080s sys 0m2.908s - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 00/37] Permit filesystem local caching
Well, the AFS paper that was referenced earlier was written around the time of 10bt and 100bt. Local disk caching worked well then. There should also be some papers at CITI about disk caching over slower connections, and disconnected operation (which should still be applicable today). There are still winners from local disk caching, but their numbers have been reduced. Server load reduction should be a win. I'm not sure if it's worth it from a security/manageability standpoint, but I haven't looked that closely at David's code. -Dan -Original Message- From: Daniel Phillips [mailto:[EMAIL PROTECTED] Sent: Thursday, February 21, 2008 2:44 PM To: David Howells Cc: Myklebust, Trond; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux-security-module@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [PATCH 00/37] Permit filesystem local caching Hi David, I am trying to spot the numbers that show the sweet spot for this optimization, without much success so far. Who is supposed to win big? Is this mainly about reducing the load on the server, or is the client supposed to win even with a lightly loaded server? When you say Ext3 cache vs NFS cache is the first on the server and the second on the client? Regards, Daniel ___ NFSv4 mailing list [EMAIL PROTECTED] http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Hi David, I am trying to spot the numbers that show the sweet spot for this optimization, without much success so far. Who is supposed to win big? Is this mainly about reducing the load on the server, or is the client supposed to win even with a lightly loaded server? When you say Ext3 cache vs NFS cache is the first on the server and the second on the client? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
On Thu, Feb 21, 2008 at 9:55 AM, David Howells <[EMAIL PROTECTED]> wrote: > Daniel Phillips <[EMAIL PROTECTED]> wrote: > > > > Have you got before/after benchmark results? > > See attached. > > These show a couple of things: > > (1) Dealing with lots of metadata slows things down a lot. Note the result > of > looking and reading lots of small files with tar (the last result). The > NFS client has to both consult the NFS server *and* the cache. Not only > that, but any asynchronicity the cache may like to do is rendered > ineffective by the fact tar wants to do a read on a file pretty much > directly after opening it. > > (2) Getting metadata from the local disk fs is slower than pulling it across > an unshared gigabit ethernet from a server that already has it in memory. Hi David, Your results remind me of this in case you're interested... http://www.citi.umich.edu/techreports/reports/citi-tr-92-3.pdf - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips <[EMAIL PROTECTED]> wrote: > Have you got before/after benchmark results? See attached. These show a couple of things: (1) Dealing with lots of metadata slows things down a lot. Note the result of looking and reading lots of small files with tar (the last result). The NFS client has to both consult the NFS server *and* the cache. Not only that, but any asynchronicity the cache may like to do is rendered ineffective by the fact tar wants to do a read on a file pretty much directly after opening it. (2) Getting metadata from the local disk fs is slower than pulling it across an unshared gigabit ethernet from a server that already has it in memory. These points don't mean that fscache is no use, just that you have to consider carefully whether it's of use to *you* given your particular situation, and that depends on various factors. Note that currently FS-Caching is disabled for individual NFS files opened for writing as there's no way to handle the coherency problems thereby introduced. David --- === FS-CACHE FOR NFS BENCHMARKS === (*) The NFS client has a 1.86GHz Core2 Duo CPU and 1GB of RAM. (*) The NFS client has a Seagate ST380211AS 80GB 7200rpm SATA disk on an interface running in AHCI mode. The chipset is an Intel G965. (*) A partition of approx 4.5GB is committed to caching, and is formatted as Ext3 with a blocksize of 4096 and directory indices. (*) The NFS client is using SELinux. (*) The NFS server is running an in-kernel NFSd, and has a 2.66GHz Core2 Duo CPU and 6GB of RAM. The chipset is an Intel P965. (*) The NFS client is connected to the NFS server by Gigabit Ethernet. (*) The NFS mount is made with defaults for all options not relating to the cache: warthog:/warthog /warthog nfs rw,vers=3,rsize=1048576,wsize=1048576,hard,proto=tcp,timeo=600, retrans=2,sec=sys,fsc,addr=w.x.y.z 0 0 == FEW BIG FILES TEST == Where: (*) The NFS server has two files: [EMAIL PROTECTED] ~]# ls -l /warthog/bigfile -rw-rw-r-- 1 4043 4043 104857600 2006-11-30 09:39 /warthog/bigfile [EMAIL PROTECTED] ~]# ls -l /warthog/biggerfile -rw-rw-r-- 1 4043 4041 209715200 2006-03-21 13:56 /warthog/biggerfile Both of which are in memory on the server in all cases. No patches, cold NFS cache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m1.909s user0m0.000s sys 0m0.520s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m3.750s user0m0.000s sys 0m0.904s CONFIG_FSCACHE=n, cold NFS cache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m2.003s user0m0.000s sys 0m0.124s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m4.100s user0m0.004s sys 0m0.488s Cold NFS cache, no disk cache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m2.084s user0m0.000s sys 0m0.136s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m4.020s user0m0.000s sys 0m0.720s Completely cold caches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m2.412s user0m0.000s sys 0m0.892s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m4.449s user0m0.000s sys 0m2.300s Warm NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m0.067s user0m0.000s sys 0m0.064s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m0.133s user0m0.000s sys 0m0.136s Warm Ext3 pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m0.173s user0m0.000s sys 0m0.172s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m0.316s user0m0.000s sys 0m0.316s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile >/dev/null real0m1.955s user0m0.000s sys 0m0.244s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile >/dev/null real0m3.596s user0m0.000s sys 0m0.460s === MANY SMALL/MEDIUM FILE READING TEST === Where: (*) The NFS server has an old kernel tree: [EMAIL PROTECTED] ~]# du -s /warthog/aaa 347340 /warthog/aaa
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips <[EMAIL PROTECTED]> wrote: > > These patches add local caching for network filesystems such as NFS. > > Have you got before/after benchmark results? I need to get a new hard drive for my test machine before I can go and get some more up to date benchmark results. It does seem, however, that the I/O error handling capabilities of FS-Cache work properly:-) David - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Hi David, On Wednesday 20 February 2008 08:05, David Howells wrote: > These patches add local caching for network filesystems such as NFS. Have you got before/after benchmark results? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
On Wed, 2008-02-20 at 13:58 -0600, Serge E. Hallyn wrote: > > Seems *really* weird that every time you send this, patch 6 doesn't seem > to reach me in any of my mailboxes... (did get it from the url > you listed) That's because patch #6 is 169K. You also don't get patch #20 which is 140K and patch #14 which is 237K. For the SELinux list, messages longer than 100K need to be approved. Since David can't break his patches into smaller sizes and since he provides a link to get the whole thing anyway, I don't bother approving the large patches. The LSM list appears to have the same policy. There are no messages larger than 100K. -- James Carter <[EMAIL PROTECTED]> National Security Agency - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Serge E. Hallyn <[EMAIL PROTECTED]> wrote: > Seems *really* weird that every time you send this, patch 6 doesn't seem > to reach me in any of my mailboxes... (did get it from the url > you listed) It's the largest of the patches, so that's not entirely surprising. Hence why I included the URL to the tarball also. > I'm sorry if I miss where you explicitly state this, but is it safe to > assume, as perusing the patches suggests, that > > 1. tsk->sec never changes other than in task_alloc_security()? Correct. > 2. tsk->act_as is only ever dereferenced from (a) current-> That ought to be correct. > except (b) in do_coredump? Actually, do_coredump() only deals with current->act_as. > (thereby carefully avoiding locking issues) That's the idea. > I'd still like to see some performance numbers. Not to object to > these patches, just to make sure there's no need to try and optimize > more of the dereferences away when they're not needed. I hope that the performance impact is minimal. The kernel should spend very little time looking at the security data. I'll try and get some though. > Oh, manually copied from patch 6, I see you have in the task_security > struct definition: > > kernel_cap_tcap_bset; /* ? */ > > That comment can be filled in with 'capability bounding set' (for this > task and all its future descendents). Thanks. David - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Quoting David Howells ([EMAIL PROTECTED]): > > > These patches add local caching for network filesystems such as NFS. > > The patches can roughly be broken down into a number of sets: > > (*) 01-keys-inc-payload.diff > (*) 02-keys-search-keyring.diff > (*) 03-keys-callout-blob.diff > > Three patches to the keyring code made to help the CIFS people. > Included because of patches 05-08. > > (*) 04-keys-get-label.diff > > A patch to allow the security label of a key to be retrieved. > Included because of patches 05-08. > > (*) 05-security-current-fsugid.diff > (*) 06-security-separate-task-bits.diff Seems *really* weird that every time you send this, patch 6 doesn't seem to reach me in any of my mailboxes... (did get it from the url you listed) I'm sorry if I miss where you explicitly state this, but is it safe to assume, as perusing the patches suggests, that 1. tsk->sec never changes other than in task_alloc_security()? 2. tsk->act_as is only ever dereferenced from (a) current-> except (b) in do_coredump? (thereby carefully avoiding locking issues) I'd still like to see some performance numbers. Not to object to these patches, just to make sure there's no need to try and optimize more of the dereferences away when they're not needed. Oh, manually copied from patch 6, I see you have in the task_security struct definition: kernel_cap_tcap_bset; /* ? */ That comment can be filled in with 'capability bounding set' (for this task and all its future descendents). thanks, -serge > (*) 07-security-subjective.diff > (*) 08-security-kernel_service-class.diff > (*) 09-security-kernel-service.diff > (*) 10-security-nfsd.diff > > Patches to permit the subjective security of a task to be overridden. > All the security details in task_struct are decanted into a new struct > that task_struct then has two pointers two: one that defines the > objective security of that task (how other tasks may affect it) and one > that defines the subjective security (how it may affect other objects). > > Note that I have dropped the idea of struct cred for the moment. With > the amount of stuff that was excluded from it, it wasn't actually any > use to me. However, it can be added later. > > Required for cachefiles. > > (*) 11-release-page.diff > (*) 12-fscache-page-flags.diff > (*) 13-add_wait_queue_tail.diff > (*) 14-fscache.diff > > Patches to provide a local caching facility for network filesystems. > > (*) 15-cachefiles-ia64.diff > (*) 16-cachefiles-ext3-f_mapping.diff > (*) 17-cachefiles-write.diff > (*) 18-cachefiles-monitor.diff > (*) 19-cachefiles-export.diff > (*) 20-cachefiles.diff > > Patches to provide a local cache in a directory of an already mounted > filesystem. > > (*) 21-nfs-comment.diff > (*) 22-nfs-fscache-option.diff > (*) 23-nfs-fscache-kconfig.diff > (*) 24-nfs-fscache-top-index.diff > (*) 25-nfs-fscache-server-obj.diff > (*) 26-nfs-fscache-super-obj.diff > (*) 27-nfs-fscache-inode-obj.diff > (*) 28-nfs-fscache-use-inode.diff > (*) 29-nfs-fscache-invalidate-pages.diff > (*) 30-nfs-fscache-iostats.diff > (*) 31-nfs-fscache-page-management.diff > (*) 32-nfs-fscache-read-context.diff > (*) 33-nfs-fscache-read-fallback.diff > (*) 34-nfs-fscache-read-from-cache.diff > (*) 35-nfs-fscache-store-to-cache.diff > (*) 36-nfs-fscache-mount.diff > (*) 37-nfs-fscache-display.diff > > Patches to provide NFS with local caching. > > A couple of questions on the NFS iostat changes: (1) Should I update the > iostat version number; (2) is it permitted to have conditional iostats? > > > I've brought the patchset up to date with respect to the 2.6.25-rc1 merge > window, in particular altering Smack to handle the split in objective and > subjective security in the task_struct. > > -- > A tarball of the patches is available at: > > > http://people.redhat.com/~dhowells/fscache/patches/nfs+fscache-30.tar.bz2 > > > To use this version of CacheFiles, the cachefilesd-0.9 is also required. It > is available as an SRPM: > > http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9-1.fc7.src.rpm > > Or as individual bits: > > http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9.tar.bz2 > http://people.redhat.com/~dhowells/fscache/cachefilesd.fc > http://people.redhat.com/~dhowells/fscache/cachefilesd.if > http://people.redhat.com/~dhowells/fscache/cachefilesd.te > http://people.redhat.com/~dhowells/fscache/cachefilesd.spec > > The .fc, .if and .te files are for manipulating SELinux. > > David > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send t
[PATCH 00/37] Permit filesystem local caching
These patches add local caching for network filesystems such as NFS. The patches can roughly be broken down into a number of sets: (*) 01-keys-inc-payload.diff (*) 02-keys-search-keyring.diff (*) 03-keys-callout-blob.diff Three patches to the keyring code made to help the CIFS people. Included because of patches 05-08. (*) 04-keys-get-label.diff A patch to allow the security label of a key to be retrieved. Included because of patches 05-08. (*) 05-security-current-fsugid.diff (*) 06-security-separate-task-bits.diff (*) 07-security-subjective.diff (*) 08-security-kernel_service-class.diff (*) 09-security-kernel-service.diff (*) 10-security-nfsd.diff Patches to permit the subjective security of a task to be overridden. All the security details in task_struct are decanted into a new struct that task_struct then has two pointers two: one that defines the objective security of that task (how other tasks may affect it) and one that defines the subjective security (how it may affect other objects). Note that I have dropped the idea of struct cred for the moment. With the amount of stuff that was excluded from it, it wasn't actually any use to me. However, it can be added later. Required for cachefiles. (*) 11-release-page.diff (*) 12-fscache-page-flags.diff (*) 13-add_wait_queue_tail.diff (*) 14-fscache.diff Patches to provide a local caching facility for network filesystems. (*) 15-cachefiles-ia64.diff (*) 16-cachefiles-ext3-f_mapping.diff (*) 17-cachefiles-write.diff (*) 18-cachefiles-monitor.diff (*) 19-cachefiles-export.diff (*) 20-cachefiles.diff Patches to provide a local cache in a directory of an already mounted filesystem. (*) 21-nfs-comment.diff (*) 22-nfs-fscache-option.diff (*) 23-nfs-fscache-kconfig.diff (*) 24-nfs-fscache-top-index.diff (*) 25-nfs-fscache-server-obj.diff (*) 26-nfs-fscache-super-obj.diff (*) 27-nfs-fscache-inode-obj.diff (*) 28-nfs-fscache-use-inode.diff (*) 29-nfs-fscache-invalidate-pages.diff (*) 30-nfs-fscache-iostats.diff (*) 31-nfs-fscache-page-management.diff (*) 32-nfs-fscache-read-context.diff (*) 33-nfs-fscache-read-fallback.diff (*) 34-nfs-fscache-read-from-cache.diff (*) 35-nfs-fscache-store-to-cache.diff (*) 36-nfs-fscache-mount.diff (*) 37-nfs-fscache-display.diff Patches to provide NFS with local caching. A couple of questions on the NFS iostat changes: (1) Should I update the iostat version number; (2) is it permitted to have conditional iostats? I've brought the patchset up to date with respect to the 2.6.25-rc1 merge window, in particular altering Smack to handle the split in objective and subjective security in the task_struct. -- A tarball of the patches is available at: http://people.redhat.com/~dhowells/fscache/patches/nfs+fscache-30.tar.bz2 To use this version of CacheFiles, the cachefilesd-0.9 is also required. It is available as an SRPM: http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9-1.fc7.src.rpm Or as individual bits: http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9.tar.bz2 http://people.redhat.com/~dhowells/fscache/cachefilesd.fc http://people.redhat.com/~dhowells/fscache/cachefilesd.if http://people.redhat.com/~dhowells/fscache/cachefilesd.te http://people.redhat.com/~dhowells/fscache/cachefilesd.spec The .fc, .if and .te files are for manipulating SELinux. David - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/37] Permit filesystem local caching
These patches add local caching for network filesystems such as NFS. The patches can roughly be broken down into a number of sets: (*) 01-keys-inc-payload.diff (*) 02-keys-search-keyring.diff (*) 03-keys-callout-blob.diff Three patches to the keyring code made to help the CIFS people. Included because of patches 05-08. (*) 04-keys-get-label.diff A patch to allow the security label of a key to be retrieved. Included because of patches 05-08. (*) 05-security-current-fsugid.diff (*) 06-security-separate-task-bits.diff (*) 07-security-subjective.diff (*) 08-security-kernel_service-class.diff (*) 09-security-kernel-service.diff (*) 10-security-nfsd.diff Patches to permit the subjective security of a task to be overridden. All the security details in task_struct are decanted into a new struct that task_struct then has two pointers two: one that defines the objective security of that task (how other tasks may affect it) and one that defines the subjective security (how it may affect other objects). Note that I have dropped the idea of struct cred for the moment. With the amount of stuff that was excluded from it, it wasn't actually any use to me. However, it can be added later. Required for cachefiles. (*) 11-release-page.diff (*) 12-fscache-page-flags.diff (*) 13-add_wait_queue_tail.diff (*) 14-fscache.diff Patches to provide a local caching facility for network filesystems. (*) 15-cachefiles-ia64.diff (*) 16-cachefiles-ext3-f_mapping.diff (*) 17-cachefiles-write.diff (*) 18-cachefiles-monitor.diff (*) 19-cachefiles-export.diff (*) 20-cachefiles.diff Patches to provide a local cache in a directory of an already mounted filesystem. (*) 21-nfs-comment.diff (*) 22-nfs-fscache-option.diff (*) 23-nfs-fscache-kconfig.diff (*) 24-nfs-fscache-top-index.diff (*) 25-nfs-fscache-server-obj.diff (*) 26-nfs-fscache-super-obj.diff (*) 27-nfs-fscache-inode-obj.diff (*) 28-nfs-fscache-use-inode.diff (*) 29-nfs-fscache-invalidate-pages.diff (*) 30-nfs-fscache-iostats.diff (*) 31-nfs-fscache-page-management.diff (*) 32-nfs-fscache-read-context.diff (*) 33-nfs-fscache-read-fallback.diff (*) 34-nfs-fscache-read-from-cache.diff (*) 35-nfs-fscache-store-to-cache.diff (*) 36-nfs-fscache-mount.diff (*) 37-nfs-fscache-display.diff Patches to provide NFS with local caching. A couple of questions on the NFS iostat changes: (1) Should I update the iostat version number; (2) is it permitted to have conditional iostats? I've massively split up the NFS patches as requested by Trond Myklebust and Chuck Lever. I've also brought the patches up to date with the patch window turbulence. -- A tarball of the patches is available at: http://people.redhat.com/~dhowells/fscache/patches/nfs+fscache-29.tar.bz2 To use this version of CacheFiles, the cachefilesd-0.9 is also required. It is available as an SRPM: http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9-1.fc7.src.rpm Or as individual bits: http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9.tar.bz2 http://people.redhat.com/~dhowells/fscache/cachefilesd.fc http://people.redhat.com/~dhowells/fscache/cachefilesd.if http://people.redhat.com/~dhowells/fscache/cachefilesd.te http://people.redhat.com/~dhowells/fscache/cachefilesd.spec The .fc, .if and .te files are for manipulating SELinux. David - To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html