Re: Random disk cache expiry
On Friday 31 January 2003 03:47, David Schultz wrote: > You have found an optimal replacement algorithm for the case of > repeated sequential reads. In fact, if you know in advance what > the access pattern is going to be, it is *always* possible to find > an optimal replacement algorithm. Specifically, you always > replace the block in the cache that will not be used for the > longest time in the future. > > If you want the system to always cache the right thing in the > general case, the only hint you would need to provide the system > is a reference string specifying the predicted access pattern. > (If I were to do this, my first reaction would be to encode it as > an FSA.) If that proves to be infeasible, I'm sure there are ways > to approximate the same thing. The hard parts, I think, would be > teaching the VM system to use the new information, and gathering > statistics from which you form your hints. I wouldn't even begin to try and deduce the access pattern from statistics, that sounds hard. In many cases, the application knows in advance what the access pattern will be. I think someone else in the thread has suggested having hints attached to the file as one option and I think the app itself should be able to signal it's intention to repeatedly read the file. I think trying to encode generalised access patterns is maybe not such a win. You could end up using a lot of memory storing the access pattern and a lot of effort encoding things as FSAs. Also the FSA has to be able to pick not just the block that won't be used for the longest time but it also has to pick a block that is actually in memory and for a more general access pattern this is not necessarily easy. Coping with reading a file straight through, several times is relatively common and can be encoded very easily and without much memory and assuming the file is not being accessed non-sequentially by another process, then picking the correct block is easy. Anyway, it's all academic for me really as I don't even run BSD and although I can write it when I have to, I don't find C anywhere near as much fun as Perl. I just thought it was an interesting question. I should probably go and get my coat now ;-) F To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
For those thinking of playing with predictive caching (likely an area of considerable student endeveour/interest these days at both filesystem and "web" level): --- Matthew Dillon: > So there is no 'perfect' caching algorithm. There > are simply too many variables even in a well defined > environment for even the best system heuristics to > cover optimally. --- David Schultz: > If that proves to be infeasible, I'm sure there are > ways to approximate the same thing. The hard parts, > I think, would be teaching the VM system to use the > new information, and gathering statistics from which > you form your hints. --- Right. It's easy if you know the complete future of the total system state, which of course you never will. Someone interested in this might try to apply the latest in machine learing techniques, classifiers, etc., to the online problem. Variants of this are receiving lots of attention in areas such as gene sequence prediction. I dunno, but it seems like a lot of the math ends up pretty similar to economics, and we all know how well those models work. Kind of funny, running an economic simulation in your kernel... but actually getting possible at some level, at least for research systems with modern machines. There was a time when you would be fired for putting floating-point in an OS. http://csl.cse.ucsc.edu/acme.shtml : "Many cache replacement policies have been invented and some perform better than others under certain workload and network-topological conditions. It is impossible and sub-optimal to manually choose cache replacement policies for workloads and topologies that are under continuous change. We use machine learning algorithms to automatically select the best current policy or mixtures of policies from a policy (a.k.a expert) pool to provide an "adaptive caching" service." - bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
> You have found an optimal replacement algorithm for the case of > repeated sequential reads. In fact, if you know in advance what > the access pattern is going to be, it is *always* possible to find > an optimal replacement algorithm. Specifically, you always > replace the block in the cache that will not be used for the > longest time in the future. Did everyone read UBM paper from OSDI? It presents one possible solution for dealing with sequentaly accessed files. Why is it not enough (at least to begin with)? To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
Thus spake Fergal Daly <[EMAIL PROTECTED]>: > [EMAIL PROTECTED] (Tim Kientzle) wrote in message > news:... > > Personally, I think there's a lot of merit to _trying_ > > There's even more merit in only pretending to try... Welcome to my quotes file. > As you can see, the locking cache is always better than the random one and the > file doesn't have to be very big for it to be hugely better. You have found an optimal replacement algorithm for the case of repeated sequential reads. In fact, if you know in advance what the access pattern is going to be, it is *always* possible to find an optimal replacement algorithm. Specifically, you always replace the block in the cache that will not be used for the longest time in the future. If you want the system to always cache the right thing in the general case, the only hint you would need to provide the system is a reference string specifying the predicted access pattern. (If I were to do this, my first reaction would be to encode it as an FSA.) If that proves to be infeasible, I'm sure there are ways to approximate the same thing. The hard parts, I think, would be teaching the VM system to use the new information, and gathering statistics from which you form your hints. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
:I think you missed Matt's point, which is well-taken: : :Even if everybody accesses it sequentially, if you have 100 processes :accessing it sequentially at the *same* time, then it would be to your :benefit to leave the "old" pages around because even though *this* :process won't access it again, the *next* process very well might, if :it just happens to be reading it sequentially as well but is a little :further behind on its sequential read. Right, and if the same 100 processes are accessing N files sequentially instead of just one, you get a different effect... you might blow out your cache if the aggregate size of the N files is too large. But then if some of those processes are accessing the same file, and other processes were accessing different files, you might want to cache that file but possibly not cache others, even though all the files are (for this example) the same size. But then what if some of the processes accessing some of those other files were from slow clients? You could get away with not caching those files and then you might possibly be able cache all the remaining files (being accessed by faster clients). And so on, and so forth. It gets even more complicated when you throw in read verses write verses read-modify-write accesses, and even more complicated when you add other load factors (e.g. the sheer number of connections might reduce the memory available for caching on the fly, or trying to balance executable pages verses data pages to optimize paging and program performance). So there is no 'perfect' caching algorithm. There are simply too many variables even in a well defined environment for even the best system heuristics to cover optimally. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
Hi, re: > If a file's access history were stored as a "hint" > associated with the file, then it would > be possible to make better up-front decisions about > how to allocate cache space. I believe at one time this was a hot area, and now maybe it is back. I vaguely recall a recent PhD in this area, you might want to contact him and get some up to date pointers (I think Tom Kroger spent some time at CMU looking at a lot of file access traces in this regard). I believe Tom modified a Linux filesystem to do just this and did a fair amount of benchmarking (which, of course, proved it (or at least his PhD ;) was useful)! In any case you might get some interesting starting refs if you are interested in this area... http://www.cse.ucsc.edu/~tmk/ http://csl.cse.ucsc.edu/prediction.shtml http://csl.cse.ucsc.edu/acme.shtml - bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
On Thursday 30 January 2003 07:06 pm, Tim Kientzle wrote: | Matthew Dillon wrote: | > Your idea of 'sequential' access cache restriction only | > | > works if there is just one process doing the accessing. | | Not necessarily. I suspect that there is | a strong tendency to access particular files | in particular ways. E.g., in your example of | a download server, those files are always | read sequentially. You can make similar assertions | about a lot of files: : : For example, if a process | started to read a 10GB file that has historically been | accessed sequentially, you could immediately decide | to enable read-ahead for performance, but also mark | those pages to be released as soon as they were read by the | process. I think you missed Matt's point, which is well-taken: Even if everybody accesses it sequentially, if you have 100 processes accessing it sequentially at the *same* time, then it would be to your benefit to leave the "old" pages around because even though *this* process won't access it again, the *next* process very well might, if it just happens to be reading it sequentially as well but is a little further behind on its sequential read. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
:Not necessarily. I suspect that there is :a strong tendency to access particular files :in particular ways. E.g., in your example of :a download server, those files are always :read sequentially. You can make similar assertions :about a lot of files: manpages, gzip files, :C source code files, etc, are "always" read :sequentially. : :If a file's access history were stored as a "hint" :associated with the file, then it would :be possible to make better up-front decisions about :how to allocate cache space. The ideal would be to This has been tried. It works up to a point, but not to the extent that you want it to. The basic problem is that past history does not necessarily predict future behavior. With the web server example, different client loads will result in different access behaviors. They might still all be sequential, but the combinations of multiple users will change the behavior enough that you would not be able to use the history as a reliable metric to control the cache. There is also an issue of how to store the 'history'. It isn't a simple matter of storing when a block was last accessed. Analysis of the access history is just as important and a lot of the type of analysis we humans do is intuitive and just cannot be replicated by a computer. Basically it all devolves down into the case where if you know exactly how something is going to be accessed, or you need caching to work a certain way in order to guarentee a certain behavior, the foreknowledge you have of the access methodologies will allow you to cache the information manually far better then the system could cache it heuristically. :store such hints on disk (maybe as an extended :attribute?), but it might also be useful to cache :them in memory somewhere. That would allow the :cache-management code to make much earlier decisions :about how to handle a file. For example, if a process :started to read a 10GB file that has historically been :accessed sequentially, you could immediately decide :to enable read-ahead for performance, but also mark :those pages to be released as soon as they were read by the :process. : :FWIW, a web search for "randomized caching" yields :some interesting reading. Apparently, there are :a few randomized cache-management algorithms for :which the mathematics work out reasonably well, :despite Terry's protestations to the contrary. ;-) :I haven't yet found any papers describing experiences :with real implementations, though. : :If only I had the time to spend poring over FreeBSD's :cache-management code to see how these ideas might :actually be implemented... : :Tim Kientzle It should be noted that was already implement most of the heuristics you talk about. We have a heuristic that detects sequential access patterns, for example, and enables clustered read-ahead. The problem isn't detection, the problem is scale. These heuristics work wonderfully at a small scale (i.e. lets read 64K ahead verses trying to cache 64MB worth of the file). Just knowing something is sequential does not allow you to choose how much memory you should set aside to cache that object, for example. Automatically depressing the priority of pages read sequentially after they've been used can have as terrible a performance impact as it can a positive one, depending on the size of the object, the number of distinct objects being accessed in that manner, perceived latency by end users, number of end users, the speed of their connections (some objects may be accessed more slowly then others depending on the client's network bandwidth), and so forth. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
Matthew Dillon wrote: Your idea of 'sequential' access cache restriction only works if there is just one process doing the accessing. Not necessarily. I suspect that there is a strong tendency to access particular files in particular ways. E.g., in your example of a download server, those files are always read sequentially. You can make similar assertions about a lot of files: manpages, gzip files, C source code files, etc, are "always" read sequentially. If a file's access history were stored as a "hint" associated with the file, then it would be possible to make better up-front decisions about how to allocate cache space. The ideal would be to store such hints on disk (maybe as an extended attribute?), but it might also be useful to cache them in memory somewhere. That would allow the cache-management code to make much earlier decisions about how to handle a file. For example, if a process started to read a 10GB file that has historically been accessed sequentially, you could immediately decide to enable read-ahead for performance, but also mark those pages to be released as soon as they were read by the process. FWIW, a web search for "randomized caching" yields some interesting reading. Apparently, there are a few randomized cache-management algorithms for which the mathematics work out reasonably well, despite Terry's protestations to the contrary. ;-) I haven't yet found any papers describing experiences with real implementations, though. If only I had the time to spend poring over FreeBSD's cache-management code to see how these ideas might actually be implemented... Tim Kientzle To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
On Thursday 30 January 2003 05:22 pm, Matthew Dillon wrote: | Well, here's a counterpoint. Lets say you have an FTP | server with 1G of ram full of, say, pirated CDs at 600MB a | pop. | | Now lets say someone puts up a new madonna CD and suddenly | you have thousands of people from all over the world trying | to download a single 600MB file. | | Lets try another one. Lets say you have an FTP server with | 1G of ram full of hundreds of MPEG encoded pirated CDs at | 50MB a pop and you have thousands of people from all over the | world trying to download a core set of 25 CDs, which exceeds | the available ram you have to cache all of them. | | What I'm trying to illustrate here is the impossibility of | what you are asking. Your idea of 'sequential' access cache | restriction only works if there is just one process doing the | accessing. But if you have 25 processes accessing 25 different | files sequentially it doesn't work, and how is the system supposed to | detect the difference between 25 processes accessing 25 50MB files on | a 1G machine (which doesn't fit in the cache) verses 300 processes | accessing 15 50MB files on a 1G machine (which does fit). | Furthermore, how do you differentiate between 30 processes all | downloading the same 600MB CD verses 30 processes downloading two | different 600MB CD's, on a machine with 1G of cache? | | You can't. That's the problem. There is no magic number between | 0 and the amount of memory you have where you can say "I am going | to stop caching this sequential file" that covers even the more | common situations that come up. There is no algorithm that can | detect the above situations before the fact or on the fly. You | can analyize the situation after the fact, but by then it is too | late, and the situation may change from minute to minute. One minute | you have 300 people trying to download one CD, the next minute you | have 20 people trying to download 10 different CD's. | | -Matt You are absolutely right, and thank you for fullfilling my request to pillory me for that bit: | :Of course the trick here is waving my hands and saying "assume that | : you know how the file will be accessed in the future." You ought | : to pillory me for *that* bit. Even with hinting there are problems | : with this whole idea. Still with some hinting the algorithm could | : probably be a little more clever. | : | :(Actually, Terry Lambert *did* pillory me for that bit, just a bit, | : when he pointed out the impossibility of knowing whether the file | : is being used in the same way by other processes.) Though actually that was awful gentle for a pillory. All that said, there are *always* pathological cases for any algorithm; it is my intuition that somehow giving cache retention "priority" to processes that randomly access files would be likely to be an overall win. Not that I'm planning to put my money where my mouth is and write some code to test this contention. If you wanted to get *really* fancy you could add some code that tracked hits for given blocks over time and increased priority for blocks that were brought in often "recently" -- sort of an extended LRU list that wouldn't map a whole block in but would tell you what you'd tossed out lately and you could give priorty to blocks that were out of the "list cache" (or whatever we would call this secondary list of recently-used blocks). Um, in re-reading that paragraph it seems clear as mud . . . Whether any of these schemes would be of any practical benefit in any common situations is highly dubious; they wuld all increase overhead for one thing. I think that the definitive answer, really, is the one you gave earlier: If you have a specialized application or server which you know will make intense use of the file system in a predictable way, then you should customize the way it accesses the files because no general-purpose algorithm can ever be optimal for all cases. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
Well, here's a counterpoint. Lets say you have an FTP server with 1G of ram full of, say, pirated CDs at 600MB a pop. Now lets say someone puts up a new madonna CD and suddenly you have thousands of people from all over the world trying to download a single 600MB file. Lets try another one. Lets say you have an FTP server with 1G of ram full of hundreds of MPEG encoded pirated CDs at 50MB a pop and you have thousands of people from all over the world trying to download a core set of 25 CDs, which exceeds the available ram you have to cache all of them. What I'm trying to illustrate here is the impossibility of what you are asking. Your idea of 'sequential' access cache restriction only works if there is just one process doing the accessing. But if you have 25 processes accessing 25 different files sequentially it doesn't work, and how is the system supposed to detect the difference between 25 processes accessing 25 50MB files on a 1G machine (which doesn't fit in the cache) verses 300 processes accessing 15 50MB files on a 1G machine (which does fit). Furthermore, how do you differentiate between 30 processes all downloading the same 600MB CD verses 30 processes downloading two different 600MB CD's, on a machine with 1G of cache? You can't. That's the problem. There is no magic number between 0 and the amount of memory you have where you can say "I am going to stop caching this sequential file" that covers even the more common situations that come up. There is no algorithm that can detect the above situations before the fact or on the fly. You can analyize the situation after the fact, but by then it is too late, and the situation may change from minute to minute. One minute you have 300 people trying to download one CD, the next minute you have 20 people trying to download 10 different CD's. -Matt :The suggestion here basically boils down to this: if the system could :act on hints that somebody will be doing sequential access, then it :should be more timid about caching for that file access. That is to :say, it should allow that file to "use up" a smaller number of blocks :from the cache (yes, the VM) at a time, and it should favor, if :anything, a LIFO scheme instead of the usual FIFO (LRU) scheme. (That :is to say, for the special case of *sequential* access, LRU == FIFO, :and yet LIFO is probably more optimal for this case, at least if the :file will be re-read later.) : :Caching will do more good on files that that will be randomly accessed; :an intermediate amount of good on files sequentially accessed but :rewound and/or accessed over and over, and if the file system could :somehow know (or be hinted) that the file is being sequentially :accessed and is unlikely to be accessed again for a good long while it :would clearly be better off not caching it at all. : :Of course the trick here is waving my hands and saying "assume that you :know how the file will be accessed in the future." You ought to :pillory me for *that* bit. Even with hinting there are problems with :this whole idea. Still with some hinting the algorithm could probably :be a little more clever. : :(Actually, Terry Lambert *did* pillory me for that bit, just a bit, when :he pointed out the impossibility of knowing whether the file is being :used in the same way by other processes.) : :And . . . also to Terry, yes, I know that my proposal about :over-simplifies, but the point is that for sequential access you want :to go "gentle" on making the cache of other process' and earlier reads :go away. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
The suggestion here basically boils down to this: if the system could act on hints that somebody will be doing sequential access, then it should be more timid about caching for that file access. That is to say, it should allow that file to "use up" a smaller number of blocks from the cache (yes, the VM) at a time, and it should favor, if anything, a LIFO scheme instead of the usual FIFO (LRU) scheme. (That is to say, for the special case of *sequential* access, LRU == FIFO, and yet LIFO is probably more optimal for this case, at least if the file will be re-read later.) Caching will do more good on files that that will be randomly accessed; an intermediate amount of good on files sequentially accessed but rewound and/or accessed over and over, and if the file system could somehow know (or be hinted) that the file is being sequentially accessed and is unlikely to be accessed again for a good long while it would clearly be better off not caching it at all. Of course the trick here is waving my hands and saying "assume that you know how the file will be accessed in the future." You ought to pillory me for *that* bit. Even with hinting there are problems with this whole idea. Still with some hinting the algorithm could probably be a little more clever. (Actually, Terry Lambert *did* pillory me for that bit, just a bit, when he pointed out the impossibility of knowing whether the file is being used in the same way by other processes.) And . . . also to Terry, yes, I know that my proposal about over-simplifies, but the point is that for sequential access you want to go "gentle" on making the cache of other process' and earlier reads go away. On Monday 27 January 2003 12:44 am, Tim Kientzle wrote: | Brian T. Schellenberger wrote: | > This to me is imminently sensible. | > In fact there seem like two rules that have come up in this | > discussion: | > | > 1. For sequential access, you should be very hesitant to throw away | > *another* processes blocks, at least once you have used more than, | > say, 25% of the cache or potential cache. | > | > 2. For sequential access, you should stop caching before you throw | > away your own blocks. If it's sequential it is, it seems to me, | > always a lose to throw away your *own* processes older bllocks on | > thee same file. | | The question isn't "should I throw away blocks or not?". | Obviously, the ideal is to keep everything in the cache, but | that's not possible. | | The question is "what blocks should be discarded?" You've | ruled out quite a few possibilities here. What blocks _should_ | be discarded? | | Tim Kientzle To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
M. Basically what it comes down to is that without foreknowledge of the data locations being accessed, it is not possible for any cache algorithm to adapt to all the myrid ways data might be accessed. If you focus the cache on one methodology it will probably perform terribly when presented with some other methodology. What this means is that for the cases where you *KNOW* how a program intends to access a dataset larger then main memory, it is better to have the program explicitly cache/not-cache the data under program control rather then trying to force the system cache to adapt. I'll also be nice and decode some of Terry's Jargon for the rest of the readers. :will result in significant failure of random page replacement to :result in cache hits; likewise, going to 85% overage will practically :guarantee an almost 100% failure rate, as cyclical access with random :replacement is statistically more likely, in aggregate, to replace :the pages which are there longer (the probability is iterative and :additive: it's effectively a permutation). What Terry is saying is that if you have a dataset that is 2x the size of your cache, the cache hit rate on that data with random page replacement is NOT going to be 50%. This is because with random page replacement the likelihood of a piece of data being found in the cache depends on how long the data has been sitting in the cache. The longer the data has been sitting in the cache, the less likely you will find it when you need it (because it is more likely to have been replaced by the random replacement algorithm over time). So, again, the best caching methodology to use in the case where you *know* how much data you will be accessing and how you will access it is to build the caching directly into your program and not depend on system caching at all (for datasets which are far larger then main memory). This is true of all systems, not just BSD. This is one reason why databases do their own caching (another is so databases know when an access will require I/O for scheduling reasons, but that's a different story). The FreeBSD VM caching system does prevent one process from exhausting another process's cached data due to a sequential access, but the FreeBSD VM cache does not try to outsmart sequential data accesses to datasets which are larger then available cache space because it's an insanely difficult (impossible) problem to solve properly without foreknowledge of what data elements will be accessed when. This isn't to say that we can't improve on what we have now. I certainly think we can. But there is no magic bullet that will deal with every situation. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
Thus spake Tim Kientzle <[EMAIL PROTECTED]>: > Sean Hamilton proposes: > > >Wouldn't it seem logical to have [randomized disk cache expiration] in > >place at all times? > > Terry Lambert responds: > > >>:I really dislike the idea of random expiration; I don't understand > >>:the point, unless you are trying to get better numbers on some > > >>:benchmark. > > Matt Dillon concedes: > > >>... it's only useful when you are cycling through a [large] data set ... > > Cycling through large data sets is not really that uncommon. > I do something like the following pretty regularly: >find /usr/src -type f | xargs grep function_name > > Even scanning through a large dataset once can really hurt > competing applications on the same machine by flushing > their data from the cache for no gain. I think this > is where randomized expiration might really win, by reducing the > penalty for disk-cache-friendly applications who are competing > with disk-cache-unfriendly applications. > > There's an extensive literature on randomized algorithms. > Although I'm certainly no expert, I understand that such > algorithms work very well in exactly this sort of application, > since they "usually" avoid worst-case behavior under a broad > variety of inputs. The current cache is, in essence, > tuned specifically to work badly on a system where applications > are scanning through large amounts of data. No matter what > deterministic caching algorithm you use, you're choosing > to behave badly under some situation. Yes, if you randomly vary the behavior of the algorithm, you can guarantee that on average, performance will never be too bad for any particular input, but it will never be very good, on average, either. You can't mathematically prove everything about the memory access patterns of real-world programs, but LRU seems to do pretty well in a variety of situations. It does, however, have its worst cases. A random replacement algorithm is very unlikely to do *as* badly as LRU in LRU's worst case; its performance is consistent and relatively poor. Keep in mind that there's a bias in most real-world programs that favors LRU more than randomness. So should FreeBSD make it possible to ask for random replacement? Probably, since it would be helpful for those times when you *know* that LRU isn't going to do the right thing. (In the sequential read special case the OP mentions, random replacement is better than LRU, but still worse than a deterministic algorithm that just caches the prefix of the file that will fit in memory. So in this situation we could do even better in theory.) Should randomness be part of the default replacement algorithm, as the OP suggests? Probably not, since that would be pessimizing performance in the common case for the sake of improving it in an uncommon case. Should the system be able to detect cases where the default replacement algorithm is failing and dynamically modify its behavior? I think that would be really cool, if it is possible... To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
"Brian T. Schellenberger" wrote: > 2. For sequential access, you should stop caching before you throw away > your own blocks. If it's sequential it is, it seems to me, always a > lose to throw away your *own* processes older bllocks on thee same > file. You can not have a block in a vm object which is not a cached copy of a data block in the backing store object. You can not access data blocks from the disk, except in the context of a backing store object. FreeBSD has a unified VM and buffer cache. FreeBSD does not copy data off the disk into a buffer, and then decide whether to copy the buffer/reference it from a cache: if it is in a buffer, it's in the cache, period. You are talking about avoiding a copy that doesn't happen. The real issue is whether, after notification of completion of the transfer (via wakeup of the blocked "read" system call), the VM buffer in question is relinked to the other end of the LRU list, artificially "aging" it. The answer to this is that you can't do it, because you do not know that the file system access itself is sequential for all people who have the file open at the time you make the request, and you can't know to which buffers you are actually referring, because this is handled via demand paging, "under the covers", and therefore not visible to the "read" call at that level. > These algorithmic changes seem to me to be more likely to be optimal > most of the time. No. All openers get different struct file's, but they get the same vp, which means they get the same vm_object_t. > A random approach *does* reduce the penalty for worst-case scenarios but > at the cost of reducing the benefit of both "normal" and "best-case" > scenarios, it seems to me, even more. I believe the "random page replacement" alrgotihm that was being referred to here is actually only supposed to be triggered if the file access has been advised tbe sequential; even so, there is no win in doing this, as you say (see other posting). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
Sean Hamilton wrote: > In my case I have a webserver serving up a few dozen files of about 10 MB > each. While yes it is true that I could purchase more memory, and I could > purchase more drives and stripe them, I am more interested in the fact that > this server is constantly grinding away because it has found a weakness in > the caching algorithm. This is a problem in Linux, nd in SVR4, as well. THis is the main reason that scheduler classes were implemented for SVR4, and the main reason the "fixed" scheduler class was implemented in SVR4.0.2/SVR4.2, as of UnixWare. The system guarantees time to the programs running under this scheduling class, and what other software pages out, the program then has the CPU time to page back in. This was UnixWare's response to "move mouse, wiggle cursor", when the X server performance went in the toilet. The reason the UnixWare X server performance was in the toilet is that the UnixWare "ld" program mmap's the .o files, and then seeks around in them to resolve symbols, rather than reading the parts in and building in-core data structures. The result is predictable: the linker, when linking any sufficiently large set of object files, thrashes all other pages out of core, as it moves around accessing pages in these files. The obvious answer to this problem -- and you problem -- is to implement a working set size quota on a per-vnode basis, as I previously suggested. By doing this, you cause the pages in the large objects to replace the old pages in the large objects, and your other data does not get thrashed out of core. THis works great on UnixWare, but requires modification of the VM system in order to provide counting for the number of buffers that are hubg off a given file object, with the result being that the code is not binary compatible (this is why the modification never made it into UnixWare, even though it was tested, and found to solve the "ld/X server problem" without leading to more thrashing, and do it without needing to introduce a new scheduling class, and make less CPU time available to other applications on the system to let the X server have sufficient time to thrash its pages back in). This approach will also work for your problem, which is that your several 10M files thrash everything else out of the cache, including the executable pages. Note that this approach need not have any effect under normal conditionas, until available memory hits some low watermark, causing it to trigger, and therefore we are talking a single "if" test in the page replacement path, in the normal non-trigger case, to implement it. > After further thought, I propose something much simpler: when the kernel is > hinted that access will be sequential, it should stop caching when there is > little cache space available, instead of throwing away old blocks, or be > much more hesitant to throw away old blocks. Consider that in almost all > cases where access is sequential, as reading continues, the chances of the > read being aborted increase: ie, users downloading files, directory tree > traversal, etc. Since the likelihood of the first byte reading the first > byte is very high, and the next one less high, and the next less yet, etc, > it seems to make sense to tune the caching algorithm to accomodate this. This is much harder to implement. Specifically, the sequential nature is hueristically detected, and it is this hueristic, not the madvise, which is at issue. If this hueristic did *not* get triggered, then you would lose your read-ahead. Therefore it's not something that can be easily turned off. Second, the VM and buffer cache are unified in FreeBSD. THis means that you can not "reserve" buffer cache entries that are then not cached in VM objects, in order to cause the entries to turn over. Even if you were able to do this, through some serious kludge, you would not be able to differentiate the things that needed to be thrown out to make room for the transient pages, which leaves you in the sme boat you were in before. > While discussing disks, I have a minor complaint: at least on IDE systems, > when doing something like an untar, the entire system is painfully > unresponsive, even though CPU load is low. I presume this is because when an > executable is run, it needs to sit and wait for the disk. Wouldn't it make > sense to give very high disk priority to executables? Isn't that worth the > extra seeks? Actually, there reason for this is that the data portion of tagged writes can not disconnect from the drive in ATA, and therefore the tagged command queueing on reads does not help you for writes. The ATA protocol is broken. If you switched to a SCSI disk, you would see this problem go away. This was recently discussed in detail in the context of a discussion with a MAXTOR drive engineer in a thread on the FreeBSD-FS mailing list. In the limit, though, all disk requests come about as a result of executables making demands. Another serious issue
Re: Random disk cache expiry
Tim Kientzle wrote: > Cycling through large data sets is not really that uncommon. > I do something like the following pretty regularly: > find /usr/src -type f | xargs grep function_name > > Even scanning through a large dataset once can really hurt > competing applications on the same machine by flushing > their data from the cache for no gain. I think this > is where randomized expiration might really win, by reducing the > penalty for disk-cache-friendly applications who are competing > with disk-cache-unfriendly applications. Let's set aside for a minute that this discussion does not -- can not -- apply to directory metadata, and that the limiting factor in your example is going to be the divorce of inode data from vnodes in the UFS/UFS2/EXT2 ihash code, or, if there is a low maximum vnode count, the vnodes themselves, and in any divorce case or vnode LRU, the association between the cached data and the FS object will be lost -- effectively losing the cached data, no matter what, since it can't be recovered, even if it's in memory, because nothing references it or remembers which pages it's in. Let's say, instead, that we are only talking about the data being grep'ed. There are a million ways to "win" against such applications; one obvious one is to establish a "cache time to live" for any cached data, and not evict if it has not passed its time to live (effectively establishing two zones, one for cached data, and one for transient data which would be cached, if you had the RAM for it). Doing this randomized drop approach is bogus. What you are effectively hoping for is that the pages will not be evicted before they are hit again, because your shotgun pellets will not shoot them down. I can prove mathematically (or you can read it out of Knuth's Seminumerical Algorithms: Sorting and Searching) that even with a "perfect" random number generator, as soon as the cache size is 115% or less of the data set size, you are almost certainly guaranteed that the pages will be evicted before they are rereferenced, given sequential access. > There's an extensive literature on randomized algorithms. > Although I'm certainly no expert, I understand that such > algorithms work very well in exactly this sort of application, > since they "usually" avoid worst-case behavior under a broad > variety of inputs. The current cache is, in essence, > tuned specifically to work badly on a system where applications > are scanning through large amounts of data. No matter what > deterministic caching algorithm you use, you're choosing > to behave badly under some situation. As I stated before, this is frequently used to "cheat" benchmarks that are intended to evict cache content to defeat performance improvements from caching (bechmarks intended to "cheat" caching). The "magic" part here, though is that the working set size ends up being only fractionally larger than the cache size. For example, the "Polygraph" benchmark intentionally "busts" the cache on intermediate servers, to make a political statement: http://www.web-polygraph.org/docs/workloads/ Network Appliance and other vendors have a random page replacement policy that scores them much better numbers on this test, simply by doing random replacement instead of LRU or some other weight based replacement which the test software is capable of detecting during the "run up" phase of the test, where th software attempts to determine the "peak rate" following cache overflow. While this gives "good numbers" for the benchmark, in a real workload, this is unlikely to be representative of real performance. Likewuse, since their intent is to find the "knee" in the performance curve, and populate the cache beyond that, then do testing there (thus "proving" that all connections "should be" end-to-end, rather than permitting the existance of proxies), this is unlikely to be representative of a real-world cache overflow condition, since in the real world, the overflow is unlikely to be limited to "101% of cache capacity", but will instead not stop there; going above 15% of planned capacity will result in significant failure of random page replacement to result in cache hits; likewise, going to 85% overage will practically guarantee an almost 100% failure rate, as cyclical access with random replacement is statistically more likely, in aggregate, to replace the pages which are there longer (the probability is iterative and additive: it's effectively a permutation). > Personally, I think there's a lot of merit to _trying_ I think that if this is implemented by someone who doesn't want to believe Donald Knuth (8-)), they should at least be willing to test with a data set of 3 times the available cache size, so that they can see that the algorithm does not result in any performance improvement. > randomized disk cache expiry and seeing how it works in practice. It should work very poorly, according to the math. Another approach to "get what you want" wo
Re: Random disk cache expiry
Brian T. Schellenberger wrote: This to me is imminently sensible. In fact there seem like two rules that have come up in this discussion: 1. For sequential access, you should be very hesitant to throw away *another* processes blocks, at least once you have used more than, say, 25% of the cache or potential cache. 2. For sequential access, you should stop caching before you throw away your own blocks. If it's sequential it is, it seems to me, always a lose to throw away your *own* processes older bllocks on thee same file. The question isn't "should I throw away blocks or not?". Obviously, the ideal is to keep everything in the cache, but that's not possible. The question is "what blocks should be discarded?" You've ruled out quite a few possibilities here. What blocks _should_ be discarded? Tim Kientzle To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
On Sunday 26 January 2003 11:55 pm, Sean Hamilton wrote: | - Original Message - | From: "Tim Kientzle" <[EMAIL PROTECTED]> | | | Cycling through large data sets is not really that uncommon. | | I do something like the following pretty regularly: | | find /usr/src -type f | xargs grep function_name | | | | Even scanning through a large dataset once can really hurt | | competing applications on the same machine by flushing | | their data from the cache for no gain. I think this | | is where randomized expiration might really win, by reducing the | | penalty for disk-cache-friendly applications who are competing | | with disk-cache-unfriendly applications. | | | After further thought, I propose something much simpler: when the | kernel is hinted that access will be sequential, it should stop | caching when there is little cache space available, instead of | throwing away old blocks, or be much more hesitant to throw away old | blocks. This to me is imminently sensible. In fact there seem like two rules that have come up in this discussion: 1. For sequential access, you should be very hesitant to throw away *another* processes blocks, at least once you have used more than, say, 25% of the cache or potential cache. 2. For sequential access, you should stop caching before you throw away your own blocks. If it's sequential it is, it seems to me, always a lose to throw away your *own* processes older bllocks on thee same file. These algorithmic changes seem to me to be more likely to be optimal most of the time. A random approach *does* reduce the penalty for worst-case scenarios but at the cost of reducing the benefit of both "normal" and "best-case" scenarios, it seems to me, even more. | Consider that in almost all cases where access is sequential, | as reading continues, the chances of the read being aborted increase: | ie, users downloading files, directory tree traversal, etc. Since the | likelihood of the first byte reading the first byte is very high, and | the next one less high, and the next less yet, etc, it seems to make | sense to tune the caching algorithm to accomodate this. | | While discussing disks, I have a minor complaint: at least on IDE | systems, when doing something like an untar, the entire system is | painfully unresponsive, even though CPU load is low. I presume this | is because when an executable is run, it needs to sit and wait for | the disk. Wouldn't it make sense to give very high disk priority to | executables? Isn't that worth the extra seeks? | | sh | | | To Unsubscribe: send mail to [EMAIL PROTECTED] | with "unsubscribe freebsd-hackers" in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
On Sun, 26 Jan 2003, Sean Hamilton wrote: > > In my case I have a webserver serving up a few dozen files of about 10 MB > each. While yes it is true that I could purchase more memory, and I could > purchase more drives and stripe them, I am more interested in the fact that > this server is constantly grinding away because it has found a weakness in > the caching algorithm. > > After further thought, I propose something much simpler: when the kernel is > hinted that access will be sequential, it should stop caching when there is > little cache space available, instead of throwing away old blocks, or be > much more hesitant to throw away old blocks. Consider that in almost all > cases where access is sequential, as reading continues, the chances of the > read being aborted increase: ie, users downloading files, directory tree > traversal, etc. Since the likelihood of the first byte reading the first > byte is very high, and the next one less high, and the next less yet, etc, > it seems to make sense to tune the caching algorithm to accomodate this. Your case seems to be a highly specific one, and would require very specific tuning. And then one will be able to find some other "unwanted behaviour" once you tune your system for particular behaviour. -ASR http://www.crosswinds.net/~rajekar MERYL STREEP is my obstetrician! To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
- Original Message - From: "Tim Kientzle" <[EMAIL PROTECTED]> | Cycling through large data sets is not really that uncommon. | I do something like the following pretty regularly: | find /usr/src -type f | xargs grep function_name | | Even scanning through a large dataset once can really hurt | competing applications on the same machine by flushing | their data from the cache for no gain. I think this | is where randomized expiration might really win, by reducing the | penalty for disk-cache-friendly applications who are competing | with disk-cache-unfriendly applications. In my case I have a webserver serving up a few dozen files of about 10 MB each. While yes it is true that I could purchase more memory, and I could purchase more drives and stripe them, I am more interested in the fact that this server is constantly grinding away because it has found a weakness in the caching algorithm. After further thought, I propose something much simpler: when the kernel is hinted that access will be sequential, it should stop caching when there is little cache space available, instead of throwing away old blocks, or be much more hesitant to throw away old blocks. Consider that in almost all cases where access is sequential, as reading continues, the chances of the read being aborted increase: ie, users downloading files, directory tree traversal, etc. Since the likelihood of the first byte reading the first byte is very high, and the next one less high, and the next less yet, etc, it seems to make sense to tune the caching algorithm to accomodate this. While discussing disks, I have a minor complaint: at least on IDE systems, when doing something like an untar, the entire system is painfully unresponsive, even though CPU load is low. I presume this is because when an executable is run, it needs to sit and wait for the disk. Wouldn't it make sense to give very high disk priority to executables? Isn't that worth the extra seeks? sh To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
Sean Hamilton proposes: Wouldn't it seem logical to have [randomized disk cache expiration] in place at all times? Terry Lambert responds: :I really dislike the idea of random expiration; I don't understand :the point, unless you are trying to get better numbers on some >>:benchmark. Matt Dillon concedes: ... it's only useful when you are cycling through a [large] data set ... Cycling through large data sets is not really that uncommon. I do something like the following pretty regularly: find /usr/src -type f | xargs grep function_name Even scanning through a large dataset once can really hurt competing applications on the same machine by flushing their data from the cache for no gain. I think this is where randomized expiration might really win, by reducing the penalty for disk-cache-friendly applications who are competing with disk-cache-unfriendly applications. There's an extensive literature on randomized algorithms. Although I'm certainly no expert, I understand that such algorithms work very well in exactly this sort of application, since they "usually" avoid worst-case behavior under a broad variety of inputs. The current cache is, in essence, tuned specifically to work badly on a system where applications are scanning through large amounts of data. No matter what deterministic caching algorithm you use, you're choosing to behave badly under some situation. Personally, I think there's a lot of merit to _trying_ randomized disk cache expiry and seeing how it works in practice. (I would also observe here that 5.0 now has a fast, high-quality source of randomness that seems ideal for exactly such applications.) I don't believe that it would _prevent_ applications from using optimizations such as those that Terry suggests, while possibly providing reasonable performance under a broader range of scenarios than are currently supported. Sounds like a good idea to me. Tim Kientzle To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
Matthew Dillon wrote: > :I really dislike the idea of random expiration; I don't understand > :the point, unless you are trying to get better numbers on some > :.. > >Well, the basic scenario is something like this: Lets say you have >512MB of ram and you are reading a 1GB file sequentially, over and over >again. The cache scenario that gives you the greatest performance is >basically to keep 512MB worth of the file in memory and throw away the >rest as you read it. You don't need random expiration for this; in fact, random expiration will give you 50% worse performance than you expect. If you are doing this, the proper thing is to map the file window you want to keep seperately from the rest of the file, or, if it's small enough you can mmap the whole thing, then to madvise the part of the file you want to not cache as "sequential", and drop the cache. That this is not well supported is a bug in FreeBSD (IMO). >This is effectively what random disk cache expiration gives you (assuming >you expire whole tracks worth of data, not just random pages). i.e. it's >only useful when you are cycling through a data set that is larger then >main memory over and over again. And not very, there. Because you destroy locality of reference -- or, as in this case, you assume that there is no locality in the first place. I think it works at 50% of the efficiency of the madvise that I suggest, above, which, at least, gets hits on most of the file, without the risk of page replacement of the cached area of the file that you intend to read subsequently (25% hit rate in the hittable area, assuming this, since page replacement is random, if the replacement range is 50% of the size of the data set... i.e.: numbers in real life will vary downward, the larger the data set). >Most people deal with this issue simply by beefing up their disk >subsystem. After all, you don't have to stripe very many disks (only >2 or 3) to get 100MB/sec+ in throughput from the disk subsystem. I think most people don't have this problem, because this is a very degenerate case, which requires either a serindipitously incredibly bad design, or intential screwing of caching, in order to do something like benchmark the disk I/O subsystem seperately from the case (for legitimate cases) or to "prove" that caching is not useful because you have an axe to grind (e.g. the proxy cache "benchmark" suite that does this because the author is a mathematician that has developed a pessimizing hueristic because he is philosophically opposed to proxy caching, due to his own narrow interpretation of "end to end guarantees"). In general, this is not a useful case to try to optimize, and the people who try to optimize this case are doing it to cheat on a benchmark, not because there's any treal-world application that requires it. 8-(. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
:I really dislike the idea of random expiration; I don't understand :the point, unless you are trying to get better numbers on some :.. Well, the basic scenario is something like this: Lets say you have 512MB of ram and you are reading a 1GB file sequentially, over and over again. The cache scenario that gives you the greatest performance is basically to keep 512MB worth of the file in memory and throw away the rest as you read it. This is effectively what random disk cache expiration gives you (assuming you expire whole tracks worth of data, not just random pages). i.e. it's only useful when you are cycling through a data set that is larger then main memory over and over again. Most people deal with this issue simply by beefing up their disk subsystem. After all, you don't have to stripe very many disks (only 2 or 3) to get 100MB/sec+ in throughput from the disk subsystem. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
Matthew Dillon wrote: > Hi Sean. I've wanted to have a random-disk-cache-expiration feature > for a long time. We do not have one now. We do have mechanisms in > place to reduce the impact of sequential cycling a large dataset so > it does not totally destroy unrelated cached data. I really dislike the idea of random expiration; I don't understand the point, unless you are trying to get better numbers on some benchmark that intentionally determines the cache size, and then chooses to use a dataset larger than that, to factor out caching and/or make bogus representations about the value of caching in the first place. There are a number of well-known proxy cache benchmarks that do this, because the person who wrote them wanted to advocate against the use of proxy caches entirely. A number of proxy cache vendors use random page replacement to get around the benchmark tuning itself to get the worst possible performance when using a proxy cache; with random replacement, you maintain 25% of your workingselt, instead of 0% with intentional overflow from the benchmark (or 100%, if you are not doing bogus things to try and advocate against cahe vendors). > Due to the way our page queues work, it's not an easy problem to solve. It's actually pretty trivial, if you are willing to do it on a per-vnode basis, rather than a per-open instance basis. This is called a "working set quota" on VMS, and the code to do this has been around there for over 20 years now. The way you do this is to keep a count of pages associated with a given vnode, and if you are over the quota, and want more pages, to steal from the end of the per-vnode LRU, rather than going to a system wide LRU, and competing with other files. This assumes that you are not ignoring the value of locality of reference, and hacking the code to make some bogus benghmark run faster, and are instead replacing pages for performance reasons -- real performance reasons, not fake ones. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Random disk cache expiry
:Greetings, : :I have a situation where I am reading large quantities of data from disk :sequentially. The problem is that as the data is read, the oldest cached :blocks are thrown away in favor of new ones. When I start re-reading data :from the beginning, it has to read the entire file from disk again. Is there :some sort of sysctl which could be changed to induce a more random expiry of :cached disk blocks? Wouldn't it seem logical to have something like this in :place at all times? : :thanks, : :sh Hi Sean. I've wanted to have a random-disk-cache-expiration feature for a long time. We do not have one now. We do have mechanisms in place to reduce the impact of sequential cycling a large dataset so it does not totally destroy unrelated cached data. Due to the way our page queues work, it's not an easy problem to solve. You might be able to simulate more proactive cache control by using O_DIRECT reads for some of the data, and normal reads for the rest of the data (see the 'fcntl' manual page). But it might be better simply to purchase more memory, purchase a faster hard drive, or stripe two hard drives together. HDs these days can do 25-50MB/s each so striping two together should yield 50-100MB/s of sequential throughput. See the 'STRIPING DISKS' section in 'man tuning'. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message