On Wednesday, Jul 16, 2003, at 15:00 America/Guayaquil, Jason Foster wrote:
You'd get the same thing using a non-linear function instead of using variable weightings.
<snip/>
If we're thinking non-linear, then how about considering fuzzy parameters?
Hmmm. This would place even more configuration hassles on the installer/sysadm since all those fuzzy parameters would have to have default values.
This would be an opportunity to include some of Berin's thoughts on adaptive rules and add in even more math :)
I think that Berin's rules are already taken care of. Peter is perfectly right in showing how a non-linear yet-simply-polinomial function can account for different cost behaviors.
Basically, we are moving an explicit set of rules to an implicit cost description in math terms. This thread shows that you need to be pretty fund in math to understand why the implicit one is more general. it's not obvious at all.
Also, it's not obvious how to write a cost function for a particular environment, but it's also true that I believe that 95% of the cases can be accounted with the simple
cost(request) = time
which is the implicit cost function used in *all* cache systems I've seen in my life. Still, the design I outlined, being fully adaptive, can "move" with the cost changes without requiring human intervention and it can also show the efficiency numbers to help administrators tune it further.
Is it just me, or are we basically trying to solve an optimization problem?
Yep, we are.
Under this formulation the goal is to find the optimal set of cache entries that maximizes the total cost savings where each element has an associated individual cost savings.
No. that's only one side of it. We must also understand what it's better not even trying to cache (thus avoiding the cost of statistical analysis and SAX serializations, disk access, memory consumpitio as well)
The trick is that the cost savings of an entry varies and that there is a fixed cache size.
And that we don't know if going to the cache ergodicity validation process is faster than not even trying and recreate the resource right away without caching it.
There are a lot of variables on the table.
Unfortunately even with constant cost savings this is a variant of the Knapsack problem, which means it's NP-complete. Stefano's cache would then be a packing heuristic :)
Or maybe I'm completely missing something...
The *real* minimization of the cost function is probably an NP-complete problem, I agree (didn't prove it but I tend to have the same perception), but it doesn't make sense to attempt the "real" solution.
Please note that with monotonic cost functions, the local minimum is also the global minimum. This means: just consider better those samples which save cost and you'll reach the optimal point incrementally. No reason to bruteforce, just start the system and let it improve as it goes (same model used by the hotspot server JVM, btw)
It must be said that since this cache is sampled, it follows Shannon laws of information theory, that say that you have to sample at twice the frequency of changes to be able to reconstruct the original signal.
In our case, this means that if the cost of generating a particular resource varies over time faster than the request frequency for that resource, the cache will simply not converge for that particular resource but will keep basically a random sampling process on that resource.
But this should not worry since low frequency means low impact on performance. therefore for those resources where the cache doesn't really converge, we loose some optimatity but it is not a big deal. [unless the cost function varies a lot within few seconds, but this would show a defective hardware setup!]
NOTE: I'm not stating that the cache doesn't converge for resources that have a low frequency access! Not at all! the cache doesn't converge if time between two subsequent requests is not half of the time taken by the cost function to modify significantly.
An example of this would be generating a PDF report of a database every two days at random times on a database that has a daily cost function periodicity.
Normally, the cost function periodicity of IT systems are a daily function superimposed with a weekly function, superimposed by a yearly function. That means, if you do fourier transformation of the cost function sampling over time, you get a function with three bell-shaped maximum, one peaked around the day, the other around the week and the other around the year. (months are normally not significant)
If such a scenario, all requests that are called more than twice a day will have exhibit a convergence.
Note that the cache doesn't need to do fourier analysis of the samples, it's an implicit property of the cache design.
-- Stefano.