NOTE: this is a refactoring of an email I wrote 2 1/2 years ago. The original can be found here:
http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=98205774411049&w=2
I re-edited the content to fit the current state of affairs and I'm resending hoping to trigger some discussion that didn't happen in the past.
1st post! (Sorry, I had to say that).
---------- o ----------
WARNING: this RT is long! and very dense, so I suggest you to turn on your printer. Also, keep in mind this is comes from hard-core accademic research and you might find it 'too theorical' for your practical taste. If it's the case, get over it! since it's only when you fly high in the abstraction that you percieve the borders of your problem space. You'll find some math formulas, but I've avoided all deep statistical proofs that don't really belong here.
No kidding. 18pp. from my printer.
As usual, all feedback will be welcome.
Kool. Let me restate things in different terms then.
The basic underlying issue here is that we want a smart and adaptive cache. I think this is an admirable goal, and I wanted to borrow some of the ideas to create a smart and adaptive pool controller for the MPool package.
However, that would be overthinking it a bit for that purpose... What we need to do is balance "killer" with "good enough". What we have is better than what we used to have, so we want to improve even more.
There are three major issues at stake here:
1) We don't have a target metric for "close enough". We need a goal that is measurable and reasonable. Without that goal we will optimize away essential features because we need the extra ms.
2) We need something that adapts to real use.
3) We need to be able to identify the primary resource should be protected. For some people memory consumption is key, and others raw time is key.
Lastly, the system you are describing can be stated in terms that are familiar to Artificial Intelligence programmers. I.e. we need a cache controller that is "intelligent". By that it can make complex decisions based on a set of rules as it adapts to the environment the controller finds itself within. The cache controller will be referred to as an "agent" for the rest of this discussion.
The proposed agent would use statistical analysis of the past N requests to identify what was the best course of action. Based on that information available to the agent, it would apply a set of rules based on weighted numbers. In fact, in AI terms the numbers are continually re-evaluated and re-weighted as the environment changes. This approach enables the agent to be more efficient as it only needs to store one weighting value for each resource instead of N values. However, if the cache needs to be proactive in its evaluation, it needs to have a unique weighting value for each timeframe that it must make decisions on.
We have a number of imprecise measurements that we need to account for:
* Current memory utilization * Number of concurrently running threads * Speed of production * Speed of serialization * Speed of evaluation * Last requested time * Size of serialized cache repository
All of these would be coerced into a binary decision: to cache or not
We would have to apply a set of rules that make sense in this instance:
* If resource is already cached, use cached resource. * If current system load is too great, extend ergodic period. * If production time less than serialization time, never cache. * If last requested time is older than ergodic period, purge entry. * If memory utilization too high, purge oldest entries.
Etc. etc.
In fact the rules for the cache should be able to be tuned and optimized for the particular project. Perhaps deployment concerns requires that no more than 20 MB for your webapp be used--that limits the amount of caching that can take place.
Using a rule-based approach will achieve the desires of the adaptive cache, with more understanding of how the decision process is made.
As to the good enough vs. perfect issue, caching partial pipelines (i.e. the results of a generator, each transformer, and the final result) will prove to be an inadequate way to improve system performance. Here are the reasons why:
* We do not have accurate enough tools to determine the cost of any particular component in the pipeline. The only true way to determine the cost of a transformer is to measure the cost with it included vs. the cost with it omitted. This is not desirable for the end result, so the extra production costs to determine component cost is not worth the effort. To make matters worse, certain components (like the SQLTransformer) will behave differently when used in different pipelines.
* The resource requirements for storing the results of partial pipelines will outweigh the benefit for using them. Whether it is memory or disk space, we have a finite amount no matter how generous. Most production sites will vary little over its life. The ergodic periods and other criteria will provide all the variation that is required.
* The difference in production time by starting from a later step is fairly minimal since the true cost of production is in the final step: the Serializer. Communication heavy processes such as database communication and serialization throttle the system more than any other production cost. The final serialization is usually the most costly due to the fact that the client is communicating over a slower resource--a T1 line is slower than a 10base-T connection which is in turn slower than a 100base-T connection.
For this reason, providing a generic cache that works on whole resources is a much more efficient use of time. For example, it would make my site run much more efficiently if I could use a cache for my database bound objects instead of creating a call to the database to re-read the same information over and over. Allowing the cache to have hooks for a persistence mechanism will allow it to handle write-back style caching for user objects. Write-back caches will asynchronously write the information to the persistence mechanism while the critical computation path is minimally affected.
Just my observations.
--
"They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." - Benjamin Franklin