On Monday, Jul 14, 2003, at 11:29 America/Guayaquil, Berin Loritsch wrote:

The basic underlying issue here is that we want a smart and adaptive cache.

Yep. Basically, this comes from the fact that I would not know whether caching a particular resource fragment makes things faster or not.


[snipping the stating of the thinking context that brought to the design expressed in my post]


We have a number of imprecise measurements that we need to account for:


* Current memory utilization
* Number of concurrently running threads
* Speed of production
* Speed of serialization
* Speed of evaluation
* Last requested time
* Size of serialized cache repository

All of these would be coerced into a binary decision: to cache or not

yep. my design, allow the "metric" of your processing "cost" to be pluggable. of course, we might provide basic ones that people will generally use, in other cases, you might want to write your own.


We would have to apply a set of rules that make sense in this instance:

* If resource is already cached, use cached resource.
* If current system load is too great, extend ergodic period.
* If production time less than serialization time, never cache.
* If last requested time is older than ergodic period, purge entry.
* If memory utilization too high, purge oldest entries.

here is where I start to disagree. The above rules mix concerns like crazy and some of them are plain wrong.


#1 -> you are assuming that going to the process of evaluating the cache is optimal. it is not always the case
#2 -> the ergodic period is a property of the resource, not of the cache, you can't safely extend it
#3 -> you are assuming that time locality is infinite: measuring now, means that the same numbers apply forever. This is not the case.
#4 -> this is correct
#5 -> you are assuming that time-locality works for cache optimization. this is not always the case. the correct rule is "if you need memory, discard the entry that has a smallest cost difference between creation cost and caching cost"


if you start to think in terms of "costs" and not "time", you'll start seeing how much of your cache assumptions are very context specific and not general at all.

Etc. etc.

In fact the rules for the cache should be able to be tuned and optimized
for the particular project. Perhaps deployment concerns requires that
no more than 20 MB for your webapp be used--that limits the amount of
caching that can take place.

this is part of the cost function that is pluggable.


Using a rule-based approach will achieve the desires of the adaptive cache,
with more understanding of how the decision process is made.

dead wrong. read again my post: a purely mathematical model is much more general and doesn't require silly AI hardwired concepts at all.


it's like spamassassin vs. bogofilter for spam filtering: the first needs a community to understand the new spammer tricks and react accordingly every time. the second does it simply by math analysis.

you can be skeptical but it works like magic! (it's better at spam analysis than *I* am!)

I plan to do apply the same pure-math approach to cache as well. This will require a little bit of faith from people, but I will have numbers to show.

As to the good enough vs. perfect issue, caching partial pipelines (i.e.
the results of a generator, each transformer, and the final result) will
prove to be an inadequate way to improve system performance.

Pfff. Stating this as a general truth is simply silly.


Maybe it's hard to achieve, true, but saying that it doesn't improve performance is definately wrong. You are basically stating a overall general cost properties of resources, without even knowing what resources you are talking about.

Here are the reasons why:

* We do not have accurate enough tools to determine the cost of any particular
component in the pipeline.

And you don't need to do it. You just sample the cost of the entire resource. Then let the system try different assembling (evaluating the final resource creation costs) and at that point, you have enough information on what's the best stragety (but you keep updating the information).


God, it's all described in my design, why people don't read what I write if I put math in it? :-((((

The only true way to determine the cost of a
transformer is to measure the cost with it included vs. the cost with it
omitted. This is not desirable for the end result, so the extra production
costs to determine component cost is not worth the effort. To make matters
worse, certain components (like the SQLTransformer) will behave
differently when used in different pipelines.

see above



* The resource requirements for storing the results of partial pipelines will
outweigh the benefit for using them.

Again, how in hell do you know? you are assuming global properties. Maybe it's the case for 80% of the resources, but it's exactly that 20% that I have no idea how to optimize and I want the cache to adapt.


If you do the heuristics yourself, this is not adaptive at all! your "good enough" is exactly what we have today and, believe me, it's not good enough at all.

Whether it is memory or disk space,
we have a finite amount no matter how generous. Most production sites will
vary little over its life. The ergodic periods and other criteria will
provide all the variation that is required.

Again, time vs. cost. You can write your cost function with anything you want. Even temperature of the CPU, if that matters to you and the system will adapt to minimize that, leaving the other variables (as time, for example) free to move.


Caching is a function minimization problem. I *KNOW* I don't know the function, so I design the system general enough to minimize any function.

You (and all the other caches I've seen in my life so far) try to estimate the properties of a general set of functions and create heuristics to minimize them.

Needless to say, being more general, my approach can converge to the same solutions, but without requiring knowledgeg on the function to minimize. (again, spamassassin vs. bogofilter)


* The difference in production time by starting from a later step is fairly
minimal since the true cost of production is in the final step: the
Serializer.

again, heuristics.


Communication heavy processes such as database communication
and serialization throttle the system more than any other production cost.

but maybe not. let the system adapt, don't make the guess for yourself.


The final serialization is usually the most costly due to the fact that the
client is communicating over a slower resource--a T1 line is slower than
a 10base-T connection which is in turn slower than a 100base-T connection.

assumptions over assumptions. maybe correct, I don't question, but why making them when you have a system that discovers them automatically and *MORE IMPORTANT* adapt to those changes without your intervention in order to converge to a more optimal solution?


it sounds like magic, I know, but read the math, it's all there.


For this reason, providing a generic cache that works on whole resources is
a much more efficient use of time.

you based assumptions over assumptions and you come up with a general answer? c'mon.


For example, it would make my site run
much more efficiently if I could use a cache for my database bound objects
instead of creating a call to the database to re-read the same information
over and over.

hey, careful, an adaptive cache will never be smarter than you. Example: if you make an database connection to understand if the database data has changed, the cost difference between caching and not caching will very likely be small, if not negative.


If you use a event driven, control inverted approach and the database calls you to signal the changes, the cache will be much more efficient.

my cache design optimizes what's there, it doesn't write code for you.

but saying that pipeline fragment caching is not generally useful is pure bullshit.

Allowing the cache to have hooks for a persistence mechanism
will allow it to handle write-back style caching for user objects. Write-back
caches will asynchronously write the information to the persistence mechanism
while the critical computation path is minimally affected.

As I said, this is a completely different concern.


The cache system is given with:

 1) a cost function
 2) a pipeline setup
 3) a set of components

this creates a overall cost function of the 'operation' of this system.

by running the system, using requests as cost sampling and cost analysis as a way to drive the sampling, the cache system converges to a time-local minimization of the overall cost of running the system.

I can also provide you with numbers on how efficient the minimization was and give you insights on those resources where the caching is inefficient (you can call them hotspots, if you want) or provide you strategies on cost effectiveness (for example, it can forecast the minimization effectiveness as a function of memory or, for example, disk speed)

With *this* information, gathered on a running system, you can tune:

1) your cost function
2) the physical resources that enter your cost function (cpu speed, memory, disk speed, network speed)
3) the caching logic of the hotspots


Of course, the cache should not do the tuning itself, but merely trying to do its best with what you give to it. Then give you numbers that allow you to further tune the system.

Please, don't reply without having understood what I wrote and if there is something unclear, please ask, don't just assume.

TIA

--
Stefano, a little depressed by the fact that what he considers the best idea of his entire life is not even barely understood :-(




Reply via email to