On Wednesday, Jul 16, 2003, at 14:40 America/Guayaquil, Berin Loritsch wrote:


Stefano Mazzocchi wrote:

What you left as an excersize to the reader was the Cost function.
See my email to Marc for more insights.

And my response to that. I think I am starting to "get it". Having an adaptive cost function with an adaptive cache can efficiently protect against absolute constraints and promote the primary concern.

Bingo!


In the end we might find that the memory/disk space consumption is
our most precious resources--esp. if we have absolute limits imposed
on us from outside. Time would always expand to the remaining proportion.

Yep.


Note that the collected data might be used to "predict" what would be the increased efficiency of the cache if, for example, more RAM was added to the system.

Today, this information is just a wild guess.


The
rules-based approach is one way of achieving that cost function declaratively.
True. But I prefer a purely statistical approach because I'm lazy and I don't want write those rules: I want to the system to find them out for me. ;-)

I am most concerned with rules imposed from outside. For example, some
hosting environments will have sys admins on your back if you have runtime
requirements that exceed their perceptions of optimal resource use. In those
cases we don't have the luxury of a purely adaptive cache.

I know. :( You are touching the nerve here: sysadmins who believe that they can hand-optimize their caches better than an adaptive system can do.


If we hit that wall, there is nothing technological we can do about it. But I think that if we give fancy data on what the cache is doing and how they can further tune it, you give them back the 'feeling' that they are doing the job of optimization, not the system (which is partially true). And that might satisfy their need for complete control.

But I have to prove it works first ;-)

Explicit rules are easier to
debug and understand, also to predict.
Here I disagree.

Currently we have a very simple caching rule:


If memory consumption goes higher than a particular threshold, start purging
the cache. This in turn increases the cache miss ratio, but as the purging
is smart enough to get rid of old entries and not recently accessed items.


Anyhoo, this is neither here nor there. I am beginning to see how we can
impose close to absolute limits with a simple weighted cost function.

Exactly! I came to that same exact conclusion from this very point: I tried to come up with an optimal way to know which object to throw aaway when the cache is full, and I found out that we were heuristically trying to estimate which one was saving more resources... but without even trying to measure it and thinking that time is the only valuable resource.


The whole design started exactly from there! and I'm very glad you got it too (it means I'm not completely out of mind)

Truth be told, your search algorithm might be as efficient as they come.
It might not.
True. It has to be demostrated. But the incredible ability of bogofilter gives me hope that it is the case even for caching.

As long as we have a consistent interface for the cache controller, we can
swap out search algorithms and still use our same cost function. That way
we can find the optimal search algorithm for the cache. Anyhoo, this is
another discussion entirely.

Yep, this is an implementation detail at this point.


But do keep in mind the poor administrator/installer. The guy who is
managing the install is more interested in the efficiency of the cache
and how Cocoon uses resources than the developer. The user in many cases
won't know the difference because their network connection is throttling
the response, or their browser is the bottleneck. The efficiency of the
cache will likely have the most impact on the scalability of the server.
That is my OPINION, and should not be taken as gospel. Please do not
shoot me for holding that opinion.
Not at all. When implemented, the dynamic adaptability will be configurable and you can turn it off or plug in your own cache if you dislike it.
Still, keeping in mind the poor administrator/installer, I don't see how a rule engine can be easier to configure than a system that doesn't need anything to learn and adapting and optimizing resources as it works. by itself and with no need for supervision.

The big thing is that we need a way for the administrator/installer to set
hard limits and/or the order of precedence for the cost function parameters.
That way they don't have to experiment with the actual weighting values,
and they will be just as adaptive as the search algorithm.

True. But keep in mind that even a reasonable guess on the cost function will do magic compared to the usual caching approach.


I took the 18 pages home and started to play with some code. I appreciate
the journey to get to the search algorithm you proposed. The functions
as declared in the beginning were not that efficient, and on my machine
as soon as you had more than ~410 samples for the invalid/valid cost
evaluations the function took ~10ms (the granularity of my clock) to
evaluate. The evaluation used constants for all the cost figures because
that would be the computationally cheapest way to evaluate the efficiency
of those functions.


After this exercise, to my dismay, was the better way of evaluating a
continual cost value.  That way instead of having a sample for each
request at a particular time, we only had to work with three samples.
A vastly improved search algorithm in terms of being computationally
cheap.
Hmmm, I'm very intersted on this, can you elaborate more?

I started by implementing the first algorithms for valid(r) and invalid(r),
as well as the eff(r) and the evaluation for whether it was efficient to cache
or not. That meant I had to create the functions for Pt(r) and Ct(r).

ok


The functions as defined by your initial algorithm used the sum notation,
which means we needed to maintain a list of samples for the resource. I.e.
the cost functions identified were a function of the resource and time of
request. The more samples maintained, the less efficient the cache evaluation
became.

yes, I see your point. I think it's entirely possible to use a windowed version of the sum without degrading much the results (but this is, again, a wild guess, this should have to be tested). that means, just sum the last n terms, not more.


Now, considering I have a Windows based box, the JVM reports time
with a granularity of 10ms which is not enough for truly evaluating the
performance of the algorithm.

yep.


The better example that you had in your whitepaper had three continuous cost
values (C1, C2, and C3). C1 was the cost for direct production without caching,
C2 was the cost of using the cached value, and C3 was the cost for production
and storage using the cache. I.e. we had P, Pc, and l continously evaluated.
These three samples were being adaptively applied to the overall eff evaluation.


I will have to play with the improved psuedo-code to provide a better idea.

I'm thinking about implementing the cache separately from cocoon to see if the math behind it works or not. If you have code to share, that would be wonderful (you can place it in the scratchpad if you want)


I think it's the case and today's CPU don't make it slower to use floating-point math. Still, the hyperbolic tangent function was just one solution. there is at least another family of simply polynomial functions that exhibit the same overall properties.

:) It depends on your processor architecture really. Most 32-bit based
architectures like Pentium and PowerPC still lean a more strongly in the
direction of integer performance.


true.

Most 64-bit based architectures like
Itanium, Alpha, MIPS, etc. lean more strongly in the direction of floating
point performance. In the end, I would favor something simple like tanh()
as apposed to the complexity of a polynomial function. It is simply
easier to read.

that's why I used it ;-)


Perhaps seeing what you want in code would be the best thing. It would
help solidify things for us that don't do the math or are overwhelmed by
a very long whitepaper and trying to derive from it how it will practically
make our lives better. It is would help determine the cost/benefit ratio
of actually developing this thing. As it stands, any idea that takes 18
pages to explain gives the impression of a very high cost. Whether this
bears out in practice or not remains to be seen.
I can't tell you when, but I'll do this.
At the end, it might all be an accademic exercise that doesn't work IRL, I don't know. But in case it does, it would be a very different system that would provide cocoon with even more advantages over other solutions (or, at least, that's my goal).

I believe I am starting to see the light.

That makes me *very* happy! really! the more minds work on this, the better.


--
Stefano.



Reply via email to