On Friday, Jul 18, 2003, at 13:03 America/Guayaquil, Hunsberger, Peter wrote:
Let me try this a different way: one of the design decisions driving
the use of SAX over DOM is that SAX is more memory efficient. However,
if you're caching SAX event streams this is no longer true (assuming the
SAX data structures and DOM data structures are more less equivalent in
size). Thus, caching calls into question the whole way in which parsing
and transformation work: if you're going to cache, why not cache
something which is directly useful to the parsing and transformation
stage instead of the output? It's a bit of a radical thought because in
a way you're no longer assembling pipelines. Rather, you're pushing
data into a database and pulling data out of the database. The database
just happens to work as a cache at the same time as it works for storing
parser output. Since it's a database the contents can be normalized
(saving space) since it feeds transformers directly it saves parsing
overhead (saving CPU). (Recall the discussion we had on push vs. pull
parsing and lazy evaluation.)
Ok, I think I got it: instead of saving the output, you want to save the 'internal state' of the producer that is generating that output.
The only way to implement this would be to force the pipeline components to output two different things, the output (that goes into the pipeline) and a serializable form of their internal state (that goes into the cache)... also, provide a method for the cache to 'push' a cached internal state to allow them to recreate their operation.
If I understood correctly, I really don't see the advantage.
1) Since you aren't tracking a history of events there is no relationship to Fourier transforms and sampling periods, they're not relevant.
? I don't see why.
My point is, put it in simple down-to-earth terms: if the behavior of the system changes faster than we can measure it, we'll never converge.
This is the basic of all retroactive systems and the reason why B2 pilots can't fly that airplane without computer control: their reactivity is too slow. Piloting, there, means "suggesting the general tendency of directionality", the rest is done by the computer.
One way to explain this is using fourier analysis, which I tend to prefer over the Z-transform used in discrete domains, or Laplace analysis, but it's just personal taste.
Only if you mapped a specific load to a particular cost would
a period apply (and then it would be in relationship to loading and not
time!). Creating a map of load to cost for each fragment producer would
be possible, but how do you measure "load" in a meaningful way that can
be extrapolated to gingival producer behavior over global load? I don't
think you can without consuming a lot of resources...
you don't need to measure load. all I said is that the algorithm doesn't work if the system behavior changes faster than the cache can adapt to it. During *very* lazy calculations on observable behaviors, I think that the adaptive cache converges for resources that are requested once every 3 hours or more frequently.
2) I do think you may want some randomness. However, at this point I
don't think it applies. Consider a simple system with say 5 resources.
The resources are each initially found to have a cost of production of
.1, .2 , .3, etc. respectively. At this point the cache is full (must
be Cocoon running on a Palm Pilot) so we trash the lowest cost item.
Now the system starts to get busy and the cost of recreating it starts
to rise monotonically until it reaches .2. Now we use some randomness
(assuming all other factors equal) to decide which guy to toss. Upon
re-evaluating the .2 item it is found to have exponentially rising cost
of production and now costs .4 to recreate. The system continues to
load up and eventually the .2 guy costs .3 to cost. Again some
randomness occurs and we choose the original .3 guy to retry and it
still costs .3 to recreate and it eventually becomes the lowest cost
item to recreate under load (the original .1 guy bumps up to .301 or
whatever).
At this point we have found the cost of producing the items under load.
If the system load now starts to decrease we have to decide whether the
cost under load is still useful or we need to re-evaluate the cost under
reduced load. The only way we can re-evaluate it is if we arbitrarily
force it out of cache. At first it would seem that since the system is
under reduced load there are resource available to process the item that
has the lowest cost at load even if it is not the lowest cost with
reduced load. However, this does not guarantee the best global response
times under reduced load. One can argue (I believe prove) that if you
pick your cost functions properly that you will arrive at best global
response time. This means the question is how to pick the cost
function.
I really don't see how you can apply probabilistic behavior without randomness.
And in case you don't want probabilistic behavior, I don't see how you can create a stable system.
First examine the case where we have a less than optimal cost function;
if it has enough of an impact it will push the load up and force a
reevaluation of other items. Eventually one will either settle into a
steady state or be thrashing (A little background: I helped write memory
paging algorithms for the OS in my distant youth.) If you are thrashing
one of two things is possible: you have so few resources for the given
load that nothing you do will help OR alternately, your cost evaluation
function is really poor. So, we see everything comes back to picking
the right cost algorithm. This is the NP hard packing problem (in
another guise). As a result, this is why it makes some sense to allow
for the possibility of introducing randomness at this point. We can use
Monte Carlo like simulation to make up for picking less than optimal
cost functions.
I really don't see why moving randomness from the choice stage to the cost evaluation stage can improve converging of the system (rather the opposite, I would say)
Two important results:
1) It only makes sense to fall back to introducing randomness under conditions of less than full load or if we are thrashing.
??? I saw no proof of this at all.
Both of these can be hard to determine, thrashing cycles can be long and involved.
??? like "what entry is the one with the smaller efficiency?" c'mon.
Under full load or near full load re-evaluation will be forced in any case (if resources are impacted enough for it to matter).
I don't get it.
2) Using randomness is one way of evaluating cost. If you have a function that produces good cost results then adding the randomness doesn't help. In other words, a Monte Carlo like behavior can be in itself a cost evaluation function.
I don't get it either.
I think I'm missing your point entirely. Sorry for being so slow, but I'm really trying hard!
-- Stefano.
