Hunsberger, Peter wrote:

Stefano Mazzocchi <[EMAIL PROTECTED]> responds:

<snip/>


I think I really lost you here. What does it mean to "retain the intermediate results of the parser"? what are you referring to? and what kind of database do you envision in a push pipe? Sorry, I don't get it but I smell something interesting so please elaborate more.


Let me try this a different way:  one of the design decisions driving
the use of SAX over DOM is that SAX is more memory efficient.  However,
if you're caching SAX event streams this is no longer true (assuming the
SAX data structures and DOM data structures are more less equivalent in
size).  Thus, caching calls into question the whole way in which parsing
and transformation work: if you're going to cache, why not cache
something which is directly useful to the parsing and transformation
stage instead of the output?  It's a bit of a radical thought because in
a way you're no longer assembling pipelines.  Rather, you're pushing
data into a database and pulling data out of the database.  The database
just happens to work as a cache at the same time as it works for storing
parser output.  Since it's a database the contents can be normalized
(saving space) since it feeds transformers directly it saves parsing
overhead (saving CPU).  (Recall the discussion we had on push vs. pull
parsing and lazy evaluation.)

What you described here is a function of storage and retrieval. No matter how that is changed or optimized, the process of determining whether to cache or not is up to the algorithm that Stefano described. Your proposal would have an affect on the output of the cost--hopefully it would be smaller. That said, using a standard cost function can help you determine authoritatively if your storage mechanism is truly smaller or not. We have a value that takes into consideration different parameters.

You might find that your storage solution will cost less with some cost
functions and more with others.  In that case, we would be able to tune
the storage mechanism with the cost function that suits the system best.


this said, I admit there are tons of other ways to achieve similar functionality. I just expressed one.

Yes, I guess I'll comment here instead of on that part of the thread.
Two thoughts:


1) Since you aren't tracking a history of events there is no
relationship to Fourier transforms and sampling periods, they're not
relevant.  Only if you mapped a specific load to a particular cost would
a period apply (and then it would be in relationship to loading and not
time!).  Creating a map of load to cost for each fragment producer would
be possible, but how do you measure "load" in a meaningful way that can
be extrapolated to gingival producer behavior over global load?  I don't
think you can without consuming a lot of resources...

Any time we track a period of history, that does affect things. Given global load algorithms and the 10ms granularity of many JVM clocks, that might be a function best suited for JNI integration. The JNI interface will provide hooks to obtain more precise memory info, more precise timing info (10ms means that until I have >= 10ms of timing all requests are measured as 0ms--clearly not adequate), as well as a hook to obtain system load. This is a function available in UNIX environments, and it would need to be translated for Windows environments, but it is something that SYS admins care greatly about.

2) I do think you may want some randomness.  However, at this point I
don't think it applies.   Consider a simple system with say 5 resources.

<snip/>


At this point we have found the cost of producing the items under load.
If the system load now starts to decrease we have to decide whether the
cost under load is still useful or we need to re-evaluate the cost under
reduced load.  The only way we can re-evaluate it is if we arbitrarily
force it out of cache.  At first it would seem that since the system is
under reduced load there are resource available to process the item that
has the lowest cost at load even if it is not the lowest cost with
reduced load.  However, this does not guarantee the best global response
times under reduced load. One can argue (I believe prove) that if you
pick your cost functions properly that you will arrive at best global
response time.  This means the question is how to pick the cost
function.

By separating the cost function out, we can experiment with that until we arive with something that works.

So, we see everything comes back to picking
the right cost algorithm.  This is the NP hard packing problem (in
another guise).  As a result, this is why it makes some sense to allow
for the possibility of introducing randomness at this point.  We can use
Monte Carlo like simulation to make up for picking less than optimal
cost functions.

Yep.


Two important results:

1) It only makes sense to fall back to introducing randomness under
conditions of less than full load or if we are thrashing.  Both of these
can be hard to determine, thrashing cycles can be long and involved.
Under full load or near full load re-evaluation will be forced in any
case (if resources are impacted enough for it to matter).

How do you know what full load is? 100% CPU utilization? 100% memory utilization? Just because the CPU is running 100% does not mean that we have a particularly high load. Check out a UNIX system using the ps command. You will find that there is a marked difference between 100% CPU utilization and a load of 1.32 and 70% CPU utilization and a load of 21.41.

2) Using randomness is one way of evaluating cost.  If you have a
function that produces good cost results then adding the randomness
doesn't help.  In other words, a Monte Carlo like behavior can be in
itself a cost evaluation function.

Ok. All this theory can be proven...


--

"They that give up essential liberty to obtain a little temporary safety
 deserve neither liberty nor safety."
                - Benjamin Franklin



Reply via email to