Berin Loritsch <[EMAIL PROTECTED]> comments: <snip/>
> > > > Let me try this a different way: one of the design > decisions driving > > the use of SAX over DOM is that SAX is more memory efficient. > > However, if you're caching SAX event streams this is no longer true > > (assuming the SAX data structures and DOM data structures are more > > less equivalent in size). Thus, caching calls into > question the whole > > way in which parsing and transformation work: if you're going to > > cache, why not cache something which is directly useful to > the parsing > > and transformation stage instead of the output? It's a bit of a > > radical thought because in a way you're no longer assembling > > pipelines. Rather, you're pushing data into a database and pulling > > data out of the database. The database just happens to work as a > > cache at the same time as it works for storing parser > output. Since > > it's a database the contents can be normalized (saving > space) since it > > feeds transformers directly it saves parsing overhead > (saving CPU). > > (Recall the discussion we had on push vs. pull parsing and lazy > > evaluation.) > > What you described here is a function of storage and > retrieval. No matter how that is changed or optimized, the > process of determining whether to cache or not is up to the > algorithm that Stefano described. Yes, except for one thing, the distinction between caching and non-caching producers becomes blurred since both are generating their output into "cache". The real distinction now is that some producers have such a small ergodic period that it may not make sense to keep the results in the "cache". However, you don't need to divide producers up into caching/non-caching since the cache manager can figure that out for you. <snip/> > > Two thoughts: > > > > 1) Since you aren't tracking a history of events there is no > > relationship to Fourier transforms and sampling periods, > they're not > > relevant. Only if you mapped a specific load to a particular cost > > would a period apply (and then it would be in relationship > to loading > > and not time!). Creating a map of load to cost for each fragment > > producer would be possible, but how do you measure "load" in a > > meaningful way that can be extrapolated to gingival > producer behavior > > over global load? I don't think you can without consuming a lot of > > resources... > > Any time we track a period of history, that does affect > things. Given global load algorithms and the 10ms > granularity of many JVM clocks, that might be a function best > suited for JNI integration. The JNI interface will provide > hooks to obtain more precise memory info, more precise timing > info (10ms means that until I have >= 10ms of timing all > requests are measured as 0ms--clearly not adequate), as well > as a hook to obtain system load. This is a function > available in UNIX environments, and it would need to be > translated for Windows environments, but it is something that > SYS admins care greatly about. I can't really see the history being that useful, you need to know "load" at each point as well as cost, and as you emphasize, what is load? > > Two important results: > > > > 1) It only makes sense to fall back to introducing randomness under > > conditions of less than full load or if we are thrashing. Both of > > these can be hard to determine, thrashing cycles can be long and > > involved. Under full load or near full load re-evaluation will be > > forced in any case (if resources are impacted enough for it to > > matter). > > How do you know what full load is? 100% CPU utilization? > 100% memory utilization? Just because the CPU is running 100% > does not mean that we have a particularly high load. Check > out a UNIX system using the ps command. You will find that > there is a marked difference between 100% CPU utilization and > a load of 1.32 and 70% CPU utilization and a load of 21.41. Yes, that's sort of my point: you can only get some sort of approximation. As a result you may want 2) more often than not: > > 2) Using randomness is one way of evaluating cost. If you have a > > function that produces good cost results then adding the randomness > > doesn't help. In other words, a Monte Carlo like behavior > can be in > > itself a cost evaluation function. > > Ok. All this theory can be proven... Well, just getting solid evidence would be great, "proof" gets you back into the math.... ;-)
