Berin Loritsch wrote:
As to the good enough vs. perfect issue, caching partial pipelines (i.e. the results of a generator, each transformer, and the final result) will prove to be an inadequate way to improve system performance.
I think caching parts of a pipeline ist a very smart way of optimizing the cache. One Example: - Complex generator (ttl 12h) - Transformation (expensive) - CIncludeTransformer (cheap in terms of CPU usage, includes perheaps something like static header and the time of the day). One of the included source is dynamic (the time of the day) and has an time to live of one minute. - serializer
So the complete pipeline has a ttl of 1 minute, but it makes more sense to cache the generation and transfomation for 12h instead the complete pipeline for 1 minute. And I think, as I understand Stefanos ideas, his cache would adapt to such a situation (Knowing that the CPU time to save could be maximized be also caching the first part of the pipeline (if the cache agent makes the experience that the component is accessed more than one in 12h)).
Again, I respectfully disagree. Partial pipeline caching IMO has proven my points. Consider the same scenario above. Transformation, while computationally expensive, still is less cost than serialization due to the blocking nature of the serializer. Generators are the part of the pipeline that are most likely to alter the contents of a resource, which means the entire pipeline will have to be re-evaluated anyway.
In the instance of the CIncludeTransformer, to include the dynamic time of day is a bad example. What if I give the illusion of dynamics by using JavaScript for the same purpose? I have the same dynamics, yet I don't have the overhead of invalidating my cache.
However because that is an implementation detail that not all developers think of, even if we regenerate the entire pipeline every minute, that can be done by an asynchronous process while the cache serves up the old content while the new content is being generated. This is how we update the sitemap.
Practically speaking, most pipelines in my applications take less than 100ms to generate including database access and serialization. In fact, most take between 20-60 ms depending on the complexity of the pipeline. If your generation times are taking much longer than that, then you really need to look at it. That figure incorporates a complex generator and up to five transformers. I'd say that is impressive. Add a cache and the results are returned in 0-20 ms depending on the load of the machine. As the machine gets heavily loaded, generation may take longer than a second but cached resources remain at least within 10% of the full generation of the resource.
For this reason, providing a generic cache that works on whole resources
is
a much more efficient use of time.
doesn't make more sense then, just to run squid infront of cocoon?
Does squid allow you to cache user objects? No. By generic cache I mean making it available to cache your data objects used in your site as well as your generated pages.
* We do not have accurate enough tools to determine the cost of any
particular
component in the pipeline.
I think, to measure the time for any component/pipeline is quite difficult. It is allways affected by the system load.
That's fine. System load should be accounted for. What I am speaking of is the inability to correctly determine the cost of any one particular stage.
In your example above, the profiler as it is written now will return a set of results of what it has measured. The problem is that the total processing time does not match all those results added together. It isn't even close.
Until you have that ability to correctly determine the metric, then you have no way to correctly determine the cost and your adaptive cache will start making the wrong decisions.
--
"They that give up essential liberty to obtain a little temporary safety
deserve neither liberty nor safety."
- Benjamin Franklin