Stefano Mazzocchi wrote:
On 16 Dec 2003, at 14:02, bernhard huber wrote:
hi, <snip/>
Now, the way the event cache works is like this:
a) a cache validity is generated b) pipeline is executed c) result is stored in the cache
then the pipeline is never called, until an event is triggered externally (from an avalon component) that invalidates that particular cache entity.
Some experiences I had using some sort of simple Servlet Cache Filter using caching by sessionid: The session is not touched as long the cache entry is valid, the session gets expired due to this caching. But perhaps that's just an issue of the servlet engine, or the Servlet CachFilter issue,
Your sentence ..the pipeline is never called, just reminded me of the that situation, and of the danger of pruning to optimistically.
Thru my JSR 170 work, I've been exposed to what Day Software does with their Communique CMS.
What they do is very simple architecturally yet extremely elegant and effective.
They don't use the file system. Never. They store everything in a repository. Consider it a virtual file system with observable hooks for now (it's much more than that but it's not important for this discussion).
Whenever a resource is generated by the publishing layer, this layer instantiates a sort of "reading transaction" so that the repository can keep track of all the dependencies of that particular resource.
Note that they have libraries that, for example, generate images out of markup (sort-of Batik serializer style) so those dependencies might be quite big (I heard up to 100 files for a single resource).
When a resource is modified into the repository, the tree of dependencies is crawled "backwards" and all resources that depend on it gets invalidated. Invalidation gets all the way up to an Apache module.
This allows Communique to handle *extreme* load (they run Sony Style with just two boxes for fault tollerance and simple load balancing and that site generates tens of millions of requests per day, with huge peaks at break times). Note that communique is a 100% pure java servlet and the repository is all java again and runs in the same JVM: no database at all, no networking overhead.
How do that do that? well, first thing is that most requests are handled directly by the web server... the servlet engine is called only when the resource needs to be regenerated.
This leaves the machines almost doing nothing all day (if you run stuff from mod_cache, you can fill a T1 with a 486) and ready to go when a new resource has to be generated.
Now, the drawbacks:
1) if you are *not* in control of your data environment, the above system doesn't work... unless you have synchronous polling on the datasources... which is not any better than the caching system we have.
2) the caching strategy is centralized. I'm not sure if components can have their own, but for sure it's a pain. [note: they don't have a pipelined rendering layer, just a one stage, template driven, approach]
Communique is a publishing system on steroids, so I hear that writing an entire web application with Communique is probably harder than using a simple webapp framework.
Cocoon wants to do both things and do them well, with as less effort and code as possible.
Cocoon cannot has a predefined global caching strategy, it doesn't make sense. But it *does* make sense to have a pipeline-granular caching strategy, with the ability to modify it at the component level.
We have this already, we just need to polish it up a little and find out what is *really* useful and how things can be made more usable.
Today, modifying the caching strategy at the component level is black magic: nobody does. I'm scared about it myself, so I can't even imagine users trying to do this themselves.
The off-the-shelf pipeline caches have some "magic" associated to it.... they are black boxes, basically, nobody really knows when something is caching or not.... it's hard to tell, hard to visualize, hard to control, hard to tune and hard to modify.
This makes the whole thing much less powerful than it really is.
You know how much I care about caching, but there is still a lot of work to do... expecially now that new "inverted" scenarios of use are going to appear on the horizon with observable repositories.
We're talking about validities, but before checking a validity, we first have to obtain it through the cache key.
In the current Cocoon architecture, keys of cache entries are built with abitrary data defined by each of the individual pipeline components. The result of this is that we can have several different cached responses for a single request definition (URI + headers).
The big benefit of this approach is that many variations can be cached (depending on night/day, local weather, whatever), but the main disadvantage is that the pipeline *must* be built for every request in order to compute the cache key, even if the response is served from the cache afterwards.
A solution would be to have another pipeline implementation that uses a different strategy to build cache keys. What comes to mind is that instead of returning abitrary values for key, components could return some matching criteria on request metadata. The pipeline could then organize the cache entries by URIs, each URI having a list of cached responses along with the matching criteria.
It has just occurred to me how cool this would be for the CLI. Doing this would make it possible to identify whether or not _any_ effort should be expended upon generating a page, rather than the current system which involves actually getting a value from the cache before you decide to discard it.
The approach you describe above could result in a truly significant speed improvement for offline site creation.
Regards,
Upayavira, who is actually starting to use the CLI/bean on a real site for the first time!
This approach would reduce the possible cached variations for a given request, but would allow to find cached content (and its validity) without incuring the cost of building the pipeline.
What do you think?
Sylvain
