Stefano Mazzocchi <[EMAIL PROTECTED]> writes (and writes, and writes, and writes):
<small snip/> > WARNING: this RT is long! and very dense, so I suggest you to > turn on your printer. I don't have time to go through this in detail yet, but I've had a couple of fundamental questions that it might be useful to raise. I think the answer to some of these questions is maybe more of a Cocoon 3.0 type of solution than anything that would happen short term, but none-the-less it might be possible to consider some of them at the moment (and I may never get around to writing it later).... <small snip/> > Final note: we are discussing resources which are produced > using a "cacheable" pipeline *ONLY*. If the pipeline is not > cacheable (means: it's not entirely composed of cache-aware > components) caching never takes place. Strange as it may seem I think this statement might actually be questionable! This raises the question of what we mean by caching in the first place? You touch on this later, but let me suggest a couple of possible answers here: - client caching, 304 headers... - proxied caching - server caching - what the RT is mostly all about? Within server caching we can still dredge up more detail. In particular, with Cocoon we need to analyze the very mechanics of why caching is an issue at all: 1 ) Cocoon allows the production of various serialized outputs from dynamic inputs. If nothing is dynamic, no caching is needed (go direct to the source). Or alternately, think of the source as being the cache! 2) Within Cocoon, dynamic production (ignoring Readers for the moment) is done via the creation and later serialization of SAX events. To put it another way, within Cocoon caching is needed to optimize the production and serialization of the SAX events. The fact is, for some moment in time the SAX events are persisted within Cocoon and ultimately the serialized results are also persisted and hopefully cached. (With partial caching pipelines the serialized results cannot be cached.) As I skim through reading it, most of this paper seems to deal with the issue of how to determine in an efficient manner whether it is more efficient to retain these cached resources for some duration less than their normal ergodic period or regenerate them from scratch? This immediately creates the question of how to determine the ergodic period of an item. At first it would seem that if there is no way to determine the ergodic period of a fragment there is no reason to cache it! However, there is an alternative method of using the cache (which Geoff Howard has been working on) which is to have an event invalidated cache. In this model cache validity is determined by some event external to the production of the cached fragment and the cached fragment has no natural ergodic period. Such fragments still fit mostly within the model given here: although we do not know when the external event may transpire we can still determine that it is more efficient to regenerate the fragment from scratch than retain it in cache. If a cache invalidating event transpires then, for such fragments, it may also make sense to push the new version of the fragment into the cache at that time. Common use cases might be CMSs where authoring or editing events are expensive and rare (eg. regen Javadoc). In our case, we have a large set of metadata that is expensive to generate but rarely updated. This metadata is global across all users and if there are resources available we want it in the cache. This points out that in order to push something into cache one wants to make the same calculation as the cache manager would make to expire it from cache; is it more efficient to push a new version of this now? If not there may eventually be a pull request at which point the normal cache evaluation will determine how long to keep the new fragment cached. <snip on the introductory math> > The first result of the above model is that site > administrators cannot decide whether or not a particular > resource needs to be cached since they don't have a way to > measure the efficiency of the cache on that particular > resource: they don't have all the necessary information. > > So: > > +----------------------------------------------------------+ > | Result #1: | > | | > | To obtain optimal caching efficiency, the system must be | > | totally adaptive on all cacheable resources. | > | | > | which means (in Cocoon terms) | > | | > | The sitemap should *NOT* contain caching information | > | since the caching concern and discrimintation doesn't | > | belong to any individual's concern area. | > +----------------------------------------------------------+ This is a side issue: Although a site administrator may not have the information needed at run time to know whether a given fragment should be cached they may still have knowledge of the cachability of a fragment. For example, they may know that the HR system generates a new set of reports into a given directory every weekday night between 3:00 and 4:00AM. Outside of those times the fragments are eligible for caching. It would be nice if there was an easy way to configure this without having to create your own generator (assuming one exists that can already do the job). > - o - > > There are three possible ways to generate a resource > > 1) ---> cache? -(yes)-> production ---> > 2) ---> cache? -(yes)-> valid? -(no)--> production --> storage --> > 3) ---> cache? -(yes)-> valid? -(yes)-> lookup ---> With fragments one also has to allow for intermediate versions somewhere in between these, which moves me on to my main reason for questioning the assumption about caching only applying to caching pipelines: since fragments haven't been serialized (as final output) they need to be (at the moment) persisted representations of SAX streams or DOM instances. We've discussed in the past whether this could not be improved upon with some form of intermediate results database being used to capture and manage a more abstract infoset. (Slide comes up in this context, but I don't know enough about it to judge it's applicability.) The issue that plays into this is generator push vs. transformer pull and in particular XML parsing and transformation models. Consider for example a standard Xalan transformation: generator push -> parse -> DTM -> transform pull -> transform push -> parse -> etc. (ignoring the XSLT itself). Now a second call for the same transform comes along. Perhaps the generator fragments are cached, but everything from the parse on still happens (ie; generalized transforms with external resource hooks). What if instead the DTM (or similar) itself was the cached instance? This obviously ties Cocoon directly to a particular parser (or requires standardized handling of some infoset model!) but I hope one can see why this is desirable? Essentially, I'm raising the question of whether more efficient caching isn't tied to directly retaining the intermediate results of the parser and placing these results in a normalized database of sorts. At this point caching vs. non-caching pipeline isn't as much of an issue as determining that for a given resource the ergodic period is such that it just doesn't make sense to keep the result in the cache... <snip on intro to efficiency model/> > Thus, the discriminating algorithm is: > > - generate a random real value between [ 0,1] ---> n > - obtain the caching efficiency for the given resource --> eff(r) > - calculate the chance of caching ---> c(eff(r)) > - perform caching if n < c(eff(r)) Why a random n? Doesn't it make more sense to start with n = 1 and decrease n only as resources become scarce? In other words, isn't n your (current) cost of caching measure? <BIG snip/> > > Assuming that memory represents an ordered collection of > randomly accessible bytes, the act of 'storing a resource' > and 'getting a resource' imply the act of 'serialization' and > 'deserialization' of the resource. If you're view the cache (as potentially) more of a DTM like database then I think the model is more like Direct Memory Access perhaps? No need to move things in or out of the cache, instead you're operating directly on the cache (the intermediate results and the cache are one and the same). <big snip/> > So far we have treated the pipelines as they were composed > only by generators and transformers. In short, each pipeline > can be seen as a SAX producer. This is a slight difference > from the original C2 term of pipeline that included a > serializer as well, but it has been shown how the addition of > xinclusion requires the creation of two different terms to > define pipelines that are used only internally for inclusion > or those pipelines that "get out" and must therefore be serialized. > > I have the feeling that this requires some sitemap semantics > modification, allowing administrators to clearly separate > those resources who are visible from the outside (thus > require a serializer) and those who are internal only (thus > don't require a serializer). I think this is partially addressed by caching the intermediate results, though as you state, there are still clearly two different types of cache: the internal intermediate results cache and the serialized final results cache (when available). <medium sized snip/> > It must be noted that normal operations like XSLT > transformation cannot provide a maximum age because there is > no information on when the stylesheet can be changed. On the > other hand, it's not normally harmful to have old > stylesheets, so it's up to the administrator to tune the > caching system for their needs. Hmm, perhaps another reason for the administrator being able to provide pipeline level caching configuration information? IE; I consider this XSLT completely stable (hasn't been touched in years), vs. this XSLT still under development (updated hourly!)... <small snip/> > Awaiting for your comments, I thank you all for the patience :) I'd still like to get into the particulars of the RT explicitly, most of them and the subsequent discussion on the list seems to be heading in a good direction. So far I'm not trying to throw any wrenches into the current work but rather raise the question if there isn't perhaps a better way to pipeline XML than afforded by being able to plug and play parsers and transformers....