RE: [RT] Adaptive Caching

Hunsberger, Peter Thu, 17 Jul 2003 11:28:11 -0700

Stefano Mazzocchi <[EMAIL PROTECTED]> writes (and writes, and writes,
and writes):


<small snip/> 
 
> WARNING: this RT is long! and very dense, so I suggest you to 
> turn on your printer. 

I don't have time to go through this in detail yet, but I've had a
couple of fundamental questions that it might be useful to raise.  I
think the answer to some of these questions is maybe more of a Cocoon
3.0 type of solution than anything that would happen short term, but
none-the-less it might be possible to consider some of them at the
moment (and I may never get around to writing it later)....

<small snip/>
 
> Final note: we are discussing resources which are produced 
> using a "cacheable" pipeline *ONLY*. If the pipeline is not 
> cacheable (means: it's not entirely composed of cache-aware 
> components) caching never takes place.

Strange as it may seem I think this statement might actually be
questionable!   This raises the question of what we mean by caching in
the first place?  You touch on this later, but let me suggest a couple
of possible answers here:

- client caching, 304 headers...

- proxied caching

- server caching - what the RT is mostly all about?

Within server caching we can still dredge up more detail.  In
particular, with Cocoon we need to analyze the very mechanics of why
caching is an issue at all:

1 ) Cocoon allows the production of various serialized outputs from
dynamic inputs.  If nothing is dynamic, no caching is needed (go direct
to the source).  Or alternately, think of the source as being the cache!

2) Within Cocoon, dynamic production (ignoring Readers for the moment)
is done via the creation and later serialization of SAX events.

To put it another way, within Cocoon caching is needed to optimize the
production and serialization of the SAX events.  The fact is, for some
moment in time the SAX events are persisted within Cocoon and ultimately
the serialized results are also persisted and hopefully cached.  (With
partial caching pipelines the serialized results cannot be cached.)  As
I skim through reading it, most of this paper seems to deal with the
issue of how to determine in an efficient manner whether it is more
efficient to retain these cached resources for some duration less than
their normal ergodic period or regenerate them from scratch?  This
immediately creates the question of how to determine the ergodic period
of an item.

At first it would seem that if there is no way to determine the ergodic
period of a fragment there is no reason to cache it!  However, there is
an alternative method of using the cache (which Geoff Howard has been
working on) which is to have an event invalidated cache.  In this model
cache validity is determined by some event external to the production of
the cached fragment and the cached fragment has no natural ergodic
period.  Such fragments still fit mostly within the model given here:
although we do not know when the external event may transpire we can
still determine that it is more efficient to regenerate the fragment
from scratch than retain it in cache.

If a cache invalidating event transpires then, for such fragments, it
may also make sense to push the new version of the fragment into the
cache at that time.  Common use cases might be CMSs where authoring or
editing events are expensive and rare (eg. regen Javadoc).  In our case,
we have a large set of metadata that is expensive to generate but rarely
updated.  This metadata is global across all users and if there are
resources available we want it in the cache.

This points out that in order to push something into cache one wants to
make the same calculation as the cache manager would make to expire it
from cache; is it more efficient to push a new version of this now?  If
not there may eventually be a pull request at which point the normal
cache evaluation will determine how long to keep the new fragment
cached.

<snip on the introductory math>            
 
> The first result of the above model is that site 
> administrators cannot decide whether or not a particular 
> resource needs to be cached since they don't have a way to 
> measure the efficiency of the cache on that particular 
> resource: they don't have all the necessary information.
> 
> So:
> 
>        +----------------------------------------------------------+
>        | Result #1:                                               |
>        |                                                          |
>        | To obtain optimal caching efficiency, the system must be |
>        | totally adaptive on all cacheable resources.             |
>        |                                                          |
>        |             which means (in Cocoon terms)                |
>        |                                                          |
>        | The sitemap should *NOT* contain caching information     |
>        | since the caching concern and discrimintation doesn't    |
>        | belong to any individual's concern area.                 |
>        +----------------------------------------------------------+
 
This is a side issue: Although a site administrator may not have the
information needed at run time to know whether a given fragment should
be cached they may still have knowledge of the cachability of a
fragment.  For example, they may know that the HR system generates a new
set of reports into a given directory every weekday night between 3:00
and 4:00AM. Outside of those times the fragments are eligible for
caching.  It would be nice if there was an easy way to configure this
without having to create your own generator (assuming one exists that
can already do the job).

>                                      - o -
> 
> There are three possible ways to generate a resource
> 
>   1)  ---> cache? -(yes)-> production --->
>   2)  ---> cache? -(yes)-> valid? -(no)--> production --> storage -->
>   3)  ---> cache? -(yes)-> valid? -(yes)-> lookup --->

With fragments one also has to allow for intermediate versions somewhere
in between these, which moves me on to my main reason for questioning
the assumption about caching only applying to caching pipelines:  since
fragments haven't been serialized (as final output) they need to be (at
the moment) persisted representations of SAX streams or DOM instances.
We've discussed in the past whether this could not be improved upon with
some form of intermediate results database being used to capture and
manage a more abstract infoset.  (Slide comes up in this context, but I
don't know enough about it to judge it's applicability.)  The issue that
plays into this is generator push vs. transformer pull and in particular
XML parsing and transformation models.  Consider for example a standard
Xalan transformation:  

        generator push -> parse -> DTM -> transform pull -> transform
push -> parse -> etc.

(ignoring the XSLT itself).  Now a second call for the same transform
comes along.  Perhaps the generator fragments are cached, but everything
from the parse on still happens (ie; generalized transforms with
external resource hooks).  What if instead the DTM (or similar) itself
was the cached instance?  This obviously ties Cocoon directly to a
particular parser (or requires standardized handling of some infoset
model!) but I hope one can see why this is desirable? Essentially, I'm
raising the question of whether more efficient caching isn't tied to
directly retaining the intermediate results of the parser and placing
these results in a normalized database of sorts.  At this point caching
vs. non-caching pipeline isn't as much of an issue as determining that
for a given resource the ergodic period is such that it just doesn't
make sense to keep the result in the cache...

<snip on intro to efficiency model/>
 
> Thus, the discriminating algorithm is:
> 
>   - generate a random real value between [ 0,1] ---> n
>   - obtain the caching efficiency for the given resource --> eff(r)
>   - calculate the chance of caching ---> c(eff(r))
>   - perform caching if  n < c(eff(r))
 
Why a random n?  Doesn't it make more sense to start with n = 1 and
decrease n only as resources become scarce?  In other words, isn't n
your (current) cost of caching measure?

<BIG snip/>

> 
> Assuming that memory represents an ordered collection of 
> randomly accessible bytes, the act of 'storing a resource' 
> and 'getting a resource' imply the act of 'serialization' and 
> 'deserialization' of the resource.

If you're view the cache (as potentially) more of a DTM like database
then I think the model is more like Direct Memory Access perhaps?  No
need to move things in or out of the cache, instead you're operating
directly on the cache (the intermediate results and the cache are one
and the same).

<big snip/>

> So far we have treated the pipelines as they were composed 
> only by generators and transformers. In short, each pipeline 
> can be seen as a SAX producer. This is a slight difference 
> from the original C2 term of pipeline that included a 
> serializer as well, but it has been shown how the addition of 
> xinclusion requires the creation of two different terms to 
> define pipelines that are used only internally for inclusion 
> or those pipelines that "get out" and must therefore be serialized.
> 
> I have the feeling that this requires some sitemap semantics 
> modification, allowing administrators to clearly separate 
> those resources who are visible from the outside (thus 
> require a serializer) and those who are internal only (thus 
> don't require a serializer).
 
I think this is partially addressed by caching the intermediate results,
though as you state, there are still clearly two different types of
cache: the internal intermediate results cache and the serialized final
results cache (when available).

<medium sized snip/>
 
> It must be noted that normal operations like XSLT 
> transformation cannot provide a maximum age because there is 
> no information on when the stylesheet can be changed. On the 
> other hand, it's not normally harmful to have old 
> stylesheets, so it's up to the administrator to tune the 
> caching system for their needs.
 
Hmm, perhaps another reason for the administrator being able to provide
pipeline level caching configuration information?  IE; I consider this
XSLT completely stable (hasn't been touched in years), vs. this XSLT
still under development (updated hourly!)...

<small snip/>

> Awaiting for your comments, I thank you all for the patience :)

I'd still like to get into the particulars of the RT explicitly, most of
them and the subsequent discussion on the list seems to be heading in a
good direction.  So far I'm not trying to throw any wrenches into the
current work but rather raise the question if there isn't perhaps a
better way to pipeline XML than afforded by being able to plug and play
parsers and transformers....

RE: [RT] Adaptive Caching

Reply via email to