On Thursday, Jul 17, 2003, at 13:29 America/Guayaquil, Hunsberger, Peter wrote:


Stefano Mazzocchi <[EMAIL PROTECTED]> writes (and writes, and writes,
and writes):

LOL!


<small snip/>

WARNING: this RT is long! and very dense, so I suggest you to
turn on your printer.

I don't have time to go through this in detail yet, but I've had a couple of fundamental questions that it might be useful to raise. I think the answer to some of these questions is maybe more of a Cocoon 3.0 type of solution than anything that would happen short term, but none-the-less it might be possible to consider some of them at the moment (and I may never get around to writing it later)....

<small snip/>

Final note: we are discussing resources which are produced
using a "cacheable" pipeline *ONLY*. If the pipeline is not
cacheable (means: it's not entirely composed of cache-aware
components) caching never takes place.

Strange as it may seem I think this statement might actually be questionable! This raises the question of what we mean by caching in the first place? You touch on this later, but let me suggest a couple of possible answers here:

- client caching, 304 headers...

- proxied caching

- server caching - what the RT is mostly all about?

all caching generates from the server. even proxy/client caching is done only after some metadata is attached to the response by the server.


I agree that cocoon should be as proxy/client cache friendly as possible: that means, if the caching logic of the pipeline components can yield an ergodic period, we signal it to the proxy/client.

if not, we can trigger the resource validity estimation and return an empty HTTP response with the proper code (I don't remember the number) to signify that the proxied/client-cached data is still valid and we don't have to regenerated, look it up and resend it.

everything else is the internal cache concern.

Within server caching we can still dredge up more detail.  In
particular, with Cocoon we need to analyze the very mechanics of why
caching is an issue at all:

1 ) Cocoon allows the production of various serialized outputs from
dynamic inputs. If nothing is dynamic, no caching is needed (go direct
to the source). Or alternately, think of the source as being the cache!


2) Within Cocoon, dynamic production (ignoring Readers for the moment)
is done via the creation and later serialization of SAX events.

To put it another way, within Cocoon caching is needed to optimize the
production and serialization of the SAX events. The fact is, for some
moment in time the SAX events are persisted within Cocoon and ultimately
the serialized results are also persisted and hopefully cached. (With
partial caching pipelines the serialized results cannot be cached.) As
I skim through reading it, most of this paper seems to deal with the
issue of how to determine in an efficient manner whether it is more
efficient to retain these cached resources for some duration less than
their normal ergodic period or regenerate them from scratch? This
immediately creates the question of how to determine the ergodic period
of an item.

Yep, that's a big concern.

At first it would seem that if there is no way to determine the ergodic
period of a fragment there is no reason to cache it! However, there is
an alternative method of using the cache (which Geoff Howard has been
working on) which is to have an event invalidated cache. In this model
cache validity is determined by some event external to the production of
the cached fragment and the cached fragment has no natural ergodic
period. Such fragments still fit mostly within the model given here:
although we do not know when the external event may transpire we can
still determine that it is more efficient to regenerate the fragment
from scratch than retain it in cache.

I agree. Also let me point out that the logic of cache invalidation of fragments is totally orthogonal to the adaptive algorithms described.


If a cache invalidating event transpires then, for such fragments, it
may also make sense to push the new version of the fragment into the
cache at that time. Common use cases might be CMSs where authoring or
editing events are expensive and rare (eg. regen Javadoc). In our case,
we have a large set of metadata that is expensive to generate but rarely
updated. This metadata is global across all users and if there are
resources available we want it in the cache.


This points out that in order to push something into cache one wants to
make the same calculation as the cache manager would make to expire it
from cache; is it more efficient to push a new version of this now?  If
not there may eventually be a pull request at which point the normal
cache evaluation will determine how long to keep the new fragment
cached.

Hmmm, very interesting point. Didn't think about this.... I'll let it percolate thru my synapses a little before replying...hmmm...


<snip on the introductory math>

The first result of the above model is that site
administrators cannot decide whether or not a particular
resource needs to be cached since they don't have a way to
measure the efficiency of the cache on that particular
resource: they don't have all the necessary information.

So:

       +----------------------------------------------------------+
       | Result #1:                                               |
       |                                                          |
       | To obtain optimal caching efficiency, the system must be |
       | totally adaptive on all cacheable resources.             |
       |                                                          |
       |             which means (in Cocoon terms)                |
       |                                                          |
       | The sitemap should *NOT* contain caching information     |
       | since the caching concern and discrimintation doesn't    |
       | belong to any individual's concern area.                 |
       +----------------------------------------------------------+

This is a side issue: Although a site administrator may not have the
information needed at run time to know whether a given fragment should
be cached they may still have knowledge of the cachability of a
fragment. For example, they may know that the HR system generates a new
set of reports into a given directory every weekday night between 3:00
and 4:00AM. Outside of those times the fragments are eligible for
caching. It would be nice if there was an easy way to configure this
without having to create your own generator (assuming one exists that
can already do the job).

You are totally right. The above result is a little too strong. An adaptive system should benefit from a-priori knowledge of how the environment behaves. it might also allow an easier migration path for hard-core sysadm who don't believe in math ;-)



- o -

There are three possible ways to generate a resource

  1)  ---> cache? -(yes)-> production --->
  2)  ---> cache? -(yes)-> valid? -(no)--> production --> storage -->
  3)  ---> cache? -(yes)-> valid? -(yes)-> lookup --->

With fragments one also has to allow for intermediate versions somewhere
in between these, which moves me on to my main reason for questioning
the assumption about caching only applying to caching pipelines: since
fragments haven't been serialized (as final output) they need to be (at
the moment) persisted representations of SAX streams or DOM instances.
We've discussed in the past whether this could not be improved upon with
some form of intermediate results database being used to capture and
manage a more abstract infoset. (Slide comes up in this context, but I
don't know enough about it to judge it's applicability.) The issue that
plays into this is generator push vs. transformer pull and in particular
XML parsing and transformation models. Consider for example a standard
Xalan transformation:


        generator push -> parse -> DTM -> transform pull -> transform
push -> parse -> etc.

(ignoring the XSLT itself). Now a second call for the same transform
comes along. Perhaps the generator fragments are cached, but everything
from the parse on still happens (ie; generalized transforms with
external resource hooks). What if instead the DTM (or similar) itself
was the cached instance? This obviously ties Cocoon directly to a
particular parser (or requires standardized handling of some infoset
model!) but I hope one can see why this is desirable? Essentially, I'm
raising the question of whether more efficient caching isn't tied to
directly retaining the intermediate results of the parser and placing
these results in a normalized database of sorts. At this point caching
vs. non-caching pipeline isn't as much of an issue as determining that
for a given resource the ergodic period is such that it just doesn't
make sense to keep the result in the cache...

I think I really lost you here. What does it mean to "retain the intermediate results of the parser"? what are you referring to? and what kind of database do you envision in a push pipe? Sorry, I don't get it but I smell something interesting so please elaborate more.



<snip on intro to efficiency model/>


Thus, the discriminating algorithm is:

  - generate a random real value between [ 0,1] ---> n
  - obtain the caching efficiency for the given resource --> eff(r)
  - calculate the chance of caching ---> c(eff(r))
  - perform caching if  n < c(eff(r))

Why a random n? Doesn't it make more sense to start with n = 1 and decrease n only as resources become scarce? In other words, isn't n your (current) cost of caching measure?

see my reply to Berin, maybe that shows you some insights on why I choose a probabilistic approach to the adaptation nature.


this said, I admit there are tons of other ways to achieve similar functionality. I just expressed one.


<BIG snip/>



Assuming that memory represents an ordered collection of randomly accessible bytes, the act of 'storing a resource' and 'getting a resource' imply the act of 'serialization' and 'deserialization' of the resource.

If you're view the cache (as potentially) more of a DTM like database then I think the model is more like Direct Memory Access perhaps? No need to move things in or out of the cache, instead you're operating directly on the cache (the intermediate results and the cache are one and the same).

I feel this is related to the above 'database of sort'.


<big snip/>

So far we have treated the pipelines as they were composed
only by generators and transformers. In short, each pipeline
can be seen as a SAX producer. This is a slight difference
from the original C2 term of pipeline that included a
serializer as well, but it has been shown how the addition of
xinclusion requires the creation of two different terms to
define pipelines that are used only internally for inclusion
or those pipelines that "get out" and must therefore be serialized.

I have the feeling that this requires some sitemap semantics
modification, allowing administrators to clearly separate
those resources who are visible from the outside (thus
require a serializer) and those who are internal only (thus
don't require a serializer).

I think this is partially addressed by caching the intermediate results,
though as you state, there are still clearly two different types of
cache: the internal intermediate results cache and the serialized final
results cache (when available).

This has been already implemented ;-) Content aggregation works exactly like this in today's cocoon (and has been so for a while).



<medium sized snip/>


It must be noted that normal operations like XSLT
transformation cannot provide a maximum age because there is
no information on when the stylesheet can be changed. On the
other hand, it's not normally harmful to have old
stylesheets, so it's up to the administrator to tune the
caching system for their needs.

Hmm, perhaps another reason for the administrator being able to provide pipeline level caching configuration information? IE; I consider this XSLT completely stable (hasn't been touched in years), vs. this XSLT still under development (updated hourly!)...

Very true!



<small snip/>


Awaiting for your comments, I thank you all for the patience :)

I'd still like to get into the particulars of the RT explicitly, most of
them and the subsequent discussion on the list seems to be heading in a
good direction. So far I'm not trying to throw any wrenches into the
current work but rather raise the question if there isn't perhaps a
better way to pipeline XML than afforded by being able to plug and play
parsers and transformers....

I'm all ears.


--
Stefano.



Reply via email to