Re: UUID musings and area bashing

solprovider Tue, 15 Aug 2006 19:35:34 -0700

There was no possibility I could ignore that subject line.


On 8/15/06, Joern Nettingsmeier <[EMAIL PROTECTED]> wrote:

Andreas Hartmann wrote:
>> * are UUIDs unique across publications?
>>   -> if yes, {pubId} is redundant. do we want to drag it along?
> It would be great if we could omit it, but this would require a
> performant lookup mechanism. Or we just put all content in a
> single box, and add the publication ID to the meta data. This
> sounds quite appealing to me.
+1 for all-in-one-box, but -1 for adding it to the per-{UUID+lang}
metadata. that's suboptimal from an index efficiency pov. the
publication should maintain a list of which UUIDs belong to it. this is
also easier to debug, since you can see everything at one glance.
moreover, the property "belonging to <pub>" is orthogonal to revisions,
while per-{UUID+lang} metadata is not.


Speaking from the future, it is possible to put all Resources into one
datastore.  We would need to make the SitetreeGenerator aware that it
must select based on Publication when using a flat Index, but that
would be easy.

>> * UUIDs are definitely orthogonal to revisions. we do not need to access
>> revisions other than "current" most of the time, but we should make it
>> possible now in order to avoid having to tack on another mechanism for
>> situations where revisions are involved.
> +1, this sounds useful.
so the "unique index" to borrow from database theory would be tuple
{UUID+lang+revision}


"live" is the name of the "current" Revision.  "live" is shorter, and
backwards-compatible with 1.2 when it was famous as the name of the
primary "Area" (an obsolete concept used to scatter information about
a Resource into many locations to increase the difficulty of
administration.)

1.3's content: protocol uses "UUID_language!revision".  But the
current language and "live" revision are assumed if they are omitted,
so the following return the same resource (assuming "en" is the
current language):
content:/uuid
content:/uuid_en!live

>> [terminology]
>> are we realling "addressing documents"?
>> currently, i find in the sitemaps the term "document-uuid".
>> that implies we use the term "document" to mean "the set of all stored
>> data snippets (including meta) that corresponds to a particular UUID".
>> so we are not addressing documents. we are addressing particular
>> instances of a document in a certain publication, area and language.


This was fun to design.  The content: protocol has three methods:
- DATA is the default,  It returns the Resource's content: XML for
type "xml" and binary files for "file".
- META returns XML.  For type "xml", the result is the same as DATA,
but it returns the associated "meta" xml for type "file".
- INFO returns XML describing structural information about a Resource,
including all Translations and the Revisions of each Translation.

> At the moment, a document is a particular translation in a particular
> area in a particular publication (we didn't yet change the terminology,
> at least as far as the class names are concerned).
> IIRC we agreed upon the term "translation" for this.
> We don't have a class for "the object that contains all translations
> for a UUID in a certain area" yet. That would be a document/resource/asset
> (IIRC "document" was the preferred term).


I still feel that "document" is too closely associated with "XML
Document" to be used for all resources in a product based on XML.
While there are also "MSWord Documents" and "PDF Documents", they are
not "Documents" in Lenya.  It also feels weird to refer to "JPEG
Documents" and "GIF Documents".  1.2 referred to JPEGs and GIFs (and
MSWord Documents and PDFs) as "Assets", but the distinction between
Documents and Assets is not relevant to the code in 1.3, so 1.3 calls
them all "Resources".  Resources have types "xml", "file", "link", and
(coming soon) "text".  An XML Document is an "xml" Resource.  An
MSWord Document is a "file" Resource.  There is no need to constantly
define the terms so people understand this usage.

We can start this discussion again when others start working on 1.3.

so let me propose the following:
<section status="draft" normative="yes">
the entirety of all data pertaining to what is traditionally called a
"web page" is called a *document* within lenya.


1.2 and 1.3 refer to a "web page" as a Page.  (Think "PageEnvelope".)
You violate the above definition in the next section, because a "web
page" may be composed of more than one "document".  (Think blogs.)

documents are uniquely identified by *UUIDs*, which may therefore be
called *document UUIDs* for extra clarity.


Documents/Resources.  UUIDs/UNIDs.

Documents are type of Resource, but what type of Resource changes
depending on the type of Document.  I won't argue if others prefer
working with GIF Documents; I'll just assume English is not their
native language.

A UUID is a UNID.  I have not written the UNID generator yet (because
1.3 does not have the ability to add Resources yet), but I plan to use
UUIDs as UNIDs.  There is no reason not to use UUIDs as the UNIDs, but
the code is more flexible with UNIDs.  (And I don't have to touch the
complicated 1.2->1.3 migration routine again.  It will be easier to
force all non-UUID UNIDs to UUIDs later if we decide to remove the
flexibility.  I expect the files->JCR migration to do something like
that.)

documents contain one or more *translations*. "translation" here refers
to the actual content, and includes the "original language version",
being a general category.
each translation has *metadata* associated with it.

the terms MUST, MUST NOT and SHOOTING OFFENSE are to interpreted as
described in RFC2119.
</sections>

>> [areas]
>> thinking about andreas' suggestion, it becomes ever more evident to me
>> that the area concept is flawed. areas should be done in altogether in
>> the not too far future.
> I agree that it has to be reconsidered, but should we address in 1.4?
HELL NO! :-D
this is 1.5 stuff. but i should think that the 1.4 cycle will be short
anyway.


Heavy agreement.  Just complete 1.4.  Then come to 1.3 where Areas are
almost forgotten.  (Some of the backwards-compatibility code still
uses the term, but the functions are deprecated.)

>>> An internal link URL might look like this:
>>>   document://{pubId}:{area}:{uuid}:{language}
>> what about lenya: and lenyadoc:? i must confess i have never quite
>> grasped the concept...
> lenya:// is one layer below this, it addresses repository nodes.
does that mean that it's obsolete now? or if not, what is it currently
used for?
> lenyadoc:// is probably fine for links. Maybe we should just use that one.
>> in any case, the protocol should definitely begin with "lenya...", so
>> that it's immediately obvious what's going on in the sitemap.
>> i would even go as far as suggesting that all our input modules and
>> pseudo-protocols that are not suited for upstream cocoon be re-named
>> lenya-fallback, {lenya-docinfo:...} etc.
>> this would greatly reduce the learning curve for our users, and make
>> life easier for casual committers from other apache projects, since it's
>> obvious if custom magic is at work, as opposed to core cocoon
>> functionality.
> -0.5, I'd prefer to keep them short, but it's OK with me to change it.
i strongly feel that cocoon namespaces must be restructured, even at the
cost of increased verbosity. it should be easy to register both the
traditional and the prefixed name for a grace period, and move the
sitemaps over piecemeal without breaking external code too soon.


This is all specific to 1.4  The fallback:protocol is rather mindless,
and is another well-forgotten concept in 1.3.  I have not used the
other protocols mentioned.  Are any of them useful for something not
handled by the 1.3's content: and module: protocols?  We can rename
1.3's protocols for better branding, but I prefer simplicity.

solprovider

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: UUID musings and area bashing

Reply via email to