Re: Mapping URLs to documents

Andreas Hartmann Mon, 09 Jan 2006 01:17:57 -0800

Michael Wechner wrote:

Andreas Hartmann wrote:

Hi Lenya devs,

I'd like to raise an issue that bothers me for quite a long time
and share some random thoughts.

Currently, Lenya is based on the following axioms
(please correct my if I'm wrong):

1. A URL is represented by exactly one document.

what you do mean by one document? In the case of the default publicationthis might be true,

but one can do it very differently


No, that's not yet supported by the Lenya framework. Of course you can
implement a custom solution using plain Cocoon.

Joachim Wolfang suggested to support multiple content items per page
(http://wiki.apache.org/lenya/ProposalContentModel), but AFAIK it is
still in proposal state.

2. A document can be represented by an arbitrary number of URLs.


you mean like "softlinks"?


Actually not quite. Softlinks would be fine, because that would imply
a "real", native URL. But Lenya supports multiple "native" URLs for
a Document object. In the filesystem, that would mean that one and
the same file occurs in multiple paths, without the ability of singling
out one of them.

3. For each document, there is exactly one canonical URL.


what do you mean by canonical URL?


We once created the term to be able to denote a kind of "primary" URL.
It is not clearly defined what this URL should look like. You can
probably compare it to a canonical filesystem path.

If you generate the canonical URLs of two documents d1 and d2, and the
canonical URLs are equal, then d1 and d2 represent the same document.

The DefaultDocumentBuilder returns /foo.html for the default language
and /foo_xx.html for the other languages as canonical URLs.

This is reflected in the following methods:

  DocumentBuilder.buildDocument(...)
  DocumentBuilder.buildCanonicalUrl(...)
if you use the DocumentBuilder, then I guess the above is correct, but Idon't think one has to use the DocumentBuilder


It's virtually impossible not to use Lenya without the DocumentBuilder.
You couln't use any of the valuable features like workflow, the usecase
framework, transactions etc.

At the moment, the concept of multiple URLs per document is typically
used for language versions (foo_{defaultlanguage}.html = foo.html)
and to support different URL suffixes (foo, foo.htm, foo.html).

The site structure is currently tightly connected to the URL space.
Link URLs are derived directly from the site structure:

  <node id="foo">
    <node id="bar"/>
  </node>

is interpreted as

  /foo/bar

The language version is handled orthogonally to the site structure.
The URL is determined by combining both document ID and language.


If we want to allow multiple site structures, we have to choose between
the following options:

1. The connection between site structures and URL space is kept. Thisimplies


   - a document has a different canonical URL for each site structure
   - calculating a document's URL depends on the site structure

2. The purpose of the site structure is reduced to building navigation
   widgets etc., the URL space is orthogonal to that.

   - a document has only one single canonical URL
   - the site structure stores the UUID of a document
   - navigating the site structure is not reflected 1:1 in the URL space

I am not sure if I understand you correctly, but I would say we shouldgo with (2), but

I guess if you make an example, e.g.

/en/developers/andreas-hartmann

/de/entwickler/andreas-hartmann

/en/committers/andi

/de/committers/andreas


In my opinion, only one of these URLs should actually represent the document.
The others should merely point to the document, i.e. by redirects, URL
rewriting or another concept like this. If you ask the document for its
URL, there should be only one option that can be returned.

Option (2) implies that, when a document is created, its URL and itslocationin the site structure have to be determined. IMO this is just a GUIissue.In most cases, a default site structure which corresponds to the URLspace,
will be used to create documents. These documents can be referenced from
other site structures later on.

I'm not particularly fond of the DocumentBuilder concept. With option (2)
and the default site structure it would be obsolete, because the document
could be derived directly from the default site structure. The ambiguity
that multipe, arbitrary URLs can point to a document would be removed.

----

The question is if multiple URLs for a document should be allowed at all.
sure, why not? I think there are many usecases for that and existing URLspaces
which couldn't be handled by Lenya if it won't support this...


Sure, the system should allow to have multiple URLs pointing to a document.
But, as I already mentioned, there are several concepts to support this:

- redirects
- URL rewriting (proxy)
- soft links
- ... (?)

We don't have to support multiple URLs to natively *represent* a document.

Actually I don't think this is necessary. At the moment, manypublications
show the following behaviour:

/foo.html       -> Hello World!
/foo_en.html    -> Hello World!
/foo_de.html    -> Hallo Welt!

Why is the support for /foo_en.html necessary? I see only two reasons:
1. Laziness. You don't have to find out the default language to createa URL.
2. You can switch the default language without creating dead URLs.
IMO both of them don't outweigh the disadvantages of an ambiguous URLspace.
In fact, (2) should probably be avoided because the content of a document
page changes (it becomes a different language version). So IMO itcould look
like this:

/foo.html       -> Hello World!
/foo_en.html    -> 404
/foo_de.html    -> Hallo Welt!
what if you switch the default language to german,


... which is not a good thing to do IMO, see above ...

then suddenly all foo_de become 404?!


You could solve this using redirects, as Solprovider suggested.
Or using softlinks.

Actually this would simplify the URL mapping concept by mergingdocument ID
(or better document path to avoid confusion with the UUID) and language.
In the site structure, there wouldn't be multiple language versions ofa document, but only links to documents. The connection between theactual
language versions of a document would be represented in another location
(see ContentNode and Document in o.a.l.cms.repo for more information).

Assuming we have two documents which are language versions of the
same content:

* language="en" uuid="1-en"
* language="de" uuid="1-de"
This could be represented for instance by the following default sitestructures:
1. /foo.html
   /foo_de.html

   <node id="foo" document-uuid="1-en"/>
   <node id="foo_de" document-uuid="1-de"/>

   (note that the language suffix "_de" is just a part of the URL)
I am not sure if this is a good idea and what the consequences are ...my belly tells me that it's a bad idea ;-)
(e.g. in the case of switching the default language)



OK, how about this:

    <node id="foo" softlink="1-en"/>
    <node id="foo_en" document-uuid="1-en"/>
    <node id="foo_de" document-uuid="1-de"/>

If you change the default language, you'd have to change the links
(automatically), but IMO this price can be paid.

2. /en/foo.html
   /de/foo.html

   <node id="en">
     <node id="foo" document-uuid="1-en"/>
   </node>
   <node id="de">
     <node id="foo" document-uuid="1-de"/>
   </node>

Assuming that a document can only be referenced once in the default site
structure, it is now trivial to map URLs to documents and vice versa,withoutusing a DocumentBuilder. The important fact is that the knowledge howto mapURLs to documents belongs to the component which *creates* documents.That's
why there's no knowledge duplication if you hard-code that the German
version of /en/foo should be created at /de/foo.

----
Supporting the other case, multiple URL suffixes for a document, iscertainly
necessary. But I'd separate this information from the document itself.
IMO the URL suffix should be used to request a certain view of adocument:
/foo              -> HTML view
/foo.html         -> HTML view
/foo.pdf          -> PDF view
/foo.print.html -> print HTML view (if CSS is not appropriate orwhatever)
this might be one scheme, but others are possible as well. I think Lenyaneeds to allow flexibility here,
because otherwise you shut Lenya out from many URL spaces being used


Yes, it was just an example.

The canonical URL of a document whould be assembled from the canonical
base URL (/foo) and the extension denoting the view. This would be done
by the client code, the document itself (or whatever component knows the
document's URL) would just return the canonical base URL. (BTW, the term

canonical is not necessary anymore since only one base URL exists perdocument)


----

Another question: With multiple site structures, how does the system keep
track of the currently selected site structure?

  - URL prefix




that would be my first suggestion, similar to "context"  for servlets


Yes, but it would require reserved URL spaces.

[...]

I think it's best if we use a few real world examples, because then itbecomes much clearer very quickly.


Yes, my statements were of rather general nature. Is there anything
particular you'd like an example for?

Thanks for your comments,

-- Andreas



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Mapping URLs to documents

Reply via email to