Re: [RT] Escaping Sitemap Hell

Stefano Mazzocchi Thu, 06 Jan 2005 21:50:01 -0800

Daniel Fagerstrom wrote:

 (was: Splitting xconf files step 2: the sitemap)
Although the Cocoon sitemap is a really cool innovation it is not entierly without problems:
* Sitemaps for large webapps easy becomes a mess
* It sucks as a map describing the site [1]
* It doesn't give that much support for "cool URLs" [2]
In this RT I will try to analyze the situation especially with respect to URL space design and then move on to discuss a possible solution.
Before you entusiasticly dive into the text:
* It is a long RT, (as my RTs usually are) * It might contain provoking and hopefully even thought provoking ideas * No, I will not require that everything in it should be part of 2.2 * No, I don't propose that we should scrap the current sitemap, actually I believe that we should support it for the next few millenia ;)


See my comments intermixed.

                           --- o0o ---
Peter and I had some discussion:
Peter Hunsberger wrote:
On Tue, 04 Jan 2005 13:25:05 +0100, Daniel Fagerstrom <[EMAIL PROTECTED]> wrote:
<snip/>
Anyway, sometimes when I need to refactor or add functionallity to some of our Cocoon applications, where I or colleagues of mine have written endless sitemaps, I have felt that it would have been nice if the sitemap would have been more declarative so that I could have asked it basic things like geting a list of what URLs or URL pattern it handles. Also if I have an URL in a large webapp and it doesn't work as expected it can require quite some work to trace through all involved sitemaps to see what rule that actually is used.
Of course I understand that if I used a set of efficient conventions about how to structure my URL space and my sitemaps the problem would be much less. Problem is that I don't have found such a set of conventions yet. Of course I'm following some kind of principles, but I don't have anything that I'm completely happy with yet. Anyone having good design patterns for URL space structuring and sitemap structuring, that you want to share?
We have conventions that use sort of type extensions on the names: patient.search, patient.list, patient.edit where the search, list, edit (and other) screen patterns are common across many different metadata sources (in this case patient). We don't do match *.edit directly in the sitemap (any more) but I find that if you've got to handle orthoganal concerns then x.y.z naming patterns can sometimes help.
Ok, lets look at this in a more abstracted setting:
Resource Aspects
================
In the example above we have an object or better a _resource_, the patient that everything else is about. The resource should be identifyable in an unique way in this case with e.g. the social security number.

First big mistake: you think that http-based URIs and http-based URLs are the same thing.

Well, WRONG.

There is nothing that says that every http-URI should be automatically treated as a URL. This is a very commmon misconception, but nevertheless a big one.

There are a number of _operations_ that can be performed at the patient resource: show, edit, list, search etc, (although the search might be on the set of patient rather than a single one).

The resource has a _type_, patient, that might affect how we choose to show it etc.

Secong mistake: it is a architectural design issue to *avoid* adding a type to a URI. These are three separate issues:

1) how to resolve a URI into a URL 2) how to negotiate the content of that URL 3) how to map that returned URL metadata (the HTTP response headers) to a recognized type or format.

combining them into one is just a really poor way to use the web architecture.

There are in general other aspects that will stear how we render the response when someone asks for the resouce:
* The _format_ of the response: html, pdf, svg, gif etc.
* The _status_ of the resource: old, draft, new etc.
* The _access_ rights of the response: public, team, member etc.
There are plenty of other possible aspect areas as well.

Cool Webapp URLs
================
I searched the web to gain some insights in URL space design. It soon become clear that I should re-read Tim Berners-Lee's clasic, "Cool URIs don't change" [2]. I must say I wasn't prepared to the chock, I had completely missed how radical the message in it was when I read it the last time. I can also recomend reading [3], a W3C note that codifies the message from [2] and some other good URI practices into a set of guidelines.


I suggest you to read

  http://www.w3.org/TR/webarch/

So what is an URI? According to [3]:
A URI is, actually, a //reference to a resource, with fixed and independent semantics/ /.

This means that the URI should reference to a specific product, _always_.


GRRRR! A URI IS NOT A REFERENCE! A URI IS AN IDENTIFIER!

How to get a reference out of an itentifier is a totally different thing.

Independent semantics means that a social security number is not enough, it should say that it is a person (from USA) as well. See [3] for the philosophical details.


Pfff, independent semantics doesn't mean anything. A perfectly valid URI is

 urn:943098029834098/9829982739487298374

* The URI should be easy to type


What the hell does this mean?

http://tinyurl.com/5r8kl

is easier to type than

http://www.amazon.com/exec/obidos/tg/detail/-/0465026567/

but which one is "better"? They both locate the same resource, but which one of them identifies it better?

* It should not contain to much meaning, especially not about implementation details

Now I try to apply the ideas from [2] and [3] on the different resource aspects mentioned above. When I use words like "should" or "should not" without any motivation it means that I believed in the motivation from the gurus in the references ;) I will try to motivate my own ideas ;)

What I'm going to suggest might be quite far from how you design your URL spaces. It is certainly far from the implementation detail plauged mess that I have created in my own applications.
The Resource
------------
The idea is that an URL identifies a resource. For the patient case above it could be:
http://myhospital.com/person/123456789
If we use a hierarchial URI space like /person/123456789, the "parent" URIs e.g. /person should also refer to a resource.

There is *NO SUCH THING* as a parent URI, because URIs do not have the notion of paths. It is a *convention* that it was established by early web server implementations (and that apache httpd perdured) that the / in the paths got automatically mapped to the / in the file system or in a hierarchical system where the / is used as a fragmentor for hierachy identifiers.

There is *NOTHING* in any web spec that says this is the rule or, for that matter, that this is a good thing.

/ is a "separator" in fact, from a URI point of view

 http://myhospital.com/123456789/person

and

 http://myhospital.com/person/123456789

show no difference in identification power.. which is what URIs do: they identify!

Its in most cases not a good idea to put a lot of topics classification effort in the URI hierarchy. Classifications are not unique and will change according to changing interests and world view.

This is true. But it is also true that, if you follow this reasoning, you should not be using http:// URIs at all!

In fact, what happens to a URI when say, two hospitals merge and they decide that it's in their best interest to get rid of the previous references of the names, including those in the URIs?

This is the reason why a lot of people prefer URNs over http-URIs, for example:

 1) the handle system: http://www.handle.net/
 2) the LSID system: http://www.omg.org/docs/dtc/04-05-01.pdf
 3) the DOI system: http://www.doi.org/

TimBL believes that the above systems are just a different way to skin a cat and they don't really solve anything (even if he agrees on the problem that the domain part of http-URIs is the weakest part of an http-URI, in terms of long-term persistence)

Also, you should take a look at 'Dynamic Delegation Discovery System' (DDDS):

  http://uri.net/ddds.html

which aims to become the standard way to translate a URI into a URL.

Operations
----------
What about the operations on the resource: list, search, edit etc? I find the object oriented style in WebDAV elegant where you use one URL together with different HTTP methods to perform different operations.

It's not the OO style of WebDAV, but it's the design of HTTP. Here is another example of somebody ruining a perfectly great design by not getting it: the browsers only allowed people to overload the actions in forms, but never in anchor tags and the browsers never allowed javascript to change that either.

Sam Ruby also have some intersting ideas about using URLs to identify "objects" and different SOAP messages for different methods on the object in his "REST+SOAP" article [4]. But neither adhoc HTTP methods or XML posts seem like good candidates for invoking operations on a resource in a typical webapp. So maybe something like:
/person/123456789/edit or
/person/123456789.edit or
/person/123456789?operation=edit
is a good idea.
Resource Type
-------------
Should the type of the resource be part of the URI?


Absolutely not!

We probably have to contain some type info in the URL to give it "independent sematics" (person e.g.). But we should not put types that might change like patient, manager, project-leader etc in the URL. And we should especially avoid types that only have to do with implementation details like what pipeline we want to use for rendering the resource.
Format
------
Cocoon especially shines in handling all the various file name extensions: .html, .wml, .pdf, .txt, .doc, .jpg, .png, .svg, etc, etc. But I'm sorry, if you want cool URLs you have to kiss them godbye as well ;)

This is, again, another one of those major screwups from some browsers (mostly IE) where the "extension" of a URL (as such a thing existed!) was used to identify the mime-type instead of the response headers.

It might be a good idea to send a html page to a browser on a PC and a wml page to a PDA user. But you shouldn't require your user to remember different URLs for different clients, thats a task for server driven content negotiation.

Using .html is not especially future proof, should all links become invalid when you decide to reimplement your site with dynamic SVG?

Often it is good to provide the user with a nice printable version of your page. But why should you advertice Adobes products in your URLs.

Unfair: many non-adobe things produce PDF and it's a royalty-free specification to use.

http://partners.adobe.com/public/developer/pdf/index_reference.html

A few years ago it was .ps or .dvi from academic sites and .doc in comersial sites. Right now it happen to be .pdf but will that be forever?

Same thing with images, the user don't care about the format as long as it can be shown in the browser (content negotiation), neither should you make your content links or (Googles image search) be dependent on a particular compression scheme that happen to be popular right now.

There are of course cases where you really whant to give your user the abillity to choose a specific format. Then a file name extension is a good idea. If you happens to maintain http://www.adobe.com/products/acrobat/ its ok to put some .pdf there e.g. ;)

But in most cases file name extensions is an implementation detail that not is relevant for your users.

This is correct. Although a URL that might break in the future but shows me a page in my browser today is better than a URL that might not break tomorrow but doesn't show me anything at all today ;-)

Status
------
The status will by definition change, and that make your URL uncool if the status was part of the URL.
Access Rights
-------------
Access rigths will often change for a document. I know it is easy to write path dependent rules for access rights in most webserver configuration files. But you expose irrelevant implementation details and its not future proof.
Am I Really Serious?
--------------------
Why should a webapp URL be cool and future proof? Well, its the interface to your webapp. We agree that we shouldn't change interfaces in Cocoon at a whim, why should we treat the users of our webapps differently? And like it or not, usefull software sometimes lives for decades. If you build useful webapps you should consider planing ahead.

Currently we are all used with webapps that uses the most horrible URLs containing tons of implementation details and changing every now and then. But it is not a law of nature that it must be like that. It is mainly a result of webapp development still being immature and the tools being far from perfect. Of course the user should be able to bookmark a useful form or wizard.

Also I believe that exposing implementation details in ones URLs is at least as bad as making all member variables public in Java classes. It makes your webapp monolithic and fragile.

To get this straight: I totally agree that a cool URL scheme is a great thing and I also think that the best URL scheme is something like

 http://site.com/342343

and that's it... that's the only way never to change anything because those numbers are the only 'semantically neutral' thing that you can do?

But still, my blog news URLs are the form of

 http://www.betaversion.org/~stefano/linotype/news/34/

which have several problems:

1) we might forget to register the domain and somebody might steal it from us

 2) well, my name might change (but that's unlikely)

 3) the company that has a trademark on linotype might sue me

4) I might decide to add other types of idems to my blog, like images or articles or whatever else... then news/id/ would seem awkward

but the best part is the number, chosen to be incremental and unique in that space.

                           --- o0o ---
You might find the views expressed above rather extreme and maybe unpractical. As indicated above they are also far away from what I curently do in my webapps. But I have for quite some time thought about how to fight the to easily increasing entropy in the webapps we develop. I have suspected that badly designed URL spaces has been part of the trouble. And when I re-read Tim BLs classic I suddenly realized that the habit of exposing implementation in the URLs might be at the root of the evil.

There is truth in this, but what I found irritating was the lack of understanding of the difference between a URI and a URL.

Cocoon's internals show some of this too (and I have to admit that I understood what URIs really were only after starting to work on the semantic web) but this should not be perpetuated further.

If this realization will survive the contact with your comments and other parts of reality is of course to early to tell ;)
Does Cocoon Support Cool URLs?
==============================


Yessir!

But how does Cocoon support the above ideas about URL space design?
Well, in some way one could say that it supports it. The sitemap is so powerfull that you can program most usage patterns in it in some more or less elegant way. But AFAICS, writing webapps following the URL space design ideas above would be rather tricky. So I would say that Cocoon doesn't support it that well.

I rather strongly (and probably not surprisingly) disagree with this statement.

The main reasons are:
* The sitemap is not that usefull as a site map


How is this making it worse to support "cool URLs"?

* The sitemap gives excelent support for choosing resource production implementation based on the implementation details coded into the URL, but not for avoiding it

wrong! that's why we have pluggable matchers! the fact that you choose to match by URL is your choice, not an architectural decision!

* The sitemap mixes site map concerns with resource production implementation details

Yes, the cocoon sitemap describes how resources get produced in the pipelines.... but what is the site map you are talking about? a collection of all the resources available on the site? or just the URL matchers without anything else?

Is it a Map of the Site?
------------------------
The Forrest people don't think that the sitemap is enough as map of the site. They have a special linkmap [1] that gives a map over the site and that is used for internal navigation and for creating menu trees. I have a similar view. From the sitemap it can be hard to answer basic questions like:
* What is the URL space of the application
* What is the structure of the URL space
* How is the resource refered by this URL produced


Hold it right there!

If you think that understanding the URLspace of the application for a sitemap is hard, then what about PHP? JSP? what about web.xml descriptors? are they any better?

Second point: how in hell is "structure of the URL space" different from "the URL space of the application"?

Third point: this is *flow* is should *NOT* be part of a sitemap anyway.

The basic view of a URL in the sitemap is that it is any string. Even if there are constructions like mount, the URL space is not considered as hierarchial. That means that the URLs can be presented as patterns in any order in the sitemap and you have to read through all of it to see if there is a rule for a certain URL.

As I mentioned already, this is a design decision based on the fact that it is *arbitrary* to consider the / as a hierachical separator.

Also, matchers are *NOT* URL-specific and it's a very useful concept. Forcing matching to be:

 1) URL-based

and

 2) intrinsically hierarchical

is IMO a *severe* step backward in terms of architectural design.

A real map for the site should be tree structured like the linkmap in forrest. Take a look at the example in [1], (I don't suggest using the "free form" XML, something stricter is required). Such a tree model will also help in planning the URI space as it gives a good overview of it.


Forrest and cocoon serve different purposes.

While I totally welcome the fact that Forrest has such "linkmaps", I don't think they are general-enough concepts to drive the entire framework. They are fine as specific cases, especially appealing for a website generation facility like forrest, but as a general concept is too weak.

The Forrest linkmap have no notion of wildcards, which is a must in Cocoon. We continue discussing that.


All right.

Choosing Production Pipeline
----------------------------
With the sitemap it is very easy to choose the pipeline used for producing the response based on a URL pattern "*.x.y.z". That more or less forces the user to code implementation details i.e. what pipeline to use into the URL. This is only a problem for wildcard patterns otherwise we just associate the pipeline to the concrete "cool URL".


At this point I seriously wonder: are you aware that matchers are pluggable?

Before I suggested that aspects like: type, format, status, access rights etc shouldn't be part of the URL as those aspects might change for the resource. OTH these aspects certainly are necessary for choosing rendering pipeline, what should we do?


URL-parameter matching.

 <match type="wildcard" pattern="/news/*">
   <match type="param" pattern="edit">
    ....
   </match>
   <match type="param" pattern="delete">
    ....
   </match>
 </match>

or, if you have HTTP action control (as in form actions), you can do

 <match type="wildcard" pattern="/news/*">
   <match type="action" pattern="get">
    ....
   </match>
   <match type="action" pattern="post">
    ....
   </match>
 </match>

and, most of all, you do *NOT* include access control information in the URL! nor type! nor status!

The requested resource will often be based on some content or combination of content that we can access from Cocoon. The content can be a file, data in a db, result from a business object etc. Let us assume that it resides in some kind of content repository. Now if we think about it, isn't it more natural to ask the content, that we are going to use, about its propertiies like type, format, status, access rights, etc, than to encode it in the URL? This properties can be encoded in the file name, in metadata in some property file, within the file, in a DB etc.


Ok, now that the nonsense venting is over, we seem to be getting at your RT.

Now instead of having the rule:

*.x.y.z ==> XYZPipeline

we have

* where repository:{1} have properites {x, y, z} ==> XYZPipeline

or

* where repository:{1}.x.y.z exists ==> XYZPipeline


Oh, a rule system for sitemap!

hmmmm, interesting... know what? the above smells a *lot* like you are querying RDF. hmmmm...

We get the right pipeline by querying the repository instead of encoding it in the URL. A further advantage is that the rule becomes "listable" as the "where" clause in the match expresses what are the allowed values for the wildcard.
Separating the Concerns
-----------------------
The sitemap handles two concerns: it maps an URL to a pipeline that produces a repsonse and it describes how to put together this pipeline from sitemap components.


True.

The first concern is related to site design and the second is more a form of programming. Puting them together makes it hard to see the URL structure and also makes it tempting to group URLs based on common pipeline implementation instead of on site structure.


Fair enough.

Virtual Pipeline Components (VPCs) give us a way out from this. Large parts of our sites might be buildable with pipelines allready constructed in some standard blocks.


Right.

I would propose to go even further, in the "real" site map it should only be allowed to call VPC pipelines, no pipeline construction is allowed, that should be done in the component area.

In the "real" site map the current context is set and the the arguments to the called VPC is given.


Hmmm, rather drastic, but let's stick to it for your proposal.

Search Order
------------
 The problem for us, is as you allude to at the start of this
thread: Cocoon takes the first match, where what you really want is a
more XSLT "best match" type of handling; sometimes *.a, *.b, *.c works
and other times it's m.*, n.*, o.*...
In the past that has lead me to suggest a sort of XSLT flow, but thinking about it in this light I wonder if what I really want is just XSLT sitemap matching (same thing in the end)...
I also believe that a "best match" type of handling is preferable, it increases IMO usabillity and it also makes it possible to use tree based maching algoritms that are far more efficient than the current linear search based.


This is a valid point.

The new sitemap
===========
To sum up the proposal:
Pipelines: * Pipeline construction is only done as VPCs in component areas (often in blocks).

Sitemap: * The sitemap is folow the tree structure of the URL space (like the Forrest linkmap). * Its responsibillity is to map URLs to VPCs * It can set the current context for each level in the tree (for derefering relative paths used in the VPC) * Wildcards can have restrictions based on properties in the content repository * Its best match based rather than rule order based * Of course we have an include construct so that we can reuse sub sites
It might look like:
<sitemap> <path match="person" context="adm/persons" pipeline="block:skin:default(search.xml)"> <path match="*:patient" test="mydb:/patients/{patient} exists" context="adm/patients" pipeline="journal-summary({patient})"> <path match="edit" pipeline="edit({patient})"/> <path match="list" pipeline="list({patient})"/>  </path> </path> </sitemap>

Don't care about the syntactical details in the example it needs much more thought, I just wanted to make it a little bit more concrete. The path separator "/" is implicily assumed between the levels. "*:patient", means that the content of "*" can be refered to as "patient".

Much of what I propose can be achieved with VPCs and a new "property aware" matcher. But IMO the stricter SoC above, the ability to "query" the sitemap, the possible advantages of the "best match" search, are reasons enough to go further.

First thing that comes to mind is that the implicit assumption of '/' is just bad. I would be against the proposal just for that.

Second, you lose the ability to do non-URL matching, which is, again another reason to vote against this.

Third, conditional matching is just nonsense, it's mixing flow concerns with matching.

Forth, I don't find the above any more readable than a sitemap that uses VPCs.

I'll think about the rule-based pipeline resolution (which is an interesting concept on itself) but the rest, I'm sorry, it really does not resonate with me at all.

--
Stefano.

Re: [RT] Escaping Sitemap Hell

Reply via email to