Re: URL Theory & Best Practices

Tony Collen Fri, 08 Nov 2002 13:49:18 -0800

Comments inline...

Miles Elam wrote:

Justin Fagnani-Bell wrote:
I've wrestled with similar problems for a while with my content management system, which uses a database for content and structure. I'm in the process of setting the system to use file extensions for the client to specify the file type and have Cocoon return that type. If they request /a.html, they get html, /a.pdf and they get pdf, and so on. This seems elegant, but it has problems when you consider the points covered in the slashforward article. Here's the compromise I've come up with so far, adapted to a filesystem like you're using. I'm still toying with these ideas, so i'd like to hear comments.

1) Instead of having directories with index.xml files, have a directory and an xml file with the same name at the same level.
so you have /a/b/ actually returning /a/b.xml. you could map a request for /a/b/index.html to /a/b.xml as well. This way you can add a leaf, and if you need to later add sub-nodes, and turn the leaf into a node, you just add a directory and some files underneath it.
sounds good to me
2) Redirect all urls to *not* end in a slash. I see the point of the article you've linked to, and agree with it, but the file extension is the only form of file meta data that's pretty standard. Ending all urls in slashes only works, in my opinion, if all the files are the same type, if not it's really nice to have a way of identifying the type from the url, not just the mime-type response header. So considering that any request is going to point to a leaf (or an error page), then I would redirect /a/b/ to /a/b.html
But can't delivered types differ by the incoming client?

Yes, but a problem then arises when someone is using IE and they want a PDF, when your user-agent rules will only serve a PDF for FooCo PDF Browser 1.0. IMO browsers should respect the mime-type header. I believe the mime-type headers is very useful when you want to use something like a PHP script to send an image or a .tar.gz file. In fact, it's essential for it to work, otherwise the browser interprets the data as garbage.

This is where we differ slightly. In my mind /a/b/ is the intrinsic resource. /a/b/index.html is the explicit call for HTML represention of /a/b/. If you redirect a client to /a/b/index.html and the client bookmarks it, they are bookmarking the HTML representation, not the intrinsic resource. I understand the efficiency issues, but a user agent match when viewed in the context of sitemap matches, server-side logic, servlet request and response object creation and other assorted methods calls is just a couple of string comparisons.

This is pretty much the original problem I was trying to solve. Sure, having a clean URL space that always ends in a / is useful, but if you look at how that would work on the server, side, it means you create a physical directory for each page and then create an index.html. You have tons of files named index.html on your web server, but at least it's all organized with the directories.

In particular, as new clients become more and more capable, a give and take can take place when the resource identifier is left ambiguous. For example giving Opera the XHTML/CSS version and IE6 the XML w/ XSLT processing instruction. I'm sure we're all aware of IE's fixation on file extension (or at least anyone who's fought with serving PDFs when the URL didn't end in PDF). If you pass XML w/ processing instruction from a URL tagged with .html, I'm not entirely convinced that IE will get this straight. The file extension can become a straightjacket.

As clients become more advanced, some work (ie. XSLT processing, XInclude work, etc) can be offloaded from the server. If someone has the .html version bookmarked or copied to email, we have basically made a contract with the user that they will always receive HTML for this resource no matter the capabilities of the client.

In my opinion, URLs should not change.

As further explained at http://www.useit.com/alertbox/990321.html
The rundown:

- URLs should not change
- URLs are easy to remember (and therefore are organized logically)
- URLs are easy to type and are generally all in lowercase

That is one of the main things that drew me to Cocoon: URI abstraction. Once the URL is abstracted enough to act as a true URI, it can start acting as a true indentifier instead of an ad hoc, vague gobbledygook. Of course this also assumes that the URL/URI remains set in stone and not a moving target.

Yes! This is exactly the conclusion I was coming to on my own. URIs are no more than data abstractions. They usually provide a view to some data, and more often than not, a URL on a web server directly correlates with a physical file on a disk (e.g. index.html). Cocoon allows one to create a purely virtual URL space in which no real files on the server could exist. It probably doesn't matter how the underlying data is abstracted, whether it be a one-to-one correlation to a directory tree on a disk somewhere, or an xpath statement into an xml file, or arguments to a CGI script that accesses a database depending on the order of the items in the request. Imagine a request for /articles/bydate/2002/10/31/ mapping to articles.php?mode=bydate&year=2002&month=10&day=31, which in turn queries a database.
Accessing a URL can provide a default view of the data, and depending on the request, the data can be presented different ways. In the case of things like PHP and CGI scripts, the URL sometimes accepts incoming data (GET or POST data) and will return different results based on the messages passed to it. Cocoon allows you to provide different views of a resource based on the User-Agent string which is supplied by the browser. URLs represent objects.

This way the extension isn't revealing the underlying technology of the site, but the type of file the client is expecting, and this goes for directories too.

If all we're really serving up is data, and XML is "just data" (http://radio.weblogs.com/0101679/), then perhaps all of our matches should match for *.xml. Based on other things, like the User-Agent string, or request parameters, we can provide different views of the data (PDF, SVG, HTML etc). A page named "foo.xml" could be an instance of intelligent data, whereby Cocoon supplies the "smarts" to change the data depending on any number of conditions.
In the end, it probably doesn't matter how the data is abstracted, as long as it's consistent, easy to use, and is mostly permanent (or rather, will be flexible if the abstraction changes in the future)

Life will be so much easier in 5 years when we're just serving up straight up xml files. Unfortunately this puts Cocoon out of business ;)

Phew. THAT was way more than I was hoping to write :)

Tony

---------------------------------------------------------------------
Please check that your question has not already been answered in the
FAQ before posting. <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail: <[EMAIL PROTECTED]>
For additional commands, e-mail: <[EMAIL PROTECTED]>

Re: URL Theory & Best Practices

Reply via email to