Re: URL Theory & Best Practices

Miles Elam Thu, 07 Nov 2002 22:59:59 -0800

Justin Fagnani-Bell wrote:

I've wrestled with similar problems for a while with my content management system, which uses a database for content and structure. I'm in the process of setting the system to use file extensions for the client to specify the file type and have Cocoon return that type. If they request /a.html, they get html, /a.pdf and they get pdf, and so on. This seems elegant, but it has problems when you consider the points covered in the slashforward article. Here's the compromise I've come up with so far, adapted to a filesystem like you're using. I'm still toying with these ideas, so i'd like to hear comments.

1) Instead of having directories with index.xml files, have a directory and an xml file with the same name at the same level.
so you have /a/b/ actually returning /a/b.xml. you could map a request for /a/b/index.html to /a/b.xml as well. This way you can add a leaf, and if you need to later add sub-nodes, and turn the leaf into a node, you just add a directory and some files underneath it.

sounds good to me

2) Redirect all urls to *not* end in a slash. I see the point of the article you've linked to, and agree with it, but the file extension is the only form of file meta data that's pretty standard. Ending all urls in slashes only works, in my opinion, if all the files are the same type, if not it's really nice to have a way of identifying the type from the url, not just the mime-type response header. So considering that any request is going to point to a leaf (or an error page), then I would redirect /a/b/ to /a/b.html

But can't delivered types differ by the incoming client?

This is where we differ slightly. In my mind /a/b/ is the intrinsic resource. /a/b/index.html is the explicit call for HTML represention of /a/b/. If you redirect a client to /a/b/index.html and the client bookmarks it, they are bookmarking the HTML representation, not the intrinsic resource. I understand the efficiency issues, but a user agent match when viewed in the context of sitemap matches, server-side logic, servlet request and response object creation and other assorted methods calls is just a couple of string comparisons.

In particular, as new clients become more and more capable, a give and take can take place when the resource identifier is left ambiguous. For example giving Opera the XHTML/CSS version and IE6 the XML w/ XSLT processing instruction. I'm sure we're all aware of IE's fixation on file extension (or at least anyone who's fought with serving PDFs when the URL didn't end in PDF). If you pass XML w/ processing instruction from a URL tagged with .html, I'm not entirely convinced that IE will get this straight. The file extension can become a straightjacket.

As clients become more advanced, some work (ie. XSLT processing, XInclude work, etc) can be offloaded from the server. If someone has the .html version bookmarked or copied to email, we have basically made a contract with the user that they will always receive HTML for this resource no matter the capabilities of the client.

In my opinion, URLs should not change. That is one of the main things that drew me to Cocoon: URI abstraction. Once the URL is abstracted enough to act as a true URI, it can start acting as a true indentifier instead of an ad hoc, vague gobbledygook. Of course this also assumes that the URL/URI remains set in stone and not a moving target.

This way the extension isn't revealing the underlying technology of the site, but the type of file the client is expecting, and this goes for directories too.

Yup, although I think people underestimate the utility of the default directory listing when there is no index.html (or default.htm, home.html, etc.). If you think back to the beginnings of the web, what was index.html but a dressed up view of all resources in the general area?

The matchers would look something like this: (i might have this wrong)

<map:match pattern="**/">
<map:redirect-to uri="{1}.html"/>
</map:match>

<map:match pattern="**/*.html">
<map:generate src="documents/{1}.xml/>
<map:transform src="stylesheets/page2html.xsl"/>
<map:serialize type="xhtml"/>
</map:match>

Shouldn't this be <map:generate src="documents/{1}/{2}.xml"/>? But yeah, that's assuming that the resource will be HTML. A valid assumption for most sites...for the time being. A lot has changed in the last few years and a lot of new clients have jumped on the scene. As I mentioned before, I believe URLs should be as permanent as possible. This has no flexibility for the future.

This is based upon the sitemap we're using as a working model (I might also have something wrong):


<map:match pattern="**/index.xml">
<map:generate type="directory" src="{1}"/>
<map:transform src="dir2page.xsl"/>
<map:transform src="stylesheets/processing-instruction.xsl">
<map:parameter name="stylesheet" value="stylesheets/page2xhtml.xsl"/>
</map:transform>
<map:serialize type="xml"/>
</map:match>


<map:match pattern="**/index.*">
<map:generate src="cocoon:/{1}/index.xml"/>
<map:transform src="stylesheets/page2{2}.xsl"/>
<map:serialize type="{2}"/>
</map:match>


<map:match pattern="**/page.xml">
<map:generate src="{1}.xml"/>
<map:transform src="stylesheets/processing-instruction.xsl">
<map:parameter name="stylesheet" value="stylesheets/page2xhtml.xsl"/>
</map:transform>
<map:serialize type="xml"/>
</map:match>


<map:match pattern="**/page.*">
<map:generate src="cocoon:/{1}.xml"/>
<map:transform src="stylesheets/page2{2}.xsl"/>
<map:serialize type="{2}"/>
</map:match>


<map:match pattern="**/">
<map:select type="browser">
<map:when test="wap">
<map:generate src="cocoon:/{1}/page.wml"/>
<map:serialize type="wml"/>
</map:when>
<map:when test="xslt">
<map:generate src="cocoon:/{1}/page.xml"/>
<map:serialize type="xml"/>
</map:when>
<map:when test="html">
<map:generate src="cocoon:/{1}/page.html"/>
<map:serialize type="html"/>
</map:when>
<map:otherwise>
<map:generate src="cocoon:/{1}/page.xhtml"/>
<map:serialize type="xhtml"/>
</map:otherwise>
</map:select>
</match>

This all works on the following assumptions:

"/a/b/d/" refers to a resource independant of presentation. From here, we do browser type checking for the appropriate output type.

"/a/b/d/index.xml" refers to a list of resources associated with "/a/b/d/"

"/a/b/d/page.xml" refers to the resource explicitly as XML.

"/a/b/d/page.html" refers to the resource explicitly as HTML.

--------------

This also reflects the change we made to the browser selector. In effect, we've turned it into a poor man's Deli.

<map:selector logger="sitemap.selector.browser" name="client" src="org.apache.cocoon.selection.BrowserSelector">
<browser name="xslt" useragent="MSIE 6"/>
<browser name="xhtml" useragent="MSIE"/>
<browser name="xhtml" useragent="MSPIE"/>
<browser name="xhtml" useragent="HandHTTP"/>
<browser name="xhtml" useragent="AvantGo"/>
<browser name="xhtml" useragent="DoCoMo"/>
<browser name="xhtml" useragent="Opera"/>
<browser name="xhtml" useragent="Lynx"/>
<browser name="xhtml" useragent="Java"/>
<browser name="wap" useragent="Nokia"/>
<browser name="wap" useragent="UP"/>
<browser name="wap" useragent="Wapalizer"/>
<browser name="xhtml" useragent="Mozilla/5"/>
<browser name="xhtml" useragent="Netscape6/"/>
<browser name="xhtml" useragent="Netscape7"/>
<browser name="html" useragent="Mozilla"/>
</map:selector>

This basically designates what class of content goes where as opposed to what the client name is. FYI: I know that Mozilla supports XSLT as well, but I ran into a CSS rendering bug with regard to background-color on the body tag that prevents its use.

--------------

In other news, I found that just using the filesystem, while simple, lacks some flexibility for other purposes: for example being able to list files published in a certain time frame or by a particular author. We ended up looking at database solutions before too long. If however this really is just a document bucket, no problem.

The examples I gave were tailored to what I gather of the original problem set where it seems basically text documents are being handled.

For documents where it is assumed images, media files, etc. will be associated, I'd actually recommend a setup where index.* refers to the document itself instead of a directory listing. The document would have references to the media pieces and thus would fit the description of a overview or listing.

For our site, the directory structure is more like:

/articles/00000001/

with index.xml, index.html, index.xhtml, etc. being views for the article. Any images et al would be referenced as:

/articles/00000001/image1.png

This precludes hierarchy however. We got around this by having alternate hierarchies independant of this one. This would be a listing of all software having to do with software reviews:

/articles/reviews/software/

This would be software published last month:

/articles/2002/10/

This would be articles written by me:

/users/melam/articles/

This fits for articles being fixed, independant items (via article ID), and available by other organizational means. Unfortunately this isn't really feasible without a database -- we tried and realized that flat files weren't any easier or simpler for what we wanted to do. Relational database, object database, or XML database doesn't really cause a need for a URL/URI change as they're relative to the resource as a concept instead of a filesystem layout.

Okay... Anyone going to poke holes in my arguments? Our setup is pretty young and the database is relatively empty. I'd love to hear about problems before our dataset gets much larger and unwieldy.

- Miles

P.S. That was far longer than I was originally planning...

---------------------------------------------------------------------
Please check that your question has not already been answered in the
FAQ before posting. <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail: <[EMAIL PROTECTED]>
For additional commands, e-mail: <[EMAIL PROTECTED]>

Re: URL Theory & Best Practices

Reply via email to