<map:match pattern="*.doc">
<map:act type="catch-view">
<map:parameter name="view-name" value="content"/>
<map:generate type="word2xml" src="{../1}.doc"/>
<!-- complete the pipeline -->
</map:act>
<map:read src="{1}.doc"/>
</map:match>Jeff mentioned getting metainformation from binary data for searching, but surely there are so many different types of binary data, a universal view seems rather heavy-handed. It works for search queries (barely, in my opinion). For content manipulation clients (like WebDAV), these clients can't pass the query string trigger for views. This seems to me to be a one-trick pony. To make views available for readers, it seems as though specificity is lost.
The point of XML was specifically structured content, yes? Any conformant parser should be able to read any conformant file. Binary content has no such constraint. If both a reader and a generator are required in a matcher, I think some type of syntax that separates the two *visually* (not just conceptually) is necessary as a cue.
Putting in binary options makes all content one step worse than your typical HTML web page: lack of intelligent structure without hope of enforcing a schema. Generators that read from Word (and other similar formats) have taken some time to come to fruition precisely because of their arbitrary nature (varying character set assumptions, embedded OLE objects, various content encoding blocks, etc.). Remember, XML (in this case as metadata) is just one representation of structure. The important thing (in my opinion) is preserving the structure. I don't see that happening with further intermingling of arbitrary binary data.
I guess I'm in the camp that's glad that readers exist. Every time I have run into the dreaded error that comes from trying to load the output of a reader into the generator of another matcher, I have found a sitemap organization error. I guess I'm seeing the Cocoon version of "goto considered harmful." Sure it's flexible. Sure it's powerful. But will it impart more complexity and discomfort than it solves in actual practice?
Hacking the view internals seems overkill (emphasis on kill). Inline with resource reader's role as "arbitrary, unorganized bit bucket with a MIME type," there is no universal way of delivering appropriate content. The method of getting content from a Word document is very different from the method of content gathering from a PDF document. Views, orthogonal access to similar resources (ie. XML resources), doesn't apply. "View source" on a text file is straightforward. "View source" on an XML file even more so. What is "View source" on reader content? You would have to assign a different view to each class of reader or put in some MIME type matching hack. Neither is less work or easier to grok than simply putting in an action or selector in the appropriate matchers I think.
If this type of thing moves forward, I would rather see more specificity going into readers than twiddling with what comes out: a PDF reader, a Word reader, a Postscript reader, etc. In that case you're separating out by schema, by at least some form of contract. The alternative is equivalent to saying, "let's just make one class of transformer because all XML is alike and only three transformation options are available anyway."
- Miles Elam
P.S. Sorry to start trouble, but I think someone had to mention it.
