Miles Elam wrote:

Ummm... Quick question: What are the use cases for this that are not handled by existing methods? I mean, couldn't this be handled with an (as-yet unwritten) action?

<map:match pattern="*.doc">
 <map:act type="catch-view">
   <map:parameter name="view-name" value="content"/>
   <map:generate type="word2xml" src="{../1}.doc"/>
   <!-- complete the pipeline -->
 </map:act>
 <map:read src="{1}.doc"/>
</map:match>


Go back to first post of this thread, where (last paragraph) I proposed something similar. The whole discussion is about how we could have a syntax which doesn't introduce such verbosity in the sitemap.

Jeff mentioned getting metainformation from binary data for searching, but surely there are so many different types of binary data, a universal view seems rather heavy-handed. It works for search queries (barely, in my opinion). For content manipulation clients (like WebDAV), these clients can't pass the query string trigger for views. This seems to me to be a one-trick pony. To make views available for readers, it seems as though specificity is lost.

The point of XML was specifically structured content, yes? Any conformant parser should be able to read any conformant file. Binary content has no such constraint. If both a reader and a generator are required in a matcher, I think some type of syntax that separates the two *visually* (not just conceptually) is necessary as a cue.

Putting in binary options makes all content one step worse than your typical HTML web page: lack of intelligent structure without hope of enforcing a schema. Generators that read from Word (and other similar formats) have taken some time to come to fruition precisely because of their arbitrary nature (varying character set assumptions, embedded OLE objects, various content encoding blocks, etc.). Remember, XML (in this case as metadata) is just one representation of structure. The important thing (in my opinion) is preserving the structure. I don't see that happening with further intermingling of arbitrary binary data.

I guess I'm in the camp that's glad that readers exist. Every time I have run into the dreaded error that comes from trying to load the output of a reader into the generator of another matcher, I have found a sitemap organization error. I guess I'm seeing the Cocoon version of "goto considered harmful." Sure it's flexible. Sure it's powerful. But will it impart more complexity and discomfort than it solves in actual practice?

Hacking the view internals seems overkill (emphasis on kill). Inline with resource reader's role as "arbitrary, unorganized bit bucket with a MIME type," there is no universal way of delivering appropriate content. The method of getting content from a Word document is very different from the method of content gathering from a PDF document. Views, orthogonal access to similar resources (ie. XML resources), doesn't apply. "View source" on a text file is straightforward. "View source" on an XML file even more so. What is "View source" on reader content? You would have to assign a different view to each class of reader or put in some MIME type matching hack. Neither is less work or easier to grok than simply putting in an action or selector in the appropriate matchers I think.

If this type of thing moves forward, I would rather see more specificity going into readers than twiddling with what comes out: a PDF reader, a Word reader, a Postscript reader, etc. In that case you're separating out by schema, by at least some form of contract. The alternative is equivalent to saying, "let's just make one class of transformer because all XML is alike and only three transformation options are available anyway."


As I explained in several replies, there's no equivalence between a reader and generator able to parse a given binary format. There needs to be some kind of adaptation/extraction before feeding the view.

And what you describe above as "a PDF reader, a Word reader, a Postscript reader, etc." are IMO nothing more than _generators_, just like the SWF and MIDI generators we already have.

Let's consider the MIDI example. Suppose we have a large collection of karaoke files (MIDI supports embedded text that can be played on screen while playing the music), and we want to index the text of these songs for easy retrieval (along with some other meta-data).

Here's a sitemap example, using the current syntax
<map:match pattern="*.mid"/>
 <map:act type="catch-view" src="content">
   <map:generate type="midi" src="{1}.mid"/>
   <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
   <!-- should never come here -->
   <map:serialize type="xml"/>
 </map:match>
 <map:read src="{1}.mid"/>
</map:match>

(the "content" view starts at the "content-label" label to clearly distinguish the two notions).

And the proposed shorter one :

<map:match pattern="*.mid">
 <map:read src="{1}.mid" unless-label="content"/>
 <map:generate type="midi" src="{1}.mid"/>
 <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
 <!-- should never come here -->
 <map:serialize type="xml"/>
</map:match>

Note also that the "catch-view" action is not an easy thing to do, as the view is defined on the environment object which is theoretically not visible to components.

Furthermore, it would be better to catch on labels, since several views can be plugged on a given label (e.g. "content" & "pretty-content"). And it would be impossible for the action to access this information.

P.S. Sorry to start trouble, but I think someone had to mention it.


No trouble. Just lots of misunderstandings in this thread, I guess.

Sylvain

--
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com




Reply via email to