Re: [RT] Views for readers
On Thursday, Aug 14, 2003, at 19:07 Europe/Rome, Miles Elam wrote: Vadim Gritsenko wrote: Here is another wild (or not?) thought. Not so wild to me. All this discussion comes down to the requirement of generating some XML out of the content usually served by the reader, if that's possible (and it is possible for some of the types of the content), in order to feed this XMLized content into the view. This generated XML is somewhat equivalent to the binary represenation for the purpose of view building. So, I'm going to the conclusion that some types of readers can be paired with the generator producing equivalent, but XMLized, content. The best place to indicate such pairing is the time when you declare a reader: snip idea=interesting/ The syntax looks a bit ugly to me, but the idea seems much more sane to me. PS: Modifying sitemap syntax to allow reader/generator pairs with some unless attrbiutes looks awful to me. Complete agreement. One of the reasons for the sitemap (*the* reason?) is for the simple and easy management of a site. Some recent proposals seem to be pushing in the direction of Apache HTTPd's mod_rewrite; A lot of flexibility by adding just one more construct. From the mod_rewrite page: The great thing about mod_rewrite is it gives you all the configurability and flexibility of Sendmail. The downside to mod_rewrite is that it gives you all the configurability and flexibility of Sendmail. -- Brian Behlendorf Apache Group Despite the tons of examples and docs, mod_rewrite is voodoo. Damned cool voodoo, but still voodoo. -- Brian Moore [EMAIL PROTECTED] It'd be a shame if the sitemap became a cousin to mod_rewrite despite the cool voodoo. I can hardly agree more! - Miles Elam P.S. I shudder to think of what will happen to search index creation times when multi-megabyte Word documents and the like are sent down the pipe. The parsers, however efficient they may turn out to be, will still have to contend with seemingly endless streams of seemingly pointless formatting cruft. I'm sure we've all seen 10MB files that would be 100K in proper HTML I'm sure. Ah well...'tis the cost of progress, I guess. cocoon is not about binary and should *NOT* touch them. Readers were implemented as helpers. multi-views for binary files belong to the repository level, not to the publishing level!!! I haven't read all email left (300 more to go after 5 days of offline) but I strongly hope you haven't implemented this or I'll scream!!! -- Stefano.
Re: [RT] Views for readers
On Thursday, Aug 14, 2003, at 21:44 Europe/Rome, Andreas Hochsteger wrote: Hi! Sorry, but this discussion seems to tell us one thing: The current sitemap syntax and cocoon processing model is not really suitable for such kind of processing. I completely agree. All this reminds me of a proposal (which was actually a RT) I've sent back in January this year, where I proposed a more intuitive and flexible pipeline concept. Maybe more flexible (and this is half of what FS is!) but more intuitive? I don't want to say, that this would be the solution to all problems and I definitely made some mistakes because I didn't know that much of the cocoon internals at the time of writing, but I think it's time to take a second look at it. Until I'm around, any proposal to make cocoon more suitable for binary pipelines will receive a -1 from me (vote, not veto! remember) So here's the link: http://marc.theaimsgroup.com/?l=xml-cocoon-devm=104482372430759w=2 Again, please don't be so harsh concerning mistakes, but I think there are many ideas included, which give some food for thought. the main idea of this thread and Nicola's thoughts and Jeff's proposal is that cocoon should be instrumented to allow more binary process. I strongly disagree. Cocoon should process XML and focus on that. Other systems (a content repository, for example) should process the binaries (at creation time! not at publishing time!) Say no to woodo! -- Stefano.
Re: [RT] Views for readers
On 14 Aug 2003 at 15:34, Bertrand Delacretaz wrote: I find this more understandable (but dunno about implementation): !-- if reader is executed, the rest is not -- map:read src=docs/{1}.doc unless-view=wordToXml/ map:generate src=docs/{1}.doc type=wordToXml/ map:transform... Simplifying further: map:read src=docs/{1}.doc view-generator=wordToXml/ Surely that'd do it? Regards, Upayavira
Re: [RT] Views for readers
Bertrand Delacretaz wrote: Le Jeudi, 14 aoû 2003, à 15:24 Europe/Zurich, Sylvain Wallez a écrit : ...But what if we write it the other way around : map:read src=docs/{1}.doc map:generate src=docs/{1}.doc type=wordToXml label=content/ /map:read I find this more understandable (but dunno about implementation): !-- if reader is executed, the rest is not -- map:read src=docs/{1}.doc unless-view=wordToXml/ map:generate src=docs/{1}.doc type=wordToXml/ map:transform... Interesting. This is looks like a more compact notation for the view-selector I was thinking of at first. We're leaving the RT world... But shouldn't we keep labels that are already used into pipelines ? E.g : map:read src=docs/{1}.doc label=raw, xdoc/ map:generate src=docs/{1}.doc type=word2xml label=raw/ map:transform src=xword2xdoc.xsl label=xdoc/ The label on the reader would skip the reader if the requested view corresponds to one of these labels. Now should this be named label or unless-label ? Ah, and this is very easily implementable ;-) Sylvain -- Sylvain Wallez Anyware Technologies http://www.apache.org/~sylvain http://www.anyware-tech.com { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects } Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Jeff Turner wrote: On Wed, Aug 13, 2003 at 12:02:04PM +0200, Sylvain Wallez wrote: Frederic's question about search engine integration led me to questioning myself at how Cocoon's Lucene integration could be able to transparently index Word PDF documents along with XML-produced documents. There exists some text-extraction libraries for Word PDF (e.g. http://www.textmining.org/). Now how can we integrate this as transparently as possible in Cocoon's search functionnality ? The Lucene indexer crawls a website and asks for a particular view (content) which is used to fill the index. But Word and PDF documents being binary files, they're handled by a map:read statement, which does not handle views. On the other hand, this use case shows that having views on binary content may make sense : the normal requests just sends back the binary content, while a view can use a text/XML extraction on these binary files. So the question is : how could views be plugged to readers ? I must say that I don't have an answer, as views contain transformers and a serializer, but no generator. So how could we express in the sitemap that a particular view on a reader should replace that reader by a particular generator ? Or should this go through some special readers that could also act as generators ? Or maybe these are silly thoughts and we should use a map:select directing to a map:read or map:generate depending on the view. But this introduces explicit view management in the pipelines, which doesn't seem nice to me. Solution: strongly typed pipelines! :) Imagine if, at each node in the sitemap, we knew what type of content we were dealing with (usually some flavour of XML). Then we could write a single view that behaves differently depending on the _type_ of data: map:view name=indexablecontent from-position=first map:select type=xml-type map:when test=docbook map:transform src=docbook2whatever.xsl/ /map:when map:when test=tei map:transform src=tei2whatever.xsl/ /map:when map:when test=msword map:transform src=word2whatever.xsl/ /map:when /map:select /map:view Ah, ok, the strongly type pipelines are a different wording for content-aware selectors ! So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would return XML representing the content of the .doc file. I described the same thing in a mail with subject 'Type-aware Views (Re: Link view goodness)'. Same need, different context, same proposed solution. Not exactly : the use case here is that we have a binary file which is normally sent as is to the browser using a reader. It is _not_ parsed as an XML stream. So we can't attach a view to these kinds of URLs since views provide a different _ending_ to a pipeline, meaning there must exist at least a generator and optionnaly one or more transformers at the point where processing is directed to the view. So even content-aware selectors don't solve this problem... Sylvain -- Sylvain Wallez Anyware Technologies http://www.apache.org/~sylvain http://www.anyware-tech.com { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects } Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Le Jeudi, 14 aoû 2003, à 15:24 Europe/Zurich, Sylvain Wallez a écrit : ...But what if we write it the other way around : map:read src=docs/{1}.doc map:generate src=docs/{1}.doc type=wordToXml label=content/ /map:read I find this more understandable (but dunno about implementation): !-- if reader is executed, the rest is not -- map:read src=docs/{1}.doc unless-view=wordToXml/ map:generate src=docs/{1}.doc type=wordToXml/ map:transform... -Bertrand
Re: [RT] Views for readers
Miles Elam wrote: Ummm... Quick question: What are the use cases for this that are not handled by existing methods? I mean, couldn't this be handled with an (as-yet unwritten) action? Matcher *does* exist: map:match pattern=*.doc map:match type=wildcard-request-parameter pattern=content map:parameter name=parameter-name value=cocoon-view/ map:generate type=word2xml src={../1}.doc/ !-- complete the pipeline -- /map:match map:read src={1}.doc/ /map:match snip/ Vadim
Re: [RT] Views for readers
Sylvain Wallez wrote: Bertrand Delacretaz wrote: Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit : ...But shouldn't we keep labels that are already used into pipelines ? E.g : map:read src=docs/{1}.doc label=raw, xdoc/ map:generate src=docs/{1}.doc type=word2xml label=raw/ map:transform src=xword2xdoc.xsl label=xdoc/ If it's this way I'd prefer unless-label in map:read to make it clear. Or maybe map:read src=docs/{1}.doc unless-label=*/ would do, meaning use this unless any views are requested (and * would be the only allowed value). Ah, and this is very easily implementable ;-) Quickquick, do it before the FS police hears us ;-) Seriously, I find this useful for indexing and other purposes (gettting meta-information about binary files, images, etc for example). Me too. But since is a change in the sitemap syntax, we should have a vote on this. Any other proposal or opinion on this subject before we start a vote ? Can't you just enable generators in map:view in case when view starts with reader? Vadim
Re: [RT] Views for readers
On Thu, Aug 14, 2003 at 01:41:55PM +0200, Sylvain Wallez wrote: Jeff Turner wrote: ... map:view name=indexablecontent from-position=first map:select type=xml-type map:when test=docbook map:transform src=docbook2whatever.xsl/ /map:when map:when test=tei map:transform src=tei2whatever.xsl/ /map:when map:when test=msword map:transform src=word2whatever.xsl/ /map:when /map:select /map:view Ah, ok, the strongly type pipelines are a different wording for content-aware selectors ! Ah yes. Strange how the same concept can live two separate lives in one's head ;) Like the same class in two classloaders. So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would return XML representing the content of the .doc file. I described the same thing in a mail with subject 'Type-aware Views (Re: Link view goodness)'. Same need, different context, same proposed solution. Not exactly : the use case here is that we have a binary file which is normally sent as is to the browser using a reader. It is _not_ parsed as an XML stream. So we can't attach a view to these kinds of URLs since views provide a different _ending_ to a pipeline, meaning there must exist at least a generator and optionnaly one or more transformers at the point where processing is directed to the view. So even content-aware selectors don't solve this problem... Isn't the problem there that a map:read is a whole little pipeline unto itself? If it were broken into two atomic operations: map:generate type=binary src=foo.doc/ map:serialize type=binary/ then we could have a map:view from-position=first/ using a content-aware pipeline, and everything would work. I have the feeling that handling non-XML content in Cocoon is Just Wrong, and that map:read is just a hack. The fact that it doesn't integrate with Views is a symptom of this. In a theoretically pure world, we'd either make Cocoon an XML-only framework and kill map:read, or make Cocoon a generic data pipelining framework capable of handling and transforming binary content. Well it's a RT after all.. ;) --Jeff Sylvain -- Sylvain Wallez Anyware Technologies http://www.apache.org/~sylvain http://www.anyware-tech.com { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects } Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Miles Elam wrote: Sylvain Wallez wrote: Go back to first post of this thread, where (last paragraph) I proposed something similar. The whole discussion is about how we could have a syntax which doesn't introduce such verbosity in the sitemap. Verbosity is not necessarily a bad thing. If it were, would any of us be using XML? ;-) Good point. snip/ Let's consider the MIDI example. Suppose we have a large collection of karaoke files (MIDI supports embedded text that can be played on screen while playing the music), and we want to index the text of these songs for easy retrieval (along with some other meta-data). Here's a sitemap example, using the current syntax snip/ And the proposed shorter one : map:match pattern=*.mid map:read src={1}.mid unless-label=content/ map:generate type=midi src={1}.mid/ map:transform src=xmidi2xdoc.xsl label=content-label/ !-- should never come here -- map:serialize type=xml/ /map:match Two lines. What does it give except obfuscation? Given the point above (Verbosity is not necessarily a bad thing (c) Miles Elam) more readable and already supported syntax is: map:resource name=midi/ map:match type=view pattern=content map:generate type=midi src={1}.mid/ map:transform src=xmidi2xdoc.xsl label=content/ map:serialize type=xml/ /map:match map:read mime-type=whatever/midi src={1}.mid/ /map:match map:match pattern=*.mid/ map:call resource=midi/ /map:match Moreover! Resource midi is reusable: map:match pattern=another/*.mid/ map:call resource=midi/ /map:match , while example above is not. This breaks current convention that either a reader or a generator/transformer/serializer can act in a pipeline. And, given this resource example, it does not break any sitemap semantics which we have today. In the first example, if content isn't specified, the action returns null and the reader is invoked; As far as the pipeline logic is concerned, there is only the reader. Serializers are already known as universal exit points. To use the second, the convention must be broken and readers must become universal exit points. In other words, map:match pattern=*.mid map:read src={1}.mid/ !-- without the unless-label -- map:generate type=midi src={1}.mid/ map:transform src=xmidi2xdoc.xsl label=content-label/ !-- should never come here -- map:serialize type=xml/ /map:match must become valid for consistency. A reader becomes an exit point and the rest of a pipeline is, by default, ignored. Is this an intended consequence? I fell strongly -1 on this one. Vadim
Re: [RT] Views for readers
Bertrand Delacretaz wrote: Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit : ...But shouldn't we keep labels that are already used into pipelines ? E.g : map:read src=docs/{1}.doc label=raw, xdoc/ map:generate src=docs/{1}.doc type=word2xml label=raw/ map:transform src=xword2xdoc.xsl label=xdoc/ If it's this way I'd prefer unless-label in map:read to make it clear. Or maybe map:read src=docs/{1}.doc unless-label=*/ would do, meaning use this unless any views are requested (and * would be the only allowed value). Ah, and this is very easily implementable ;-) Quickquick, do it before the FS police hears us ;-) Seriously, I find this useful for indexing and other purposes (gettting meta-information about binary files, images, etc for example). Me too. But since is a change in the sitemap syntax, we should have a vote on this. Any other proposal or opinion on this subject before we start a vote ? Sylvain -- Sylvain Wallez Anyware Technologies http://www.apache.org/~sylvain http://www.anyware-tech.com { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects } Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Sylvain Wallez wrote, On 14/08/2003 14.30: Nicola Ken Barozzi wrote: Jeff Turner wrote, On 14/08/2003 14.17: ... Isn't the problem there that a map:read is a whole little pipeline unto itself? If it were broken into two atomic operations: map:generate type=binary src=foo.doc/ map:serialize type=binary/ then we could have a map:view from-position=first/ using a content-aware pipeline, and everything would work. Well, why can't the view simply start from a reader? map:read src=foo.doc/ Because a view finishes a partial XML pipeline, meaning it requires a generator to be already present... That's because of how we define a view now ;-) If we had just pipelines that handle both binary and xml data, the viw would finish a partial pipeline, in this case starting from binary. I have the feeling that handling non-XML content in Cocoon is Just Wrong, and that map:read is just a hack. The fact that it doesn't integrate with Views is a symptom of this. In a theoretically pure world, we'd either make Cocoon an XML-only framework and kill map:read, or make Cocoon a generic data pipelining framework capable of handling and transforming binary content. Well, it can be done easily by allowing more than one reader and by allowing readers in the xml pipeline. Some time back I had proposed the following to be possible (and got touted as the usual FS man) map:read src=foo1.doc/ map:read type=stripstuff/ map:read type=otherfilter/ Mhhh... I guess stripstuff and otherfilter are actually map:transform-binary and not map:read as they do have an input. Now how do we close the pipeline ? Is there a map:serialize-binary ? Since streams are just streams, they don't need to be adapted like XML, so there is no notion of Generator or Serializer really, but only filter. So the reader is just a filter, and if in the middle it's just given a stream and has to output to a stream. So there is no need to open, and no need to close. And also: map:read src=foo1.doc/ map:generate src=foo1.doc/ map:serialize src=foo1.doc/ map:read type=zip/ Wow! What's the result of this ?? Oops, a bit too quick. !-- remove encription or do other stream preprocessing -- map:read type=decrypt src=foo1.doc/ !-- normal generation but from the previous reader output -- map:generate type=doc2xml/ !-- eventual transforms-- !-- give back html -- map:serialize type=html/ !-- zip that result so that it takes less bandwidth -- map:read type=zip/ We can already do this BTW by using the Cocooon protocol, but it's such a hack! Sounds interesting. Can you elaborate on the hack ? map:match pattern=mypage.html map:read src=internal/mypage.html type=zip/ /map:match map:match pattern=internal/mypage.html !-- generate, transform, serialize... -- /map:match BTW, maybe you may be interested in my RT about aspected pipeline snippets, it could be interesting. Basically it would make it possible to insert pipeline components inside all pipelines using certain rules. -- Nicola Ken Barozzi [EMAIL PROTECTED] - verba volant, scripta manent - (discussions get forgotten, just code remains) -
Re: [RT] Views for readers
Bertrand Delacretaz wrote: How about making it the other way round, by allowing Generators to read from Readers? map:match pattern=*.doc default-view=binary map:generator label=xml-content-for-indexing type=wordToXml map:read src=word-documents/{1}.doc label=binary mime-type=.../ /map:generator map:serialize type=xml/ /map:match Do you mean that the generator would be used if the xml-content-for-indexing view is selected ? This doesn't fit with the existing sitemap behaviour, since generators are _always_ added to the pipeline. But what if we write it the other way around : map:read src=docs/{1}.doc map:generate src=docs/{1}.doc type=wordToXml label=content/ /map:read The meaning of the above is : if a view is requested, execute what's _inside_ the map:read. If it builds a complete pipeline then return its result, otherwise just perform the usual read operation. Is that RT-ish enough? Mmmmh... not as wild as Nicola Ken's. Try again ;-P Sylvain -- Sylvain Wallez Anyware Technologies http://www.apache.org/~sylvain http://www.anyware-tech.com { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects } Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Hi! Sorry, but this discussion seems to tell us one thing: The current sitemap syntax and cocoon processing model is not really suitable for such kind of processing. All this reminds me of a proposal (which was actually a RT) I've sent back in January this year, where I proposed a more intuitive and flexible pipeline concept. I don't want to say, that this would be the solution to all problems and I definitely made some mistakes because I didn't know that much of the cocoon internals at the time of writing, but I think it's time to take a second look at it. So here's the link: http://marc.theaimsgroup.com/?l=xml-cocoon-devm=104482372430759w=2 Again, please don't be so harsh concerning mistakes, but I think there are many ideas included, which give some food for thought. Bye, Andreas Hochsteger http://highstick.blogspot.com/ Sylvain Wallez wrote: Bertrand Delacretaz wrote: Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit : ...But shouldn't we keep labels that are already used into pipelines ? E.g : map:read src=docs/{1}.doc label=raw, xdoc/ map:generate src=docs/{1}.doc type=word2xml label=raw/ map:transform src=xword2xdoc.xsl label=xdoc/ If it's this way I'd prefer unless-label in map:read to make it clear. Or maybe map:read src=docs/{1}.doc unless-label=*/ would do, meaning use this unless any views are requested (and * would be the only allowed value). Ah, and this is very easily implementable ;-) Quickquick, do it before the FS police hears us ;-) Seriously, I find this useful for indexing and other purposes (gettting meta-information about binary files, images, etc for example). Me too. But since is a change in the sitemap syntax, we should have a vote on this. Any other proposal or opinion on this subject before we start a vote ? Sylvain
Re: [RT] Views for readers
Vadim Gritsenko wrote: Sylvain Wallez wrote: Vadim Gritsenko wrote: Sylvain Wallez wrote: snip/ Any other proposal or opinion on this subject before we start a vote ? Can't you just enable generators in map:view in case when view starts with reader? No, since views capture the (XML) output at certain points of the pipeline to provide a different formatting. In case of the reader, there is no (XML) output in the pipeline. It's special case, unless you want to introduce binary pipelines (and I hope you don't want to), so it would require special handling. E.g. the processing for the indexable-content view Sidenote: It's called content -- the view which you use to build a site search index. Picky sidenote : this is configurable using the content-view-query config of the lucene-xml-indexer component ;-) is the same for all URIs, be them XML pipelines or a single reader. So there's no way other than having a generator _before_ jumping to the view, feeding that view with the kind of XML content it expects. Here is another wild (or not?) thought. All this discussion comes down to the requirement of generating some XML out of the content usually served by the reader, if that's possible (and it is possible for some of the types of the content), in order to feed this XMLized content into the view. This generated XML is somewhat equivalent to the binary represenation for the purpose of view building. So, I'm going to the conclusion that some types of readers can be paired with the generator producing equivalent, but XMLized, content. The best place to indicate such pairing is the time when you declare a reader: map:readers default=resource map:reader name=resource src=org.apache.cocoon.reading.ResourceReader/ map:reader name=html src=org.apache.cocoon.reading.ResourceReader generator-paired-to-this-readerhtml/generator-paired-to-this-reader /map:reader map:reader name=msexcel src=org.apache.cocoon.reading.ResourceReader generator-paired-to-this-readerpoi-excel-generator/generator-paired-to-this-reader /map:reader map:reader name=pdf src=org.apache.cocoon.reading.ResourceReader generator-paired-to-this-readerpdf-text-extractor-generator/generator-paired-to-this-reader /map:reader /map:readers I'm afraid this won't work : - a generator specific to a given content-type is very unlikely to produce the document type expected by the view. We will most often need an additional transformation (e.g. the xword2xdoc.xsl that was in my example) - views, through their associated labels, can be plugged at any point of the pipelines. Defining pair generators restricts views to be only from-label=start. PS: Modifying sitemap syntax to allow reader/generator pairs with some unless attrbiutes looks awful to me. Doesn't seem so awful to me, since the reader should be executed unless certain conditions are met, which are that the specified label(s) correspond to the one at which the requested view should start. Sylvain -- Sylvain Wallez Anyware Technologies http://www.apache.org/~sylvain http://www.anyware-tech.com { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects } Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Jeff Turner wrote: snip/ Isn't the problem there that a map:read is a whole little pipeline unto itself? If it were broken into two atomic operations: map:generate type=binary src=foo.doc/ map:serialize type=binary/ then we could have a map:view from-position=first/ using a content-aware pipeline, and everything would work. I have the feeling that handling non-XML content in Cocoon is Just Wrong, and that map:read is just a hack. The fact that it doesn't integrate with Views is a symptom of this. In a theoretically pure world, we'd either make Cocoon an XML-only framework and kill map:read, or make Cocoon a generic data pipelining framework capable of handling and transforming binary content. Well it's a RT after all.. ;) Content-aware and binary pipelines in the same post? Wow! Yes, it's definitely a RT ;-P Sylvain -- Sylvain Wallez Anyware Technologies http://www.apache.org/~sylvain http://www.anyware-tech.com { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects } Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Vadim Gritsenko wrote: Sylvain Wallez wrote: Vadim Gritsenko wrote: Sylvain Wallez wrote: Vadim Gritsenko wrote: snip/ Here is another wild (or not?) thought. All this discussion comes down to the requirement of generating some XML out of the content usually served by the reader, if that's possible (and it is possible for some of the types of the content), in order to feed this XMLized content into the view. This generated XML is somewhat equivalent to the binary represenation for the purpose of view building. So, I'm going to the conclusion that some types of readers can be paired with the generator producing equivalent, but XMLized, content. The best place to indicate such pairing is the time when you declare a reader: map:readers default=resource map:reader name=resource src=org.apache.cocoon.reading.ResourceReader/ map:reader name=html src=org.apache.cocoon.reading.ResourceReader generator-paired-to-this-readerhtml/generator-paired-to-this-reader /map:reader map:reader name=msexcel src=org.apache.cocoon.reading.ResourceReader generator-paired-to-this-readerpoi-excel-generator/generator-paired-to-this-reader /map:reader map:reader name=pdf src=org.apache.cocoon.reading.ResourceReader generator-paired-to-this-readerpdf-text-extractor-generator/generator-paired-to-this-reader /map:reader /map:readers I'm afraid this won't work : Can you suggest some improvements so it does work? My goal is to have as little impact on sitemap syntax as possible. - a generator specific to a given content-type is very unlikely to produce the document type expected by the view. We will most often need an additional transformation (e.g. the xword2xdoc.xsl that was in my example) More wild suggestions. 1/ Do something with the views. Say, allow duplicate view names and make them work as selector: map:views !-- works if (when) reader -- map:view from-position=reader name=content map:transform src=wordml2content.xsl label=content/ map:serialize type=xml/ /map:view !-- works if (when) label -- map:view from-label=content name=content map:serialize type=xml/ /map:view !-- works if no label (otherwise) -- map:view from-position=first name=content map:serialize type=xml/ /map:view /map:views Still the same problem I desperatly pointing out again and again : how can the from-position=reader use different generators (i.e. parsers) depending on the binary content ? 2/ Do something with the readers. map:readers default=resource map:reader name=msword src=org.apache.cocoon.reading.ResourceReader map:generate type=msword/ map:transform src=wordml2content.xsl/ /map:reader /map:readers This introduces sitemap snippets into a component manager configuration, wich is not good at all. 3/ Alternative to 2: map:readers default=resource map:reader name=msword src=org.apache.cocoon.reading.ResourceReader xmlizer-uricocoon://word-2-content//xmlizer-uri /map:reader /map:readers map:views map:view from-label=content name=content map:serialize type=xml/ /map:view /map:views map:pipelines ... map:read src=my.doc/ ... map:match pattern=word-2-content/* map:generate type=msword src={1}/ map:transform src=wordml2content.xsl label=content/ map:serialize type=xml/ /map:match /map:pipelines Sounds better, but has the problem that it implies that every view should return xml content on my.doc. Or to we introduce a label attribute on map:read to define on which particular view the xmlizer-uri should be triggered ? I would not say that I like any of the suggestions above. The cleanest way ATM is the usage of map:resource I suggested in other email (I yet to see your comment on it). Sorry, I have no particular comment on the use of resources, as it's mainly a refactoring of the action/matcher proposals. - views, through their associated labels, can be plugged at any point of the pipelines. Defining pair generators restricts views to be only from-label=start. PS: Modifying sitemap syntax to allow reader/generator pairs with some unless attrbiutes looks awful to me. Doesn't seem so awful to me, since the reader should be executed unless certain conditions are met, which are that the specified label(s) correspond to the one at which the requested view should start. This unless attribute is nothing else than shortcut for map:match. Given point on verbosity and given the obfuscated result, I'm for verbosity. Not exacly : you can currently match on the view name (provided that the environment actually does rely on the cocoon-view parameter), but you cannot match on the labels. And only labels are currently used in the map:pipelines section. PS Keep sitemap syntax clean! Say No! to woodo! Funny. That's often me that says too much magic kills the confidence. Let's stop this discussion for now. I have the feeling
Re: [RT] Views for readers
Upayavira wrote: On 14 Aug 2003 at 15:34, Bertrand Delacretaz wrote: I find this more understandable (but dunno about implementation): !-- if reader is executed, the rest is not -- map:read src=docs/{1}.doc unless-view=wordToXml/ map:generate src=docs/{1}.doc type=wordToXml/ map:transform... Simplifying further: map:read src=docs/{1}.doc view-generator=wordToXml/ Surely that'd do it? this might be better, because what happens when someone comes along doing this: map:read src=docs/{1}.doc unless-view=wordToXml/ map:generate src=docs/{2}.doc type=wordToXml/ Then the same request represents two difference sources, which could be either confusing or very useful and I don't fully understand the implications of everything. Just tossing my $0.02 in... it's early and I'm tired :) Tony
Re: [RT] Views for readers
Miles Elam wrote: In other words, the pipeline is full of side effects and dependant upon things happening behind the curtain (to use a Wizard of Oz reference). You'd be right in that it adds to the confusion. I agree with Vadim. This is obfuscation in exchange for two lines of verboseness. Just some additional precisions, mon frère ! Yes, the pipeline is full of side effects, which can break pipelines at any point an continue somewhere else without this being explicitely visible in the pipeline construction statements. These side effects are called views, and the way to define views is through labels. And even worse : labels can be placed on component definitions, meaning a clean pipeline with no label attribute at all is full of these side effects. So what you call obfuscation has been there *for years*. And everybody's happy with it. Sylvain -- Sylvain Wallez Anyware Technologies http://www.apache.org/~sylvain http://www.anyware-tech.com { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects } Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Miles Elam wrote: Sylvain Wallez wrote: Go back to first post of this thread, where (last paragraph) I proposed something similar. The whole discussion is about how we could have a syntax which doesn't introduce such verbosity in the sitemap. Verbosity is not necessarily a bad thing. If it were, would any of us be using XML? ;-) Good, point. However, the only verbosity currently added by views is the label attribute. This proposal is about achieving the same low verbosity for views with binary content. As I explained in several replies, there's no equivalence between a reader and generator able to parse a given binary format. There needs to be some kind of adaptation/extraction before feeding the view. Yup. And what you describe above as a PDF reader, a Word reader, a Postscript reader, etc. are IMO nothing more than _generators_, just like the SWF and MIDI generators we already have. The functionality for all readers would obviously be the same: move these bytes from here to there. But yes, the codified mapping I think is important. Please read carefully : I wrote *generators* !! This isn't about moving bytes, but about producing an XML document. Let's consider the MIDI example. Suppose we have a large collection of karaoke files (MIDI supports embedded text that can be played on screen while playing the music), and we want to index the text of these songs for easy retrieval (along with some other meta-data). Here's a sitemap example, using the current syntax map:match pattern=*.mid/ map:act type=catch-view src=content map:generate type=midi src={1}.mid/ map:transform src=xmidi2xdoc.xsl label=content-label/ !-- should never come here -- map:serialize type=xml/ /map:match map:read src={1}.mid/ /map:match You're mixing the map:act with a /map:match, but I get the idea. Picky guy, eh ? (the content view starts at the content-label label to clearly distinguish the two notions). And the proposed shorter one : map:match pattern=*.mid map:read src={1}.mid unless-label=content/ map:generate type=midi src={1}.mid/ map:transform src=xmidi2xdoc.xsl label=content-label/ !-- should never come here -- map:serialize type=xml/ /map:match This breaks current convention that either a reader or a generator/transformer/serializer can act in a pipeline. In the first example, if content isn't specified, the action returns null and the reader is invoked; As far as the pipeline logic is concerned, there is only the reader. Serializers are already known as universal exit points. To use the second, the convention must be broken and readers must become universal exit points. Readers already are universal exit points : once you encounter a reader, sitemap processing is terminated. map:read and map:serialize are like a return statement in Java. In other words, map:match pattern=*.mid map:read src={1}.mid/ !-- without the unless-label -- map:generate type=midi src={1}.mid/ map:transform src=xmidi2xdoc.xsl label=content-label/ !-- should never come here -- map:serialize type=xml/ /map:match must become valid for consistency. A reader becomes an exit point and the rest of a pipeline is, by default, ignored. Is this an intended consequence? No consequence : this is how the sitemap works today, and the above is valid, even if we can consider that the sitemap engine should more strict and signal that there's some unreachable code. To add more to the confusion, in both your and my example, we can even avoid writing the map:serialize statement. Since some additional filtering occurs beforehand (either through the action or through reader labels), this statement is never reached and is useless ! Sylvain -- Sylvain Wallez Anyware Technologies http://www.apache.org/~sylvain http://www.anyware-tech.com { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects } Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Vadim Gritsenko wrote: Ummm... Quick question: What are the use cases for this that are not handled by existing methods? I mean, couldn't this be handled with an (as-yet unwritten) action? Matcher *does* exist: Heh heh... learning something new everyday. - Miles Elam
Re: [RT] Views for readers
Sylvain Wallez wrote: Vadim Gritsenko wrote: Sylvain Wallez wrote: Vadim Gritsenko wrote: snip/ Here is another wild (or not?) thought. All this discussion comes down to the requirement of generating some XML out of the content usually served by the reader, if that's possible (and it is possible for some of the types of the content), in order to feed this XMLized content into the view. This generated XML is somewhat equivalent to the binary represenation for the purpose of view building. So, I'm going to the conclusion that some types of readers can be paired with the generator producing equivalent, but XMLized, content. The best place to indicate such pairing is the time when you declare a reader: map:readers default=resource map:reader name=resource src=org.apache.cocoon.reading.ResourceReader/ map:reader name=html src=org.apache.cocoon.reading.ResourceReader generator-paired-to-this-readerhtml/generator-paired-to-this-reader /map:reader map:reader name=msexcel src=org.apache.cocoon.reading.ResourceReader generator-paired-to-this-readerpoi-excel-generator/generator-paired-to-this-reader /map:reader map:reader name=pdf src=org.apache.cocoon.reading.ResourceReader generator-paired-to-this-readerpdf-text-extractor-generator/generator-paired-to-this-reader /map:reader /map:readers I'm afraid this won't work : Can you suggest some improvements so it does work? My goal is to have as little impact on sitemap syntax as possible. - a generator specific to a given content-type is very unlikely to produce the document type expected by the view. We will most often need an additional transformation (e.g. the xword2xdoc.xsl that was in my example) More wild suggestions. 1/ Do something with the views. Say, allow duplicate view names and make them work as selector: map:views !-- works if (when) reader -- map:view from-position=reader name=content map:transform src=wordml2content.xsl label=content/ map:serialize type=xml/ /map:view !-- works if (when) label -- map:view from-label=content name=content map:serialize type=xml/ /map:view !-- works if no label (otherwise) -- map:view from-position=first name=content map:serialize type=xml/ /map:view /map:views 2/ Do something with the readers. map:readers default=resource map:reader name=msword src=org.apache.cocoon.reading.ResourceReader map:generate type=msword/ map:transform src=wordml2content.xsl/ /map:reader /map:readers 3/ Alternative to 2: map:readers default=resource map:reader name=msword src=org.apache.cocoon.reading.ResourceReader xmlizer-uricocoon://word-2-content//xmlizer-uri /map:reader /map:readers map:views map:view from-label=content name=content map:serialize type=xml/ /map:view /map:views map:pipelines ... map:read src=my.doc/ ... map:match pattern=word-2-content/* map:generate type=msword src={1}/ map:transform src=wordml2content.xsl label=content/ map:serialize type=xml/ /map:match /map:pipelines I would not say that I like any of the suggestions above. The cleanest way ATM is the usage of map:resource I suggested in other email (I yet to see your comment on it). - views, through their associated labels, can be plugged at any point of the pipelines. Defining pair generators restricts views to be only from-label=start. PS: Modifying sitemap syntax to allow reader/generator pairs with some unless attrbiutes looks awful to me. Doesn't seem so awful to me, since the reader should be executed unless certain conditions are met, which are that the specified label(s) correspond to the one at which the requested view should start. This unless attribute is nothing else than shortcut for map:match. Given point on verbosity and given the obfuscated result, I'm for verbosity. PS Keep sitemap syntax clean! Say No! to woodo! Vadim
Re: [RT] Views for readers
Hmm, Frederic's question about search engine integration led me to questioning myself at how Cocoon's Lucene integration could be able to transparently index Word PDF documents along with XML-produced documents. I have been wondering that too. At my company, we put together a simple web management tool to put small collections of documents into a web frame for a client. Pretty useless, but it's what he wanted. At the time I had thought it may be possible to just improve Lucene so it could understand binary files by introducing mime-type triggerable filter modules that converted to text on the input stream. After all, if the text were only going to be used for indexing, it wouldn't matter if the text wasn't available within Cocoon itself. In any case he's happy with what he has and we're happily doing other stuff. Perhaps if the individual extractors are part of specialised readers for specific types of documents, then you could configure the label for the XML they return? That would allow for the duality of that behaviour to be mostly concealed and managed from within Cocoon with little effect to the sitemap. I personally find it tempting to think that it may be possible to rip out XML from any of these formats, and do with it as we wish, particulary when I saw that programs like catdoc could recognize the tables even from Word 2k documents. But I often find myself thinking back against that, and that maybe I should represent all content (even document content) semantically in XML and let rendering technologies (PDFSerializer, POI) handle binary output, and perhaps leverage document importers that map those documents back to XML (they all seem to be proprietary, big buck solutions from what I see currently, though). In any case, it does seem that is certainly a ways off in the future *sigh* Hmm, an OCR extractor would be way cool for faxes too! just my 2c, i never say anything most of the time, anyway Sam
Re: [RT] Views for readers
Sylvain Wallez wrote: Miles Elam wrote: In other words, the pipeline is full of side effects and dependant upon things happening behind the curtain (to use a Wizard of Oz reference). You'd be right in that it adds to the confusion. I agree with Vadim. This is obfuscation in exchange for two lines of verboseness. Just some additional precisions, mon frère ! I hope it wasn't taken the wrong way. I did not intend any offense. Yes, the pipeline is full of side effects, which can break pipelines at any point an continue somewhere else without this being explicitely visible in the pipeline construction statements. These side effects are called views, and the way to define views is through labels. Don't get me wrong. I see clearly the reason why views exist. I see clearly why reader views are wanted. When working with XML data -- not just text, but structured text -- getting at that data before it is processed into a presentation format (such as viewing source, getting a true content view, etc.) can prove invaluable. And even worse : labels can be placed on component definitions, meaning a clean pipeline with no label attribute at all is full of these side effects. So what you call obfuscation has been there *for years*. And everybody's happy with it. When grabbing from the presentation format as a source, you are comparing apples and oranges. Not only are there innumerable binary formats out there being squeezed into a few reader implementations, but they are not all desirable data. While you may want the data from a PDF file, you may not bother with a PNG image because it may index Created with The Gimp over and over. Since putting in all binary format-to-generator mapping info seems out of the question, all of the pipeline path must be specified in the matcher -- hence the discussion surrounding readers and generators in the same matcher. If everything is specified in the same matcher and not truly orthogonal, as is the case for views currently, why add the extra syntax for what amounts to a non-orthogonal if-else clause? if (!content-view) read else generate transform serialize as opposed to generate +-- view-short-curcuit! --+- transform-x transform-1 +- serialize transform-2 serialize There is a discontinuity there that makes me uncomfortable. This is not an overt attachment to symmetry. This is seeing the same tool applied to two (in my opinion) very different tasks. I am not a committer and can't vote. But these are my thoughts on the matter. Take with as many grains of salt as are necessary. - Miles Elam
Re: [RT] Views for readers
Sylvain Wallez wrote: The functionality for all readers would obviously be the same: move these bytes from here to there. But yes, the codified mapping I think is important. Please read carefully : I wrote *generators* !! This isn't about moving bytes, but about producing an XML document. Au contraire mon frére, this is implemented with generators but it is about pulling searchable info out of arbitrary binary data. The first step to that goal is to standardize it -- therefore generators are added. The issue is about *readers* and the custom formats they encompass not being indexable. You're mixing the map:act with a /map:match, but I get the idea. Picky guy, eh ? You know it. :) Readers already are universal exit points : once you encounter a reader, sitemap processing is terminated. map:read and map:serialize are like a return statement in Java. Not according to the code, they're not. Check out AbstractProcessingPipeline.java. There are method bodies like: public void setGenerator (String role, String source, Parameters param, Parameters hintParam) throws ProcessingException { if (this.generator != null) { throw new ProcessingException (Generator already set. You can only select one Generator ( + role + )); } if (this.reader != null) { throw new ProcessingException (Reader already set. You cannot use a reader and a generator for one pipeline.); } ... and public void setReader (String role, String source, Parameters param, String mimeType) throws ProcessingException { if (this.reader != null) { throw new ProcessingException (Reader already set. You can only select one Reader ( + role + )); } if (this.generator != null) { throw new ProcessingException (Generator already set. You cannot use a reader and a generator for one pipeline.); } ... Either the policy was in effect when this file (and its subclasses) were made or someone put constraining statements in that serve no purpose. The file was last modified on August 6th of this year. If the policy has changed, no one told the code. No consequence : this is how the sitemap works today, and the above is valid, even if we can consider that the sitemap engine should more strict and signal that there's some unreachable code. I can't speak to validity, but this is NOT how it works today. To add more to the confusion, in both your and my example, we can even avoid writing the map:serialize statement. Since some additional filtering occurs beforehand (either through the action or through reader labels), this statement is never reached and is useless ! In other words, the pipeline is full of side effects and dependant upon things happening behind the curtain (to use a Wizard of Oz reference). You'd be right in that it adds to the confusion. I agree with Vadim. This is obfuscation in exchange for two lines of verboseness. - Miles Elam
Re: [RT] Views for readers
On Wed, Aug 13, 2003 at 12:02:04PM +0200, Sylvain Wallez wrote: Frederic's question about search engine integration led me to questioning myself at how Cocoon's Lucene integration could be able to transparently index Word PDF documents along with XML-produced documents. There exists some text-extraction libraries for Word PDF (e.g. http://www.textmining.org/). Now how can we integrate this as transparently as possible in Cocoon's search functionnality ? The Lucene indexer crawls a website and asks for a particular view (content) which is used to fill the index. But Word and PDF documents being binary files, they're handled by a map:read statement, which does not handle views. On the other hand, this use case shows that having views on binary content may make sense : the normal requests just sends back the binary content, while a view can use a text/XML extraction on these binary files. So the question is : how could views be plugged to readers ? I must say that I don't have an answer, as views contain transformers and a serializer, but no generator. So how could we express in the sitemap that a particular view on a reader should replace that reader by a particular generator ? Or should this go through some special readers that could also act as generators ? Or maybe these are silly thoughts and we should use a map:select directing to a map:read or map:generate depending on the view. But this introduces explicit view management in the pipelines, which doesn't seem nice to me. Solution: strongly typed pipelines! :) Imagine if, at each node in the sitemap, we knew what type of content we were dealing with (usually some flavour of XML). Then we could write a single view that behaves differently depending on the _type_ of data: map:view name=indexablecontent from-position=first map:select type=xml-type map:when test=docbook map:transform src=docbook2whatever.xsl/ /map:when map:when test=tei map:transform src=tei2whatever.xsl/ /map:when map:when test=msword map:transform src=word2whatever.xsl/ /map:when /map:select /map:view So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would return XML representing the content of the .doc file. I described the same thing in a mail with subject 'Type-aware Views (Re: Link view goodness)'. Same need, different context, same proposed solution. --Jeff Any thoughts ? Sylvain -- Sylvain Wallez Anyware Technologies http://www.apache.org/~sylvain http://www.anyware-tech.com { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects } Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit : ...But shouldn't we keep labels that are already used into pipelines ? E.g : map:read src=docs/{1}.doc label=raw, xdoc/ map:generate src=docs/{1}.doc type=word2xml label=raw/ map:transform src=xword2xdoc.xsl label=xdoc/ If it's this way I'd prefer unless-label in map:read to make it clear. Or maybe map:read src=docs/{1}.doc unless-label=*/ would do, meaning use this unless any views are requested (and * would be the only allowed value). Ah, and this is very easily implementable ;-) Quickquick, do it before the FS police hears us ;-) Seriously, I find this useful for indexing and other purposes (gettting meta-information about binary files, images, etc for example). -Bertrand
Re: [RT] Views for readers
Sylvain Wallez wrote: Vadim Gritsenko wrote: Sylvain Wallez wrote: Vadim Gritsenko wrote: Sylvain Wallez wrote: Vadim Gritsenko wrote: snip/ Here is another wild (or not?) thought. All this discussion comes down to the requirement of generating some XML out of the content usually served by the reader, if that's possible (and it is possible for some of the types of the content), in order to feed this XMLized content into the view. This generated XML is somewhat equivalent to the binary represenation for the purpose of view building. So, I'm going to the conclusion that some types of readers can be paired with the generator producing equivalent, but XMLized, content. The best place to indicate such pairing is the time when you declare a reader: map:readers default=resource map:reader name=resource src=org.apache.cocoon.reading.ResourceReader/ map:reader name=html src=org.apache.cocoon.reading.ResourceReader generator-paired-to-this-readerhtml/generator-paired-to-this-reader /map:reader map:reader name=msexcel src=org.apache.cocoon.reading.ResourceReader generator-paired-to-this-readerpoi-excel-generator/generator-paired-to-this-reader /map:reader map:reader name=pdf src=org.apache.cocoon.reading.ResourceReader generator-paired-to-this-readerpdf-text-extractor-generator/generator-paired-to-this-reader /map:reader /map:readers I'm afraid this won't work : Can you suggest some improvements so it does work? My goal is to have as little impact on sitemap syntax as possible. - a generator specific to a given content-type is very unlikely to produce the document type expected by the view. We will most often need an additional transformation (e.g. the xword2xdoc.xsl that was in my example) More wild suggestions. 1/ Do something with the views. Say, allow duplicate view names and make them work as selector: map:views !-- works if (when) reader -- map:view from-position=reader name=content map:transform src=wordml2content.xsl label=content/ map:serialize type=xml/ /map:view !-- works if (when) label -- map:view from-label=content name=content map:serialize type=xml/ /map:view !-- works if no label (otherwise) -- map:view from-position=first name=content map:serialize type=xml/ /map:view /map:views Still the same problem I desperatly pointing out again and again : how can the from-position=reader use different generators (i.e. parsers) depending on the binary content ? I did not copy reader-to-generator association (generator-paired-to-this-reader/) declared on top. Get the generator from there. 2/ Do something with the readers. ... This introduces sitemap snippets into a component manager configuration, wich is not good at all. Yep. Not good. 3/ Alternative to 2: map:readers default=resource map:reader name=msword src=org.apache.cocoon.reading.ResourceReader xmlizer-uricocoon://word-2-content//xmlizer-uri /map:reader /map:readers map:views map:view from-label=content name=content map:serialize type=xml/ /map:view /map:views map:pipelines ... map:read src=my.doc/ ... map:match pattern=word-2-content/* map:generate type=msword src={1}/ map:transform src=wordml2content.xsl label=content/ map:serialize type=xml/ /map:match /map:pipelines Sounds better, but has the problem that it implies that every view should return xml content on my.doc. Yep. Unless you define one xmlizer URI per view... Awful! Or to we introduce a label attribute on map:read to define on which particular view the xmlizer-uri should be triggered ? Possible. I would not say that I like any of the suggestions above. The cleanest way ATM is the usage of map:resource I suggested in other email (I yet to see your comment on it). Sorry, I have no particular comment on the use of resources, as it's mainly a refactoring of the action/matcher proposals. But it solves the problem! And the cleanest solution (with minimal impact) among all discussed here. - views, through their associated labels, can be plugged at any point of the pipelines. Defining pair generators restricts views to be only from-label=start. PS: Modifying sitemap syntax to allow reader/generator pairs with some unless attrbiutes looks awful to me. Doesn't seem so awful to me, since the reader should be executed unless certain conditions are met, which are that the specified label(s) correspond to the one at which the requested view should start. This unless attribute is nothing else than shortcut for map:match. Given point on verbosity and given the obfuscated result, I'm for verbosity. Not exacly : you can currently match on the view name (provided that the environment actually does rely on the cocoon-view parameter), (Special view matcher is still possible) but you cannot match on the labels. And only labels are currently used in