Re: [RT] Views for readers

2003-08-19 Thread Stefano Mazzocchi
On Thursday, Aug 14, 2003, at 19:07 Europe/Rome, Miles Elam wrote:

Vadim Gritsenko wrote:

Here is another wild (or not?) thought.


Not so wild to me.

All this discussion comes down to the requirement of generating some 
XML out of the content usually served by the reader, if that's 
possible (and it is possible for some of the types of the content), 
in order to feed this XMLized content into the view. This generated 
XML is somewhat equivalent to the binary represenation for the 
purpose of view building. So, I'm going to the conclusion that some 
types of readers can be paired with the generator producing 
equivalent, but XMLized, content. The best place to indicate such 
pairing is the time when you declare a reader:


snip idea=interesting/

The syntax looks a bit ugly to me, but the idea seems much more sane 
to me.

PS: Modifying sitemap syntax to allow reader/generator pairs with 
some unless attrbiutes looks awful to me.


Complete agreement.  One of the reasons for the sitemap (*the* 
reason?) is for the simple and easy management of a site.  Some recent 
proposals seem to be pushing in the direction of Apache HTTPd's 
mod_rewrite;  A lot of flexibility by adding just one more  construct.

From the mod_rewrite page:

   The great thing about mod_rewrite is it gives you all the
   configurability and flexibility of Sendmail. The downside to
   mod_rewrite is that it gives you all the configurability and
   flexibility of Sendmail.
   -- Brian Behlendorf
   Apache Group
   Despite the tons of examples and docs, mod_rewrite is voodoo.
   Damned cool voodoo, but still voodoo.
   -- Brian Moore
   [EMAIL PROTECTED]
It'd be a shame if the sitemap became a cousin to mod_rewrite despite 
the cool voodoo.
I can hardly agree more!

- Miles Elam

P.S.  I shudder to think of what will happen to search index creation 
times when multi-megabyte Word documents and the like are sent down 
the pipe.  The parsers, however efficient they may turn out to be, 
will still have to contend with seemingly endless streams of seemingly 
pointless formatting cruft.  I'm sure we've all seen 10MB files that 
would be 100K in proper HTML I'm sure.  Ah well...'tis the cost of 
progress, I guess.
cocoon is not about binary and should *NOT* touch them. Readers were 
implemented as helpers. multi-views for binary files belong to the 
repository level, not to the publishing level!!!

I haven't read all email left (300 more to go after 5 days of offline) 
but I strongly hope you haven't implemented this or I'll scream!!!

--
Stefano.


Re: [RT] Views for readers

2003-08-19 Thread Stefano Mazzocchi
On Thursday, Aug 14, 2003, at 21:44 Europe/Rome, Andreas Hochsteger 
wrote:

Hi!

Sorry, but this discussion seems to tell us one thing:
The current sitemap syntax and cocoon processing model is not really
suitable for such kind of processing.
I completely agree.

All this reminds me of a proposal (which was actually a RT) I've sent
back in January this year, where I proposed a more intuitive and
flexible pipeline concept.
Maybe more flexible (and this is half of what FS is!) but more 
intuitive?

I don't want to say, that this would be the solution to all problems 
and
I definitely made some mistakes because I didn't know that much of the
cocoon internals at the time of writing, but I think it's time to take 
a
second look at it.
Until I'm around, any proposal to make cocoon more suitable for binary 
pipelines will receive a -1 from me (vote, not veto! remember)

So here's the link:
http://marc.theaimsgroup.com/?l=xml-cocoon-devm=104482372430759w=2
Again, please don't be so harsh concerning mistakes, but I think there 
are
many ideas included, which give some food for thought.
the main idea of this thread and Nicola's thoughts and Jeff's proposal 
is that cocoon should be instrumented to allow more binary process.

I strongly disagree.

Cocoon should process XML and focus on that. Other systems (a content 
repository, for example) should process the binaries (at creation time! 
not at publishing time!)

Say no to woodo!

--
Stefano.


Re: [RT] Views for readers

2003-08-14 Thread Upayavira
On 14 Aug 2003 at 15:34, Bertrand Delacretaz wrote:

 I find this more understandable (but dunno about implementation):
 
 !-- if reader is executed, the rest is not --
 map:read src=docs/{1}.doc unless-view=wordToXml/
 map:generate src=docs/{1}.doc type=wordToXml/
 map:transform...

Simplifying further:
  map:read src=docs/{1}.doc view-generator=wordToXml/

Surely that'd do it?

Regards, Upayavira



Re: [RT] Views for readers

2003-08-14 Thread Sylvain Wallez
Bertrand Delacretaz wrote:

Le Jeudi, 14 aoû 2003, à 15:24 Europe/Zurich, Sylvain Wallez a écrit :

...But what if we write it the other way around :
map:read src=docs/{1}.doc
 map:generate src=docs/{1}.doc type=wordToXml label=content/
/map:read


I find this more understandable (but dunno about implementation):

!-- if reader is executed, the rest is not --
map:read src=docs/{1}.doc unless-view=wordToXml/
map:generate src=docs/{1}.doc type=wordToXml/
map:transform... 


Interesting. This is looks like a more compact notation for the 
view-selector I was thinking of at first. We're leaving the RT world...

But shouldn't we keep labels that are already used into pipelines ? E.g :

map:read src=docs/{1}.doc label=raw, xdoc/
map:generate src=docs/{1}.doc type=word2xml label=raw/
map:transform src=xword2xdoc.xsl label=xdoc/
The label on the reader would skip the reader if the requested view 
corresponds to one of these labels. Now should this be named label or 
unless-label ?

Ah, and this is very easily implementable ;-)

Sylvain

--
Sylvain Wallez  Anyware Technologies
http://www.apache.org/~sylvain   http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

2003-08-14 Thread Sylvain Wallez
Jeff Turner wrote:

On Wed, Aug 13, 2003 at 12:02:04PM +0200, Sylvain Wallez wrote:
 

Frederic's question about search engine integration led me to 
questioning myself at how Cocoon's Lucene integration could be able to 
transparently index Word  PDF documents along with XML-produced documents.

There exists some text-extraction libraries for Word  PDF (e.g. 
http://www.textmining.org/). Now how can we integrate this as 
transparently as possible in Cocoon's search functionnality ?

The Lucene indexer crawls a website and asks for a particular view 
(content) which is used to fill the index. But Word and PDF documents 
being binary files, they're handled by a map:read statement, which 
does not handle views. On the other hand, this use case shows that 
having views on binary content may make sense : the normal requests 
just sends back the binary content, while a view can use a text/XML 
extraction on these binary files.

So the question is : how could views be plugged to readers ? I must say 
that I don't have an answer, as views contain transformers and a 
serializer, but no generator. So how could we express in the sitemap 
that a particular view on a reader should replace that reader by a 
particular generator ? Or should this go through some special readers 
that could also act as generators ?

Or maybe these are silly thoughts and we should use a map:select 
directing to a map:read or map:generate depending on the view. But 
this introduces explicit view management in the pipelines, which doesn't 
seem nice to me.
   

Solution: strongly typed pipelines! :)

Imagine if, at each node in the sitemap, we knew what type of content we
were dealing with (usually some flavour of XML).  Then we could write a
single view that behaves differently depending on the _type_ of data:
map:view name=indexablecontent from-position=first
 map:select type=xml-type
   map:when test=docbook
 map:transform src=docbook2whatever.xsl/
   /map:when
   map:when test=tei
 map:transform src=tei2whatever.xsl/
   /map:when
   map:when test=msword
 map:transform src=word2whatever.xsl/
   /map:when
 /map:select
/map:view
Ah, ok, the strongly type pipelines are a different wording for 
content-aware selectors !

So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would
return XML representing the content of the .doc file.
I described the same thing in a mail with subject 'Type-aware Views (Re:
Link view goodness)'.  Same need, different context, same proposed
solution.
Not exactly : the use case here is that we have a binary file which is 
normally sent as is to the browser using a reader. It is _not_ parsed as 
an XML stream. So we can't attach a view to these kinds of URLs since 
views provide a different _ending_ to a pipeline, meaning there must 
exist at least a generator and optionnaly one or more transformers at 
the point where processing is directed to the view.

So even content-aware selectors don't solve this problem...

Sylvain

--
Sylvain Wallez  Anyware Technologies
http://www.apache.org/~sylvain   http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

2003-08-14 Thread Bertrand Delacretaz
Le Jeudi, 14 aoû 2003, à 15:24 Europe/Zurich, Sylvain Wallez a écrit :

...But what if we write it the other way around :
map:read src=docs/{1}.doc
 map:generate src=docs/{1}.doc type=wordToXml label=content/
/map:read
I find this more understandable (but dunno about implementation):

!-- if reader is executed, the rest is not --
map:read src=docs/{1}.doc unless-view=wordToXml/
map:generate src=docs/{1}.doc type=wordToXml/
map:transform...
-Bertrand



Re: [RT] Views for readers

2003-08-14 Thread Vadim Gritsenko
Miles Elam wrote:

Ummm...  Quick question:  What are the use cases for this that are not 
handled by existing methods?  I mean, couldn't this be handled with an 
(as-yet unwritten) action?


Matcher *does* exist:


map:match pattern=*.doc


map:match type=wildcard-request-parameter pattern=content
 map:parameter name=parameter-name value=cocoon-view/
   map:generate type=word2xml src={../1}.doc/
   !-- complete the pipeline --


/map:match


 map:read src={1}.doc/
/map:match 


snip/

Vadim




Re: [RT] Views for readers

2003-08-14 Thread Vadim Gritsenko
Sylvain Wallez wrote:

Bertrand Delacretaz wrote:

Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :

...But shouldn't we keep labels that are already used into pipelines 
? E.g :

map:read src=docs/{1}.doc label=raw, xdoc/
map:generate src=docs/{1}.doc type=word2xml label=raw/
map:transform src=xword2xdoc.xsl label=xdoc/


If it's this way I'd prefer unless-label in map:read to make it clear.

Or maybe

  map:read src=docs/{1}.doc unless-label=*/

would do, meaning use this unless any views are requested
(and * would be the only allowed value).
Ah, and this is very easily implementable ;-)


Quickquick, do it before the FS police hears us ;-)

Seriously, I find this useful for indexing and other purposes 
(gettting meta-information about binary files, images, etc for example). 


Me too. But since is a change in the sitemap syntax, we should have a 
vote on this.

Any other proposal or opinion on this subject before we start a vote ? 


Can't you just enable generators in map:view in case when view starts 
with reader?

Vadim




Re: [RT] Views for readers

2003-08-14 Thread Jeff Turner
On Thu, Aug 14, 2003 at 01:41:55PM +0200, Sylvain Wallez wrote:
 Jeff Turner wrote:
...
 map:view name=indexablecontent from-position=first
  map:select type=xml-type
map:when test=docbook
  map:transform src=docbook2whatever.xsl/
/map:when
map:when test=tei
  map:transform src=tei2whatever.xsl/
/map:when
map:when test=msword
  map:transform src=word2whatever.xsl/
/map:when
  /map:select
 /map:view
 
 
 Ah, ok, the strongly type pipelines are a different wording for 
 content-aware selectors !

Ah yes.  Strange how the same concept can live two separate lives in
one's head ;)  Like the same class in two classloaders.

 So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would
 return XML representing the content of the .doc file.
 
 I described the same thing in a mail with subject 'Type-aware Views (Re:
 Link view goodness)'.  Same need, different context, same proposed
 solution.
 
 
 Not exactly : the use case here is that we have a binary file which is 
 normally sent as is to the browser using a reader. It is _not_ parsed as 
 an XML stream. So we can't attach a view to these kinds of URLs since 
 views provide a different _ending_ to a pipeline, meaning there must 
 exist at least a generator and optionnaly one or more transformers at 
 the point where processing is directed to the view.
 
 So even content-aware selectors don't solve this problem...

Isn't the problem there that a map:read is a whole little pipeline unto
itself?  If it were broken into two atomic operations:

map:generate type=binary src=foo.doc/
map:serialize type=binary/

then we could have a map:view from-position=first/ using a
content-aware pipeline, and everything would work.

I have the feeling that handling non-XML content in Cocoon is Just Wrong,
and that map:read is just a hack.  The fact that it doesn't integrate
with Views is a symptom of this.  In a theoretically pure world, we'd
either make Cocoon an XML-only framework and kill map:read, or make
Cocoon a generic data pipelining framework capable of handling and
transforming binary content.

Well it's a RT after all.. ;)

--Jeff

 Sylvain
 
 -- 
 Sylvain Wallez  Anyware Technologies
 http://www.apache.org/~sylvain   http://www.anyware-tech.com
 { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
 Orixo, the opensource XML business alliance  -  http://www.orixo.com
 
 


Re: [RT] Views for readers

2003-08-14 Thread Vadim Gritsenko
Miles Elam wrote:

Sylvain Wallez wrote:

Go back to first post of this thread, where (last paragraph) I 
proposed something similar. The whole discussion is about how we 
could have a syntax which doesn't introduce such verbosity in the 
sitemap. 


Verbosity is not necessarily a bad thing.  If it were, would any of us 
be using XML?  ;-) 


Good point.

snip/


Let's consider the MIDI example. Suppose we have a large collection 
of karaoke files (MIDI supports embedded text that can be played on 
screen while playing the music), and we want to index the text of 
these songs for easy retrieval (along with some other meta-data).

Here's a sitemap example, using the current syntax 

snip/

And the proposed shorter one :

map:match pattern=*.mid
 map:read src={1}.mid unless-label=content/
 map:generate type=midi src={1}.mid/
 map:transform src=xmidi2xdoc.xsl label=content-label/
 !-- should never come here --
 map:serialize type=xml/
/map:match

Two lines. What does it give except obfuscation? Given the point above 
(Verbosity is not necessarily a bad thing (c) Miles Elam) more 
readable and already supported syntax is:

map:resource name=midi/
 map:match type=view pattern=content
   map:generate type=midi src={1}.mid/
   map:transform src=xmidi2xdoc.xsl label=content/
   map:serialize type=xml/
 /map:match
 map:read mime-type=whatever/midi src={1}.mid/
/map:match
map:match pattern=*.mid/
 map:call resource=midi/
/map:match
Moreover! Resource midi is reusable:

map:match pattern=another/*.mid/
 map:call resource=midi/
/map:match
, while example above is not.



This breaks current convention that either a reader or a 
generator/transformer/serializer can act in a pipeline.


And, given this resource example, it does not break any sitemap 
semantics which we have today.



In the first example, if content isn't specified, the action returns 
null and the reader is invoked;  As far as the pipeline logic is 
concerned, there is only the reader.  Serializers are already known as 
universal exit points.  To use the second, the convention must be 
broken and readers must become universal exit points.

In other words,

map:match pattern=*.mid
map:read src={1}.mid/ !-- without the unless-label --
map:generate type=midi src={1}.mid/
map:transform src=xmidi2xdoc.xsl label=content-label/
!-- should never come here --
map:serialize type=xml/
/map:match
must become valid for consistency.  A reader becomes an exit point and 
the rest of a pipeline is, by default, ignored.  Is this an intended 
consequence?


I fell strongly -1 on this one.

Vadim




Re: [RT] Views for readers

2003-08-14 Thread Sylvain Wallez
Bertrand Delacretaz wrote:

Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :

...But shouldn't we keep labels that are already used into pipelines 
? E.g :

map:read src=docs/{1}.doc label=raw, xdoc/
map:generate src=docs/{1}.doc type=word2xml label=raw/
map:transform src=xword2xdoc.xsl label=xdoc/


If it's this way I'd prefer unless-label in map:read to make it clear.

Or maybe

  map:read src=docs/{1}.doc unless-label=*/

would do, meaning use this unless any views are requested
(and * would be the only allowed value).
Ah, and this is very easily implementable ;-)


Quickquick, do it before the FS police hears us ;-)

Seriously, I find this useful for indexing and other purposes 
(gettting meta-information about binary files, images, etc for example). 


Me too. But since is a change in the sitemap syntax, we should have a 
vote on this.

Any other proposal or opinion on this subject before we start a vote ?

Sylvain

--
Sylvain Wallez  Anyware Technologies
http://www.apache.org/~sylvain   http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

2003-08-14 Thread Nicola Ken Barozzi
Sylvain Wallez wrote, On 14/08/2003 14.30:
Nicola Ken Barozzi wrote:

Jeff Turner wrote, On 14/08/2003 14.17:

...

Isn't the problem there that a map:read is a whole little pipeline 
unto
itself?  If it were broken into two atomic operations:

map:generate type=binary src=foo.doc/
map:serialize type=binary/
then we could have a map:view from-position=first/ using a
content-aware pipeline, and everything would work.
Well, why can't the view simply start from a reader?

 map:read src=foo.doc/ 
Because a view finishes a partial XML pipeline, meaning it requires a 
generator to be already present...
That's because of how we define a view now ;-)
If we had just pipelines that handle both binary and xml data, the viw 
would finish a partial pipeline, in this case starting from binary.

I have the feeling that handling non-XML content in Cocoon is Just 
Wrong,
and that map:read is just a hack.  The fact that it doesn't integrate
with Views is a symptom of this.  In a theoretically pure world, we'd
either make Cocoon an XML-only framework and kill map:read, or make
Cocoon a generic data pipelining framework capable of handling and
transforming binary content.
Well, it can be done easily by allowing more than one reader and by 
allowing readers in the xml pipeline.

Some time back I had proposed the following to be possible (and got 
touted as the usual FS man)

 map:read src=foo1.doc/
 map:read type=stripstuff/
 map:read type=otherfilter/ 
Mhhh... I guess stripstuff and otherfilter are actually 
map:transform-binary and not map:read as they do have an input. Now 
how do we close the pipeline ? Is there a map:serialize-binary ?
Since streams are just streams, they don't need to be adapted like XML, 
so there is no notion of Generator or Serializer really, but only 
filter. So the reader is just a filter, and if in the middle it's just 
given a stream and has to output to a stream. So there is no need to 
open, and no need to close.

And also:

 map:read src=foo1.doc/
 map:generate src=foo1.doc/
 map:serialize src=foo1.doc/
 map:read type=zip/ 


Wow! What's the result of this ??
Oops, a bit too quick.

!-- remove encription or do other stream preprocessing --
  map:read type=decrypt src=foo1.doc/
!-- normal generation but from the previous reader output --
  map:generate type=doc2xml/
!-- eventual transforms--
!-- give back html --
  map:serialize type=html/
!-- zip that result so that it takes less bandwidth --
  map:read type=zip/
We can already do this BTW by using the Cocooon protocol, but it's 
such a hack! 
Sounds interesting. Can you elaborate on the hack ?
map:match pattern=mypage.html
  map:read src=internal/mypage.html type=zip/
/map:match
map:match pattern=internal/mypage.html
  !-- generate, transform, serialize... --
/map:match
BTW, maybe you may be interested in my RT about aspected pipeline 
snippets, it could be interesting. Basically it would make it possible 
to insert pipeline components inside all pipelines using certain rules.

--
Nicola Ken Barozzi   [EMAIL PROTECTED]
- verba volant, scripta manent -
   (discussions get forgotten, just code remains)
-



Re: [RT] Views for readers

2003-08-14 Thread Sylvain Wallez
Bertrand Delacretaz wrote:

How about making it the other way round, by allowing Generators to 
read from Readers?

map:match pattern=*.doc default-view=binary
  map:generator label=xml-content-for-indexing type=wordToXml
map:read src=word-documents/{1}.doc label=binary mime-type=.../
  /map:generator
  map:serialize type=xml/
/map:match 


Do you mean that the generator would be used if the 
xml-content-for-indexing view is selected ? This doesn't fit with the 
existing sitemap behaviour, since generators are _always_ added to the 
pipeline.

But what if we write it the other way around :
map:read src=docs/{1}.doc
 map:generate src=docs/{1}.doc type=wordToXml label=content/
/map:read
The meaning of the above is : if a view is requested, execute what's 
_inside_ the map:read. If it builds a complete pipeline then return 
its result, otherwise just perform the usual read operation.

Is that RT-ish enough? 


Mmmmh... not as wild as Nicola Ken's. Try again ;-P

Sylvain

--
Sylvain Wallez  Anyware Technologies
http://www.apache.org/~sylvain   http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

2003-08-14 Thread Andreas Hochsteger
Hi!

Sorry, but this discussion seems to tell us one thing:
The current sitemap syntax and cocoon processing model is not really 
suitable for such kind of processing.

All this reminds me of a proposal (which was actually a RT) I've sent 
back in January this year, where I proposed a more intuitive and 
flexible pipeline concept.
I don't want to say, that this would be the solution to all problems and 
I definitely made some mistakes because I didn't know that much of the
cocoon internals at the time of writing, but I think it's time to take a
second look at it.

So here's the link:
http://marc.theaimsgroup.com/?l=xml-cocoon-devm=104482372430759w=2

Again, please don't be so harsh concerning mistakes, but I think there are
many ideas included, which give some food for thought.

Bye,

Andreas Hochsteger
http://highstick.blogspot.com/


Sylvain Wallez wrote:
 Bertrand Delacretaz wrote:
 
 Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :

 ...But shouldn't we keep labels that are already used into pipelines 
 ? E.g :

 map:read src=docs/{1}.doc label=raw, xdoc/
 map:generate src=docs/{1}.doc type=word2xml label=raw/
 map:transform src=xword2xdoc.xsl label=xdoc/



 If it's this way I'd prefer unless-label in map:read to make it clear.

 Or maybe

   map:read src=docs/{1}.doc unless-label=*/

 would do, meaning use this unless any views are requested
 (and * would be the only allowed value).

 Ah, and this is very easily implementable ;-)



 Quickquick, do it before the FS police hears us ;-)

 Seriously, I find this useful for indexing and other purposes 
 (gettting meta-information about binary files, images, etc for example). 
 
 
 
 Me too. But since is a change in the sitemap syntax, we should have a 
 vote on this.
 
 Any other proposal or opinion on this subject before we start a vote ?
 
 Sylvain
 






Re: [RT] Views for readers

2003-08-14 Thread Sylvain Wallez
Vadim Gritsenko wrote:

Sylvain Wallez wrote:

Vadim Gritsenko wrote:

Sylvain Wallez wrote:

snip/

Any other proposal or opinion on this subject before we start a vote ? 


Can't you just enable generators in map:view in case when view 
starts with reader? 


No, since views capture the (XML) output at certain points of the 
pipeline to provide a different formatting.


In case of the reader, there is no (XML) output in the pipeline. It's 
special case, unless you want to introduce binary pipelines (and I 
hope you don't want to), so it would require special handling.

E.g. the processing for the indexable-content view


Sidenote: It's called content -- the view which you use to build a 
site search index. 


Picky sidenote : this is configurable using the content-view-query 
config of the lucene-xml-indexer component ;-)

is the same for all URIs, be them XML pipelines or a single reader.

So there's no way other than having a generator _before_ jumping to 
the view, feeding that view with the kind of XML content it expects.


Here is another wild (or not?) thought.

All this discussion comes down to the requirement of generating some 
XML out of the content usually served by the reader, if that's 
possible (and it is possible for some of the types of the content), in 
order to feed this XMLized content into the view. This generated XML 
is somewhat equivalent to the binary represenation for the purpose 
of view building. So, I'm going to the conclusion that some types of 
readers can be paired with the generator producing equivalent, but 
XMLized, content. The best place to indicate such pairing is the time 
when you declare a reader:

 map:readers default=resource
   map:reader name=resource 
src=org.apache.cocoon.reading.ResourceReader/
   map:reader name=html 
src=org.apache.cocoon.reading.ResourceReader
 
generator-paired-to-this-readerhtml/generator-paired-to-this-reader
   /map:reader
   map:reader name=msexcel 
src=org.apache.cocoon.reading.ResourceReader
 
generator-paired-to-this-readerpoi-excel-generator/generator-paired-to-this-reader 

   /map:reader
   map:reader name=pdf src=org.apache.cocoon.reading.ResourceReader
 
generator-paired-to-this-readerpdf-text-extractor-generator/generator-paired-to-this-reader 

   /map:reader
 /map:readers 


I'm afraid this won't work :

- a generator specific to a given content-type is very unlikely to 
produce the document type expected by the view. We will most often need 
an additional transformation (e.g. the xword2xdoc.xsl that was in my 
example)

- views, through their associated labels, can be plugged at any point of 
the pipelines. Defining pair generators restricts views to be only 
from-label=start.

PS: Modifying sitemap syntax to allow reader/generator pairs with some 
unless attrbiutes looks awful to me. 


Doesn't seem so awful to me, since the reader should be executed 
unless certain conditions are met, which are that the specified 
label(s) correspond to the one at which the requested view should start.

Sylvain

--
Sylvain Wallez  Anyware Technologies
http://www.apache.org/~sylvain   http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

2003-08-14 Thread Sylvain Wallez
Jeff Turner wrote:

snip/

Isn't the problem there that a map:read is a whole little pipeline unto itself?  If it were broken into two atomic operations:

map:generate type=binary src=foo.doc/
map:serialize type=binary/
then we could have a map:view from-position=first/ using a content-aware pipeline, and everything would work.

I have the feeling that handling non-XML content in Cocoon is Just Wrong, and that map:read is just a hack.  The fact that it doesn't integrate with Views is a symptom of this.  In a theoretically pure world, we'd either make Cocoon an XML-only framework and kill map:read, or make Cocoon a generic data pipelining framework capable of handling and transforming binary content.

Well it's a RT after all.. ;)

Content-aware and binary pipelines in the same post? Wow! Yes, it's 
definitely a RT ;-P

Sylvain

--
Sylvain Wallez  Anyware Technologies
http://www.apache.org/~sylvain   http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

2003-08-14 Thread Sylvain Wallez
Vadim Gritsenko wrote:

Sylvain Wallez wrote:

Vadim Gritsenko wrote:

Sylvain Wallez wrote:

Vadim Gritsenko wrote: 


snip/

Here is another wild (or not?) thought.

All this discussion comes down to the requirement of generating some 
XML out of the content usually served by the reader, if that's 
possible (and it is possible for some of the types of the content), 
in order to feed this XMLized content into the view. This generated 
XML is somewhat equivalent to the binary represenation for the 
purpose of view building. So, I'm going to the conclusion that some 
types of readers can be paired with the generator producing 
equivalent, but XMLized, content. The best place to indicate such 
pairing is the time when you declare a reader:

 map:readers default=resource
   map:reader name=resource 
src=org.apache.cocoon.reading.ResourceReader/
   map:reader name=html 
src=org.apache.cocoon.reading.ResourceReader
 
generator-paired-to-this-readerhtml/generator-paired-to-this-reader
   /map:reader
   map:reader name=msexcel 
src=org.apache.cocoon.reading.ResourceReader
 
generator-paired-to-this-readerpoi-excel-generator/generator-paired-to-this-reader 

   /map:reader
   map:reader name=pdf 
src=org.apache.cocoon.reading.ResourceReader
 
generator-paired-to-this-readerpdf-text-extractor-generator/generator-paired-to-this-reader 

   /map:reader
 /map:readers 




I'm afraid this won't work :


Can you suggest some improvements so it does work? My goal is to have 
as little impact on sitemap syntax as possible.


- a generator specific to a given content-type is very unlikely to 
produce the document type expected by the view. We will most often 
need an additional transformation (e.g. the xword2xdoc.xsl that was 
in my example)


More wild suggestions.

1/ Do something with the views. Say, allow duplicate view names and 
make them work as selector:

 map:views
   !-- works if (when) reader --
   map:view from-position=reader name=content
 map:transform src=wordml2content.xsl label=content/
 map:serialize type=xml/
   /map:view
   !-- works if (when) label --
   map:view from-label=content name=content
 map:serialize type=xml/
   /map:view
   !-- works if no label (otherwise) --
   map:view from-position=first name=content
 map:serialize type=xml/
   /map:view
 /map:views 


Still the same problem I desperatly pointing out again and again : how 
can the from-position=reader use different generators (i.e. parsers) 
depending on the binary content ?

2/ Do something with the readers.

 map:readers default=resource
   map:reader name=msword 
src=org.apache.cocoon.reading.ResourceReader
 map:generate type=msword/
 map:transform src=wordml2content.xsl/
   /map:reader
 /map:readers


This introduces sitemap snippets into a component manager configuration, 
wich is not good at all.

3/ Alternative to 2:

 map:readers default=resource
   map:reader name=msword 
src=org.apache.cocoon.reading.ResourceReader
 xmlizer-uricocoon://word-2-content//xmlizer-uri
   /map:reader
 /map:readers

 map:views
   map:view from-label=content name=content
 map:serialize type=xml/
   /map:view
 /map:views
 map:pipelines
   ...
   map:read src=my.doc/
   ...
   map:match pattern=word-2-content/*
 map:generate type=msword src={1}/
 map:transform src=wordml2content.xsl label=content/
 map:serialize type=xml/
   /map:match
 /map:pipelines 


Sounds better, but has the problem that it implies that every view 
should return xml content on my.doc. Or to we introduce a label 
attribute on map:read to define on which particular view the 
xmlizer-uri should be triggered ?

I would not say that I like any of the suggestions above. The cleanest 
way ATM is the usage of map:resource I suggested in other email (I yet 
to see your comment on it). 


Sorry, I have no particular comment on the use of resources, as it's 
mainly a refactoring of the action/matcher proposals.

- views, through their associated labels, can be plugged at any point 
of the pipelines. Defining pair generators restricts views to be only 
from-label=start.

PS: Modifying sitemap syntax to allow reader/generator pairs with 
some unless attrbiutes looks awful to me. 


Doesn't seem so awful to me, since the reader should be executed 
unless certain conditions are met, which are that the specified 
label(s) correspond to the one at which the requested view should start. 


This unless attribute is nothing else than shortcut for map:match. 
Given point on verbosity and given the obfuscated result, I'm for 
verbosity.


Not exacly : you can currently match on the view name (provided that the 
environment actually does rely on the cocoon-view parameter), but you 
cannot match on the labels. And only labels are currently used in the 
map:pipelines section.

PS Keep sitemap syntax clean! Say No! to woodo! 


Funny. That's often me that says too much magic kills the confidence.

Let's stop this discussion for now. I have the feeling 

Re: [RT] Views for readers

2003-08-14 Thread Tony Collen
Upayavira wrote:
On 14 Aug 2003 at 15:34, Bertrand Delacretaz wrote:


I find this more understandable (but dunno about implementation):

!-- if reader is executed, the rest is not --
map:read src=docs/{1}.doc unless-view=wordToXml/
map:generate src=docs/{1}.doc type=wordToXml/
map:transform...


Simplifying further:
  map:read src=docs/{1}.doc view-generator=wordToXml/
Surely that'd do it?
this might be better, because what happens when someone comes along doing this:

map:read src=docs/{1}.doc unless-view=wordToXml/
map:generate src=docs/{2}.doc type=wordToXml/

Then the same request represents two difference sources, which could be either confusing or very 
useful and I don't fully understand the implications of everything.

Just tossing my $0.02 in... it's early and I'm tired :)



Tony



Re: [RT] Views for readers

2003-08-14 Thread Sylvain Wallez
Miles Elam wrote:

In other words, the pipeline is full of side effects and dependant 
upon things happening behind the curtain (to use a Wizard of Oz 
reference).  You'd be right in that it adds to the confusion.  I agree 
with Vadim.  This is obfuscation in exchange for two lines of 
verboseness.


Just some additional precisions, mon frère !

Yes, the pipeline is full of side effects, which can break pipelines at 
any point an continue somewhere else without this being explicitely 
visible in the pipeline construction statements.

These side effects are called views, and the way to define views is 
through labels.

And even worse : labels can be placed on component definitions, meaning 
a clean pipeline with no label attribute at all is full of these side 
effects.

So what you call obfuscation has been there *for years*. And everybody's 
happy with it.

Sylvain

--
Sylvain Wallez  Anyware Technologies
http://www.apache.org/~sylvain   http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

2003-08-14 Thread Sylvain Wallez
Miles Elam wrote:

Sylvain Wallez wrote:

Go back to first post of this thread, where (last paragraph) I 
proposed something similar. The whole discussion is about how we 
could have a syntax which doesn't introduce such verbosity in the 
sitemap. 


Verbosity is not necessarily a bad thing.  If it were, would any of us 
be using XML?  ;-) 


Good, point. However, the only verbosity currently added by views is the 
label attribute. This proposal is about achieving the same low 
verbosity for views with binary content.

As I explained in several replies, there's no equivalence between a 
reader and generator able to parse a given binary format. There needs 
to be some kind of adaptation/extraction before feeding the view. 


Yup.

And what you describe above as a PDF reader, a Word reader, a 
Postscript reader, etc. are IMO nothing more than _generators_, just 
like the SWF and MIDI generators we already have. 


The functionality for all readers would obviously be the same: move 
these bytes from here to there.  But yes, the codified mapping I think 
is important.


Please read carefully : I wrote *generators* !! This isn't about moving 
bytes, but about producing an XML document.

Let's consider the MIDI example. Suppose we have a large collection 
of karaoke files (MIDI supports embedded text that can be played on 
screen while playing the music), and we want to index the text of 
these songs for easy retrieval (along with some other meta-data).

Here's a sitemap example, using the current syntax
map:match pattern=*.mid/
 map:act type=catch-view src=content
   map:generate type=midi src={1}.mid/
   map:transform src=xmidi2xdoc.xsl label=content-label/
   !-- should never come here --
   map:serialize type=xml/
 /map:match
 map:read src={1}.mid/
/map:match 


You're mixing the map:act with a /map:match, but I get the idea. 


Picky guy, eh ?

(the content view starts at the content-label label to clearly 
distinguish the two notions).

And the proposed shorter one :

map:match pattern=*.mid
 map:read src={1}.mid unless-label=content/
 map:generate type=midi src={1}.mid/
 map:transform src=xmidi2xdoc.xsl label=content-label/
 !-- should never come here --
 map:serialize type=xml/
/map:match 


This breaks current convention that either a reader or a 
generator/transformer/serializer can act in a pipeline.  In the first 
example, if content isn't specified, the action returns null and the 
reader is invoked;  As far as the pipeline logic is concerned, there 
is only the reader.  Serializers are already known as universal exit 
points.  To use the second, the convention must be broken and readers 
must become universal exit points. 


Readers already are universal exit points : once you encounter a reader, 
sitemap processing is terminated. map:read and map:serialize are 
like a return statement in Java.

In other words,

map:match pattern=*.mid
map:read src={1}.mid/ !-- without the unless-label --
map:generate type=midi src={1}.mid/
map:transform src=xmidi2xdoc.xsl label=content-label/
!-- should never come here --
map:serialize type=xml/
/map:match
must become valid for consistency.  A reader becomes an exit point and 
the rest of a pipeline is, by default, ignored.  Is this an intended 
consequence?


No consequence : this is how the sitemap works today, and the above is 
valid, even if we can consider that the sitemap engine should more 
strict and signal that there's some unreachable code.

To add more to the confusion, in both your and my example, we can even 
avoid writing the map:serialize statement. Since some additional 
filtering occurs beforehand (either through the action or through reader 
labels), this statement is never reached and is useless !

Sylvain

--
Sylvain Wallez  Anyware Technologies
http://www.apache.org/~sylvain   http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

2003-08-14 Thread Miles Elam
Vadim Gritsenko wrote:

Ummm...  Quick question:  What are the use cases for this that are 
not handled by existing methods?  I mean, couldn't this be handled 
with an (as-yet unwritten) action?


Matcher *does* exist:


Heh heh...  learning something new everyday.

- Miles Elam



Re: [RT] Views for readers

2003-08-14 Thread Vadim Gritsenko
Sylvain Wallez wrote:

Vadim Gritsenko wrote:

Sylvain Wallez wrote:

Vadim Gritsenko wrote: 

snip/

Here is another wild (or not?) thought.

All this discussion comes down to the requirement of generating some 
XML out of the content usually served by the reader, if that's 
possible (and it is possible for some of the types of the content), 
in order to feed this XMLized content into the view. This generated 
XML is somewhat equivalent to the binary represenation for the 
purpose of view building. So, I'm going to the conclusion that some 
types of readers can be paired with the generator producing 
equivalent, but XMLized, content. The best place to indicate such 
pairing is the time when you declare a reader:

 map:readers default=resource
   map:reader name=resource 
src=org.apache.cocoon.reading.ResourceReader/
   map:reader name=html 
src=org.apache.cocoon.reading.ResourceReader
 
generator-paired-to-this-readerhtml/generator-paired-to-this-reader
   /map:reader
   map:reader name=msexcel 
src=org.apache.cocoon.reading.ResourceReader
 
generator-paired-to-this-readerpoi-excel-generator/generator-paired-to-this-reader 

   /map:reader
   map:reader name=pdf 
src=org.apache.cocoon.reading.ResourceReader
 
generator-paired-to-this-readerpdf-text-extractor-generator/generator-paired-to-this-reader 

   /map:reader
 /map:readers 


I'm afraid this won't work :


Can you suggest some improvements so it does work? My goal is to have as 
little impact on sitemap syntax as possible.


- a generator specific to a given content-type is very unlikely to 
produce the document type expected by the view. We will most often 
need an additional transformation (e.g. the xword2xdoc.xsl that was 
in my example)


More wild suggestions.

1/ Do something with the views. Say, allow duplicate view names and make 
them work as selector:

 map:views
   !-- works if (when) reader --
   map:view from-position=reader name=content
 map:transform src=wordml2content.xsl label=content/
 map:serialize type=xml/
   /map:view
   !-- works if (when) label --
   map:view from-label=content name=content
 map:serialize type=xml/
   /map:view
   !-- works if no label (otherwise) --
   map:view from-position=first name=content
 map:serialize type=xml/
   /map:view
 /map:views
2/ Do something with the readers.

 map:readers default=resource
   map:reader name=msword 
src=org.apache.cocoon.reading.ResourceReader
 map:generate type=msword/
 map:transform src=wordml2content.xsl/
   /map:reader
 /map:readers

3/ Alternative to 2:

 map:readers default=resource
   map:reader name=msword 
src=org.apache.cocoon.reading.ResourceReader
 xmlizer-uricocoon://word-2-content//xmlizer-uri
   /map:reader
 /map:readers

 map:views
   map:view from-label=content name=content
 map:serialize type=xml/
   /map:view
 /map:views
 map:pipelines
   ...
   map:read src=my.doc/
   ...
   map:match pattern=word-2-content/*
 map:generate type=msword src={1}/
 map:transform src=wordml2content.xsl label=content/
 map:serialize type=xml/
   /map:match
 /map:pipelines
I would not say that I like any of the suggestions above. The cleanest 
way ATM is the usage of map:resource I suggested in other email (I yet 
to see your comment on it).


- views, through their associated labels, can be plugged at any point 
of the pipelines. Defining pair generators restricts views to be only 
from-label=start.

PS: Modifying sitemap syntax to allow reader/generator pairs with 
some unless attrbiutes looks awful to me. 


Doesn't seem so awful to me, since the reader should be executed 
unless certain conditions are met, which are that the specified 
label(s) correspond to the one at which the requested view should start. 


This unless attribute is nothing else than shortcut for map:match. 
Given point on verbosity and given the obfuscated result, I'm for verbosity.

PS Keep sitemap syntax clean! Say No! to woodo!

Vadim




Re: [RT] Views for readers

2003-08-14 Thread Sam Coward
Hmm,

Frederic's question about search engine integration led me to 
questioning myself at how Cocoon's Lucene integration could be able to 
transparently index Word  PDF documents along with XML-produced 
documents.
I have been wondering that too. At my company, we put together a simple 
web management tool to put small collections of documents into a web 
frame for a client. Pretty useless, but it's what he wanted.

At the time I had thought it may be possible to just improve Lucene so 
it could understand binary files by introducing mime-type triggerable 
filter modules that converted to text on the input stream. After all, if 
the text were only going to be used for indexing, it wouldn't matter if 
the text wasn't available within Cocoon itself. In any case he's happy 
with what he has and we're happily doing other stuff.

Perhaps if the individual extractors are part of specialised readers for 
specific types of documents, then you could configure the label for the 
XML they return? That would allow for the duality of that behaviour to 
be mostly concealed and managed from within Cocoon with little effect to 
the sitemap.

I personally find it tempting to think that it may be possible to  rip 
out XML from any of these formats, and do with it as we wish, 
particulary when I saw that programs like catdoc could recognize the 
tables even from Word 2k documents. But I often find myself thinking 
back against that, and that maybe I should represent all content (even 
document content) semantically in XML and let rendering technologies 
(PDFSerializer, POI) handle binary output, and perhaps leverage document 
importers that map those documents back to XML (they all seem to be 
proprietary, big buck solutions from what I see currently, though). In 
any case, it does seem that is certainly a ways off in the future *sigh*

Hmm, an OCR extractor would be way cool for faxes too!

just my 2c, i never say anything most of the time, anyway
Sam


Re: [RT] Views for readers

2003-08-14 Thread Miles Elam
Sylvain Wallez wrote:

Miles Elam wrote:

In other words, the pipeline is full of side effects and dependant 
upon things happening behind the curtain (to use a Wizard of Oz 
reference).  You'd be right in that it adds to the confusion.  I 
agree with Vadim.  This is obfuscation in exchange for two lines of 
verboseness.


Just some additional precisions, mon frère ! 


I hope it wasn't taken the wrong way.  I did not intend any offense.

Yes, the pipeline is full of side effects, which can break pipelines 
at any point an continue somewhere else without this being explicitely 
visible in the pipeline construction statements.

These side effects are called views, and the way to define views is 
through labels. 


Don't get me wrong.  I see clearly the reason why views exist.  I see 
clearly why reader views are wanted.  When working with XML data -- not 
just text, but structured text -- getting at that data before it is 
processed into a presentation format (such as viewing source, getting a 
true content view, etc.) can prove invaluable.

And even worse : labels can be placed on component definitions, 
meaning a clean pipeline with no label attribute at all is full of 
these side effects.

So what you call obfuscation has been there *for years*. And 
everybody's happy with it. 


When grabbing from the presentation format as a source, you are 
comparing apples and oranges.  Not only are there innumerable binary 
formats out there being squeezed into a few reader implementations, but 
they are not all desirable data.  While you may want the data from a PDF 
file, you may not bother with a PNG image because it may index Created 
with The Gimp over and over.

Since putting in all binary format-to-generator mapping info seems out 
of the question, all of the pipeline path must be specified in the 
matcher -- hence the discussion surrounding readers and generators in 
the same matcher.  If everything is specified in the same matcher and 
not truly orthogonal, as is the case for views currently, why add the 
extra syntax for what amounts to a non-orthogonal if-else clause?

if (!content-view)
   read
else
   generate
   transform
   serialize
as opposed to

generate
  +-- view-short-curcuit! --+- transform-x
transform-1  +- serialize
transform-2
serialize
There is a discontinuity there that makes me uncomfortable.  This is not 
an overt attachment to symmetry.  This is seeing the same tool applied 
to two (in my opinion) very different tasks.  I am not a committer and 
can't vote.  But these are my thoughts on the matter.  Take with as many 
grains of salt as are necessary.

- Miles Elam




Re: [RT] Views for readers

2003-08-14 Thread Miles Elam
Sylvain Wallez wrote:

The functionality for all readers would obviously be the same: move 
these bytes from here to there.  But yes, the codified mapping I 
think is important.


Please read carefully : I wrote *generators* !! This isn't about 
moving bytes, but about producing an XML document. 


Au contraire mon frére, this is implemented with generators but it is 
about pulling searchable info out of arbitrary binary data.  The first 
step to that goal is to standardize it -- therefore generators are 
added.  The issue is about *readers* and the custom formats they 
encompass not being indexable.

You're mixing the map:act with a /map:match, but I get the idea. 


Picky guy, eh ? 


You know it.  :)

Readers already are universal exit points : once you encounter a 
reader, sitemap processing is terminated. map:read and 
map:serialize are like a return statement in Java. 


Not according to the code, they're not.  Check out 
AbstractProcessingPipeline.java.  There are method bodies like:

   public void setGenerator (String role, String source, Parameters 
param, Parameters hintParam)
   throws ProcessingException {
   if (this.generator != null) {
   throw new ProcessingException (Generator already set. You 
can only select one Generator ( + role + ));
   }
   if (this.reader != null) {
   throw new ProcessingException (Reader already set. You 
cannot use a reader and a generator for one pipeline.);
   }
   ...

and

   public void setReader (String role, String source, Parameters param, 
String mimeType)
   throws ProcessingException {
   if (this.reader != null) {
   throw new ProcessingException (Reader already set. You can 
only select one Reader ( + role + ));
   }
   if (this.generator != null) {
   throw new ProcessingException (Generator already set. You 
cannot use a reader and a generator for one pipeline.);
   }
   ...

Either the policy was in effect when this file (and its subclasses) were 
made or someone put constraining statements in that serve no purpose.  
The file was last modified on August 6th of this year.  If the policy 
has changed, no one told the code.

No consequence : this is how the sitemap works today, and the above is 
valid, even if we can consider that the sitemap engine should more 
strict and signal that there's some unreachable code. 


I can't speak to validity, but this is NOT how it works today.

To add more to the confusion, in both your and my example, we can even 
avoid writing the map:serialize statement. Since some additional 
filtering occurs beforehand (either through the action or through 
reader labels), this statement is never reached and is useless ! 


In other words, the pipeline is full of side effects and dependant upon 
things happening behind the curtain (to use a Wizard of Oz reference).  
You'd be right in that it adds to the confusion.  I agree with Vadim.  
This is obfuscation in exchange for two lines of verboseness.

- Miles Elam




Re: [RT] Views for readers

2003-08-14 Thread Jeff Turner
On Wed, Aug 13, 2003 at 12:02:04PM +0200, Sylvain Wallez wrote:
 Frederic's question about search engine integration led me to 
 questioning myself at how Cocoon's Lucene integration could be able to 
 transparently index Word  PDF documents along with XML-produced documents.
 
 There exists some text-extraction libraries for Word  PDF (e.g. 
 http://www.textmining.org/). Now how can we integrate this as 
 transparently as possible in Cocoon's search functionnality ?
 
 The Lucene indexer crawls a website and asks for a particular view 
 (content) which is used to fill the index. But Word and PDF documents 
 being binary files, they're handled by a map:read statement, which 
 does not handle views. On the other hand, this use case shows that 
 having views on binary content may make sense : the normal requests 
 just sends back the binary content, while a view can use a text/XML 
 extraction on these binary files.
 
 So the question is : how could views be plugged to readers ? I must say 
 that I don't have an answer, as views contain transformers and a 
 serializer, but no generator. So how could we express in the sitemap 
 that a particular view on a reader should replace that reader by a 
 particular generator ? Or should this go through some special readers 
 that could also act as generators ?
 
 Or maybe these are silly thoughts and we should use a map:select 
 directing to a map:read or map:generate depending on the view. But 
 this introduces explicit view management in the pipelines, which doesn't 
 seem nice to me.

Solution: strongly typed pipelines! :)

Imagine if, at each node in the sitemap, we knew what type of content we
were dealing with (usually some flavour of XML).  Then we could write a
single view that behaves differently depending on the _type_ of data:

map:view name=indexablecontent from-position=first
  map:select type=xml-type
map:when test=docbook
  map:transform src=docbook2whatever.xsl/
/map:when
map:when test=tei
  map:transform src=tei2whatever.xsl/
/map:when
map:when test=msword
  map:transform src=word2whatever.xsl/
/map:when
  /map:select
/map:view

So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would
return XML representing the content of the .doc file.

I described the same thing in a mail with subject 'Type-aware Views (Re:
Link view goodness)'.  Same need, different context, same proposed
solution.


--Jeff


 Any thoughts ?
 
 Sylvain
 
 -- 
 Sylvain Wallez  Anyware Technologies
 http://www.apache.org/~sylvain   http://www.anyware-tech.com
 { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
 Orixo, the opensource XML business alliance  -  http://www.orixo.com
 
 


Re: [RT] Views for readers

2003-08-14 Thread Bertrand Delacretaz
Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :

...But shouldn't we keep labels that are already used into pipelines ? 
E.g :

map:read src=docs/{1}.doc label=raw, xdoc/
map:generate src=docs/{1}.doc type=word2xml label=raw/
map:transform src=xword2xdoc.xsl label=xdoc/
If it's this way I'd prefer unless-label in map:read to make it clear.

Or maybe

  map:read src=docs/{1}.doc unless-label=*/

would do, meaning use this unless any views are requested
(and * would be the only allowed value).
Ah, and this is very easily implementable ;-)
Quickquick, do it before the FS police hears us ;-)

Seriously, I find this useful for indexing and other purposes (gettting 
meta-information about binary files, images, etc for example).

-Bertrand


Re: [RT] Views for readers

2003-08-14 Thread Vadim Gritsenko
Sylvain Wallez wrote:

Vadim Gritsenko wrote:

Sylvain Wallez wrote:

Vadim Gritsenko wrote:

Sylvain Wallez wrote:

Vadim Gritsenko wrote: 



snip/

Here is another wild (or not?) thought.

All this discussion comes down to the requirement of generating 
some XML out of the content usually served by the reader, if that's 
possible (and it is possible for some of the types of the content), 
in order to feed this XMLized content into the view. This generated 
XML is somewhat equivalent to the binary represenation for the 
purpose of view building. So, I'm going to the conclusion that some 
types of readers can be paired with the generator producing 
equivalent, but XMLized, content. The best place to indicate such 
pairing is the time when you declare a reader:

 map:readers default=resource
   map:reader name=resource 
src=org.apache.cocoon.reading.ResourceReader/
   map:reader name=html 
src=org.apache.cocoon.reading.ResourceReader
 
generator-paired-to-this-readerhtml/generator-paired-to-this-reader 

   /map:reader
   map:reader name=msexcel 
src=org.apache.cocoon.reading.ResourceReader
 
generator-paired-to-this-readerpoi-excel-generator/generator-paired-to-this-reader 

   /map:reader
   map:reader name=pdf 
src=org.apache.cocoon.reading.ResourceReader
 
generator-paired-to-this-readerpdf-text-extractor-generator/generator-paired-to-this-reader 

   /map:reader
 /map:readers 




I'm afraid this won't work :




Can you suggest some improvements so it does work? My goal is to have 
as little impact on sitemap syntax as possible.


- a generator specific to a given content-type is very unlikely to 
produce the document type expected by the view. We will most often 
need an additional transformation (e.g. the xword2xdoc.xsl that 
was in my example)




More wild suggestions.

1/ Do something with the views. Say, allow duplicate view names and 
make them work as selector:

 map:views
   !-- works if (when) reader --
   map:view from-position=reader name=content
 map:transform src=wordml2content.xsl label=content/
 map:serialize type=xml/
   /map:view
   !-- works if (when) label --
   map:view from-label=content name=content
 map:serialize type=xml/
   /map:view
   !-- works if no label (otherwise) --
   map:view from-position=first name=content
 map:serialize type=xml/
   /map:view
 /map:views 


Still the same problem I desperatly pointing out again and again : how 
can the from-position=reader use different generators (i.e. parsers) 
depending on the binary content ?


I did not copy reader-to-generator association 
(generator-paired-to-this-reader/) declared on top. Get the generator 
from there.


2/ Do something with the readers.

...

This introduces sitemap snippets into a component manager 
configuration, wich is not good at all.


Yep. Not good.


3/ Alternative to 2:

 map:readers default=resource
   map:reader name=msword 
src=org.apache.cocoon.reading.ResourceReader
 xmlizer-uricocoon://word-2-content//xmlizer-uri
   /map:reader
 /map:readers

 map:views
   map:view from-label=content name=content
 map:serialize type=xml/
   /map:view
 /map:views
 map:pipelines
   ...
   map:read src=my.doc/
   ...
   map:match pattern=word-2-content/*
 map:generate type=msword src={1}/
 map:transform src=wordml2content.xsl label=content/
 map:serialize type=xml/
   /map:match
 /map:pipelines 


Sounds better, but has the problem that it implies that every view 
should return xml content on my.doc.


Yep. Unless you define one xmlizer URI per view... Awful!


Or to we introduce a label attribute on map:read to define on 
which particular view the xmlizer-uri should be triggered ?


Possible.


I would not say that I like any of the suggestions above. The 
cleanest way ATM is the usage of map:resource I suggested in other 
email (I yet to see your comment on it). 


Sorry, I have no particular comment on the use of resources, as it's 
mainly a refactoring of the action/matcher proposals.


But it solves the problem! And the cleanest solution (with minimal 
impact) among all discussed here.


- views, through their associated labels, can be plugged at any 
point of the pipelines. Defining pair generators restricts views to 
be only from-label=start.

PS: Modifying sitemap syntax to allow reader/generator pairs with 
some unless attrbiutes looks awful to me. 


Doesn't seem so awful to me, since the reader should be executed 
unless certain conditions are met, which are that the specified 
label(s) correspond to the one at which the requested view should 
start. 


This unless attribute is nothing else than shortcut for 
map:match. Given point on verbosity and given the obfuscated 
result, I'm for verbosity.


Not exacly : you can currently match on the view name (provided that 
the environment actually does rely on the cocoon-view parameter),


(Special view matcher is still possible)


but you cannot match on the labels. And only labels are currently used 
in