On Wed, Aug 13, 2003 at 12:02:04PM +0200, Sylvain Wallez wrote:
Frederic's question about search engine integration led me to questioning myself at how Cocoon's Lucene integration could be able to transparently index Word & PDF documents along with XML-produced documents.
There exists some text-extraction libraries for Word & PDF (e.g. http://www.textmining.org/). Now how can we integrate this as transparently as possible in Cocoon's search functionnality ?
The Lucene indexer crawls a website and asks for a particular view ("content") which is used to fill the index. But Word and PDF documents being binary files, they're handled by a <map:read> statement, which does not handle views. On the other hand, this use case shows that having views on binary content may make sense : the "normal" requests just sends back the binary content, while a view can use a text/XML extraction on these binary files.
So the question is : how could views be plugged to readers ? I must say that I don't have an answer, as views contain transformers and a serializer, but no generator. So how could we express in the sitemap that a particular view on a reader should "replace" that reader by a particular generator ? Or should this go through some special readers that could also act as generators ?
Or maybe these are silly thoughts and we should use a <map:select> directing to a <map:read> or <map:generate> depending on the view. But this introduces explicit view management in the pipelines, which doesn't seem nice to me.
Solution: strongly typed pipelines! :)
Imagine if, at each node in the sitemap, we knew what type of content we were dealing with (usually some flavour of XML). Then we could write a single view that behaves differently depending on the _type_ of data:
<map:view name="indexablecontent" from-position="first"> <map:select type="xml-type"> <map:when test="docbook"> <map:transform src="docbook2whatever.xsl"/> </map:when> <map:when test="tei"> <map:transform src="tei2whatever.xsl"/> </map:when> <map:when test="msword"> <map:transform src="word2whatever.xsl"/> </map:when> </map:select> </map:view>
Ah, ok, the "strongly type pipelines" are a different wording for "content-aware selectors" !
So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would return XML representing the content of the .doc file.
I described the same thing in a mail with subject 'Type-aware Views (Re: Link view goodness)'. Same need, different context, same proposed solution.
Not exactly : the use case here is that we have a binary file which is normally sent as is to the browser using a reader. It is _not_ parsed as an XML stream. So we can't attach a view to these kinds of URLs since views provide a different _ending_ to a pipeline, meaning there must exist at least a generator and optionnaly one or more transformers at the point where processing is directed to the view.
So even content-aware selectors don't solve this problem...
Sylvain
-- Sylvain Wallez Anyware Technologies http://www.apache.org/~sylvain http://www.anyware-tech.com { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects } Orixo, the opensource XML business alliance - http://www.orixo.com