Hmm,

Frederic's question about search engine integration led me to questioning myself at how Cocoon's Lucene integration could be able to transparently index Word & PDF documents along with XML-produced documents.

I have been wondering that too. At my company, we put together a simple web management tool to put small collections of documents into a web frame for a client. Pretty useless, but it's what he wanted.


At the time I had thought it may be possible to just improve Lucene so it could understand binary files by introducing mime-type triggerable filter modules that converted to text on the input stream. After all, if the text were only going to be used for indexing, it wouldn't matter if the text wasn't available within Cocoon itself. In any case he's happy with what he has and we're happily doing other stuff.

Perhaps if the individual extractors are part of specialised readers for specific types of documents, then you could configure the label for the XML they return? That would allow for the duality of that behaviour to be mostly concealed and managed from within Cocoon with little effect to the sitemap.

I personally find it tempting to think that it may be possible to rip out XML from any of these formats, and do with it as we wish, particulary when I saw that programs like catdoc could recognize the tables even from Word 2k documents. But I often find myself thinking back against that, and that maybe I should represent all content (even document content) semantically in XML and let rendering technologies (PDFSerializer, POI) handle binary output, and perhaps leverage document importers that map those documents back to XML (they all seem to be proprietary, big buck solutions from what I see currently, though). In any case, it does seem that is certainly a ways off in the future *sigh*

Hmm, an OCR extractor would be way cool for faxes too!

just my 2c, i never say anything most of the time, anyway
Sam



Reply via email to