I would prefer to see a good open-source framework pulling together a
collection of document parsers but which isn't tied directly to Lucene
(that binding would be via *another* project).
If the parser framework extracted document text in a standard
document-and-application-neutral form (XML/Java object?) this could
underpin *any* IR/IE project wanting to make use of the parser
functionality e.g. the GATE framework for example. That would ultimately
make a much more valuable piece of functionality and is the approach
taken by Stellent (used by many search engines, recently purchased by
Oracle).
Cheers
Mark
___________________________________________________________
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine
http://uk.docs.yahoo.com/nowyoucan.html
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]