Hi all,

I've been debating this for a while, too, what Sami suggested in another thread: "I think we should start looking at Apache Tika for most (or all) of our parsers."

This is actually a part of my broader vision for Nutch, that this project should not duplicate functionality of other well-established projects by re-implementing the same functionality, only poorly - because our focus is not on parsers, plugins, mime/charset detection, distributed RPC, but on building a robust platform for crawling.

We could start working on this particular issue by donating the Nutch parsers to Tika, those that are not already present there, and start using Tika's parsers in Nutch where it's already possible. Once Tika supports all types of parsers that we have, we should switch completely to Tika.

Of course, this will happen post-1.0 release.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to