Moving Nutch parsers to Tika

Andrzej Bialecki Tue, 10 Mar 2009 02:58:19 -0700

Hi all,

I've been debating this for a while, too, what Sami suggested in anotherthread: "I think we should start looking at Apache Tika for most (orall) of our parsers."

This is actually a part of my broader vision for Nutch, that thisproject should not duplicate functionality of other well-establishedprojects by re-implementing the same functionality, only poorly -because our focus is not on parsers, plugins, mime/charset detection,distributed RPC, but on building a robust platform for crawling.

We could start working on this particular issue by donating the Nutchparsers to Tika, those that are not already present there, and startusing Tika's parsers in Nutch where it's already possible. Once Tikasupports all types of parsers that we have, we should switch completelyto Tika.


Of course, this will happen post-1.0 release.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Moving Nutch parsers to Tika

Reply via email to