RE: robust Tika and Hadoop

Allison, Timothy B. Mon, 20 Jul 2015 18:39:33 -0700

Thank you, Ken and Mark.  Will update wiki over the next few days!

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Monday, July 20, 2015 7:21 PM
To: user@tika.apache.org
Subject: RE: robust Tika and Hadoop


Hi Tim,

When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
TikaCallable 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)

This lets us orphan the parsing thread if it times out 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)

And provides a bit of protection against things like NoSuchMethodErrors that 
can be thrown by Tika if the mime-type detection code tries to use a parser 
that we exclude, in order to keep the Hadoop job jar size to something 
reasonable.

-- Ken

________________________________

From: Allison, Timothy B.

Sent: July 15, 2015 4:38:56am PDT

To: user@tika.apache.org<mailto:user@tika.apache.org>

Subject: robust Tika and Hadoop

All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

        Best,

                  Tim


[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: robust Tika and Hadoop

Reply via email to