Thank you, Ken and Mark. Will update wiki over the next few days! From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Monday, July 20, 2015 7:21 PM To: user@tika.apache.org Subject: RE: robust Tika and Hadoop
Hi Tim, When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a TikaCallable (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java) This lets us orphan the parsing thread if it times out (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187) And provides a bit of protection against things like NoSuchMethodErrors that can be thrown by Tika if the mime-type detection code tries to use a parser that we exclude, in order to keep the Hadoop job jar size to something reasonable. -- Ken ________________________________ From: Allison, Timothy B. Sent: July 15, 2015 4:38:56am PDT To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: robust Tika and Hadoop All, I'd like to fill out our Wiki a bit more on using Tika robustly within Hadoop. I'm aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven't looked carefully into these packages yet. Does anyone have any recommendations for specific configurations/design patterns that will defend against oom and permanent hangs within Hadoop? Thank you! Best, Tim [0] https://github.com/DigitalPebble/behemoth [1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ [2] http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr