All, I'd like to fill out our Wiki a bit more on using Tika robustly within Hadoop. I'm aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven't looked carefully into these packages yet.
Does anyone have any recommendations for specific configurations/design patterns that will defend against oom and permanent hangs within Hadoop? Thank you! Best, Tim [0] https://github.com/DigitalPebble/behemoth [1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ [2] http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/