Re: robust Tika and Hadoop
awesome work Mark and Ken ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Mark Kerzner mark.kerz...@shmsoft.com Reply-To: user@tika.apache.org user@tika.apache.org Date: Monday, July 20, 2015 at 4:22 PM To: Tika User user@tika.apache.org Subject: Re: robust Tika and Hadoop Hi, Tim, here is my Tika with Hadoop project, tested on Enron, http://frd.org/, and it works quite well. Mark On Mon, Jul 20, 2015 at 6:20 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Tim, When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a TikaCallable (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCa llable.java) This lets us orphan the parsing thread if it times out (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/Simple Parser.java#L187) And provides a bit of protection against things like NoSuchMethodErrors that can be thrown by Tika if the mime-type detection code tries to use a parser that we exclude, in order to keep the Hadoop job jar size to something reasonable. -- Ken From: Allison, Timothy B. Sent: July 15, 2015 4:38:56am PDT To:user@tika.apache.org Subject: robust Tika and Hadoop All, I’d like to fill out our Wiki a bit more on using Tika robustly within Hadoop. I’m aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven’t looked carefully into these packages yet. Does anyone have any recommendations for specific configurations/design patterns that will defend against oom and permanent hangs within Hadoop? Thank you! Best, Tim [0] https://github.com/DigitalPebble/behemoth [1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-c ontent-nanite/ [2] http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and -integrate-etl-apps-for-apache-hadoop/ -- Ken Krugler +1 530-210-6378 tel:%2B1%20530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 tel:%2B1%20530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Mark Kerzner, President CEO, SHMsoft http://shmsoft.com/, To schedule a meeting with me: http://www.meetme.so/markkerzner Mobile: 713-724-2534 Skype: mark.kerzner1 Office: One Riverway Suite 1700 Houston, TX 77056 Privileged and Confidential http://shmsoft.com/
RE: robust Tika and Hadoop
Thank you, Ken! From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Tuesday, July 21, 2015 10:23 AM To: user@tika.apache.org Subject: RE: robust Tika and Hadoop Hi Tim, Responses inline below. -- Ken From: Allison, Timothy B. Sent: July 21, 2015 5:29:37am PDT To: user@tika.apache.orgmailto:user@tika.apache.org Subject: RE: robust Tika and Hadoop Ken, To confirm your strategy: one new Thread for each call to Tika, add timeout exception handling, orphan the thread. Correct. Out of curiosity, three questions: 1) If I had more time to read your code, the answer would be obvious...sorryHow are you organizing your ingest? Are you concatenating files into a SequenceFile or doing something else? Are you processing each file in a single map step, or batching files in your mapper? Files are effectively concatenated, as each record (Cascading Tuple, or Hadoop KV pair) has the raw bytes plus a bunch of other data (headers returned, etc) The parse phase is a map operation, so it's batch processing of all files successfully downloaded during that fetch loop. 2) Somewhat related to the first question, in addition to orphaning the parsing thread, are you doing anything else, like setting maximum number of tasks per jvm? Are you configuring max number of retries, etc? If by tasks per JVM you mean the # of times we reuse the JVM, then yes - otherwise the orphan threads would eventually clog things up. For retries, typically we don't set it (so defaults to 4), but in practice I'd recommend using something like 2 - so you get one retry, and then it fails, otherwise you typically fail four times on that error that could never possible happen but does. 3) Are you adding the AutoDetectParser to your ParseContext so that you'll get content from embedded files? No, not typically, as we're usually ignoring archive files. But that's a good point, with current versions of Tika we could now more easily handle those. It gets a bit tricky, though, as the UID for content is the URL, but now we'd have multiple sub-docs that we'd want to index separately. From: Ken Krugler [mailto:kkrugler_li...@transpac.comhttp://transpac.com/] Sent: Monday, July 20, 2015 7:21 PM To: user@tika.apache.orgmailto:user@tika.apache.org Subject: RE: robust Tika and Hadoop Hi Tim, When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a TikaCallable (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java) This lets us orphan the parsing thread if it times out (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187) And provides a bit of protection against things like NoSuchMethodErrors that can be thrown by Tika if the mime-type detection code tries to use a parser that we exclude, in order to keep the Hadoop job jar size to something reasonable. -- Ken From: Allison, Timothy B. Sent: July 15, 2015 4:38:56am PDT To: user@tika.apache.orgmailto:user@tika.apache.org Subject: robust Tika and Hadoop All, I'd like to fill out our Wiki a bit more on using Tika robustly within Hadoop. I'm aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven't looked carefully into these packages yet. Does anyone have any recommendations for specific configurations/design patterns that will defend against oom and permanent hangs within Hadoop? Thank you! Best, Tim [0] https://github.com/DigitalPebble/behemoth [1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ [2] http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.comhttp://www.scaleunlimited.com/ custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.comhttp://www.scaleunlimited.com/ custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
RE: robust Tika and Hadoop
Ken, To confirm your strategy: one new Thread for each call to Tika, add timeout exception handling, orphan the thread. Out of curiosity, three questions: 1) If I had more time to read your code, the answer would be obvious...sorryHow are you organizing your ingest? Are you concatenating files into a SequenceFile or doing something else? Are you processing each file in a single map step, or batching files in your mapper? 2) Somewhat related to the first question, in addition to orphaning the parsing thread, are you doing anything else, like setting maximum number of tasks per jvm? Are you configuring max number of retries, etc? 3) Are you adding the AutoDetectParser to your ParseContext so that you'll get content from embedded files? Thank you, again. Best, Tim From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Monday, July 20, 2015 7:21 PM To: user@tika.apache.org Subject: RE: robust Tika and Hadoop Hi Tim, When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a TikaCallable (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java) This lets us orphan the parsing thread if it times out (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187) And provides a bit of protection against things like NoSuchMethodErrors that can be thrown by Tika if the mime-type detection code tries to use a parser that we exclude, in order to keep the Hadoop job jar size to something reasonable. -- Ken From: Allison, Timothy B. Sent: July 15, 2015 4:38:56am PDT To: user@tika.apache.orgmailto:user@tika.apache.org Subject: robust Tika and Hadoop All, I'd like to fill out our Wiki a bit more on using Tika robustly within Hadoop. I'm aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven't looked carefully into these packages yet. Does anyone have any recommendations for specific configurations/design patterns that will defend against oom and permanent hangs within Hadoop? Thank you! Best, Tim [0] https://github.com/DigitalPebble/behemoth [1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ [2] http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
RE: robust Tika and Hadoop
Hi Tim, Responses inline below. -- Ken From: Allison, Timothy B. Sent: July 21, 2015 5:29:37am PDT To: user@tika.apache.org Subject: RE: robust Tika and Hadoop Ken, To confirm your strategy: one new Thread for each call to Tika, add timeout exception handling, orphan the thread. Correct. Out of curiosity, three questions: 1) If I had more time to read your code, the answer would be obvious…sorry….How are you organizing your ingest? Are you concatenating files into a SequenceFile or doing something else? Are you processing each file in a single map step, or batching files in your mapper? Files are effectively concatenated, as each record (Cascading Tuple, or Hadoop KV pair) has the raw bytes plus a bunch of other data (headers returned, etc) The parse phase is a map operation, so it's batch processing of all files successfully downloaded during that fetch loop. 2) Somewhat related to the first question, in addition to orphaning the parsing thread, are you doing anything else, like setting maximum number of tasks per jvm? Are you configuring max number of retries, etc? If by tasks per JVM you mean the # of times we reuse the JVM, then yes - otherwise the orphan threads would eventually clog things up. For retries, typically we don't set it (so defaults to 4), but in practice I'd recommend using something like 2 - so you get one retry, and then it fails, otherwise you typically fail four times on that error that could never possible happen but does. 3) Are you adding the AutoDetectParser to your ParseContext so that you’ll get content from embedded files? No, not typically, as we're usually ignoring archive files. But that's a good point, with current versions of Tika we could now more easily handle those. It gets a bit tricky, though, as the UID for content is the URL, but now we'd have multiple sub-docs that we'd want to index separately. From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Monday, July 20, 2015 7:21 PM To: user@tika.apache.org Subject: RE: robust Tika and Hadoop Hi Tim, When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a TikaCallable (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java) This lets us orphan the parsing thread if it times out (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187) And provides a bit of protection against things like NoSuchMethodErrors that can be thrown by Tika if the mime-type detection code tries to use a parser that we exclude, in order to keep the Hadoop job jar size to something reasonable. -- Ken From: Allison, Timothy B. Sent: July 15, 2015 4:38:56am PDT To: user@tika.apache.org Subject: robust Tika and Hadoop All, I’d like to fill out our Wiki a bit more on using Tika robustly within Hadoop. I’m aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven’t looked carefully into these packages yet. Does anyone have any recommendations for specific configurations/design patterns that will defend against oom and permanent hangs within Hadoop? Thank you! Best, Tim [0] https://github.com/DigitalPebble/behemoth [1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ [2] http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
RE: robust Tika and Hadoop
Thank you, Ken and Mark. Will update wiki over the next few days! From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Monday, July 20, 2015 7:21 PM To: user@tika.apache.org Subject: RE: robust Tika and Hadoop Hi Tim, When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a TikaCallable (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java) This lets us orphan the parsing thread if it times out (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187) And provides a bit of protection against things like NoSuchMethodErrors that can be thrown by Tika if the mime-type detection code tries to use a parser that we exclude, in order to keep the Hadoop job jar size to something reasonable. -- Ken From: Allison, Timothy B. Sent: July 15, 2015 4:38:56am PDT To: user@tika.apache.orgmailto:user@tika.apache.org Subject: robust Tika and Hadoop All, I'd like to fill out our Wiki a bit more on using Tika robustly within Hadoop. I'm aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven't looked carefully into these packages yet. Does anyone have any recommendations for specific configurations/design patterns that will defend against oom and permanent hangs within Hadoop? Thank you! Best, Tim [0] https://github.com/DigitalPebble/behemoth [1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ [2] http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
RE: robust Tika and Hadoop
Hi Tim, When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a TikaCallable (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java) This lets us orphan the parsing thread if it times out (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187) And provides a bit of protection against things like NoSuchMethodErrors that can be thrown by Tika if the mime-type detection code tries to use a parser that we exclude, in order to keep the Hadoop job jar size to something reasonable. -- Ken From: Allison, Timothy B. Sent: July 15, 2015 4:38:56am PDT To: user@tika.apache.org Subject: robust Tika and Hadoop All, I’d like to fill out our Wiki a bit more on using Tika robustly within Hadoop. I’m aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven’t looked carefully into these packages yet. Does anyone have any recommendations for specific configurations/design patterns that will defend against oom and permanent hangs within Hadoop? Thank you! Best, Tim [0] https://github.com/DigitalPebble/behemoth [1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ [2] http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
Re: robust Tika and Hadoop
I would add Nutch to the list too, Tim :-) +1 from me. — Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Allison, Timothy B. talli...@mitre.org Reply-To: user@tika.apache.org Date: Wednesday, July 15, 2015 at 4:38 AM To: user@tika.apache.org user@tika.apache.org Subject: robust Tika and Hadoop All, I’d like to fill out our Wiki a bit more on using Tika robustly within Hadoop. I’m aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven’t looked carefully into these packages yet. Does anyone have any recommendations for specific configurations/design patterns that will defend against oom and permanent hangs within Hadoop? Thank you! Best, Tim [0] https://github.com/DigitalPebble/behemoth [1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-c ontent-nanite/ [2] http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and -integrate-etl-apps-for-apache-hadoop/ http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-an d-integrate-etl-apps-for-apache-hadoop/