Hi Tim, Responses inline below.
-- Ken > From: Allison, Timothy B. > Sent: July 21, 2015 5:29:37am PDT > To: user@tika.apache.org > Subject: RE: robust Tika and Hadoop > > Ken, > To confirm your strategy: one new Thread for each call to Tika, add timeout > exception handling, orphan the thread. Correct. > > Out of curiosity, three questions: > 1) If I had more time to read your code, the answer would be > obvious…sorry….How are you organizing your ingest? Are you concatenating > files into a SequenceFile or doing something else? Are you processing each > file in a single map step, or batching files in your mapper? Files are effectively concatenated, as each record (Cascading Tuple, or Hadoop KV pair) has the raw bytes plus a bunch of other data (headers returned, etc) The parse phase is a map operation, so it's batch processing of all files successfully downloaded during that fetch loop. > 2) Somewhat related to the first question, in addition to orphaning the > parsing thread, are you doing anything else, like setting maximum number of > tasks per jvm? Are you configuring max number of retries, etc? If by "tasks per JVM" you mean the # of times we reuse the JVM, then yes - otherwise the orphan threads would eventually clog things up. For retries, typically we don't set it (so defaults to 4), but in practice I'd recommend using something like 2 - so you get one retry, and then it fails, otherwise you typically fail four times on that error that could never possible happen but does. > 3) Are you adding the AutoDetectParser to your ParseContext so that > you’ll get content from embedded files? No, not typically, as we're usually ignoring archive files. But that's a good point, with current versions of Tika we could now more easily handle those. It gets a bit tricky, though, as the UID for content is the URL, but now we'd have multiple sub-docs that we'd want to index separately. > From: Ken Krugler [mailto:kkrugler_li...@transpac.com] > Sent: Monday, July 20, 2015 7:21 PM > To: user@tika.apache.org > Subject: RE: robust Tika and Hadoop > > Hi Tim, > > When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a > TikaCallable > (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java) > > This lets us orphan the parsing thread if it times out > (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187) > > And provides a bit of protection against things like NoSuchMethodErrors that > can be thrown by Tika if the mime-type detection code tries to use a parser > that we exclude, in order to keep the Hadoop job jar size to something > reasonable. > > -- Ken > > From: Allison, Timothy B. > Sent: July 15, 2015 4:38:56am PDT > To: user@tika.apache.org > Subject: robust Tika and Hadoop > > All, > > I’d like to fill out our Wiki a bit more on using Tika robustly within > Hadoop. I’m aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven’t > looked carefully into these packages yet. > > Does anyone have any recommendations for specific configurations/design > patterns that will defend against oom and permanent hangs within Hadoop? > > Thank you! > > Best, > > Tim > > > [0] https://github.com/DigitalPebble/behemoth > [1] > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ > [2] > http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr