RE: robust Tika and Hadoop

Ken Krugler Tue, 21 Jul 2015 07:25:35 -0700

Hi Tim,

Responses inline below.


-- Ken

> From: Allison, Timothy B.
> Sent: July 21, 2015 5:29:37am PDT
> To: user@tika.apache.org
> Subject: RE: robust Tika and Hadoop
> 
> Ken,
>   To confirm your strategy: one new Thread for each call to Tika, add timeout 
> exception handling, orphan the thread.

Correct.

>  
> Out of curiosity, three questions:
> 1)      If I had more time to read your code, the answer would be 
> obvious…sorry….How are you organizing your ingest?  Are you concatenating 
> files into a SequenceFile or doing something else?  Are you processing each 
> file in a single map step, or batching files in your mapper?

Files are effectively concatenated, as each record (Cascading Tuple, or Hadoop 
KV pair) has the raw bytes plus a bunch of other data (headers returned, etc)

The parse phase is a map operation, so it's batch processing of all files 
successfully downloaded during that fetch loop.

> 2)      Somewhat related to the first question, in addition to orphaning the 
> parsing thread, are you doing anything else, like setting maximum number of 
> tasks per jvm?  Are you configuring max number of retries, etc? 

If by "tasks per JVM" you mean the # of times we reuse the JVM, then yes - 
otherwise the orphan threads would eventually clog things up.

For retries, typically we don't set it (so defaults to 4), but in practice I'd 
recommend using something like 2 - so you get one retry, and then it fails, 
otherwise you typically fail four times on that error that could never possible 
happen but does.

> 3)      Are you adding the AutoDetectParser to your ParseContext so that 
> you’ll get content from embedded files?

No, not typically, as we're usually ignoring archive files. But that's a good 
point, with current versions of Tika we could now more easily handle those. It 
gets a bit tricky, though, as the UID for content is the URL, but now we'd have 
multiple sub-docs that we'd want to index separately.

 
> From: Ken Krugler [mailto:kkrugler_li...@transpac.com] 
> Sent: Monday, July 20, 2015 7:21 PM
> To: user@tika.apache.org
> Subject: RE: robust Tika and Hadoop
>  
> Hi Tim,
>  
> When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
> TikaCallable 
> (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)
>  
> This lets us orphan the parsing thread if it times out 
> (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)
>  
> And provides a bit of protection against things like NoSuchMethodErrors that 
> can be thrown by Tika if the mime-type detection code tries to use a parser 
> that we exclude, in order to keep the Hadoop job jar size to something 
> reasonable.
>  
> -- Ken
>  
> From: Allison, Timothy B.
> Sent: July 15, 2015 4:38:56am PDT
> To: user@tika.apache.org
> Subject: robust Tika and Hadoop
>  
> All,
>  
>   I’d like to fill out our Wiki a bit more on using Tika robustly within 
> Hadoop.  I’m aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven’t 
> looked carefully into these packages yet.
>  
>   Does anyone have any recommendations for specific configurations/design 
> patterns that will defend against oom and permanent hangs within Hadoop?
>  
>   Thank you!
>  
>         Best,
>  
>                   Tim
>  
>  
> [0] https://github.com/DigitalPebble/behemoth
> [1] 
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
> [2] 
> http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/
>  
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>  
>  
> 
> 
>  
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>  
>  
> 
> 
>  

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: robust Tika and Hadoop

Reply via email to