Re: robust Tika and Hadoop

2015-07-22 Thread Mattmann, Chris A (3980)
awesome work Mark and Ken

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Mark Kerzner mark.kerz...@shmsoft.com
Reply-To: user@tika.apache.org user@tika.apache.org
Date: Monday, July 20, 2015 at 4:22 PM
To: Tika User user@tika.apache.org
Subject: Re: robust Tika and Hadoop

Hi, Tim,


here is my Tika with Hadoop project, tested on Enron,
http://frd.org/, and it works quite well.


Mark


On Mon, Jul 20, 2015 at 6:20 PM, Ken Krugler
kkrugler_li...@transpac.com wrote:

Hi Tim,


When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it
with a TikaCallable
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCa
llable.java)


This lets us orphan the parsing thread if it times out
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/Simple
Parser.java#L187)


And provides a bit of protection against things like NoSuchMethodErrors
that can be thrown by Tika if the mime-type detection code tries to use a
parser that we exclude, in order to keep the Hadoop job jar size to
something reasonable.


-- Ken



From: Allison, Timothy B.
Sent: July 15, 2015 4:38:56am PDT
To:user@tika.apache.org
Subject: robust Tika and Hadoop


All,
 
  I’d like to fill out our Wiki a bit more on using Tika robustly within
Hadoop.  I’m aware of Behemoth [0], Nanite [1] and Morphlines [2].  I
haven’t looked carefully into these packages yet.
 
  Does anyone have any recommendations for specific configurations/design
patterns that will defend against oom and permanent hangs within Hadoop?
 
  Thank you!
 
Best,
 
  Tim
 
 
[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-c
ontent-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and
-integrate-etl-apps-for-apache-hadoop/







--
Ken Krugler
+1 530-210-6378 tel:%2B1%20530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr

















--
Ken Krugler
+1 530-210-6378 tel:%2B1%20530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr



























-- 
Mark Kerzner, President  CEO, SHMsoft http://shmsoft.com/,
To schedule a meeting with me: http://www.meetme.so/markkerzner

Mobile: 713-724-2534
Skype: mark.kerzner1
Office: One Riverway Suite 1700
Houston, TX 77056

Privileged and Confidential
 http://shmsoft.com/







RE: robust Tika and Hadoop

2015-07-21 Thread Allison, Timothy B.
Thank you, Ken!

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Tuesday, July 21, 2015 10:23 AM
To: user@tika.apache.org
Subject: RE: robust Tika and Hadoop

Hi Tim,

Responses inline below.

-- Ken



From: Allison, Timothy B.

Sent: July 21, 2015 5:29:37am PDT

To: user@tika.apache.orgmailto:user@tika.apache.org

Subject: RE: robust Tika and Hadoop

Ken,
  To confirm your strategy: one new Thread for each call to Tika, add timeout 
exception handling, orphan the thread.

Correct.



Out of curiosity, three questions:
1)  If I had more time to read your code, the answer would be 
obvious...sorryHow are you organizing your ingest?  Are you concatenating 
files into a SequenceFile or doing something else?  Are you processing each 
file in a single map step, or batching files in your mapper?

Files are effectively concatenated, as each record (Cascading Tuple, or Hadoop 
KV pair) has the raw bytes plus a bunch of other data (headers returned, etc)

The parse phase is a map operation, so it's batch processing of all files 
successfully downloaded during that fetch loop.


2)  Somewhat related to the first question, in addition to orphaning the 
parsing thread, are you doing anything else, like setting maximum number of 
tasks per jvm?  Are you configuring max number of retries, etc?

If by tasks per JVM you mean the # of times we reuse the JVM, then yes - 
otherwise the orphan threads would eventually clog things up.

For retries, typically we don't set it (so defaults to 4), but in practice I'd 
recommend using something like 2 - so you get one retry, and then it fails, 
otherwise you typically fail four times on that error that could never possible 
happen but does.


3)  Are you adding the AutoDetectParser to your ParseContext so that you'll 
get content from embedded files?

No, not typically, as we're usually ignoring archive files. But that's a good 
point, with current versions of Tika we could now more easily handle those. It 
gets a bit tricky, though, as the UID for content is the URL, but now we'd have 
multiple sub-docs that we'd want to index separately.


From: Ken Krugler [mailto:kkrugler_li...@transpac.comhttp://transpac.com/]
Sent: Monday, July 20, 2015 7:21 PM
To: user@tika.apache.orgmailto:user@tika.apache.org
Subject: RE: robust Tika and Hadoop

Hi Tim,

When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
TikaCallable 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)

This lets us orphan the parsing thread if it times out 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)

And provides a bit of protection against things like NoSuchMethodErrors that 
can be thrown by Tika if the mime-type detection code tries to use a parser 
that we exclude, in order to keep the Hadoop job jar size to something 
reasonable.

-- Ken



From: Allison, Timothy B.

Sent: July 15, 2015 4:38:56am PDT

To: user@tika.apache.orgmailto:user@tika.apache.org

Subject: robust Tika and Hadoop

All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

Best,

  Tim


[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.comhttp://www.scaleunlimited.com/
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr






--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.comhttp://www.scaleunlimited.com/
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr






--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







RE: robust Tika and Hadoop

2015-07-21 Thread Allison, Timothy B.
Ken,
  To confirm your strategy: one new Thread for each call to Tika, add timeout 
exception handling, orphan the thread.

Out of curiosity, three questions:

1)  If I had more time to read your code, the answer would be 
obvious...sorryHow are you organizing your ingest?  Are you concatenating 
files into a SequenceFile or doing something else?  Are you processing each 
file in a single map step, or batching files in your mapper?

2)  Somewhat related to the first question, in addition to orphaning the 
parsing thread, are you doing anything else, like setting maximum number of 
tasks per jvm?  Are you configuring max number of retries, etc?

3)  Are you adding the AutoDetectParser to your ParseContext so that you'll 
get content from embedded files?

Thank you, again.

Best,

 Tim

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Monday, July 20, 2015 7:21 PM
To: user@tika.apache.org
Subject: RE: robust Tika and Hadoop

Hi Tim,

When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
TikaCallable 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)

This lets us orphan the parsing thread if it times out 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)

And provides a bit of protection against things like NoSuchMethodErrors that 
can be thrown by Tika if the mime-type detection code tries to use a parser 
that we exclude, in order to keep the Hadoop job jar size to something 
reasonable.

-- Ken



From: Allison, Timothy B.

Sent: July 15, 2015 4:38:56am PDT

To: user@tika.apache.orgmailto:user@tika.apache.org

Subject: robust Tika and Hadoop

All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

Best,

  Tim


[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr





--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







RE: robust Tika and Hadoop

2015-07-21 Thread Ken Krugler
Hi Tim,

Responses inline below.

-- Ken

 From: Allison, Timothy B.
 Sent: July 21, 2015 5:29:37am PDT
 To: user@tika.apache.org
 Subject: RE: robust Tika and Hadoop
 
 Ken,
   To confirm your strategy: one new Thread for each call to Tika, add timeout 
 exception handling, orphan the thread.

Correct.

  
 Out of curiosity, three questions:
 1)  If I had more time to read your code, the answer would be 
 obvious…sorry….How are you organizing your ingest?  Are you concatenating 
 files into a SequenceFile or doing something else?  Are you processing each 
 file in a single map step, or batching files in your mapper?

Files are effectively concatenated, as each record (Cascading Tuple, or Hadoop 
KV pair) has the raw bytes plus a bunch of other data (headers returned, etc)

The parse phase is a map operation, so it's batch processing of all files 
successfully downloaded during that fetch loop.

 2)  Somewhat related to the first question, in addition to orphaning the 
 parsing thread, are you doing anything else, like setting maximum number of 
 tasks per jvm?  Are you configuring max number of retries, etc? 

If by tasks per JVM you mean the # of times we reuse the JVM, then yes - 
otherwise the orphan threads would eventually clog things up.

For retries, typically we don't set it (so defaults to 4), but in practice I'd 
recommend using something like 2 - so you get one retry, and then it fails, 
otherwise you typically fail four times on that error that could never possible 
happen but does.

 3)  Are you adding the AutoDetectParser to your ParseContext so that 
 you’ll get content from embedded files?

No, not typically, as we're usually ignoring archive files. But that's a good 
point, with current versions of Tika we could now more easily handle those. It 
gets a bit tricky, though, as the UID for content is the URL, but now we'd have 
multiple sub-docs that we'd want to index separately.

 
 From: Ken Krugler [mailto:kkrugler_li...@transpac.com] 
 Sent: Monday, July 20, 2015 7:21 PM
 To: user@tika.apache.org
 Subject: RE: robust Tika and Hadoop
  
 Hi Tim,
  
 When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
 TikaCallable 
 (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)
  
 This lets us orphan the parsing thread if it times out 
 (https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)
  
 And provides a bit of protection against things like NoSuchMethodErrors that 
 can be thrown by Tika if the mime-type detection code tries to use a parser 
 that we exclude, in order to keep the Hadoop job jar size to something 
 reasonable.
  
 -- Ken
  
 From: Allison, Timothy B.
 Sent: July 15, 2015 4:38:56am PDT
 To: user@tika.apache.org
 Subject: robust Tika and Hadoop
  
 All,
  
   I’d like to fill out our Wiki a bit more on using Tika robustly within 
 Hadoop.  I’m aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven’t 
 looked carefully into these packages yet.
  
   Does anyone have any recommendations for specific configurations/design 
 patterns that will defend against oom and permanent hangs within Hadoop?
  
   Thank you!
  
 Best,
  
   Tim
  
  
 [0] https://github.com/DigitalPebble/behemoth
 [1] 
 http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
 [2] 
 http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/
  
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr
  
  
 
 
  
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr
  
  
 
 
  

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







RE: robust Tika and Hadoop

2015-07-20 Thread Allison, Timothy B.
Thank you, Ken and Mark.  Will update wiki over the next few days!

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Monday, July 20, 2015 7:21 PM
To: user@tika.apache.org
Subject: RE: robust Tika and Hadoop

Hi Tim,

When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
TikaCallable 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)

This lets us orphan the parsing thread if it times out 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)

And provides a bit of protection against things like NoSuchMethodErrors that 
can be thrown by Tika if the mime-type detection code tries to use a parser 
that we exclude, in order to keep the Hadoop job jar size to something 
reasonable.

-- Ken



From: Allison, Timothy B.

Sent: July 15, 2015 4:38:56am PDT

To: user@tika.apache.orgmailto:user@tika.apache.org

Subject: robust Tika and Hadoop

All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

Best,

  Tim


[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr





--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







RE: robust Tika and Hadoop

2015-07-20 Thread Ken Krugler
Hi Tim,

When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
TikaCallable 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)

This lets us orphan the parsing thread if it times out 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)

And provides a bit of protection against things like NoSuchMethodErrors that 
can be thrown by Tika if the mime-type detection code tries to use a parser 
that we exclude, in order to keep the Hadoop job jar size to something 
reasonable.

-- Ken

 From: Allison, Timothy B.
 Sent: July 15, 2015 4:38:56am PDT
 To: user@tika.apache.org
 Subject: robust Tika and Hadoop
 
 All,
  
   I’d like to fill out our Wiki a bit more on using Tika robustly within 
 Hadoop.  I’m aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven’t 
 looked carefully into these packages yet.
  
   Does anyone have any recommendations for specific configurations/design 
 patterns that will defend against oom and permanent hangs within Hadoop?
  
   Thank you!
  
 Best,
  
   Tim
  
  
 [0] https://github.com/DigitalPebble/behemoth
 [1] 
 http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
 [2] 
 http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr





--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







Re: robust Tika and Hadoop

2015-07-15 Thread Chris Mattmann
I would add Nutch to the list too, Tim :-)

+1 from me.

—
Chris Mattmann
chris.mattm...@gmail.com






-Original Message-
From: Allison, Timothy B. talli...@mitre.org
Reply-To: user@tika.apache.org
Date: Wednesday, July 15, 2015 at 4:38 AM
To: user@tika.apache.org user@tika.apache.org
Subject: robust Tika and Hadoop

All,
 
  I’d like to fill out our Wiki a bit more on using Tika robustly within
Hadoop.  I’m aware of Behemoth [0], Nanite [1] and Morphlines [2].  I
haven’t looked carefully into these packages yet.
 
  Does anyone have any recommendations for specific configurations/design
patterns that will defend against oom and permanent hangs within Hadoop?
  
  Thank you!
 
Best,
 
  Tim
 
 
[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-c
ontent-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and
-integrate-etl-apps-for-apache-hadoop/
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-an
d-integrate-etl-apps-for-apache-hadoop/