In an earlier version of tika-batch, we had a single AutoDetectParser per 
thread, and we had no problems.  I experimented with a single AutoDetectParser 
across the threads, and we didn’t have problems.

Because of configuration issues, tika-batch is now creating a new parser for 
each file.

In our unit test suite, last I experimented with this, the first initialization 
did take a while, but then there was no measurable extra cost to instantiating 
a new parser.   In short, we didn’t save anything by using a static 
AutoDetectParser instead of just instantiating a new one for each unit test.

If you are going from file system to file system, you might want to consider 
tika-batch.

java -jar tika-app.jar -i <input_dir> -o <output_dir>

If you have a whole lot of files (millions), try to isolate Tika in its own jvm 
or server or data center; bad things can happen.  See slide 17: 
http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf

And: 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/

From: Haris Osmanagic [mailto:haris.osmana...@gmail.com]
Sent: Friday, September 30, 2016 10:54 AM
To: user@tika.apache.org
Subject: Re: Is creating new AutoDetectParsers expensive?

I read the first sentence and thought: "Yes! I can save ourselves a bunch of 
memory!"
Then I read the second: "Oh, oh, do I dare trying it out?" : )
Thank you very much for the super-speedy response!

On Fri, Sep 30, 2016 at 4:46 PM Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
You can reuse AutoDetectParser in a multithreaded environment.  You shouldn’t 
have problems with performance or thread safety.

If you find otherwise, please let us know! ☺

From: Haris Osmanagic 
[mailto:haris.osmana...@gmail.com<mailto:haris.osmana...@gmail.com>]
Sent: Friday, September 30, 2016 10:36 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Is creating new AutoDetectParsers expensive?

Hi all!
Let's assume there are really many files to be parsed, and the operation is 
repeated a relatively large number of times each day.
Is it, in that case, too expensive to create new AutoDetectParsers for every 
file? Or, in other words, if I were to reuse a AutoDetectParser for a large 
number of files, would I:
* Have problems with thread-safety?
* Have problems with performance?
Thanks you very much!
Haris Osmanagić

Reply via email to