In an earlier version of tika-batch, we had a single AutoDetectParser per thread, and we had no problems. I experimented with a single AutoDetectParser across the threads, and we didn’t have problems.
Because of configuration issues, tika-batch is now creating a new parser for each file. In our unit test suite, last I experimented with this, the first initialization did take a while, but then there was no measurable extra cost to instantiating a new parser. In short, we didn’t save anything by using a static AutoDetectParser instead of just instantiating a new one for each unit test. If you are going from file system to file system, you might want to consider tika-batch. java -jar tika-app.jar -i <input_dir> -o <output_dir> If you have a whole lot of files (millions), try to isolate Tika in its own jvm or server or data center; bad things can happen. See slide 17: http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf And: http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ From: Haris Osmanagic [mailto:haris.osmana...@gmail.com] Sent: Friday, September 30, 2016 10:54 AM To: user@tika.apache.org Subject: Re: Is creating new AutoDetectParsers expensive? I read the first sentence and thought: "Yes! I can save ourselves a bunch of memory!" Then I read the second: "Oh, oh, do I dare trying it out?" : ) Thank you very much for the super-speedy response! On Fri, Sep 30, 2016 at 4:46 PM Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>> wrote: You can reuse AutoDetectParser in a multithreaded environment. You shouldn’t have problems with performance or thread safety. If you find otherwise, please let us know! ☺ From: Haris Osmanagic [mailto:haris.osmana...@gmail.com<mailto:haris.osmana...@gmail.com>] Sent: Friday, September 30, 2016 10:36 AM To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: Is creating new AutoDetectParsers expensive? Hi all! Let's assume there are really many files to be parsed, and the operation is repeated a relatively large number of times each day. Is it, in that case, too expensive to create new AutoDetectParsers for every file? Or, in other words, if I were to reuse a AutoDetectParser for a large number of files, would I: * Have problems with thread-safety? * Have problems with performance? Thanks you very much! Haris Osmanagić