RE: Maximizing performance when parsing a lot of files

Allison, Timothy B. Fri, 25 Sep 2015 05:29:04 -0700

It's best to keep Tika in its own jvm.

If you are working filesystem to filesystem... The simplest thing to do would 
be to call tika-batch via the commandline of tika-app every so often.  By 
default, tika-batch will skip files that it has already processed if you run it 
again, but you will pay the small performance cost of crawling the entire 
directory with each run and checking whether there is an output file for each 
input file.


If you think this is a common enough use case, and I do, I'm wondering if it 
would make sense for us to experiment with adding a WatchService to 
tika-batch...Scratch that...probably wouldn't scale ("This API is not designed 
for indexing a hard drive. Most file system implementations have native support 
for file change notification."[0]).  I'm wondering if we could have the crawler 
automatically rerun from the start directory until the user tells tika-batch to 
stop or unless there have been no new files processed in X minutes.
 
If you are going db to db...that's another area for growth in tika-batch.

Finally, the real "big data" solution is probably to go with Spark and friends.

[0] https://docs.oracle.com/javase/tutorial/essential/io/notification.html
-----Original Message-----
From: zahlenm...@gmx.de [mailto:zahlenm...@gmx.de] 
Sent: Friday, September 25, 2015 7:33 AM
To: user@tika.apache.org
Subject: Maximizing performance when parsing a lot of files

So I have thousands of files to be run by Tika. Unfortunatly, these are not 
available at once but are "created" one by one. My tests have shown that the 
creator process is faster than Tika. So now I am wondering how I should combine 
creator and parser process to speed things up.
Btw. the creator is completly separate, otherwise I would include the parser 
calls directly in it. But this is not possible.
To achieve some kind of parallelism I thought of two options:
1) Spawn a new small Java code piece which parses a file
2) Send the file to Tika Jaxrs Server
But since the creator is so fast it would fire up multiple calls to Tika per 
second. On the other hand I don't want to wait for the creator to finish 
because it runs for houres and in the meantime I could already start parsing.
Any ideas?

RE: Maximizing performance when parsing a lot of files

Reply via email to