I've done some few improvements in ForkParser performance in an internal fork. Will try to contribute upstream...
Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < nicholas.dipia...@gmail.com> escreveu: > I am attempting to Tika parse dozens of millions of office documents. Pdfs, > docs, excels, xmls, etc. Wide assortment of types. > > Throughput is very important. I need to be able parse these files in a > reasonable amount of time, but at the same time, accuracy is also pretty > important. I hope to have less than 10% of the documents parsed fail. (And > by fail I mean fail due to tika stability, like a timeout while parsing. I > do not mean fail due to the document itself). > > My question - how to configure Tika Server in a containerized environment > to maximize throughput? > > My environment: > > - I am using Openshift. > - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8 > GiB to 10 GiB*. > - I have 10 tika parsing pod replicas. > > On each pod, I run a java program where I have 8 parse threads. > > Each thread: > > - Starts a single tika server process (in spawn child mode) > - Tika server arguments: -s -spawnChild -maxChildStartupMillis 120000 > -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis 500 > -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures > -enableFileUrl > - The thread will now continuously grab a file from the files-to-fetch > queue and will send it to the tika server, stopping when there are no > more > files to parse. > > Each of these files are stored locally on the pod in a buffer, so the local > file optimization is used: > > The Tika web service it is using is: > > Endpoint: `/rmeta/text` > Method: `PUT` > Headers: - writeLimit = 32000000 - maxEmbeddedResources = 0 - > fileUrl = file:///path/to/file > > Files are no greater than 100Mb, the maximum number of bytes tika text will > be (writeLimit) 32Mb. > > Each pod is parsing about 370,000 documents per day. I've been messing with > a ton of different attempts at settings. > > I previously tried to use the actual Tika "ForkParser" but the performance > was far worse than spawning tika servers. So that is why I am using Tika > Server. > > I don't hate the performance results of this.... but I feel like I'd better > reach out and make sure there isn't someone out there who sanity checks my > numbers and is like "woah that's awful performance, you should be getting > xyz like me!" > > Anyone have any similar things you are doing? If so, what settings did you > end up settling on? > > Also, I'm wondering if Apache Http Client would be causing any overhead > here when I am calling to my Tika Server /rmeta/text endpoint. I am using a > shared connection pool. Would there be any benefit in say using a unique > HttpClients.createDefault() for each thread instead of sharing a connection > pool between the threads? > > > Cross posted question here as well > > https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput >