Not what you asked but related :) Luis
Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif <[email protected]> escreveu: > I've done some few improvements in ForkParser performance in an internal > fork. Will try to contribute upstream... > > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < > [email protected]> escreveu: > >> I am attempting to Tika parse dozens of millions of office documents. >> Pdfs, >> docs, excels, xmls, etc. Wide assortment of types. >> >> Throughput is very important. I need to be able parse these files in a >> reasonable amount of time, but at the same time, accuracy is also pretty >> important. I hope to have less than 10% of the documents parsed fail. (And >> by fail I mean fail due to tika stability, like a timeout while parsing. I >> do not mean fail due to the document itself). >> >> My question - how to configure Tika Server in a containerized environment >> to maximize throughput? >> >> My environment: >> >> - I am using Openshift. >> - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8 >> GiB to 10 GiB*. >> - I have 10 tika parsing pod replicas. >> >> On each pod, I run a java program where I have 8 parse threads. >> >> Each thread: >> >> - Starts a single tika server process (in spawn child mode) >> - Tika server arguments: -s -spawnChild -maxChildStartupMillis >> 120000 >> -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis 500 >> -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures >> -enableFileUrl >> - The thread will now continuously grab a file from the files-to-fetch >> queue and will send it to the tika server, stopping when there are no >> more >> files to parse. >> >> Each of these files are stored locally on the pod in a buffer, so the >> local >> file optimization is used: >> >> The Tika web service it is using is: >> >> Endpoint: `/rmeta/text` >> Method: `PUT` >> Headers: - writeLimit = 32000000 - maxEmbeddedResources = 0 - >> fileUrl = file:///path/to/file >> >> Files are no greater than 100Mb, the maximum number of bytes tika text >> will >> be (writeLimit) 32Mb. >> >> Each pod is parsing about 370,000 documents per day. I've been messing >> with >> a ton of different attempts at settings. >> >> I previously tried to use the actual Tika "ForkParser" but the performance >> was far worse than spawning tika servers. So that is why I am using Tika >> Server. >> >> I don't hate the performance results of this.... but I feel like I'd >> better >> reach out and make sure there isn't someone out there who sanity checks my >> numbers and is like "woah that's awful performance, you should be getting >> xyz like me!" >> >> Anyone have any similar things you are doing? If so, what settings did you >> end up settling on? >> >> Also, I'm wondering if Apache Http Client would be causing any overhead >> here when I am calling to my Tika Server /rmeta/text endpoint. I am using >> a >> shared connection pool. Would there be any benefit in say using a unique >> HttpClients.createDefault() for each thread instead of sharing a >> connection >> pool between the threads? >> >> >> Cross posted question here as well >> >> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput >> >
