Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Luís Filipe Nassif Wed, 25 Nov 2020 18:20:47 -0800

I've done some few improvements in ForkParser performance in an internal
fork. Will try to contribute upstream...


Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
[email protected]> escreveu:

> I am attempting to Tika parse dozens of millions of office documents. Pdfs,
> docs, excels, xmls, etc. Wide assortment of types.
>
> Throughput is very important. I need to be able parse these files in a
> reasonable amount of time, but at the same time, accuracy is also pretty
> important. I hope to have less than 10% of the documents parsed fail. (And
> by fail I mean fail due to tika stability, like a timeout while parsing. I
> do not mean fail due to the document itself).
>
> My question - how to configure Tika Server in a containerized environment
> to maximize throughput?
>
> My environment:
>
>    - I am using Openshift.
>    - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
>    GiB to 10 GiB*.
>    - I have 10 tika parsing pod replicas.
>
> On each pod, I run a java program where I have 8 parse threads.
>
> Each thread:
>
>    - Starts a single tika server process (in spawn child mode)
>       - Tika server arguments: -s -spawnChild -maxChildStartupMillis 120000
>       -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis 500
>       -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures
> -enableFileUrl
>    - The thread will now continuously grab a file from the files-to-fetch
>    queue and will send it to the tika server, stopping when there are no
> more
>    files to parse.
>
> Each of these files are stored locally on the pod in a buffer, so the local
> file optimization is used:
>
> The Tika web service it is using is:
>
> Endpoint: `/rmeta/text`
> Method: `PUT`
> Headers:    - writeLimit = 32000000    - maxEmbeddedResources = 0    -
> fileUrl = file:///path/to/file
>
> Files are no greater than 100Mb, the maximum number of bytes tika text will
> be (writeLimit) 32Mb.
>
> Each pod is parsing about 370,000 documents per day. I've been messing with
> a ton of different attempts at settings.
>
> I previously tried to use the actual Tika "ForkParser" but the performance
> was far worse than spawning tika servers. So that is why I am using Tika
> Server.
>
> I don't hate the performance results of this.... but I feel like I'd better
> reach out and make sure there isn't someone out there who sanity checks my
> numbers and is like "woah that's awful performance, you should be getting
> xyz like me!"
>
> Anyone have any similar things you are doing? If so, what settings did you
> end up settling on?
>
> Also, I'm wondering if Apache Http Client would be causing any overhead
> here when I am calling to my Tika Server /rmeta/text endpoint. I am using a
> shared connection pool. Would there be any benefit in say using a unique
> HttpClients.createDefault() for each thread instead of sharing a connection
> pool between the threads?
>
>
> Cross posted question here as well
>
> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Reply via email to