Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Luís Filipe Nassif Wed, 25 Nov 2020 18:23:33 -0800

Not what you asked but related :)

Luis


Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif <[email protected]>
escreveu:

> I've done some few improvements in ForkParser performance in an internal
> fork. Will try to contribute upstream...
>
> Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
> [email protected]> escreveu:
>
>> I am attempting to Tika parse dozens of millions of office documents.
>> Pdfs,
>> docs, excels, xmls, etc. Wide assortment of types.
>>
>> Throughput is very important. I need to be able parse these files in a
>> reasonable amount of time, but at the same time, accuracy is also pretty
>> important. I hope to have less than 10% of the documents parsed fail. (And
>> by fail I mean fail due to tika stability, like a timeout while parsing. I
>> do not mean fail due to the document itself).
>>
>> My question - how to configure Tika Server in a containerized environment
>> to maximize throughput?
>>
>> My environment:
>>
>>    - I am using Openshift.
>>    - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
>>    GiB to 10 GiB*.
>>    - I have 10 tika parsing pod replicas.
>>
>> On each pod, I run a java program where I have 8 parse threads.
>>
>> Each thread:
>>
>>    - Starts a single tika server process (in spawn child mode)
>>       - Tika server arguments: -s -spawnChild -maxChildStartupMillis
>> 120000
>>       -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis 500
>>       -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures
>> -enableFileUrl
>>    - The thread will now continuously grab a file from the files-to-fetch
>>    queue and will send it to the tika server, stopping when there are no
>> more
>>    files to parse.
>>
>> Each of these files are stored locally on the pod in a buffer, so the
>> local
>> file optimization is used:
>>
>> The Tika web service it is using is:
>>
>> Endpoint: `/rmeta/text`
>> Method: `PUT`
>> Headers:    - writeLimit = 32000000    - maxEmbeddedResources = 0    -
>> fileUrl = file:///path/to/file
>>
>> Files are no greater than 100Mb, the maximum number of bytes tika text
>> will
>> be (writeLimit) 32Mb.
>>
>> Each pod is parsing about 370,000 documents per day. I've been messing
>> with
>> a ton of different attempts at settings.
>>
>> I previously tried to use the actual Tika "ForkParser" but the performance
>> was far worse than spawning tika servers. So that is why I am using Tika
>> Server.
>>
>> I don't hate the performance results of this.... but I feel like I'd
>> better
>> reach out and make sure there isn't someone out there who sanity checks my
>> numbers and is like "woah that's awful performance, you should be getting
>> xyz like me!"
>>
>> Anyone have any similar things you are doing? If so, what settings did you
>> end up settling on?
>>
>> Also, I'm wondering if Apache Http Client would be causing any overhead
>> here when I am calling to my Tika Server /rmeta/text endpoint. I am using
>> a
>> shared connection pool. Would there be any benefit in say using a unique
>> HttpClients.createDefault() for each thread instead of sharing a
>> connection
>> pool between the threads?
>>
>>
>> Cross posted question here as well
>>
>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>>
>

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Reply via email to