> - how many connections will it accept before not accepting new
connections?

You will not hit the jetty max request limits. Rather you will hit CPU
saturation or out-of-memory conditions that will happen far before that.

> how many files can be scanned in parallel?

Totally depends on the files you are parsing. Empirical analysis is the
only way to tell.

> what is the return code to expect when there is contention on the server?

You will get timeouts when there is too much contention and you'll see the
tika spawned servers will keep restarting if you are making them OOM with
too many files. And CPU contention will show in just sluggishness and
failure to respond. No error codes.

> - is it a safe assumption that for connections to be dropped, CPU will be
saturated?

1) cpu saturation, 2) OOM, 3) infinite loops due to parser bug can happen.

The naive solution is to just turn your timeout to something reasonable
like 30 - 60 seconds, then retry documents.

I went through all of this for many years, then recently last year i
changed to a new much more successful strategy:

Kubernetes/Docker - have many tika-server instances deployed in 2CPU/4G
kubernetes pods

Then add these server urls to a resource pool, where each thread that needs
to parse checks out a server, then checks it back in.

By having each thread have it's very own tika server, it prevents issues
where threadA threw in an Excel document that causes an OOM error, then
blew up all the active parses for threadB, C, D, E, F. etc

Then we created Tika Pipes in Tika 2.0.0 to do this in a more graceful way.
Where you create a Fetch/Emit pipeline, then you push files that you want
to parse into a queue then they are asynchronously parsed and the parsed
output is emitted after completed.

When I unbury from a bunch of unrelated work, I hope to have a youtube
video and corresponding wiki article that show show Tika pipes works. It's
likely exactly what you are looking for. I'll make sure to send you a link
to that when done.


On Wed, Jun 23, 2021 at 6:47 AM Tim Allison <[email protected]> wrote:

>
> Sorry… Sergey Beryozkin
>
> On Wed, Jun 23, 2021 at 6:46 AM Tim Allison <[email protected]> wrote:
>
>> Hi Cristi,
>>
>>    I regret that I don't have precise answers for these questions.
>> tika-server uses Apache cxf and most of your questions are handled at
>> that level.  There is no logic in Tika for number of connections,
>> identifying contention or even keeping track of the number of parallel
>> requests.
>>
>>    If you're running in --spawnChild mode in 1.x or running in default
>> in 2.x, the server can go down and drop connections if a file has
>> caused a catastrophic problem (timeout, oom or other crash), but that
>> doesn't necessarily mean that CPU will be saturated.
>>
>>    In practice, I've found that it is better to run multiple
>> tika-servers (on different ports?) and have one tika-server per client
>> so that you effectively avoid multithreading...this also enables you
>> to know which file caused a catastrophic problem.  If you're running
>> multiple requests on a single server, and one of the files causes a
>> shutdown/restart, you won't know which of the active files caused the
>> problem.
>>
>>    Nicholas DiPiazza has experience with pegging tika-servers.  He
>> might be willing to chime in?
>>
>>    Sergey Beryokin is our cxf expert...he might have better insight on
>> the cxf layer.
>>
>>    The above input applies to the standard /tika, /rmeta endpoints.
>> The new pipes /pipes and /async handlers fork multiple sub-processes
>> and do the parsing there.  I have not yet experimented with
>> overwhelming them in practice/production, but the /async handler at
>> least has a return value for "queue is full, please don't send any
>> more requests".
>>
>>      Best,
>>
>>           Tim
>>
>> On Tue, Jun 22, 2021 at 3:28 AM Cristian Zamfir <[email protected]>
>> wrote:
>> >
>> > Hello, please let me know if somebody has looked into this or I should
>> look at the source code instead? Thanks!
>> >
>> > On Fri, Jun 18, 2021 at 5:04 PM Cristian Zamfir <[email protected]>
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have a few questions about the concurrency level of tika-server in
>> the default configuration:
>> >> - how many connections will it accept before not accepting new
>> connections?
>> >> - how many files can be scanned in parallel?
>> >> - what is the return code to expect when there is contention on the
>> server?
>> >> - is it a safe assumption that for connections to be dropped, CPU will
>> be saturated?
>> >>
>> >> Thanks,
>> >> Cristi
>> >>
>>
>

Reply via email to