Re: Setting limits on text extraction for compressed files with Tika Server

Tim Allison Fri, 24 May 2024 10:55:41 -0700

I'm not sure which endpoint you're using, but search for "writeLimit" on
this page: https://cwiki.apache.org/confluence/display/TIKA/TikaServer

As you probably know, many file formats are actually compressed: PDF, docx,
etc. There is no way to know ahead of time for many file formats what the
amount of extracted text will be.

Tika-server should be restarting on OOMs, but they are frustrating, and if
you're running in a multithreaded setup, it is impossible to tell which
file caused the problem.

We designed tika-pipes to isolate parsing _per file_ so that an OOM from
one file will not cause problems for others. If you have an interest:
https://cwiki.apache.org/confluence/display/tika/tika-pipes

Let me know if any of this is helpful.

On Fri, May 24, 2024 at 11:06 AM Alishah Momin <[email protected]>
wrote:

> Hi,
>
> In using Tika Server, I've run into issues with large compressed files
> causing OOM issues, which is resulting in reduced availability. Are there
> any config flags available for limiting text extraction based on size? In
> most cases, I would do this by checking the size prior to sending the file
> to Tika, but with compressed files, I don't know the uncompressed size
> before sending it to Tika.
>
> So far I've attempted adding the following to my `tika-config.xml`, but
> I'm not sure if this is a parameter that gets loaded in from the config and
> into the parser. In my testing, I didn't see any effect. I'm also not sure
> if it would help with what I am trying to do, so perhaps that's an issue.
>
> <parser class="org.apache.tika.parser.pkg.CompressorParser">
> <params>
> <param name="memoryLimitInKb" type="int">100000</param>
> </params>
> </parser>
>
> I'm currently running Tika Server Standard 2.3.0.
>
> Thanks.
>

Re: Setting limits on text extraction for compressed files with Tika Server

Reply via email to