I'm not sure which endpoint you're using, but search for "writeLimit" on this page: https://cwiki.apache.org/confluence/display/TIKA/TikaServer
As you probably know, many file formats are actually compressed: PDF, docx, etc. There is no way to know ahead of time for many file formats what the amount of extracted text will be. Tika-server should be restarting on OOMs, but they are frustrating, and if you're running in a multithreaded setup, it is impossible to tell which file caused the problem. We designed tika-pipes to isolate parsing _per file_ so that an OOM from one file will not cause problems for others. If you have an interest: https://cwiki.apache.org/confluence/display/tika/tika-pipes Let me know if any of this is helpful. On Fri, May 24, 2024 at 11:06 AM Alishah Momin <[email protected]> wrote: > Hi, > > In using Tika Server, I've run into issues with large compressed files > causing OOM issues, which is resulting in reduced availability. Are there > any config flags available for limiting text extraction based on size? In > most cases, I would do this by checking the size prior to sending the file > to Tika, but with compressed files, I don't know the uncompressed size > before sending it to Tika. > > So far I've attempted adding the following to my `tika-config.xml`, but > I'm not sure if this is a parameter that gets loaded in from the config and > into the parser. In my testing, I didn't see any effect. I'm also not sure > if it would help with what I am trying to do, so perhaps that's an issue. > > <parser class="org.apache.tika.parser.pkg.CompressorParser"> > <params> > <param name="memoryLimitInKb" type="int">100000</param> > </params> > </parser> > > I'm currently running Tika Server Standard 2.3.0. > > Thanks. >
