Ethan Wilansky created TIKA-3880: ------------------------------------ Summary: Tika not picking-up setByteArrayMaxOverride from tika-config Key: TIKA-3880 URL: https://issues.apache.org/jira/browse/TIKA-3880 Project: Tika Issue Type: Improvement Components: app Affects Versions: 2.5.0 Environment: We are running this through docker on a machine with plenty of memory resources allocated to Docker.
Docker config: 32 GB, 8 processors Host machine: 64 GB, 32 processors Our docker-compose configuration is derived from: [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml] We are experienced with Docker and are confident that the issue isn't with Docker. Reporter: Ethan Wilansky I have specified this parser parameter in tika-config.xml: <properties> <parserclass="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"> <params> <paramname="byteArrayMaxOverride"type="int">700000000</param> </params> </parser> </properties> I've also verified that the tika-config.xml is being picked-up by Tika on startup: org.apache.tika.server.core.TikaServerProcess Using custom config: /tika-config.xml However, when I encounter a very large docx file, I can clearly see that the configuration in tika-config is not being picked-up: Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 686,679,089, but the maximum length for this record type is 100,000,000. If the file is not corrupt and not large, please open an issue on bugzilla to request increasing the maximum allowable size for this record type. You can set a higher override value with IOUtils.setByteArrayMaxOverride() I understand that this is a very large docx file. However, we can handle this amount of text extraction and am fine with the time it takes for Tika to perform this extraction and the amount of memory required to complete this extraction. -- This message was sent by Atlassian Jira (v8.20.10#820010)