[ https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617879#comment-17617879 ]
Ethan Wilansky commented on TIKA-3880: -------------------------------------- I'll try the config you posted. > Tika not picking-up setByteArrayMaxOverride from tika-config > ------------------------------------------------------------ > > Key: TIKA-3880 > URL: https://issues.apache.org/jira/browse/TIKA-3880 > Project: Tika > Issue Type: Improvement > Components: app > Affects Versions: 2.5.0 > Environment: We are running this through docker on a machine with > plenty of memory resources allocated to Docker. > Docker config: 32 GB, 8 processors > Host machine: 64 GB, 32 processors > Our docker-compose configuration is derived from: > [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml] > We are experienced with Docker and are confident that the issue isn't with > Docker. > > Reporter: Ethan Wilansky > Priority: Blocker > > I have specified this parser parameter in tika-config.xml: > <properties> > <parserclass="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"> > <params> > <paramname="byteArrayMaxOverride"type="int">700000000</param> > </params> > </parser> > </properties> > > I've also verified that the tika-config.xml is being picked-up by Tika on > startup: > org.apache.tika.server.core.TikaServerProcess Using custom config: > /tika-config.xml > > However, when I encounter a very large docx file, I can clearly see that the > configuration in tika-config is not being picked-up: > > Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an > array of length 686,679,089, but the maximum length for this record type is > 100,000,000. > If the file is not corrupt and not large, please open an issue on bugzilla to > request > increasing the maximum allowable size for this record type. > You can set a higher override value with IOUtils.setByteArrayMaxOverride() > > I understand that this is a very large docx file. However, we can handle this > amount of text extraction and am fine with the time it takes for Tika to > perform this extraction and the amount of memory required to complete this > extraction. -- This message was sent by Atlassian Jira (v8.20.10#820010)