Ethan Wilansky created TIKA-3880:
------------------------------------

             Summary: Tika not picking-up setByteArrayMaxOverride from 
tika-config
                 Key: TIKA-3880
                 URL: https://issues.apache.org/jira/browse/TIKA-3880
             Project: Tika
          Issue Type: Improvement
          Components: app
    Affects Versions: 2.5.0
         Environment: We are running this through docker on a machine with 
plenty of memory resources allocated to Docker.

Docker config: 32 GB, 8 processors
Host machine: 64 GB, 32 processors

Our docker-compose configuration is derived from: 
[https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]

We are experienced with Docker and are confident that the issue isn't with 
Docker.

 
            Reporter: Ethan Wilansky


I have specified this parser parameter in tika-config.xml:


<properties>
  <parserclass="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
    <params>
      <paramname="byteArrayMaxOverride"type="int">700000000</param>
    </params>
</parser>
</properties>
 
I've also verified that the tika-config.xml is being picked-up by Tika on 
startup:
  org.apache.tika.server.core.TikaServerProcess Using custom config: 
/tika-config.xml
 
However, when I encounter a very large docx file, I can clearly see that the 
configuration in tika-config is not being picked-up:
 
Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
array of length 686,679,089, but the maximum length for this record type is 
100,000,000.
If the file is not corrupt and not large, please open an issue on bugzilla to 
request 
increasing the maximum allowable size for this record type.
You can set a higher override value with IOUtils.setByteArrayMaxOverride()
 
I understand that this is a very large docx file. However, we can handle this 
amount of text extraction and am fine with the time it takes for Tika to 
perform this extraction and the amount of memory required to complete this 
extraction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to