[jira] [Commented] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config

Ethan Wilansky (Jira) Fri, 14 Oct 2022 09:59:04 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617879#comment-17617879
 ]


Ethan Wilansky commented on TIKA-3880:
--------------------------------------

I'll try the config you posted.

> Tika not picking-up setByteArrayMaxOverride from tika-config
> ------------------------------------------------------------
>
>                 Key: TIKA-3880
>                 URL: https://issues.apache.org/jira/browse/TIKA-3880
>             Project: Tika
>          Issue Type: Improvement
>          Components: app
>    Affects Versions: 2.5.0
>         Environment: We are running this through docker on a machine with 
> plenty of memory resources allocated to Docker.
> Docker config: 32 GB, 8 processors
> Host machine: 64 GB, 32 processors
> Our docker-compose configuration is derived from: 
> [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]
> We are experienced with Docker and are confident that the issue isn't with 
> Docker.
>  
>            Reporter: Ethan Wilansky
>            Priority: Blocker
>
> I have specified this parser parameter in tika-config.xml:
> <properties>
>   <parserclass="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
>     <params>
>       <paramname="byteArrayMaxOverride"type="int">700000000</param>
>     </params>
> </parser>
> </properties>
>  
> I've also verified that the tika-config.xml is being picked-up by Tika on 
> startup:
>   org.apache.tika.server.core.TikaServerProcess Using custom config: 
> /tika-config.xml
>  
> However, when I encounter a very large docx file, I can clearly see that the 
> configuration in tika-config is not being picked-up:
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 686,679,089, but the maximum length for this record type is 
> 100,000,000.
> If the file is not corrupt and not large, please open an issue on bugzilla to 
> request 
> increasing the maximum allowable size for this record type.
> You can set a higher override value with IOUtils.setByteArrayMaxOverride()
>  
> I understand that this is a very large docx file. However, we can handle this 
> amount of text extraction and am fine with the time it takes for Tika to 
> perform this extraction and the amount of memory required to complete this 
> extraction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config

Reply via email to