[ 
https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617902#comment-17617902
 ] 

Ethan Wilansky commented on TIKA-3880:
--------------------------------------

Thanks for the reference about large file processing, much appreciated!

About my comment, "...text extraction of large files, up to 50 MB file size in 
our case.", yes, we do know before sending the file to Tika and check file size 
before sending it for extraction. My point was that we are sending large files 
for extraction and want to configure Tika to parse these large files regardless 
of file type. We are careful not to send files to Tika that can't be parsed. We 
use the Tika detect endpoint to verify mimetype before sending files for text 
extraction.  About your last comment, I will be sure to set the default parser 
element in the config. Thanks again for your responses, Tim.

> Tika not picking-up setByteArrayMaxOverride from tika-config
> ------------------------------------------------------------
>
>                 Key: TIKA-3880
>                 URL: https://issues.apache.org/jira/browse/TIKA-3880
>             Project: Tika
>          Issue Type: Improvement
>          Components: app
>    Affects Versions: 2.5.0
>         Environment: We are running this through docker on a machine with 
> plenty of memory resources allocated to Docker.
> Docker config: 32 GB, 8 processors
> Host machine: 64 GB, 32 processors
> Our docker-compose configuration is derived from: 
> [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]
> We are experienced with Docker and are confident that the issue isn't with 
> Docker.
>  
>            Reporter: Ethan Wilansky
>            Priority: Blocker
>
> I have specified this parser parameter in tika-config.xml:
> <properties>
>   <parserclass="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
>     <params>
>       <paramname="byteArrayMaxOverride"type="int">700000000</param>
>     </params>
> </parser>
> </properties>
>  
> I've also verified that the tika-config.xml is being picked-up by Tika on 
> startup:
>   org.apache.tika.server.core.TikaServerProcess Using custom config: 
> /tika-config.xml
>  
> However, when I encounter a very large docx file, I can clearly see that the 
> configuration in tika-config is not being picked-up:
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 686,679,089, but the maximum length for this record type is 
> 100,000,000.
> If the file is not corrupt and not large, please open an issue on bugzilla to 
> request 
> increasing the maximum allowable size for this record type.
> You can set a higher override value with IOUtils.setByteArrayMaxOverride()
>  
> I understand that this is a very large docx file. However, we can handle this 
> amount of text extraction and am fine with the time it takes for Tika to 
> perform this extraction and the amount of memory required to complete this 
> extraction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to