[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

Xiaohong Yang (Jira) Wed, 11 Aug 2021 06:12:08 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397345#comment-17397345
 ]


Xiaohong Yang commented on TIKA-3519:
-------------------------------------

I tried org.apache.tika.sax.WriteOutContentHandler with writeLimit in a test 
program and found out that this is one of the features we want. However I 
noticed that this approach (setting of writeLimit) does not help to avoid the 
ByteArrayMaxOverride error mentioned in the ticket (Caused by: 
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 
14523048, but 5000000 is the maximum for this record type…).  I also noticed 
that if the ByteArrayMaxOverride error happens we do  not get any body text 
regardless the value of  writeLimit.

When the ByteArrayMaxOverride error happens we can catch the exception and get 
the required override value from the stack trace,  and then set the required 
override value with IOUtils.setByteArrayMaxOverride() and try the parse method 
again (it will probably succeed if the machine has enough memory).

However we wonder if you can add a feature so that the body text is still 
available when the ByteArrayMaxOverride error happens so that we can decide to 
try again or use the available body text (and metadata) depending on the 
required override value because a very higher value may not be feasible for 
reasons like there is not enough memory available on the machine.

> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3519
>                 URL: https://issues.apache.org/jira/browse/TIKA-3519
>             Project: Tika
>          Issue Type: Wish
>          Components: detector
>    Affects Versions: 1.25, 1.26
>         Environment: Linux
>            Reporter: Xiaohong Yang
>            Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 5000000 is the maximum for this record type. If 
> the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type. As a temporary 
> workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>  
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml 
> file as follows
>  
>               <parser class="org.apache.tika.parser.microsoft.OfficeParser">
>                      <params>
>                            <param name="byteArrayMaxOverride" 
> type="int">20000000</param>
>                      </params>
>               </parser>
>  
> This helped to parse some files that failed previously. But some other files 
> still failed.  And then we increased the value to 200 MB and 500 MB.
>  
> Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So 
> we wonder if you can add a feature to the Tika parser for it  to stop reading 
>  metadata and body content if certain amount of memory or body content has 
> reached.  The parser will return the  metadata and body content obtained so 
> far. A warning message will be returned to the caller if this happens.  This 
> will help us to get the metadata and body content from some files that 
> requires a lot of memory.  We may not be able to successfully parse some 
> files without this feature because those files fail somewhere else with the 
> out-of-memory error after we set byteArrayMaxOverride to very high values and 
> the above mentioned failure does not happen. With this feature we will get 
> truncated body content with some files but it is better than get nothing. 
> Actually we will truncate the body content ourselves if it is too large. So 
> we do not care if the body content is truncated if it reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

Reply via email to