[ https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395548#comment-17395548 ]
Tim Allison commented on TIKA-3519: ----------------------------------- Are you using Tika server or Tika app or calling Tika programmatically? > Wonder if you can add a feature for Tika parser to stop reading metadata and > body content if certain amount of memory or body content has reached > -------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: TIKA-3519 > URL: https://issues.apache.org/jira/browse/TIKA-3519 > Project: Tika > Issue Type: Wish > Components: detector > Affects Versions: 1.25, 1.26 > Environment: Linux > Reporter: Xiaohong Yang > Priority: Major > > We use org.apache.tika.parser.AutoDetectParser to get the metadata and body > content of MS office files. We encountered the following exception with some > files > > Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an > array of length 14523048, but 5000000 is the maximum for this record type. If > the file is not corrupt, please open an issue on bugzilla to request > increasing the maximum allowable size for this record type. As a temporary > workaround, consider setting a higher override value with > IOUtils.setByteArrayMaxOverride() > > To resolve the problem we set byteArrayMaxOverride in the tika-config.xml > file as follows > > <parser class="org.apache.tika.parser.microsoft.OfficeParser"> > <params> > <param name="byteArrayMaxOverride" > type="int">20000000</param> > </params> > </parser> > > This helped to parse some files that failed previously. But some other files > still failed. And then we increased the value to 200 MB and 500 MB. > > Some other file may still fail with byteArrayMaxOverride set to 500 MB. So > we wonder if you can add a feature to the Tika parser for it to stop reading > metadata and body content if certain amount of memory or body content has > reached. The parser will return the metadata and body content obtained so > far. A warning message will be returned to the caller if this happens. This > will help us to get the metadata and body content from some files that > requires a lot of memory. We may not be able to successfully parse some > files without this feature because those files fail somewhere else with the > out-of-memory error after we set byteArrayMaxOverride to very high values and > the above mentioned failure does not happen. With this feature we will get > truncated body content with some files but it is better than get nothing. > Actually we will truncate the body content ourselves if it is too large. So > we do not care if the body content is truncated if it reaches certain amount. -- This message was sent by Atlassian Jira (v8.3.4#803005)