[
https://issues.apache.org/jira/browse/TIKA-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036370#comment-18036370
]
ASF GitHub Bot commented on TIKA-4474:
--------------------------------------
tballison commented on PR #2388:
URL: https://github.com/apache/tika/pull/2388#issuecomment-3504652237
Checkstyle should work now. However, I'm noticing that the microsoft tests
are now taking 12 minutes on my laptop. I need to figure out if this change is
what's causing that or if there's something else going on with my laptop
> Exception on ooxml office files with large entries
> --------------------------------------------------
>
> Key: TIKA-4474
> URL: https://issues.apache.org/jira/browse/TIKA-4474
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 3.2.2
> Environment: OS: Ubuntu 24.04.3 LTS x86_64
> Host: Precision 5560
> Kernel: 6.8.0-71-generic
> Shell: zsh 5.9
> Terminal: kitty
> CPU: 11th Gen Intel i7-11850H (16) @ 4.800GHz
> GPU: Intel TigerLake-H GT1 [UHD Graphics]
> Memory: 12574MiB / 15711MiB
> Reporter: Manish S N
> Priority: Major
> Labels: OOXML, XLSX, tika-parsers
> Attachments: OOXMLExtractPerfTest.java, OOXMLExtractPerfTest.output,
> non_spooling_perf_chart.png, overallStats, spooling_perf_chart.png,
> testRecordFormatExceeded.xlsx
>
>
> When we try to parse ooxml office files with an entry which expands to larger
> than 100MB we get RecordFormatException from poi's IO Utils.
> Eg: a large spreadsheet (attached on such file; the attached excel file is
> about 12mb but has a single sheet that expands to over 300 mb)
> This is caused when we use InputStream based TikaInputStream and not when we
> use a file based one.
> This is caused by poi IOUtils' limit of 100MB for a zip entry while we try to
> make an OPCPackage out of the input stream we passed
> Exception:
> {code:java}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@12d40609 at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204) at
> redacted.for.privacy
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an
> array of length 327,956,216, but the maximum length for this record type is
> 100,000,000.If the file is not corrupt and not large, please open an issue on
> bugzilla to request increasing the maximum allowable size for this record
> type.You can set a higher override value with
> IOUtils.setByteArrayMaxOverride() at
> org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:622) at
> org.apache.poi.util.IOUtils.checkLength(IOUtils.java:307) at
> org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:261) at
> org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:235) at
> org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:93)
> at
> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:114)
> at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:164) at
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:455) at
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:430) at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:127)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:117)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> ... 75 more
> {code}
> Solution:
> To solve it without having to override that byte array max value and
> compromising anymore ram,
> Just like for ODF we can force spooling the files beforehand for ooxml files
> too. This ensures minimum load on ram and increase in performance too
> [the performance test i did for a similar
> issue|https://issues.apache.org/jira/browse/TIKA-4459?focusedCommentId=18010803&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-18010803]
> is also for msofflice files. and the same issue has reasons to move to
> spooling entirely
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)