[ https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920032#action_12920032 ]
Sjoerd Smeets commented on TIKA-521: ------------------------------------ Attached a proposed patch for bigger XLS files. It has been tested with a XSL spreadsheet of 70Mb with a heapsize of 1024Mb. It should be able to handle bigger files, since it is using SAX parsing. However, using a smaller heapsize for the test file restulted in a OutOfMemoryError, when extracting the different parts of the XLS document. Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:133) at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:118) at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:154) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:68) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:146) at com.ravn.test.tika.XLSTester.parse(XLSTester.java:47) at com.ravn.test.TikaTester.main(TikaTester.java:39) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115) The proposed patch is an attempt to generate the same information about a XSL document as the XSSFExcelExtractorDecorator parser does. There are still some issues to look into, which are commented with TODO. Some advice on these matters would be welcome. Could someone check if the proposed patch is acceptable, so I'll try to implement the TODO things plus write some testcases? Maybe this can then be the default parser I also changed/created certain parts in POI in order to get the patch working. See https://issues.apache.org/bugzilla/show_bug.cgi?id=50076 for the proposed changes for POI. > OutOfMemoryError Parsing XSLX File > ---------------------------------- > > Key: TIKA-521 > URL: https://issues.apache.org/jira/browse/TIKA-521 > Project: Tika > Issue Type: Bug > Affects Versions: 0.7, 0.8 > Reporter: Stephen Duncan Jr > Attachments: memory-test.xlsx, tika-diff.txt, tika-new-files.tar.bz2 > > > I have several XSLX files I'm trying to parse with Tika that are failing with > an OutOfMemoryError even when using a large heap size. For instance the > attached 1.26MB excel file fails using a 512MB heap. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.