[ 
https://issues.apache.org/jira/browse/TIKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920032#action_12920032
 ] 

Sjoerd Smeets commented on TIKA-521:
------------------------------------

Attached a proposed patch for bigger XLS files. It has been tested with a XSL 
spreadsheet of 70Mb with a heapsize of 1024Mb. It should be able to handle 
bigger files, since it is using SAX parsing. However, using a smaller heapsize 
for the test file restulted in a OutOfMemoryError, when extracting the 
different parts of the XLS document.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2786)
        at 
java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:133)
        at 
org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:118)
        at 
org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55)
        at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220)
        at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:154)
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:68)
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:146)
        at com.ravn.test.tika.XLSTester.parse(XLSTester.java:47)
        at com.ravn.test.TikaTester.main(TikaTester.java:39)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)

The proposed patch is an attempt to generate the same information about a XSL 
document as the XSSFExcelExtractorDecorator parser does. There are still some 
issues to look into, which are commented with TODO. Some advice on these 
matters would be welcome. Could someone check if the proposed patch is 
acceptable, so I'll try to implement the TODO things plus write some testcases? 
Maybe this can then be the default parser

I also changed/created certain parts in POI in order to get the patch working. 
See https://issues.apache.org/bugzilla/show_bug.cgi?id=50076 for the proposed 
changes for POI.

> OutOfMemoryError Parsing XSLX File
> ----------------------------------
>
>                 Key: TIKA-521
>                 URL: https://issues.apache.org/jira/browse/TIKA-521
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7, 0.8
>            Reporter: Stephen Duncan Jr
>         Attachments: memory-test.xlsx, tika-diff.txt, tika-new-files.tar.bz2
>
>
> I have several XSLX files I'm trying to parse with Tika that are failing with 
> an OutOfMemoryError even when using  a large heap size.  For instance the 
> attached 1.26MB excel file fails using a 512MB heap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to