https://issues.apache.org/bugzilla/show_bug.cgi?id=54790

            Bug ID: 54790
           Summary: Word Document loading strategy is memory hungry and
                    causes OutOfMemoryError
           Product: POI
           Version: 3.8
          Hardware: PC
                OS: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HWPF
          Assignee: [email protected]
          Reporter: [email protected]
    Classification: Unclassified

In my case I have 70MB Word document, which actually results 50MB plain text
(after saved as...). When this document is loaded using POI the following error
occurs:

Caused by: java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:133)
    at java.lang.StringCoding.decode(StringCoding.java:173)
    at java.lang.String.<init>(String.java:443)
    at java.lang.String.<init>(String.java:515)
    at org.apache.poi.hwpf.model.TextPiece.buildInitSB(TextPiece.java:89)
    at org.apache.poi.hwpf.model.TextPiece.<init>(TextPiece.java:66)
    at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:111)
    at
org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70)
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:267)
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186)

As to my observations 70MB document explodes to approx 900MB heap.

Analysis:

As I can see, class TextPieceTable creates thousands of TextPiece objects (and
thus thousands of StringBuilder objects with small char[] buffers). Later
HWPFDocument strategy is the following:

- it collects all text pieces again in line 275:
  _text = _tpt.getText();
- if preserveTextTable=false, then new ComplexFileTable object is created
holing one TextPieceTable, holding one SinglentonTextPiece in lines 314-318

Perhaps this can be further improved. In particular when
preserveTextTable=false then TextPieceTable should not make a copy of
documentStream part:

System.arraycopy( documentStream, start, buf, 0, textSizeBytes );

and use another lightweight version of TextPiece without buffer. Later when all
text pieces need to be collected, they can be taken directly from
documentStream.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to