https://issues.apache.org/bugzilla/show_bug.cgi?id=54790
Bug ID: 54790
Summary: Word Document loading strategy is memory hungry and
causes OutOfMemoryError
Product: POI
Version: 3.8
Hardware: PC
OS: Windows XP
Status: NEW
Severity: normal
Priority: P2
Component: HWPF
Assignee: [email protected]
Reporter: [email protected]
Classification: Unclassified
In my case I have 70MB Word document, which actually results 50MB plain text
(after saved as...). When this document is loaded using POI the following error
occurs:
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:133)
at java.lang.StringCoding.decode(StringCoding.java:173)
at java.lang.String.<init>(String.java:443)
at java.lang.String.<init>(String.java:515)
at org.apache.poi.hwpf.model.TextPiece.buildInitSB(TextPiece.java:89)
at org.apache.poi.hwpf.model.TextPiece.<init>(TextPiece.java:66)
at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:111)
at
org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:267)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186)
As to my observations 70MB document explodes to approx 900MB heap.
Analysis:
As I can see, class TextPieceTable creates thousands of TextPiece objects (and
thus thousands of StringBuilder objects with small char[] buffers). Later
HWPFDocument strategy is the following:
- it collects all text pieces again in line 275:
_text = _tpt.getText();
- if preserveTextTable=false, then new ComplexFileTable object is created
holing one TextPieceTable, holding one SinglentonTextPiece in lines 314-318
Perhaps this can be further improved. In particular when
preserveTextTable=false then TextPieceTable should not make a copy of
documentStream part:
System.arraycopy( documentStream, start, buf, 0, textSizeBytes );
and use another lightweight version of TextPiece without buffer. Later when all
text pieces need to be collected, they can be taken directly from
documentStream.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]