Ha!  I just ran into this with .docx as well [1][2].  Given that we only need 
to extract contents on Tika, the experimental SAX parser [3] is way more 
efficient, esp on very large docs.  The gc with our current DOM was killing 
performance, esp multithreaded.

There are portions of the docx SAX parser that I think will fit well within POI 
as a parallel to xssf's eventusermodel.  I hope to submit a patch for review 
sometime next week (?) (or in open-source time, January?)...

Cheers,

          Tim


[1] https://issues.apache.org/jira/browse/TIKA-1321
[2] https://issues.apache.org/jira/browse/TIKA-2180 
[3] Admittedly, the experimental SAX parser doesn't include all of the features 
that our current DOM parser does!  More work remains...

-----Original Message-----
From: Javen O'Neal [mailto:[email protected]] 
Sent: Friday, December 2, 2016 2:21 PM
To: POI Users List <[email protected]>
Subject: Re: Too much memory is used when reading a xlsx-file whose size is 
just 7.3M

Those numbers sound about right. I'm used to 4 MB balloning to 1 GB.

We could significantly reduce memory consumption if we didn't maintain the XML 
DOM in memory, but replacing that requires thousands of hours of work.

Reply via email to