Thanks Andreas - could you raise a bug? https://bz.apache.org/bugzilla/

Definitely looks like a bug.






On Friday 11 March 2022, 16:15:54 GMT+1, Andreas Hubold 
<[email protected]> wrote: 





Hi,

I'm just trying to upgrade POI from 5.2.0 to 5.2.1 for using it with 
Apache Tika 2.3.0, but I suddenly see memory problems when processing 
DOCX files with embedded images. This looks like a severe bug in POI 
5.2.1 to me:

POI 5.2.1 changed XWPFPictureData#getChecksum to call 
IOUtils.toByteArrayWithMaxLengthwith a default max length of 100MB 
(XWPFPictureData#DEFAULT_MAX_IMAGE_SIZE). The implementation of that 
method allocates a byte array of that size by instantiating an 
UnsynchronizedByteArrayOutputStream with that max value.

The effect is that 100MB of heap memory are allocated, even if the 
embedded image is quite small (less than 1MB in my case).

Here's an exception stack trace where the code is called from Apache Tika:

Caused by: java.io.IOException: java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:249)
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:201)
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
        ... 9 common frames omitted
Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.commons.io.IOUtils.byteArray(IOUtils.java:338)
        at 
org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104)
        at 
org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.<init>(UnsynchronizedByteArrayOutputStream.java:51)
        at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:205)
        at 
org.apache.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:191)
        at 
org.apache.poi.xwpf.usermodel.XWPFPictureData.getChecksum(XWPFPictureData.java:168)
        at 
org.apache.poi.xwpf.usermodel.XWPFDocument.registerPackagePictureData(XWPFDocument.java:1460)
        at 
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:264)
        at 
org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:169)
        at 
org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:145)
        at 
org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:63)
        at 
org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:224)
        ... 12 common frames omitted

IOUtils.toByteArrayWithMaxLength is also used at other places in the 
code, so the problem might affect other calls as well.

Maybe the checksum could even be implemented in a streaming fashion 
without loading the whole data into a byte array? There's even a method 
for that in 
org.apache.poi.util.IOUtils#calculateChecksum(java.io.InputStream).
But that method also wasn't used for that in earlier versions of POI, so 
that's maybe a different topic and not necessary to change.

Thanks in advance for having a look!

Kind Regards,
Andreas




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to