I created https://bz.apache.org/bugzilla/show_bug.cgi?id=65950
On 2022/03/11 16:46:18 PJ Fanning wrote: > Thanks Andreas - could you raise a bug? https://bz.apache.org/bugzilla/ > > Definitely looks like a bug. > > > > > > > On Friday 11 March 2022, 16:15:54 GMT+1, Andreas Hubold > <[email protected]> wrote: > > > > > > Hi, > > I'm just trying to upgrade POI from 5.2.0 to 5.2.1 for using it with > Apache Tika 2.3.0, but I suddenly see memory problems when processing > DOCX files with embedded images. This looks like a severe bug in POI > 5.2.1 to me: > > POI 5.2.1 changed XWPFPictureData#getChecksum to call > IOUtils.toByteArrayWithMaxLengthwith a default max length of 100MB > (XWPFPictureData#DEFAULT_MAX_IMAGE_SIZE). The implementation of that > method allocates a byte array of that size by instantiating an > UnsynchronizedByteArrayOutputStream with that max value. > > The effect is that 100MB of heap memory are allocated, even if the > embedded image is quite small (less than 1MB in my case). > > Here's an exception stack trace where the code is called from Apache Tika: > > Caused by: java.io.IOException: java.lang.OutOfMemoryError: Java heap space > at > org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:249) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:201) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > ... 9 common frames omitted > Caused by: java.lang.OutOfMemoryError: Java heap space > at org.apache.commons.io.IOUtils.byteArray(IOUtils.java:338) > at > org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104) > at > org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.<init>(UnsynchronizedByteArrayOutputStream.java:51) > at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:205) > at > org.apache.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:191) > at > org.apache.poi.xwpf.usermodel.XWPFPictureData.getChecksum(XWPFPictureData.java:168) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.registerPackagePictureData(XWPFDocument.java:1460) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:264) > at > org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:169) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:145) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:63) > at > org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:224) > ... 12 common frames omitted > > IOUtils.toByteArrayWithMaxLength is also used at other places in the > code, so the problem might affect other calls as well. > > Maybe the checksum could even be implemented in a streaming fashion > without loading the whole data into a byte array? There's even a method > for that in > org.apache.poi.util.IOUtils#calculateChecksum(java.io.InputStream). > But that method also wasn't used for that in earlier versions of POI, so > that's maybe a different topic and not necessary to change. > > Thanks in advance for having a look! > > Kind Regards, > Andreas > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
