I created https://bz.apache.org/bugzilla/show_bug.cgi?id=65950

On 2022/03/11 16:46:18 PJ Fanning wrote:
> Thanks Andreas - could you raise a bug? https://bz.apache.org/bugzilla/
> 
> Definitely looks like a bug.
> 
> 
> 
> 
> 
> 
> On Friday 11 March 2022, 16:15:54 GMT+1, Andreas Hubold 
> <[email protected]> wrote: 
> 
> 
> 
> 
> 
> Hi,
> 
> I'm just trying to upgrade POI from 5.2.0 to 5.2.1 for using it with 
> Apache Tika 2.3.0, but I suddenly see memory problems when processing 
> DOCX files with embedded images. This looks like a severe bug in POI 
> 5.2.1 to me:
> 
> POI 5.2.1 changed XWPFPictureData#getChecksum to call 
> IOUtils.toByteArrayWithMaxLengthwith a default max length of 100MB 
> (XWPFPictureData#DEFAULT_MAX_IMAGE_SIZE). The implementation of that 
> method allocates a byte array of that size by instantiating an 
> UnsynchronizedByteArrayOutputStream with that max value.
> 
> The effect is that 100MB of heap memory are allocated, even if the 
> embedded image is quite small (less than 1MB in my case).
> 
> Here's an exception stack trace where the code is called from Apache Tika:
> 
> Caused by: java.io.IOException: java.lang.OutOfMemoryError: Java heap space
>         at 
> org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:249)
>         at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:201)
>         at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
>         ... 9 common frames omitted
> Caused by: java.lang.OutOfMemoryError: Java heap space
>         at org.apache.commons.io.IOUtils.byteArray(IOUtils.java:338)
>         at 
> org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104)
>         at 
> org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.<init>(UnsynchronizedByteArrayOutputStream.java:51)
>         at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:205)
>         at 
> org.apache.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:191)
>         at 
> org.apache.poi.xwpf.usermodel.XWPFPictureData.getChecksum(XWPFPictureData.java:168)
>         at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.registerPackagePictureData(XWPFDocument.java:1460)
>         at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:264)
>         at 
> org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:169)
>         at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:145)
>         at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:63)
>         at 
> org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:224)
>         ... 12 common frames omitted
> 
> IOUtils.toByteArrayWithMaxLength is also used at other places in the 
> code, so the problem might affect other calls as well.
> 
> Maybe the checksum could even be implemented in a streaming fashion 
> without loading the whole data into a byte array? There's even a method 
> for that in 
> org.apache.poi.util.IOUtils#calculateChecksum(java.io.InputStream).
> But that method also wasn't used for that in earlier versions of POI, so 
> that's maybe a different topic and not necessary to change.
> 
> Thanks in advance for having a look!
> 
> Kind Regards,
> Andreas
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to