[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012145#comment-14012145 ]
Jeremias Maerki commented on PDFBOX-2101: ----------------------------------------- One thing here is that image compression can be extremely efficient, but if an image is decoded into a BufferedImage just so it can be exported into a compressed file again, it can take a lot of memory as we see here. In the case of PDJpeg, it's a bit unfortunate that the image is loaded into a BufferedImage since JPEG is a lossy compression format. Ideally, this class' write2OutputStream() method would just extract the compressed image since what's in there is almost exactly a normal JFIF/JPEG file. In Apache FOP, for example, we can embedd the compressed data stream into the PDF without actually decompressing and recompressing the image data (it's damn fast, too, and memory consumption is reduced to a little copy buffer). We're just filtering out stuff like the color profile which goes into a separate object. Here it would have to be implemented the other way around: Gathering the various objects associated with the JPEG image and re-assemble the JFIF/JPEG file as closely to the original as possible. > Surprising memory consumption when extracting images > ---------------------------------------------------- > > Key: PDFBOX-2101 > URL: https://issues.apache.org/jira/browse/PDFBOX-2101 > Project: PDFBox > Issue Type: Bug > Components: Utilities > Affects Versions: 1.8.5 > Environment: Windows 7 > java version "1.7.0_55" > Java(TM) SE Runtime Environment (build 1.7.0_55-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) > Reporter: Tim Allison > Assignee: Andreas Lehmkühler > Priority: Minor > > ExtractImages seems to fail to release memory resources on some files in both > PDFBox 1.8.5 and trunk. > On this file 4MB file > [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if > extracting every image on every page (and there are many, many duplicate > images), there is an OOM with -Xmx1g. If there is no Xmx and there is > 2.5g > available, ExtractImages will work. > With some experimentation, the triggers seem to be JPEG images that have > masks. I'm not sure, though, whether the issue is with PDFBox or Java. > Commandlines: > 1.8.5: > java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages > 239665.pdf > 2.0_SNAPSHOT: > java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar > org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf > Results: > 1.8.5: 906 files before OOM > {noformat} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja > va:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: > 514) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP > ixelMap.java:217) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr > eam(PDPixelMap.java:363) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( > PDXObjectImage.java:254) > at > org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 > 02) > at > org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) > at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) > {noformat} > 2.0_SNAPSHOT: 428 files before OOM > {noformat} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja > va:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) > at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( > SampledImageReader.java:171) > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma > ge(SampledImageReader.java:154) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm > ageXObject.java:171) > at > org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 > 31) > at > org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. > java:206) > at > org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav > a:164) > at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)