[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018977#comment-14018977 ]
Andreas Lehmkühler edited comment on PDFBOX-2101 at 6/5/14 5:12 PM: -------------------------------------------------------------------- All resources of a page are now automatically cleared after the conversion to an image per default. The user may decide to disable that feature when creating the PDFRenderer. I've added those changes in revision http://svn.apache.org/r1600699 to the trunk. In the 1.8 branch is any PDFRenderer so that I've added a clear method to the PDPage class only in revsion http://svn.apache.org/r1600701 [~tilman] IMHO it's more convenient to clear the resources automatically as the user may not be aware about those cached values. was (Author: lehmi): All resources of a page are now automatically cleared after the conversion to an image per default. The user may decide to disable that feature when creating the PDFRenderer. I've added those changes in revision http://svn.apache.org/r1600699 to the trunk. In the 1.8 branch is any PDFRenderer so that I've added a clear method to the PDPage class only in revsion http://svn.apache.org/r1600701 > Surprising memory consumption when extracting images > ---------------------------------------------------- > > Key: PDFBOX-2101 > URL: https://issues.apache.org/jira/browse/PDFBOX-2101 > Project: PDFBox > Issue Type: Bug > Components: Utilities > Affects Versions: 1.8.5 > Environment: Windows 7 > java version "1.7.0_55" > Java(TM) SE Runtime Environment (build 1.7.0_55-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) > Reporter: Tim Allison > Assignee: Andreas Lehmkühler > Priority: Minor > Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, > PDFBOX-2101-714-poor.jpg, java.hprof.zip > > > ExtractImages seems to fail to release memory resources on some files in both > PDFBox 1.8.5 and trunk. > On this file 4MB file > [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if > extracting every image on every page (and there are many, many duplicate > images), there is an OOM with -Xmx1g. If there is no Xmx and there is > 2.5g > available, ExtractImages will work. > With some experimentation, the triggers seem to be JPEG images that have > masks. I'm not sure, though, whether the issue is with PDFBox or Java. > Commandlines: > 1.8.5: > java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages > 239665.pdf > 2.0_SNAPSHOT: > java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar > org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf > Results: > 1.8.5: 906 files before OOM > {noformat} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja > va:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: > 514) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP > ixelMap.java:217) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr > eam(PDPixelMap.java:363) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( > PDXObjectImage.java:254) > at > org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 > 02) > at > org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) > at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) > {noformat} > 2.0_SNAPSHOT: 428 files before OOM > {noformat} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja > va:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) > at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( > SampledImageReader.java:171) > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma > ge(SampledImageReader.java:154) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm > ageXObject.java:171) > at > org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 > 31) > at > org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. > java:206) > at > org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav > a:164) > at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)