[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012906#comment-14012906 ]
John Hewson edited comment on PDFBOX-2101 at 5/29/14 10:05 PM: --------------------------------------------------------------- I did some memory profiling, and COSStream is holding onto a copy of the stream data in a RandomAccessFileInputStream instance variable "unFilteredStream", which it keeps as long as the COSStream is around. Each PDPage holds a reference to its "Page" COSDictionary, which in turn contains the Resources, and ultimately a COSStream containing a named image XObject stream: Page ... Resources ...... XObject ......... obj1 (COSStream) This would be fine, except that COSStream caches its data once it has been read! Specifically when reading an image, COSStream.getUnfilteredStream() will be called which causes RandomAccessFileOutputStream unFilteredStream to be populated. The only way to close unFilteredStream is to call COSStream.close() but that destroys the entire COSStream object, preventing it from being read in the future and clearing its dictionary. Furthermore, the COSStream is kept around for the entire lifetime of the COSDocument, as its ancestor, the document Catalog is retained by COSDocument.objectPool. That's by design, and it's ok. However, it means that every time a COSStream is read, its contents is cached until the document is closed. As far as I can tell, the best solution seems to be to prevent COSStream from caching anything, then make sure callers of COSStream methods are equipped to handle that. EDIT: I set up my JVM to dump the heap at 200MB, here's what I got: Approx 25MB of ByteBandedRaster + 60MB of IntegerInterleavedRaster (cached images). Approx 72MB (4500 x 16kb) buffers in RandomAccessBuffer(s) belonging to COSStream. was (Author: jahewson): I did some memory profiling, and COSStream is holding onto a copy of the stream data in a RandomAccessFileInputStream instance variable "unFilteredStream", which it keeps as long as the COSStream is around. Each PDPage holds a reference to its "Page" COSDictionary, which in turn contains the Resources, and ultimately a COSStream containing a named image XObject stream: Page ... Resources ...... XObject ......... obj1 (COSStream) This would be fine, except that COSStream caches its data once it has been read! Specifically when reading an image, COSStream.getUnfilteredStream() will be called which causes RandomAccessFileOutputStream unFilteredStream to be populated. The only way to close unFilteredStream is to call COSStream.close() but that destroys the entire COSStream object, preventing it from being read in the future and clearing its dictionary. Furthermore, the COSStream is kept around for the entire lifetime of the COSDocument, as its ancestor, the document Catalog is retained by COSDocument.objectPool. That's by design, and it's ok. However, it means that every time a COSStream is read, its contents is cached until the document is closed. As far as I can tell, the best solution seems to be to prevent COSStream from caching anything, then make sure callers of COSStream methods are equipped to handle that. > Surprising memory consumption when extracting images > ---------------------------------------------------- > > Key: PDFBOX-2101 > URL: https://issues.apache.org/jira/browse/PDFBOX-2101 > Project: PDFBox > Issue Type: Bug > Components: Utilities > Affects Versions: 1.8.5 > Environment: Windows 7 > java version "1.7.0_55" > Java(TM) SE Runtime Environment (build 1.7.0_55-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) > Reporter: Tim Allison > Assignee: Andreas Lehmkühler > Priority: Minor > Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, > PDFBOX-2101-714-poor.jpg, java.hprof.zip > > > ExtractImages seems to fail to release memory resources on some files in both > PDFBox 1.8.5 and trunk. > On this file 4MB file > [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if > extracting every image on every page (and there are many, many duplicate > images), there is an OOM with -Xmx1g. If there is no Xmx and there is > 2.5g > available, ExtractImages will work. > With some experimentation, the triggers seem to be JPEG images that have > masks. I'm not sure, though, whether the issue is with PDFBox or Java. > Commandlines: > 1.8.5: > java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages > 239665.pdf > 2.0_SNAPSHOT: > java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar > org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf > Results: > 1.8.5: 906 files before OOM > {noformat} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja > va:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: > 514) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP > ixelMap.java:217) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr > eam(PDPixelMap.java:363) > at > org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( > PDXObjectImage.java:254) > at > org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 > 02) > at > org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) > at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) > {noformat} > 2.0_SNAPSHOT: 428 files before OOM > {noformat} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja > va:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) > at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( > SampledImageReader.java:171) > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma > ge(SampledImageReader.java:154) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm > ageXObject.java:171) > at > org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 > 31) > at > org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. > java:206) > at > org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav > a:164) > at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)