[jira] [Comment Edited] (PDFBOX-2101) Surprising memory consumption when extracting images

John Hewson (JIRA) Thu, 29 May 2014 15:07:25 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012906#comment-14012906
 ]


John Hewson edited comment on PDFBOX-2101 at 5/29/14 10:05 PM:
---------------------------------------------------------------

I did some memory profiling, and COSStream is holding onto a copy of the stream 
data in a RandomAccessFileInputStream instance variable "unFilteredStream", 
which it keeps as long as the COSStream is around.

Each PDPage holds a reference to its "Page" COSDictionary, which in turn 
contains the Resources, and ultimately a COSStream containing a named image 
XObject stream:

Page
... Resources
...... XObject
......... obj1 (COSStream)

This would be fine, except that COSStream caches its data once it has been 
read! Specifically when reading an image, COSStream.getUnfilteredStream() will 
be called which causes RandomAccessFileOutputStream unFilteredStream to be 
populated. The only way to close unFilteredStream is to call COSStream.close() 
but that destroys the entire COSStream object, preventing it from being read in 
the future and clearing its dictionary.

Furthermore, the COSStream is kept around for the entire lifetime of the 
COSDocument, as its ancestor, the document Catalog is retained by 
COSDocument.objectPool. That's by design, and it's ok. However, it means that 
every time a COSStream is read, its contents is cached until the document is 
closed.

As far as I can tell, the best solution seems to be to prevent COSStream from 
caching anything, then make sure callers of COSStream methods are equipped to 
handle that.

EDIT:

I set up my JVM to dump the heap at 200MB, here's what I got:

Approx 25MB of ByteBandedRaster + 60MB of IntegerInterleavedRaster (cached 
images).
Approx 72MB (4500 x 16kb) buffers in RandomAccessBuffer(s) belonging to 
COSStream.



was (Author: jahewson):
I did some memory profiling, and COSStream is holding onto a copy of the stream 
data in a RandomAccessFileInputStream instance variable "unFilteredStream", 
which it keeps as long as the COSStream is around.

Each PDPage holds a reference to its "Page" COSDictionary, which in turn 
contains the Resources, and ultimately a COSStream containing a named image 
XObject stream:

Page
... Resources
...... XObject
......... obj1 (COSStream)

This would be fine, except that COSStream caches its data once it has been 
read! Specifically when reading an image, COSStream.getUnfilteredStream() will 
be called which causes RandomAccessFileOutputStream unFilteredStream to be 
populated. The only way to close unFilteredStream is to call COSStream.close() 
but that destroys the entire COSStream object, preventing it from being read in 
the future and clearing its dictionary.

Furthermore, the COSStream is kept around for the entire lifetime of the 
COSDocument, as its ancestor, the document Catalog is retained by 
COSDocument.objectPool. That's by design, and it's ok. However, it means that 
every time a COSStream is read, its contents is cached until the document is 
closed.

As far as I can tell, the best solution seems to be to prevent COSStream from 
caching anything, then make sure callers of COSStream methods are equipped to 
handle that.

> Surprising memory consumption when extracting images
> ----------------------------------------------------
>
>                 Key: PDFBOX-2101
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.8.5
>         Environment: Windows 7
> java version "1.7.0_55"
> Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
>            Reporter: Tim Allison
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>         Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
> PDFBOX-2101-714-poor.jpg, java.hprof.zip
>
>
> ExtractImages seems to fail to release memory resources on some files in both 
> PDFBox 1.8.5 and trunk.  
> On this file 4MB file 
> [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
> extracting every image on every page (and there are many, many duplicate 
> images), there is an OOM with -Xmx1g.  If there is no Xmx and there is > 2.5g 
> available, ExtractImages will work.
> With some experimentation, the triggers seem to be JPEG images that have 
> masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
> Commandlines:
> 1.8.5:
> java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
> 239665.pdf
> 2.0_SNAPSHOT:
> java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
> org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
> Results:
> 1.8.5: 906 files before OOM
> {noformat}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>         at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
> va:93)
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>         at 
> org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
> 514)
>         at 
> org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
> ixelMap.java:217)
>         at 
> org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
> eam(PDPixelMap.java:363)
>         at 
> org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
> PDXObjectImage.java:254)
>         at 
> org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
> 02)
>         at 
> org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
>         at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
> {noformat}
> 2.0_SNAPSHOT: 428 files before OOM
> {noformat}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>         at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
> va:93)
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>         at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
>         at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
>         at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
> SampledImageReader.java:171)
>         at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
> ge(SampledImageReader.java:154)
>         at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
> ageXObject.java:171)
>         at 
> org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
> 31)
>         at 
> org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
> java:206)
>         at 
> org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
> a:164)
>         at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (PDFBOX-2101) Surprising memory consumption when extracting images

Reply via email to