[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012145#comment-14012145
 ] 

Jeremias Maerki commented on PDFBOX-2101:
-----------------------------------------

One thing here is that image compression can be extremely efficient, but if an 
image is decoded into a BufferedImage just so it can be exported into a 
compressed file again, it can take a lot of memory as we see here. In the case 
of PDJpeg, it's a bit unfortunate that the image is loaded into a BufferedImage 
since JPEG is a lossy compression format. Ideally, this class' 
write2OutputStream() method would just extract the compressed image since 
what's in there is almost exactly a normal JFIF/JPEG file. In Apache FOP, for 
example, we can embedd the compressed data stream into the PDF without actually 
decompressing and recompressing the image data (it's damn fast, too, and memory 
consumption is reduced to a little copy buffer). We're just filtering out stuff 
like the color profile which goes into a separate object. Here it would have to 
be implemented the other way around: Gathering the various objects associated 
with the JPEG image and re-assemble the JFIF/JPEG file as closely to the 
original as possible. 

> Surprising memory consumption when extracting images
> ----------------------------------------------------
>
>                 Key: PDFBOX-2101
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.8.5
>         Environment: Windows 7
> java version "1.7.0_55"
> Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
>            Reporter: Tim Allison
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>
> ExtractImages seems to fail to release memory resources on some files in both 
> PDFBox 1.8.5 and trunk.  
> On this file 4MB file 
> [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
> extracting every image on every page (and there are many, many duplicate 
> images), there is an OOM with -Xmx1g.  If there is no Xmx and there is > 2.5g 
> available, ExtractImages will work.
> With some experimentation, the triggers seem to be JPEG images that have 
> masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
> Commandlines:
> 1.8.5:
> java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
> 239665.pdf
> 2.0_SNAPSHOT:
> java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
> org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
> Results:
> 1.8.5: 906 files before OOM
> {noformat}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>         at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
> va:93)
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>         at 
> org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
> 514)
>         at 
> org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
> ixelMap.java:217)
>         at 
> org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
> eam(PDPixelMap.java:363)
>         at 
> org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
> PDXObjectImage.java:254)
>         at 
> org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
> 02)
>         at 
> org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
>         at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
> {noformat}
> 2.0_SNAPSHOT: 428 files before OOM
> {noformat}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>         at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
> va:93)
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>         at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
>         at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
>         at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
> SampledImageReader.java:171)
>         at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
> ge(SampledImageReader.java:154)
>         at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
> ageXObject.java:171)
>         at 
> org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
> 31)
>         at 
> org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
> java:206)
>         at 
> org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
> a:164)
>         at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to