Re: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when extracting images

Tilman Hausherr Wed, 23 Jul 2014 10:28:46 -0700

Hi Tim,

if you're working with pages (PDPage), you can also call .clear() afteryou're done.

Tilman


Am 23.07.2014 18:26, schrieb Allison, Timothy B.:

Andreas and Tilman,

   Thank you very much for fixing this so quickly.  I'm finally getting around 
to figuring out if we should change anything in the Tika code based on your 
fixes.  If I follow the example of the latest ExtractImages for the 1.8x 
branch, I think I see that we should add:

1) resources.clear() at the end of processResources()
2) image.clear() after image.write2File()

Is there anything else that our client code should do to decrease the memory 
footprint during extraction of images?  Thank you, again!

      Best,

               Tim

________________________________________
From: Andreas Lehmkühler (JIRA) [[email protected]]
Sent: Sunday, June 15, 2014 7:36 AM
To: [email protected]
Subject: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when 
extracting images

      [ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-2101.
----------------------------------------

     Resolution: Fixed

Surprising memory consumption when extracting images
----------------------------------------------------

                 Key: PDFBOX-2101
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
             Project: PDFBox
          Issue Type: Bug
          Components: Utilities
    Affects Versions: 1.8.5
         Environment: Windows 7
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
            Reporter: Tim Allison
            Assignee: Andreas Lehmkühler
            Priority: Minor
             Fix For: 1.8.6, 2.0.0

         Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
PDFBOX-2101-714-poor.jpg, java.hprof.zip


ExtractImages seems to fail to release memory resources on some files in both 
PDFBox 1.8.5 and trunk.
On this file 4MB file 
[http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting 
every image on every page (and there are many, many duplicate images), there is an 
OOM with -Xmx1g.  If there is no Xmx and there is > 2.5g available, 
ExtractImages will work.
With some experimentation, the triggers seem to be JPEG images that have masks. 
 I'm not sure, though, whether the issue is with PDFBox or Java.
Commandlines:
1.8.5:
java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf
2.0_SNAPSHOT:
java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
Results:
1.8.5: 906 files before OOM
{noformat}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
         at java.util.Arrays.copyOf(Arrays.java:2271)
         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
         at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
va:93)
         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
         at 
org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
514)
         at 
org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
ixelMap.java:217)
         at 
org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
eam(PDPixelMap.java:363)
         at 
org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
PDXObjectImage.java:254)
         at 
org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
02)
         at 
org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
         at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
{noformat}
2.0_SNAPSHOT: 428 files before OOM
{noformat}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
         at java.util.Arrays.copyOf(Arrays.java:2271)
         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
         at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
va:93)
         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
         at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
         at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
         at 
org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
SampledImageReader.java:171)
         at 
org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
ge(SampledImageReader.java:154)
         at 
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
ageXObject.java:171)
         at 
org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
31)
         at 
org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
java:206)
         at 
org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
a:164)
         at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when extracting images

Reply via email to