RE: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when extracting images

Allison, Timothy B. Wed, 23 Jul 2014 11:26:29 -0700

Got it.  Will do.  Thank you.

________________________________________
From: Tilman Hausherr [thaush...@t-online.de]
Sent: Wednesday, July 23, 2014 1:28 PM
To: dev@pdfbox.apache.org
Subject: Re: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when 
extracting images


Hi Tim,
if you're working with pages (PDPage), you can also call .clear() after
you're done.
Tilman

Am 23.07.2014 18:26, schrieb Allison, Timothy B.:
> Andreas and Tilman,
>
>    Thank you very much for fixing this so quickly.  I'm finally getting 
> around to figuring out if we should change anything in the Tika code based on 
> your fixes.  If I follow the example of the latest ExtractImages for the 1.8x 
> branch, I think I see that we should add:
>
> 1) resources.clear() at the end of processResources()
> 2) image.clear() after image.write2File()
>
> Is there anything else that our client code should do to decrease the memory 
> footprint during extraction of images?  Thank you, again!
>
>       Best,
>
>                Tim
>
> ________________________________________
> From: Andreas Lehmkühler (JIRA) [j...@apache.org]
> Sent: Sunday, June 15, 2014 7:36 AM
> To: dev@pdfbox.apache.org
> Subject: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when 
> extracting images
>
>       [ 
> https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
>
> Andreas Lehmkühler resolved PDFBOX-2101.
> ----------------------------------------
>
>      Resolution: Fixed
>
>> Surprising memory consumption when extracting images
>> ----------------------------------------------------
>>
>>                  Key: PDFBOX-2101
>>                  URL: https://issues.apache.org/jira/browse/PDFBOX-2101
>>              Project: PDFBox
>>           Issue Type: Bug
>>           Components: Utilities
>>     Affects Versions: 1.8.5
>>          Environment: Windows 7
>> java version "1.7.0_55"
>> Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
>> Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
>>             Reporter: Tim Allison
>>             Assignee: Andreas Lehmkühler
>>             Priority: Minor
>>              Fix For: 1.8.6, 2.0.0
>>
>>          Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
>> PDFBOX-2101-714-poor.jpg, java.hprof.zip
>>
>>
>> ExtractImages seems to fail to release memory resources on some files in 
>> both PDFBox 1.8.5 and trunk.
>> On this file 4MB file 
>> [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
>> extracting every image on every page (and there are many, many duplicate 
>> images), there is an OOM with -Xmx1g.  If there is no Xmx and there is > 
>> 2.5g available, ExtractImages will work.
>> With some experimentation, the triggers seem to be JPEG images that have 
>> masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
>> Commandlines:
>> 1.8.5:
>> java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
>> 239665.pdf
>> 2.0_SNAPSHOT:
>> java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
>> org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
>> Results:
>> 1.8.5: 906 files before OOM
>> {noformat}
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>          at java.util.Arrays.copyOf(Arrays.java:2271)
>>          at 
>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>          at 
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
>> va:93)
>>          at 
>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>          at 
>> org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
>> 514)
>>          at 
>> org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
>> ixelMap.java:217)
>>          at 
>> org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
>> eam(PDPixelMap.java:363)
>>          at 
>> org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
>> PDXObjectImage.java:254)
>>          at 
>> org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
>> 02)
>>          at 
>> org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
>>          at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
>> {noformat}
>> 2.0_SNAPSHOT: 428 files before OOM
>> {noformat}
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>          at java.util.Arrays.copyOf(Arrays.java:2271)
>>          at 
>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>          at 
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
>> va:93)
>>          at 
>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>          at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
>>          at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
>>          at 
>> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
>> SampledImageReader.java:171)
>>          at 
>> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
>> ge(SampledImageReader.java:154)
>>          at 
>> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
>> ageXObject.java:171)
>>          at 
>> org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
>> 31)
>>          at 
>> org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
>> java:206)
>>          at 
>> org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
>> a:164)
>>          at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
>> {noformat}
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)

RE: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when extracting images

Reply via email to