Hi Tim,
if you're working with pages (PDPage), you can also call .clear() after
you're done.
Tilman
Am 23.07.2014 18:26, schrieb Allison, Timothy B.:
Andreas and Tilman,
Thank you very much for fixing this so quickly. I'm finally getting around
to figuring out if we should change anything in the Tika code based on your
fixes. If I follow the example of the latest ExtractImages for the 1.8x
branch, I think I see that we should add:
1) resources.clear() at the end of processResources()
2) image.clear() after image.write2File()
Is there anything else that our client code should do to decrease the memory
footprint during extraction of images? Thank you, again!
Best,
Tim
________________________________________
From: Andreas Lehmkühler (JIRA) [j...@apache.org]
Sent: Sunday, June 15, 2014 7:36 AM
To: dev@pdfbox.apache.org
Subject: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when
extracting images
[
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler resolved PDFBOX-2101.
----------------------------------------
Resolution: Fixed
Surprising memory consumption when extracting images
----------------------------------------------------
Key: PDFBOX-2101
URL: https://issues.apache.org/jira/browse/PDFBOX-2101
Project: PDFBox
Issue Type: Bug
Components: Utilities
Affects Versions: 1.8.5
Environment: Windows 7
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
Fix For: 1.8.6, 2.0.0
Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg,
PDFBOX-2101-714-poor.jpg, java.hprof.zip
ExtractImages seems to fail to release memory resources on some files in both
PDFBox 1.8.5 and trunk.
On this file 4MB file
[http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting
every image on every page (and there are many, many duplicate images), there is an
OOM with -Xmx1g. If there is no Xmx and there is > 2.5g available,
ExtractImages will work.
With some experimentation, the triggers seem to be JPEG images that have masks.
I'm not sure, though, whether the issue is with PDFBox or Java.
Commandlines:
1.8.5:
java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf
2.0_SNAPSHOT:
java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar
org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
Results:
1.8.5: 906 files before OOM
{noformat}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
va:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at
org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
514)
at
org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
ixelMap.java:217)
at
org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
eam(PDPixelMap.java:363)
at
org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
PDXObjectImage.java:254)
at
org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
02)
at
org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
{noformat}
2.0_SNAPSHOT: 428 files before OOM
{noformat}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
va:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
at
org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
SampledImageReader.java:171)
at
org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
ge(SampledImageReader.java:154)
at
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
ageXObject.java:171)
at
org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
31)
at
org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
java:206)
at
org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
a:164)
at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
{noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)