Got it. Will do. Thank you. ________________________________________ From: Tilman Hausherr [thaush...@t-online.de] Sent: Wednesday, July 23, 2014 1:28 PM To: dev@pdfbox.apache.org Subject: Re: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when extracting images
Hi Tim, if you're working with pages (PDPage), you can also call .clear() after you're done. Tilman Am 23.07.2014 18:26, schrieb Allison, Timothy B.: > Andreas and Tilman, > > Thank you very much for fixing this so quickly. I'm finally getting > around to figuring out if we should change anything in the Tika code based on > your fixes. If I follow the example of the latest ExtractImages for the 1.8x > branch, I think I see that we should add: > > 1) resources.clear() at the end of processResources() > 2) image.clear() after image.write2File() > > Is there anything else that our client code should do to decrease the memory > footprint during extraction of images? Thank you, again! > > Best, > > Tim > > ________________________________________ > From: Andreas Lehmkühler (JIRA) [j...@apache.org] > Sent: Sunday, June 15, 2014 7:36 AM > To: dev@pdfbox.apache.org > Subject: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when > extracting images > > [ > https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Andreas Lehmkühler resolved PDFBOX-2101. > ---------------------------------------- > > Resolution: Fixed > >> Surprising memory consumption when extracting images >> ---------------------------------------------------- >> >> Key: PDFBOX-2101 >> URL: https://issues.apache.org/jira/browse/PDFBOX-2101 >> Project: PDFBox >> Issue Type: Bug >> Components: Utilities >> Affects Versions: 1.8.5 >> Environment: Windows 7 >> java version "1.7.0_55" >> Java(TM) SE Runtime Environment (build 1.7.0_55-b13) >> Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) >> Reporter: Tim Allison >> Assignee: Andreas Lehmkühler >> Priority: Minor >> Fix For: 1.8.6, 2.0.0 >> >> Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, >> PDFBOX-2101-714-poor.jpg, java.hprof.zip >> >> >> ExtractImages seems to fail to release memory resources on some files in >> both PDFBox 1.8.5 and trunk. >> On this file 4MB file >> [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if >> extracting every image on every page (and there are many, many duplicate >> images), there is an OOM with -Xmx1g. If there is no Xmx and there is > >> 2.5g available, ExtractImages will work. >> With some experimentation, the triggers seem to be JPEG images that have >> masks. I'm not sure, though, whether the issue is with PDFBox or Java. >> Commandlines: >> 1.8.5: >> java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages >> 239665.pdf >> 2.0_SNAPSHOT: >> java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar >> org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf >> Results: >> 1.8.5: 906 files before OOM >> {noformat} >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >> at java.util.Arrays.copyOf(Arrays.java:2271) >> at >> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) >> at >> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja >> va:93) >> at >> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) >> at >> org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: >> 514) >> at >> org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP >> ixelMap.java:217) >> at >> org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr >> eam(PDPixelMap.java:363) >> at >> org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( >> PDXObjectImage.java:254) >> at >> org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 >> 02) >> at >> org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) >> at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) >> {noformat} >> 2.0_SNAPSHOT: 428 files before OOM >> {noformat} >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >> at java.util.Arrays.copyOf(Arrays.java:2271) >> at >> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) >> at >> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja >> va:93) >> at >> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) >> at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) >> at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) >> at >> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( >> SampledImageReader.java:171) >> at >> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma >> ge(SampledImageReader.java:154) >> at >> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm >> ageXObject.java:171) >> at >> org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 >> 31) >> at >> org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. >> java:206) >> at >> org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav >> a:164) >> at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) >> {noformat} > > > -- > This message was sent by Atlassian JIRA > (v6.2#6252)