[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031832#comment-14031832 ] Andreas Lehmkühler commented on PDFBOX-2101: IMHO for now we are done here, aren't we? I'd like to close this before releasing 1.8.6. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Fix For: 1.8.6, 2.0.0 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031834#comment-14031834 ] Tilman Hausherr commented on PDFBOX-2101: - Yes Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Fix For: 1.8.6, 2.0.0 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020125#comment-14020125 ] Tilman Hausherr commented on PDFBOX-2101: - Committed 2nd attempt in rev 1600968 to save JPEGs directly, only RGB and Gray. John, please test before I do the 1.8 branch. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018712#comment-14018712 ] Tilman Hausherr commented on PDFBOX-2101: - [~dave.smith] did you try this with the current version from svn after rendering? {code} if (pdPage.getResources() != null) { pdPage.getResources().clear(); } {code} Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018734#comment-14018734 ] Dave Smith commented on PDFBOX-2101: That works VERY nicely. Should we add an option to PDFRenderer so it could clear it's resources after it's finished rendering the page? Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018745#comment-14018745 ] Tilman Hausherr commented on PDFBOX-2101: - I don't think that this belongs in PDFRenderer. I'd rather add a clear() method to PDPage that does this call. [~lehmi] WDYT ? Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018754#comment-14018754 ] Dave Smith commented on PDFBOX-2101: I would agree that adding a clearResources to PDPage would be great however it would be nice to bury the whole logic of clearing page resources into the renderer itself. It would alert anyone using PDFRenderer that resources can be cleaded up (plus then we do not have to lookup the page again) Compare .. ListPDPage pages = document.getDocumentCatalog().getAllPages(); PDFRenderer render = new PDFRenderer(document); for (int i=0;ipages.size();i++) { BufferedImage pageImage = render.renderImageWithDPI(i,200f,ImageType.GRAY); PDPage pdPage = pages.get(i); // clear resources once the page is finished loading if (pdPage.getResources() != null) { pdPage.getResources().clear(); } // do something here } With ListPDPage pages = document.getDocumentCatalog().getAllPages(); PDFRenderer render = new PDFRenderer(document); render.clearPageResourcesAfterRender(); // for (int i=0;ipages.size();i++) { BufferedImage pageImage = render.renderImageWithDPI(i,200f,ImageType.GRAY); // do something with the image } Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018977#comment-14018977 ] Andreas Lehmkühler commented on PDFBOX-2101: All resources of a page are now automatically cleared after the conversion to an image per default. The user may decide to disable that feature when creating the PDFRenderer. I've added those changes in revision http://svn.apache.org/r1600699 to the trunk. In the 1.8 branch is any PDFRenderer so that I've added a clear method to the PDPage class only in revsion http://svn.apache.org/r1600701 Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017814#comment-14017814 ] Dave Smith commented on PDFBOX-2101: What we do is convert each page of the pdf to an image. Once I have the image I am done with the page. What would be nice is if the references that the page was holding could be cleared out of the global cache. If page 2 needed a filter that was already loaded on page one then so be it. Right now we can not render more than 30 pages without the JMV running out of memory. Sure it might be a bit slower but it is better than it not working at all.. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015504#comment-14015504 ] John Hewson commented on PDFBOX-2101: - Yes, that should work nicely. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014972#comment-14014972 ] Tilman Hausherr commented on PDFBOX-2101: - I have inserted another clear() in extractImages in rev 1598978 for the trunk and rev 1598979 for the 1.8 branch, this makes it possible to handle the file of PDFBOX-1350 without getting an OutOfMemoryError. The current resources.clear() is too late because it is triggered only after the whole list of images has been done. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015024#comment-14015024 ] Tilman Hausherr commented on PDFBOX-2101: - John - Re: ExtractImages and YCCK etc: after testing a lot it looks to me as if it would be enough to check that JPEG files have Gray or RGB colorspace, i.e. that all these can be written directly. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014700#comment-14014700 ] Andreas Lehmkühler commented on PDFBOX-2101: I've fixed the regression in revisions 1598883 and 1598884. [~tilman] Thanks for the pointer! Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014795#comment-14014795 ] Tilman Hausherr commented on PDFBOX-2101: - Yes that solved it, thanks. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013797#comment-14013797 ] Andreas Lehmkühler commented on PDFBOX-2101: I've added a clear() method to PDFont and PDXObject to delete cached resources if necessary in revisions 1598627 (trunk) and 1598633 (1.8 branch). Those methods are called when clearing PDResources. PDFont.clear is still empty but I'm going to fill in some stuff soon. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013904#comment-14013904 ] Andreas Lehmkühler commented on PDFBOX-2101: I've implemented clear() for some of the classes inherited from PDFont in revisions 1598655 (trunk) and 1598657 (1.8 branch). This should lead to a smaller memory foot print as some objects could be released earlier Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014083#comment-14014083 ] Tilman Hausherr commented on PDFBOX-2101: - Sorry, but there's a rendering problem with the 2nd page of PDFBOX-2103: {code} Start rendering page 2 30.05.2014 20:39:20.854 WARN [main] org.apache.pdfbox.util.PDFStreamEngine:557 - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at org.apache.pdfbox.cos.COSArray.getObject(COSArray.java:188) at org.apache.pdfbox.pdmodel.font.PDType0Font.init(PDType0Font.java:63) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:72) at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:209) at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:615) at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:53) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:544) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:264) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:223) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:205) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:164) at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:214) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:147) at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:96) at pdfboxpageimageextraction.ExtractImages.doPdf(ExtractImages.java:414) at pdfboxpageimageextraction.ExtractImages.main(ExtractImages.java:208) 30.05.2014 20:39:20.866 WARN [main] org.apache.pdfbox.util.PDFStreamEngine:356 - java.lang.NullPointerException java.lang.NullPointerException at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:352) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:43) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:544) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:264) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:223) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:205) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:164) at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:214) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:147) at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:96) at pdfboxpageimageextraction.ExtractImages.doPdf(ExtractImages.java:414) at pdfboxpageimageextraction.ExtractImages.main(ExtractImages.java:208) {code} Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014101#comment-14014101 ] Tilman Hausherr commented on PDFBOX-2101: - The file of PDFBOX-1283 has also a rendering problem. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012145#comment-14012145 ] Jeremias Maerki commented on PDFBOX-2101: - One thing here is that image compression can be extremely efficient, but if an image is decoded into a BufferedImage just so it can be exported into a compressed file again, it can take a lot of memory as we see here. In the case of PDJpeg, it's a bit unfortunate that the image is loaded into a BufferedImage since JPEG is a lossy compression format. Ideally, this class' write2OutputStream() method would just extract the compressed image since what's in there is almost exactly a normal JFIF/JPEG file. In Apache FOP, for example, we can embedd the compressed data stream into the PDF without actually decompressing and recompressing the image data (it's damn fast, too, and memory consumption is reduced to a little copy buffer). We're just filtering out stuff like the color profile which goes into a separate object. Here it would have to be implemented the other way around: Gathering the various objects associated with the JPEG image and re-assemble the JFIF/JPEG file as closely to the original as possible. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012191#comment-14012191 ] Tilman Hausherr commented on PDFBOX-2101: - Done in rev 1598218 for the 1.8 branch. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012199#comment-14012199 ] Jeremias Maerki commented on PDFBOX-2101: - Please note that color fidelity will suffer like that as the resulting JPEG will no longer have a color profile if one was assigned explicitely inside the PDF file. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012212#comment-14012212 ] Tilman Hausherr commented on PDFBOX-2101: - [~jeremias.mae...@outline.ch] yes... this is a matter about how extractImages is defined. Should it extract the only the payload or should it do a full service job, or some of it? Is there a method to insert the color profile into the jpeg stream? I could also revert the change... JPEG images are not that often anyway. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012247#comment-14012247 ] Jeremias Maerki commented on PDFBOX-2101: - Well, I guess it's a matter of priorities. What's more important? Lower memory consumption and better performance or color fidelity (which can also be added later should anyone complain). I'd leave it like this for now. The important thing is that it's documented. The only thing that could be useful would be a TODO in the source code. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012261#comment-14012261 ] Tilman Hausherr commented on PDFBOX-2101: - Added TODO in rev 1598244 for the 1.8 branch and rev 1598245 for the trunk. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012278#comment-14012278 ] Andreas Lehmkühler commented on PDFBOX-2101: It depends on the point of view. We already extract just the raw data without some post processing in some cases (e.g. masked images, grouped content etc.) I guess, for now we are done here, aren't we? Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012298#comment-14012298 ] Tim Allison commented on PDFBOX-2101: - Thank you, all, for your work this! I can't speak for the entire Tika community, but I suspect that the most common use case would be to extract one of each image (whether or not the image appears on 20 pages). A caching parameter would be very handy for this. For those who want to extract 20 copies of the same image, they can choose to take the potential memory hit for the sake of speed. We have a decent method to configure PDFBox on Tika, and it would be great to add this if it isn't too much effort. Thank you, again. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012306#comment-14012306 ] Andreas Lehmkühler commented on PDFBOX-2101: Just to avoid a misunderstanding. An image which is used more than once should be part of the document resources and not be part of the page resources on which the image appears. So that such images won't be extracted more than once. Tilmans approach takes into account that maybe those images from page 1 are no longer needed if page 2 is rendered and therefore shouldn't be cached anymore. Those resources are dropped when calling clear() on PDResources which is done automatically when parsing a pdf and I've added the same behaviour to the ExtractImages class. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012326#comment-14012326 ] Tim Allison commented on PDFBOX-2101: - Ah, ok, thank you. That makes sense. To confirm my understanding of [~jeremias.mae...@outline.ch]'s point...PDFBox is caching the uncompressed image? That would explain why I'm seeing this: I'm running hprof with trunk now with no -Xmx on a linux box, and ExtractImages has exported 223 images (many more to go!). The exported images take up ~17m, but Java is choosing to use 1.1gb of memory. I'll submit the hprof results when that completes for kicks... Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012588#comment-14012588 ] John Hewson commented on PDFBOX-2101: - We probably shouldn't be dumping the JPEG stream directly to disk, it's not necessarily a useful JPEG file, because PDF can perform quite a bit of post-processing. Tilman's changes in r1598244/r1598245 cause a regression in ExtractImages for the DocuPrint file from PDFBOX-1058 for example - the extracted images are inverted. Perhaps we could dump the raw JPEG in cases where we're sure that there is no post-processing, but it's not a good default if you want useful images. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012619#comment-14012619 ] Tilman Hausherr commented on PDFBOX-2101: - Maybe add some option so that both is possible? The regression that you mention is exactly what [~jerem...@apache.org] expected. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012622#comment-14012622 ] John Hewson commented on PDFBOX-2101: - I'm not sure that's what Jeremias was expecting - it's not a classic color fidelity issue with the colors being slightly off due to the wrong profile - it's completely the wrong color model, with YCbCr being treated as CMYK! Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012784#comment-14012784 ] Tilman Hausherr commented on PDFBOX-2101: - I thought again about this and decided to revert, done in rev 1598379 for the 1.8 branch and rev 1598382 for the trunk. The reason is that it would provide a bad user experience, and lead to support requests. I'm thinking about a better idea: only write those JPEGs directly that have no decodeParams. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012796#comment-14012796 ] John Hewson commented on PDFBOX-2101: - Yeah, we can dump the raw JPEG in cases where we're sure that there is no post-processing, so that's DecodeParams but also YCCK (see DCTFilter) and images using PDF color spaces. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012906#comment-14012906 ] John Hewson commented on PDFBOX-2101: - I did some memory profiling, and COSStream is holding onto a copy of the stream data in a RandomAccessFileInputStream instance variable unFilteredStream, which it keeps as long as the COSStream is around. Each PDPage holds a reference to its Page COSDictionary, which in turn contains the Resources, and ultimately a COSStream containing a named image XObject stream: Page ... Resources .. XObject . obj1 (COSStream) This would be fine, except that COSStream caches its data once it has been read! Specifically when reading an image, COSStream.getUnfilteredStream() will be called which causes RandomAccessFileOutputStream unFilteredStream to be populated. The only way to close unFilteredStream is to call COSStream.close() but that destroys the entire COSStream object, preventing it from being read in the future and clearing its dictionary. Furthermore, the COSStream is kept around for the entire lifetime of the COSDocument, as its ancestor, the document Catalog is retained by COSDocument.objectPool. This means that every time a COSStream is read, its contents is cached until the document is closed. As far as I can tell, the best solution seems to be to prevent COSStream from caching anything, then make sure callers of COSStream methods are equipped to handle that. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012962#comment-14012962 ] Tilman Hausherr commented on PDFBOX-2101: - This means that image objects are cached twice, i.e. once as BufferedImage and once as unfilteredStream (= the decoded version of whats in the pdf). Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013001#comment-14013001 ] John Hewson commented on PDFBOX-2101: - Yep, that's right. I set up my JVM to dump the heap at 200MB, here's what I got: Approx 25MB of ByteBandedRaster + 60MB of IntegerInterleavedRaster (cached images). Approx 72MB (4500 x 16kb) buffers in RandomAccessBuffer(s) belonging to COSStream. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011439#comment-14011439 ] Andreas Lehmkühler commented on PDFBOX-2101: I can confirm the described OOM on windows 7 64bit but NOT on linux fedora 20. I've checked both for pdfbox 1.8.5. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Priority: Minor ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011445#comment-14011445 ] Jeremias Maerki commented on PDFBOX-2101: - While working on this it should be considered to switch to the ByteArrayOutputStream provided by Commons IO which doesn't need to constantly copy arrays when growing. That makes it faster and it has a lower peak memory consumption. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011452#comment-14011452 ] Andreas Lehmkühler commented on PDFBOX-2101: java.io.ByteArrayOutputStream is used on many occasions, do you have a special one in your mind? Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011455#comment-14011455 ] Andreas Lehmkühler commented on PDFBOX-2101: I've fixed the issue in revisions 1598103 (trunk) and 1598104 (1.8 branch) by clearing the resources at the end of each step. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011457#comment-14011457 ] Tilman Hausherr commented on PDFBOX-2101: - 2.0 has a higher memory consumption, this is because of a clone in some class (I mentioned this a few months ago). The real cause of the problem IMHO is that images are cached. What could be done is to create some sort of cache manager that stores a certain amount of images and orphans them when new images are built. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011469#comment-14011469 ] Jeremias Maerki commented on PDFBOX-2101: - java.io.ByteArrayOutputStream is just a bit naïvely implemented. The Commons IO variant has the exact same API but doesn't copy data around so much. So, no, no special occasion. It's just a better replacement in every case. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011472#comment-14011472 ] Andreas Lehmkühler commented on PDFBOX-2101: Ok, I see. We should keep that in mind, a refactor some of the code from time to time. That class is used a lot within pdfbox. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011475#comment-14011475 ] Andreas Lehmkühler commented on PDFBOX-2101: [~talli...@apache.org] Please double check if this works as it seems to be at least partly platform/environment specific. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011725#comment-14011725 ] John Hewson commented on PDFBOX-2101: - {quote} The real cause of the problem IMHO is that images are cached. What could be done is to create some sort of cache manager that stores a certain amount of images and orphans them when new images are built. {quote} I don't know, it seems like the ImageXObject is being kept around for too long, the cached image only lives as long as this object. If that's a problem then I'd recommend just getting rid of image caching and letting the end-user do it in the (relatively rare?) case that they need it. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011787#comment-14011787 ] Tilman Hausherr commented on PDFBOX-2101: - What if the image is a company logo that is on every page of a big file? In that case caching would make sense, and the user can't do it if he just wants to render all the pages. Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images
[ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011875#comment-14011875 ] Tim Allison commented on PDFBOX-2101: - [~lehmi], thank you for looking into this so quickly. Your mods definitely took the edge off! Thank you. Both 1.8.6 SNAPSHOT and 2.0 SNAPSHOT passed with -Xmx1g. On Windows 7, Java was choosing to use up to 800mb with 2.0 SNAPSHOT and ~600mb with 1.8.6 SNAPSHOT. Both versions of PDFBox were also still successful with -Xmx500m. It still feels like something odd is going on in that Java is choosing to consume up to 800m to export images from a 4m file. It feels a bit like the old substring() feature of Java. Your modifications have definitely helped, but my poor gc is very, very tired. :) Thank you! P.S. For the record, I tested 1.8.5 on: Red Hat Enterprise Linux Server release 6.5 (Santiago) java version 1.7.0_40 Java(TM) SE Runtime Environment (build 1.7.0_40-b43) Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode) and I still had OOM with -Xmx1g...it isn't just Windows (although I'm willing to believe it could be!) Surprising memory consumption when extracting images Key: PDFBOX-2101 URL: https://issues.apache.org/jira/browse/PDFBOX-2101 Project: PDFBox Issue Type: Bug Components: Utilities Affects Versions: 1.8.5 Environment: Windows 7 java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) Reporter: Tim Allison Assignee: Andreas Lehmkühler Priority: Minor ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk. On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is 2.5g available, ExtractImages will work. With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java. Commandlines: 1.8.5: java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf 2.0_SNAPSHOT: java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf Results: 1.8.5: 906 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java: 514) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP ixelMap.java:217) at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr eam(PDPixelMap.java:363) at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file( PDXObjectImage.java:254) at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2 02) at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160) at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65) {noformat} 2.0_SNAPSHOT: 428 files before OOM {noformat} Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja va:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70) at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit( SampledImageReader.java:171) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma ge(SampledImageReader.java:154) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm ageXObject.java:171) at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2 31) at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages. java:206) at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav a:164) at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69) {noformat} -- This message was sent by Atlassian JIRA