[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-06-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031832#comment-14031832
 ] 

Andreas Lehmkühler commented on PDFBOX-2101:


IMHO for now we are done here, aren't we? I'd like to close this before 
releasing 1.8.6.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Fix For: 1.8.6, 2.0.0

 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-06-15 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031834#comment-14031834
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

Yes

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Fix For: 1.8.6, 2.0.0

 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-06-06 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020125#comment-14020125
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

Committed 2nd attempt in rev 1600968 to save JPEGs directly, only RGB and Gray. 
John, please test before I do the 1.8 branch.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-06-05 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018712#comment-14018712
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

[~dave.smith] did you try this with the current version from svn after 
rendering?
{code}
if (pdPage.getResources() != null)
{   
  pdPage.getResources().clear(); 
}
{code}

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-06-05 Thread Dave Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018734#comment-14018734
 ] 

Dave Smith commented on PDFBOX-2101:


That works VERY nicely. Should we add an option to PDFRenderer so it could 
clear it's resources after it's finished rendering the page?

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-06-05 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018745#comment-14018745
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

I don't think that this belongs in PDFRenderer. I'd rather add a clear() method 
to PDPage that does this call.

[~lehmi] WDYT ?

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-06-05 Thread Dave Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018754#comment-14018754
 ] 

Dave Smith commented on PDFBOX-2101:


I would agree that adding a clearResources to PDPage would be great  however it 
would be nice to bury the whole logic of clearing page resources into the 
renderer itself. It would alert anyone using PDFRenderer that  resources can be 
cleaded up (plus then we do not have to lookup the page again) Compare ..

ListPDPage pages = document.getDocumentCatalog().getAllPages();
PDFRenderer render = new PDFRenderer(document);
for (int i=0;ipages.size();i++)
{
BufferedImage pageImage = render.renderImageWithDPI(i,200f,ImageType.GRAY);

PDPage pdPage = pages.get(i); // clear resources once the page is 
finished loading
 if (pdPage.getResources() != null)
{
  pdPage.getResources().clear();
}   
// do something here
}

With
ListPDPage pages = document.getDocumentCatalog().getAllPages();
PDFRenderer render = new PDFRenderer(document);
render.clearPageResourcesAfterRender(); // 
for (int i=0;ipages.size();i++)
{
BufferedImage pageImage = render.renderImageWithDPI(i,200f,ImageType.GRAY);
// do something with the image
}

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 

[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-06-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018977#comment-14018977
 ] 

Andreas Lehmkühler commented on PDFBOX-2101:


All resources of a page are now automatically cleared after the conversion to 
an image per default. The user may decide to disable that  feature when 
creating the PDFRenderer. I've added those changes in revision 
http://svn.apache.org/r1600699 to the trunk. In the 1.8 branch is any 
PDFRenderer so that I've added a clear method to the PDPage class only in 
revsion http://svn.apache.org/r1600701

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-06-04 Thread Dave Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017814#comment-14017814
 ] 

Dave Smith commented on PDFBOX-2101:



What we do is convert each page of the pdf to an image. Once I have the image I 
am done with the page. What would be nice is if the references that the page 
was holding could be cleared out of the global cache. If page 2 needed a filter 
that was already loaded on page one then so be it. Right now we can not render 
more than 30 pages without the JMV running out of memory. Sure it might be a 
bit slower but it is better than it not working at all..



 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-06-02 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015504#comment-14015504
 ] 

John Hewson commented on PDFBOX-2101:
-

Yes, that should work nicely.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-06-01 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014972#comment-14014972
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

I have inserted another clear() in extractImages in rev 1598978 for the trunk 
and rev 1598979 for the 1.8 branch, this makes it possible to handle the file 
of PDFBOX-1350 without getting an OutOfMemoryError. The current 
resources.clear() is too late because it is triggered only after the whole list 
of images has been done.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-06-01 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015024#comment-14015024
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

John - Re: ExtractImages and YCCK etc: after testing a lot it looks to me as if 
it would be enough to check that JPEG files have Gray or RGB colorspace, i.e. 
that all these can be written directly.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-31 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014700#comment-14014700
 ] 

Andreas Lehmkühler commented on PDFBOX-2101:


I've fixed the regression in revisions 1598883 and 1598884. 

[~tilman] Thanks for the pointer!

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-31 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014795#comment-14014795
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

Yes that solved it, thanks.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013797#comment-14013797
 ] 

Andreas Lehmkühler commented on PDFBOX-2101:


I've added a clear() method to PDFont and PDXObject to delete cached resources 
if necessary in revisions 1598627 (trunk) and 1598633 (1.8 branch). Those 
methods are called when clearing PDResources.
PDFont.clear is still empty but I'm going to fill in some stuff soon.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013904#comment-14013904
 ] 

Andreas Lehmkühler commented on PDFBOX-2101:


I've implemented clear() for some of the classes inherited from PDFont in 
revisions 1598655 (trunk) and 1598657 (1.8 branch). 
This should lead to a smaller memory foot print as some objects could be 
released earlier

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-30 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014083#comment-14014083
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

Sorry, but there's a rendering problem with the 2nd page of PDFBOX-2103:
{code}
Start rendering page 2
30.05.2014 20:39:20.854 WARN  [main] org.apache.pdfbox.util.PDFStreamEngine:557 
- java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at org.apache.pdfbox.cos.COSArray.getObject(COSArray.java:188)
at 
org.apache.pdfbox.pdmodel.font.PDType0Font.init(PDType0Font.java:63)
at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:72)
at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:209)
at 
org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:615)
at 
org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:53)
at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:544)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:264)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:223)
at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:205)
at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:164)
at 
org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:214)
at 
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:147)
at 
org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:96)
at pdfboxpageimageextraction.ExtractImages.doPdf(ExtractImages.java:414)
at pdfboxpageimageextraction.ExtractImages.main(ExtractImages.java:208)
30.05.2014 20:39:20.866 WARN  [main] org.apache.pdfbox.util.PDFStreamEngine:356 
- java.lang.NullPointerException
java.lang.NullPointerException
at 
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:352)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:43)
at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:544)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:264)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:223)
at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:205)
at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:164)
at 
org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:214)
at 
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:147)
at 
org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:96)
at pdfboxpageimageextraction.ExtractImages.doPdf(ExtractImages.java:414)
at pdfboxpageimageextraction.ExtractImages.main(ExtractImages.java:208)
{code}



 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 

[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-30 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014101#comment-14014101
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

The file of PDFBOX-1283 has also a rendering problem.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread Jeremias Maerki (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012145#comment-14012145
 ] 

Jeremias Maerki commented on PDFBOX-2101:
-

One thing here is that image compression can be extremely efficient, but if an 
image is decoded into a BufferedImage just so it can be exported into a 
compressed file again, it can take a lot of memory as we see here. In the case 
of PDJpeg, it's a bit unfortunate that the image is loaded into a BufferedImage 
since JPEG is a lossy compression format. Ideally, this class' 
write2OutputStream() method would just extract the compressed image since 
what's in there is almost exactly a normal JFIF/JPEG file. In Apache FOP, for 
example, we can embedd the compressed data stream into the PDF without actually 
decompressing and recompressing the image data (it's damn fast, too, and memory 
consumption is reduced to a little copy buffer). We're just filtering out stuff 
like the color profile which goes into a separate object. Here it would have to 
be implemented the other way around: Gathering the various objects associated 
with the JPEG image and re-assemble the JFIF/JPEG file as closely to the 
original as possible. 

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor

 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at 

[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012191#comment-14012191
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

Done in rev 1598218 for the 1.8 branch.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread Jeremias Maerki (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012199#comment-14012199
 ] 

Jeremias Maerki commented on PDFBOX-2101:
-

Please note that color fidelity will suffer like that as the resulting JPEG 
will no longer have a color profile if one was assigned explicitely inside the 
PDF file.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012212#comment-14012212
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

[~jeremias.mae...@outline.ch] yes... this is a matter about how extractImages 
is defined. Should it extract the only the payload or should it do a full 
service job, or some of it? Is there a method to insert the color profile into 
the jpeg stream?

I could also revert the change... JPEG images are not that often anyway.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread Jeremias Maerki (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012247#comment-14012247
 ] 

Jeremias Maerki commented on PDFBOX-2101:
-

Well, I guess it's a matter of priorities. What's more important? Lower memory 
consumption and better performance or color fidelity (which can also be added 
later should anyone complain). I'd leave it like this for now. The important 
thing is that it's documented. The only thing that could be useful would be a 
TODO in the source code.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012261#comment-14012261
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

Added TODO in rev 1598244 for the 1.8 branch and rev 1598245 for the trunk.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012278#comment-14012278
 ] 

Andreas Lehmkühler commented on PDFBOX-2101:


It depends on the point of view. We already extract just the raw data without 
some post processing in some cases (e.g. masked images, grouped content etc.)

I guess, for now we are done here, aren't we?

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012298#comment-14012298
 ] 

Tim Allison commented on PDFBOX-2101:
-

Thank you, all, for your work this!  

I can't speak for the entire Tika community, but I suspect that the most common 
use case would be to extract one of each image (whether or not the image 
appears on 20 pages).  A caching parameter would be very handy for this.  For 
those who want to extract 20 copies of the same image, they can choose to take 
the potential memory hit for the sake of speed.  We have a decent method to 
configure PDFBox on Tika, and it would be great to add this if it isn't too 
much effort.

Thank you, again.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012306#comment-14012306
 ] 

Andreas Lehmkühler commented on PDFBOX-2101:


Just to avoid a misunderstanding. An image which is used more than once should 
be part of the document resources and not be part of the page resources on 
which the image appears. So that such images won't be extracted more than once. 
Tilmans approach takes into account that maybe those images from page 1 are no 
longer needed if page 2 is rendered and therefore shouldn't be cached anymore. 
Those resources are dropped when calling clear() on PDResources which is done 
automatically when parsing a pdf and I've added the same behaviour to the 
ExtractImages class.


 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012326#comment-14012326
 ] 

Tim Allison commented on PDFBOX-2101:
-

Ah, ok, thank you.  That makes sense.

To confirm my understanding of [~jeremias.mae...@outline.ch]'s point...PDFBox is
caching the uncompressed image?  That would explain why I'm seeing this:

I'm running hprof with trunk now with no -Xmx on a linux box, and ExtractImages 
has exported 223 images (many more to go!).  The exported images take up ~17m, 
but Java is choosing to use 1.1gb of memory.  

I'll submit the hprof results when that completes for kicks...

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012588#comment-14012588
 ] 

John Hewson commented on PDFBOX-2101:
-

We probably shouldn't be dumping the JPEG stream directly to disk, it's not 
necessarily a useful JPEG file, because PDF can perform quite a bit of 
post-processing. Tilman's changes in r1598244/r1598245 cause a regression in 
ExtractImages for the DocuPrint file from PDFBOX-1058 for example - the 
extracted images are inverted. Perhaps we could dump the raw JPEG in cases 
where we're sure that there is no post-processing, but it's not a good default 
if you want useful images.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, 
 java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012619#comment-14012619
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

Maybe add some option so that both is possible? The regression that you mention 
is exactly what [~jerem...@apache.org] expected.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012622#comment-14012622
 ] 

John Hewson commented on PDFBOX-2101:
-

I'm not sure that's what Jeremias was expecting - it's not a classic color 
fidelity issue with the colors being slightly off due to the wrong profile - 
it's completely the wrong color model, with YCbCr being treated as CMYK!

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012784#comment-14012784
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

I thought again about this and decided to revert, done in rev 1598379 for the 
1.8 branch and rev 1598382 for the trunk. The reason is that it would provide a 
bad user experience, and lead to support requests. I'm thinking about a better 
idea: only write those JPEGs directly that have no decodeParams.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012796#comment-14012796
 ] 

John Hewson commented on PDFBOX-2101:
-

Yeah, we can dump the raw JPEG in cases where we're sure that there is no 
post-processing, so that's DecodeParams but also YCCK (see DCTFilter) and 
images using PDF color spaces.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012906#comment-14012906
 ] 

John Hewson commented on PDFBOX-2101:
-

I did some memory profiling, and COSStream is holding onto a copy of the stream 
data in a RandomAccessFileInputStream instance variable unFilteredStream, 
which it keeps as long as the COSStream is around.

Each PDPage holds a reference to its Page COSDictionary, which in turn 
contains the Resources, and ultimately a COSStream containing a named image 
XObject stream:

Page
... Resources
.. XObject
. obj1 (COSStream)

This would be fine, except that COSStream caches its data once it has been 
read! Specifically when reading an image, COSStream.getUnfilteredStream() will 
be called which causes RandomAccessFileOutputStream unFilteredStream to be 
populated. The only way to close unFilteredStream is to call COSStream.close() 
but that destroys the entire COSStream object, preventing it from being read in 
the future and clearing its dictionary.

Furthermore, the COSStream is kept around for the entire lifetime of the 
COSDocument, as its ancestor, the document Catalog is retained by 
COSDocument.objectPool. This means that every time a COSStream is read, its 
contents is cached until the document is closed.

As far as I can tell, the best solution seems to be to prevent COSStream from 
caching anything, then make sure callers of COSStream methods are equipped to 
handle that.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 

[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012962#comment-14012962
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

This means that image objects are cached twice, i.e. once as BufferedImage and 
once as unfilteredStream (= the decoded version of whats in the pdf).

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-29 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013001#comment-14013001
 ] 

John Hewson commented on PDFBOX-2101:
-

Yep, that's right. I set up my JVM to dump the heap at 200MB, here's what I got:

Approx 25MB of ByteBandedRaster + 60MB of IntegerInterleavedRaster (cached 
images).
Approx 72MB (4500 x 16kb) buffers in RandomAccessBuffer(s) belonging to 
COSStream.


 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
 Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, 
 PDFBOX-2101-714-poor.jpg, java.hprof.zip


 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011439#comment-14011439
 ] 

Andreas Lehmkühler commented on PDFBOX-2101:


I can confirm the described OOM on windows 7 64bit but NOT on linux fedora 20. 
I've checked both for pdfbox 1.8.5.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Priority: Minor

 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-28 Thread Jeremias Maerki (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011445#comment-14011445
 ] 

Jeremias Maerki commented on PDFBOX-2101:
-

While working on this it should be considered to switch to the 
ByteArrayOutputStream provided by Commons IO which doesn't need to constantly 
copy arrays when growing. That makes it faster and it has a lower peak memory 
consumption.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor

 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011452#comment-14011452
 ] 

Andreas Lehmkühler commented on PDFBOX-2101:


java.io.ByteArrayOutputStream is used on many occasions, do you have a special 
one in your mind?

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor

 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011455#comment-14011455
 ] 

Andreas Lehmkühler commented on PDFBOX-2101:


I've fixed the issue in revisions 1598103 (trunk) and 1598104 (1.8 branch) by 
clearing the resources at the end of each step.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor

 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-28 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011457#comment-14011457
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

2.0 has a higher memory consumption, this is because of a clone in some class 
(I mentioned this a few months ago).

The real cause of the problem IMHO is that images are cached. What could be 
done is to create some sort of cache manager that stores a certain amount of 
images and orphans them when new images are built.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor

 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-28 Thread Jeremias Maerki (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011469#comment-14011469
 ] 

Jeremias Maerki commented on PDFBOX-2101:
-

java.io.ByteArrayOutputStream is just a bit naïvely implemented. The Commons IO 
variant has the exact same API but doesn't copy data around so much. So, no, no 
special occasion. It's just a better replacement in every case.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor

 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011472#comment-14011472
 ] 

Andreas Lehmkühler commented on PDFBOX-2101:


Ok, I see. We should keep that in mind, a refactor some of the code from time 
to time. That class is used a lot within pdfbox.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor

 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011475#comment-14011475
 ] 

Andreas Lehmkühler commented on PDFBOX-2101:


[~talli...@apache.org] Please double check if this works as it seems to be at 
least partly platform/environment specific.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor

 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-28 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011725#comment-14011725
 ] 

John Hewson commented on PDFBOX-2101:
-

{quote}
The real cause of the problem IMHO is that images are cached. What could be 
done is to create some sort of cache manager that stores a certain amount of 
images and orphans them when new images are built.
{quote}

I don't know, it seems like the ImageXObject is being kept around for too long, 
the cached image only lives as long as this object. If that's a problem then 
I'd recommend just getting rid of image caching and letting the end-user do it 
in the (relatively rare?) case that they need it.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor

 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-28 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011787#comment-14011787
 ] 

Tilman Hausherr commented on PDFBOX-2101:
-

What if the image is a company logo that is on every page of a big file? In 
that case caching would make sense, and the user can't do it if he just wants 
to render all the pages.

 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor

 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2101) Surprising memory consumption when extracting images

2014-05-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011875#comment-14011875
 ] 

Tim Allison commented on PDFBOX-2101:
-

[~lehmi], thank you for looking into this so quickly.  Your mods definitely 
took the edge off!  Thank you.  Both 1.8.6 SNAPSHOT and 2.0 SNAPSHOT passed 
with -Xmx1g.  On Windows 7, Java was choosing to use up to 800mb with 2.0 
SNAPSHOT and ~600mb with 1.8.6 SNAPSHOT.  Both versions of PDFBox were also 
still successful with -Xmx500m.

It still feels like something odd is going on in that Java is choosing to 
consume up to 800m to export images from a 4m file.  It feels a bit like the 
old substring() feature of Java.

Your modifications have definitely helped, but my poor gc is very, very tired. 
:)  Thank you!


P.S. For the record, I tested 1.8.5 on:
Red Hat Enterprise Linux Server release 6.5 (Santiago)

java version 1.7.0_40
Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)

and I still had OOM with -Xmx1g...it isn't just Windows (although I'm willing 
to believe it could be!)



 Surprising memory consumption when extracting images
 

 Key: PDFBOX-2101
 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.5
 Environment: Windows 7
 java version 1.7.0_55
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor

 ExtractImages seems to fail to release memory resources on some files in both 
 PDFBox 1.8.5 and trunk.  
 On this file 4MB file 
 [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if 
 extracting every image on every page (and there are many, many duplicate 
 images), there is an OOM with -Xmx1g.  If there is no Xmx and there is  2.5g 
 available, ExtractImages will work.
 With some experimentation, the triggers seem to be JPEG images that have 
 masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
 Commandlines:
 1.8.5:
 java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 
 239665.pdf
 2.0_SNAPSHOT:
 java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar 
 org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
 Results:
 1.8.5: 906 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
 514)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
 ixelMap.java:217)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
 eam(PDPixelMap.java:363)
 at 
 org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
 PDXObjectImage.java:254)
 at 
 org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
 02)
 at 
 org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
 at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
 {noformat}
 2.0_SNAPSHOT: 428 files before OOM
 {noformat}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
 va:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
 at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
 SampledImageReader.java:171)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
 ge(SampledImageReader.java:154)
 at 
 org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
 ageXObject.java:171)
 at 
 org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
 31)
 at 
 org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
 java:206)
 at 
 org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
 a:164)
 at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
 {noformat}



--
This message was sent by Atlassian JIRA