[ 
https://issues.apache.org/jira/browse/PDFBOX-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205331#comment-13205331
 ] 

Antoni Mylka commented on PDFBOX-1227:
--------------------------------------

The PDF is obviously from PDFBOX-706, not PDFBOX-708. My test (as uploaded, 
with default maven-surefire-plugin settings i.e. -Xmx now results in:

java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
        at 
org.apache.pdfbox.io.RandomAccessBuffer.write(RandomAccessBuffer.java:131)
        at 
org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:108)
        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:117)
        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279)
        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:229)
        at 
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
        at 
org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:214)
        at 
org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:468)
        at 
org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:143)
        at 
org.apache.pdfbox.antoni.pub.TestPdfbox706FlexEnableBeta1.testFile2(TestPdfbox706FlexEnableBeta1.java:39)

Now when I increase the heap size (by binary search, needs at least -Xmx296m), 
the pictures 51,52,53 get extracted properly. They aren't black any more, so 
there must have been some improvement in this respect. BUT now more pictures 
are found. I comment out those 'ifs' in that code, so that all pictures are 
supposed to be non-black. Then my test extracts 53 pictures and dies with this 
at picture 54

java.lang.ArrayIndexOutOfBoundsException: Coordinate out of bounds!
        at 
sun.awt.image.ByteInterleavedRaster.getDataElements(ByteInterleavedRaster.java:301)
        at java.awt.image.BufferedImage.getRGB(BufferedImage.java:871)
        at 
org.apache.pdfbox.antoni.pub.TestPdfbox706FlexEnableBeta1.testFile2(TestPdfbox706FlexEnableBeta1.java:45)
        
So now it seems there are two problems: increased memory consumption and an 
array index exception. Three pictures are extracted properly now though.
                
> File submitted to PFDBOX-708 throws OOME
> ----------------------------------------
>
>                 Key: PDFBOX-1227
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1227
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.0
>         Environment: Windows 7 64bit
>            Reporter: Antoni Mylka
>         Attachments: TestPdfbox706FlexEnableBeta1.java, 
> pdfbox-706-flex-enable-beta1.pdf
>
>
> I want to extract pictures from FLEX Enable Beta1 Feb13.pdf originally 
> submitted to PDFBOX-708. It used to work, but now it throws an OOME.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to