[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Ilija Pavlic (Commented) (JIRA) Tue, 10 Jan 2012 16:17:07 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183733#comment-13183733
 ]


Ilija Pavlic commented on PDFBOX-1202:
--------------------------------------

Here's the stack trace from the latest pdfbox built from svn.

11.01.2012. 01:10:42 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to an OutOfMemoryError
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
        at 
org.apache.pdfbox.io.RandomAccessBuffer.write(RandomAccessBuffer.java:131)
        at 
org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:108)
        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:117)
        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279)
        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
        at 
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
        at 
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:105)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:262)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448)
        at 
org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:153)
        at pdf.test.Main.main(Main.java:61)

You were right about an java.lang.OutOfMemoryError error. What does that mean? 
Somewhat amusing is that a larger document of a similar type (947 pages long) 
can be read without the exception thrown. 
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt 
> stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading 
> corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < 
> allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages 
> from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages 
> from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an 
> error. Is that an indication of a memory leak of some sort?
> Full code is below. Note that the result is the same when instantiating a 
> single PDFTextStripperByArea outside the page loop and invoking resetEngine() 
> on the stripper inside the page loop.
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
>       public static void main(String[] args) throws IOException,
>                       COSVisitorException, CryptographyException {
>               
>               PDDocument document = null;
>               try {
>                       document = 
> PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
>                       if (document.isEncrypted()) {
>                               try {
>                                       document.decrypt("");
>                               } catch (InvalidPasswordException e) {
>                                       System.err.println("Error: Document is 
> encrypted with a password.");
>                                       System.exit(1);
>                               }
>                       }
>                       float x = 55f;
>                       float y = 40f;
>                       float width = 168.5f;
>                       float height = 689f;
>                       float evenOffset = -10f;
>                       List allPages = 
> document.getDocumentCatalog().getAllPages();
>                       for (int i = 0; i < allPages.size(); i++) {
>                               System.out.println("Page " + i);
>                               PDPage page = (PDPage) allPages.get(i);
>                               PDFTextStripperByArea stripper = new 
> PDFTextStripperByArea();
>                               stripper.setSortByPosition(true);
>                               for (int j = 0; j < 3; j++)
>                               {
>                                       if (i % 2 == 0) {
>                                               Rectangle2D.Float region = new 
> Rectangle2D.Float(x, y, width*3, height);
>                                               stripper.addRegion("region", 
> region);
>                                       }
>                                       else {
>                                               Rectangle2D.Float region = new 
> Rectangle2D.Float(x + evenOffset, y, width*3, height);
>                                               stripper.addRegion("region", 
> region);
>                                       }
>                               }
>                               stripper.extractRegions(page);
>                               for (String regionName : stripper.getRegions())
>                               {
>                                       stripper.getTextForRegion(regionName);
>                               }
>                       }
>               }
>               
>               catch(Exception e) {
>                       e.printStackTrace();
>               }
>               finally {
>                       if (document != null) {
>                               document.close();
>                       }
>               }
>       }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1202) org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

Reply via email to