[ 
https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181683#comment-13181683
 ] 

Ilija Pavlic commented on PDFBOX-1202:
--------------------------------------

Tried it now, the error gets thrown in the same way when calling resetEngine() 
in the loop for the PDFTextStripperByArea instantiated outside the loop.

My original expectation was that the class would be garbage collected, and that 
there would be no harm in multiple instatiations for each page, as the 
PDFTextStripperByArea's extractRegions is called on a single page, as in 
"stripper.extractRegions(page)". In either case, the error is not prevented by 
using the a single instance of PDFTextStripperByArea and resetting the stripper 
by invoking resetEngine() inside the loop.
                
> org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt 
> stream
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1202
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1202
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Mac OS X 10.7.2
>            Reporter: Ilija Pavlic
>            Priority: Critical
>         Attachments: IATAUnitedStates.pdf
>
>
> Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading 
> corrupt stream" thrown when extracting text.
> The error is thrown at:
> - page 397 if the page loop starts at zero -- for (int i = 0; i < 
> allPages.size(); i++)
> - page 790 if the loop starts at 395 (that would make it approx. 397 pages 
> from the beggining of the loop)
> - page 848 if the loop starts at 450 (that would make it aprox. 397 pages 
> from the beggining of the loop)
> The error is not thrown if:
> - the loop starts at page 452 or later
> - the loop starts at 0 and ends before 396
> - the loop starts at 200 and ends before 595
> Therefore I suspect that a loop spanning more than 396 pages will throw an 
> error. Is that an indication of a memory leak of some sort?
> Here is the full code:
> import java.awt.geom.Rectangle2D;
> import java.io.IOException;
> import java.util.List;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.exceptions.CryptographyException;
> import org.apache.pdfbox.exceptions.InvalidPasswordException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.util.PDFTextStripperByArea;
> public class Main {
>       public static void main(String[] args) throws IOException,
>                       COSVisitorException, CryptographyException {
>               
>               PDDocument document = null;
>               try {
>                       document = 
> PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
>                       if (document.isEncrypted()) {
>                               try {
>                                       document.decrypt("");
>                               } catch (InvalidPasswordException e) {
>                                       System.err.println("Error: Document is 
> encrypted with a password.");
>                                       System.exit(1);
>                               }
>                       }
>                       float x = 55f;
>                       float y = 40f;
>                       float width = 168.5f;
>                       float height = 689f;
>                       float evenOffset = -10f;
>                       List allPages = 
> document.getDocumentCatalog().getAllPages();
>                       for (int i = 0; i < allPages.size(); i++) {
>                               System.out.println("Page " + i);
>                               PDPage page = (PDPage) allPages.get(i);
>                               PDFTextStripperByArea stripper = new 
> PDFTextStripperByArea();
>                               stripper.setSortByPosition(true);
>                               for (int j = 0; j < 3; j++)
>                               {
>                                       if (i % 2 == 0) {
>                                               Rectangle2D.Float region = new 
> Rectangle2D.Float(x, y, width*3, height);
>                                               stripper.addRegion("region", 
> region);
>                                       }
>                                       else {
>                                               Rectangle2D.Float region = new 
> Rectangle2D.Float(x + evenOffset, y, width*3, height);
>                                               stripper.addRegion("region", 
> region);
>                                       }
>                               }
>                               stripper.extractRegions(page);
>                               for (String regionName : stripper.getRegions())
>                               {
>                                       stripper.getTextForRegion(regionName);
>                               }
>                       }
>               }
>               
>               catch(Exception e) {
>                       e.printStackTrace();
>               }
>               finally {
>                       if (document != null) {
>                               document.close();
>                       }
>               }
>       }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to