[ https://issues.apache.org/jira/browse/PDFBOX-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183733#comment-13183733 ]
Ilija Pavlic commented on PDFBOX-1202: -------------------------------------- Here's the stack trace from the latest pdfbox built from svn. 11.01.2012. 01:10:42 org.apache.pdfbox.filter.FlateFilter decode SEVERE: FlateFilter: stop reading corrupt stream due to an OutOfMemoryError Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151) at org.apache.pdfbox.io.RandomAccessBuffer.write(RandomAccessBuffer.java:131) at org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:108) at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:117) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221) at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156) at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:105) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:262) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448) at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:153) at pdf.test.Main.main(Main.java:61) You were right about an java.lang.OutOfMemoryError error. What does that mean? Somewhat amusing is that a larger document of a similar type (947 pages long) can be read without the exception thrown. > org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt > stream > ------------------------------------------------------------------------------- > > Key: PDFBOX-1202 > URL: https://issues.apache.org/jira/browse/PDFBOX-1202 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.6.0 > Environment: Mac OS X 10.7.2 > Reporter: Ilija Pavlic > Priority: Critical > Attachments: IATAUnitedStates.pdf > > > Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading > corrupt stream" thrown when extracting text. > The error is thrown at: > - page 397 if the page loop starts at zero -- for (int i = 0; i < > allPages.size(); i++) > - page 790 if the loop starts at 395 (that would make it approx. 397 pages > from the beggining of the loop) > - page 848 if the loop starts at 450 (that would make it aprox. 397 pages > from the beggining of the loop) > The error is not thrown if: > - the loop starts at page 452 or later > - the loop starts at 0 and ends before 396 > - the loop starts at 200 and ends before 595 > Therefore I suspect that a loop spanning more than 396 pages will throw an > error. Is that an indication of a memory leak of some sort? > Full code is below. Note that the result is the same when instantiating a > single PDFTextStripperByArea outside the page loop and invoking resetEngine() > on the stripper inside the page loop. > import java.awt.geom.Rectangle2D; > import java.io.IOException; > import java.util.List; > import org.apache.pdfbox.exceptions.COSVisitorException; > import org.apache.pdfbox.exceptions.CryptographyException; > import org.apache.pdfbox.exceptions.InvalidPasswordException; > import org.apache.pdfbox.pdmodel.PDDocument; > import org.apache.pdfbox.pdmodel.PDPage; > import org.apache.pdfbox.util.PDFTextStripperByArea; > public class Main { > public static void main(String[] args) throws IOException, > COSVisitorException, CryptographyException { > > PDDocument document = null; > try { > document = > PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf"); > if (document.isEncrypted()) { > try { > document.decrypt(""); > } catch (InvalidPasswordException e) { > System.err.println("Error: Document is > encrypted with a password."); > System.exit(1); > } > } > float x = 55f; > float y = 40f; > float width = 168.5f; > float height = 689f; > float evenOffset = -10f; > List allPages = > document.getDocumentCatalog().getAllPages(); > for (int i = 0; i < allPages.size(); i++) { > System.out.println("Page " + i); > PDPage page = (PDPage) allPages.get(i); > PDFTextStripperByArea stripper = new > PDFTextStripperByArea(); > stripper.setSortByPosition(true); > for (int j = 0; j < 3; j++) > { > if (i % 2 == 0) { > Rectangle2D.Float region = new > Rectangle2D.Float(x, y, width*3, height); > stripper.addRegion("region", > region); > } > else { > Rectangle2D.Float region = new > Rectangle2D.Float(x + evenOffset, y, width*3, height); > stripper.addRegion("region", > region); > } > } > stripper.extractRegions(page); > for (String regionName : stripper.getRegions()) > { > stripper.getTextForRegion(regionName); > } > } > } > > catch(Exception e) { > e.printStackTrace(); > } > finally { > if (document != null) { > document.close(); > } > } > } > } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira