[ https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761607#action_12761607 ]
Mel Martinez commented on PDFBOX-533: ------------------------------------- Lars - thanks for posting the problematic file - I was able to reproduce the error. This is actually a different error than what Navendu was hitting, but similarly unrelated to the text extraction code. This is happening in the PDFXrefStreamParser.parse() method because there is no objIter.hasNext() test to protect the objIter.next() call on line 115. This is an outright bug. Specifically, the current code looks like so: public void parse() throws IOException { ... Iterator objIter = objNums.iterator(); //<------- here we create the Iterator /* * Calculating the size of the line in bytes */ int w0 = xrefFormat.getInt(0); int w1 = xrefFormat.getInt(1); int w2 = xrefFormat.getInt(2); int lineSize = w0 + w1 + w2; while(pdfSource.available() > 0) { byte[] currLine = new byte[lineSize]; pdfSource.read(currLine); int type = 0; /* * Grabs the number of bytes specified for the first column in * the W array and stores it. */ for(int i = 0; i < w0; i++) { type += (currLine[i] & 0x00ff) << ((w0 - i - 1)* 8); } //Need to remember the current objID Integer objID = (Integer)objIter.next(); //<---- here we attempt to pull objects out of it. /* * 3 different types of entries. */ switch(type) { // ... do stuff ... } } ... } The code seems to be written with the assumption that if pdfSource.available() >0 that the object count will have another increment. That seems a bit vulnerable to corrupt streams. Further it is a logic error because the stream seems to contain lines of different types not processed as Xref objects. At least that seems clear from my cursory step through. I modified line 100 to look like while(pdfSource.available() > 0 && objIter.next()) and it now parses your test document just fine. From what I can tell all the text is captured. If you use my PDFTextStripper2 you will need to adjust the vertical drop threshold used for paragraph tests. The default is a bit too small and it breaks most paragraphs up into separate chunks. I tried a value of 3 (the default is 2.5) and got decent results with your document. My Deutch is very very rusty but I think it did a decent job. Note that I just uploaded a new version to PDFBOX-521 that fixes a small bug. I will create a separate JIRA that covers this particular issue (the missing iterator test) and post the modified src file there (I am not a committer) for consideration by the devs. I will link back to this one. > PDFTextStripper.writeCharacters is called no where in the class > --------------------------------------------------------------- > > Key: PDFBOX-533 > URL: https://issues.apache.org/jira/browse/PDFBOX-533 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 0.8.0-incubator > Reporter: Navendu Garg > Attachments: TestPDFTextStripperPerf.java > > > It seems writeCharacters method is not called anywhere in the PDFTextStripper > class. This makes it impossible for handling character TextPosition as well > as Line Separator because processLineSeparator method is no longer there and > writeLineSeparator is called when actual writing happens. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.