[ 
https://issues.apache.org/jira/browse/PDFBOX-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761607#action_12761607
 ] 

Mel Martinez commented on PDFBOX-533:
-------------------------------------

Lars - thanks for posting the problematic file - I was able to reproduce the 
error.

This is actually a different error than what Navendu was hitting, but similarly 
unrelated to the text extraction code.

This is happening in the PDFXrefStreamParser.parse() method because there is no 
objIter.hasNext() test to protect the objIter.next() call on line 115.  This is 
an outright bug.

Specifically, the current code looks like so:

public void parse() throws IOException {
    ...
            Iterator objIter = objNums.iterator();   //<------- here we create 
the Iterator
            /*
             * Calculating the size of the line in bytes
             */
            int w0 = xrefFormat.getInt(0);
            int w1 = xrefFormat.getInt(1);
            int w2 = xrefFormat.getInt(2);
            int lineSize = w0 + w1 + w2;
            
            while(pdfSource.available() > 0)
            {
                byte[] currLine = new byte[lineSize];
                pdfSource.read(currLine);

                int type = 0;
                /*
                 * Grabs the number of bytes specified for the first column in 
                 * the W array and stores it.
                 */
                for(int i = 0; i < w0; i++)
                {
                    type += (currLine[i] & 0x00ff) << ((w0 - i - 1)* 8);
                }
                //Need to remember the current objID
                Integer objID = (Integer)objIter.next();    //<---- here we 
attempt to pull objects out of it.
                /*
                 * 3 different types of entries. 
                 */
                switch(type)
                {
                    // ... do stuff ...
                }
            }
    ...
}

The code seems to be written with the assumption that if pdfSource.available() 
>0 that the object count will have another increment.  That seems a bit 
vulnerable to corrupt streams.  Further it is a logic error because the stream 
seems to contain lines of different types not processed as Xref objects.   At 
least that seems clear from my cursory step through.

I modified line 100 to look like 

            while(pdfSource.available() > 0 && objIter.next())

and it now parses your test document just fine.  From what I can tell all the 
text is captured.  

If you use my PDFTextStripper2 you will need to adjust the vertical drop 
threshold used for paragraph tests.  The default is a bit too small and it 
breaks most paragraphs up into separate chunks.  I tried a value of 3 (the 
default is 2.5) and got decent results with your document.  My Deutch is very 
very rusty but I think it did a decent job.  Note that I just uploaded a new 
version to PDFBOX-521 that fixes a small bug.

I will create a separate JIRA that covers this particular issue (the missing 
iterator test) and post the modified src file there (I am not a committer) for 
consideration by the devs.  I will link back to this one.

> PDFTextStripper.writeCharacters is called no where in the class
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-533
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-533
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Navendu Garg
>         Attachments: TestPDFTextStripperPerf.java
>
>
> It seems writeCharacters method is not called anywhere in the PDFTextStripper 
> class. This makes it impossible for handling character TextPosition as well 
> as Line Separator because processLineSeparator method is no longer there and 
> writeLineSeparator is called when actual writing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to