Am 09.05.2015 um 23:50 schrieb Tilman Hausherr:

Hello,

I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox v1.8.9.

My problem is that when trying to getText(doc) form a certain section of the pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text rather than just the text from the specified section.

WhiIe trying to resolve this I realized that the writeText(doc, outputStream) method always calls resetEngine() method. That will reset all the parameters and delete the bookmarks I set.

That seems like a bug to me :-(

the two lines that reset the bookmarks were added to resetEngine in PDFBOX-1808 in rev 1553175
https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java?r1=1553175&r2=1553174&pathrev=1553175
that was meant to save some memory. (Andreas)

I also found another weird piece of code:

        if (startPage != null && endPage != null &&
            startBookmark.getCOSObject() == endBookmark.getCOSObject())
        {
            // this is a special case where both the start and end bookmark
            // are the same but point to nothing.  In this case
            // we will not extract any text.
            startBookmarkPageNumber = 0;
            endBookmarkPageNumber = 0;
        }

(should probably be startPage == null && endPage == null && ....)

 earlier, that segment was:

       if( startBookmarkPageNumber == -1 && startBookmark != null &&
                endBookmarkPageNumber == -1 && endBookmark != null &&
startBookmark.getCOSObject() == endBookmark.getCOSObject() )
        {
            //this is a special case where both the start and end bookmark
            //are the same but point to nothing.  In this case
            //we will not extract any text.
            startBookmarkPageNumber = 0;
            endBookmarkPageNumber = 0;
        }

which makes more sense. The change was made last year in rev 1634252 as part of the pagetree refactoring. (John)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to