Re: Can't resolve page number

Tilman Hausherr Sat, 09 May 2015 15:12:06 -0700

Am 09.05.2015 um 23:50 schrieb Tilman Hausherr:

Hello,
I’m trying to parse a pdf file that I haven’t created, I’m usingpdfBox v1.8.9.
My problem is that when trying to getText(doc) form a certain sectionof the pdf using setStartBookmark(item) and setEndBookmark(item) Iget all the text rather than just the text from the specified section.
WhiIe trying to resolve this I realized that the writeText(doc,outputStream) method always calls resetEngine() method. That willreset all the parameters and delete the bookmarks I set.
That seems like a bug to me :-(

the two lines that reset the bookmarks were added to resetEngine inPDFBOX-1808 in rev 1553175

https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java?r1=1553175&r2=1553174&pathrev=1553175
that was meant to save some memory. (Andreas)

I also found another weird piece of code:

        if (startPage != null && endPage != null &&
            startBookmark.getCOSObject() == endBookmark.getCOSObject())
        {
            // this is a special case where both the start and end bookmark
            // are the same but point to nothing.  In this case
            // we will not extract any text.
            startBookmarkPageNumber = 0;
            endBookmarkPageNumber = 0;
        }

(should probably be startPage == null && endPage == null && ....)

 earlier, that segment was:

       if( startBookmarkPageNumber == -1 && startBookmark != null &&
                endBookmarkPageNumber == -1 && endBookmark != null &&

startBookmark.getCOSObject() ==endBookmark.getCOSObject() )

        {
            //this is a special case where both the start and end bookmark
            //are the same but point to nothing.  In this case
            //we will not extract any text.
            startBookmarkPageNumber = 0;
            endBookmarkPageNumber = 0;
        }

which makes more sense. The change was made last year in rev 1634252 aspart of the pagetree refactoring. (John)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Can't resolve page number

Reply via email to