Am 09.05.2015 um 23:50 schrieb Tilman Hausherr:
Hello,
I’m trying to parse a pdf file that I haven’t created, I’m using
pdfBox v1.8.9.
My problem is that when trying to getText(doc) form a certain section
of the pdf using setStartBookmark(item) and setEndBookmark(item) I
get all the text rather than just the text from the specified section.
WhiIe trying to resolve this I realized that the writeText(doc,
outputStream) method always calls resetEngine() method. That will
reset all the parameters and delete the bookmarks I set.
That seems like a bug to me :-(
the two lines that reset the bookmarks were added to resetEngine in
PDFBOX-1808 in rev 1553175
https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java?r1=1553175&r2=1553174&pathrev=1553175
that was meant to save some memory. (Andreas)
I also found another weird piece of code:
if (startPage != null && endPage != null &&
startBookmark.getCOSObject() == endBookmark.getCOSObject())
{
// this is a special case where both the start and end bookmark
// are the same but point to nothing. In this case
// we will not extract any text.
startBookmarkPageNumber = 0;
endBookmarkPageNumber = 0;
}
(should probably be startPage == null && endPage == null && ....)
earlier, that segment was:
if( startBookmarkPageNumber == -1 && startBookmark != null &&
endBookmarkPageNumber == -1 && endBookmark != null &&
startBookmark.getCOSObject() ==
endBookmark.getCOSObject() )
{
//this is a special case where both the start and end bookmark
//are the same but point to nothing. In this case
//we will not extract any text.
startBookmarkPageNumber = 0;
endBookmarkPageNumber = 0;
}
which makes more sense. The change was made last year in rev 1634252 as
part of the pagetree refactoring. (John)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]