Am 08.05.2015 um 17:17 schrieb [email protected]:
I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox v1.8.9.

My problem is that when trying to getText(doc) form a certain section of the 
pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text 
rather than just the text from the specified section.

WhiIe trying to resolve this I realized that the writeText(doc, outputStream) 
method always calls resetEngine() method. That will reset all the parameters 
and delete the bookmarks I set.

So my first question is what is the correct way to get the text from a 
specified section of the pdf?

I've now hopefully fixed that problem in
https://issues.apache.org/jira/browse/PDFBOX-2792
a snapshot version will soon be available here:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/1.8.10-SNAPSHOT/

When I continued to try and resolve this I created a new class that 
extendsPDFTextStripper and I changed the getText() and writeText() methods 
(also changing their names) so that it won’t call the resetEngine() method 
while keeping the rest of the functionality (I also had to delete the if 
(getAddMoreFormatting()) section as the parameters are private, is that a 
problem?).

Now when I call the method I created I have a second problem, while it tries to 
determine the startBookmarkPageNumber in processPages method getPageNumber 
method returns -1.

When I dug deeper I saw that in findDestinationPage method the rawDest is of 
type PDNamedDestination.

The problem is that when trying to get namesDict = 
doc.getDocumentCatalog().getNames() it returns null. That means that the names 
dictionary doesn’t exist. What can be done?

Just need to point out that in Acrobat the bookmarks all work.

I tested this on a document with names, and I didn't have that effect with 1.8.9, so whatever the problem is, it isn't a general problem, so I need the file.

One thing to try is to load the document with loadNonSeq(file,null) instead of load().

Tilman






---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to