Thanks Tilman for all your great and fast work.
Unfortunately I can't share the pdf publicly, it's copyrighted.
My code for extracting the text is (simplified):
public static void main(String[] args) throws IOException {
PDDocument doc = null;
boolean hasOutputPath = false;
if (args.length != 1 && args.length != 2) {
usage();
System.exit(0);
}
if (args.length == 2) {
hasOutputPath = true;
}
try {
doc = PDDocument.load(args[0]);
if (doc.isEncrypted())
{
StandardDecryptionMaterial sdm = new
StandardDecryptionMaterial("");
doc.openProtection(sdm);
}
}
catch (IOException e) {
System.err.println("Error loading PDF file");
e.printStackTrace();
System.exit(0);
}
catch (BadSecurityHandlerException e) {
e.printStackTrace();
System.exit(0);
}
catch (CryptographyException e) {
e.printStackTrace();
System.exit(0);
}
TextParser parser = new TextParser(hasOutputPath? args[1]:
args[0]);//A class of mine to parse the text received
PDDocumentOutline outlineRoot =
doc.getDocumentCatalog().getDocumentOutline();
PDOutlineItem parentItem = outlineRoot.getFirstChild();
String parentTitleName;
String currentChildTitleName;
String nextChildTitleName;
PDFTextStripperExt stripper = new PDFTextStripperExt();
boolean childrenWereParsed = false;
while (parentItem != null) {
parentTitleName = parentItem.getTitle();
if (Pattern.matches(".*Commands", parentTitleName)) {
PDOutlineItem item = parentItem.getFirstChild();
while (item != null) {
currentChildTitleName = item.getTitle();
stripper.setStartBookmark(item);
if ((item = item.getNextSibling()) == null) {
nextChildTitleName = (parentItem =
parentItem.getNextSibling()).getTitle();/*need to check null on next parent
item but in this pdf case it won't happen*/
stripper.setEndBookmark(parentItem);
}
else {
nextChildTitleName = item.getTitle();
stripper.setEndBookmark(item);
}
parser.parseText(stripper.getTextBySpecification(doc),
currentChildTitleName, nextChildTitleName);
docCount++;
}
childrenWereParsed = true;
}
if (!childrenWereParsed) {
parentItem = parentItem.getNextSibling();
}
}
}
(there might be some syntax errors since I simplified the code, but this is
the main concept)
The code which I was talking about with the *namesDict =
doc**.getDocumentCatalog().getNames()
*returns *null *is part of the pdfbox code in the *findDestinationPage *method
in the section of the *if( rawDest instanceof PDNamedDestination )* in the
*PDOutlineItem* class.
It sems that there is an anomaly in this spacific pdf. Ill try to load the
pdf with *loadNonSeq(file,null) *and see what's the difference.
Noam
On Sun, May 10, 2015 at 5:37 PM, Tilman Hausherr <[email protected]>
wrote:
> Am 08.05.2015 um 17:17 schrieb [email protected]:
>
>> I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox
>> v1.8.9.
>>
>> My problem is that when trying to getText(doc) form a certain section of
>> the pdf using setStartBookmark(item) and setEndBookmark(item) I get all the
>> text rather than just the text from the specified section.
>>
>> WhiIe trying to resolve this I realized that the writeText(doc,
>> outputStream) method always calls resetEngine() method. That will reset all
>> the parameters and delete the bookmarks I set.
>>
>> So my first question is what is the correct way to get the text from a
>> specified section of the pdf?
>>
>
> I've now hopefully fixed that problem in
> https://issues.apache.org/jira/browse/PDFBOX-2792
> a snapshot version will soon be available here:
>
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/1.8.10-SNAPSHOT/
>
> When I continued to try and resolve this I created a new class that
>> extendsPDFTextStripper and I changed the getText() and writeText() methods
>> (also changing their names) so that it won’t call the resetEngine() method
>> while keeping the rest of the functionality (I also had to delete the if
>> (getAddMoreFormatting()) section as the parameters are private, is that a
>> problem?).
>>
>> Now when I call the method I created I have a second problem, while it
>> tries to determine the startBookmarkPageNumber in processPages method
>> getPageNumber method returns -1.
>>
>> When I dug deeper I saw that in findDestinationPage method the rawDest is
>> of type PDNamedDestination.
>>
>> The problem is that when trying to get namesDict =
>> doc.getDocumentCatalog().getNames() it returns null. That means that the
>> names dictionary doesn’t exist. What can be done?
>>
>> Just need to point out that in Acrobat the bookmarks all work.
>>
>
> I tested this on a document with names, and I didn't have that effect with
> 1.8.9, so whatever the problem is, it isn't a general problem, so I need
> the file.
>
> One thing to try is to load the document with loadNonSeq(file,null)
> instead of load().
>
> Tilman
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>