PDFTextStripper doesn't process text annotations
------------------------------------------------

                 Key: PDFBOX-1143
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1143
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
            Reporter: Michael McCandless
            Priority: Minor


Users are able to add annotations (comments) to a PDF, and PDFBox
processes them correctly: you can retrieve them via
PDPage.getAnnotations.

But PDFTextStripper currently doesn't extract the text from
annotations.

I think it [optionally] should?

I think we'd add a boolean (shouldProcessAnnotations?), and if
enabled, we'd visit the annotations that have sub-type FreeText, and
extract what text we can (Subject, TitlePopup, Contents, maybe
RichContents?), associate the .getRectangle with the text to make a
TextPosition, and then somehow associate that with the right
"article" (so that annotations "over" a given article are rendered
with it).

Alternatively we just put all annotations into their own "article"?

I'm not familiar enough with PDF text positioning nor PDFTextStripper
to work out a real patch here... but I think this approach should
work?


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to