[ https://issues.apache.org/jira/browse/PDFBOX-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Hewson updated PDFBOX-1143: -------------------------------- Fix Version/s: 2.0.0 > PDFTextStripper doesn't process text annotations > ------------------------------------------------ > > Key: PDFBOX-1143 > URL: https://issues.apache.org/jira/browse/PDFBOX-1143 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.7.0 > Reporter: Michael McCandless > Priority: Minor > Fix For: 2.0.0 > > > Users are able to add annotations (comments) to a PDF, and PDFBox > processes them correctly: you can retrieve them via > PDPage.getAnnotations. > But PDFTextStripper currently doesn't extract the text from > annotations. > I think it [optionally] should? > I think we'd add a boolean (shouldProcessAnnotations?), and if > enabled, we'd visit the annotations that have sub-type FreeText, and > extract what text we can (Subject, TitlePopup, Contents, maybe > RichContents?), associate the .getRectangle with the text to make a > TextPosition, and then somehow associate that with the right > "article" (so that annotations "over" a given article are rendered > with it). > Alternatively we just put all annotations into their own "article"? > I'm not familiar enough with PDF text positioning nor PDFTextStripper > to work out a real patch here... but I think this approach should > work? -- This message was sent by Atlassian JIRA (v6.3.4#6332)