[jira] [Comment Edited] (TIKA-2845) Override ProcessPages in PDFTextStripper

Tim Allison (JIRA) Wed, 03 Apr 2019 06:18:21 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808708#comment-16808708
 ]


Tim Allison edited comment on TIKA-2845 at 4/3/19 1:17 PM:
-----------------------------------------------------------

The attached file opens in MS Edge, has no "contents" element but does have an 
embedded file in an annotation that can be downloaded/exported w Edge.  We are 
not currently extracting this embedded file.

NOTE: Adobe DC Reader is not able to open this file.


was (Author: [email protected]):
The attached file opens in Adobe, has no "contents" element but does have an 
embedded file in an annotation that can be downloaded/exported w Adobe.  We are 
not currently extracting this embedded file.

> Override ProcessPages in PDFTextStripper
> ----------------------------------------
>
>                 Key: TIKA-2845
>                 URL: https://issues.apache.org/jira/browse/TIKA-2845
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: testPDFFileEmbInAnnotation_noContents.pdf
>
>
> On the PDFBox user list, [~lehmi] confirmed (and [~tilman] clarified) that 
> PDFTextStripper's {{processPages}} skips pages that lack a "Contents" 
> element.  Inline images are part of the "Contents" element and would still be 
> processed (e.g. in OCR).  
>  
> However, there are other elements that might be on a page that does not have 
> a "Contents" element, such as an annotation with an embedded file.
>  
> We should override {{processPages()}} to process all pages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TIKA-2845) Override ProcessPages in PDFTextStripper

Reply via email to