[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

Ray Gauss II (JIRA) Wed, 14 May 2014 07:21:07 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997500#comment-13997500
 ]


Ray Gauss II commented on TIKA-1294:
------------------------------------

I saw similar problematic resource consumption as well, which was the reason 
for figuring out how to disable this stuff :)

Perhaps a generic indication of why this embedded object is being parsed would 
be useful to have in the metadata object passed to the 
{{EmbeddedDocumentExtractor}}, something like an {{EmbeddedObjectContext}} enum 
with {{INLINE}} and {{ATTACHMENT}} options, which the 
{{EmbeddedDocumentExtractor}} (and in most cases that means the 
{{DocumentSelector}}) could use to determine whether to parse on a per-object 
basis? 

> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-1294
>                 URL: https://issues.apache.org/jira/browse/TIKA-1294
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Trivial
>         Attachments: TIKA-1294.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded 
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types 
> of embedded resources.  I see two ways of allowing the client to choose 
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them 
> as embedded PDXObjectImages vs regular image attachments.  The client can 
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

Reply via email to