[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

Tim Allison (JIRA) Tue, 13 May 2014 19:43:22 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997127#comment-13997127
 ]


Tim Allison commented on TIKA-1294:
-----------------------------------

Ah, ok, that makes sense.  My subclassed parser would have to figure out if the 
parent file was a PDF and then apply this logic...I wouldn't want to apply the 
rule of "don't extract attachment if the metadata object only contains media 
type" across all file types.  Parent file type is not normally passed in to the 
embedded document extractor via metadata?

Perhaps instead of a boolean, we should set an enum to allow the user to 
control processing:
PDXObjectImage
EmbeddedFile
PDXObjectImage and EmbeddedFile

This would allow a client to pick the current default, my use case, and the use 
case where you just want to extract the images that are used to render the pdf.

On a related note, I think I just came across a nasty memory leak when 
extracting PDXObjectImages.  I have a 6 page/1.2MB PDF with 102 JPEGs.  It 
looks like the pdf generator created a separate image for every row in a table. 
 Recursive text comes out with no problem in Tika 1.5 with -Xmx200m, but I need 
500m to get recursive text or extract attachments with trunk.

I can't share this pdf, but I'll try to create a synthetic Apache-friendly doc 
for testing.  Initial testing suggests the leak may be in a javax 
component...Tika is not to blame.  More work remains.



> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-1294
>                 URL: https://issues.apache.org/jira/browse/TIKA-1294
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Trivial
>         Attachments: TIKA-1294.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded 
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types 
> of embedded resources.  I see two ways of allowing the client to choose 
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them 
> as embedded PDXObjectImages vs regular image attachments.  The client can 
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

Reply via email to