[jira] [Comment Edited] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

Tim Allison (JIRA) Fri, 23 May 2014 14:21:25 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007717#comment-14007717
 ]


Tim Allison edited comment on TIKA-1294 at 5/23/14 9:19 PM:
------------------------------------------------------------

I investigated a bit more and sent a question to pdfbox users list.  It looks 
like the memory consumption profile is far better in PDFBox 2.0 (constant 
130m), but I was getting errors when I tried to view the exported files.  With 
PDFBox 2.0, I found that govdocs 239665 (mentioned above as jvm killer) had 
2,750 embedded images (2.6 GB) when they were fully extracted.

Given the OOM issues with PDFBox 1.8.5 on some files, I'd prefer to set the 
default behavior to not extract PDXObjectImages.  I figure if I found this 
problem in my small personal test set and in the first 500 govdocs test, this 
may be a fairly common issue.

Users who just want text and/or metadata will face a decent sized increase in 
OOM Exceptions if we leave this on as default. [~jukkaz], I won't turn off the 
feature you added, though, without your consent! 

I'd also prefer to allow users to turn this on/off via config file so that 
non-dev folks who are using Tika don't have to add their own DocumentSelector.

Patch is attached. I've added a parameter in PDFParserConfig _and_ I've added 
some metadata that will allow consumers who want to use a DocumentSelector to 
tell what type of embedded object they're looking at.

Any and all feedback is welcome.  I'm not held to the decisions I made in this 
patch.

 


was (Author: talli...@mitre.org):
I investigated a bit more and sent a question to pdfbox users list.  It looks 
like the memory consumption profile is far better in PDFBox 2.0 (constant 
130m), but I was getting errors when I tried to view the exported files.  With 
PDFBox 2.0, I found that govdocs 239665 (mentioned above as jvm killer) had 
2,750 embedded images (2.6 GB) when they were fully extracted.

Given the OOM issues with PDFBox 1.8.5 on some files, I'd prefer to set the 
default behavior to not extract PDXObjectImages.  I figure if I found this 
problem in my small personal test set and in the first 500 govdocs test, this 
may be a fairly common issue.

Users who just want text and/or metadata will face a decent sized increase in 
OOM Exceptions if we leave this on as default. [~jukkaz], I won't want to turn 
off the feature you added, though, without your consent! 

I'd also prefer to allow users to turn this on/off via config file so that 
non-dev folks who are using Tika don't have to add their own DocumentSelector.

Patch is attached. I've added a parameter in PDFParserConfig _and_ I've added 
some metadata that will allow consumers who want to use a DocumentSelector to 
tell what type of embedded object they're looking at.

Any and all feedback is welcome.  I'm not held to the decisions I made in this 
patch.

 

> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-1294
>                 URL: https://issues.apache.org/jira/browse/TIKA-1294
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Trivial
>         Attachments: TIKA-1294.patch, TIKA-1294v1.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded 
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types 
> of embedded resources.  I see two ways of allowing the client to choose 
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them 
> as embedded PDXObjectImages vs regular image attachments.  The client can 
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

Reply via email to