[ https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-1294: ------------------------------ Attachment: TIKA-1294v1.patch I investigated a bit more and sent a question to pdfbox users list. It looks like the memory consumption profile is far better in PDFBox 2.0 (constant 130m), but I was getting errors when I tried to view the exported files. With PDFBox 2.0, I found that govdocs 239665 (mentioned above as jvm killer) had 2,750 embedded images (2.6 GB) when they were fully extracted. Given the OOM issues with PDFBox 1.8.5 on some files, I'd prefer to set the default behavior to not extract PDXObjectImages. I figure if I found this problem in my small personal test set and in the first 500 govdocs test, this may be a fairly common issue. Users who just want text and/or metadata will face a decent sized increase in OOM Exceptions if we leave this on as default. [~jukkaz], I won't want to turn off the feature you added, though, without your consent! I'd also prefer to allow users to turn this on/off via config file so that non-dev folks who are using Tika don't have to add their own DocumentSelector. Patch is attached. I've added a parameter in PDFParserConfig _and_ I've added some metadata that will allow consumers who want to use a DocumentSelector to tell what type of embedded object they're looking at. Any and all feedback is welcome. I'm not held to the decisions I made in this patch. > Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs > --------------------------------------------------------------------------- > > Key: TIKA-1294 > URL: https://issues.apache.org/jira/browse/TIKA-1294 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Trivial > Attachments: TIKA-1294.patch, TIKA-1294v1.patch > > > TIKA-1268 added the capability to extract embedded images as regular embedded > resources...a great feature! > However, for some use cases, it might not be desirable to extract those types > of embedded resources. I see two ways of allowing the client to choose > whether or not to extract those images: > 1) set a value in the metadata for the extracted images that identifies them > as embedded PDXObjectImages vs regular image attachments. The client can > then choose not to process embedded resources with a given metadata value. > 2) allow the client to set a parameter in the PDFConfig object. > My initial proposal is to go with option 2, and I'll attach a patch shortly. -- This message was sent by Atlassian JIRA (v6.2#6252)