[
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995960#comment-13995960
]
Ray Gauss II commented on TIKA-1294:
------------------------------------
bq. Can your MediaTypeDisablingDocumentSelector tell the difference between a
jpeg that was attached to a PDF (basic attachment) and one that was derived
from a PDXObjectImage?
If by basic attachment you mean those defined in
{{PDEmbeddedFilesNameTreeNode}}, then not exactly.
Both {{PDF2XHTML.extractImages}} and {{PDF2XHTML.extractEmbeddedDocuments}} end
up using the same {{getEmbeddedDocumentExtractor}} (a
{{ParsingEmbeddedDocumentExtractor}} by default) and use the same
{{DocumentSelector}} in the calls to
{{extractor.shouldParseEmbedded(metadata)}}, but neither sets any special
metadata keys indicating 'attached' vs 'embedded' so document selectors aren't
able to explicitly distinguish.
However, the {{PDXObjectImage}} resources *only* get the media type set in the
metadata object while the {{PDEmbeddedFilesNameTreeNode}} resources get media
type, name, and length set, so you could potentially check for their presence
to distinguish.
> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---------------------------------------------------------------------------
>
> Key: TIKA-1294
> URL: https://issues.apache.org/jira/browse/TIKA-1294
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Trivial
> Attachments: TIKA-1294.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types
> of embedded resources. I see two ways of allowing the client to choose
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them
> as embedded PDXObjectImages vs regular image attachments. The client can
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.
--
This message was sent by Atlassian JIRA
(v6.2#6252)