[
https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134252#comment-15134252
]
Tim Allison commented on TIKA-1854:
-----------------------------------
Got it. This is very helpful. Thank you.
bq. Is the same mechanism used to determine the mime type of the embedded
documents?
Y, sometimes it is. Depending on the container mime type and the embedded doc
type, we rely on what the container document tells us the embedded file is, and
sometimes we run the same mime type detection algorithms on the embedded file
bytes as we do on the container files.
bq. If custom mime types worked for embedded documents that could also be
useful.
They do. Or, they should... let us know if you find that not working!!!
bq. I think the specific formats I'm interested in are not in widespread use
Ok, y, this makes sense. Given our goal of the Babel fish, though, I wonder if
we wouldn't want to add whatever you're working on?
On a related note, would there be any utility in adding a detector that checks
for the storageClassId in the Metadata object and then returns a mime-type
based on a configurable lookup list?
> Include the storage class ID of documents embedded in MS Office documents
> -------------------------------------------------------------------------
>
> Key: TIKA-1854
> URL: https://issues.apache.org/jira/browse/TIKA-1854
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Daniel Bonniot de Ruisselet
> Assignee: Tim Allison
> Attachments: class-id.patch
>
>
> When processing embedded documents using an EmbeddedDocumentExtractor, the
> storage class ID of the embedded document would be a useful metadata to have,
> but it's currently missing.
> I'll promptly attach a patch implementing and testing this new feature.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)