[ https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134252#comment-15134252 ]
Tim Allison commented on TIKA-1854: ----------------------------------- Got it. This is very helpful. Thank you. bq. Is the same mechanism used to determine the mime type of the embedded documents? Y, sometimes it is. Depending on the container mime type and the embedded doc type, we rely on what the container document tells us the embedded file is, and sometimes we run the same mime type detection algorithms on the embedded file bytes as we do on the container files. bq. If custom mime types worked for embedded documents that could also be useful. They do. Or, they should... let us know if you find that not working!!! bq. I think the specific formats I'm interested in are not in widespread use Ok, y, this makes sense. Given our goal of the Babel fish, though, I wonder if we wouldn't want to add whatever you're working on? On a related note, would there be any utility in adding a detector that checks for the storageClassId in the Metadata object and then returns a mime-type based on a configurable lookup list? > Include the storage class ID of documents embedded in MS Office documents > ------------------------------------------------------------------------- > > Key: TIKA-1854 > URL: https://issues.apache.org/jira/browse/TIKA-1854 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Daniel Bonniot de Ruisselet > Assignee: Tim Allison > Attachments: class-id.patch > > > When processing embedded documents using an EmbeddedDocumentExtractor, the > storage class ID of the embedded document would be a useful metadata to have, > but it's currently missing. > I'll promptly attach a patch implementing and testing this new feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)