[
https://issues.apache.org/jira/browse/TIKA-4391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931277#comment-17931277
]
Ross Johnson edited comment on TIKA-4391 at 2/27/25 7:46 PM:
-------------------------------------------------------------
I've worked a lot with msg files & normalizing attachments, so just thought I'd
give a bit of a brain dump of related info.
In HTML bodies & RTF-encapsulated HTML bodies, inline images normally have
_PidTagAttachmentHidden_ = true. Note that I have seen unusual emails where an
attachment image has _PidTagAttachmentHidden_ = true, but it's there doesn't
seem to be a place in the HTML where that image actually goes, i.e. no apparent
reference in the HTML. This flag is also used to hide other non-image
attachments, mostly related to calendar invites & calendar exceptions.
For real RTF bodies that reference attachments, things are a bit different.
These attachments don't have _PidTagAttachmentHidden_ = true but rather have
_PidTagRenderingPosition_ < 0xFFFFFFFF. The main issue with RTF attachments is
determining whether the attachment is actually fully shown inline or not. For
example, an embedded message or normal binary file will just show a thumbnail
in the body (stored in a OLE presentation stream). Other attachments, such as
an Excel file, may show a selection of a worksheet inline, and clicking on that
section in Outlook then opens the full Excel file. I think true inline images
won't have any OLE presentation defined, indicating that the original image
data is used inline directly instead.
was (Author: rossj):
I've worked a lot with msg files & normalizing attachments, so just thought I'd
give a bit of a brain dump of related info.
In HTML bodies & RTF-encapsulated HTML bodies, inline images normally have
`PidTagAttachmentHidden` = true. Note that I have seen unusual emails where an
attachment image has `PidTagAttachmentHidden` = true, but it's there doesn't
seem to be a place in the HTML where that image actually goes, i.e. no apparent
reference in the HTML. This flag is also used to hide other non-image
attachments, mostly related to calendar invites & calendar exceptions.
For real RTF bodies that reference attachments, things are a bit different.
These attachments don't have `PidTagAttachmentHidden` = true but rather have
`PidTagRenderingPosition` < 0xFFFFFFFF. The main issue with RTF attachments is
determining whether the attachment is actually fully shown inline or not. For
example, an embedded message or normal binary file will just show a thumbnail
in the body (stored in a OLE presentation stream). Other attachments, such as
an Excel file, may show a selection of a worksheet inline, and clicking on that
section in Outlook then opens the full Excel file. I think true inline images
won't have any OLE presentation defined, indicating that the original image
data is used inline directly instead.
> Detect inline images in msg files
> ---------------------------------
>
> Key: TIKA-4391
> URL: https://issues.apache.org/jira/browse/TIKA-4391
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> Images are stored as attachments. It would be helpful to be able to
> distinguish between "inline" images that are intended to be rendered in the
> email vs regular image attachments.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)