[
https://issues.apache.org/jira/browse/TIKA-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685330#comment-17685330
]
Ross Johnson commented on TIKA-3968:
------------------------------------
I'm not seeing any extra bits of information that go along with these
EMR_COMMENT records. In general, the EMR_COMMENT record has been highly
overloaded in the EMF spec for a bunch of different purposes, but these ones
look to just be generic / "private data" records as they don't have a special
"CommentIdentifier" value.
I agree with your suggestion of looking for this special sequence of the 5 the
comment records and assuming the 2nd one is the file name. I would think that
this could also be done without necessarily parsing through all of EMF records,
just by looking for the surrounding pair of 25 byte "IconOnly" records, which
should have a static hex sequence of "20 00 00 00 12 00 00 00 49 00 63 00 6F 00
6E 00 4F 00 6E 00 6C 00 79 00 00".
For extra certainty, you could also look for any EMR_EXTTEXTOUTW records, join
the text of those, and compare that to what is in the 2nd EMR_COMMENT record,
but this would likely require iterating through each of the EMF records.
> Reconstruct embedded file names from recent docx files
> ------------------------------------------------------
>
> Key: TIKA-3968
> URL: https://issues.apache.org/jira/browse/TIKA-3968
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: Microsoft_Word_Document.docx,
> image-2023-02-06-15-46-05-678.png, image-2023-02-06-15-58-20-443.png,
> image1-1.emf, image1-2.emf, image1.emf, image2.emf, image3.emf,
> oleObject1.bin, oleObject2.bin, testWORD has attachment.docx
>
>
> I'm starting to see among several users communicating with me privately that
> Microsoft has changed their basic behavior for files attached to at least
> docx files (possibly pptx and xlsx?). Rather than storing the original file
> name, the file associates an EMF file with an attachment. The filename that
> a human sees in the application is spelled/painted out in the EMF file, but
> does NOT exist in any of the XML.
> I'm attaching an example file.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)