[
https://issues.apache.org/jira/browse/TIKA-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-4091:
------------------------------
Priority: Blocker (was: Major)
> OLE2 / CFB entry names should be treated case-insensitively
> -----------------------------------------------------------
>
> Key: TIKA-4091
> URL: https://issues.apache.org/jira/browse/TIKA-4091
> Project: Tika
> Issue Type: Bug
> Affects Versions: 2.8.0
> Reporter: Ross Johnson
> Priority: Blocker
> Attachments: protected - normal case.docx, protected - upper
> case.docx, simple - lower case.doc, simple - normal case.doc, simple - upper
> case.doc
>
>
> According to section [2.6.1 of
> MS-CFB|https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-cfb/60fe8611-66c3-496b-b70d-a504c94c9ace],
> entries (whether they be "storage" or "stream" nodes) should be located with
> a special case-insensitive uppercase mapping. I believe Tika is using a
> case-sensitive approach, e.g. when looking for certain OLE2 objects in
> POIFSContainerDetector.java. The result is that Tika may perform incomplete
> or otherwise subpar type detection on OLE2 files, as well as provide
> incomplete metadata & extracted text output.
> Attached are some sample documents. The 3 "simple" ones demonstrate
> incomplete metadata & text extraction. These 3 files are equivalent except
> for the casing of the OLE2 names. Word opens all normally and shows the
> correct metadata. Tika output is missing all metadata and document content
> for the "upper case" and "lower case" variants.
> The two "protected" examples are again equivalent, except for the casing.
> Tika gives an EncryptedDocumentException for "protected - normal case.docx"
> but not for "protected - upper case.docx". The password for these 2 files is
> "password".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)