[ https://issues.apache.org/jira/browse/TIKA-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-2311: ------------------------------ Summary: Preserve "x-tika-ooxml" mime value for truncated ooxml files (was: Create x-tika-ooxml-unk mime type (?)) > Preserve "x-tika-ooxml" mime value for truncated ooxml files > ------------------------------------------------------------ > > Key: TIKA-2311 > URL: https://issues.apache.org/jira/browse/TIKA-2311 > Project: Tika > Issue Type: Bug > Reporter: Tim Allison > Fix For: 2.0, 1.15 > > > The following is an unintended consequence of TIKA-2212. > The OOXML parser used to handle {{x-tika-ooxml}}. We have some truncated > ooxml files in our regression corpus. The previous behavior was: > 1) ZipPackage detector caught the zip truncation exception and returned > "application/zip" > 2) The mime detector recognized magic and returned {{x-tika-ooxml}} > 3) The file was then routed to the OOXML parser which didn't wind up doing > much with the content because it hit the zip exception early on, but the > final mime type was {{x-tika-ooxml}}. > The current behavior > 1) Same detection steps > 2) However, because the OOXML parser no longer handles {{x-tika-ooxml}}, the > file is handled by the Package Parser, which overwrites the magic-determined > mime type, and the new mime type is {{application/zip}}. > 3) Some content is extracted because the Package parser handles the zip > entries in order and only throws the exception once it hits the last entry in > the zip file. > Ideally, I'd like to keep the magic-determined mime detection. Once we can > chain parsers, the user should be able to backoff to the PackageParser, but I > don't think this should be the default behavior. > One solution would be to create a new mime type that is not the parent of the > other ooxml subtypes, but is itself a leaf subtype, something like: > {{x-tika-ooxml-unk}}. > Any objections/other recommendations? -- This message was sent by Atlassian JIRA (v6.3.15#6346)