[
https://issues.apache.org/jira/browse/TIKA-3556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415626#comment-17415626
]
Tim Allison edited comment on TIKA-3556 at 9/15/21, 4:35 PM:
-------------------------------------------------------------
In addition to that (which is easily fixable), things get complicated with the
open container added to the TikaInputStream and when to overwrite that. If the
OPCPackage arrives second, then it should overwrite the zipfile, but if zipFile
arrives second, it shouldn't overwrite the open container.
It would be simpler to add a sort to the DefaultZipContainer that guarantees
that OPCPackage is always last.
Separate topic, if we change all the odt to be subtypes of {{application/zip}},
which we should do, we'll likely have to come up with a more elegant way of
turning off OpenOffice parsing.
If someone wants to parse zip files, but doesn't want to parse a specialization
of Zip, they can turn off that specialized parser, but then we allow backoff,
and the file is parsed by the Zip parser, which means they have to exclude the
parser and then also exclude every mime that is a subtype of zip, which is
messy... I'm not sure what the best solution for that is, but it is not a
small change.
was (Author: [email protected]):
In addition to that (which is easily fixable), things get complicated with the
open container added to the TikaInputStream and when to overwrite that. If the
OPCPackage arrives second, then it should overwrite the zipfile, but if zipFile
arrives second, it shouldn't overwrite the open container.
It would be simpler to add a sort to the DefaultZipContainer that guarantees
that OPCPackage is always last.
Separate topic, if we change all the odt to be subtypes of {{application/zip}},
which we should do, we'll likely have to come up with a more elegant way of
turning off OpenOffice parsing.
If someone wants to parse zip files, but doesn't want to parse a specialization
of Zip, they can turn off that specialized parser, but then we allow backoff,
and the file is parsed by the Zip parser, which means they have to turn exclude
the parser and then also exclude every mime that is a subtype of zip, which is
messy... I'm not sure what the best solution for that is, but it is not a
small change.
> DefaultZipContainerDetector returns application/zip for .odt files when
> OPCPackageDetector is on the classpath
> --------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-3556
> URL: https://issues.apache.org/jira/browse/TIKA-3556
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 2.1.0
> Reporter: Simon Gaeremynck
> Priority: Major
>
> This is happening because the OPCPackageDetector.detect method will [fail and
> close the underlying zip
> stream|https://github.com/apache/tika/blob/2.1.0-rc2/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java#L257].
> When the next detector runs (e.g. OpenDocumentDetector), the stream it
> receives has been closed and it won't be able to detect anything.
> After all detectors have effectively no-oped, [the
> DefaultZipContainerDetector falls back to
> application/zip|https://github.com/apache/tika/blob/2.1.0-rc2/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L209].
> Now, when running with the default CompositeDetector, the next detector is
> usually the MimeTypes detector. This returns the proper
> application/vnd.oasis.opendocument.text, but the [CompositeDetector will
> ignore|https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/detect/CompositeDetector.java#L86]
> it as that mime type isn't marked up as a subclass of application/zip in
> [the
> registry|https://github.com/apache/tika/blob/2.1.0-rc2/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L2327].
>
> In short, I think there are two bugs here potentially:
> # The OPCPacakageDetector either shouldn't close the zip while detecting or
> the DefaultZipContainerDetector should re-open if necessary?
> # The registry should be updated to mark up
> application/vnd.oasis.opendocument.text as a subclass of application/zip ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)