[ 
https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated TIKA-447:
-------------------------------

    Attachment: TIKA-447-TikaInputStream.patch

BTW, the current new Detector implementations are a bit troublesome as they 
break the contract that the detect() method must not close() the given stream 
and should use mark() and reset() where necessary to avoid changing the state 
of the stream. The rationale behind this contract is that you should be able to 
call parse() on the same stream instance after detecting its type.

The attached patch fixes this issue by using the TikaInputStream.getFile() 
method to access the underlying file (when available or spooled) when detecting 
these kinds of complex container formats. If the given stream is not a 
TikaInputStream, then just the generic application/zip or 
application/x-tika-msoffice type is returned.

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TIKA-447-TikaInputStream.patch, 
> TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to 
> process container based formats (eg zip files and ole2 files) when trying to 
> detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to 
> do all the work of parsing the whole file when they're not interested in 
> knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, 
> we should try to keep the open container around to speed up parsing of the 
> contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to