[ 
https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894486#action_12894486
 ] 

Jukka Zitting commented on TIKA-447:
------------------------------------

It would be great if the AutoDetectParser could automatically leverage such 
detectors that use external parser libraries. The AutoDetectParser can't 
directly link to such parsers due to dependency issues, but we could use the 
service provider mechanism just like we do with Parser classes to automatically 
load all the Detectors available in the classpath. To do this effectively, I'd 
also add a Detector.getSupportedTypes() method like below so that more complex 
and potentially more expensive (need to read the entire document) detectors 
like POIFSContainerDetector could only be called if a more generic detector 
first determines that the input document matches the supported base type.

    /**
     * Returns the set of base media types supported by this detector
     * when used with the given parse context. The base media type can
     * be <code>application/octet-stream</code> for generic detectors
     * or a more specific type like <code>text/plain</code> or
     * <code>application/zip</code> for detectors that can only
     * distinguish between subtypes of that base type.
     *
     * @since Apache Tika 0.8
     * @param context parse context
     * @return immutable set of media types
     */
    Set<MediaType> getSupportedTypes(ParseContext context);


> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to 
> process container based formats (eg zip files and ole2 files) when trying to 
> detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to 
> do all the work of parsing the whole file when they're not interested in 
> knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, 
> we should try to keep the open container around to speed up parsing of the 
> contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to