[ 
https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907319#action_12907319
 ] 

Nick Burch commented on TIKA-509:
---------------------------------

Also on the config side of things, currently there's just a straight list of 
ContainerExtractors, and each one does its own mini-detection to decide if it's 
suitable. We may wish to change this later to work with MediaTypes as the 
Parsers do, however there are issues around how much extra work might be 
involved in the detection step - we don't want to duplicate the container 
parsing by forcing people to run it through a ContainerAwareDetector and then 
process the container again in a ContainerExtractor. Probably one to hold off 
deciding on until we have several extractors in place, when we'll have a better 
idea of how they might work and fit together?

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3calpine.deb.1.10.1009010000250.5...@urchin.earth.li%3e
> This service will operate in a push mode, using streaming where possible (not 
> all container formats will support that). Users can control recursion, and 
> will be given the chance to process each embeded file in turn. It's up to 
> them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having 
> its own extractor in the parsers package, and the interface defined in the 
> core package. There will be an Auto extractor in the core package, configured 
> with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to