[jira] Commented: (TIKA-509) Container contents extraction

Nick Burch (JIRA) Thu, 09 Sep 2010 04:39:18 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907599#action_12907599
 ]


Nick Burch commented on TIKA-509:
---------------------------------

Jukka - your patch looks good, just thought I'd check a few things

Are you thinking that the ContainerExtractor interface and 
ContainerEmbededResourceHandler will remain? And that this would be the way for 
people who's main interest is getting at the embeded documents to work? 

In terms of the parser / ParserContainerExtractor, are you thinking that we 
should try to make the container related Parsers call the nested parser from 
the ParseContext? The only issue with that I can see is that it isn't then 
possible for for users to say "I don't want that file, don't bother doing lots 
of work to extract it". Admittedly the ContainerExtractor doesn't support that 
either, but I had an idea for how to do that, and I can't easily see how that'd 
fit in with parsers

I'll apply your patch shortly, then carry on with my work on making the office 
format embeded resources available, but using your new pattern

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>         Attachments: 0001-TIKA-509-Container-contents-extraction.patch
>
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3calpine.deb.1.10.1009010000250.5...@urchin.earth.li%3e
> This service will operate in a push mode, using streaming where possible (not 
> all container formats will support that). Users can control recursion, and 
> will be given the chance to process each embeded file in turn. It's up to 
> them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having 
> its own extractor in the parsers package, and the interface defined in the 
> core package. There will be an Auto extractor in the core package, configured 
> with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-509) Container contents extraction

Reply via email to