[jira] Updated: (TIKA-509) Container contents extraction

Jukka Zitting (JIRA) Wed, 08 Sep 2010 11:11:55 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting updated TIKA-509:
-------------------------------

    Attachment: 0001-TIKA-509-Container-contents-extraction.patch

I'm not too excited about the idea of introducing a completely new mechanism in 
parallel with the Parser API we already have. AFAIUI the Parser API already 
supports all the functionality you're looking for.

See the attached patch that copies the embedded document handling code from the 
POIFSContainerExtractor class to our existing OfficeParser implementation, and 
adds a generic ParserContainerExtractor class that implements the 
ContainerExtractor interface based on our existing Parser and Detector APIs.

This solution passes all the current test cases (see the modifications I made 
to POIFSContainerExtractorTest), implements the embedded document support asked 
for in TIKA-489, and as a bonus gives you ContainerExtractor support for all 
the package formats (zip, tar, cpio, etc.) that we already have parsers for.

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>         Attachments: 0001-TIKA-509-Container-contents-extraction.patch
>
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3calpine.deb.1.10.1009010000250.5...@urchin.earth.li%3e
> This service will operate in a push mode, using streaming where possible (not 
> all container formats will support that). Users can control recursion, and 
> will be given the chance to process each embeded file in turn. It's up to 
> them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having 
> its own extractor in the parsers package, and the interface defined in the 
> core package. There will be an Auto extractor in the core package, configured 
> with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-509) Container contents extraction

Reply via email to