On Mon, 12 Jul 2010, Paul Jakubik wrote:
I'm using tika to parse packages (zip, tar.gz, tar.bz2, etc.) and I'd like to get access to the metadata for the individual files inside of the package.

I believe there are two different tika enhancements for container formats needed.

The first is for detection of files which are held in a container format, eg .doc (several named streams in an OLE2 file) or .xlsx (several named xml files in a zip file). TIKA-391 and TIKA-447 cover these. This is an area with some ideas for a possible solution, but more work and review needed.

Secondly, there's the issue of embeded documents, which could be a .zip file with half a dozen text files in it, but could equally be a .doc file with two embeded excel spreadsheets in it.

For this latter one, there is a little bit of support in Tika already, but it's not complete, and certainly needs more work. OutlookExtractor is one place I know of which uses it

Easy access to embeded document metadata is, I believe, still an outstanding issue. The solution needs to handle embeded documents, container formats filled with multiple files (eg get me the metadata of the 2nd embeded excel file vs get me the metadata on the file of /foo/bar.jpg in te zip), as well as ideally coping with a single file with different metadata for different bits of it (I think pdf can do this?)

Assuming I've got all of the above correct, it might be worth creating a wiki page for this (probably + referencing jira entry), and start trying to work up a proposed solution that'll handle all the above problems and use cases.

Nick

Reply via email to