On Mon, Jul 12, 2010 at 10:37 AM, Nick Burch <nick.bu...@alfresco.com>wrote:
> On Mon, 12 Jul 2010, Paul Jakubik wrote: > >> I'm using tika to parse packages (zip, tar.gz, tar.bz2, etc.) and I'd like >> to get access to the metadata for the individual files inside of the >> package. >> > > I believe there are two different tika enhancements for container formats > needed. > I've tried to summarize the various use cases mentioned in your email. Please let me know if I have correctly captured everything. - *Containers that are conceptually a single document.* eg .doc (several named streams in an OLE2 file), or .xlsx (several named xml files in a zip file) - *Containers that are conceptually containers of many separate documents.* eg a zip file with several text files in it, or a tar file with zip files, doc files, and text files in it. - *Containers that are both a single document and separate documents.* eg an email with multiple parts and/or attachements, or a .doc with embedded spreadsheets. - *Single documents with metadata associated with regions of the document. *eg PDF? >From the point of view of reporting metadata for documents, it might be useful to group these use cases the following way: - Single documents with multiple sets of metadata - Containers that are conceptually single documents - PDF? - Containers that contain many distinct documents and/or containers - Containers that are conceptually containers - Containers that are conceptually documents and containers Paul