On Mon, Jul 12, 2010 at 10:37 AM, Nick Burch <nick.bu...@alfresco.com>wrote:

> On Mon, 12 Jul 2010, Paul Jakubik wrote:
>
>> I'm using tika to parse packages (zip, tar.gz, tar.bz2, etc.) and I'd like
>> to get access to the metadata for the individual files inside of the
>> package.
>>
>
> I believe there are two different tika enhancements for container formats
> needed.
>

I've tried to summarize the various use cases mentioned in your email.
Please let me know if I have correctly captured everything.

- *Containers that are conceptually a single document.* eg .doc (several
named streams in an OLE2 file), or .xlsx (several named xml files in a zip
file)

- *Containers that are conceptually containers of many separate documents.* eg
a zip file with several text files in it, or a tar file with zip files, doc
files, and text files in it.

- *Containers that are both a single document and separate documents.* eg an
email with multiple parts and/or attachements, or a .doc with embedded
spreadsheets.

- *Single documents with metadata associated with regions of the document. *eg
PDF?

>From the point of view of reporting metadata for documents, it might be
useful to group these use cases the following way:

- Single documents with multiple sets of metadata
    - Containers that are conceptually single documents
    - PDF?

- Containers that contain many distinct documents and/or containers
    - Containers that are conceptually containers
    - Containers that are conceptually documents and containers

Paul

Reply via email to