Hi,

On Sat, Aug 16, 2008 at 10:52 PM, Keith R. Bennett <[EMAIL PROTECTED]> wrote:
> Do we intend to parse zip and tar files that contain multiple files?  I'll
> apologize in advance if we've already discussed this and I've forgotten.

See TIKA-149 where Dave Meikle has been helping us cross that bridge. :-)

> If so, I'm a little concerned that the code base might be made more
> difficult to maintain and extend, if we consider parsing these equivalent to
> parsing documents.  I think that unpacking these composite files is a task
> that is orthogonal to extracting text and metadata -- in fact, IMHO it would
> be better to use a word other than "parse" to refer to this action.

I disagree. The application/zip format is just another file format and
in a way it's nothing different from something like a Word or PDF
document with attachments in it.

I think composite files are well within the scope of Tika, while
things like crawling a file system or parsing an HTTP response are
clearly outside the scope.

BR,

Jukka Zitting

Reply via email to