Hi, On Sat, Aug 16, 2008 at 10:52 PM, Keith R. Bennett <[EMAIL PROTECTED]> wrote: > Do we intend to parse zip and tar files that contain multiple files? I'll > apologize in advance if we've already discussed this and I've forgotten.
See TIKA-149 where Dave Meikle has been helping us cross that bridge. :-) > If so, I'm a little concerned that the code base might be made more > difficult to maintain and extend, if we consider parsing these equivalent to > parsing documents. I think that unpacking these composite files is a task > that is orthogonal to extracting text and metadata -- in fact, IMHO it would > be better to use a word other than "parse" to refer to this action. I disagree. The application/zip format is just another file format and in a way it's nothing different from something like a Word or PDF document with attachments in it. I think composite files are well within the scope of Tika, while things like crawling a file system or parsing an HTTP response are clearly outside the scope. BR, Jukka Zitting