Detecting container formats

Nick Burch Tue, 15 Jun 2010 10:25:47 -0700

Hi All

I've been thinking about TIKA-391 (intermittent incorrect mime typedetection of office formats), and I think we might need to do somethingdifferent for container formats.

At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd etc),and for ZIP based files (.zip, but also .xlsx, .pptx, .docx, .odf, .odt,.ots, .sxw etc), I don't think the current method works well. AFAICT,we detect the container, then have sub-class matches that try to look forthe appropriate children by hoping we can guess where the definition mighthide within the container. However, I think this is too unreliable - forexample, with a .doc file, the entry for the Word stream can come anywherein the list of top level entries, so is very hard to reliably find withoutproperly parsing the OLE2 structure

So, I'd like to suggest a slightly different approach, one of loading thecontainer format to decide the mime type. This will, of course, make thedetection step slower and more memory hungry for detecting these (but onlythese) kinds of documents. However, provided that we keep the opencontainer around and pass it to the parser in a later step, it's work wewould've done anyway.


I'd then see the mime process be something like:
* Loop over all magic rules
  * If the magic fits and the file extension fits, pick this one
  * Otherwise if the magic fits and it's a container:
    * Load the container
    * Check the top level entries against our list for that container
    * If we get a hit, pick that
    * If nothing hits, assume it's just the container

eg we have a file with the zip magic, but no / unreliable filename.
 We open the zip file and look at the top level directory entries.
 If we spot [Content_Types].xml and /xl/ we know it's an OOXML Excel file
 If we spot meta.xml and mimetype then read mimetype and go from there
 ...
 Else decide it's just a zipfile of files, and handle appropriately

What does everyone else think? Is the extra work in the mime detectionstep (but only for container formats with no reliable filename) worth itfor the improved detection?


note - the issue of when given a filename with a useful extension of being
 able to reliably pick the right mime type still needs to be solved, but
 largely wouldn't be affected by this

Nick

Detecting container formats

Reply via email to