Re: Detecting container formats

Ken Krugler Tue, 15 Jun 2010 11:57:31 -0700

I think this is a reasonable approach, as long as (per Alex'ssuggestion) it's configurable in various ways.

E.g. if you know you don't want to parse OLE2-based files, so you'veremoved jars for those parser, then it would be great to have an easyway of disabling the (more expensive) mime-type detection, andpotentially avoid the dependency on these same jars.

Separately, I think this issue might also trigger improvements to theexisting "magic bytes" detection code in Tika. IIRC, we wound upadding full regex with some additional matching rules in Krugle, toextend the (from Nutch, same as Tika) mime-type detection code tobetter handle things like source code files. I imagine somethingsimilar might be needed to reliably handle container matching.


-- Ken


On Jun 15, 2010, at 10:25am, Nick Burch wrote:

Hi All
I've been thinking about TIKA-391 (intermittent incorrect mime typedetection of office formats), and I think we might need to dosomething different for container formats.
At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsdetc), and for ZIP based files (.zip, butalso .xlsx, .pptx, .docx, .odf, .odt, .ots, .sxw etc), I don't thinkthe current method works well. AFAICT,we detect the container, then have sub-class matches that try tolook for the appropriate children by hoping we can guess where thedefinition might hide within the container. However, I think this istoo unreliable - for example, with a .doc file, the entry for theWord stream can come anywhere in the list of top level entries, sois very hard to reliably find without properly parsing the OLE2structure
So, I'd like to suggest a slightly different approach, one ofloading the container format to decide the mime type. This will, ofcourse, make the detection step slower and more memory hungry fordetecting these (but only these) kinds of documents. However,provided that we keep the open container around and pass it to theparser in a later step, it's work we would've done anyway.
I'd then see the mime process be something like:
* Loop over all magic rules
 * If the magic fits and the file extension fits, pick this one
 * Otherwise if the magic fits and it's a container:
   * Load the container
   * Check the top level entries against our list for that container
   * If we get a hit, pick that
   * If nothing hits, assume it's just the container

eg we have a file with the zip magic, but no / unreliable filename.
We open the zip file and look at the top level directory entries.
If we spot [Content_Types].xml and /xl/ we know it's an OOXML Excelfile
If we spot meta.xml and mimetype then read mimetype and go from there
...
Else decide it's just a zipfile of files, and handle appropriately
What does everyone else think? Is the extra work in the mimedetection step (but only for container formats with no reliablefilename) worth it for the improved detection?
note - the issue of when given a filename with a useful extension ofbeingable to reliably pick the right mime type still needs to be solved,but
largely wouldn't be affected by this

Nick


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Detecting container formats

Reply via email to