Hello Nick Burch at "Tue, 15 Jun 2010 18:25:13 +0100 (BST)" wrote: NB> Hi All
NB> I've been thinking about TIKA-391 (intermittent incorrect mime type detection of office NB> formats), and I think we might need to do something different for container formats. NB> At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd etc), and for ZIP based NB> files (.zip, but also .xlsx, .pptx, .docx, .odf, .odt, .ots, .sxw etc), I don't think the NB> current method works well. AFAICT, NB> we detect the container, then have sub-class matches that try to look for the appropriate NB> children by hoping we can guess where the definition might hide within the NB> container. However, I think this is too unreliable - for example, with a .doc file, the NB> entry for the Word stream can come anywhere in the list of top level entries, so is very NB> hard to reliably find without properly parsing the OLE2 structure Hmmm, WordDocument stream in .doc could be only under / directory entry, but yes - it could anywhere in list of OLE2 entries... NB> So, I'd like to suggest a slightly different approach, one of loading the container format NB> to decide the mime type. This will, of course, make the detection step slower and more NB> memory hungry for detecting these (but only these) kinds of documents. However, provided NB> that we keep the open container around and pass it to the parser in a later step, it's NB> work we would've done anyway. NB> I'd then see the mime process be something like: NB> * Loop over all magic rules NB> * If the magic fits and the file extension fits, pick this one NB> * Otherwise if the magic fits and it's a container: NB> * Load the container NB> * Check the top level entries against our list for that container NB> * If we get a hit, pick that NB> * If nothing hits, assume it's just the container Maybe it would useful to make this configurable? Sometimes it's useful to force media type detection by magic only, not by extension (for example, file could be renamed)... -- With best wishes, Alex Ott, MBA http://alexott.blogspot.com/ http://alexott.net/ http://alexott-ru.blogspot.com/ Skype: alex.ott