On Tue, 15 Jun 2010, Alex Ott wrote:
Hmmm, WordDocument stream in .doc could be only under / directory entry, but yes - it could anywhere in list of OLE2 entries...

And the list of ole2 entries can come anywhere in the file - the header block contains a pointer to the block holding the entries, which is normally near the start but isn't required to be...

Detecting OLE2 or Zip with magic seems easy enough, but as mentioned it's whats inside them that I don't think magic + a few regexps on the first few kbs will cut it :/

Maybe it would useful to make this configurable? Sometimes it's useful to force media type detection by magic only, not by extension (for example, file could be renamed)...

IIRC, if you don't set the filename in the Metadata object that you pass into the detector, then it can't use the file extension!

Not sure how you could best turn it off though, short of a config that would disable the loading of ole2 and zip files (and maybe other containers in the future), but then what (if any) would we return for the mimetype? Maybe just a generic one?

Nick

Reply via email to