Hi Jukka, Great, I'm glad that you broke this out into a separate thread as I was just about to respond to Keith's prior message.
> The current MimeUtils.getType relies only on magic header matching, > and should be fixed. +1, as per my suggestion, we should probably put something in there that does something similar to the Nutch Content class changes that I just committed in Nutch. > > The main reason why I decided to implement my own version of the code > based on MimeTypes in AutoDetectParser was that I was somewhat > confused about the separation of concerns across MimeTypes and > MimeUtils. The MimeTypes class already has a number of utility methods > like getMimeType(String, byte[]) and getMimeType(URL), so I'm not sure > why we need MimeUtils. Good question. Originally I was uncertain of that myself as the code that I got from Jerome originally had it. After looking through the code and trying to understand it more myself (when I was originally committing it), I decided that it makes sense to have MimeUtils as a decorator class to handle instantiation of the MimeTypes repository (from a resourceName), and from a given mime magic boolean flag. AFAIK, that's currently the only need for it. It may make sense to simply move this capability down into MimeTypes and use that class and remove MimeUtils altogether. If this is your suggestion, then I'm +1 for it. > >> Jukka, should I modify AutoDetectParser to call this method instead of its >> own? > > OK once the method has been fixed. Well, more generally, once we all agree on what to do :) > >> However, the bigger issue is, is the assessment that header based detection >> fails with certain file types correct? > > Magic detection can never be 100% correct or complete, but there's a > lot that we could still do to improve the current status in Tika. +1 for this. Mime detection/magic header/byte detection is not exactly a science, but more a practice of heuristics and patterns picked up over time. I think that the framework in Tika is one of the most comprehensive I've seen. Additionally, the great thing about it is that it's extensible. If we decide it's not doing a great job at detecting Keith's .ppt files, we can add more byte headers to compare against by editing the tika-mimetypes.xml file underneath the application/microsoft-powerpoint mime type. Thanks! Cheers, Chris > > BR, > > Jukka Zitting ______________________________________________ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
