Hi Jukka,

 Great, I'm glad that you broke this out into a separate thread as I was
just about to respond to Keith's prior message.

> The current MimeUtils.getType relies only on magic header matching,
> and should be fixed.

+1, as per my suggestion, we should probably put something in there that
does something similar to the Nutch Content class changes that I just
committed in Nutch.

> 
> The main reason why I decided to implement my own version of the code
> based on MimeTypes in AutoDetectParser was that I was somewhat
> confused about the separation of concerns across MimeTypes and
> MimeUtils. The MimeTypes class already has a number of utility methods
> like getMimeType(String, byte[]) and getMimeType(URL), so I'm not sure
> why we need MimeUtils.

Good question. Originally I was uncertain of that myself as the code that I
got from Jerome originally had it. After looking through the code and trying
to understand it more myself (when I was originally committing it), I
decided that it makes sense to have MimeUtils as a decorator class to handle
instantiation of the MimeTypes repository (from a resourceName), and from a
given mime magic boolean flag. AFAIK, that's currently the only need for it.
It may make sense to simply move this capability down into MimeTypes and use
that class and remove MimeUtils altogether. If this is your suggestion, then
I'm +1 for it.

> 
>> Jukka, should I modify AutoDetectParser to call this method instead of its
>> own?
> 
> OK once the method has been fixed.

Well, more generally, once we all agree on what to do :)

> 
>> However, the bigger issue is, is the assessment that header based detection
>> fails with certain file types correct?
> 
> Magic detection can never be 100% correct or complete, but there's a
> lot that we could still do to improve the current status in Tika.

+1 for this. Mime detection/magic header/byte detection is not exactly a
science, but more a practice of heuristics and patterns picked up over time.
I think that the framework in Tika is one of the most comprehensive I've
seen. Additionally, the great thing about it is that it's extensible. If we
decide it's not doing a great job at detecting Keith's .ppt files, we can
add more byte headers to compare against by editing the tika-mimetypes.xml
file underneath the application/microsoft-powerpoint mime type.

Thanks!

Cheers,
  Chris


> 
> BR,
> 
> Jukka Zitting

______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to