Hi Stephane,

Thanks for your email.

> I didn't know Tika mime type detection was based on freedesktop.org.
> I've also developed a mimeType detection system built on top of
> freedesktop, leveraging the shared-mime-info database to be accurate. Is
> this what you guys have done as well?

Yep, the mime type detection system in Tika is based on the one developed
for Nutch primarily by Jerome Charron. Jerome worked on an update to this
mime system, with the freedesktop.org-style interface, and then I worked to
clean this up and get the functionality into Tika.

> In any case, the point I was trying to make in my previous post was to
> leverage functionality that is available somewhere else as much as
> possible and focus on Tika core features.

Gotcha. My point is: mime detection _is_ one of Tika's core features :)

See: http://wiki.apache.org/incubator/TikaProposal

It's what we proposed as one of the core parts of the system to get Tika
approved as an Apache project and to get it to be more useful to the
community.

> True, mime type detection is important for Tika. However, as you pointed
> out, mime type detection is a project by itself. If the idea of creating
> a commons.xx project for mime detection was floating around earlier, why
> not starting an Apache commons.xxx project based on Tika detection
> schema then? Now be a good time, don't you think?
> It would be a great addition to commons and would free Tika developer
> from maintaining the code base for it

It was decided that, rather than go the commons-xxx route, that we would
maintain the code as part of the core functionality of Tika. There are
developers in Tika that are interested in mime detection as natural lockstep
with content analysis (myself included in that list) and because of this who
are very happy to maintain the mime detection code in Tika.

Thanks!

Cheers,
Chris


> Mattmann, Chris A wrote:
>> Hi Stephane,
>>
>>
>>
>>> This is definitely a good news. Besides very good parsers, Aperture also
>>> has strong support for mime type. I know we also have support for
>>> detecting mime types but at some point and time we may consider using
>>> theirs and focus solely on writing Parsers?
>>>
>>
>> I would be strongly against this mainly due to the fact that there is almost
>> a 1-to-1 correspondence between having a good mime detection system, and
>> parsing content. Tika has a fairly robust mime system based on
>> freedesktop.org's system and I think there is value in Apache having a good
>> mime detection system (in fact it was discussed, even before Tika's
>> inception, to take the Nutch mime type code and turn it into a commons-*
>> project).
>>
>> Thanks,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [EMAIL PROTECTED]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Disclaimer:  The opinions presented within are my own and do not reflect
>> those of either NASA, JPL, or the California Institute of Technology.
>>
>>
>>
>
>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [EMAIL PROTECTED]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to