Mattmann, Chris A pisze:
Hi Stephane,
This is definitely a good news. Besides very good parsers, Aperture also
has strong support for mime type. I know we also have support for
detecting mime types but at some point and time we may consider using
theirs and focus solely on writing Parsers?
I would be strongly against this mainly due to the fact that there is almost
a 1-to-1 correspondence between having a good mime detection system, and
parsing content. Tika has a fairly robust mime system based on
freedesktop.org's system and I think there is value in Apache having a good
mime detection system (in fact it was discussed, even before Tika's
inception, to take the Nutch mime type code and turn it into a commons-*
project).
Mime type identification would be the easiest thing to colaborate on,
since in general the interface is identical (put in a byte array and
return a string with the mime type). Both MimeTypeIdentifiers have been
actively maintained and used in production for years. I guess we could
all benefit from pooling the resources.
Clearly Tika MimeTypes and MimeType classes have friendlier API, and
Tika allows new patterns to be added at runtime which aperture doesn't,
but I wonder if anyone tried to assess the real advantages or
disadvantages of tika vs aperture mime type identifier? (number of
recognized types, speed, memory consumption etc.)?
My dream is a project that maintains a single mime type identification
class, but for every single identifiable mime type - there is a test
document that confirms it. Our mimetypes.xml file lists patterns for 162
mime types, yours tika-mimetypes.xml lists patterns for 78 mime types,
but how many can we really recognize - that is an open question.
Apart from that there are three ideas we could explore:
1. An issue of ASCII text files with headers that happen to be magic
numbers for some other type, http://tinyurl.com/66tabh,
2. specific treatment of text/xml mime type, to detect xml-specific mime
types (by DTD,XSD,namespace etc.) http://tinyurl.com/6xolsx
3. specific treatment of ZIP mime type to detect zip-specific mime types
(office 2007, open office, jars etc), without resorting to extensions.
None of this managed to gain critical mass within aperture itself.
I will take a closer look at the Tika MimeTypes class and will get back
to you with something more concrete, but I'd like to know what do you
think about this in general.
Antoni Mylka
[email protected]