I tried asking this on the users' list, but no one responded, so I
guess I'll try the dev list.
Tika's mime type detection routinely fails on fairly common files.
For instance, every gif I've tried Tika returns application/octet-
stream rather than image/gif. Plain text files without any extension
also get marked application/octet-stream instead of text/plain.
I copied the tika-mimetypes.xml that's included in tika-0.3-
SNAPSHOT.tar and added a glob for image/gif via:
<mime-type type="image/gif">
<glob pattern="*.gif" />
</mime-type>
(which is odd that this has to be added since the default tika-
config.xml configures a parser for this mime-type.)
I loaded the xml file via:
mimeTypes = MimeTypesFactory.create("/fullpath/tika-
mimetypes.xml");
parser = new AutoDetectParser();
parser.setMimeTypes(mimeTypes);
but apparently the config file is silently failing to be loaded, or
being ignored, or AutoDetectParser's mime detector isn't correctly
checking the globs or something, none of which makes any sense.
Something should either write to stderr or throw an exception if this
was the case.
MimeTypes doesn't have a way to listing what mime-types are
registered, and I can't find a publicly accessible way to
programatically add a new MimeType to the MimeTypes class. (You can
add glob patterns, but not actual types. `new MimeType()` fails
because it's called outside its package.)
There is either a bug here, or there's some trick that is completely
undocumented, or somehow tika-0.3-SNAPSHOT-standalone.jar is
overriding everything.
I've even tried creating a new tika-config.xml with the fullpath to my
tika-mimetypes.xml in it
try {
configFile= new File("/fullpath/tika-config.xml");
config = new TikaConfig(configFile);
parser = new AutoDetectParser(config);
contentHandler = getContentHandler();
} catch (org.xml.sax.SAXException e) {
System.err.println("cant parse " + e.getMessage());
} catch (TikaException e) {
System.err.println("tika exception " + e.getMessage());
} catch (IOException e) {
System.err.println("can't read configfile " +
e.getMessage());
}
}
but that just causes NullPointerException s to be thrown.
This is beyond frustrating.
--
Jonathan Koren
[email protected]
http://www.soe.ucsc.edu/~jonathan/