I tried asking this on the users' list, but no one responded, so I guess I'll try the dev list.

Tika's mime type detection routinely fails on fairly common files. For instance, every gif I've tried Tika returns application/octet- stream rather than image/gif. Plain text files without any extension also get marked application/octet-stream instead of text/plain.

I copied the tika-mimetypes.xml that's included in tika-0.3- SNAPSHOT.tar and added a glob for image/gif via:

  <mime-type type="image/gif">
    <glob pattern="*.gif" />
  </mime-type>

(which is odd that this has to be added since the default tika- config.xml configures a parser for this mime-type.)

I loaded the xml file via:

mimeTypes = MimeTypesFactory.create("/fullpath/tika- mimetypes.xml");
            parser = new AutoDetectParser();
            parser.setMimeTypes(mimeTypes);

but apparently the config file is silently failing to be loaded, or being ignored, or AutoDetectParser's mime detector isn't correctly checking the globs or something, none of which makes any sense. Something should either write to stderr or throw an exception if this was the case.

MimeTypes doesn't have a way to listing what mime-types are registered, and I can't find a publicly accessible way to programatically add a new MimeType to the MimeTypes class. (You can add glob patterns, but not actual types. `new MimeType()` fails because it's called outside its package.)

There is either a bug here, or there's some trick that is completely undocumented, or somehow tika-0.3-SNAPSHOT-standalone.jar is overriding everything.

I've even tried creating a new tika-config.xml with the fullpath to my tika-mimetypes.xml in it

        try {
            configFile= new File("/fullpath/tika-config.xml");
            config = new TikaConfig(configFile);
            parser = new AutoDetectParser(config);
            contentHandler = getContentHandler();
        } catch (org.xml.sax.SAXException e) {
            System.err.println("cant parse " + e.getMessage());
        } catch (TikaException e) {
            System.err.println("tika exception " + e.getMessage());
        } catch  (IOException e) {
System.err.println("can't read configfile " + e.getMessage());
        }
    }

but that just causes NullPointerException s to be thrown.

This is beyond frustrating.

--
Jonathan Koren
[email protected]
http://www.soe.ucsc.edu/~jonathan/


Reply via email to