Hi,
On Mon, Jan 26, 2009 at 6:49 AM, Jonathan Koren <[email protected]> wrote:
> I tried asking this on the users' list, but no one responded, so I guess
> I'll try the dev list.
Sorry for the silence.
> Tika's mime type detection routinely fails on fairly common files. For
> instance, every gif I've tried Tika returns application/octet-stream rather
> than image/gif.
Hmm, you're right. As no noted, the proper configuration for GIF is
missing. I'll fix that in TIKA-192.
> Plain text files without any extension also get marked
> application/octet-stream instead of text/plain.
See TIKA-154, where we just added support for automatically detecting
plain text from nothing but the document input stream.
> I loaded the xml file via:
>
> mimeTypes =
> MimeTypesFactory.create("/fullpath/tika-mimetypes.xml");
> parser = new AutoDetectParser();
> parser.setMimeTypes(mimeTypes);
>
> but apparently the config file is silently failing to be loaded, or being
> ignored, or AutoDetectParser's mime detector isn't correctly checking the
> globs or something, none of which makes any sense. Something should either
> write to stderr or throw an exception if this was the case.
Hmm, I'll look into that.
> MimeTypes doesn't have a way to listing what mime-types are registered, and
> I can't find a publicly accessible way to programatically add a new MimeType
> to the MimeTypes class. (You can add glob patterns, but not actual types.
> `new MimeType()` fails because it's called outside its package.)
Agreed, the MimeTypes class is not as user friendly as it could be.
> I've even tried creating a new tika-config.xml with the fullpath to my
> tika-mimetypes.xml in it
>
> try {
> configFile= new File("/fullpath/tika-config.xml");
> config = new TikaConfig(configFile);
> parser = new AutoDetectParser(config);
> contentHandler = getContentHandler();
> } catch (org.xml.sax.SAXException e) {
> System.err.println("cant parse " + e.getMessage());
> } catch (TikaException e) {
> System.err.println("tika exception " + e.getMessage());
> } catch (IOException e) {
> System.err.println("can't read configfile " + e.getMessage());
> }
> }
>
> but that just causes NullPointerException s to be thrown.
Where's the NPE coming from? Can you file a bug for that?
BR,
Jukka Zitting