Hi,

On Mon, Jan 26, 2009 at 6:49 AM, Jonathan Koren <[email protected]> wrote:
> I tried asking this on the users' list, but no one responded, so I guess
> I'll try the dev list.

Sorry for the silence.

> Tika's mime type detection routinely fails on fairly common files.  For
> instance, every gif I've tried Tika returns application/octet-stream rather
> than image/gif.

Hmm, you're right. As no noted, the proper configuration for GIF is
missing. I'll fix that in TIKA-192.

> Plain text files without any extension also get marked
> application/octet-stream instead of text/plain.

See TIKA-154, where we just added support for automatically detecting
plain text from nothing but the document input stream.

> I loaded the xml file via:
>
>            mimeTypes =
> MimeTypesFactory.create("/fullpath/tika-mimetypes.xml");
>            parser = new AutoDetectParser();
>            parser.setMimeTypes(mimeTypes);
>
> but apparently the config file is silently failing to be loaded, or being
> ignored, or AutoDetectParser's mime detector isn't correctly checking the
> globs or something, none of which makes any sense.  Something should either
> write to stderr or throw an exception if this was the case.

Hmm, I'll look into that.

> MimeTypes doesn't have a way to listing what mime-types are registered, and
> I can't find a publicly accessible way to programatically add a new MimeType
> to the MimeTypes class.  (You can add glob patterns, but not actual types.
>  `new MimeType()` fails because it's called outside its package.)

Agreed, the MimeTypes class is not as user friendly as it could be.

> I've even tried creating a new tika-config.xml with the fullpath to my
> tika-mimetypes.xml in it
>
>        try {
>            configFile= new File("/fullpath/tika-config.xml");
>            config = new TikaConfig(configFile);
>            parser = new AutoDetectParser(config);
>            contentHandler = getContentHandler();
>        } catch (org.xml.sax.SAXException e) {
>            System.err.println("cant parse " + e.getMessage());
>        } catch (TikaException e) {
>            System.err.println("tika exception " + e.getMessage());
>        } catch  (IOException e) {
>            System.err.println("can't read configfile " + e.getMessage());
>        }
>    }
>
> but that just causes NullPointerException s to be thrown.

Where's the NPE coming from? Can you file a bug for that?

BR,

Jukka Zitting

Reply via email to