Error thrown with TikaConfig() constructor

Ken Krugler Thu, 09 Sep 2010 20:23:43 -0700

Hi all,

In the past, we'd build our Hadoop job jars using a dependency on Tika-parsers but excluding the supporting jars for types that we know wedon't need to process (e.g. Microsoft docs, PDFs, etc). Thisdramatically reduces the size of the resulting Hadoop job jar.

With 0.8-SNAPSHOT, the TikaConfig(Classpath) constructor now finds andinstantiates all Parser-based classes found on the classpath. Which,as expected, triggers a storm of Exceptions and Errors.

I'm wondering how best to handle this type of configuration, in a waythat's relatively resilient to Tika configuration changes and mytarget set of formats.

The quick & cheesy hack is to change the TikaConfig constructor tocatch exceptions thrown by parser instantiation, and ignore (or log)them. But that seems likely to create lots of pain and suffering forpeople who have broken setups, as it fails slowly & silently.

We could try to avoid triggering the construction of TikaConfig, anddo our own dispatching based on mime-types, but that seems both kludgyand brittle.

We could build a custom version of Tika that only includes the parserclasses we use, but that also seems brittle.


Any other thoughts/options?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Error thrown with TikaConfig() constructor

Reply via email to