Hi all,

In the past, we'd build our Hadoop job jars using a dependency on Tika- parsers but excluding the supporting jars for types that we know we don't need to process (e.g. Microsoft docs, PDFs, etc). This dramatically reduces the size of the resulting Hadoop job jar.

With 0.8-SNAPSHOT, the TikaConfig(Classpath) constructor now finds and instantiates all Parser-based classes found on the classpath. Which, as expected, triggers a storm of Exceptions and Errors.

I'm wondering how best to handle this type of configuration, in a way that's relatively resilient to Tika configuration changes and my target set of formats.

The quick & cheesy hack is to change the TikaConfig constructor to catch exceptions thrown by parser instantiation, and ignore (or log) them. But that seems likely to create lots of pain and suffering for people who have broken setups, as it fails slowly & silently.

We could try to avoid triggering the construction of TikaConfig, and do our own dispatching based on mime-types, but that seems both kludgy and brittle.

We could build a custom version of Tika that only includes the parser classes we use, but that also seems brittle.

Any other thoughts/options?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to