Hi all,
In the past, we'd build our Hadoop job jars using a dependency on Tika-
parsers but excluding the supporting jars for types that we know we
don't need to process (e.g. Microsoft docs, PDFs, etc). This
dramatically reduces the size of the resulting Hadoop job jar.
With 0.8-SNAPSHOT, the TikaConfig(Classpath) constructor now finds and
instantiates all Parser-based classes found on the classpath. Which,
as expected, triggers a storm of Exceptions and Errors.
I'm wondering how best to handle this type of configuration, in a way
that's relatively resilient to Tika configuration changes and my
target set of formats.
The quick & cheesy hack is to change the TikaConfig constructor to
catch exceptions thrown by parser instantiation, and ignore (or log)
them. But that seems likely to create lots of pain and suffering for
people who have broken setups, as it fails slowly & silently.
We could try to avoid triggering the construction of TikaConfig, and
do our own dispatching based on mime-types, but that seems both kludgy
and brittle.
We could build a custom version of Tika that only includes the parser
classes we use, but that also seems brittle.
Any other thoughts/options?
Thanks,
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g