Hi All
This came up in TIKA-1623, but I thought it might be better brought out to
the list for discussion
To configure parsers on a per-document basis, such as setting PDF spacing
tolerances, or telling Tesseract what language it should be OCRing for, we
have the *Config objects. You create one of these, use the setters to
configure it for your document, pop it onto the Parse context and it's
used when processing your document
To configure parsers and translators on a per-JVM basis, to apply to all
documents processed, it's a bit less consistent. At least some look for a
properties file with a specific name, usually in the tika namespace, and
grab their settings / keys / etc out of that. At least some expect to find
a *Config with their program path on it, even though that remains constant
between documents. None of them support getting their settings from the
Tika Config
As part of our evolution of parser preferences, we're moving towards
people either being able to set their preferences in code, or being able
to supply a Tika Config xml which sets their parser preferences or
overrides certain bits of the default. The code option works for people
who want to declare certain specific things, the Tika Config one gives the
same functionality but allows a consistent and clean way to set it between
Tika App, Tika Server and java code.
Another related example is the External Parser support. Because you can
have multiple External Parser instances in your setup, one per format /
program, we look for all the
org/apache/tika/parser/external/tika-external-parsers.xml files on the
classpath, and create parser instances based on definitions in there
What do we think about setting executable paths and keys/logins for
parsers like OCR, Strings, Translators etc? Always on ParseContext?
Properties? Custom xml config? Tika config xml? Other? Combination?
Nick
- Configuring parsers and translators Nick Burch
-