Hi All

This came up in TIKA-1623, but I thought it might be better brought out to the list for discussion

To configure parsers on a per-document basis, such as setting PDF spacing tolerances, or telling Tesseract what language it should be OCRing for, we have the *Config objects. You create one of these, use the setters to configure it for your document, pop it onto the Parse context and it's used when processing your document

To configure parsers and translators on a per-JVM basis, to apply to all documents processed, it's a bit less consistent. At least some look for a properties file with a specific name, usually in the tika namespace, and grab their settings / keys / etc out of that. At least some expect to find a *Config with their program path on it, even though that remains constant between documents. None of them support getting their settings from the Tika Config


As part of our evolution of parser preferences, we're moving towards people either being able to set their preferences in code, or being able to supply a Tika Config xml which sets their parser preferences or overrides certain bits of the default. The code option works for people who want to declare certain specific things, the Tika Config one gives the same functionality but allows a consistent and clean way to set it between Tika App, Tika Server and java code.

Another related example is the External Parser support. Because you can have multiple External Parser instances in your setup, one per format / program, we look for all the org/apache/tika/parser/external/tika-external-parsers.xml files on the classpath, and create parser instances based on definitions in there


What do we think about setting executable paths and keys/logins for parsers like OCR, Strings, Translators etc? Always on ParseContext? Properties? Custom xml config? Tika config xml? Other? Combination?

Nick

Reply via email to