Hey Tyler, I hear you, but balance that against all the hidden things here and there, and everywhere, that I constantly keep discovering and having to pour through lines of TikaConfig - service loaders, class loaders.
When things work right - no problem. When something goes wrong; HUGE waste of time. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Tyler Palsulich <tpalsul...@gmail.com> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org> Date: Saturday, June 6, 2015 at 3:59 PM To: "dev@tika.apache.org" <dev@tika.apache.org> Subject: Re: Configuring parsers and translators >(Devil's advocate hat slightly on.) My one hesitation about putting it all >into tika-config is that the default might get to be a monstrosity -- >difficult for new users to use. > >Tyler > >On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) < >chris.a.mattm...@jpl.nasa.gov> wrote: > >> I think it would be great to have all this in the Tika Config. >> >> The one thing then is to provide an example default config and >> to make it *hugely* clear rather than all the levels of indirection >> that we currently have going on which makes it super hard when >> there is a config error (SPI, swallowing print messages, etc.) >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> -----Original Message----- >> From: Tyler Palsulich <tpalsul...@gmail.com> >> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org> >> Date: Saturday, June 6, 2015 at 3:45 PM >> To: "dev@tika.apache.org" <dev@tika.apache.org> >> Subject: Re: Configuring parsers and translators >> >> >Hi Nick, >> > >> >I've been mulling this over since you sent the first message. But, I'm >> >afraid I don't have a good solution or developed ideas. >> > >> >I agree, it would be very nice to consolidate all configuration for all >> >parsers in the server and app. >> > >> >Is it feasible to put everything into tika-config? Then Parser >> >implementations would read the config to pull out their own >>configuration. >> >Or, would it be better to keep some configuration separate? >>Documentation >> >would be an issue if every parser defines its own metadata keys... >>But, it >> >might be an improvement since we don't have "free form" properties and >> >configuration files. >> > >> >Tyler >> > >> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <apa...@gagravarr.org> >>wrote: >> > >> >> Anyone have any thoughts on this? >> >> >> >> On Fri, 8 May 2015, Nick Burch wrote: >> >> > Hi All >> >> > >> >> > This came up in TIKA-1623, but I thought it might be better brought >> >>out >> >> to >> >> > the list for discussion >> >> > >> >> > To configure parsers on a per-document basis, such as setting PDF >> >> > spacing tolerances, or telling Tesseract what language it should be >> >> > OCRing for, we have the *Config objects. You create one of these, >>use >> >> > the setters to configure it for your document, pop it onto the >>Parse >> >> > context and it's used when processing your document >> >> > >> >> > To configure parsers and translators on a per-JVM basis, to apply >>to >> >>all >> >> > documents processed, it's a bit less consistent. At least some look >> >>for >> >> > a properties file with a specific name, usually in the tika >>namespace, >> >> > and grab their settings / keys / etc out of that. At least some >>expect >> >> > to find a *Config with their program path on it, even though that >> >> > remains constant between documents. None of them support getting >>their >> >> > settings from the Tika Config >> >> > >> >> > >> >> > As part of our evolution of parser preferences, we're moving >>towards >> >> > people either being able to set their preferences in code, or being >> >>able >> >> > to supply a Tika Config xml which sets their parser preferences or >> >> > overrides certain bits of the default. The code option works for >> >>people >> >> > who want to declare certain specific things, the Tika Config one >>gives >> >> > the same functionality but allows a consistent and clean way to >>set it >> >> > between Tika App, Tika Server and java code. >> >> > >> >> > Another related example is the External Parser support. Because you >> >>can >> >> > have multiple External Parser instances in your setup, one per >>format >> >>/ >> >> > program, we look for all the >> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on >>the >> >> > classpath, and create parser instances based on definitions in >>there >> >> > >> >> > >> >> > What do we think about setting executable paths and keys/logins for >> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext? >> >> > Properties? Custom xml config? Tika config xml? Other? Combination? >> >> > >> >> > Nick >> >> > >> >> >> >>