It seems like there are two goals here, both aiming to centralize configuration:
1. Provide an easy mechanism to configure which parsers to use when (TIKA-1509). 2. Configure all individual parser parameters in Tika Config (not in, for example, TesseractOCRConfig.properties) (TIKA-1508). I'm also in favor of consolidating everything in Tika Config. Tyler On Mon, Jun 8, 2015 at 7:25 AM Allison, Timothy B. <talli...@mitre.org> wrote: > Tyler, I see your devil's advocate point. > > I strongly agree with Chris about the benefit of centralizing > configuration and making it easy to dump and modify the TikaConfig file. > > Even though the TikaConfig file might get ugly, it would be far better to > have everything nailed down there than searching through service > loaders...IMHO. > > I opened TIKA-1508 a while ago and haven't had any time to work on > it...this just deals with simple parameter settings for parsers, not the > far more difficult/interesting stuff that we've discussed with composite > parsers. > > >> My main worry with putting it all into config xml is that we accidently > end up re-inventing spring badly... > > Yeah, or re-inventing Solr's parameter loading as my example does... :( > > I think that basic parameter setting should at least be fairly trivial to > code...time allowing...argh. > > > -----Original Message----- > From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] > Sent: Saturday, June 06, 2015 7:01 PM > To: dev@tika.apache.org > Subject: Re: Configuring parsers and translators > > Hey Tyler, > > I hear you, but balance that against all the hidden things here > and there, and everywhere, that I constantly keep discovering and > having to pour through lines of TikaConfig - service loaders, class > loaders. > > When things work right - no problem. When something goes wrong; > HUGE waste of time. > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > -----Original Message----- > From: Tyler Palsulich <tpalsul...@gmail.com> > Reply-To: "dev@tika.apache.org" <dev@tika.apache.org> > Date: Saturday, June 6, 2015 at 3:59 PM > To: "dev@tika.apache.org" <dev@tika.apache.org> > Subject: Re: Configuring parsers and translators > > >(Devil's advocate hat slightly on.) My one hesitation about putting it all > >into tika-config is that the default might get to be a monstrosity -- > >difficult for new users to use. > > > >Tyler > > > >On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) < > >chris.a.mattm...@jpl.nasa.gov> wrote: > > > >> I think it would be great to have all this in the Tika Config. > >> > >> The one thing then is to provide an example default config and > >> to make it *hugely* clear rather than all the levels of indirection > >> that we currently have going on which makes it super hard when > >> there is a config error (SPI, swallowing print messages, etc.) > >> > >> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Chris Mattmann, Ph.D. > >> Chief Architect > >> Instrument Software and Science Data Systems Section (398) > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> Office: 168-519, Mailstop: 168-527 > >> Email: chris.a.mattm...@nasa.gov > >> WWW: http://sunset.usc.edu/~mattmann/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Adjunct Associate Professor, Computer Science Department > >> University of Southern California, Los Angeles, CA 90089 USA > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> > >> > >> > >> -----Original Message----- > >> From: Tyler Palsulich <tpalsul...@gmail.com> > >> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org> > >> Date: Saturday, June 6, 2015 at 3:45 PM > >> To: "dev@tika.apache.org" <dev@tika.apache.org> > >> Subject: Re: Configuring parsers and translators > >> > >> >Hi Nick, > >> > > >> >I've been mulling this over since you sent the first message. But, I'm > >> >afraid I don't have a good solution or developed ideas. > >> > > >> >I agree, it would be very nice to consolidate all configuration for all > >> >parsers in the server and app. > >> > > >> >Is it feasible to put everything into tika-config? Then Parser > >> >implementations would read the config to pull out their own > >>configuration. > >> >Or, would it be better to keep some configuration separate? > >>Documentation > >> >would be an issue if every parser defines its own metadata keys... > >>But, it > >> >might be an improvement since we don't have "free form" properties and > >> >configuration files. > >> > > >> >Tyler > >> > > >> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <apa...@gagravarr.org> > >>wrote: > >> > > >> >> Anyone have any thoughts on this? > >> >> > >> >> On Fri, 8 May 2015, Nick Burch wrote: > >> >> > Hi All > >> >> > > >> >> > This came up in TIKA-1623, but I thought it might be better brought > >> >>out > >> >> to > >> >> > the list for discussion > >> >> > > >> >> > To configure parsers on a per-document basis, such as setting PDF > >> >> > spacing tolerances, or telling Tesseract what language it should be > >> >> > OCRing for, we have the *Config objects. You create one of these, > >>use > >> >> > the setters to configure it for your document, pop it onto the > >>Parse > >> >> > context and it's used when processing your document > >> >> > > >> >> > To configure parsers and translators on a per-JVM basis, to apply > >>to > >> >>all > >> >> > documents processed, it's a bit less consistent. At least some look > >> >>for > >> >> > a properties file with a specific name, usually in the tika > >>namespace, > >> >> > and grab their settings / keys / etc out of that. At least some > >>expect > >> >> > to find a *Config with their program path on it, even though that > >> >> > remains constant between documents. None of them support getting > >>their > >> >> > settings from the Tika Config > >> >> > > >> >> > > >> >> > As part of our evolution of parser preferences, we're moving > >>towards > >> >> > people either being able to set their preferences in code, or being > >> >>able > >> >> > to supply a Tika Config xml which sets their parser preferences or > >> >> > overrides certain bits of the default. The code option works for > >> >>people > >> >> > who want to declare certain specific things, the Tika Config one > >>gives > >> >> > the same functionality but allows a consistent and clean way to > >>set it > >> >> > between Tika App, Tika Server and java code. > >> >> > > >> >> > Another related example is the External Parser support. Because you > >> >>can > >> >> > have multiple External Parser instances in your setup, one per > >>format > >> >>/ > >> >> > program, we look for all the > >> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on > >>the > >> >> > classpath, and create parser instances based on definitions in > >>there > >> >> > > >> >> > > >> >> > What do we think about setting executable paths and keys/logins for > >> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext? > >> >> > Properties? Custom xml config? Tika config xml? Other? Combination? > >> >> > > >> >> > Nick > >> >> > > >> >> > >> > >> > >