I think, there's a third concern should be taken in account: dynamic configuration (e.g. based on metadata, like password provider on per-document basis). Currently you only can inject some dynamically configurable behavior via ParseContext, but it adds complexity to recursive parser implementations.
-- Best regards, Konstantin Gribov пн, 15 июня 2015 г. в 16:52, Allison, Timothy B. <talli...@mitre.org>: > Agreed. They are two separate but related issues. TIKA-1508 should be > fairly straightforward. Should I start coding it? Any other > recommendations/concerns? > > > > -----Original Message----- > From: Tyler Palsulich [mailto:tpalsul...@gmail.com] > Sent: Saturday, June 13, 2015 12:54 PM > To: dev@tika.apache.org > Subject: Re: Configuring parsers and translators > > It seems like there are two goals here, both aiming to centralize > configuration: > > 1. Provide an easy mechanism to configure which parsers to use when > (TIKA-1509). > 2. Configure all individual parser parameters in Tika Config (not in, for > example, TesseractOCRConfig.properties) (TIKA-1508). > > I'm also in favor of consolidating everything in Tika Config. > > Tyler > > On Mon, Jun 8, 2015 at 7:25 AM Allison, Timothy B. <talli...@mitre.org> > wrote: > > > Tyler, I see your devil's advocate point. > > > > I strongly agree with Chris about the benefit of centralizing > > configuration and making it easy to dump and modify the TikaConfig file. > > > > Even though the TikaConfig file might get ugly, it would be far better to > > have everything nailed down there than searching through service > > loaders...IMHO. > > > > I opened TIKA-1508 a while ago and haven't had any time to work on > > it...this just deals with simple parameter settings for parsers, not the > > far more difficult/interesting stuff that we've discussed with composite > > parsers. > > > > >> My main worry with putting it all into config xml is that we > accidently > > end up re-inventing spring badly... > > > > Yeah, or re-inventing Solr's parameter loading as my example does... :( > > > > I think that basic parameter setting should at least be fairly trivial to > > code...time allowing...argh. > > > > > > -----Original Message----- > > From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] > > Sent: Saturday, June 06, 2015 7:01 PM > > To: dev@tika.apache.org > > Subject: Re: Configuring parsers and translators > > > > Hey Tyler, > > > > I hear you, but balance that against all the hidden things here > > and there, and everywhere, that I constantly keep discovering and > > having to pour through lines of TikaConfig - service loaders, class > > loaders. > > > > When things work right - no problem. When something goes wrong; > > HUGE waste of time. > > > > Cheers, > > Chris > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Chief Architect > > Instrument Software and Science Data Systems Section (398) > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 168-519, Mailstop: 168-527 > > Email: chris.a.mattm...@nasa.gov > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Associate Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > -----Original Message----- > > From: Tyler Palsulich <tpalsul...@gmail.com> > > Reply-To: "dev@tika.apache.org" <dev@tika.apache.org> > > Date: Saturday, June 6, 2015 at 3:59 PM > > To: "dev@tika.apache.org" <dev@tika.apache.org> > > Subject: Re: Configuring parsers and translators > > > > >(Devil's advocate hat slightly on.) My one hesitation about putting it > all > > >into tika-config is that the default might get to be a monstrosity -- > > >difficult for new users to use. > > > > > >Tyler > > > > > >On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) < > > >chris.a.mattm...@jpl.nasa.gov> wrote: > > > > > >> I think it would be great to have all this in the Tika Config. > > >> > > >> The one thing then is to provide an example default config and > > >> to make it *hugely* clear rather than all the levels of indirection > > >> that we currently have going on which makes it super hard when > > >> there is a config error (SPI, swallowing print messages, etc.) > > >> > > >> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> Chris Mattmann, Ph.D. > > >> Chief Architect > > >> Instrument Software and Science Data Systems Section (398) > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > >> Office: 168-519, Mailstop: 168-527 > > >> Email: chris.a.mattm...@nasa.gov > > >> WWW: http://sunset.usc.edu/~mattmann/ > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> Adjunct Associate Professor, Computer Science Department > > >> University of Southern California, Los Angeles, CA 90089 USA > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> > > >> > > >> > > >> > > >> -----Original Message----- > > >> From: Tyler Palsulich <tpalsul...@gmail.com> > > >> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org> > > >> Date: Saturday, June 6, 2015 at 3:45 PM > > >> To: "dev@tika.apache.org" <dev@tika.apache.org> > > >> Subject: Re: Configuring parsers and translators > > >> > > >> >Hi Nick, > > >> > > > >> >I've been mulling this over since you sent the first message. But, > I'm > > >> >afraid I don't have a good solution or developed ideas. > > >> > > > >> >I agree, it would be very nice to consolidate all configuration for > all > > >> >parsers in the server and app. > > >> > > > >> >Is it feasible to put everything into tika-config? Then Parser > > >> >implementations would read the config to pull out their own > > >>configuration. > > >> >Or, would it be better to keep some configuration separate? > > >>Documentation > > >> >would be an issue if every parser defines its own metadata keys... > > >>But, it > > >> >might be an improvement since we don't have "free form" properties > and > > >> >configuration files. > > >> > > > >> >Tyler > > >> > > > >> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <apa...@gagravarr.org> > > >>wrote: > > >> > > > >> >> Anyone have any thoughts on this? > > >> >> > > >> >> On Fri, 8 May 2015, Nick Burch wrote: > > >> >> > Hi All > > >> >> > > > >> >> > This came up in TIKA-1623, but I thought it might be better > brought > > >> >>out > > >> >> to > > >> >> > the list for discussion > > >> >> > > > >> >> > To configure parsers on a per-document basis, such as setting PDF > > >> >> > spacing tolerances, or telling Tesseract what language it should > be > > >> >> > OCRing for, we have the *Config objects. You create one of these, > > >>use > > >> >> > the setters to configure it for your document, pop it onto the > > >>Parse > > >> >> > context and it's used when processing your document > > >> >> > > > >> >> > To configure parsers and translators on a per-JVM basis, to apply > > >>to > > >> >>all > > >> >> > documents processed, it's a bit less consistent. At least some > look > > >> >>for > > >> >> > a properties file with a specific name, usually in the tika > > >>namespace, > > >> >> > and grab their settings / keys / etc out of that. At least some > > >>expect > > >> >> > to find a *Config with their program path on it, even though that > > >> >> > remains constant between documents. None of them support getting > > >>their > > >> >> > settings from the Tika Config > > >> >> > > > >> >> > > > >> >> > As part of our evolution of parser preferences, we're moving > > >>towards > > >> >> > people either being able to set their preferences in code, or > being > > >> >>able > > >> >> > to supply a Tika Config xml which sets their parser preferences > or > > >> >> > overrides certain bits of the default. The code option works for > > >> >>people > > >> >> > who want to declare certain specific things, the Tika Config one > > >>gives > > >> >> > the same functionality but allows a consistent and clean way to > > >>set it > > >> >> > between Tika App, Tika Server and java code. > > >> >> > > > >> >> > Another related example is the External Parser support. Because > you > > >> >>can > > >> >> > have multiple External Parser instances in your setup, one per > > >>format > > >> >>/ > > >> >> > program, we look for all the > > >> >> > org/apache/tika/parser/external/tika-external-parsers.xml files > on > > >>the > > >> >> > classpath, and create parser instances based on definitions in > > >>there > > >> >> > > > >> >> > > > >> >> > What do we think about setting executable paths and keys/logins > for > > >> >> > parsers like OCR, Strings, Translators etc? Always on > ParseContext? > > >> >> > Properties? Custom xml config? Tika config xml? Other? > Combination? > > >> >> > > > >> >> > Nick > > >> >> > > > >> >> > > >> > > >> > > > > >