It seems like there are two goals here, both aiming to centralize
configuration:

1. Provide an easy mechanism to configure which parsers to use when
(TIKA-1509).
2. Configure all individual parser parameters in Tika Config (not in, for
example, TesseractOCRConfig.properties) (TIKA-1508).

I'm also in favor of consolidating everything in Tika Config.

Tyler

On Mon, Jun 8, 2015 at 7:25 AM Allison, Timothy B. <talli...@mitre.org>
wrote:

> Tyler, I see your devil's advocate point.
>
> I strongly agree with Chris about the benefit of centralizing
> configuration and making it easy to dump and modify the TikaConfig file.
>
> Even though the TikaConfig file might get ugly, it would be far better to
> have everything nailed down there than searching through service
> loaders...IMHO.
>
> I opened TIKA-1508 a while ago and haven't had any time to work on
> it...this just deals with simple parameter settings for parsers, not the
> far more difficult/interesting stuff that we've discussed with composite
> parsers.
>
> >> My main worry with putting it all into config xml is that we accidently
> end up re-inventing spring badly...
>
> Yeah, or re-inventing Solr's parameter loading as my example does... :(
>
> I think that basic parameter setting should at least be fairly trivial to
> code...time allowing...argh.
>
>
> -----Original Message-----
> From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
> Sent: Saturday, June 06, 2015 7:01 PM
> To: dev@tika.apache.org
> Subject: Re: Configuring parsers and translators
>
> Hey Tyler,
>
> I hear you, but balance that against all the hidden things here
> and there, and everywhere, that I constantly keep discovering and
> having to pour through lines of TikaConfig - service loaders, class
> loaders.
>
> When things work right - no problem. When something goes wrong;
> HUGE waste of time.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
> -----Original Message-----
> From: Tyler Palsulich <tpalsul...@gmail.com>
> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
> Date: Saturday, June 6, 2015 at 3:59 PM
> To: "dev@tika.apache.org" <dev@tika.apache.org>
> Subject: Re: Configuring parsers and translators
>
> >(Devil's advocate hat slightly on.) My one hesitation about putting it all
> >into tika-config is that the default might get to be a monstrosity --
> >difficult for new users to use.
> >
> >Tyler
> >
> >On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
> >chris.a.mattm...@jpl.nasa.gov> wrote:
> >
> >> I think it would be great to have all this in the Tika Config.
> >>
> >> The one thing then is to provide an example default config and
> >> to make it *hugely* clear rather than all the levels of indirection
> >> that we currently have going on which makes it super hard when
> >> there is a config error (SPI, swallowing print messages, etc.)
> >>
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattm...@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Tyler Palsulich <tpalsul...@gmail.com>
> >> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
> >> Date: Saturday, June 6, 2015 at 3:45 PM
> >> To: "dev@tika.apache.org" <dev@tika.apache.org>
> >> Subject: Re: Configuring parsers and translators
> >>
> >> >Hi Nick,
> >> >
> >> >I've been mulling this over since you sent the first message. But, I'm
> >> >afraid I don't have a good solution or developed ideas.
> >> >
> >> >I agree, it would be very nice to consolidate all configuration for all
> >> >parsers in the server and app.
> >> >
> >> >Is it feasible to put everything into tika-config? Then Parser
> >> >implementations would read the config to pull out their own
> >>configuration.
> >> >Or, would it be better to keep some configuration separate?
> >>Documentation
> >> >would be an issue if every parser defines its own metadata keys...
> >>But, it
> >> >might be an improvement since we don't have "free form" properties and
> >> >configuration files.
> >> >
> >> >Tyler
> >> >
> >> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <apa...@gagravarr.org>
> >>wrote:
> >> >
> >> >> Anyone have any thoughts on this?
> >> >>
> >> >> On Fri, 8 May 2015, Nick Burch wrote:
> >> >> > Hi All
> >> >> >
> >> >> > This came up in TIKA-1623, but I thought it might be better brought
> >> >>out
> >> >> to
> >> >> > the list for discussion
> >> >> >
> >> >> > To configure parsers on a per-document basis, such as setting PDF
> >> >> > spacing tolerances, or telling Tesseract what language it should be
> >> >> > OCRing for, we have the *Config objects. You create one of these,
> >>use
> >> >> > the setters to configure it for your document, pop it onto the
> >>Parse
> >> >> > context and it's used when processing your document
> >> >> >
> >> >> > To configure parsers and translators on a per-JVM basis, to apply
> >>to
> >> >>all
> >> >> > documents processed, it's a bit less consistent. At least some look
> >> >>for
> >> >> > a properties file with a specific name, usually in the tika
> >>namespace,
> >> >> > and grab their settings / keys / etc out of that. At least some
> >>expect
> >> >> > to find a *Config with their program path on it, even though that
> >> >> > remains constant between documents. None of them support getting
> >>their
> >> >> > settings from the Tika Config
> >> >> >
> >> >> >
> >> >> > As part of our evolution of parser preferences, we're moving
> >>towards
> >> >> > people either being able to set their preferences in code, or being
> >> >>able
> >> >> > to supply a Tika Config xml which sets their parser preferences or
> >> >> > overrides certain bits of the default. The code option works for
> >> >>people
> >> >> > who want to declare certain specific things, the Tika Config one
> >>gives
> >> >> > the same functionality but allows a consistent and clean way to
> >>set it
> >> >> > between Tika App, Tika Server and java code.
> >> >> >
> >> >> > Another related example is the External Parser support. Because you
> >> >>can
> >> >> > have multiple External Parser instances in your setup, one per
> >>format
> >> >>/
> >> >> > program, we look for all the
> >> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on
> >>the
> >> >> > classpath, and create parser instances based on definitions in
> >>there
> >> >> >
> >> >> >
> >> >> > What do we think about setting executable paths and keys/logins for
> >> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
> >> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
> >> >> >
> >> >> > Nick
> >> >> >
> >> >>
> >>
> >>
>
>

Reply via email to