I think, there's a third concern should be taken in account: dynamic
configuration (e.g. based on metadata, like password provider on
per-document basis).
Currently you only can inject some dynamically configurable behavior via
ParseContext, but it adds complexity to recursive parser implementations.

-- 
Best regards,
Konstantin Gribov

пн, 15 июня 2015 г. в 16:52, Allison, Timothy B. <talli...@mitre.org>:

> Agreed.  They are two separate but related issues.  TIKA-1508 should be
> fairly straightforward.  Should I start coding it?  Any other
> recommendations/concerns?
>
>
>
> -----Original Message-----
> From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
> Sent: Saturday, June 13, 2015 12:54 PM
> To: dev@tika.apache.org
> Subject: Re: Configuring parsers and translators
>
> It seems like there are two goals here, both aiming to centralize
> configuration:
>
> 1. Provide an easy mechanism to configure which parsers to use when
> (TIKA-1509).
> 2. Configure all individual parser parameters in Tika Config (not in, for
> example, TesseractOCRConfig.properties) (TIKA-1508).
>
> I'm also in favor of consolidating everything in Tika Config.
>
> Tyler
>
> On Mon, Jun 8, 2015 at 7:25 AM Allison, Timothy B. <talli...@mitre.org>
> wrote:
>
> > Tyler, I see your devil's advocate point.
> >
> > I strongly agree with Chris about the benefit of centralizing
> > configuration and making it easy to dump and modify the TikaConfig file.
> >
> > Even though the TikaConfig file might get ugly, it would be far better to
> > have everything nailed down there than searching through service
> > loaders...IMHO.
> >
> > I opened TIKA-1508 a while ago and haven't had any time to work on
> > it...this just deals with simple parameter settings for parsers, not the
> > far more difficult/interesting stuff that we've discussed with composite
> > parsers.
> >
> > >> My main worry with putting it all into config xml is that we
> accidently
> > end up re-inventing spring badly...
> >
> > Yeah, or re-inventing Solr's parameter loading as my example does... :(
> >
> > I think that basic parameter setting should at least be fairly trivial to
> > code...time allowing...argh.
> >
> >
> > -----Original Message-----
> > From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
> > Sent: Saturday, June 06, 2015 7:01 PM
> > To: dev@tika.apache.org
> > Subject: Re: Configuring parsers and translators
> >
> > Hey Tyler,
> >
> > I hear you, but balance that against all the hidden things here
> > and there, and everywhere, that I constantly keep discovering and
> > having to pour through lines of TikaConfig - service loaders, class
> > loaders.
> >
> > When things work right - no problem. When something goes wrong;
> > HUGE waste of time.
> >
> > Cheers,
> > Chris
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattm...@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> > -----Original Message-----
> > From: Tyler Palsulich <tpalsul...@gmail.com>
> > Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
> > Date: Saturday, June 6, 2015 at 3:59 PM
> > To: "dev@tika.apache.org" <dev@tika.apache.org>
> > Subject: Re: Configuring parsers and translators
> >
> > >(Devil's advocate hat slightly on.) My one hesitation about putting it
> all
> > >into tika-config is that the default might get to be a monstrosity --
> > >difficult for new users to use.
> > >
> > >Tyler
> > >
> > >On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
> > >chris.a.mattm...@jpl.nasa.gov> wrote:
> > >
> > >> I think it would be great to have all this in the Tika Config.
> > >>
> > >> The one thing then is to provide an example default config and
> > >> to make it *hugely* clear rather than all the levels of indirection
> > >> that we currently have going on which makes it super hard when
> > >> there is a config error (SPI, swallowing print messages, etc.)
> > >>
> > >>
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Chris Mattmann, Ph.D.
> > >> Chief Architect
> > >> Instrument Software and Science Data Systems Section (398)
> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> Office: 168-519, Mailstop: 168-527
> > >> Email: chris.a.mattm...@nasa.gov
> > >> WWW:  http://sunset.usc.edu/~mattmann/
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Adjunct Associate Professor, Computer Science Department
> > >> University of Southern California, Los Angeles, CA 90089 USA
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>
> > >>
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Tyler Palsulich <tpalsul...@gmail.com>
> > >> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
> > >> Date: Saturday, June 6, 2015 at 3:45 PM
> > >> To: "dev@tika.apache.org" <dev@tika.apache.org>
> > >> Subject: Re: Configuring parsers and translators
> > >>
> > >> >Hi Nick,
> > >> >
> > >> >I've been mulling this over since you sent the first message. But,
> I'm
> > >> >afraid I don't have a good solution or developed ideas.
> > >> >
> > >> >I agree, it would be very nice to consolidate all configuration for
> all
> > >> >parsers in the server and app.
> > >> >
> > >> >Is it feasible to put everything into tika-config? Then Parser
> > >> >implementations would read the config to pull out their own
> > >>configuration.
> > >> >Or, would it be better to keep some configuration separate?
> > >>Documentation
> > >> >would be an issue if every parser defines its own metadata keys...
> > >>But, it
> > >> >might be an improvement since we don't have "free form" properties
> and
> > >> >configuration files.
> > >> >
> > >> >Tyler
> > >> >
> > >> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <apa...@gagravarr.org>
> > >>wrote:
> > >> >
> > >> >> Anyone have any thoughts on this?
> > >> >>
> > >> >> On Fri, 8 May 2015, Nick Burch wrote:
> > >> >> > Hi All
> > >> >> >
> > >> >> > This came up in TIKA-1623, but I thought it might be better
> brought
> > >> >>out
> > >> >> to
> > >> >> > the list for discussion
> > >> >> >
> > >> >> > To configure parsers on a per-document basis, such as setting PDF
> > >> >> > spacing tolerances, or telling Tesseract what language it should
> be
> > >> >> > OCRing for, we have the *Config objects. You create one of these,
> > >>use
> > >> >> > the setters to configure it for your document, pop it onto the
> > >>Parse
> > >> >> > context and it's used when processing your document
> > >> >> >
> > >> >> > To configure parsers and translators on a per-JVM basis, to apply
> > >>to
> > >> >>all
> > >> >> > documents processed, it's a bit less consistent. At least some
> look
> > >> >>for
> > >> >> > a properties file with a specific name, usually in the tika
> > >>namespace,
> > >> >> > and grab their settings / keys / etc out of that. At least some
> > >>expect
> > >> >> > to find a *Config with their program path on it, even though that
> > >> >> > remains constant between documents. None of them support getting
> > >>their
> > >> >> > settings from the Tika Config
> > >> >> >
> > >> >> >
> > >> >> > As part of our evolution of parser preferences, we're moving
> > >>towards
> > >> >> > people either being able to set their preferences in code, or
> being
> > >> >>able
> > >> >> > to supply a Tika Config xml which sets their parser preferences
> or
> > >> >> > overrides certain bits of the default. The code option works for
> > >> >>people
> > >> >> > who want to declare certain specific things, the Tika Config one
> > >>gives
> > >> >> > the same functionality but allows a consistent and clean way to
> > >>set it
> > >> >> > between Tika App, Tika Server and java code.
> > >> >> >
> > >> >> > Another related example is the External Parser support. Because
> you
> > >> >>can
> > >> >> > have multiple External Parser instances in your setup, one per
> > >>format
> > >> >>/
> > >> >> > program, we look for all the
> > >> >> > org/apache/tika/parser/external/tika-external-parsers.xml files
> on
> > >>the
> > >> >> > classpath, and create parser instances based on definitions in
> > >>there
> > >> >> >
> > >> >> >
> > >> >> > What do we think about setting executable paths and keys/logins
> for
> > >> >> > parsers like OCR, Strings, Translators etc? Always on
> ParseContext?
> > >> >> > Properties? Custom xml config? Tika config xml? Other?
> Combination?
> > >> >> >
> > >> >> > Nick
> > >> >> >
> > >> >>
> > >>
> > >>
> >
> >
>

Reply via email to