Hey Tyler,

I hear you, but balance that against all the hidden things here
and there, and everywhere, that I constantly keep discovering and
having to pour through lines of TikaConfig - service loaders, class
loaders.

When things work right - no problem. When something goes wrong;
HUGE waste of time.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Tyler Palsulich <tpalsul...@gmail.com>
Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
Date: Saturday, June 6, 2015 at 3:59 PM
To: "dev@tika.apache.org" <dev@tika.apache.org>
Subject: Re: Configuring parsers and translators

>(Devil's advocate hat slightly on.) My one hesitation about putting it all
>into tika-config is that the default might get to be a monstrosity --
>difficult for new users to use.
>
>Tyler
>
>On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
>chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> I think it would be great to have all this in the Tika Config.
>>
>> The one thing then is to provide an example default config and
>> to make it *hugely* clear rather than all the levels of indirection
>> that we currently have going on which makes it super hard when
>> there is a config error (SPI, swallowing print messages, etc.)
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>> -----Original Message-----
>> From: Tyler Palsulich <tpalsul...@gmail.com>
>> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
>> Date: Saturday, June 6, 2015 at 3:45 PM
>> To: "dev@tika.apache.org" <dev@tika.apache.org>
>> Subject: Re: Configuring parsers and translators
>>
>> >Hi Nick,
>> >
>> >I've been mulling this over since you sent the first message. But, I'm
>> >afraid I don't have a good solution or developed ideas.
>> >
>> >I agree, it would be very nice to consolidate all configuration for all
>> >parsers in the server and app.
>> >
>> >Is it feasible to put everything into tika-config? Then Parser
>> >implementations would read the config to pull out their own
>>configuration.
>> >Or, would it be better to keep some configuration separate?
>>Documentation
>> >would be an issue if every parser defines its own metadata keys...
>>But, it
>> >might be an improvement since we don't have "free form" properties and
>> >configuration files.
>> >
>> >Tyler
>> >
>> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <apa...@gagravarr.org>
>>wrote:
>> >
>> >> Anyone have any thoughts on this?
>> >>
>> >> On Fri, 8 May 2015, Nick Burch wrote:
>> >> > Hi All
>> >> >
>> >> > This came up in TIKA-1623, but I thought it might be better brought
>> >>out
>> >> to
>> >> > the list for discussion
>> >> >
>> >> > To configure parsers on a per-document basis, such as setting PDF
>> >> > spacing tolerances, or telling Tesseract what language it should be
>> >> > OCRing for, we have the *Config objects. You create one of these,
>>use
>> >> > the setters to configure it for your document, pop it onto the
>>Parse
>> >> > context and it's used when processing your document
>> >> >
>> >> > To configure parsers and translators on a per-JVM basis, to apply
>>to
>> >>all
>> >> > documents processed, it's a bit less consistent. At least some look
>> >>for
>> >> > a properties file with a specific name, usually in the tika
>>namespace,
>> >> > and grab their settings / keys / etc out of that. At least some
>>expect
>> >> > to find a *Config with their program path on it, even though that
>> >> > remains constant between documents. None of them support getting
>>their
>> >> > settings from the Tika Config
>> >> >
>> >> >
>> >> > As part of our evolution of parser preferences, we're moving
>>towards
>> >> > people either being able to set their preferences in code, or being
>> >>able
>> >> > to supply a Tika Config xml which sets their parser preferences or
>> >> > overrides certain bits of the default. The code option works for
>> >>people
>> >> > who want to declare certain specific things, the Tika Config one
>>gives
>> >> > the same functionality but allows a consistent and clean way to
>>set it
>> >> > between Tika App, Tika Server and java code.
>> >> >
>> >> > Another related example is the External Parser support. Because you
>> >>can
>> >> > have multiple External Parser instances in your setup, one per
>>format
>> >>/
>> >> > program, we look for all the
>> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on
>>the
>> >> > classpath, and create parser instances based on definitions in
>>there
>> >> >
>> >> >
>> >> > What do we think about setting executable paths and keys/logins for
>> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
>> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
>> >> >
>> >> > Nick
>> >> >
>> >>
>>
>>

Reply via email to