Hi, On 9/20/07, Chris Mattmann <[EMAIL PROTECTED]> wrote: > 1. How do we configure Tika? This somewhat relates to the prior discussion > on the Tika parser interface, but it extends beyond that. > [...] > 1. We define a TikaConfiguration object, and an xml file location/format > within CM for Tika configuration properties. [...]
I'd like to keep the configuration part structurally separate form the parser and other components, so that one could easily integrate Tika components in various different environments like IoC containers, etc. We could have a "native" Tika configuration file and a simple mechanism that converts the configuration to active parser (and other) instances, but it should be possible to use the Tika features even without such a configuration. My preference would be to use the JavaBean conventions for any configuration options on parser and other Tika classes to avoid extra dependencies on custom configuration objects (see also TIKA-23). A native configuration mechanism could use the property setters just like a generic IoC container or even a hardcoded client application would. > 2. What are the right data attributes to configure a parser? Could we get > some documentation on them? [...] > [...] > 2. We sit down and baseline a set of properties, including documentation on > them, for tika parsers. I think the configuration of a parser class will be highly dependent on the content format it uses, so there may not be that many truly global configuration options. I even think that the mime type should be a part of content metadata and not of parser configuration. However, I very much agree with the drive to plan and document the available configuration options. > We should also change everything in CM right now that says "Luis" to "Tika". Agreed. > 3. What are the entry points into Tika? As far as I can tell, there is a > ParserFactory that can be used to get a Parser for a particular file or Url, > etc. This implies that the ParserFactory performs some sort of mime type > resolution (which it does), however, mime type resolution (using the new > mime framework) requires the ability for Tika to have a configuration. I think we should try to keep Tika as modular as possible and have multiple different entry points depending on the set of functionality and amount of customization a client wants. Currently I could foresee Tika being composed of three independent components (parsing, mime type detection, configuration) and a helper layer that binds these three together. It should be possible (and easy) for a client to reach directly to even a single parser class and use just that, but also to invoke a single helper method that looks up a configuration file, instantiates a set of parser and type detection components, retrieves a resource identified by a URI, and extracts the text content of the resource using all the configured components. BR, Jukka Zitting
