Hi,

So just to understand the break downs.  When you say:

tika-classic-parser-bundle/
        Tika-office-parser-bundle/ (including microsoft, opendocument, pst, 
rtf, iwork? Has dependency on html/text)
        Tika-pdf-parser-bundle/
        Tika-text-parser-bundle (including txt,chm, rfc822, html, xml, kml, 
feed, iptc, crypto, etc?)/
        Tika-sourcecode-parser-bundle (parsers that handle source code)
        Tika-package-parser-bundle (all zip/tar/etc)

Does that indicate 6 bundles? 5 individuals that could wrap into 1 uber jar? Breaking things down at different levels will add to maintenance effort so it may be better to start with the broad strokes like tika-classic-parser-bundle. But if we just created a tika-classic-parser-bundle are we attempting to group the bundles by a type of usecase? I think this approach is fine but it does mean we're taking an opinion on what most of Tika's basic users want for simple usecases.

Another approach could be grouping the parsers by similar dependencies which I think the tika-multimedia-parser-bundle does fairly well. From a dependence management perspective this is desirable. I've used tools like JDepend to break down which packages use which dependencies. Also determining package based dependencies within tika-parsers can be seen here in sonar:

http://nemo.sonarqube.org/design/index/253571


With respect to bundles that don't fit perhaps those live on their own until an obvious emerges. It's much harder to remove something from a bundle than to add it later. I think this may apply to native bundles too.

- Bob

On 8/4/2015 8:32 AM, Allison, Timothy B. wrote:
Bob,
   Thank you, again.  This looks promising at first glance!

To continue down the strawman path and to start discussion on the elephant in 
the room...

We'd want bundles that allow enough control for users but aren't too much of a 
hassle to configure.  There will be trade-offs.

So, what do we think of this strawman for proposed bundles:

tika-classic-parser-bundle/
        Tika-office-parser-bundle/ (including microsoft, opendocument, pst, 
rtf, iwork? Has dependency on html/text)
        Tika-pdf-parser-bundle/
                 Tika-text-parser-bundle (including txt,chm, rfc822, html, xml, 
kml, feed, iptc, crypto, etc?)/
        Tika-sourcecode-parser-bundle (parsers that handle source code)
        Tika-package-parser-bundle (all zip/tar/etc)

tika-multimedia-parser-bundle/  (parsers that pull metadata out of image, 
audio, audio+video files)
        Tika-image-parser-bundle
        Tika-image-ocr-parser-bundle
        Tika-audio-parser-bundle
        Tika-video-parser-bundle

tika-scientific-parser-bundle/ (all parsers that handle scientific data sets 
(grib, isatab,gdal,hdf,netcdf,geoinfo,dif...much hand-waving...input, Chris?)

tika-nativelib-parser-bundle/ (sqlite...any others at the moment? all parsers 
that rely on native libs...unfortunately, this doesn't fit well thematically...)

tika-advanced-bundle/ (all parsers that rely on nlp or other advanced 
techniques for extraction of information...
                these aren't really just pulling text and metadata out, but are 
operating on the text/metadata
                 once it has been pulled out.  We may need separate bundles for 
each?)
        Tika-nlp-parser-bundle/ (ctakes, phone number, geo.topic, grobid(?) etc.
                ...or maybe we want separate bundles for each?)
        Tika-sentiment-parser-bundle (imaginary...?)
        Tika-object-parser-bundle
        
Where to put?
         font parser
        executable
        mat
        prt
        strings


Cheers,
Tim



-----Original Message-----
From: Bob Paulin [mailto:[email protected]]
Sent: Tuesday, August 04, 2015 8:56 AM
To: [email protected]
Subject: Re: [DISCUSS] A more modular parser project

So I just tried adding a META-INF/services/org.apache.tika.parser.Parser
file to each bundle in the straw man implementation and it seemed to do
the trick. Looks like the ServiceLoader code searches the classloader
for all of these files and iterates through them to pick up each jar's
META-INF/services/org.apache.tika.parser.Parser entries and adds them to
the list.  I've updated the code on github to include one per bundle.
This might be the way to go.

ex.
https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles/tika-image-parser-bundle/src/main/resources/META-INF/services


- Bob

On 8/3/2015 9:21 PM, Allison, Timothy B. wrote:
+1 to moving the source to bundles.  I think for a 2.0 would be easier
to consolidate into a parser uber jar than trying to tease things out
like I did in the straw man impl. However deciding how to break things
up might take some experimentation.

Y, and the strawman is a great easy entry down this path towards 2.0.  I think 
the main hangup will be coming to consensus about granularity and nature of the 
packages, but we can burn that bridge when we get to it.  There are some 
dependencies between parsers, but we can work through that.

1) To spin up the GUI you need org.apache.tika.parser.util (perhaps
consider moving this up to core).
Y, I put that in tika-parsers because it relies on commons codec, and I wanted 
to keep that dependency out of tika-core.  But, I'm willing to add it to 
tika-core if there aren't objections.


2) Since the META-INF/services/org.apache.tika.parser.Parser is in
tika-parser we'd need to rethink the static ServiceLoader strategy to
either always be dynamic or figure out a way to have each jar bring
there own static loader.

Hmmm...is there a way to specify this in one overall tika-config file or in 
separate configs in each bundle (yuck)...


Reply via email to