Re: [DISCUSS] A more modular parser project

Bob Paulin Sat, 15 Aug 2015 07:33:48 -0700

Hi,

So just to understand the break downs.  When you say:


tika-classic-parser-bundle/
        Tika-office-parser-bundle/ (including microsoft, opendocument, pst, 
rtf, iwork? Has dependency on html/text)
        Tika-pdf-parser-bundle/
        Tika-text-parser-bundle (including txt,chm, rfc822, html, xml, kml, 
feed, iptc, crypto, etc?)/
        Tika-sourcecode-parser-bundle (parsers that handle source code)
        Tika-package-parser-bundle (all zip/tar/etc)

Does that indicate 6 bundles? 5 individuals that could wrap into 1 uberjar? Breaking things down at different levels will add to maintenanceeffort so it may be better to start with the broad strokes liketika-classic-parser-bundle. But if we just created atika-classic-parser-bundle are we attempting to group the bundles by atype of usecase? I think this approach is fine but it does mean we'retaking an opinion on what most of Tika's basic users want for simpleusecases.

Another approach could be grouping the parsers by similar dependencieswhich I think the tika-multimedia-parser-bundle does fairly well. Froma dependence management perspective this is desirable. I've used toolslike JDepend to break down which packages use which dependencies. Alsodetermining package based dependencies within tika-parsers can be seenhere in sonar:


http://nemo.sonarqube.org/design/index/253571

With respect to bundles that don't fit perhaps those live on their ownuntil an obvious emerges. It's much harder to remove something from abundle than to add it later. I think this may apply to native bundles too.


- Bob

On 8/4/2015 8:32 AM, Allison, Timothy B. wrote:

Bob,
   Thank you, again.  This looks promising at first glance!

To continue down the strawman path and to start discussion on the elephant in 
the room...

We'd want bundles that allow enough control for users but aren't too much of a 
hassle to configure.  There will be trade-offs.

So, what do we think of this strawman for proposed bundles:

tika-classic-parser-bundle/
        Tika-office-parser-bundle/ (including microsoft, opendocument, pst, 
rtf, iwork? Has dependency on html/text)
        Tika-pdf-parser-bundle/
                 Tika-text-parser-bundle (including txt,chm, rfc822, html, xml, 
kml, feed, iptc, crypto, etc?)/
        Tika-sourcecode-parser-bundle (parsers that handle source code)
        Tika-package-parser-bundle (all zip/tar/etc)

tika-multimedia-parser-bundle/  (parsers that pull metadata out of image, 
audio, audio+video files)
        Tika-image-parser-bundle
        Tika-image-ocr-parser-bundle
        Tika-audio-parser-bundle
        Tika-video-parser-bundle

tika-scientific-parser-bundle/ (all parsers that handle scientific data sets 
(grib, isatab,gdal,hdf,netcdf,geoinfo,dif...much hand-waving...input, Chris?)

tika-nativelib-parser-bundle/ (sqlite...any others at the moment? all parsers 
that rely on native libs...unfortunately, this doesn't fit well thematically...)

tika-advanced-bundle/ (all parsers that rely on nlp or other advanced 
techniques for extraction of information...
                these aren't really just pulling text and metadata out, but are 
operating on the text/metadata
                 once it has been pulled out.  We may need separate bundles for 
each?)
        Tika-nlp-parser-bundle/ (ctakes, phone number, geo.topic, grobid(?) etc.
                ...or maybe we want separate bundles for each?)
        Tika-sentiment-parser-bundle (imaginary...?)
        Tika-object-parser-bundle
        
Where to put?
         font parser
        executable
        mat
        prt
        strings


Cheers,

Tim

-----Original Message-----
From: Bob Paulin [mailto:[email protected]]
Sent: Tuesday, August 04, 2015 8:56 AM
To: [email protected]
Subject: Re: [DISCUSS] A more modular parser project

So I just tried adding a META-INF/services/org.apache.tika.parser.Parser
file to each bundle in the straw man implementation and it seemed to do
the trick. Looks like the ServiceLoader code searches the classloader
for all of these files and iterates through them to pick up each jar's
META-INF/services/org.apache.tika.parser.Parser entries and adds them to
the list.  I've updated the code on github to include one per bundle.
This might be the way to go.

ex.
https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles/tika-image-parser-bundle/src/main/resources/META-INF/services

- Bob

On 8/3/2015 9:21 PM, Allison, Timothy B. wrote:

+1 to moving the source to bundles.  I think for a 2.0 would be easier

to consolidate into a parser uber jar than trying to tease things out
like I did in the straw man impl. However deciding how to break things
up might take some experimentation.

Y, and the strawman is a great easy entry down this path towards 2.0.  I think 
the main hangup will be coming to consensus about granularity and nature of the 
packages, but we can burn that bridge when we get to it.  There are some 
dependencies between parsers, but we can work through that.

1) To spin up the GUI you need org.apache.tika.parser.util (perhaps

consider moving this up to core).
Y, I put that in tika-parsers because it relies on commons codec, and I wanted 
to keep that dependency out of tika-core.  But, I'm willing to add it to 
tika-core if there aren't objections.

2) Since the META-INF/services/org.apache.tika.parser.Parser is in

tika-parser we'd need to rethink the static ServiceLoader strategy to
either always be dynamic or figure out a way to have each jar bring
there own static loader.

Hmmm...is there a way to specify this in one overall tika-config file or in 
separate configs in each bundle (yuck)...

Re: [DISCUSS] A more modular parser project

Reply via email to