Subsets of tika parsers redux

Nick Burch Sun, 23 Nov 2014 09:13:16 -0800

Hi All

During ApacheCon, I had a chance to chat with Sergey about the "subset ofTika Parsers" issue that bubbles up from time to time. It seemed to workwell, and I think we both now have a better idea of the other's needs andconcerns, which is good :)

As is shown on our list from time to time, but more commonly elsewhere, wehave some users who are confused already by the split between tika-coreand tika-parsers. Anything that fragments further is going to cause moreissues for that kind of user.

On the other hand, there are potential users out there who want just ahandful of parsers, in a simple and easy and small way, who don't know alot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of thoseare using OSGi, but not all.

One suggested solution is to just document what dependencies oftika-parsers can be excluded at the maven level to disable certain parsers+ shrink the resulting dependency tree. However, that requires manualupdates, manual checking, and like our examples on the website riskgetting out of date without automated checking.

Discussion then turned to our move to get all the examples for the websiteinto svn, with unit tests, and having the website pull those from svn onthe fly to always get the latest tested version.



That led to an idea. Not sure if it'll work yet, but...

What about having multiple Tika OSGi bundles? Continue with the "full"bundle as now, but also have ones for "pdf", "microsoft office", "images"etc. OSGi users (eg CXF users) could then opt to depend on pdf+image ifthey only wanted a handful of parsers, or the full one as now.

The smart bit - we have unit tests for these smaller bundles. These unittests ensure that the desired parsers still work on their smaller bundle.These unit tests also ensure that unwanted parsers don't work, thusflagging up if extra dependencies have snuck though.

Finally, we pull out the includes/excludes information that went into thebundle, and display that for non-OSGi users. A non-OSGi person wanting"tika with pdf only" could then look at what the tika-pdf-bundle does anddoesn't use, and from that know what maven level dependencies to keep andwhich to exclude

This new plan would mean having to tweak our build to support multiplebundles, and potentially tweaking our bundles so that you could loadtika-pdf + tika-image and have those two play nicely together. It'd alsoneed some new unit tests, and the work to figure out what toinclude/exclude for each of our handful of "common" cases. It should,however, deliver a way for OSGi and non-OSGi people to get just a subsetif that's all they want.

Can anyone see a flaw with this plan? Anyone see a better way? Anyone wantto help? :)


Nick

Subsets of tika parsers redux

Reply via email to