Hi All

During ApacheCon, I had a chance to chat with Sergey about the "subset of Tika Parsers" issue that bubbles up from time to time. It seemed to work well, and I think we both now have a better idea of the other's needs and concerns, which is good :)

As is shown on our list from time to time, but more commonly elsewhere, we have some users who are confused already by the split between tika-core and tika-parsers. Anything that fragments further is going to cause more issues for that kind of user.

On the other hand, there are potential users out there who want just a handful of parsers, in a simple and easy and small way, who don't know a lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those are using OSGi, but not all.

One suggested solution is to just document what dependencies of tika-parsers can be excluded at the maven level to disable certain parsers + shrink the resulting dependency tree. However, that requires manual updates, manual checking, and like our examples on the website risk getting out of date without automated checking.

Discussion then turned to our move to get all the examples for the website into svn, with unit tests, and having the website pull those from svn on the fly to always get the latest tested version.


That led to an idea. Not sure if it'll work yet, but...

What about having multiple Tika OSGi bundles? Continue with the "full" bundle as now, but also have ones for "pdf", "microsoft office", "images" etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if they only wanted a handful of parsers, or the full one as now.

The smart bit - we have unit tests for these smaller bundles. These unit tests ensure that the desired parsers still work on their smaller bundle. These unit tests also ensure that unwanted parsers don't work, thus flagging up if extra dependencies have snuck though.

Finally, we pull out the includes/excludes information that went into the bundle, and display that for non-OSGi users. A non-OSGi person wanting "tika with pdf only" could then look at what the tika-pdf-bundle does and doesn't use, and from that know what maven level dependencies to keep and which to exclude


This new plan would mean having to tweak our build to support multiple bundles, and potentially tweaking our bundles so that you could load tika-pdf + tika-image and have those two play nicely together. It'd also need some new unit tests, and the work to figure out what to include/exclude for each of our handful of "common" cases. It should, however, deliver a way for OSGi and non-OSGi people to get just a subset if that's all they want.

Can anyone see a flaw with this plan? Anyone see a better way? Anyone want to help? :)

Nick

Reply via email to