Hi All
During ApacheCon, I had a chance to chat with Sergey about the "subset of
Tika Parsers" issue that bubbles up from time to time. It seemed to work
well, and I think we both now have a better idea of the other's needs and
concerns, which is good :)
As is shown on our list from time to time, but more commonly elsewhere, we
have some users who are confused already by the split between tika-core
and tika-parsers. Anything that fragments further is going to cause more
issues for that kind of user.
On the other hand, there are potential users out there who want just a
handful of parsers, in a simple and easy and small way, who don't know a
lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those
are using OSGi, but not all.
One suggested solution is to just document what dependencies of
tika-parsers can be excluded at the maven level to disable certain parsers
+ shrink the resulting dependency tree. However, that requires manual
updates, manual checking, and like our examples on the website risk
getting out of date without automated checking.
Discussion then turned to our move to get all the examples for the website
into svn, with unit tests, and having the website pull those from svn on
the fly to always get the latest tested version.
That led to an idea. Not sure if it'll work yet, but...
What about having multiple Tika OSGi bundles? Continue with the "full"
bundle as now, but also have ones for "pdf", "microsoft office", "images"
etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if
they only wanted a handful of parsers, or the full one as now.
The smart bit - we have unit tests for these smaller bundles. These unit
tests ensure that the desired parsers still work on their smaller bundle.
These unit tests also ensure that unwanted parsers don't work, thus
flagging up if extra dependencies have snuck though.
Finally, we pull out the includes/excludes information that went into the
bundle, and display that for non-OSGi users. A non-OSGi person wanting
"tika with pdf only" could then look at what the tika-pdf-bundle does and
doesn't use, and from that know what maven level dependencies to keep and
which to exclude
This new plan would mean having to tweak our build to support multiple
bundles, and potentially tweaking our bundles so that you could load
tika-pdf + tika-image and have those two play nicely together. It'd also
need some new unit tests, and the work to figure out what to
include/exclude for each of our handful of "common" cases. It should,
however, deliver a way for OSGi and non-OSGi people to get just a subset
if that's all they want.
Can anyone see a flaw with this plan? Anyone see a better way? Anyone want
to help? :)
Nick