On Thu, 19 Jun 2014, Ray Gauss wrote:
The point of a tika-parsers-all artifact would be a single dependency
that re-aggregates everything so that downstream projects could work the
same way they do now and not worry about missing dependencies.
What’s the disadvantage for splitting things up (in a 2.0 timeframe)?
We already have users confused by the current split between tika-core and
tika-parsers - see users list for example. We already have users confused
by what dependencies they need with the current poms setup. Splitting is
going to make that a lot worse. (POI, as a related example, sees plenty of
confused users who've got mis-matched jars and problems. Splitting is
going to make that a lot worse.)
We have previously tried pushing parsers out of the tika parser jar and
into other jars, eg ones maintained by external groups, but on the whole
it hasn't been a great success. Keeping them in sync, dealing with
different cycles, applying updates, keeping them consistent, building in a
sensible length of time, all of that would be harder with a pile of
modules.
If we were to split out out to the level needed by some of the use cases
mentioned, we'd have so many parser modules it'd be a nightmare to
maintain, and would case problems mentioned above. (People in other
threads have cautioned on these problems). If we split into just a handful
of sub modules, then many of the uses cases mentioned still have to do
work to pick out the bits they need
I still believe that the main use case of tika is "everything included",
and especially that's the beginners use case, so I think we should focus
on keeping that easy. Peeling out just some bits feels like an advanced
use case to me, so I'd rather we put the requirement for effort onto those
folks, rather than onto newbies and people on the typical uses. I'd
therefore much rather we provide advanced docs/help on excluding some
bits, rather than pull it out into a pile of different modules.
Nick