On Thu, 19 Jun 2014, Ray Gauss wrote:
The point of a tika-parsers-all artifact would be a single dependency that re-aggregates everything so that downstream projects could work the same way they do now and not worry about missing dependencies.

What’s the disadvantage for splitting things up (in a 2.0 timeframe)?

We already have users confused by the current split between tika-core and tika-parsers - see users list for example. We already have users confused by what dependencies they need with the current poms setup. Splitting is going to make that a lot worse. (POI, as a related example, sees plenty of confused users who've got mis-matched jars and problems. Splitting is going to make that a lot worse.)

We have previously tried pushing parsers out of the tika parser jar and into other jars, eg ones maintained by external groups, but on the whole it hasn't been a great success. Keeping them in sync, dealing with different cycles, applying updates, keeping them consistent, building in a sensible length of time, all of that would be harder with a pile of modules.

If we were to split out out to the level needed by some of the use cases mentioned, we'd have so many parser modules it'd be a nightmare to maintain, and would case problems mentioned above. (People in other threads have cautioned on these problems). If we split into just a handful of sub modules, then many of the uses cases mentioned still have to do work to pick out the bits they need

I still believe that the main use case of tika is "everything included", and especially that's the beginners use case, so I think we should focus on keeping that easy. Peeling out just some bits feels like an advanced use case to me, so I'd rather we put the requirement for effort onto those folks, rather than onto newbies and people on the typical uses. I'd therefore much rather we provide advanced docs/help on excluding some bits, rather than pull it out into a pile of different modules.

Nick

Reply via email to