Hi, On Tue, Jan 27, 2009 at 12:39 AM, Jonathan Koren <[email protected]> wrote: > On Jan 26, 2009, at 2:15 PM, Jukka Zitting wrote: >> Cool! However, see http://markmail.org/message/rgesbchrufeauxnw for a >> discussion on how complex a parser implementation within Tika can >> become until it would be better to look for (or create) an external >> parser library for that format. > > I particularly liked the part where the example given as a good enough > parser was the very parser I singled out. :) > > So the takeaway is "Don't be PDFBox," and "Don't be afraid to add yet > another dependency, if reimplementing is easy?"
Yeah. If you can do something reasonable with at most a few hundred lines of code, then it's OK to have it in Tika. But as soon as you go beyond that, the effort is better spent by contributing to some more external parser library and using the result in TIka. BR, Jukka Zitting
