Sergey, your point is well taken. Y, you'd need most parsers, but you can _probably_ live without advanced or scientific (sorry, Chris!).
I'd be hesitant to change the structure much. We should definitely document this well, though! -----Original Message----- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, September 15, 2016 12:15 PM To: dev@tika.apache.org Subject: PDF with embedded attachments and Tika 2.0 modularity Hi All As Tim educated me, PDF (and indeed other formats) may have all sort of embedded attachments. In my demo I've been working with Tika 2.0-SNAPSHOT which offers a nice option for users to pick up only individual parsers. So I've added PDFParser & OpenDocumentParser and tike-core to the project dependencies and all works very nice when I submit to the demo a simple PDF. But if I were to write the code which can handle the embedded attachments really well then I think I'll probably need to revert to depending on all of tika-parsers - otherwise how would I know which additional parser modules I should add ? If this reasoning is right then one can only use individual modules in the production if it is well-known the files to be processed will have no unexpected formats embedded in them... I've been wondering - would it make sense, for Tika 2.0, add few more 'helper' modules for most used formats, which would offer less than tika-parsers but more than individual modules, for example: this is what 2.x already has: tika-parser-modules/ tika-parser-pdf-module (individual parser modules for the most used ones) tika-parsers (all of the parsers) and now add: tika-parser-pdf-module-all (or similarly named) this tika-parser-pdf-module-all will depend on tika-parser-pdf-module plus the parsers which will be needed to process various PDF attachments ? This list of the extra deps will be based on the accumulated knowledge. Similarly for few other most used formats tika-parser-pdf-module-all will be a 'compromise', it will pull more modules than tika-parser-pdf-module but significantly less than tike-parsers Cheers, Sergey