Sergey, your point is well taken.

Y, you'd need most parsers, but you can _probably_ live without advanced or 
scientific (sorry, Chris!).

I'd be hesitant to change the structure much.  We should definitely document 
this well, though!

-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] 
Sent: Thursday, September 15, 2016 12:15 PM
To: dev@tika.apache.org
Subject: PDF with embedded attachments and Tika 2.0 modularity

Hi All

As Tim educated me, PDF (and indeed other formats) may have all sort of 
embedded attachments.

In my demo I've been working with Tika 2.0-SNAPSHOT which offers a nice option 
for users to pick up only individual parsers. So I've added PDFParser & 
OpenDocumentParser and tike-core to the project dependencies and all works very 
nice when I submit to the demo a simple PDF.

But if I were to write the code which can handle the embedded attachments 
really well then I think I'll probably need to revert to depending on all of 
tika-parsers - otherwise how would I know which additional parser modules I 
should add ? If this reasoning is right then one can only use individual 
modules in the production if it is well-known the files to be processed will 
have no unexpected formats embedded in them...

I've been wondering - would it make sense, for Tika 2.0, add few more 'helper' 
modules for most used formats, which would offer less than tika-parsers but 
more than individual modules, for example:

this is what 2.x already has:


tika-parser-modules/
   tika-parser-pdf-module
   (individual parser modules for the most used ones)

tika-parsers
(all of the parsers)

and now add:

tika-parser-pdf-module-all
(or similarly named)

this

tika-parser-pdf-module-all

will depend on tika-parser-pdf-module plus the parsers which will be needed to 
process various PDF attachments ? This list of the extra deps will be based on 
the accumulated knowledge. Similarly for few other most used formats

tika-parser-pdf-module-all will be a 'compromise', it will pull more modules 
than tika-parser-pdf-module but significantly less than tike-parsers


Cheers, Sergey


Reply via email to