I am wondering if I am using Tika for purposes it was not aimed at. I am 
beginning to thing that it's main aim is extract text from documents, whereas I 
really want to get an entire structure of a document and extract any or all 
pieces from it. For instance when parsing a PDF, if it has embedded streams, I 
want to be able to extract the embed stream (for instance a JavaScript). PDFBox 
can do this, but such information does not turn up in a ContentHandler passed 
to Tika.

If I want to do more than get just the text, should I really use the underlying 
parsers directly and not try to abstract them using Tika?

Many thanks,

Jim

Reply via email to