I am wondering if I am using Tika for purposes it was not aimed at. I am beginning to thing that it's main aim is extract text from documents, whereas I really want to get an entire structure of a document and extract any or all pieces from it. For instance when parsing a PDF, if it has embedded streams, I want to be able to extract the embed stream (for instance a JavaScript). PDFBox can do this, but such information does not turn up in a ContentHandler passed to Tika.
If I want to do more than get just the text, should I really use the underlying parsers directly and not try to abstract them using Tika? Many thanks, Jim
