I forgot to answer "If there are other components that you'd like to have extracted, let us know, and we'll consider adding them." I am using Tika to extract and later scan/process components of any document that may perform malicious actions. So that is any script-like or macro-like construct, plus any binary data, embedded images and so forth. So essentially I need to break down all components of all documents, which is a tall order of course. But it seems like the collection of parsers that Tika provides is my best bet, and either add enhancements or maybe add new parsers.
For instance it seems that Flash is only supported via flv. There is what looks like a good parser here: https://www.free-decompiler.com/flash/ and perhaps a thing for me to do would be to add abstraction support for this parser to Tika. Jim From: Jim Idle [mailto:[email protected]] Sent: Thursday, November 3, 2016 10:11 To: [email protected] Subject: RE: PDF Processing PDAction extraction is probably what I need. Embedded streams in general, though for non-text "pieces" it would be fine to get offset and length information from some event. I will take a look at your example output below. I'll press on with Tika as an abstraction for now as I generally like what I see. I am just a bit worried that the one abstraction to rule them all may preclude me from easily handling more esoteric parts of some document formats. I presume that the best way to request enhancements is to create a JIRA entry so it can be tracked? Thanks for your help, Jim From: Allison, Timothy B. [mailto:[email protected]] Sent: Wednesday, November 2, 2016 19:02 To: [email protected]<mailto:[email protected]> Subject: RE: PDF Processing It depends (tm). As soon as 1.14 is released, I'll add PDAction extraction from PDFs (TIKA-2090), and that will include javascript (as stored in PDActions)... that capability doesn't currently exist. If there are other components that you'd like to have extracted, let us know, and we'll consider adding them. If you want a look at what javascript extraction will look like, I recently extracted ~70k javascript elements from our 500k regression corpus: http://162.242.228.174/embedded_files<https://urldefense.proofpoint.com/v2/url?u=http-3A__162.242.228.174_embedded-5Ffiles&d=DgMFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=8uGU7SdG78oljX4iwYYOehjtb2OSMGLMdmcUYv63Zuo&s=nNP4J4eB9FTGgO9ZlvgSUhiVtxLFZuS47JwZ4stKBqo&e=> specifically: http://162.242.228.174/embedded_files/js_in_pdfs.tar.bz2<https://urldefense.proofpoint.com/v2/url?u=http-3A__162.242.228.174_embedded-5Ffiles_js-5Fin-5Fpdfs.tar.bz2&d=DgMFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=8uGU7SdG78oljX4iwYYOehjtb2OSMGLMdmcUYv63Zuo&s=_R9F1g8DMvxLVjPhOFrvLS6kS4_cALopdcqWez1cs1U&e=> > entire structure of a document and extract any or all pieces from it. Within reason(tm), that _is_ the goal of Tika. The focus is text, but we try to maintain some structural information where we can, e.g. bold/italic/lists and paragraph boundaries in MSOffice and related formats. We do not do full stylistic extraction (font name, size, etc), but the general formatting components that apply across formats, we try to maintain. From: Jim Idle [mailto:[email protected]] Sent: Wednesday, November 2, 2016 3:30 AM To: [email protected]<mailto:[email protected]> Subject: PDF Processing I am wondering if I am using Tika for purposes it was not aimed at. I am beginning to thing that it's main aim is extract text from documents, whereas I really want to get an entire structure of a document and extract any or all pieces from it. For instance when parsing a PDF, if it has embedded streams, I want to be able to extract the embed stream (for instance a JavaScript). PDFBox can do this, but such information does not turn up in a ContentHandler passed to Tika. If I want to do more than get just the text, should I really use the underlying parsers directly and not try to abstract them using Tika? Many thanks, Jim
