Hi, thanks for your information. I try to understand your workflow but get some errors when I test it:
: org.apache.nifi.processor.exception.ProcessException: javax.script.ScriptException: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed: Script36800.groovy: 15: unable to resolve class PDFTextStripper @ line 15, column 9. def s = new PDFTextStripper() I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my download folder. I then changed the path (Module Directory) in the ExecuteScript to this folder. The rest I didn’t changed. But I get this error. Do you have some hints? This would be great. To be honest (I’m totally new to groovy) in addition I did also not understand what happens here in detail: flowFile = session.write(flowFile, {inputStream, outputStream -> doc = PDDocument.load(inputStream) info = doc.getDocumentInformation() s.writeText(doc, new OutputStreamWriter(outputStream)) } as StreamCallback ) Thanks for your help. BR Ralf > Am 20.02.2016 um 16:44 schrieb Matt Burgess <mattyb...@gmail.com>: > > I have a blog post on how to do this with NiFi using a Groovy script in the > ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika: > > http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1 > > <http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1> > > Jython is also supported but can't yet use Java libraries (it uses Jython > scripts/modules instead). The other languages (Groovy, Lua, JavaScript, > JRuby) can use Java libraries like Tika and PDFBox. > > Regards, > Matt > > Sent from my iPhone > > On Feb 20, 2016, at 10:31 AM, Ralf Meier <n...@cht3.com > <mailto:n...@cht3.com>> wrote: > >> Hi Everybody, >> >> I’m new to Nifi and I want to find out if it is possible to extract content >> and metadata from PDF’s using a library like tika. >> My first Idea was to to use the following processors: >> - GetFile (Watch a specific Folder) >> - IdentifyMimeType (Identify if the file is a typ application/pdf) >> - RouteOnAttribute (If it is a pdf) >> - ExecuteStreamCommand: >> I changed the following settings. >> Command Arguments: {flowfilw_contents} >> Command Path: tika-python parse all >> >> I use the python tika wrapper from >> (https://github.com/chrismattmann/tika-python >> <https://github.com/chrismattmann/tika-python>) >> >> But it is not working. >> Has somebody an Idea how to use tika to extract the content and the metadata >> using nifi or what I’m doing wrong. >> >> Thanks for your help. >> BR >> Ralf