I have a blog post on how to do this with NiFi using a Groovy script in the ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1 Jython is also supported but can't yet use Java libraries (it uses Jython scripts/modules instead). The other languages (Groovy, Lua, JavaScript, JRuby) can use Java libraries like Tika and PDFBox. Regards, Matt Sent from my iPhone > On Feb 20, 2016, at 10:31 AM, Ralf Meier <n...@cht3.com> wrote: > > Hi Everybody, > > I’m new to Nifi and I want to find out if it is possible to extract content > and metadata from PDF’s using a library like tika. > My first Idea was to to use the following processors: > - GetFile (Watch a specific Folder) > - IdentifyMimeType (Identify if the file is a typ application/pdf) > - RouteOnAttribute (If it is a pdf) > - ExecuteStreamCommand: > I changed the following settings. > Command Arguments: {flowfilw_contents} > Command Path: tika-python parse all > > I use the python tika wrapper from > (https://github.com/chrismattmann/tika-python) > > But it is not working. > Has somebody an Idea how to use tika to extract the content and the metadata > using nifi or what I’m doing wrong. > > Thanks for your help. > BR > Ralf