Re: Using Apache Nifi and Tika to extract content from pdf

Russell Whitaker Sat, 20 Feb 2016 11:19:10 -0800

Don't forget Clojure as well. 

Russell Whitaker
Sent from my iPhone


> On Feb 20, 2016, at 7:44 AM, Matt Burgess <mattyb...@gmail.com> wrote:
> 
> I have a blog post on how to do this with NiFi using a Groovy script in the 
> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
> 
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
> 
> Jython is also supported but can't yet use Java libraries (it uses Jython 
> scripts/modules instead). The other languages (Groovy, Lua, JavaScript, 
> JRuby) can use Java libraries like Tika and PDFBox.
> 
> Regards,
> Matt
> 
> Sent from my iPhone
> 
>> On Feb 20, 2016, at 10:31 AM, Ralf Meier <n...@cht3.com> wrote:
>> 
>> Hi Everybody, 
>> 
>> I’m new to Nifi and I want to find out if it is possible to extract content 
>> and metadata from PDF’s using a library like tika. 
>> My first Idea was to to use the following processors:
>> - GetFile (Watch a specific Folder)
>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>> - RouteOnAttribute (If it is a pdf)
>> - ExecuteStreamCommand:
>>      I changed the following settings.
>>      Command Arguments: {flowfilw_contents}
>>      Command Path: tika-python parse all
>>      
>> I use the python tika wrapper from 
>> (https://github.com/chrismattmann/tika-python)
>> 
>> But it is not working. 
>> Has somebody an Idea how to use tika to extract the content and the metadata 
>> using nifi or what I’m doing wrong.
>> 
>> Thanks for your help.
>> BR 
>> Ralf

Re: Using Apache Nifi and Tika to extract content from pdf

Reply via email to