Re: Nifi 0.50 and GetKafka Issues

2016-02-20 Thread Juan Sequeiros
Excuse me my message got cut short.
I point you to the state management section since you mention you are using
zookeeper and you are running nifi in cluster mode.
The default cluster state provider is ZookeeperStateProvider

Thanks


On Sat, Feb 20, 2016 at 7:29 PM, Juan Sequeiros  wrote:

> Joshua,
>
> Did you edit the state-management.xml
> Take a look at the state management section of the admin guide throuigh
> the instances "help" link.
>
>
> On Sat, Feb 20, 2016 at 6:47 PM, Oleg Zhurakousky <
> ozhurakou...@hortonworks.com> wrote:
>
>> Josh
>>
>> The only change that ’s went and relevant to your issue is the fact that
>> we’ve upgraded client libraries to Kafka 0.9 and between 0.8 and 0.9 Kafka
>> introduced wire protocol changes that break compatibility.
>> I am still digging so stay tuned.
>>
>> Oleg
>>
>> On Feb 20, 2016, at 4:10 PM, West, Joshua  wrote:
>>
>> Hi Oleg and Joe,
>>
>> Kafka 0.8.2.1
>>
>> Attached is the app log with hostnames scrubbed.
>>
>> Thanks for your help.  Much appreciated.
>>
>> --
>> Josh West 
>> Bose Corporation
>>
>>
>>
>> On Sat, 2016-02-20 at 15:46 -0500, Joe Witt wrote:
>>
>> And also what version of Kafka are you using?
>> On Feb 20, 2016 3:37 PM, "Oleg Zhurakousky" 
>> wrote:
>>
>> Josh
>>
>> Any chance to attache the app-log or relevant stack trace?
>>
>> Thanks
>> Oleg
>>
>> On Feb 20, 2016, at 3:30 PM, West, Joshua  wrote:
>>
>> Hi folks,
>>
>> I've upgraded from Nifi 0.4.1 to 0.5.0 and I am no longer able to use the
>> GetKafka processor.  I'm seeing errors like so:
>>
>> 2016-02-20 20:10:14,953 WARN
>> [ConsumerFetcherThread-NiFi-sldjflkdsjflksjf_**SCRUBBED**-1455999008728-5b8c7108-0-0]
>> kafka.consumer.ConsumerFetcherThread
>> [ConsumerFetcherThread-NiFi-sldjflkdsjflksjf_**SCRUBBED**-1455999008728-5b8c7108-0-0],
>> Error in fetchkafka.consumer.ConsumerFetcherThread$FetchRequest@7b49a642.
>> Possible cause: java.lang.IllegalArgumentException
>>
>> ^ Note  the hostname of the server has been scrubbed.
>>
>> My configuration is pretty generic, except that with Zookeeper we use a
>> different root path, so our Zookeeper connect string looks like so:
>>
>> zookeeper-node1:2181,zookeeper-node2:2181,zookeeper-node3:2181/kafka
>>
>> Is anybody else experiencing issues?
>>
>> Thanks.
>>
>> --
>> Josh West 
>>
>> Cloud Architect
>> Bose Corporation
>>
>>
>>
>>
>>
>> 
>>
>>
>>
>
>
> --
> Juan Carlos Sequeiros
>



-- 
Juan Carlos Sequeiros


Re: Nifi 0.50 and GetKafka Issues

2016-02-20 Thread Oleg Zhurakousky
Josh

The only change that ’s went and relevant to your issue is the fact that we’ve 
upgraded client libraries to Kafka 0.9 and between 0.8 and 0.9 Kafka introduced 
wire protocol changes that break compatibility.
I am still digging so stay tuned.

Oleg

On Feb 20, 2016, at 4:10 PM, West, Joshua 
> wrote:

Hi Oleg and Joe,

Kafka 0.8.2.1

Attached is the app log with hostnames scrubbed.

Thanks for your help.  Much appreciated.


--
Josh West >
Bose Corporation



On Sat, 2016-02-20 at 15:46 -0500, Joe Witt wrote:

And also what version of Kafka are you using?

On Feb 20, 2016 3:37 PM, "Oleg Zhurakousky" 
> wrote:
Josh

Any chance to attache the app-log or relevant stack trace?

Thanks
Oleg

On Feb 20, 2016, at 3:30 PM, West, Joshua 
> wrote:

Hi folks,

I've upgraded from Nifi 0.4.1 to 0.5.0 and I am no longer able to use the 
GetKafka processor.  I'm seeing errors like so:

2016-02-20 20:10:14,953 WARN 
[ConsumerFetcherThread-NiFi-sldjflkdsjflksjf_**SCRUBBED**-1455999008728-5b8c7108-0-0]
 kafka.consumer.ConsumerFetcherThread 
[ConsumerFetcherThread-NiFi-sldjflkdsjflksjf_**SCRUBBED**-1455999008728-5b8c7108-0-0],
 Error in 
fetchkafka.consumer.ConsumerFetcherThread$FetchRequest@7b49a642.
 Possible cause: java.lang.IllegalArgumentException

^ Note  the hostname of the server has been scrubbed.

My configuration is pretty generic, except that with Zookeeper we use a 
different root path, so our Zookeeper connect string looks like so:

zookeeper-node1:2181,zookeeper-node2:2181,zookeeper-node3:2181/kafka

Is anybody else experiencing issues?

Thanks.


--
Josh West >

Cloud Architect
Bose Corporation









Re: Connecting Spark to Nifi 0.4.0

2016-02-20 Thread Bryan Bende
Just wanted to point out that the stack trace doesn't actually show the
error coming from code in the NiFi Site-To-Site client, so I wonder if it
is something else related to Spark.

Seems similar to this error, but not sure:
https://stackoverflow.com/questions/27013795/failed-to-run-the-spark-example-locally-on-a-macbook-with-error-lost-task-1-0-i

On Sat, Feb 20, 2016 at 5:16 PM, Joe Witt  wrote:

> Kyle
>
> Can you try connecting to that nifi port using telnet and see if you are
> able?
>
> Use the same host and port as you are in your spark job.
>
> Thanks
> Joe
> On Feb 20, 2016 4:55 PM, "Kyle Burke"  wrote:
>
>> All,
>>I’m attempting to connect Spark to Nifi but I’m getting a “connect
>> timed out” error when spark tries to pull records from the input port. I
>> don’t understand why I”m getting the issue because nifi and spark are both
>> running on my local laptop. Any suggestions about how to get around the
>> issue?
>>
>> *It appears that nifi is listening on the port because I see the
>> following when running the lsof command:*
>>
>> java31455 kyle.burke 1054u  IPv4 0x1024ddd67a640091  0t0  TCP
>> *:9099 (LISTEN)
>>
>>
>> *I’ve been following the instructions give in these two articles:*
>> https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark
>>
>> https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html
>>
>> *Here is how I have my nifi.properties setting:*
>>
>> # Site to Site properties
>>
>> nifi.remote.input.socket.host=
>>
>> nifi.remote.input.socket.port=9099
>>
>> nifi.remote.input.secure=false
>>
>>
>> *Below is the full error stack:*
>>
>> 16/02/20 16:34:45 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
>> 0)
>>
>> java.net.SocketTimeoutException: connect timed out
>>
>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>>
>> at
>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>>
>> at
>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>>
>> at
>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>>
>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>>
>> at java.net.Socket.connect(Socket.java:589)
>>
>> at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>>
>> at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>>
>> at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>>
>> at sun.net.www.http.HttpClient.(HttpClient.java:211)
>>
>> at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>>
>> at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>>
>> at
>> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
>>
>> at
>> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
>>
>> at
>> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
>>
>> at
>> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
>>
>> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
>>
>> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
>>
>> at
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
>>
>> at
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
>>
>> at
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>>
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>>
>> at
>> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>>
>> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>>
>> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>>
>> at
>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>>
>> at org.apache.spark.executor.Executor.org
>> $apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
>>
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>> Respectfully,
>>
>> *Kyle Burke *| Data Science Engineer
>> *IgnitionOne - *Marketing Technology. Simplified.
>>
>


Re: Connecting Spark to Nifi 0.4.0

2016-02-20 Thread Joe Witt
Kyle

Can you try connecting to that nifi port using telnet and see if you are
able?

Use the same host and port as you are in your spark job.

Thanks
Joe
On Feb 20, 2016 4:55 PM, "Kyle Burke"  wrote:

> All,
>I’m attempting to connect Spark to Nifi but I’m getting a “connect
> timed out” error when spark tries to pull records from the input port. I
> don’t understand why I”m getting the issue because nifi and spark are both
> running on my local laptop. Any suggestions about how to get around the
> issue?
>
> *It appears that nifi is listening on the port because I see the following
> when running the lsof command:*
>
> java31455 kyle.burke 1054u  IPv4 0x1024ddd67a640091  0t0  TCP
> *:9099 (LISTEN)
>
>
> *I’ve been following the instructions give in these two articles:*
> https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark
>
> https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html
>
> *Here is how I have my nifi.properties setting:*
>
> # Site to Site properties
>
> nifi.remote.input.socket.host=
>
> nifi.remote.input.socket.port=9099
>
> nifi.remote.input.secure=false
>
>
> *Below is the full error stack:*
>
> 16/02/20 16:34:45 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
> 0)
>
> java.net.SocketTimeoutException: connect timed out
>
> at java.net.PlainSocketImpl.socketConnect(Native Method)
>
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>
> at java.net.Socket.connect(Socket.java:589)
>
> at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>
> at sun.net.www.http.HttpClient.(HttpClient.java:211)
>
> at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>
> at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>
> at
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
>
> at
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
>
> at
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
>
> at
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
>
> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
>
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
>
> at
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
>
> at
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
>
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>
> at org.apache.spark.executor.Executor.org
> $apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
> Respectfully,
>
> *Kyle Burke *| Data Science Engineer
> *IgnitionOne - *Marketing Technology. Simplified.
>


Connecting Spark to Nifi 0.4.0

2016-02-20 Thread Kyle Burke
All,
   I’m attempting to connect Spark to Nifi but I’m getting a “connect timed 
out” error when spark tries to pull records from the input port. I don’t 
understand why I”m getting the issue because nifi and spark are both running on 
my local laptop. Any suggestions about how to get around the issue?

It appears that nifi is listening on the port because I see the following when 
running the lsof command:

java31455 kyle.burke 1054u  IPv4 0x1024ddd67a640091  0t0  TCP *:9099 
(LISTEN)


I’ve been following the instructions give in these two articles:
https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark
https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html

Here is how I have my nifi.properties setting:

# Site to Site properties

nifi.remote.input.socket.host=

nifi.remote.input.socket.port=9099

nifi.remote.input.secure=false


Below is the full error stack:

16/02/20 16:34:45 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

java.net.SocketTimeoutException: connect timed out

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at sun.net.NetworkClient.doConnect(NetworkClient.java:175)

at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)

at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)

at sun.net.www.http.HttpClient.(HttpClient.java:211)

at sun.net.www.http.HttpClient.New(HttpClient.java:308)

at sun.net.www.http.HttpClient.New(HttpClient.java:326)

at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)

at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)

at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)

at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)

at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)

at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)

at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)

at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)

at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)

at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)

at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)

at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


Respectfully,

Kyle Burke | Data Science Engineer
IgnitionOne - Marketing Technology. Simplified.


Re: Using Apache Nifi and Tika to extract content from pdf

2016-02-20 Thread Matt Burgess
I will update the blog to make these more clear. I used PDFBox 1.8.10 so I'm 
not sure what else you need for the 2.0-series. For the JAR issue with 1.8.10, 
PDFBox doc says you need 3 JARs: PDFBox, fontbox, and jempbox, plus 
commons-logging but I think that's already in NiFi.

The stack trace from the script error should be in logs/nifi-app.log, if you 
send it along I can take a look. You should be able to point to the folder 
containing the JARs, or supply a comma-separated list of each JAR in the Module 
Path property.

For the groovy "magic" stuff (syntactic sugar and closure coercion while using 
the NiFi APIs), I explain some of that in another post on that blog: 
http://funnifi.blogspot.com/2016/02/executescript-processor-replacing-flow.html?m=1

Hope this helps,
Matt

> On Feb 20, 2016, at 3:54 PM, Ralf Meier  wrote:
> 
> Hi,
> 
> thanks for your information. I try to understand your workflow but get some 
> errors when I test it:
> 
> : org.apache.nifi.processor.exception.ProcessException: 
> javax.script.ScriptException: 
> org.codehaus.groovy.control.MultipleCompilationErrorsException: startup 
> failed:
> Script36800.groovy: 15: unable to resolve class PDFTextStripper 
>  @ line 15, column 9.
>def s = new PDFTextStripper()
> 
> I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my 
> download folder. I then changed the path (Module Directory)  in the 
> ExecuteScript to this folder. The rest I didn’t changed. 
> 
> But I get this error. Do you have some hints? This would be great.
> 
> 
> To be honest (I’m totally new to groovy) in addition I did also not 
> understand what happens here in detail:
> 
> flowFile = session.write(flowFile, {inputStream, outputStream ->
>   doc = PDDocument.load(inputStream)
>   info = doc.getDocumentInformation()
> s.writeText(doc, new OutputStreamWriter(outputStream))
> } as StreamCallback
> )
> 
> Thanks for your help.
> 
> BR
> Ralf
> 
> 
> 
> 
>> Am 20.02.2016 um 16:44 schrieb Matt Burgess :
>> 
>> I have a blog post on how to do this with NiFi using a Groovy script in the 
>> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>> 
>> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>> 
>> Jython is also supported but can't yet use Java libraries (it uses Jython 
>> scripts/modules instead). The other languages (Groovy, Lua, JavaScript, 
>> JRuby) can use Java libraries like Tika and PDFBox.
>> 
>> Regards,
>> Matt
>> 
>> Sent from my iPhone
>> 
>>> On Feb 20, 2016, at 10:31 AM, Ralf Meier  wrote:
>>> 
>>> Hi Everybody, 
>>> 
>>> I’m new to Nifi and I want to find out if it is possible to extract content 
>>> and metadata from PDF’s using a library like tika. 
>>> My first Idea was to to use the following processors:
>>> - GetFile (Watch a specific Folder)
>>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>>> - RouteOnAttribute (If it is a pdf)
>>> - ExecuteStreamCommand:
>>> I changed the following settings.
>>> Command Arguments: {flowfilw_contents}
>>> Command Path: tika-python parse all
>>> 
>>> I use the python tika wrapper from 
>>> (https://github.com/chrismattmann/tika-python)
>>> 
>>> But it is not working. 
>>> Has somebody an Idea how to use tika to extract the content and the 
>>> metadata using nifi or what I’m doing wrong.
>>> 
>>> Thanks for your help.
>>> BR 
>>> Ralf
> 


Re: Using Apache Nifi and Tika to extract content from pdf

2016-02-20 Thread Ralf Meier
Hi,

thanks for your information. I try to understand your workflow but get some 
errors when I test it:

: org.apache.nifi.processor.exception.ProcessException: 
javax.script.ScriptException: 
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script36800.groovy: 15: unable to resolve class PDFTextStripper 
 @ line 15, column 9.
   def s = new PDFTextStripper()

I downloaded the pdfbox-2.0.0-RC3.jar and copied in a folder pdfbox in my 
download folder. I then changed the path (Module Directory)  in the 
ExecuteScript to this folder. The rest I didn’t changed. 

But I get this error. Do you have some hints? This would be great.


To be honest (I’m totally new to groovy) in addition I did also not understand 
what happens here in detail:

flowFile = session.write(flowFile, {inputStream, outputStream ->
doc = PDDocument.load(inputStream)
info = doc.getDocumentInformation()
s.writeText(doc, new OutputStreamWriter(outputStream))
} as StreamCallback
)

Thanks for your help.

BR
Ralf




> Am 20.02.2016 um 16:44 schrieb Matt Burgess :
> 
> I have a blog post on how to do this with NiFi using a Groovy script in the 
> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
> 
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>  
> 
> 
> Jython is also supported but can't yet use Java libraries (it uses Jython 
> scripts/modules instead). The other languages (Groovy, Lua, JavaScript, 
> JRuby) can use Java libraries like Tika and PDFBox.
> 
> Regards,
> Matt
> 
> Sent from my iPhone
> 
> On Feb 20, 2016, at 10:31 AM, Ralf Meier  > wrote:
> 
>> Hi Everybody, 
>> 
>> I’m new to Nifi and I want to find out if it is possible to extract content 
>> and metadata from PDF’s using a library like tika. 
>> My first Idea was to to use the following processors:
>> - GetFile (Watch a specific Folder)
>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>> - RouteOnAttribute (If it is a pdf)
>> - ExecuteStreamCommand:
>>  I changed the following settings.
>>  Command Arguments: {flowfilw_contents}
>>  Command Path: tika-python parse all
>>  
>> I use the python tika wrapper from 
>> (https://github.com/chrismattmann/tika-python 
>> )
>> 
>> But it is not working. 
>> Has somebody an Idea how to use tika to extract the content and the metadata 
>> using nifi or what I’m doing wrong.
>> 
>> Thanks for your help.
>> BR 
>> Ralf



Re: Nifi 0.50 and GetKafka Issues

2016-02-20 Thread Joe Witt
And also what version of Kafka are you using?
On Feb 20, 2016 3:37 PM, "Oleg Zhurakousky" 
wrote:

> Josh
>
> Any chance to attache the app-log or relevant stack trace?
>
> Thanks
> Oleg
>
> On Feb 20, 2016, at 3:30 PM, West, Joshua  wrote:
>
> Hi folks,
>
> I've upgraded from Nifi 0.4.1 to 0.5.0 and I am no longer able to use the
> GetKafka processor.  I'm seeing errors like so:
>
> 2016-02-20 20:10:14,953 WARN
> [ConsumerFetcherThread-NiFi-sldjflkdsjflksjf_**SCRUBBED**-1455999008728-5b8c7108-0-0]
> kafka.consumer.ConsumerFetcherThread
> [ConsumerFetcherThread-NiFi-sldjflkdsjflksjf_**SCRUBBED**-1455999008728-5b8c7108-0-0],
> Error in fetchkafka.consumer.ConsumerFetcherThread$FetchRequest@7b49a642.
> Possible cause: java.lang.IllegalArgumentException
>
> ^ Note  the hostname of the server has been scrubbed.
>
> My configuration is pretty generic, except that with Zookeeper we use a
> different root path, so our Zookeeper connect string looks like so:
>
> zookeeper-node1:2181,zookeeper-node2:2181,zookeeper-node3:2181/kafka
>
> Is anybody else experiencing issues?
>
> Thanks.
>
> --
> Josh West 
>
> Cloud Architect
> Bose Corporation
>
>
>
>
>


Re: Nifi 0.50 and GetKafka Issues

2016-02-20 Thread Oleg Zhurakousky
Josh

Any chance to attache the app-log or relevant stack trace?

Thanks
Oleg

On Feb 20, 2016, at 3:30 PM, West, Joshua 
> wrote:

Hi folks,

I've upgraded from Nifi 0.4.1 to 0.5.0 and I am no longer able to use the 
GetKafka processor.  I'm seeing errors like so:

2016-02-20 20:10:14,953 WARN 
[ConsumerFetcherThread-NiFi-sldjflkdsjflksjf_**SCRUBBED**-1455999008728-5b8c7108-0-0]
 kafka.consumer.ConsumerFetcherThread 
[ConsumerFetcherThread-NiFi-sldjflkdsjflksjf_**SCRUBBED**-1455999008728-5b8c7108-0-0],
 Error in 
fetchkafka.consumer.ConsumerFetcherThread$FetchRequest@7b49a642.
 Possible cause: java.lang.IllegalArgumentException

^ Note  the hostname of the server has been scrubbed.

My configuration is pretty generic, except that with Zookeeper we use a 
different root path, so our Zookeeper connect string looks like so:

zookeeper-node1:2181,zookeeper-node2:2181,zookeeper-node3:2181/kafka

Is anybody else experiencing issues?

Thanks.


--
Josh West >

Cloud Architect
Bose Corporation






Nifi 0.50 and GetKafka Issues

2016-02-20 Thread West, Joshua
Hi folks,

I've upgraded from Nifi 0.4.1 to 0.5.0 and I am no longer able to use the 
GetKafka processor.  I'm seeing errors like so:

2016-02-20 20:10:14,953 WARN 
[ConsumerFetcherThread-NiFi-sldjflkdsjflksjf_**SCRUBBED**-1455999008728-5b8c7108-0-0]
 kafka.consumer.ConsumerFetcherThread 
[ConsumerFetcherThread-NiFi-sldjflkdsjflksjf_**SCRUBBED**-1455999008728-5b8c7108-0-0],
 Error in fetch 
kafka.consumer.ConsumerFetcherThread$FetchRequest@7b49a642.
 Possible cause: java.lang.IllegalArgumentException

^ Note  the hostname of the server has been scrubbed.

My configuration is pretty generic, except that with Zookeeper we use a 
different root path, so our Zookeeper connect string looks like so:

zookeeper-node1:2181,zookeeper-node2:2181,zookeeper-node3:2181/kafka

Is anybody else experiencing issues?

Thanks.


--
Josh West 

Cloud Architect
Bose Corporation





Re: Problem with PutS3Object processor

2016-02-20 Thread Joe Skora
I spoke to Joseph off list, he is not using Amazon but an S3 compatible
back-end that authenticates request differently than Amazon.  Adding
credentials to his request eliminated the error message.

The underlying AWS library assumes an Amazon back-end which does not
require the date header for anonymous requests and without credentials it
assumes it is an anonymous request.  Though it doesn't require them, this
backend can use Amazon credentials and doing so fixes the problem.

I'll look into more testing with other non-Amazon S3 compatible platforms
and into finer grained control over the library behaviors.  But, it seems
reasonable that the Amazon library might not accommodate deviation from the
anticipated AWS S3 behaviors.

On Fri, Feb 19, 2016 at 3:02 PM, Joseph E. Gottman 
wrote:

> I am using an SSLContextService for my credentials.
>
>
> [image: Proteus Technologies Outsmart Logo]
>  *Joe Gottman*
> *Senior Member of Technical Staff*
> jgott...@proteuseng.com 
> 133 National Business Pkwy, Ste 150
> Annapolis Junction, MD 20701
> (Office) 301.377.7144
> *www.proteus-technologies.com* 
> *TheBlend.ProteusEng.com* * (Digital
> magazine)* This electronic message and any files transmitted with it
> contain information which may be privileged and/or proprietary. The
> information is intended for use solely by the intended recipient(s). If you
> are not the intended recipient, be aware that any disclosure, copying,
> distribution or use of this information is prohibited. If you have received
> this electronic message in error, please advise the sender by reply email
> or by telephone (443.539.3400) and delete the message.
> --
> *From:* Joe Skora 
> *Sent:* Friday, February 19, 2016 2:56 PM
> *To:* users@nifi.apache.org
> *Subject:* Re: Problem with PutS3Object processor
>
> Joseph,
>
> I ran into this same problem last week when I forgot to provide
> credentials to an S3 compatible endpoint.  AWS S3 requires the date header
> when an authorization header is provided, so the underlying Amazon library
> provides it automatically if authorization is used by the processor, but if
> it thinks the request is anonymous it leaves off the date header.
>
> Does your endpoint require authentication Amazon (or Amazon-like) Access
> Key and Secret Key credentials?  If not, can you try providing credentials
> and see if that helps?
>
> Regards,
> Joe Skora
>
> On Fri, Feb 19, 2016 at 12:56 PM, Joseph E. Gottman <
> jgott...@proteuseng.com> wrote:
>
>> I am trying to use a PutS3Object processor with the "Endpoint Override
>> URL" pointing to a custom endpoint.  I keep failing with the error message
>> "You must specify a date for this operation".  I am using NiFi version
>> 0.4.1 with Java 8 and Centos 6.7.  I suspect this might have something to
>> do with bug #1025, but according to your notes it was fixed for version
>> 0.4.0.
>>
>>
>>
>>
>> [image: Proteus Technologies Outsmart Logo]
>>  *Joe Gottman*
>> *Senior Member of Technical Staff*
>> jgott...@proteuseng.com 
>> 133 National Business Pkwy, Ste 150
>> Annapolis Junction, MD 20701
>> (Office) 301.377.7144
>> *www.proteus-technologies.com* 
>> *TheBlend.ProteusEng.com* * (Digital
>> magazine)* This electronic message and any files transmitted with it
>> contain information which may be privileged and/or proprietary. The
>> information is intended for use solely by the intended recipient(s). If you
>> are not the intended recipient, be aware that any disclosure, copying,
>> distribution or use of this information is prohibited. If you have received
>> this electronic message in error, please advise the sender by reply email
>> or by telephone (443.539.3400) and delete the message.
>>
>
>


Re: Using Apache Nifi and Tika to extract content from pdf

2016-02-20 Thread Russell Whitaker
Yes! I, for one, will weigh in with my interest in Clojure support in
the scripting processors.

Russell

On Sat, Feb 20, 2016 at 11:34 AM, Matt Burgess  wrote:
> Clojure libraries (or any JARs) can be used by the supported scripting
> languages. However Clojure itself is not yet supported by the NiFi scripting
> processors, there were issues with the Clojure ScriptEngine bridge so it was
> left off the original list. If there is interest in adding Clojure, I can
> write up an improvement Jira with the initial findings.
>
> Regards,
> Matt
>
>
> On Feb 20, 2016, at 2:18 PM, Russell Whitaker 
> wrote:
>
> Don't forget Clojure as well.
>
> Russell Whitaker
> Sent from my iPhone
>
> On Feb 20, 2016, at 7:44 AM, Matt Burgess  wrote:
>
> I have a blog post on how to do this with NiFi using a Groovy script in the
> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>
> Jython is also supported but can't yet use Java libraries (it uses Jython
> scripts/modules instead). The other languages (Groovy, Lua, JavaScript,
> JRuby) can use Java libraries like Tika and PDFBox.
>
> Regards,
> Matt
>
> Sent from my iPhone
>
> On Feb 20, 2016, at 10:31 AM, Ralf Meier  wrote:
>
> Hi Everybody,
>
> I’m new to Nifi and I want to find out if it is possible to extract content
> and metadata from PDF’s using a library like tika.
> My first Idea was to to use the following processors:
> - GetFile (Watch a specific Folder)
> - IdentifyMimeType (Identify if the file is a typ application/pdf)
> - RouteOnAttribute (If it is a pdf)
> - ExecuteStreamCommand:
> I changed the following settings.
> Command Arguments: {flowfilw_contents}
> Command Path: tika-python parse all
> I use the python tika wrapper from
> (https://github.com/chrismattmann/tika-python)
>
> But it is not working.
> Has somebody an Idea how to use tika to extract the content and the metadata
> using nifi or what I’m doing wrong.
>
> Thanks for your help.
> BR
> Ralf



-- 
Russell Whitaker
http://twitter.com/OrthoNormalRuss
http://www.linkedin.com/pub/russell-whitaker/0/b86/329


Re: Using Apache Nifi and Tika to extract content from pdf

2016-02-20 Thread Matt Burgess
Clojure libraries (or any JARs) can be used by the supported scripting 
languages. However Clojure itself is not yet supported by the NiFi scripting 
processors, there were issues with the Clojure ScriptEngine bridge so it was 
left off the original list. If there is interest in adding Clojure, I can write 
up an improvement Jira with the initial findings.

Regards,
Matt


> On Feb 20, 2016, at 2:18 PM, Russell Whitaker  
> wrote:
> 
> Don't forget Clojure as well. 
> 
> Russell Whitaker
> Sent from my iPhone
> 
>> On Feb 20, 2016, at 7:44 AM, Matt Burgess  wrote:
>> 
>> I have a blog post on how to do this with NiFi using a Groovy script in the 
>> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>> 
>> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>> 
>> Jython is also supported but can't yet use Java libraries (it uses Jython 
>> scripts/modules instead). The other languages (Groovy, Lua, JavaScript, 
>> JRuby) can use Java libraries like Tika and PDFBox.
>> 
>> Regards,
>> Matt
>> 
>> Sent from my iPhone
>> 
>>> On Feb 20, 2016, at 10:31 AM, Ralf Meier  wrote:
>>> 
>>> Hi Everybody, 
>>> 
>>> I’m new to Nifi and I want to find out if it is possible to extract content 
>>> and metadata from PDF’s using a library like tika. 
>>> My first Idea was to to use the following processors:
>>> - GetFile (Watch a specific Folder)
>>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>>> - RouteOnAttribute (If it is a pdf)
>>> - ExecuteStreamCommand:
>>> I changed the following settings.
>>> Command Arguments: {flowfilw_contents}
>>> Command Path: tika-python parse all
>>> 
>>> I use the python tika wrapper from 
>>> (https://github.com/chrismattmann/tika-python)
>>> 
>>> But it is not working. 
>>> Has somebody an Idea how to use tika to extract the content and the 
>>> metadata using nifi or what I’m doing wrong.
>>> 
>>> Thanks for your help.
>>> BR 
>>> Ralf


Re: Using Apache Nifi and Tika to extract content from pdf

2016-02-20 Thread Russell Whitaker
Don't forget Clojure as well. 

Russell Whitaker
Sent from my iPhone

> On Feb 20, 2016, at 7:44 AM, Matt Burgess  wrote:
> 
> I have a blog post on how to do this with NiFi using a Groovy script in the 
> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
> 
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
> 
> Jython is also supported but can't yet use Java libraries (it uses Jython 
> scripts/modules instead). The other languages (Groovy, Lua, JavaScript, 
> JRuby) can use Java libraries like Tika and PDFBox.
> 
> Regards,
> Matt
> 
> Sent from my iPhone
> 
>> On Feb 20, 2016, at 10:31 AM, Ralf Meier  wrote:
>> 
>> Hi Everybody, 
>> 
>> I’m new to Nifi and I want to find out if it is possible to extract content 
>> and metadata from PDF’s using a library like tika. 
>> My first Idea was to to use the following processors:
>> - GetFile (Watch a specific Folder)
>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>> - RouteOnAttribute (If it is a pdf)
>> - ExecuteStreamCommand:
>>  I changed the following settings.
>>  Command Arguments: {flowfilw_contents}
>>  Command Path: tika-python parse all
>>  
>> I use the python tika wrapper from 
>> (https://github.com/chrismattmann/tika-python)
>> 
>> But it is not working. 
>> Has somebody an Idea how to use tika to extract the content and the metadata 
>> using nifi or what I’m doing wrong.
>> 
>> Thanks for your help.
>> BR 
>> Ralf


Re: Using Apache Nifi and Tika to extract content from pdf

2016-02-20 Thread Matt Burgess
I have a blog post on how to do this with NiFi using a Groovy script in the 
ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:

http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1

Jython is also supported but can't yet use Java libraries (it uses Jython 
scripts/modules instead). The other languages (Groovy, Lua, JavaScript, JRuby) 
can use Java libraries like Tika and PDFBox.

Regards,
Matt

Sent from my iPhone

> On Feb 20, 2016, at 10:31 AM, Ralf Meier  wrote:
> 
> Hi Everybody, 
> 
> I’m new to Nifi and I want to find out if it is possible to extract content 
> and metadata from PDF’s using a library like tika. 
> My first Idea was to to use the following processors:
> - GetFile (Watch a specific Folder)
> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
> - RouteOnAttribute (If it is a pdf)
> - ExecuteStreamCommand:
>   I changed the following settings.
>   Command Arguments: {flowfilw_contents}
>   Command Path: tika-python parse all
>   
> I use the python tika wrapper from 
> (https://github.com/chrismattmann/tika-python)
> 
> But it is not working. 
> Has somebody an Idea how to use tika to extract the content and the metadata 
> using nifi or what I’m doing wrong.
> 
> Thanks for your help.
> BR 
> Ralf


Using Apache Nifi and Tika to extract content from pdf

2016-02-20 Thread Ralf Meier
Hi Everybody, 

I’m new to Nifi and I want to find out if it is possible to extract content and 
metadata from PDF’s using a library like tika. 
My first Idea was to to use the following processors:
- GetFile (Watch a specific Folder)
- IdentifyMimeType (Identify if the file is a typ application/pdf) 
- RouteOnAttribute (If it is a pdf)
- ExecuteStreamCommand:
I changed the following settings.
Command Arguments: {flowfilw_contents}
Command Path: tika-python parse all

I use the python tika wrapper from 
(https://github.com/chrismattmann/tika-python 
)

But it is not working. 
Has somebody an Idea how to use tika to extract the content and the metadata 
using nifi or what I’m doing wrong.

Thanks for your help.
BR 
Ralf