Re: Kafka Spark Streaming job has an issue when the worker reading from Kafka is killed

2014-10-02 Thread maddenpj
I am seeing this same issue. Bumping for visibility. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-job-has-an-issue-when-the-worker-reading-from-Kafka-is-killed-tp12595p15611.html Sent from the Apache Spark User List mailing list

Block removal causes Akka timeouts

2014-10-02 Thread maddenpj
I'm seeing a lot of Akka timeouts which eventually lead to job failure in spark streaming when removing blocks (Example stack trace below). It appears to be related to these issues: SPARK-3015 https://issues.apache.org/jira/browse/SPARK-3015 and SPARK-3139

Re: shuffle memory requirements

2014-09-29 Thread maddenpj
Hey Ameet, Thanks for the info, I'm running into the same issue myself and my last attempt crashed and my ulimit was 16834. I'm going to up it and try again, but yea I would like to know the best practice for computing this. Can you talk about the worker nodes, what are their specs? At least 45

Re: Build spark with Intellij IDEA 13

2014-09-27 Thread maddenpj
I actually got this same exact issue compiling a unrelated project (not using spark). Maybe it's a protobuf issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Build-spark-with-Intellij-IDEA-13-tp9904p15284.html Sent from the Apache Spark User List

Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread maddenpj
different values but it looks like only one worker is actually doing the writing to MySQL. Obviously this is not ideal because I need the parallelism to insert this data in a timely manner. Here's the code https://gist.github.com/maddenpj/5032c76aeb330371a6e6 https://gist.github.com/maddenpj

Re: Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread maddenpj
Update for posterity, so once again I solved the problem shortly after posting to the mailing list. So updateStateByKey uses the default partitioner, which in my case seemed like it was set to one. Changing my call from .updateStateByKey[Long](updateFn) - .updateStateByKey[Long](updateFn,

Re: Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread maddenpj
Yup it's all in the gist: https://gist.github.com/maddenpj/5032c76aeb330371a6e6 Lines 6-9 deal with setting up the driver specifically. This sets the driver up on each partition which keeps the connection pool around per record. -- View this message in context: http://apache-spark-user-list

Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
of data (so if we've already seen that particular piece we just update the existing total in mysql with the total spark just computed in the current window. https://gist.github.com/maddenpj/74a4c8ce372888ade92d https://gist.github.com/maddenpj/74a4c8ce372888ade92d One thing I have noticed is my

Re: Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
Oh I should add I've tried a range of batch durations and reduce by window durations to no effect. I'm not too sure how to choose these? br/br/ Currently today I've been testing with batch duration of 1 minute - 10 minute and reduce window duration of 10 minute or 20 minutes. -- View this

Re: Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
Another update, actually it just hit me my problem is probably right here: https://gist.github.com/maddenpj/74a4c8ce372888ade92d#file-gistfile1-scala-L22 I'm creating a JDBC connection on every record, that's probably whats killing the performance. I assume the fix is just broadcast

Re: Kafka - streaming from multiple topics

2014-08-13 Thread maddenpj
Can you link to the JIRA issue? I'm having to work around this bug and it would be nice to monitor the JIRA so I can change my code when it's fixed. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p12053.html Sent

Spark Streaming worker underutilized?

2014-08-08 Thread maddenpj
jI currently have a 4 node spark setup, 1 master and 3 workers running in spark standalone mode. I am currently stress testing a spark application I wrote that reads data from kafka and puts it into redshift. I'm pretty happy with the performance (Reading about 6k messages per second out of kafka)

Re: memory issue on standalone master

2014-08-07 Thread maddenpj
It looks like your Java heap space is too low: -Xmx512m. It's only using .5G of RAM, try bumping this up -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/memory-issue-on-standalone-master-tp11610p11711.html Sent from the Apache Spark User List mailing list

Using Spark Streaming with Kafka 0.7.2

2014-07-25 Thread maddenpj
Hi all, Currently we have Kafka 0.7.2 running in production and can't upgrade for external reasons however spark streaming (1.0.1) was built with Kafka 0.8.0. What is the best way to use spark streaming with older versions of Kafka. Currently I'm investigating trying to build spark streaming