Block removal causes Akka timeouts

2014-10-02 Thread maddenpj
I'm seeing a lot of Akka timeouts which eventually lead to job failure in
spark streaming when removing blocks (Example stack trace below). It appears
to be related to these issues:  SPARK-3015
   and  SPARK-3139
   but while workarounds
were provided for those scenarios there doesn't seem to be a workaround for
block removal. Any suggestions?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Block-removal-causes-Akka-timeouts-tp15632.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka Spark Streaming job has an issue when the worker reading from Kafka is killed

2014-10-02 Thread maddenpj
I am seeing this same issue. Bumping for visibility.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-job-has-an-issue-when-the-worker-reading-from-Kafka-is-killed-tp12595p15611.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: shuffle memory requirements

2014-09-29 Thread maddenpj
Hey Ameet,

Thanks for the info, I'm running into the same issue myself and my last
attempt crashed and my ulimit was 16834. I'm going to up it and try again,
but yea I would like to know the best practice for computing this. Can you
talk about the worker nodes, what are their specs? At least 45 gigs of
memory and 6 cores?

Also I left my worker at the default memory size (512m I think) and gave all
of the memory to the executor. It was my understanding that the worker just
spawns the executor but all the work is done in the executor. What was your
reasoning for using 24G on the worker?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/shuffle-memory-requirements-tp4048p15375.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Build spark with Intellij IDEA 13

2014-09-27 Thread maddenpj
I actually got this same exact issue compiling a unrelated project (not using
spark). Maybe it's a protobuf issue? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Build-spark-with-Intellij-IDEA-13-tp9904p15284.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread maddenpj
Yup it's all in the gist:
https://gist.github.com/maddenpj/5032c76aeb330371a6e6

Lines 6-9 deal with setting up the driver specifically. This sets the driver
up on each partition which keeps the connection pool around per record.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-No-parallelism-in-writing-to-database-MySQL-tp15174p15202.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread maddenpj
Update for posterity, so once again I solved the problem shortly after
posting to the mailing list. So updateStateByKey uses the default
partitioner, which in my case seemed like it was set to one.

Changing my call from .updateStateByKey[Long](updateFn) ->
.updateStateByKey[Long](updateFn, numPartitions) resolved it for me.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-No-parallelism-in-writing-to-database-MySQL-tp15174p15182.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread maddenpj
I posted yesterday about a related issue but resolved it shortly after. I'm
using Spark Streaming to summarize event data from Kafka and save it to a
MySQL table. Currently the bottleneck is in writing to MySQL and I'm puzzled
as to how to speed it up. I've tried repartitioning with several different
values but it looks like only one worker is actually doing the writing to
MySQL. Obviously this is not ideal because I need the parallelism to insert
this data in a timely manner.

Here's the code https://gist.github.com/maddenpj/5032c76aeb330371a6e6
<https://gist.github.com/maddenpj/5032c76aeb330371a6e6>  

I'm running this on a cluster of 6 spark nodes (2 cores, 7.5 GB memory) and
tried repartition sizes of 6, 12 and 48. How do I ensure that there is
parallelism in writing to the database? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-No-parallelism-in-writing-to-database-MySQL-tp15174.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
Another update, actually it just hit me my problem is probably right here:

https://gist.github.com/maddenpj/74a4c8ce372888ade92d#file-gistfile1-scala-L22

I'm creating a JDBC connection on every record, that's probably whats
killing the performance. I assume the fix is just broadcast the connection
pool? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-unable-to-handle-production-Kafka-load-tp15077p15081.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
Oh I should add I've tried a range of batch durations and reduce by window
durations to no effect. I'm not too sure how to choose these?



Currently today I've been testing with batch duration of 1 minute - 10
minute and reduce window duration of 10 minute or 20 minutes.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-unable-to-handle-production-Kafka-load-tp15077p15080.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
I am attempting to use Spark Streaming to summarize event data streaming in
and save it to a MySQL table. The input source data is stored in 4 topics on
Kafka with each topic having 12 partitions. My approach to doing this worked
both in local development and a simulated load testing environment but I
cannot seem to get it working when hooking it up to our production source.
I'm having a hard time figuring out what is going on because I'm not seeing
any obvious errors in the logs, just the first batch never finishes
processing. I believe its a data rate problem (most active topic clocks in
around 4k messages per second, least active topic is around 0.5 msg/s) but
I'm completely stuck of the best way to resolve this and maybe I'm not
following best practices.

Here is a gist of the essentials of my program. I use an updateStateByKey
approach to keep around the MySQL id of that piece of data (so if we've
already seen that particular piece we just update the existing total in
mysql with the total spark just computed in the current window.
https://gist.github.com/maddenpj/74a4c8ce372888ade92d
<https://gist.github.com/maddenpj/74a4c8ce372888ade92d>  

One thing I have noticed is my Kafka Receiver is only on one machine and I
have not yet tried to increase the parallelism of reading out of kafka,
something like this solution:
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Kafka-Receivers-and-Union-td14901.html
<http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Kafka-Receivers-and-Union-td14901.html>
  

So that's next on my list, but I'm still in need of insight into how to
figure out what's going on. When I watch the stages execute on the web UI, I
see occasional activity (A map stage processing) but most of the time it
looks like I'm stuck in some arbitrary stage (i.e.: take and runJob will be
active for the entire life of the program with 0 tasks ever completing).
This is contrary to what I see when I watch the Kafka topics get consumed,
the program is always consuming messages, it just gives no indication it's
doing any actual processing on them.

On a somewhat related note, how does everyone capacity plan for building out
a spark cluster? So far I've just been using trial and error but I still
haven't found the right number of nodes that can handle our 4k/s topic. I've
tried up to 6 amazon m3.large's (2 cores, 7.5 GB memory),  but even that
feels excessive to me as currently we're processing this data load on a
single node mapreduce cluster,  an m3.xlarge (4 cores, 15 GB memory).

Thanks,Patrick



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-unable-to-handle-production-Kafka-load-tp15077.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Kafka - streaming from multiple topics

2014-08-13 Thread maddenpj
Can you link to the JIRA issue? I'm having to work around this bug and it
would be nice to monitor the JIRA so I can change my code when it's fixed. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p12053.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming worker underutilized?

2014-08-08 Thread maddenpj
jI currently have a 4 node spark setup, 1 master and 3 workers running in
spark standalone mode. I am currently stress testing a spark application I
wrote that reads data from kafka and puts it into redshift. I'm pretty happy
with the performance (Reading about 6k messages per second out of kafka) but
I've noticed just from watching top on the worker nodes that one node seems
totally under utilized. It remains under near zero load and its memory
profile never changes.

For instance on the two workers that are working I notice there memory go
from 3.0G free to about 1.5G free as I start loading data into Kafka but the
3rd node remains free at 3G. 

Have I misconfigured something? I followed the standalone setup and see 3
workers registered with all cores being reported as used. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-worker-underutilized-tp11808.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: memory issue on standalone master

2014-08-07 Thread maddenpj
It looks like your Java heap space is too low: -Xmx512m. It's only using .5G
of RAM, try bumping this up



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/memory-issue-on-standalone-master-tp11610p11711.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Using Spark Streaming with Kafka 0.7.2

2014-07-25 Thread maddenpj
Hi all,

Currently we have Kafka 0.7.2 running in production and can't upgrade for
external reasons however spark streaming (1.0.1) was built with Kafka 0.8.0.
What is the best way to use spark streaming with older versions of Kafka.
Currently I'm investigating trying to build spark streaming myself but I
can't find any documentation specifically for building spark streaming.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Streaming-with-Kafka-0-7-2-tp10674.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.