I am seeing this same issue. Bumping for visibility.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-job-has-an-issue-when-the-worker-reading-from-Kafka-is-killed-tp12595p15611.html
Sent from the Apache Spark User List mailing list
I'm seeing a lot of Akka timeouts which eventually lead to job failure in
spark streaming when removing blocks (Example stack trace below). It appears
to be related to these issues: SPARK-3015
https://issues.apache.org/jira/browse/SPARK-3015 and SPARK-3139
Hey Ameet,
Thanks for the info, I'm running into the same issue myself and my last
attempt crashed and my ulimit was 16834. I'm going to up it and try again,
but yea I would like to know the best practice for computing this. Can you
talk about the worker nodes, what are their specs? At least 45
I actually got this same exact issue compiling a unrelated project (not using
spark). Maybe it's a protobuf issue?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Build-spark-with-Intellij-IDEA-13-tp9904p15284.html
Sent from the Apache Spark User List
different
values but it looks like only one worker is actually doing the writing to
MySQL. Obviously this is not ideal because I need the parallelism to insert
this data in a timely manner.
Here's the code https://gist.github.com/maddenpj/5032c76aeb330371a6e6
https://gist.github.com/maddenpj
Update for posterity, so once again I solved the problem shortly after
posting to the mailing list. So updateStateByKey uses the default
partitioner, which in my case seemed like it was set to one.
Changing my call from .updateStateByKey[Long](updateFn) -
.updateStateByKey[Long](updateFn,
Yup it's all in the gist:
https://gist.github.com/maddenpj/5032c76aeb330371a6e6
Lines 6-9 deal with setting up the driver specifically. This sets the driver
up on each partition which keeps the connection pool around per record.
--
View this message in context:
http://apache-spark-user-list
of data (so if we've
already seen that particular piece we just update the existing total in
mysql with the total spark just computed in the current window.
https://gist.github.com/maddenpj/74a4c8ce372888ade92d
https://gist.github.com/maddenpj/74a4c8ce372888ade92d
One thing I have noticed is my
Oh I should add I've tried a range of batch durations and reduce by window
durations to no effect. I'm not too sure how to choose these?
br/br/
Currently today I've been testing with batch duration of 1 minute - 10
minute and reduce window duration of 10 minute or 20 minutes.
--
View this
Another update, actually it just hit me my problem is probably right here:
https://gist.github.com/maddenpj/74a4c8ce372888ade92d#file-gistfile1-scala-L22
I'm creating a JDBC connection on every record, that's probably whats
killing the performance. I assume the fix is just broadcast
Can you link to the JIRA issue? I'm having to work around this bug and it
would be nice to monitor the JIRA so I can change my code when it's fixed.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p12053.html
Sent
jI currently have a 4 node spark setup, 1 master and 3 workers running in
spark standalone mode. I am currently stress testing a spark application I
wrote that reads data from kafka and puts it into redshift. I'm pretty happy
with the performance (Reading about 6k messages per second out of kafka)
It looks like your Java heap space is too low: -Xmx512m. It's only using .5G
of RAM, try bumping this up
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/memory-issue-on-standalone-master-tp11610p11711.html
Sent from the Apache Spark User List mailing list
Hi all,
Currently we have Kafka 0.7.2 running in production and can't upgrade for
external reasons however spark streaming (1.0.1) was built with Kafka 0.8.0.
What is the best way to use spark streaming with older versions of Kafka.
Currently I'm investigating trying to build spark streaming
14 matches
Mail list logo