Tobias,
Your help on the problems I have met have been very helpful. Thanks a lot!
Bill
On Wed, Jul 9, 2014 at 6:04 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Bill,
good to know you found your bottleneck. Unfortunately, I don't know how to
solve this; until know, I have used Spark only
Hi Tobias,
I was using Spark 0.9 before and the master I used was yarn-standalone. In
Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not
sure whether it is the reason why more machines do not provide better
scalability. What is the difference between these two modes in
Bill,
I haven't worked with Yarn, but I would try adding a repartition() call
after you receive your data from Kafka. I would be surprised if that didn't
help.
On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tobias,
I was using Spark 0.9 before and the master
Hi Tobias,
Now I did the re-partition and ran the program again. I find a bottleneck
of the whole program. In the streaming, there is a stage marked as
*combineByKey
at ShuffledDStream.scala:42 *in spark UI. This stage is repeatedly
executed. However, during some batches, the number of executors
Bill,
good to know you found your bottleneck. Unfortunately, I don't know how to
solve this; until know, I have used Spark only with embarassingly parallel
operations such as map or filter. I hope someone else might provide more
insight here.
Tobias
On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay
Hi Tobias,
Thanks for the suggestion. I have tried to add more nodes from 300 to 400.
It seems the running time did not get improved.
On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Bill,
can't you just add more nodes in order to speed up the processing?
Tobias
Bill,
do the additional 100 nodes receive any tasks at all? (I don't know which
cluster you use, but with Mesos you could check client logs in the web
interface.) You might want to try something like repartition(N) or
repartition(N*2) (with N the number of your nodes) after you receive your
data.
Hi all,
I have a problem of using Spark Streaming to accept input data and update a
result.
The input of the data is from Kafka and the output is to report a map which
is updated by historical data in every minute. My current method is to set
batch size as 1 minute and use foreachRDD to update