Re: spark streaming rate limiting from kafka

2014-07-22 Thread Bill Jay
Hi Tobias, I tried to use 10 as numPartition. The number of executors allocated is the number of DStream. Therefore, it seems the parameter does not spread data into many partitions. In order to to that, it seems we have to do repartition. If numPartitions will distribute the data to multiple

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Bill Jay
Hi Tathagata, I am currentlycreating multiple DStream to consumefrom different topics. How can I let each consumer consume from different partitions. I find the following parameters from Spark API: createStream[K, V, U : Decoder[_], T : Decoder[_]](jssc: JavaStreamingContext

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Tobias Pfeiffer
Bill, numPartitions means the number of Spark partitions that the data received from Kafka will be split to. It has nothing to do with Kafka partitions, as far as I know. If you create multiple Kafka consumers, it doesn't seem like you can specify which consumer will consume which Kafka

Re: spark streaming rate limiting from kafka

2014-07-20 Thread Bill Jay
Hi Tobias, It seems that repartition can create more executors for the stages following data receiving. However, the number of executors is still far less than what I require (I specify one core for each executor). Based on the index of the executors in the stage, I find many numbers are missing

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Chen Song
Thanks Tathagata, That would be awesome if Spark streaming can support receiving rate in general. I tried to explore the link you provided but could not find any specific JIRA related to this? Do you have the JIRA number for this? On Thu, Jul 17, 2014 at 9:21 PM, Tathagata Das

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Tathagata Das
Oops, wrong link! JIRA: https://github.com/apache/spark/pull/945/files Github PR: https://github.com/apache/spark/pull/945/files On Fri, Jul 18, 2014 at 7:19 AM, Chen Song chen.song...@gmail.com wrote: Thanks Tathagata, That would be awesome if Spark streaming can support receiving rate in

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Tathagata Das
Dang! Messed it up again! JIRA: https://issues.apache.org/jira/browse/SPARK-1341 Github PR: https://github.com/apache/spark/pull/945/files On Fri, Jul 18, 2014 at 11:35 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Oops, wrong link! JIRA:

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Chen Song
Thanks Luis and Tobias. On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote: * Is there a way to control how far Kafka Dstream can read on topic-partition (via offset for example). By setting

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Bill Jay
I also have an issue consuming from Kafka. When I consume from Kafka, there are always a single executor working on this job. Even I use repartition, it seems that there is still a single executor. Does anyone has an idea how to add parallelism to this job? On Thu, Jul 17, 2014 at 2:06 PM, Chen

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tobias Pfeiffer
Bill, are you saying, after repartition(400), you have 400 partitions on one host and the other hosts receive nothing of the data? Tobias On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com wrote: I also have an issue consuming from Kafka. When I consume from Kafka, there

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tathagata Das
You can create multiple kafka stream to partition your topics across them, which will run multiple receivers or multiple executors. This is covered in the Spark streaming guide. http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving And for the

spark streaming rate limiting from kafka

2014-07-01 Thread Chen Song
In my use case, if I need to stop spark streaming for a while, data would accumulate a lot on kafka topic-partitions. After I restart spark streaming job, the worker's heap will go out of memory on the fetch of the 1st batch. I am wondering if * Is there a way to throttle reading from kafka in

Re: spark streaming rate limiting from kafka

2014-07-01 Thread Luis Ángel Vicente Sánchez
Maybe reducing the batch duration would help :\ 2014-07-01 17:57 GMT+01:00 Chen Song chen.song...@gmail.com: In my use case, if I need to stop spark streaming for a while, data would accumulate a lot on kafka topic-partitions. After I restart spark streaming job, the worker's heap will go

Re: spark streaming rate limiting from kafka

2014-07-01 Thread Tobias Pfeiffer
Hi, On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote: * Is there a way to control how far Kafka Dstream can read on topic-partition (via offset for example). By setting this to a small number, it will force DStream to read less data initially. Please see the post at