Re: spark streaming rate limiting from kafka

2014-07-22 Thread Bill Jay
Hi Tobias,

I tried to use 10 as numPartition. The number of executors allocated is the
number of DStream. Therefore, it seems the parameter does not spread data
into many partitions. In order to to that, it seems we have to do
repartition. If numPartitions will distribute the data to multiple
executors/partitions, then I will be able to save the running time incurred
by repartition.

Bill




On Mon, Jul 21, 2014 at 6:43 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 numPartitions means the number of Spark partitions that the data received
 from Kafka will be split to. It has nothing to do with Kafka partitions, as
 far as I know.

 If you create multiple Kafka consumers, it doesn't seem like you can
 specify which consumer will consume which Kafka partitions. Instead, Kafka
 (at least with the interface that is exposed by the Spark Streaming API)
 will do something called rebalance and assign Kafka partitions to consumers
 evenly, you can see this in the client logs.

 When using multiple Kafka consumers with auto.offset.reset = true, please
 expect to run into this one:
 https://issues.apache.org/jira/browse/SPARK-2383

 Tobias


 On Tue, Jul 22, 2014 at 3:40 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tathagata,

 I am currentlycreating multiple DStream to consumefrom different topics.
 How can I let each consumer consume from different partitions. I find the
 following parameters from Spark API:

 createStream[K, V, U : Decoder[_], T : Decoder[_]](jssc:
 JavaStreamingContext
 https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.html
 , keyTypeClass: Class[K], valueTypeClass: Class[V],keyDecoderClass: Class
 [U], valueDecoderClass: Class[T], kafkaParams: Map[String, String],
 topics: Map[String, Integer],storageLevel: StorageLevel
 https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/storage/StorageLevel.html
 ): JavaPairReceiverInputDStream
 https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaPairReceiverInputDStream.html
 [K, V]

 Create an input stream that pulls messages form a Kafka Broker.




 The topics parameter is:
 *Map of (topic_name - numPartitions) to consume. Each partition is
 consumed in its own thread*

 Does numPartitions mean the total number of partitions to consume from
 topic_name or the index of the partition? How can we specify for each
 createStream which partition of the Kafka topic to consume? I think if so,
 I will get a lot of parallelism from the source of the data. Thanks!

 Bill

 On Thu, Jul 17, 2014 at 6:21 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 You can create multiple kafka stream to partition your topics across
 them, which will run multiple receivers or multiple executors. This is
 covered in the Spark streaming guide.
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

 And for the purpose of this thread, to answer the original question, we now
 have the ability
 https://issues.apache.org/jira/browse/SPARK-1854?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20priority%20DESC
 to limit the receiving rate. Its in the master branch, and will be
 available in Spark 1.1. It basically sets the limits at the receiver level
 (so applies to all sources) on what is the max records per second that can
 will be received by the receiver.

 TD


 On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 are you saying, after repartition(400), you have 400 partitions on one
 host and the other hosts receive nothing of the data?

 Tobias


 On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 I also have an issue consuming from Kafka. When I consume from Kafka,
 there are always a single executor working on this job. Even I use
 repartition, it seems that there is still a single executor. Does anyone
 has an idea how to add parallelism to this job?



 On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
 wrote:

 Thanks Luis and Tobias.


 On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
 wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




 --
 Chen Song









Re: spark streaming rate limiting from kafka

2014-07-21 Thread Bill Jay
Hi Tathagata,

I am currentlycreating multiple DStream to consumefrom different topics.
How can I let each consumer consume from different partitions. I find the
following parameters from Spark API:

createStream[K, V, U : Decoder[_], T : Decoder[_]](jssc:
JavaStreamingContext
https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.html
, keyTypeClass: Class[K], valueTypeClass: Class[V],keyDecoderClass: Class[U]
, valueDecoderClass: Class[T], kafkaParams: Map[String, String], topics: Map
[String, Integer],storageLevel: StorageLevel
https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/storage/StorageLevel.html
): JavaPairReceiverInputDStream
https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaPairReceiverInputDStream.html
[K, V]

Create an input stream that pulls messages form a Kafka Broker.




The topics parameter is:
*Map of (topic_name - numPartitions) to consume. Each partition is
consumed in its own thread*

Does numPartitions mean the total number of partitions to consume from
topic_name or the index of the partition? How can we specify for each
createStream which partition of the Kafka topic to consume? I think if so,
I will get a lot of parallelism from the source of the data. Thanks!

Bill

On Thu, Jul 17, 2014 at 6:21 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 You can create multiple kafka stream to partition your topics across them,
 which will run multiple receivers or multiple executors. This is covered in
 the Spark streaming guide.
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

 And for the purpose of this thread, to answer the original question, we now
 have the ability
 https://issues.apache.org/jira/browse/SPARK-1854?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20priority%20DESC
 to limit the receiving rate. Its in the master branch, and will be
 available in Spark 1.1. It basically sets the limits at the receiver level
 (so applies to all sources) on what is the max records per second that can
 will be received by the receiver.

 TD


 On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 are you saying, after repartition(400), you have 400 partitions on one
 host and the other hosts receive nothing of the data?

 Tobias


 On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 I also have an issue consuming from Kafka. When I consume from Kafka,
 there are always a single executor working on this job. Even I use
 repartition, it seems that there is still a single executor. Does anyone
 has an idea how to add parallelism to this job?



 On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
 wrote:

 Thanks Luis and Tobias.


 On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
 wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




 --
 Chen Song







Re: spark streaming rate limiting from kafka

2014-07-21 Thread Tobias Pfeiffer
Bill,

numPartitions means the number of Spark partitions that the data received
from Kafka will be split to. It has nothing to do with Kafka partitions, as
far as I know.

If you create multiple Kafka consumers, it doesn't seem like you can
specify which consumer will consume which Kafka partitions. Instead, Kafka
(at least with the interface that is exposed by the Spark Streaming API)
will do something called rebalance and assign Kafka partitions to consumers
evenly, you can see this in the client logs.

When using multiple Kafka consumers with auto.offset.reset = true, please
expect to run into this one:
https://issues.apache.org/jira/browse/SPARK-2383

Tobias


On Tue, Jul 22, 2014 at 3:40 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 Hi Tathagata,

 I am currentlycreating multiple DStream to consumefrom different topics.
 How can I let each consumer consume from different partitions. I find the
 following parameters from Spark API:

 createStream[K, V, U : Decoder[_], T : Decoder[_]](jssc:
 JavaStreamingContext
 https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.html
 , keyTypeClass: Class[K], valueTypeClass: Class[V],keyDecoderClass: Class[
 U], valueDecoderClass: Class[T], kafkaParams: Map[String, String],
 topics: Map[String, Integer],storageLevel: StorageLevel
 https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/storage/StorageLevel.html
 ): JavaPairReceiverInputDStream
 https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaPairReceiverInputDStream.html
 [K, V]

 Create an input stream that pulls messages form a Kafka Broker.




 The topics parameter is:
 *Map of (topic_name - numPartitions) to consume. Each partition is
 consumed in its own thread*

 Does numPartitions mean the total number of partitions to consume from
 topic_name or the index of the partition? How can we specify for each
 createStream which partition of the Kafka topic to consume? I think if so,
 I will get a lot of parallelism from the source of the data. Thanks!

 Bill

 On Thu, Jul 17, 2014 at 6:21 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 You can create multiple kafka stream to partition your topics across
 them, which will run multiple receivers or multiple executors. This is
 covered in the Spark streaming guide.
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

 And for the purpose of this thread, to answer the original question, we now
 have the ability
 https://issues.apache.org/jira/browse/SPARK-1854?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20priority%20DESC
 to limit the receiving rate. Its in the master branch, and will be
 available in Spark 1.1. It basically sets the limits at the receiver level
 (so applies to all sources) on what is the max records per second that can
 will be received by the receiver.

 TD


 On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 are you saying, after repartition(400), you have 400 partitions on one
 host and the other hosts receive nothing of the data?

 Tobias


 On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 I also have an issue consuming from Kafka. When I consume from Kafka,
 there are always a single executor working on this job. Even I use
 repartition, it seems that there is still a single executor. Does anyone
 has an idea how to add parallelism to this job?



 On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
 wrote:

 Thanks Luis and Tobias.


 On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
 wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




 --
 Chen Song








Re: spark streaming rate limiting from kafka

2014-07-20 Thread Bill Jay
Hi Tobias,

It seems that repartition can create more executors for the stages
following data receiving. However, the number of executors is still far
less than what I require (I specify one core for each executor). Based on
the index of the executors in the stage, I find many numbers are missing in
between. For example, if I repartition(100), the index of executors may be
1, 3, 5, 10, etc. Finally, there may be 45 executors although I request 100
partitions.

Bill


On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 are you saying, after repartition(400), you have 400 partitions on one
 host and the other hosts receive nothing of the data?

 Tobias


 On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 I also have an issue consuming from Kafka. When I consume from Kafka,
 there are always a single executor working on this job. Even I use
 repartition, it seems that there is still a single executor. Does anyone
 has an idea how to add parallelism to this job?



 On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
 wrote:

 Thanks Luis and Tobias.


 On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
 wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




 --
 Chen Song






Re: spark streaming rate limiting from kafka

2014-07-18 Thread Chen Song
Thanks Tathagata,

That would be awesome if Spark streaming can support receiving rate in
general. I tried to explore the link you provided but could not find any
specific JIRA related to this? Do you have the JIRA number for this?



On Thu, Jul 17, 2014 at 9:21 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 You can create multiple kafka stream to partition your topics across them,
 which will run multiple receivers or multiple executors. This is covered in
 the Spark streaming guide.
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

 And for the purpose of this thread, to answer the original question, we now
 have the ability
 https://issues.apache.org/jira/browse/SPARK-1854?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20priority%20DESC
 to limit the receiving rate. Its in the master branch, and will be
 available in Spark 1.1. It basically sets the limits at the receiver level
 (so applies to all sources) on what is the max records per second that can
 will be received by the receiver.

 TD


 On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 are you saying, after repartition(400), you have 400 partitions on one
 host and the other hosts receive nothing of the data?

 Tobias


 On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 I also have an issue consuming from Kafka. When I consume from Kafka,
 there are always a single executor working on this job. Even I use
 repartition, it seems that there is still a single executor. Does anyone
 has an idea how to add parallelism to this job?



 On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
 wrote:

 Thanks Luis and Tobias.


 On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
 wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




 --
 Chen Song







-- 
Chen Song


Re: spark streaming rate limiting from kafka

2014-07-18 Thread Tathagata Das
Oops, wrong link!
JIRA: https://github.com/apache/spark/pull/945/files
Github PR: https://github.com/apache/spark/pull/945/files


On Fri, Jul 18, 2014 at 7:19 AM, Chen Song chen.song...@gmail.com wrote:

 Thanks Tathagata,

 That would be awesome if Spark streaming can support receiving rate in
 general. I tried to explore the link you provided but could not find any
 specific JIRA related to this? Do you have the JIRA number for this?



 On Thu, Jul 17, 2014 at 9:21 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 You can create multiple kafka stream to partition your topics across
 them, which will run multiple receivers or multiple executors. This is
 covered in the Spark streaming guide.
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

 And for the purpose of this thread, to answer the original question, we now
 have the ability
 https://issues.apache.org/jira/browse/SPARK-1854?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20priority%20DESC
 to limit the receiving rate. Its in the master branch, and will be
 available in Spark 1.1. It basically sets the limits at the receiver level
 (so applies to all sources) on what is the max records per second that can
 will be received by the receiver.

 TD


 On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 are you saying, after repartition(400), you have 400 partitions on one
 host and the other hosts receive nothing of the data?

 Tobias


 On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 I also have an issue consuming from Kafka. When I consume from Kafka,
 there are always a single executor working on this job. Even I use
 repartition, it seems that there is still a single executor. Does anyone
 has an idea how to add parallelism to this job?



 On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
 wrote:

 Thanks Luis and Tobias.


 On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
 wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




 --
 Chen Song







 --
 Chen Song




Re: spark streaming rate limiting from kafka

2014-07-18 Thread Tathagata Das
Dang! Messed it up again!

JIRA: https://issues.apache.org/jira/browse/SPARK-1341
Github PR: https://github.com/apache/spark/pull/945/files


On Fri, Jul 18, 2014 at 11:35 AM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 Oops, wrong link!
 JIRA: https://github.com/apache/spark/pull/945/files
 Github PR: https://github.com/apache/spark/pull/945/files


 On Fri, Jul 18, 2014 at 7:19 AM, Chen Song chen.song...@gmail.com wrote:

 Thanks Tathagata,

 That would be awesome if Spark streaming can support receiving rate in
 general. I tried to explore the link you provided but could not find any
 specific JIRA related to this? Do you have the JIRA number for this?



 On Thu, Jul 17, 2014 at 9:21 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 You can create multiple kafka stream to partition your topics across
 them, which will run multiple receivers or multiple executors. This is
 covered in the Spark streaming guide.
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

 And for the purpose of this thread, to answer the original question, we now
 have the ability
 https://issues.apache.org/jira/browse/SPARK-1854?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20priority%20DESC
 to limit the receiving rate. Its in the master branch, and will be
 available in Spark 1.1. It basically sets the limits at the receiver level
 (so applies to all sources) on what is the max records per second that can
 will be received by the receiver.

 TD


 On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 are you saying, after repartition(400), you have 400 partitions on one
 host and the other hosts receive nothing of the data?

 Tobias


 On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 I also have an issue consuming from Kafka. When I consume from Kafka,
 there are always a single executor working on this job. Even I use
 repartition, it seems that there is still a single executor. Does anyone
 has an idea how to add parallelism to this job?



 On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
 wrote:

 Thanks Luis and Tobias.


 On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
 wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




 --
 Chen Song







 --
 Chen Song





Re: spark streaming rate limiting from kafka

2014-07-17 Thread Chen Song
Thanks Luis and Tobias.


On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




-- 
Chen Song


Re: spark streaming rate limiting from kafka

2014-07-17 Thread Bill Jay
I also have an issue consuming from Kafka. When I consume from Kafka, there
are always a single executor working on this job. Even I use repartition,
it seems that there is still a single executor. Does anyone has an idea how
to add parallelism to this job?



On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com wrote:

 Thanks Luis and Tobias.


 On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




 --
 Chen Song




Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tobias Pfeiffer
Bill,

are you saying, after repartition(400), you have 400 partitions on one host
and the other hosts receive nothing of the data?

Tobias


On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 I also have an issue consuming from Kafka. When I consume from Kafka,
 there are always a single executor working on this job. Even I use
 repartition, it seems that there is still a single executor. Does anyone
 has an idea how to add parallelism to this job?



 On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com wrote:

 Thanks Luis and Tobias.


 On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
 wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




 --
 Chen Song





Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tathagata Das
You can create multiple kafka stream to partition your topics across them,
which will run multiple receivers or multiple executors. This is covered in
the Spark streaming guide.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

And for the purpose of this thread, to answer the original question, we now
have the ability
https://issues.apache.org/jira/browse/SPARK-1854?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20priority%20DESC
to limit the receiving rate. Its in the master branch, and will be
available in Spark 1.1. It basically sets the limits at the receiver level
(so applies to all sources) on what is the max records per second that can
will be received by the receiver.

TD


On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 are you saying, after repartition(400), you have 400 partitions on one
 host and the other hosts receive nothing of the data?

 Tobias


 On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 I also have an issue consuming from Kafka. When I consume from Kafka,
 there are always a single executor working on this job. Even I use
 repartition, it seems that there is still a single executor. Does anyone
 has an idea how to add parallelism to this job?



 On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
 wrote:

 Thanks Luis and Tobias.


 On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
 wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




 --
 Chen Song






Re: spark streaming rate limiting from kafka

2014-07-01 Thread Luis Ángel Vicente Sánchez
Maybe reducing the batch duration would help :\


2014-07-01 17:57 GMT+01:00 Chen Song chen.song...@gmail.com:

 In my use case, if I need to stop spark streaming for a while, data would
 accumulate a lot on kafka topic-partitions. After I restart spark streaming
 job, the worker's heap will go out of memory on the fetch of the 1st batch.

 I am wondering if

 * Is there a way to throttle reading from kafka in spark streaming jobs?
 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.
 * Is there a way to limit the consumption rate at Kafka side? (This one is
 not actually for spark streaming and doesn't seem to be question in this
 group. But I am raising it anyway here.)

 I have looked at code example below but doesn't seem it is supported.

 KafkaUtils.createStream ...
 Thanks, All
 --
 Chen Song




Re: spark streaming rate limiting from kafka

2014-07-01 Thread Tobias Pfeiffer
Hi,

On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


Please see the post at
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
Kafka's auto.offset.reset parameter may be what you are looking for.

Tobias