subject:"Re\: spark streaming rate limiting from kafka"

Re: spark streaming rate limiting from kafka

2014-07-22 Thread Bill Jay

Hi Tobias,

I tried to use 10 as numPartition. The number of executors allocated is the
number of DStream. Therefore, it seems the parameter does not spread data
into many partitions. In order to to that, it seems we have to do
repartition. If numPartitions will distribute the data to multiple
executors/partitions, then I will be able to save the running time incurred
by repartition.

Bill

On Mon, Jul 21, 2014 at 6:43 PM, Tobias Pfeiffer t...@preferred.jp wrote:

Bill,

numPartitions means the number of Spark partitions that the data received
from Kafka will be split to. It has nothing to do with Kafka partitions, as
far as I know.

If you create multiple Kafka consumers, it doesn't seem like you can
specify which consumer will consume which Kafka partitions. Instead, Kafka
(at least with the interface that is exposed by the Spark Streaming API)
will do something called rebalance and assign Kafka partitions to consumers
evenly, you can see this in the client logs.

When using multiple Kafka consumers with auto.offset.reset = true, please
expect to run into this one:
https://issues.apache.org/jira/browse/SPARK-2383

Tobias

On Tue, Jul 22, 2014 at 3:40 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

Hi Tathagata,

I am currentlycreating multiple DStream to consumefrom different topics.
How can I let each consumer consume from different partitions. I find the
following parameters from Spark API:

createStream[K, V, U : Decoder[_], T : Decoder[_]](jssc:
JavaStreamingContext
https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.html
, keyTypeClass: Class[K], valueTypeClass: Class[V],keyDecoderClass: Class
[U], valueDecoderClass: Class[T], kafkaParams: Map[String, String],
topics: Map[String, Integer],storageLevel: StorageLevel
https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/storage/StorageLevel.html
): JavaPairReceiverInputDStream
https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaPairReceiverInputDStream.html
[K, V]

Create an input stream that pulls messages form a Kafka Broker.

The topics parameter is:
*Map of (topic_name - numPartitions) to consume. Each partition is
consumed in its own thread*

Does numPartitions mean the total number of partitions to consume from
topic_name or the index of the partition? How can we specify for each
createStream which partition of the Kafka topic to consume? I think if so,
I will get a lot of parallelism from the source of the data. Thanks!

Bill

On Thu, Jul 17, 2014 at 6:21 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:

You can create multiple kafka stream to partition your topics across
them, which will run multiple receivers or multiple executors. This is
covered in the Spark streaming guide.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

And for the purpose of this thread, to answer the original question, we now
have the ability
https://issues.apache.org/jira/browse/SPARK-1854?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20priority%20DESC
to limit the receiving rate. Its in the master branch, and will be
available in Spark 1.1. It basically sets the limits at the receiver level
(so applies to all sources) on what is the max records per second that can
will be received by the receiver.

On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp
wrote:

Bill,

are you saying, after repartition(400), you have 400 partitions on one
host and the other hosts receive nothing of the data?

Tobias

On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

I also have an issue consuming from Kafka. When I consume from Kafka,
there are always a single executor working on this job. Even I use
repartition, it seems that there is still a single executor. Does anyone
has an idea how to add parallelism to this job?

On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
wrote:

Thanks Luis and Tobias.

On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
wrote:

Hi,

On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
wrote:

* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting this to a small
number, it will force DStream to read less data initially.

Please see the post at

http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
Kafka's auto.offset.reset parameter may be what you are looking for.

Tobias

--
Chen Song

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Bill Jay

Hi Tathagata,

I am currentlycreating multiple DStream to consumefrom different topics.
How can I let each consumer consume from different partitions. I find the
following parameters from Spark API:

createStream[K, V, U : Decoder[_], T : Decoder[_]](jssc:
JavaStreamingContext
https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.html
, keyTypeClass: Class[K], valueTypeClass: Class[V],keyDecoderClass: Class[U]
, valueDecoderClass: Class[T], kafkaParams: Map[String, String], topics: Map
[String, Integer],storageLevel: StorageLevel
https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/storage/StorageLevel.html
): JavaPairReceiverInputDStream
https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaPairReceiverInputDStream.html
[K, V]

Create an input stream that pulls messages form a Kafka Broker.

The topics parameter is:
*Map of (topic_name - numPartitions) to consume. Each partition is
consumed in its own thread*

Bill

On Thu, Jul 17, 2014 at 6:21 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

You can create multiple kafka stream to partition your topics across them,
which will run multiple receivers or multiple executors. This is covered in
the Spark streaming guide.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:

Bill,

are you saying, after repartition(400), you have 400 partitions on one
host and the other hosts receive nothing of the data?

Tobias

On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
wrote:

Thanks Luis and Tobias.

On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
wrote:

Hi,

On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
wrote:

* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting this to a small
number, it will force DStream to read less data initially.

Please see the post at

Tobias

--
Chen Song

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Tobias Pfeiffer

Bill,

numPartitions means the number of Spark partitions that the data received
from Kafka will be split to. It has nothing to do with Kafka partitions, as
far as I know.

When using multiple Kafka consumers with auto.offset.reset = true, please
expect to run into this one:
https://issues.apache.org/jira/browse/SPARK-2383

Tobias

On Tue, Jul 22, 2014 at 3:40 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

Hi Tathagata,

I am currentlycreating multiple DStream to consumefrom different topics.
How can I let each consumer consume from different partitions. I find the
following parameters from Spark API:

createStream[K, V, U : Decoder[_], T : Decoder[_]](jssc:
JavaStreamingContext
https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.html
, keyTypeClass: Class[K], valueTypeClass: Class[V],keyDecoderClass: Class[
U], valueDecoderClass: Class[T], kafkaParams: Map[String, String],
topics: Map[String, Integer],storageLevel: StorageLevel
https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/storage/StorageLevel.html
): JavaPairReceiverInputDStream
https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaPairReceiverInputDStream.html
[K, V]

Create an input stream that pulls messages form a Kafka Broker.

The topics parameter is:
*Map of (topic_name - numPartitions) to consume. Each partition is
consumed in its own thread*

Bill

On Thu, Jul 17, 2014 at 6:21 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:

On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp
wrote:

Bill,

are you saying, after repartition(400), you have 400 partitions on one
host and the other hosts receive nothing of the data?

Tobias

On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
wrote:

Thanks Luis and Tobias.

On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
wrote:

Hi,

On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
wrote:

* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting this to a small
number, it will force DStream to read less data initially.

Please see the post at

Tobias

--
Chen Song

Re: spark streaming rate limiting from kafka

2014-07-20 Thread Bill Jay

Hi Tobias,

It seems that repartition can create more executors for the stages
following data receiving. However, the number of executors is still far
less than what I require (I specify one core for each executor). Based on
the index of the executors in the stage, I find many numbers are missing in
between. For example, if I repartition(100), the index of executors may be
1, 3, 5, 10, etc. Finally, there may be 45 executors although I request 100
partitions.

Bill

On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:

Bill,

are you saying, after repartition(400), you have 400 partitions on one
host and the other hosts receive nothing of the data?

Tobias

On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
wrote:

Thanks Luis and Tobias.

On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
wrote:

Hi,

On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
wrote:

* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting this to a small
number, it will force DStream to read less data initially.

Please see the post at

Tobias

--
Chen Song

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Chen Song

Thanks Tathagata,

That would be awesome if Spark streaming can support receiving rate in
general. I tried to explore the link you provided but could not find any
specific JIRA related to this? Do you have the JIRA number for this?

On Thu, Jul 17, 2014 at 9:21 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

You can create multiple kafka stream to partition your topics across them,
which will run multiple receivers or multiple executors. This is covered in
the Spark streaming guide.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:

Bill,

are you saying, after repartition(400), you have 400 partitions on one
host and the other hosts receive nothing of the data?

Tobias

On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
wrote:

Thanks Luis and Tobias.

On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
wrote:

Hi,

On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
wrote:

* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting this to a small
number, it will force DStream to read less data initially.

Please see the post at

Tobias

--
Chen Song

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Tathagata Das

Oops, wrong link!
JIRA: https://github.com/apache/spark/pull/945/files
Github PR: https://github.com/apache/spark/pull/945/files

On Fri, Jul 18, 2014 at 7:19 AM, Chen Song chen.song...@gmail.com wrote:

Thanks Tathagata,

On Thu, Jul 17, 2014 at 9:21 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:

On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp
wrote:

Bill,

are you saying, after repartition(400), you have 400 partitions on one
host and the other hosts receive nothing of the data?

Tobias

On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
wrote:

Thanks Luis and Tobias.

On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
wrote:

Hi,

On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
wrote:

* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting this to a small
number, it will force DStream to read less data initially.

Please see the post at

Tobias

--
Chen Song

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Tathagata Das

Dang! Messed it up again!

JIRA: https://issues.apache.org/jira/browse/SPARK-1341
Github PR: https://github.com/apache/spark/pull/945/files

On Fri, Jul 18, 2014 at 11:35 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:

Oops, wrong link!
JIRA: https://github.com/apache/spark/pull/945/files
Github PR: https://github.com/apache/spark/pull/945/files

On Fri, Jul 18, 2014 at 7:19 AM, Chen Song chen.song...@gmail.com wrote:

Thanks Tathagata,

On Thu, Jul 17, 2014 at 9:21 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:

On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp
wrote:

Bill,

are you saying, after repartition(400), you have 400 partitions on one
host and the other hosts receive nothing of the data?

Tobias

On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
wrote:

Thanks Luis and Tobias.

On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
wrote:

Hi,

On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
wrote:

* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting this to a small
number, it will force DStream to read less data initially.

Please see the post at

Tobias

--
Chen Song

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Chen Song

Thanks Luis and Tobias.


On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




-- 
Chen Song

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Bill Jay

I also have an issue consuming from Kafka. When I consume from Kafka, there
are always a single executor working on this job. Even I use repartition,
it seems that there is still a single executor. Does anyone has an idea how
to add parallelism to this job?



On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com wrote:

 Thanks Luis and Tobias.


 On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




 --
 Chen Song

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tobias Pfeiffer

Bill,

are you saying, after repartition(400), you have 400 partitions on one host
and the other hosts receive nothing of the data?

Tobias


On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 I also have an issue consuming from Kafka. When I consume from Kafka,
 there are always a single executor working on this job. Even I use
 repartition, it seems that there is still a single executor. Does anyone
 has an idea how to add parallelism to this job?



 On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com wrote:

 Thanks Luis and Tobias.


 On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
 wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias




 --
 Chen Song

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tathagata Das

You can create multiple kafka stream to partition your topics across them,
which will run multiple receivers or multiple executors. This is covered in
the Spark streaming guide.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:

Bill,

are you saying, after repartition(400), you have 400 partitions on one
host and the other hosts receive nothing of the data?

Tobias

On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

On Thu, Jul 17, 2014 at 2:06 PM, Chen Song chen.song...@gmail.com
wrote:

Thanks Luis and Tobias.

On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
wrote:

Hi,

On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
wrote:

* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting this to a small
number, it will force DStream to read less data initially.

Please see the post at

Tobias

--
Chen Song

Re: spark streaming rate limiting from kafka

2014-07-01 Thread Luis Ángel Vicente Sánchez

Maybe reducing the batch duration would help :\


2014-07-01 17:57 GMT+01:00 Chen Song chen.song...@gmail.com:

 In my use case, if I need to stop spark streaming for a while, data would
 accumulate a lot on kafka topic-partitions. After I restart spark streaming
 job, the worker's heap will go out of memory on the fetch of the 1st batch.

 I am wondering if

 * Is there a way to throttle reading from kafka in spark streaming jobs?
 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.
 * Is there a way to limit the consumption rate at Kafka side? (This one is
 not actually for spark streaming and doesn't seem to be question in this
 group. But I am raising it anyway here.)

 I have looked at code example below but doesn't seem it is supported.

 KafkaUtils.createStream ...
 Thanks, All
 --
 Chen Song

Re: spark streaming rate limiting from kafka

2014-07-01 Thread Tobias Pfeiffer

Hi,

On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


Please see the post at
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
Kafka's auto.offset.reset parameter may be what you are looking for.

Tobias

Re: spark streaming rate limiting from kafka

Re: spark streaming rate limiting from kafka

Re: spark streaming rate limiting from kafka

Re: spark streaming rate limiting from kafka

Re: spark streaming rate limiting from kafka

Re: spark streaming rate limiting from kafka

Re: spark streaming rate limiting from kafka

Re: spark streaming rate limiting from kafka

Re: spark streaming rate limiting from kafka

Re: spark streaming rate limiting from kafka

Re: spark streaming rate limiting from kafka

Re: spark streaming rate limiting from kafka

Re: spark streaming rate limiting from kafka

13 matches

Site Navigation

Mail list logo

Footer information