Hi Tathagata,

I am currentlycreating multiple DStream to consumefrom different topics.
How can I let each consumer consume from different partitions. I find the
following parameters from Spark API:

createStream[K, V, U <: Decoder[_], T <: Decoder[_]](jssc:
JavaStreamingContext
<https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.html>
, keyTypeClass: Class[K], valueTypeClass: Class[V],keyDecoderClass: Class[U]
, valueDecoderClass: Class[T], kafkaParams: Map[String, String], topics: Map
[String, Integer],storageLevel: StorageLevel
<https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/storage/StorageLevel.html>
): JavaPairReceiverInputDStream
<https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaPairReceiverInputDStream.html>
[K, V]

Create an input stream that pulls messages form a Kafka Broker.




The topics parameter is:
*Map of (topic_name -> numPartitions) to consume. Each partition is
consumed in its own thread*

Does numPartitions mean the total number of partitions to consume from
topic_name or the index of the partition? How can we specify for each
createStream which partition of the Kafka topic to consume? I think if so,
I will get a lot of parallelism from the source of the data. Thanks!

Bill

On Thu, Jul 17, 2014 at 6:21 PM, Tathagata Das <tathagata.das1...@gmail.com>
wrote:

> You can create multiple kafka stream to partition your topics across them,
> which will run multiple receivers or multiple executors. This is covered in
> the Spark streaming guide.
> <http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving>
>
> And for the purpose of this thread, to answer the original question, we now
> have the ability
> <https://issues.apache.org/jira/browse/SPARK-1854?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20priority%20DESC>
> to limit the receiving rate. Its in the master branch, and will be
> available in Spark 1.1. It basically sets the limits at the receiver level
> (so applies to all sources) on what is the max records per second that can
> will be received by the receiver.
>
> TD
>
>
> On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:
>
>> Bill,
>>
>> are you saying, after repartition(400), you have 400 partitions on one
>> host and the other hosts receive nothing of the data?
>>
>> Tobias
>>
>>
>> On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay <bill.jaypeter...@gmail.com>
>> wrote:
>>
>>> I also have an issue consuming from Kafka. When I consume from Kafka,
>>> there are always a single executor working on this job. Even I use
>>> repartition, it seems that there is still a single executor. Does anyone
>>> has an idea how to add parallelism to this job?
>>>
>>>
>>>
>>> On Thu, Jul 17, 2014 at 2:06 PM, Chen Song <chen.song...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Luis and Tobias.
>>>>
>>>>
>>>> On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer <t...@preferred.jp>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> On Wed, Jul 2, 2014 at 1:57 AM, Chen Song <chen.song...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> * Is there a way to control how far Kafka Dstream can read on
>>>>>> topic-partition (via offset for example). By setting this to a small
>>>>>> number, it will force DStream to read less data initially.
>>>>>>
>>>>>
>>>>> Please see the post at
>>>>>
>>>>> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
>>>>> Kafka's auto.offset.reset parameter may be what you are looking for.
>>>>>
>>>>> Tobias
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Chen Song
>>>>
>>>>
>>>
>>
>

Reply via email to