Re: Spark Streaming with Kafka DirectStream

ayan guha Tue, 16 Feb 2016 21:52:21 -0800

Hi

You can always use RDD properties, which already has partition information.


https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html


On Wed, Feb 17, 2016 at 2:36 PM, Cyril Scetbon <cyril.scet...@free.fr>
wrote:

> Your understanding is the right one (having re-read the documentation).
> Still wondering how I can verify that 5 partitions have been created. My
> job is reading from a topic in Kafka that has 5 partitions and sends the
> data to E/S. I can see that when there is one task to read from Kafka there
> are 5 tasks writing to E/S. So I'm supposing that the task reading from
> Kafka does it in // using 5 partitions and that's why there are then 5
> tasks to write to E/S. But I'm supposing ...
>
> On Feb 16, 2016, at 21:12, ayan guha <guha.a...@gmail.com> wrote:
>
> I have a slightly different understanding.
>
> Direct stream generates 1 RDD per batch, however, number of partitions in
> that RDD = number of partitions in kafka topic.
>
> On Wed, Feb 17, 2016 at 12:18 PM, Cyril Scetbon <cyril.scet...@free.fr>
> wrote:
>
>> Hi guys,
>>
>> I'm making some tests with Spark and Kafka using a Python script. I use
>> the second method that doesn't need any receiver (Direct Approach). It
>> should adapt the number of RDDs to the number of partitions in the topic.
>> I'm trying to verify it. What's the easiest way to verify it ? I also tried
>> to co-locate Yarn, Spark and Kafka to check if RDDs are created depending
>> on the leaders of partitions in a topic, and they are not. Can you confirm
>> that RDDs are not created depending on the location of partitions and that
>> co-locating Kafka with Spark is not a must-have or that Spark does not take
>> advantage of it ?
>>
>> As the parallelism is simplified (by creating as many RDDs as there are
>> partitions) I suppose that the biggest part of the tuning is playing with
>> KafKa partitions (not talking about network configuration or management of
>> Spark resources) ?
>>
>> Thank you
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


-- 
Best Regards,
Ayan Guha

Re: Spark Streaming with Kafka DirectStream

Reply via email to