Re: Spark Streaming with Kafka DirectStream

Cody Koeninger Wed, 17 Feb 2016 11:02:40 -0800

You can print whatever you want wherever you want, it's just a question of
whether it's going to show up on the driver or the various executors logs


On Wed, Feb 17, 2016 at 5:50 AM, Cyril Scetbon <cyril.scet...@free.fr>
wrote:

> I don't think we can print an integer value in a spark streaming process
> As opposed to a spark job. I think I can print the content of an rdd but
> not debug messages. Am I wrong ?
>
> Cyril Scetbon
>
> On Feb 17, 2016, at 12:51 AM, ayan guha <guha.a...@gmail.com> wrote:
>
> Hi
>
> You can always use RDD properties, which already has partition information.
>
>
> https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html
>
>
> On Wed, Feb 17, 2016 at 2:36 PM, Cyril Scetbon <cyril.scet...@free.fr>
> wrote:
>
>> Your understanding is the right one (having re-read the documentation).
>> Still wondering how I can verify that 5 partitions have been created. My
>> job is reading from a topic in Kafka that has 5 partitions and sends the
>> data to E/S. I can see that when there is one task to read from Kafka there
>> are 5 tasks writing to E/S. So I'm supposing that the task reading from
>> Kafka does it in // using 5 partitions and that's why there are then 5
>> tasks to write to E/S. But I'm supposing ...
>>
>> On Feb 16, 2016, at 21:12, ayan guha <guha.a...@gmail.com> wrote:
>>
>> I have a slightly different understanding.
>>
>> Direct stream generates 1 RDD per batch, however, number of partitions in
>> that RDD = number of partitions in kafka topic.
>>
>> On Wed, Feb 17, 2016 at 12:18 PM, Cyril Scetbon <cyril.scet...@free.fr>
>> wrote:
>>
>>> Hi guys,
>>>
>>> I'm making some tests with Spark and Kafka using a Python script. I use
>>> the second method that doesn't need any receiver (Direct Approach). It
>>> should adapt the number of RDDs to the number of partitions in the topic.
>>> I'm trying to verify it. What's the easiest way to verify it ? I also tried
>>> to co-locate Yarn, Spark and Kafka to check if RDDs are created depending
>>> on the leaders of partitions in a topic, and they are not. Can you confirm
>>> that RDDs are not created depending on the location of partitions and that
>>> co-locating Kafka with Spark is not a must-have or that Spark does not take
>>> advantage of it ?
>>>
>>> As the parallelism is simplified (by creating as many RDDs as there are
>>> partitions) I suppose that the biggest part of the tuning is playing with
>>> KafKa partitions (not talking about network configuration or management of
>>> Spark resources) ?
>>>
>>> Thank you
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>

Re: Spark Streaming with Kafka DirectStream

Reply via email to