Hi Gerard,
I am also trying to understand the same issue. Whatever code I have seen it
looks like once Kafka RDD is constructed the execution of that RDD is upto
the task scheduler and it can schedule the partitions based on the load on
nodes. There is preferred node specified in Kafks RDD. But ASFIK it maps to
the Kafka partitions host . So if Kafka and Spark are co hosted probably
this will work. If not, I am not sure how to get data locality for a
partition.
Others,
correct me if there is a way.

On Wed, Oct 14, 2015 at 3:08 PM, Gerard Maas <gerard.m...@gmail.com> wrote:

> In the receiver-based kafka streaming model, given that each receiver
> starts as a long-running task, one can rely in a certain degree of data
> locality based on the kafka partitioning:  Data published on a given
> topic/partition will land on the same spark streaming receiving node until
> the receiver dies and needs to be restarted somewhere else.
>
> As I understand, the direct-kafka streaming model just computes offsets
> and relays the work to a KafkaRDD. How is the execution locality compared
> to the receiver-based approach?
>
> thanks, Gerard.
>



-- 

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

Reply via email to