You could check the code of KafkaRDD, the locality (host) is got from Kafka's partition and set in KafkaRDD, this will a hint for Spark to schedule task on the preferred location.
override def getPreferredLocations(thePart: Partition): Seq[String] = { val part = thePart.asInstanceOf[KafkaRDDPartition] // TODO is additional hostname resolution necessary here Seq(part.host) } On Wed, Oct 14, 2015 at 5:38 PM, Gerard Maas <gerard.m...@gmail.com> wrote: > In the receiver-based kafka streaming model, given that each receiver > starts as a long-running task, one can rely in a certain degree of data > locality based on the kafka partitioning: Data published on a given > topic/partition will land on the same spark streaming receiving node until > the receiver dies and needs to be restarted somewhere else. > > As I understand, the direct-kafka streaming model just computes offsets > and relays the work to a KafkaRDD. How is the execution locality compared > to the receiver-based approach? > > thanks, Gerard. >