Hi, I am using Kafka Spark cluster for real time aggregation analytics use case in production.
*Cluster details* *6 nodes*, each node running 1 Spark and kafka processes each. Node1 -> 1 Master , 1 Worker, 1 Driver, 1 Kafka process Node 2,3,4,5,6 -> 1 Worker prcocess each 1 Kafka process each Spark version 1.3.0 Kafka Veriosn 0.8.1 I am using *Kafka* *Directstream* for Kafka Spark Integration. Analytics code is written in using Spark Java API. *Problem Statement : * We are dealing with about *10 M records per hour*. My Spark Streaming Batch runs at *1 hour interval*( at 11:30 12:30 1:30 and so on) Since i am using Direct Stream, it reads all the data for past hour at 11:30 12:30 1:30 and so on Though as of now it takes *about 3 minutes* to read the data with Network bandwidth utilization of *100-200 MBPS per node*( out of 6 node Spark Cluster) Since i am running both Spark and Kafka on same machine * I WANT TO BIND MY SPARK EXECUTOR TO KAFKA PARTITION LEADER*, so as to elliminate the Network bandwidth consumption of Spark. I understand that the number of partitions created on Spark for a Direct Stream is equivalent to the number of partitions on Kafka, which is the reason got a curiosity, perhaps there might be such a provision in SPark. Regards, Gaurav