Hi, I have implemented a Low Level Kafka Consumer for Spark Streaming using Kafka Simple Consumer API. This API will give better control over the Kafka offset management and recovery from failures. As the present Spark KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better control over the offset management which is not possible in Kafka HighLevel consumer.
This Project is available in below Repo : https://github.com/dibbhatt/kafka-spark-consumer I have implemented a Custom Receiver consumer.kafka.client.KafkaReceiver. The KafkaReceiver uses low level Kafka Consumer API (implemented in consumer.kafka packages) to fetch messages from Kafka and 'store' it in Spark. The logic will detect number of partitions for a topic and spawn that many threads (Individual instances of Consumers). Kafka Consumer uses Zookeeper for storing the latest offset for individual partitions, which will help to recover in case of failure. The Kafka Consumer logic is tolerant to ZK Failures, Kafka Leader of Partition changes, Kafka broker failures, recovery from offset errors and other fail-over aspects. The consumer.kafka.client.Consumer is the sample Consumer which uses this Kafka Receivers to generate DStreams from Kafka and apply a Output operation for every messages of the RDD. We are planning to use this Kafka Spark Consumer to perform Near Real Time Indexing of Kafka Messages to target Search Cluster and also Near Real Time Aggregation using target NoSQL storage. Kindly let me know your view. Also if this looks good, can I contribute to Spark Streaming project. Regards, Dibyendu