Hi,

I have implemented a Low Level Kafka Consumer for Spark Streaming using
Kafka Simple Consumer API. This API will give better control over the Kafka
offset management and recovery from failures. As the present Spark
KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better
control over the offset management which is not possible in Kafka HighLevel
consumer.

This Project is available in below Repo :

https://github.com/dibbhatt/kafka-spark-consumer


I have implemented a Custom Receiver consumer.kafka.client.KafkaReceiver.
The KafkaReceiver uses low level Kafka Consumer API (implemented in
consumer.kafka packages) to fetch messages from Kafka and 'store' it in
Spark.

The logic will detect number of partitions for a topic and spawn that many
threads (Individual instances of Consumers). Kafka Consumer uses Zookeeper
for storing the latest offset for individual partitions, which will help to
recover in case of failure. The Kafka Consumer logic is tolerant to ZK
Failures, Kafka Leader of Partition changes, Kafka broker failures,
 recovery from offset errors and other fail-over aspects.

The consumer.kafka.client.Consumer is the sample Consumer which uses this
Kafka Receivers to generate DStreams from Kafka and apply a Output
operation for every messages of the RDD.

We are planning to use this Kafka Spark Consumer to perform Near Real Time
Indexing of Kafka Messages to target Search Cluster and also Near Real Time
Aggregation using target NoSQL storage.

Kindly let me know your view. Also if this looks good, can I contribute to
Spark Streaming project.

Regards,
Dibyendu

Reply via email to