Another suggestion that may help is that, you can consider use Kafka to store the latest offset instead of Zookeeper. There are at least two benefits: 1) lower the workload of ZK 2) support replay from certain offset. This is how Samza <http://samza.incubator.apache.org/> deals with the Kafka offset, the doc is here <http://samza.incubator.apache.org/learn/documentation/0.7.0/container/checkpointing.html> . Thank you.
Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell <pwend...@gmail.com> wrote: > I'll let TD chime on on this one, but I'm guessing this would be a welcome > addition. It's great to see community effort on adding new > streams/receivers, adding a Java API for receivers was something we did > specifically to allow this :) > > - Patrick > > > On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > >> Hi, >> >> I have implemented a Low Level Kafka Consumer for Spark Streaming using >> Kafka Simple Consumer API. This API will give better control over the Kafka >> offset management and recovery from failures. As the present Spark >> KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better >> control over the offset management which is not possible in Kafka HighLevel >> consumer. >> >> This Project is available in below Repo : >> >> https://github.com/dibbhatt/kafka-spark-consumer >> >> >> I have implemented a Custom Receiver consumer.kafka.client.KafkaReceiver. >> The KafkaReceiver uses low level Kafka Consumer API (implemented in >> consumer.kafka packages) to fetch messages from Kafka and 'store' it in >> Spark. >> >> The logic will detect number of partitions for a topic and spawn that >> many threads (Individual instances of Consumers). Kafka Consumer uses >> Zookeeper for storing the latest offset for individual partitions, which >> will help to recover in case of failure. The Kafka Consumer logic is >> tolerant to ZK Failures, Kafka Leader of Partition changes, Kafka broker >> failures, recovery from offset errors and other fail-over aspects. >> >> The consumer.kafka.client.Consumer is the sample Consumer which uses this >> Kafka Receivers to generate DStreams from Kafka and apply a Output >> operation for every messages of the RDD. >> >> We are planning to use this Kafka Spark Consumer to perform Near Real >> Time Indexing of Kafka Messages to target Search Cluster and also Near Real >> Time Aggregation using target NoSQL storage. >> >> Kindly let me know your view. Also if this looks good, can I contribute >> to Spark Streaming project. >> >> Regards, >> Dibyendu >> > >