Another suggestion that may help is that, you can consider use Kafka to
store the latest offset instead of Zookeeper. There are at least two
benefits: 1) lower the workload of ZK 2) support replay from certain
offset. This is how Samza <http://samza.incubator.apache.org/> deals with
the Kafka offset, the doc is here
<http://samza.incubator.apache.org/learn/documentation/0.7.0/container/checkpointing.html>
.
Thank you.

Cheers,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell <pwend...@gmail.com> wrote:

> I'll let TD chime on on this one, but I'm guessing this would be a welcome
> addition. It's great to see community effort on adding new
> streams/receivers, adding a Java API for receivers was something we did
> specifically to allow this :)
>
> - Patrick
>
>
> On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> Hi,
>>
>> I have implemented a Low Level Kafka Consumer for Spark Streaming using
>> Kafka Simple Consumer API. This API will give better control over the Kafka
>> offset management and recovery from failures. As the present Spark
>> KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better
>> control over the offset management which is not possible in Kafka HighLevel
>> consumer.
>>
>> This Project is available in below Repo :
>>
>> https://github.com/dibbhatt/kafka-spark-consumer
>>
>>
>> I have implemented a Custom Receiver consumer.kafka.client.KafkaReceiver.
>> The KafkaReceiver uses low level Kafka Consumer API (implemented in
>> consumer.kafka packages) to fetch messages from Kafka and 'store' it in
>> Spark.
>>
>> The logic will detect number of partitions for a topic and spawn that
>> many threads (Individual instances of Consumers). Kafka Consumer uses
>> Zookeeper for storing the latest offset for individual partitions, which
>> will help to recover in case of failure. The Kafka Consumer logic is
>> tolerant to ZK Failures, Kafka Leader of Partition changes, Kafka broker
>> failures,  recovery from offset errors and other fail-over aspects.
>>
>> The consumer.kafka.client.Consumer is the sample Consumer which uses this
>> Kafka Receivers to generate DStreams from Kafka and apply a Output
>> operation for every messages of the RDD.
>>
>> We are planning to use this Kafka Spark Consumer to perform Near Real
>> Time Indexing of Kafka Messages to target Search Cluster and also Near Real
>> Time Aggregation using target NoSQL storage.
>>
>> Kindly let me know your view. Also if this looks good, can I contribute
>> to Spark Streaming project.
>>
>> Regards,
>> Dibyendu
>>
>
>

Reply via email to